The ultimate liquid cooling: heat rejection into water

The ultimate liquid cooling: heat rejection into water

Uptime Institute’s data on power usage effectiveness (PUE) is a testament to the progress the data center industry has made in energy efficiency over the past 10 years. However, global average PUEs have been largely stalling at close to 1.6 since 2018, with only marginal gains. This makes sense: for the average figure to show substantial improvement, most facilities would require financially unviable overhauls to their cooling systems to achieve notably better efficiencies, while modern builds already operate near the physical limits of air cooling.

A growing number of operators are looking to direct liquid cooling (DLC) for the next leap in infrastructure efficiency. But a switch to liquid cooling at scale involves operational and supply chain complexities that challenge even the most resourceful technical organizations. Uptime is aware of only one major operator that runs DLC as standard: French hosting and cloud provider OVHcloud, which is an outlier with a vertically integrated infrastructure using custom in-house water cold plate and server designs.

When it comes to the use of liquid cooling, an often-overlooked part of the cooling infrastructure is heat rejection. Rejecting heat into the atmosphere is a major source of inefficiencies, manifesting not only in energy use, but also in capital costs and in large reserves of power for worst case (design day) cooling needs.

A small number of data centers have been using water features as heat sinks successfully for some years. Instead of eliminating heat through water towers, air-cooled chillers or other means that rely on ambient air, some facilities use the closed chilled water loop, which rejects heat through a heat exchanger that’s cooled by an open loop of water. These cooling designs using water heat sinks extend the benefits of water’s thermal properties from heat transport inside the data center to the rejection of heat outside the facilities.

Schematic of a data center once-through system
Figure 1. Schematic of a data center once-through system

The idea of using water for heat rejection, of course, is not new. Known as once-through cooling, these systems are used extensively in thermoelectric power generation and manufacturing industries for their efficiency and reliability in handling large heat loads. Because IT infrastructures are relatively smaller and tend to cluster around population centers, which in turn tend to be situated near water, Uptime considers the approach to have wide geographical applicability in future data center construction projects.

Uptime’s research has identified more than a dozen data center sites, some operated by global brands, that use a water feature as a heat sink. All once-through cooling designs use some custom equipment — there are currently no off-the-shelf designs that are commercially available for data centers. While the facilities we studied vary in size, location and some engineering choices, there are some commonalities between the projects.

Operators we’ve interviewed for the research (all of them colocation providers) considered their once-through cooling projects to be both a technical and a business success, achieving strong operational performance and attracting customers. The energy price crisis that started in 2021, combined with a corporate rush to claim strong sustainability credentials, reportedly boosted the level of return on these investments past even optimistic scenarios.

Rejecting heat into bodies of water allows for stable PUEs year-round, meaning that colocation providers can serve a larger IT load from the same site power envelope. Another benefit is the ability to lower computer room temperatures, for example, 64°F to 68°F (18°C to 20°C) for “free”: this does not come with a PUE penalty. Low-temperature air supply helps operators minimize IT component failures and accommodate future high-performance servers with sufficient cooling. If the water feature is naturally flowing or replenished, operators also eliminate the need for chillers or other large cooling systems from their infrastructure, which would otherwise be required as backup.

Still, that these favorable outcomes outweigh the required investments were far from certain during design and construction, as all undertaking involved nontrivial engineering efforts and associated costs. Committed sponsorship from senior management was critical for these projects to be given the green light and to overcome any unexpected difficulties. Encouraged by the positive experience of the facilities we studied, Uptime expects once-through cooling to gather more interest in the future. A more mature market for these designs will factor into siting decisions, as well as jurisdictional permitting requirements as a proxy for efficiency and sustainability. Once-through systems will also help to maximize the energy efficiency benefits of future DLC rollouts through “free” low-temperature operations, creating an end-to-end liquid cooling infrastructure.

By: Jacqueline Davis, Research Analyst, Uptime Institute and Daniel Bizo, Research Director, Uptime Institute

Cloud SLAs punish, not compensate

Cloud SLAs punish, not compensate

A service level agreement (SLA) is a contract between a cloud provider and a user. The SLA describes the provider’s minimum level of service, specified by performance metrics, and the compensation due to the user should the provider fail to deliver this service.

In practice, however, this compensation is punitive. It seems designed to punish the provider for its failure, not to compensate the user for their loss.

A key metric used in cloud SLAs is availability. Cloud availability is generally expressed as the percentage of a defined period (usually a month) during which a resource has external connectivity. If a user can’t connect over the internet to the resource due to a cloud provider issue, the resource is deemed unavailable or ‘down’.

For example, Amazon Web Services (AWS) offers a 99.5% SLA on a virtual machine. There are 730 hours on average in a month, so the virtual machine should be available (or ‘up’) for around 726.5 hours of those 730 hours. Compensation is due if the virtual machine fails to meet this minimum availability during a month.

Alternatively, we could say AWS is not liable to pay compensation unless the virtual machine experiences total downtime of more than 0.5% of a month, or 3.5 hours on average (given that the length of months differ). Google Cloud, Microsoft Azure and other cloud providers offer similar terms. Table 1 shows common availability SLA performance metrics.

Table 1

Table: Common availability SLA performance metrics

If the virtual machine is distributed across two availability zones (logical data centers, see Cloud scalability and resiliency from first principles), the typical SLA for availability increases to 99.99% (equivalent to an average of around 4 minutes of downtime per month).

The business impact of downtime to users depends on the situation. If short-lived outages occur overnight, the downtime may have little effect on revenue or productivity. However, a long outage in the middle of a major retail event, for example, is likely to hit income and reputation hard. In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident.

SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.

If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total.

When a failure occurs, the user is responsible for measuring downtime and requesting compensation – this is not provided automatically. Users usually need to raise a report request with service logs to show proof of the outage. If the cloud provider approves the request, compensation is offered in service credits, not cash. Together, these approaches mean that users must detect an outage and apply for a service credit, which can only be redeemed through the offending cloud provider, and which is unlikely to cover the cost of the outage. Is the benefit of the compensation worth the effort?

Cloud providers are upfront about these compensation terms — they can be found easily on their websites. They are not being unreasonable by limiting their liabilities, but cloud users need to be realistic about the value of SLAs.

An SLA incentivizes the cloud provider to meet performance requirements with a threat of punitive losses. An SLA is also a marketing tool designed to convince the buyer that the cloud provider can be trusted to host business-critical applications.

Customers should not consider SLAs (or 99.9x% availability figures) as reliable predictors of future availability — or even a record of past levels of availability. The frequency of cloud outages (and their duration) suggests that actual performance of many service providers falls short of published SLAs. (For a more detailed discussion on outages, see our Annual outages analysis 2021: The causes and impacts of data center outages.)

In short, users shouldn’t put too much faith in SLAs. They are likely to indicate a cloud provider’s minimum standard of service: few cloud providers would guarantee compensation on metrics it didn’t feel it could realistically achieve, even if the compensation is minimal. But an SLA is not an insurance policy; it does not mitigate the business impact of downtime.

Is concern over cloud and third-party outages increasing?

Is concern over cloud and third-party outages increasing?

As a result of some high-profile outages and the growing interest in running critical services in a public cloud, the reliability — and transparency — of public cloud services has come under scrutiny.

Cloud services are designed to operate with low failure rates. Large (at-scale) cloud and IT service providers, such as Amazon Web Services, Microsoft Azure and Google Cloud, incorporate layers of software and middleware, balance capacity across systems, networks and data centers, and reroute workloads and traffic away from failures and troubled sites.

On the whole, these architectures provide high levels of service availability at scale. Despite this, no architecture is fail-safe and many failures have been recorded that can be attributed to the difficulties of managing such complex, at-scale software, data and networks.

Our recent resiliency survey of data center and IT managers shows that enterprise managers are reasonably concerned about the resiliency of public cloud services (see Figure 1). Only one in seven (14%) respondents say public cloud services are resilient enough to run all their workloads. The same proportion say the cloud is not resilient enough to run any of their workloads; and 32% say the cloud is only resilient enough to run some of their workloads. The increase in the number of “don’t know” responses since our 2021 survey also shows that confidence in the resiliency of the cloud has become shrouded in some uncertainty and skepticism.

Diagram: Most say cloud is only resilient enough for some workloads
Figure 1 Most say cloud is only resilient enough for some workloads

Concerns over cloud resiliency may be partly due to several recent outages being attributed to third-party service providers. In our resiliency survey, 39% of organizations suffered an outage that was caused by a problem with a third-party supplier.

As more workloads are outsourced to external providers, the more these operators account for high profile, public outages. Over a five-year period, third-party, commercial operators of IT and / or data centers (cloud, hosting, colocation, digital services, telecommunications, etc.) combined accounted for almost 63% of all public outages since 2016, when Uptime Institute started tracking them. This percentage has crept up year-by-year: in 2021 the combined proportion of outages caused by these commercial operators was 70%.

When third-party IT and data center service providers do have an outage, customers are affected immediately. These customers may seek compensation — and a full explanation. Many regulators and enterprises now want increased visibility, accountability and improved service level agreements — especially for cloud providers.

Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.

Outages: understanding the human factor

Outages: understanding the human factor

Analyzing human error — with a view to preventing it — has always been challenging for data center operators. The cause of a failure can lie in how well a process was taught, how tired, well trained or resourced the staff are, or whether the equipment itself was unnecessarily difficult to operate.

There are also definitional questions: If a machine fails due to a software error at the factory, is that human error? In addition, human error can play a role in outages that are attributed to other causes. For these reasons, Uptime Institute tends to research human error in a different way to other causes — we view it as one causal factor in outages, and rarely a single or root cause. Using this methodology, Uptime estimates (based on 25 years of data) that human error plays some role in about two-thirds of all outages.

In Uptime’s recent surveys on resiliency, we have tried to understand the makeup of some of these human error-related failures. As the figure below shows, human error-related outages are most commonly caused either by staff failing to follow procedures (even where they have been agreed and codified) or because the procedures themselves are faulty.

Diagram: Most common causes of major human error-related outages
Figure 1 Most common causes of major human error-related outages

This underscores a key tenet of Uptime: good training and processes both play a major role in outage reduction and can be implemented without great expense or capital expenditure.

Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.

The weakest link dictates cloud outage compensation

The weakest link dictates cloud outage compensation

Cloud providers offer services that are assembled by users into applications. An outage of any single cloud service can render an application unavailable. Importantly, cloud providers guarantee the availability of individual services, not of entire applications. Even if a whole application becomes unresponsive due to a provider outage, compensation is only due for the individual services that failed.

A small outage on a tiny part of an application may wreak havoc with an application (and even an entire business), but the provider will only award compensation for the weakest link that caused the outage.

A service level agreement (SLA) sets out the likely or promised availability of a specific cloud service, plus compensation due to the user if the provider fails to meet this availability. There is a different SLA for each cloud service.

Consider the cloud architecture in Figure 1, which shows traffic flows between virtual machines and a virtual load balancer. The load balancer distributes traffic across nine virtual machines. The virtual machines use the infrastructure as a service (IaaS) model, meaning the user is responsible for architecting virtual machines to be resilient. The load balancer, however, is a platform as a service (PaaS), which should mean the provider has architected it for resiliency. (More information on cloud models can be found in Uptime Institute’s recent report Cloud scalability and resiliency from first principles.)

Diagram: Simple load-balanced cloud architecture
Figure 1 Simple load-balanced cloud architecture

If the load balancer becomes unresponsive, the entire application is unusable as traffic can’t be routed to the virtual machines. The load balancer in this architecture is the weakest link.

The provider is responsible for the resiliency of the load balancer. Is the load balancer a single point of failure? It depends on the perception of the user. It may not be regarded as a single point of failure if the user is fully confident the provider has architected resiliency correctly through, for example, redundant servers or automatic failover. If the user is less convinced, they may consider the load balancer as a single point of failure because it is controlled by a single entity over which the user has limited visibility and control.

The virtual machines are still “available” because they are still in operation and can be accessed by the user for administration or maintenance — they just can’t be accessed by the end user via the application. The load balancer is the problem, not the virtual machines. In this scenario, no compensation is payable for the virtual machines even though the application has gone down.

To understand the financial impact, we’ll assume each virtual machine and load balancer costs $3 per month. As an example of a cloud SLA, Microsoft Azure offers compensation of 10% of the monthly cost of the load balancer if it fails to be available for 99.99% to 99.9% of a month. Similar terms also apply to virtual machines.

If the load balancer is down for 43 minutes, then Microsoft Azure is obliged to pay 10% of the monthly fee of the load balancer, so $0.30 in this case. It is not obliged to pay for the virtual machines as these continued in operation, despite the application becoming unresponsive. The total monthly cost of the application is $30, and the compensation is $0.30, which means the payment for the outage is 1% of the monthly fee paid to Microsoft Azure — a paltry sum compared with the likely impact of the outage. Table 1 provides an overview of the refunds due in this scenario.

Table: SLA refunds due in example scenario
Table 1 SLA refunds due in example scenario

This example demonstrates two points. First, cloud providers provide services that users use to build applications. Their responsibility ends with these services. The analogy of cloud services as toy bricks may seem trite, but it conveys effectively some fundamental aspects of the cloud model. A toy company may guarantee the quality of its bricks, but it would not guarantee the quality of a model completed by an enthusiastic builder, no matter how well built.

In the architecture in Figure 1, for example, the user could have designed a more scalable application. They could have reduced the cost of virtual machines by implementing autoscaling, which would automatically terminate the unutilized virtual machines when not in use (such as when the load balancer went down). The user is ultimately responsible for building a resilient and scalable application, just as the quality of the toy model is the builder’s responsibility.

Second, this example also demonstrates that an SLA is not an insurance policy. It does not mitigate the business impact of downtime. In practice, compensation for cloud outages is likely to be less than originally assumed by users due to nuances in contract terms, which reduce the provider’s liability. Cloud customers must identify single points of failure in their applications and assess the risk and impact of an outage of these services. An outage of these services can render an application unavailable, but SLA compensation is unlikely to even cover the cost of the cloud services, let alone business losses. Ultimately, users are responsible for architecting greater resiliency in their applications to reduce the impact of outages.

The shock waves from Ukraine

The shock waves from Ukraine

How is the Ukraine conflict affecting digital infrastructure and the data sector? In response to multiple queries, Uptime Institute Intelligence identified six main areas where operators and customers of digital infrastructure are experiencing effects from the conflict or will do so soon, not just in Europe but globally. (In Russia and Ukraine there are over 150 data centers, but these are not included in the analysis).

Energy prices

In 2021, electricity prices already rose dramatically partly because of the post-COVID boom in economic activity, and partly because of tight energy markets caused by a combination of weak renewable energy production and a Russian squeeze on gas supplies. Electricity prices can account for around a quarter to two-thirds of the operational costs of a data center. Prices are now near record highs, peaking at 400 euros / megawatt-hours in some European markets shortly after the invasion in February. Because electricity prices globally are linked to liquid natural gas, forward prices across the world will remain high for at least the rest of 2022, especially in Europe.

To contain the effects of price volatility, most large data center operators attempt to reduce their exposure by buying in advance, usually in half-year periods. But some operators were caught out by the Russian invasion of Ukraine, having switched to the spot markets in anticipation of falling prices.

Exposure to high energy prices among data center operators varies according to business type and how well businesses have managed their risks. Colocation companies are the least exposed, because they buy in advance and it is usual to have the contractual right to pass on power prices to their customers. Some providers, however, offer forward pricing and price locks, and may be more exposed if they offer commodity hosting and cloud-like services too.

Enterprises cannot pass on energy costs, except by charging more for their own products and services. Many enterprise operators do not lock in their own electricity costs in advance, while their colocation partners usually do. Some enterprises are benefiting from the diligent purchasing of their colocation partners.

The biggest cloud data center operators can only pass on power price increases by raising cloud prices. Recently, they have been buying more energy on the spot market for their own facilities and haven’t locked in lower prices — exposing them to increases at scale. However, those who have invested in renewable power purchase agreements — which is common among large cloud providers — are benefiting because they have locked in their green power at lower prices for several years ahead.

Economic slowdown?

Prior to late February 2022, world economies were recovering from COVID, but the Ukraine crisis has created a darker and more complex picture. The United Nation’s Conference of Trade and Development (UNCTAD) cut its global growth forecast from 3.6% to 2.6% in March. In a recent report, it highlighted inflation, fuel and food prices, disruption of trade flows and supply chains, and rising debt levels as areas of concern.

Not all economies and industries will be affected equally. But while the COVID pandemic clearly boosted demand for data centers and digital services, the effects of another, different widespread slowdown may dampen demand for digital services and data centers, partly through a tightening of capital availability, and partly also due to runaway energy costs which make computing more expensive.

Cybersecurity

 Cyberattacks on business have increased steadily in number and severity since the early 1980s. Organizations in countries that are considered critical of Russia — as well as those in Russia itself — have been warned of a sharp increase in attacks.

To date, the most serious attacks have been directly linked to the conflict. For example, hundreds of thousands of people across eastern Europe suffered a loss of communications, due to cyber assault on a satellite, in the early days of war. There have also been deepfakes inserted in media services on both sides; and thousands of Russian businesses have suffered distributed denial-of-service attacks.

Security analysts are warning that a perceived fall in cyberattacks on the West from Russian hackers is likely to be temporary. Governments, data centers and IT services companies, energy companies, financial services and media are all regarded as critical infrastructure and therefore likely targets.

Sustainability paradox

In Europe, high energy prices in 2021 and concerns over climate change caused both political and economic difficulties. This was not sufficient to trigger major structural changes in policy or investment, but this may now change.

For the data center sector, two trends are unfolding with diametrically opposite effects on carbon emissions. First, to ensure energy availability, utilities in Europe — with the support of governments — are planning to use more high-carbon coal to make up for a reduced supply of Russian gas. This will at times increase the carbon content of the grid and could increase the number of renewable energy certificates or offsets that data center operators must buy to achieve zero-carbon emissions status.

Second, there is also a counter trend: high prices reduce demand. To date, there has not been a dramatic increase in coal use — although coal prices have risen steeply. In the longer term, energy security concerns will drive up utility and government investment in renewable energy and in nuclear. The UK government is considering a plan to invest up to $140 billion in nuclear energy.

In the data center sector, energy-saving projects may now be more financially viable. These will range from upgrading inefficient cooling systems, recycling heat, raising temperatures, reducing unnecessary redundancy or addressing low IT utilization.

But operators be warned: a switch from fossil fuels to renewable energy, with grid or localized storage, means new dependencies and new global supply chains.

Separately, those European and US legislators who have campaigned for Bitcoin’s energy intensive proof-of-work protocol to be outlawed will now find new political support. Calls are growing for energy devoted to Bitcoin to be used more productively (see Are proof-of-work blockchains a corporate sustainability issue?)

Supply chain and component shortages

The digital infrastructure sector (like many other sectors) has been dogged by supply chain problems, shortages and price spikes since the early days of the COVID pandemic. Data center projects are regularly delayed because of equipment supply problems.

The Ukraine conflict is exacerbating this. Semiconductor and power electronics fabrication, which require various rare elements produced in Ukraine and Russia (e.g., neon, vanadium, palladium, nickel, titanium), may suffer shortages and delays if Russia drags out the war.

The delivery of machines (and that means nearly all) that use microelectronic components could also suffer delays. At the height of the pandemic, cooling equipment, uninterruptible power supplies and switchgear equipment were the most delayed items — and this is likely to be the case again.

Following the pandemic, most of the larger operators took steps to alleviate the effects of operational component shortages, such as building inventory and arranging for second sources, which may cushion the effect of these shortages.

Border zones

Even before the pandemic and the war in Ukraine, there were concerns with geopolitical instability and the vulnerabilities of globalization. These concerns have intensified. Several large European and US corporations, for example, outsource some IT functions to Ukraine and Russia, particularly software development. Thousands of Ukrainian developers were forced to flee from the war, while Russian IT professionals have suffered from the loss of internet applications and cloud services — and the ability to be paid.

Organizations will react in different ways, such as shorter, localized supply chains and insourcing within national borders. Global cloud providers will likely need to introduce more regions to address country-specific requirements — as was seen in the lead up to Brexit — and this is expected to require (yet) more data center capacity.


Uptime Institute will continue to monitor the Ukraine conflict and its impact on the global digital infrastructure sector.