Extreme heat stress-tests European data centers – again

Extreme heat stress-tests European data centers – again

An extreme heat wave swept across much of Western Europe on July 18 and 19, hitting some of the largest metropolitan areas such as Frankfurt, London, Amsterdam and Paris — which are also global data center hubs with hundreds of megawatts of capacity each. In the London area, temperatures at Heathrow Airport surpassed 40°C / 104°F to set an all-time record for the United Kingdom.

Other areas in the UK and continental Europe did not see new records only because of the heatwave in July 2019 that caused historical highs in Germany, France and the Netherlands, among other countries. Since most data centers in operation were built before 2019, this most recent heatwave either approached or surpassed design specifications for ambient operating temperatures.

Extreme heat stresses cooling systems by making components, such as compressors, pumps and fans, work harder than usual, which increases the likelihood of failures. Failures happen not only because of increased wear on the cooling equipment, but also due to lack of maintenance which includes regular cleaning of heat exchange coils. Most susceptible are air-cooled mechanical compressors in direct-expansion (DX) units and water chillers without economizers. DX cooling systems are more likely to rely on the ambient air for heat ejection as they tend to be relatively small in scale, and often installed in buildings that do not lend themselves to larger cooling infrastructure that is required for evaporative cooling units.

Cooling is not the only component at risk of exposure to extreme heat. Any external power equipment, such as backup power generators, is also susceptible. If a utility grid falters amid extreme temperature and the generator needs to take the load, it may not be able to deliver full nameplate power and it may even shut down to avoid damage from overheating.

Although some cloud service providers, reportedly, saw disruption due to thermal events in recent weeks, most data centers likely made it through the heatwave without a major incident. Redundancy in power and cooling, combined with good upkeep of equipment, should nearly eliminate the chances of an outage even if some components fail. Many operators have additional cushion because data centers typically don’t run at full utilization – tapping into extra cooling capacity that is normally in reserve can help maintain acceptable data hall temperatures during the peak of a heatwave. In contrast, cloud providers tend to drive their IT infrastructure harder, leaving less margin for error.

As the effects of climate change become more pronounced, making extreme weather events more likely, operators may need to revisit the climate resiliency of their sites. Uptime Institute recommends reassessing climatic conditions for each site regularly. Design conditions against which the data center was built may be out of date — in some cases by more than a decade. This can adversely impact a data center’s ability to support the original design load, even if there are no failures. And a loss of capacity may mean losing some redundancy in the event of an equipment failure. Should that coincide with high utilization of the infrastructure, a data center may not have sufficient reserves to maintain the load.

Operators have several potential responses to choose from, depending on the business case and the technical reality of the facility. One is to derate the maximum capacity of the facility with the view that its original sizing will not be needed. Operators can also decide to increase target supply air temperature or allow to rise temporarily wherever there is headroom (e.g., from 70°F / 21°C to 75°F / 24°C), to reduce the load on cooling systems and maintain full capacity. This could also involve elevating chilled water temperatures if there is sufficient pumping capacity. Another option is to tap into some of the redundancy to bring more cooling capacity online, including the option to operate at N capacity (no redundancy) on a temporary basis.

A lesson from recent heatwaves in Europe is that temperature extremes did not coincide with high humidity levels. This means that evaporative (and adiabatic) cooling systems remained highly effective throughout, and within design conditions for wet bulb (the lowest temperature of ambient air when fully saturated with moisture). Adding sprinkler systems to DX and chiller units, or evaporative economizers to the heat rejection loop will be attractive for many.

Longer term, the threat of climate change will likely prompt further adjustments in data centers. The difficulty (or even impossibility) of modeling future extreme weather events mean that hardening infrastructure against them may require operators to reconsider siting decisions, and / or adopt major changes to their cooling strategies, such as heat rejection into bodies of water and the transition to direct liquid cooling of IT systems.


Comment from Uptime Institute:

Uptime’s team of global consultants has inspected and certified thousands of enterprise-grade data center facilities worldwide, thoroughly evaluating considerations around extreme outside temperature fluctuations and hundreds of other risk areas at every site. Our Data Center Risk Assessment brings this expertise directly to owners and operators, resulting in a comprehensive review of your facility’s infrastructure, mechanical systems and operations protocols. Learn more here.

Crypto data centers: The good, the bad and the electric

Crypto data centers: The good, the bad and the electric

A single tweet in May 2021 brought unprecedented public attention to a relatively unknown issue. Tesla CEO Elon Musk tweeted that because of the large energy consumption associated with the use of Bitcoin, Tesla would no longer accept it as currency. Instead, Tesla would favor other cryptocurrencies that use less than 1% of Bitcoin’s energy per transaction. There has since been much debate – and varying data – about just how much energy crypto data centers consume, yet little focus on how innovative and efficient these facilities can be.

A study by the UK’s Cambridge Centre for Alternative Finance in April 2021 put global power demand by Bitcoin alone at 143 terawatt-hours (TWh) a year. By comparison (according to various estimates*) all the world’s data centers combined used between 210 TWh and 400 TWh — or even more — in 2020. (See Are proof-of-work blockchains a corporate sustainability issue?)

Despite the staggering (estimated) global energy consumption of cryptocurrencies, crypto data center operations are already highly optimized. Although there are legitimate concerns around over-burdening the grid, crypto facilities have deployed innovative approaches and technologies. Instead of focusing primarily on availability and reliability — a necessity for most other data centers — crypto facilities focus mainly on optimizing for cost of compute performance. This is often achieved by eliminating infrastructure (generators, uninterruptible power supplies), innovating silicon design and deploying direct liquid cooling (DLC).

Compared with traditional data centers, crypto IT operations tend to drive out cost and optimize performance, thanks in part to the heavy use of accelerator chips. Some of these accelerators are purpose-built to run cryptocurrency workloads. For example, chipmaker Intel recently launched an accelerator chip intended for crypto miners. Some of these new chips can also be liquid-cooled (or air-cooled), enabling greater efficiencies.

High-density server racks and a focus on return on investment have driven the crypto industry to pioneer and deploy immersion tanks for cooling. The most efficient crypto mining operations leverage ultra-high-density server configurations — about 100 kilowatt (kW) per rack is not unheard of — to potentially reduce both upfront building costs and operating costs.

Texas (US) has emerged as a crypto data center hub. Current estimates place the total installed capacity of crypto mining facilities in Texas at 500 to 1000 megawatt (MW) with plans to increase this to between 3,000 and 5,000 MW by 2023. The combination of cheap (sometimes renewable) energy and a deregulated energy market has made Texas a top choice for large, industrial-scale crypto mining facilities. Some crypto facilities there have large Lithium-ion battery storage capacity, which they exploit via a transactive relationship with a local utility.

One such facility, the Whinstone data center facility, has a total of 400 MW of installed capacity with provisioning for up to about 700 MW. The facility uses on-site battery stores and has a power supply contract with the grid to pay only 2.5 cents per kilowatt-hour. Increasingly, large crypto miners also seek to claim better sustainability credentials and using renewable energy has become a recent focus. Some crypto data centers can reportedly increase their utilization during periods of surplus renewable energy on the grid, as part of their transactional agreement with a utility.

Despite all the heat on crypto data centers, it is a sector of digital infrastructure that is already more optimized for high efficiency than most.

*There have been several studies of global energy use by IT and data centers. A recent and insightful paper, by the authors who reached the lower 210 TWh figure, is Recalibrating global data center energy-use estimates (Masanet E, Shehabi A, Lei N, et al. Science 2020; 367:984­986).

Making sense of the outage numbers

Making sense of the outage numbers

In recent years, Uptime Institute has published regular reports examining both the rate and causes of data center and IT service outages. The reports, which have been widely read and reported in the media, paint a picture of an industry that is struggling with resiliency and reliability — and one where operators regularly suffer downtime, disruption, and reputational and financial damage.

Is this a fair picture? Rather like the authors of a scientific paper whose findings from a small experiment are hailed as a major breakthrough, Uptime Institute Intelligence has often felt a certain unease when the results of these complex findings, pulled from an ever changing and complex environment, are distilled into sound bites and headlines.

In May this year, Uptime Intelligence published its Annual outage analysis 2022. The key findings were worded cautiously: outage rates are not falling; many outages have serious / severe consequences; the cost and impact of outages is increasing. This year, the media largely reported the findings accurately, picking different angles on the data — but this has not always been the case.

What does Uptime Institute think about the overall outages rate? Is reliability good or bad? Are outages rising or falling? If there were straightforward answers, there would be no need for this discussion. The reality, however, is that outages are both worsening and improving.

In our recent outage analysis report, four in five organizations surveyed say they’d had an outage in the past three years (Figure 1). This is in line with the previous years’ results and is consistent with various other Uptime studies. 

diagram: Most organizations experienced an outage in the past three years
Figure 1 Most organizations experienced an outage in the past three years

A smaller proportion, about one in five, have had a “serious” or “severe” outage (Uptime classes outages on a severity scale of one to five; these are levels four and five), which means the outcome has serious or severe financial and reputational consequences. This is consistent with our previous studies and our data also shows the cost of outages is increasing. 

By combining this information, we can see that the rate of outages, and their severity and impact, is not improving — in some ways it’s worsening. But hold on, say many providers of data center services and IT, we know our equipment is much better than it was, and we know that IT and data center technicians have better tools and skills — so why aren’t things improving? The fact is, they are.

Our data and findings are based on multiple sources: some reliable, others less so. The primary tools we use are large, multinational surveys of IT and data center operators. Respondents report on outages of the IT services delivered from their data center site(s) or IT operation. Therefore, the outage rate is “per site” or “per organization”.

This is important because the number of organizations with IT and data centers has increased significantly. Even more notable is the amount of IT (data, compute, IT services) per site / organization, which is rising dramatically every year.

What do we conclude? First, the rate of outages per site / company / IT operation is steady on average and is neither rising nor falling. Second, the total number of outages is rising steadily, but not substantially, even though the number of organizations either using or offering IT is increasing. Lastly, the number of outages as a percentage of all IT delivered is falling steadily, if not dramatically.

This analysis is not easy for the media to summarize in one headline. But let’s make one more observation as a comparison. In 1970, there were 298 air crashes, which resulted in 2,226 deaths; in 2021, there were 84 air crashes, which resulted in 359 deaths. This is an enormous improvement, particularly allowing for the huge increase in passenger miles flown. If the airline safety record was similar to the IT industry’s outage rate, there would still be many hundreds of crashes per year and thousands of deaths.

This is perhaps not a like-for-like comparison — flight safety is, after all, always a matter of life and death. It does, however, demonstrate the power of collective commitment, transparency of reporting, regulation and investment. As IT becomes more critical, the question for the sector, for regulators and for every IT service and data center operator is (as it always has been): what level of outage risk is acceptable?

Watch our 2022 Outage Report Webinar for more from Uptime Institute Intelligence on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members. Learn more about Uptime Institute Membership and request a free guest trial here.

Cloud price increases damage trust

Cloud price increases damage trust

In general, the prices of cloud services either remain level or decrease. There are occasional price increases, but these are typically restricted to specific features; blanket price increases across product families are rare.

Price cuts are often the result of improved operating efficiencies. Through automation, slicker processes, Moore’s law improvements in hardware and economies of scale, cloud providers can squeeze their costs and pass savings on to their customers (see Cloud generations drive down prices).

Cost base isn’t the only factor impacting price. Cloud providers need to demonstrate ongoing value. Value is the relationship between function and cost. Although central processing units are becoming increasingly powerful, most users don’t see the functionality of a slightly faster clock speed as valuable — they would rather pay less. As a result, the price of virtual machines tends to decrease.

The price of other cloud services remains flat or decreases more slowly than virtual machines. With these services, the cloud provider pockets the benefits of improved operational efficiencies or invests it in new functionality to increase the perceived value of the service. Occasionally cloud providers will cut the price of these services to garner attention and drive sales volume. Unlike virtual machines, users seek improved capability (at least for the time being) rather than lower prices.

Price increases are rare because of two reasons. First, improvements in operational efficiencies mean the provider doesn’t have to increase prices to maintain or grow margin. Second, the cloud provider doesn’t want to be perceived as taking advantage of customers that are already locked into their services.

Cloud buyers place significant trust in their cloud providers. Only a decade ago, cloud computing was viewed as being unsuitable for enterprise deployments by many buyers. Trusting a third party to host business-impacting applications takes a leap of faith: for example, there is no service level agreement that adequately covers the business cost of an outage. Cloud adoption has grown significantly over the past decade, and this reflects the increased trust in both the cloud model and individual cloud providers.

One of the major concerns of using cloud computing is vendor lock-in: users can’t easily migrate applications hosted on one cloud provider to another. (See Cloud scalability and resiliency from first principles.) If the price of the cloud services increases, the user has no choice but to accept the price increase or else plan a costly migration.

Despite this financial anxiety, price increases have not materialized. Most cloud providers have realized that increasing prices would damage the customer trust they’ve spent so long cultivating. Cloud providers want to maintain good relationships with their customers, so that they are the de facto provider of choice for new projects and developments.

However, cloud providers face new and ongoing challenges. The COVID-19 pandemic and the current Russia-Ukraine conflict have disrupted supply chains. Cloud providers may also face internal cost pressures and spend more on energy, hardware components and people. But raising prices could be perceived as price-gouging, especially as their customers are operating under similar economic pressures.

In light of these challenges, it’s surprising that Google Cloud has announced that some services will increase in price from October this year. These include a 50% increase to multiregion nearline storage and the doubling of some operations fees. Load balancers will also be subject to an outbound bandwidth charge. Google Cloud has focused on convincing users that it is a relationship-led, enterprise-focused company (not just a consumer business). To make such sweeping price increases would appear to damage its credibility in this regard.

How will these changes affect Google Cloud’s existing customers? It all depends on the customer’s application architecture. Google Cloud maintains it is raising prices to fall in line with other cloud providers. It is worth noting, however, that a price increase doesn’t necessarily mean Google Cloud will be more expensive than other cloud providers.

In Q3 2021, Google Cloud revenue increased by 45% to $4.99 billion, up from $3.44 billion in Q3 2020. Despite this growth, the division reported an operating loss of $644 million. Google Cloud’s revenue trails Amazon Web Services and Microsoft Azure by a wide margin, so Google Cloud may be implementing these price increases with a view to building a more profitable and sustainable business.

Will current and prospective customers consider the price increases reasonable or will they feel their trust in Google Cloud has been misplaced? Vendor lock-in is a business risk that needs managing — what’s not clear today is how big a risk it is.

Direct liquid cooling (DLC): pressure is rising but constraints remain

Direct liquid cooling: pressure is rising but constraints remain

Direct liquid cooling (DLC) is a collection of techniques that removes heat by circulating a coolant to IT electronics. Even though the process is far more efficient than using air, the move to liquid cooling has been largely confined to select applications in high-performance computing to cool extreme high-density IT systems. There are a few examples of operators using DLC at scale, such as OVHcloud in France, but generally DLC continues to be an exception to the air cooling norm. In a survey of enterprise data center operators in the first quarter of 2022, Uptime Institute found that approximately one in six currently uses DLC (see Figure 1).

diagram: Many would consider adopting DLC
Figure 1 Many would consider adopting DLC

Uptime Institute Intelligence believes this balance will soon start shifting toward DLC. Renewed market activity has built up around DLC in anticipation of demand, offering a broader set of DLC systems than ever before. Applications that require high-density infrastructure, such as high-performance technical computing, big data analytics and the rapidly emerging field of deep neural networks, are becoming larger and more common. In addition, the pressure on operators to further improve efficiency and sustainability is building, as practical efficiency gains from air cooling have run their course. DLC offers the potential for a step change in these areas.

Yet, it is the evolution of data center IT silicon that will likely start pushing operators toward DLC en masse. High-density racks can be handled tactically and sustainability credentials can be fudged but escalating processor and server power, together with the tighter temperature limits that come with them, will gradually render air cooling impractical in the second half of this decade.

Air cooling will be unable to handle future server developments without major compromises — such as larger heat sinks, higher fan power or (worse still) the need to lower air temperature. Uptime believes that organizations that cannot handle these next-generation systems because of their cooling requirements will compromise their IT performance — and likely their efficiency— compared with organizations that embrace them (see Moore’s law resumes — but not for all).

While the future need for scaled-up DLC deployments seems clear, there are technical and business complexities. At the root of the challenges involved is the lack of standards within the DLC market. Unlike air cooling, where air acts both as the standard cooling medium and as the interface between facilities and IT, there is no comparable standard coolant or line of demarcation for DLC. Efforts to create some mechanical interface standards, notably within the Open Compute Project and more recently by Intel, will take years to materialize in product form.

However, the DLC systems that are currently available have evolved significantly in recent years and have become easier to adopt, install and operate. A growing number of DLC production installations in live environments are providing vendors with more data to inform product development in terms of reliability, ease of use and material compatibility. Crucially, a patchwork of partnerships between IT vendors, DLC technology providers and system integrators is growing to make liquid-cooled servers more readily available.

Uptime has identified six commercially available categories of DLC systems (three cold plates and three immersion) but expects additional categories to develop in the near future:

  • Water cold plates.
  • Single-phase dielectric cold plates.
  • Two-phase dielectric cold plates.
  • Chassis immersion.
  • Single-phase immersion.
  • Two-phase immersion.

There are more than a dozen specialist vendors that actively develop DLC products, typically focusing on one of the above system categories, and numerous other ones have the capability to do so. Each category has a distinct profile of technical and business trade-offs. We exclude other systems that bring liquid to the rack, such as rear-door heat exchangers, due to their reliance on air as the medium to transfer heat from IT electronics to the facility’s cooling infrastructure. This is not an arbitrary distinction: there are major differences in the technical and business characteristics of these systems.

Uptime’s research indicates that most enterprises are open to making the change to DLC, and indeed project considerable adoption of DLC in the coming years. While the exact profile of the uptake of DLC cannot be realistically modeled, Uptime considers there to be strong evidence for a general shift to DLC. The pace of adoption, however, will likely be constrained by a fractured market of suppliers and vendors, organizational inertia — and a lack of formal standards and components.

Check out this DLC webinar for more in-depth insights on the state of direct liquid cooling in the digital infrastructure sector. The full Intelligence report “The coming era of direct liquid cooling: it’s when, not if” is available exclusively to Uptime Institute members. Learn more about Uptime Institute Membership and request a free guest trial here.

The ultimate liquid cooling: heat rejection into water

The ultimate liquid cooling: heat rejection into water

Uptime Institute’s data on power usage effectiveness (PUE) is a testament to the progress the data center industry has made in energy efficiency over the past 10 years. However, global average PUEs have been largely stalling at close to 1.6 since 2018, with only marginal gains. This makes sense: for the average figure to show substantial improvement, most facilities would require financially unviable overhauls to their cooling systems to achieve notably better efficiencies, while modern builds already operate near the physical limits of air cooling.

A growing number of operators are looking to direct liquid cooling (DLC) for the next leap in infrastructure efficiency. But a switch to liquid cooling at scale involves operational and supply chain complexities that challenge even the most resourceful technical organizations. Uptime is aware of only one major operator that runs DLC as standard: French hosting and cloud provider OVHcloud, which is an outlier with a vertically integrated infrastructure using custom in-house water cold plate and server designs.

When it comes to the use of liquid cooling, an often-overlooked part of the cooling infrastructure is heat rejection. Rejecting heat into the atmosphere is a major source of inefficiencies, manifesting not only in energy use, but also in capital costs and in large reserves of power for worst case (design day) cooling needs.

A small number of data centers have been using water features as heat sinks successfully for some years. Instead of eliminating heat through water towers, air-cooled chillers or other means that rely on ambient air, some facilities use the closed chilled water loop, which rejects heat through a heat exchanger that’s cooled by an open loop of water. These cooling designs using water heat sinks extend the benefits of water’s thermal properties from heat transport inside the data center to the rejection of heat outside the facilities.

Schematic of a data center once-through system
Figure 1. Schematic of a data center once-through system

The idea of using water for heat rejection, of course, is not new. Known as once-through cooling, these systems are used extensively in thermoelectric power generation and manufacturing industries for their efficiency and reliability in handling large heat loads. Because IT infrastructures are relatively smaller and tend to cluster around population centers, which in turn tend to be situated near water, Uptime considers the approach to have wide geographical applicability in future data center construction projects.

Uptime’s research has identified more than a dozen data center sites, some operated by global brands, that use a water feature as a heat sink. All once-through cooling designs use some custom equipment — there are currently no off-the-shelf designs that are commercially available for data centers. While the facilities we studied vary in size, location and some engineering choices, there are some commonalities between the projects.

Operators we’ve interviewed for the research (all of them colocation providers) considered their once-through cooling projects to be both a technical and a business success, achieving strong operational performance and attracting customers. The energy price crisis that started in 2021, combined with a corporate rush to claim strong sustainability credentials, reportedly boosted the level of return on these investments past even optimistic scenarios.

Rejecting heat into bodies of water allows for stable PUEs year-round, meaning that colocation providers can serve a larger IT load from the same site power envelope. Another benefit is the ability to lower computer room temperatures, for example, 64°F to 68°F (18°C to 20°C) for “free”: this does not come with a PUE penalty. Low-temperature air supply helps operators minimize IT component failures and accommodate future high-performance servers with sufficient cooling. If the water feature is naturally flowing or replenished, operators also eliminate the need for chillers or other large cooling systems from their infrastructure, which would otherwise be required as backup.

Still, that these favorable outcomes outweigh the required investments were far from certain during design and construction, as all undertaking involved nontrivial engineering efforts and associated costs. Committed sponsorship from senior management was critical for these projects to be given the green light and to overcome any unexpected difficulties. Encouraged by the positive experience of the facilities we studied, Uptime expects once-through cooling to gather more interest in the future. A more mature market for these designs will factor into siting decisions, as well as jurisdictional permitting requirements as a proxy for efficiency and sustainability. Once-through systems will also help to maximize the energy efficiency benefits of future DLC rollouts through “free” low-temperature operations, creating an end-to-end liquid cooling infrastructure.

By: Jacqueline Davis, Research Analyst, Uptime Institute and Daniel Bizo, Research Director, Uptime Institute