Blog Multi Author - Uptime Institute Blog

Costlier new cloud generations increase lock-in risk

August 10, 2022/in Executive, Operations/by Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]

Cloud providers tend to adopt the latest server technologies early, often many months ahead of enterprise buyers, to stay competitive. Providers regularly launch new generations of virtual machines with identical quantities of resources (such as core counts, memory capacity, network and storage bandwidths) as the previous generation but powered by the latest technology.

Usually, the newer generations of virtual machines are also cheaper, incentivizing users to move to server platforms that are more cost-efficient for the cloud provider. Amazon Web Services’ (AWS’) latest Graviton-based virtual machines buck that trend by being priced higher than the previous generation. New generation seven (c7g) virtual machines, based on AWS’ own ARM-based Graviton3 chips, come at a premium of around 7% compared with the last c6g generation.

A new generation doesn’t replace an older version; the older generation is still available to purchase. The user can migrate their workloads to the newer generation if they wish, but it is their responsibility to do so.

Cloud operators create significant price incentives so that users gravitate towards newer generations (and, increasingly, between server architectures). In turn, cloud providers reap the benefits of improved energy efficiency and lower cost of compute. Newer technology will also be supportable by the cloud provider for longer, compared with technology that is already close to its end of life.

AWS’ higher price isn’t unreasonable, considering Graviton3 offers higher per core performance and is built on TSMC’s (Taiwan Semiconductor Manufacturing Company, the world’s largest semiconductor foundry) bleeding edge 5 nanometer wafers, which carry a premium. The virtual machines also come with DDR5 memory, which has 1.5 times the bandwidth of the DDR4 memory used in the sixth generation equivalent but cost slightly more. With more network and storage backend bandwidth as well, AWS claims performance improvements of around 25% over the previous generation.

Supply chain difficulties, rising energy costs and political issues are raising data center costs, so AWS may not have the margin to cut prices from the previous generation to the latest. However, this complicates the business case for AWS customers, who are used to both lower prices and better performance from a new generation. With the new c7g launch, better performance alone is the selling point. Quantifying this performance gain can be challenging because it strongly depends on the application, but Figure 1 shows how two characteristics — network and storage performance — are decreasing in cost per unit due to improved capability (based on “medium”-sized virtual machine). This additional value isn’t reflected in the overall price increase of the virtual machine from generation six to seven.

diagram: Price and unit costs for AWS’ c6g and c7g virtual machines — **Figure 1 Price and unit costs for AWS’ c6g and c7g virtual machines**

A 7% increase in price seems fair if it does drive a performance improvement of 25%, as AWS claims — even if the application gains only half of that, price-performance improves.

Because more performance costs more money, the user must justify why paying more will benefit the business. For example, will it create more revenue, allow consolidation of more of the IT estate or aid productivity? These impacts are not so easy to quantify. For many applications, the performance improvements on the underlying infrastructure won’t necessarily translate to valuable application performance gains. If the application already delivers on quality-of-service requirements, more performance might not drive more business value.

Using Arm’s 64-bit instruction set, Graviton-based virtual machines aren’t as universal in their software stack as those based on x86 processors. Most of the select applications built for AWS’ Graviton-based range would have been architected and explicitly optimized to run best on AWS and Graviton systems for better price-performance. As a result, most existing users will probably migrate to the newer c7g virtual machine to gain further performance improvements, albeit more slowly than if it was cheaper.

Brand new applications will likely be built for the latest generation, even if more expensive than before. We expect AWS to reduce the price of c7g Graviton-based virtual machines as its rollout ramps up, input costs gradually become less and sixth generation systems become increasingly out-of-date and costlier to maintain.

It will be interesting to see if this pattern repeats with x86 virtual machines, complicating the business case for migration. With next-generation platforms from both Intel and AMD launching soon, the next six to 12 months will confirm whether generational price increases form a trend, even if only transitionary.

Any price hikes could create challenges down the line. If a generation becomes outdated and difficult to maintain, will cloud providers then force users to migrate to newer virtual machines? If they do and the cost is higher, will buyers – now invested and reliant on the cloud provider — be forced to pay more or will they move elsewhere? Or will cloud providers increase prices of older, legacy generations to continue support for those willing to pay, just as many software vendors charge enterprise support costs? Moving to another cloud provider isn’t trivial. A lack of standards and commonality in cloud provider services and application programming interfaces mean cloud migration can be a complex and expensive task, with its own set of business risks. Being locked-in to a cloud provider isn’t really a problem if the quality of service remains acceptable and prices remain flat or reduce. But if users are being asked to pay more, they may be stuck between a rock and a hard place — pay more to move to the new generation or pay more to move to a different cloud provider.

Extreme heat stress-tests European data centers – again

July 27, 2022/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]

An extreme heat wave swept across much of Western Europe on July 18 and 19, hitting some of the largest metropolitan areas such as Frankfurt, London, Amsterdam and Paris — which are also global data center hubs with hundreds of megawatts of capacity each. In the London area, temperatures at Heathrow Airport surpassed 40°C / 104°F to set an all-time record for the United Kingdom.

Other areas in the UK and continental Europe did not see new records only because of the heatwave in July 2019 that caused historical highs in Germany, France and the Netherlands, among other countries. Since most data centers in operation were built before 2019, this most recent heatwave either approached or surpassed design specifications for ambient operating temperatures.

Extreme heat stresses cooling systems by making components, such as compressors, pumps and fans, work harder than usual, which increases the likelihood of failures. Failures happen not only because of increased wear on the cooling equipment, but also due to lack of maintenance which includes regular cleaning of heat exchange coils. Most susceptible are air-cooled mechanical compressors in direct-expansion (DX) units and water chillers without economizers. DX cooling systems are more likely to rely on the ambient air for heat ejection as they tend to be relatively small in scale, and often installed in buildings that do not lend themselves to larger cooling infrastructure that is required for evaporative cooling units.

Cooling is not the only component at risk of exposure to extreme heat. Any external power equipment, such as backup power generators, is also susceptible. If a utility grid falters amid extreme temperature and the generator needs to take the load, it may not be able to deliver full nameplate power and it may even shut down to avoid damage from overheating.

Although some cloud service providers, reportedly, saw disruption due to thermal events in recent weeks, most data centers likely made it through the heatwave without a major incident. Redundancy in power and cooling, combined with good upkeep of equipment, should nearly eliminate the chances of an outage even if some components fail. Many operators have additional cushion because data centers typically don’t run at full utilization – tapping into extra cooling capacity that is normally in reserve can help maintain acceptable data hall temperatures during the peak of a heatwave. In contrast, cloud providers tend to drive their IT infrastructure harder, leaving less margin for error.

As the effects of climate change become more pronounced, making extreme weather events more likely, operators may need to revisit the climate resiliency of their sites. Uptime Institute recommends reassessing climatic conditions for each site regularly. Design conditions against which the data center was built may be out of date — in some cases by more than a decade. This can adversely impact a data center’s ability to support the original design load, even if there are no failures. And a loss of capacity may mean losing some redundancy in the event of an equipment failure. Should that coincide with high utilization of the infrastructure, a data center may not have sufficient reserves to maintain the load.

Operators have several potential responses to choose from, depending on the business case and the technical reality of the facility. One is to derate the maximum capacity of the facility with the view that its original sizing will not be needed. Operators can also decide to increase target supply air temperature or allow to rise temporarily wherever there is headroom (e.g., from 70°F / 21°C to 75°F / 24°C), to reduce the load on cooling systems and maintain full capacity. This could also involve elevating chilled water temperatures if there is sufficient pumping capacity. Another option is to tap into some of the redundancy to bring more cooling capacity online, including the option to operate at N capacity (no redundancy) on a temporary basis.

A lesson from recent heatwaves in Europe is that temperature extremes did not coincide with high humidity levels. This means that evaporative (and adiabatic) cooling systems remained highly effective throughout, and within design conditions for wet bulb (the lowest temperature of ambient air when fully saturated with moisture). Adding sprinkler systems to DX and chiller units, or evaporative economizers to the heat rejection loop will be attractive for many.

Longer term, the threat of climate change will likely prompt further adjustments in data centers. The difficulty (or even impossibility) of modeling future extreme weather events mean that hardening infrastructure against them may require operators to reconsider siting decisions, and / or adopt major changes to their cooling strategies, such as heat rejection into bodies of water and the transition to direct liquid cooling of IT systems.

Comment from Uptime Institute:

Uptime’s team of global consultants has inspected and certified thousands of enterprise-grade data center facilities worldwide, thoroughly evaluating considerations around extreme outside temperature fluctuations and hundreds of other risk areas at every site. Our Data Center Risk Assessment brings this expertise directly to owners and operators, resulting in a comprehensive review of your facility’s infrastructure, mechanical systems and operations protocols. Learn more here.

Crypto data centers: The good, the bad and the electric

July 27, 2022/in Design, Executive, Operations/by Lenny Simon, Senior Research Associate, Uptime Institute

A single tweet in May 2021 brought unprecedented public attention to a relatively unknown issue. Tesla CEO Elon Musk tweeted that because of the large energy consumption associated with the use of Bitcoin, Tesla would no longer accept it as currency. Instead, Tesla would favor other cryptocurrencies that use less than 1% of Bitcoin’s energy per transaction. There has since been much debate – and varying data – about just how much energy crypto data centers consume, yet little focus on how innovative and efficient these facilities can be.

A study by the UK’s Cambridge Centre for Alternative Finance in April 2021 put global power demand by Bitcoin alone at 143 terawatt-hours (TWh) a year. By comparison (according to various estimates*) all the world’s data centers combined used between 210 TWh and 400 TWh — or even more — in 2020. (See Are proof-of-work blockchains a corporate sustainability issue?)

Despite the staggering (estimated) global energy consumption of cryptocurrencies, crypto data center operations are already highly optimized. Although there are legitimate concerns around over-burdening the grid, crypto facilities have deployed innovative approaches and technologies. Instead of focusing primarily on availability and reliability — a necessity for most other data centers — crypto facilities focus mainly on optimizing for cost of compute performance. This is often achieved by eliminating infrastructure (generators, uninterruptible power supplies), innovating silicon design and deploying direct liquid cooling (DLC).

Compared with traditional data centers, crypto IT operations tend to drive out cost and optimize performance, thanks in part to the heavy use of accelerator chips. Some of these accelerators are purpose-built to run cryptocurrency workloads. For example, chipmaker Intel recently launched an accelerator chip intended for crypto miners. Some of these new chips can also be liquid-cooled (or air-cooled), enabling greater efficiencies.

High-density server racks and a focus on return on investment have driven the crypto industry to pioneer and deploy immersion tanks for cooling. The most efficient crypto mining operations leverage ultra-high-density server configurations — about 100 kilowatt (kW) per rack is not unheard of — to potentially reduce both upfront building costs and operating costs.

Texas (US) has emerged as a crypto data center hub. Current estimates place the total installed capacity of crypto mining facilities in Texas at 500 to 1000 megawatt (MW) with plans to increase this to between 3,000 and 5,000 MW by 2023. The combination of cheap (sometimes renewable) energy and a deregulated energy market has made Texas a top choice for large, industrial-scale crypto mining facilities. Some crypto facilities there have large Lithium-ion battery storage capacity, which they exploit via a transactive relationship with a local utility.

One such facility, the Whinstone data center facility, has a total of 400 MW of installed capacity with provisioning for up to about 700 MW. The facility uses on-site battery stores and has a power supply contract with the grid to pay only 2.5 cents per kilowatt-hour. Increasingly, large crypto miners also seek to claim better sustainability credentials and using renewable energy has become a recent focus. Some crypto data centers can reportedly increase their utilization during periods of surplus renewable energy on the grid, as part of their transactional agreement with a utility.

Despite all the heat on crypto data centers, it is a sector of digital infrastructure that is already more optimized for high efficiency than most.

*There have been several studies of global energy use by IT and data centers. A recent and insightful paper, by the authors who reached the lower 210 TWh figure, is Recalibrating global data center energy-use estimates (Masanet E, Shehabi A, Lei N, et al. Science 2020; 367:984986).

Making sense of the outage numbers

July 20, 2022/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]

In recent years, Uptime Institute has published regular reports examining both the rate and causes of data center and IT service outages. The reports, which have been widely read and reported in the media, paint a picture of an industry that is struggling with resiliency and reliability — and one where operators regularly suffer downtime, disruption, and reputational and financial damage.

Is this a fair picture? Rather like the authors of a scientific paper whose findings from a small experiment are hailed as a major breakthrough, Uptime Institute Intelligence has often felt a certain unease when the results of these complex findings, pulled from an ever changing and complex environment, are distilled into sound bites and headlines.

In May this year, Uptime Intelligence published its Annual outage analysis 2022. The key findings were worded cautiously: outage rates are not falling; many outages have serious / severe consequences; the cost and impact of outages is increasing. This year, the media largely reported the findings accurately, picking different angles on the data — but this has not always been the case.

What does Uptime Institute think about the overall outages rate? Is reliability good or bad? Are outages rising or falling? If there were straightforward answers, there would be no need for this discussion. The reality, however, is that outages are both worsening and improving.

In our recent outage analysis report, four in five organizations surveyed say they’d had an outage in the past three years (Figure 1). This is in line with the previous years’ results and is consistent with various other Uptime studies.

diagram: Most organizations experienced an outage in the past three years — **Figure 1 Most organizations experienced an outage in the past three years**

A smaller proportion, about one in five, have had a “serious” or “severe” outage (Uptime classes outages on a severity scale of one to five; these are levels four and five), which means the outcome has serious or severe financial and reputational consequences. This is consistent with our previous studies and our data also shows the cost of outages is increasing.

By combining this information, we can see that the rate of outages, and their severity and impact, is not improving — in some ways it’s worsening. But hold on, say many providers of data center services and IT, we know our equipment is much better than it was, and we know that IT and data center technicians have better tools and skills — so why aren’t things improving? The fact is, they are.

Our data and findings are based on multiple sources: some reliable, others less so. The primary tools we use are large, multinational surveys of IT and data center operators. Respondents report on outages of the IT services delivered from their data center site(s) or IT operation. Therefore, the outage rate is “per site” or “per organization”.

This is important because the number of organizations with IT and data centers has increased significantly. Even more notable is the amount of IT (data, compute, IT services) per site / organization, which is rising dramatically every year.

What do we conclude? First, the rate of outages per site / company / IT operation is steady on average and is neither rising nor falling. Second, the total number of outages is rising steadily, but not substantially, even though the number of organizations either using or offering IT is increasing. Lastly, the number of outages as a percentage of all IT delivered is falling steadily, if not dramatically.

This analysis is not easy for the media to summarize in one headline. But let’s make one more observation as a comparison. In 1970, there were 298 air crashes, which resulted in 2,226 deaths; in 2021, there were 84 air crashes, which resulted in 359 deaths. This is an enormous improvement, particularly allowing for the huge increase in passenger miles flown. If the airline safety record was similar to the IT industry’s outage rate, there would still be many hundreds of crashes per year and thousands of deaths.

This is perhaps not a like-for-like comparison — flight safety is, after all, always a matter of life and death. It does, however, demonstrate the power of collective commitment, transparency of reporting, regulation and investment. As IT becomes more critical, the question for the sector, for regulators and for every IT service and data center operator is (as it always has been): what level of outage risk is acceptable?

Watch our 2022 Outage Report Webinar for more from Uptime Institute Intelligence on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members. Learn more about Uptime Institute Membership and request a free guest trial here.

Cloud price increases damage trust

July 13, 2022/in Executive, Operations/by Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]

In general, the prices of cloud services either remain level or decrease. There are occasional price increases, but these are typically restricted to specific features; blanket price increases across product families are rare.

Price cuts are often the result of improved operating efficiencies. Through automation, slicker processes, Moore’s law improvements in hardware and economies of scale, cloud providers can squeeze their costs and pass savings on to their customers (see Cloud generations drive down prices).

Cost base isn’t the only factor impacting price. Cloud providers need to demonstrate ongoing value. Value is the relationship between function and cost. Although central processing units are becoming increasingly powerful, most users don’t see the functionality of a slightly faster clock speed as valuable — they would rather pay less. As a result, the price of virtual machines tends to decrease.

The price of other cloud services remains flat or decreases more slowly than virtual machines. With these services, the cloud provider pockets the benefits of improved operational efficiencies or invests it in new functionality to increase the perceived value of the service. Occasionally cloud providers will cut the price of these services to garner attention and drive sales volume. Unlike virtual machines, users seek improved capability (at least for the time being) rather than lower prices.

Price increases are rare because of two reasons. First, improvements in operational efficiencies mean the provider doesn’t have to increase prices to maintain or grow margin. Second, the cloud provider doesn’t want to be perceived as taking advantage of customers that are already locked into their services.

Cloud buyers place significant trust in their cloud providers. Only a decade ago, cloud computing was viewed as being unsuitable for enterprise deployments by many buyers. Trusting a third party to host business-impacting applications takes a leap of faith: for example, there is no service level agreement that adequately covers the business cost of an outage. Cloud adoption has grown significantly over the past decade, and this reflects the increased trust in both the cloud model and individual cloud providers.

One of the major concerns of using cloud computing is vendor lock-in: users can’t easily migrate applications hosted on one cloud provider to another. (See Cloud scalability and resiliency from first principles.) If the price of the cloud services increases, the user has no choice but to accept the price increase or else plan a costly migration.

Despite this financial anxiety, price increases have not materialized. Most cloud providers have realized that increasing prices would damage the customer trust they’ve spent so long cultivating. Cloud providers want to maintain good relationships with their customers, so that they are the de facto provider of choice for new projects and developments.

However, cloud providers face new and ongoing challenges. The COVID-19 pandemic and the current Russia-Ukraine conflict have disrupted supply chains. Cloud providers may also face internal cost pressures and spend more on energy, hardware components and people. But raising prices could be perceived as price-gouging, especially as their customers are operating under similar economic pressures.

In light of these challenges, it’s surprising that Google Cloud has announced that some services will increase in price from October this year. These include a 50% increase to multiregion nearline storage and the doubling of some operations fees. Load balancers will also be subject to an outbound bandwidth charge. Google Cloud has focused on convincing users that it is a relationship-led, enterprise-focused company (not just a consumer business). To make such sweeping price increases would appear to damage its credibility in this regard.

How will these changes affect Google Cloud’s existing customers? It all depends on the customer’s application architecture. Google Cloud maintains it is raising prices to fall in line with other cloud providers. It is worth noting, however, that a price increase doesn’t necessarily mean Google Cloud will be more expensive than other cloud providers.

In Q3 2021, Google Cloud revenue increased by 45% to $4.99 billion, up from $3.44 billion in Q3 2020. Despite this growth, the division reported an operating loss of $644 million. Google Cloud’s revenue trails Amazon Web Services and Microsoft Azure by a wide margin, so Google Cloud may be implementing these price increases with a view to building a more profitable and sustainable business.

Will current and prospective customers consider the price increases reasonable or will they feel their trust in Google Cloud has been misplaced? Vendor lock-in is a business risk that needs managing — what’s not clear today is how big a risk it is.

Direct liquid cooling (DLC): pressure is rising but constraints remain

Direct liquid cooling: pressure is rising but constraints remain

July 6, 2022/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]

Direct liquid cooling (DLC) is a collection of techniques that removes heat by circulating a coolant to IT electronics. Even though the process is far more efficient than using air, the move to liquid cooling has been largely confined to select applications in high-performance computing to cool extreme high-density IT systems. There are a few examples of operators using DLC at scale, such as OVHcloud in France, but generally DLC continues to be an exception to the air cooling norm. In a survey of enterprise data center operators in the first quarter of 2022, Uptime Institute found that approximately one in six currently uses DLC (see Figure 1).

diagram: Many would consider adopting DLC — **Figure 1 Many would consider adopting DLC**

Uptime Institute Intelligence believes this balance will soon start shifting toward DLC. Renewed market activity has built up around DLC in anticipation of demand, offering a broader set of DLC systems than ever before. Applications that require high-density infrastructure, such as high-performance technical computing, big data analytics and the rapidly emerging field of deep neural networks, are becoming larger and more common. In addition, the pressure on operators to further improve efficiency and sustainability is building, as practical efficiency gains from air cooling have run their course. DLC offers the potential for a step change in these areas.

Yet, it is the evolution of data center IT silicon that will likely start pushing operators toward DLC en masse. High-density racks can be handled tactically and sustainability credentials can be fudged but escalating processor and server power, together with the tighter temperature limits that come with them, will gradually render air cooling impractical in the second half of this decade.

Air cooling will be unable to handle future server developments without major compromises — such as larger heat sinks, higher fan power or (worse still) the need to lower air temperature. Uptime believes that organizations that cannot handle these next-generation systems because of their cooling requirements will compromise their IT performance — and likely their efficiency— compared with organizations that embrace them (see Moore’s law resumes — but not for all).

While the future need for scaled-up DLC deployments seems clear, there are technical and business complexities. At the root of the challenges involved is the lack of standards within the DLC market. Unlike air cooling, where air acts both as the standard cooling medium and as the interface between facilities and IT, there is no comparable standard coolant or line of demarcation for DLC. Efforts to create some mechanical interface standards, notably within the Open Compute Project and more recently by Intel, will take years to materialize in product form.

However, the DLC systems that are currently available have evolved significantly in recent years and have become easier to adopt, install and operate. A growing number of DLC production installations in live environments are providing vendors with more data to inform product development in terms of reliability, ease of use and material compatibility. Crucially, a patchwork of partnerships between IT vendors, DLC technology providers and system integrators is growing to make liquid-cooled servers more readily available.

Uptime has identified six commercially available categories of DLC systems (three cold plates and three immersion) but expects additional categories to develop in the near future:

Water cold plates.
Single-phase dielectric cold plates.
Two-phase dielectric cold plates.
Chassis immersion.
Single-phase immersion.
Two-phase immersion.

There are more than a dozen specialist vendors that actively develop DLC products, typically focusing on one of the above system categories, and numerous other ones have the capability to do so. Each category has a distinct profile of technical and business trade-offs. We exclude other systems that bring liquid to the rack, such as rear-door heat exchangers, due to their reliance on air as the medium to transfer heat from IT electronics to the facility’s cooling infrastructure. This is not an arbitrary distinction: there are major differences in the technical and business characteristics of these systems.

Uptime’s research indicates that most enterprises are open to making the change to DLC, and indeed project considerable adoption of DLC in the coming years. While the exact profile of the uptake of DLC cannot be realistically modeled, Uptime considers there to be strong evidence for a general shift to DLC. The pace of adoption, however, will likely be constrained by a fractured market of suppliers and vendors, organizational inertia — and a lack of formal standards and components.

Check out this DLC webinar for more in-depth insights on the state of direct liquid cooling in the digital infrastructure sector. The full Intelligence report “The coming era of direct liquid cooling: it’s when, not if” is available exclusively to Uptime Institute members. Learn more about Uptime Institute Membership and request a free guest trial here.

Extreme heat stress-tests European data centers – again

Comment from Uptime Institute:

Crypto data centers: The good, the bad and the electric

Making sense of the outage numbers

Direct liquid cooling: pressure is rising but constraints remain

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices