Blog Single Author Small - Uptime Institute Blog

Performance expectations of liquid cooling need a reality check

March 20, 2024/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]

The idea of using liquids to cool IT hardware, exemplified by technologies such as cold plates and immersion cooling, is frequently hailed as the ultimate solution to the data center’s energy efficiency and sustainability challenges. If a data center replaces air cooling with direct liquid cooling (DLC), chilled water systems can operate at higher supply and return water temperatures, which are favorable for both year-round free cooling and waste heat recovery.

Indeed, there are some larger DLC system installations that use only dry coolers for heat rejection, and a few installations are integrated into heat reuse schemes. As supply chains remain strained and regulatory environments tighten, the attraction of leaner and more efficient data center infrastructure will only grow.

However, thermal trends in server silicon will challenge engineering assumptions, chiefly DLC coolant design temperature points that ultimately underpin operators’ technical, economic and sustainability expectations of DLC. Some data center operators say the mix of technical and regulatory changes on the horizon are difficult to understand when planning for future capacity expansions — and the evolution of data center silicon will only add to the complications.

The only way is up: silicon power keeps escalating

Uptime Institute Intelligence has repeatedly noted the gradual but inescapable trend towards higher server power — barring a fundamental change in chip manufacturing technology (see Silicon heatwave: the looming change in data center climates). Not long ago, a typical enterprise server used less than 200 watts (W) on average, and stayed well below 400 W even when fully loaded. More recent highly performant dual-socket servers can reach 700 W to800 W thermal power, even when lightly configured with memory, storage and networking. In a few years, mainstream data center servers with high-performance configurations will require as much as 1 kilowatt (kW) in cooling, even without the addition of power-hungry accelerators.

The underlying driver for this trend is semiconductor physics combined with server economics for two key reasons. First, even though semiconductor circuits’ switching energy is dropping, the energy gains are being outpaced by an increase in the scale of integration. As semiconductor technology advances, the same area of silicon will gradually consume (and dissipate) ever more power as a result. Chips are also increasing in size, compounding this effect.

Second, many large server buyers prefer highly performant chips that can process greater software payloads faster because these chips drive infrastructure efficiency and business value. For some, such as financial traders and cloud services providers, higher performance can translate into more direct revenue. In return for these benefits, IT customers are ready to pay hefty price premiums and accept that high-end chips are more power-hungry.

DLC to wash cooling problems away

The escalation of silicon power is now supercharged by the high demand for artificial intelligence (AI) training and other supercomputing workloads, which will make the use of air cooling more costly. Fan power in high-performance servers can often account for 10% to 20% of total system power, in addition to silicon static power losses, due to operating near the upper temperature limit. There is also a loss of server density, resulting from the need to accommodate larger heat sinks and fans, and to allow more space between the electronics.

In addition, air cooling may soon see restrictions in operating temperatures after nearly two decades of gradual relaxation of set points. In its 2021 Equipment thermal guidelines for data processing environments, US industry body ASHRAE created a new environmental class for high-density servers with a recommended supply temperature maximum of 22°C (71.6°F) — a whole 5°C (9°F) lower than the general guidelines (Class A1 to A4), with a corresponding dip in data center energy efficiency (see New ASHRAE guidelines challenge efficiency drive).

Adopting DLC offers relief from the pressure of these trends. The superior thermal performance of liquids, whether water or engineered fluids, makes the job of removing several hundred watts of thermal energy from compact IT electronics more straightforward. Current top-of-the-line processors (up to 350 W thermal design power) and accelerators (up to 700 W on standard parts such as NVIDIA data center GPUs) can be effectively cooled even at high liquid coolant temperatures, allowing the facility water supply for the DLC system to be running as high as 40°C (104°F), and even up to 45°C (113°F).

High facility water temperatures could enable the use of dry coolers in most climates; or alternatively, the facility can offer valuable waste heat to a potential offtaker. The promise is attractive: much reduced IT and facility fan power, elimination of compressors that also lower capital and maintenance needs, and little to no water use for cooling. Today, several high-performance computing facilities with DLC systems take advantage of the heat-rejection or heat-reuse benefits of high temperatures.

Temperature expectations need to cool down

Achieving these benefits is not necessarily straightforward. Details of DLC system implementation, further increases in component thermal power, and temperature restrictions on some components all complicate the process further.

Temperatures depend on the type of DLC implementation. Many water-cooled IT systems, the most common type in use today, often serialize multiple cold plates within a server to simplify tubing, which means downstream components will receive a higher temperature coolant than the original supply. This is particularly true for densified compute systems with very compact chassis, and restricts coolant supply temperatures well below what would be theoretically permissible with a parallel supply to every single cold plate.
Thermal design power has not peaked. The forces underlying the rise in silicon power (discussed above) remain in play, and the data center industry widely expects even more power-hungry components in the coming years. Yet, these expectations remain in the realm of anecdotes, rumors and leaks in the trade press, rather than by way of publicly available information. Server chip vendors refuse to publicize the details of their roadmaps — only select customers under nondisclosure agreements have improved visibility. From our discussions with suppliers, Uptime Intelligence can surmise that more powerful processors are likely to surpass the 500 W mark by 2025. Some suppliers are running proof of concepts simulating 800 W silicon heat loads, and higher.
Temperature restrictions of processors. It is not necessarily the heat load that will cap facility water temperatures, but the changing silicon temperature requirements. As thermal power goes up, the maximum temperature permitted on the processor case (known as Tcase) is coming down —to create a larger temperature difference to the silicon and boost heat flux. Intel has also introduced processor models specified for liquid cooling, with Tcase as low as 57°C (134.6°F), which is more than a 20°C (36°F) drop from comparable air-cooled parts. These low-Tcase models are intended to take advantage of the lower operating temperature made possible by liquid cooling to maximize peak performance levels when running computationally intense code, which is typical in technical and scientific computing.
Memory module cooling.In all the speculation around high-power processors and accelerators, a potentially overlooked issue is the cooling of server memory modules, whose heat output was once treated as negligible. As module density, operating speeds and overall capacity increase with successive generations, maintaining healthy operating temperature ranges is becoming more challenging. Unlike logic chips, such as processors that can withstand higher operating temperatures, dynamic memory (DRAM) cells show performance degradation above 85°C (185°F), including elevated power use, higher latency, and — if thermal escalation is unchecked — bit errors and overwhelmed error correction schemes. Because some of the memory modules will be typically downstream of processors in a cold-plate system, they receive higher temperature coolant. In many cases it won’t be the processor’s Tcase that will restrict coolant supply temperatures, but the limits of memory chips.

The net effect of all these factors is clear: widespread deployment of DLC to promote virtually free heat rejection and heat reuse will remain aspirational in all but a few select cases where the facility infrastructure is designed around a specific liquid-cooled IT deployment.

There are too many moving parts to accurately assess the precise requirements of mainstream DLC systems in the next five years. What is clear, however, is that the very same forces that are pushing the data center industry towards liquid cooling will also challenge some of the engineering assumptions around its expected benefits.

Operators that are considering dedicated heat rejection for DLC installations will want to make sure they prepare the infrastructure for a gradual decrease in facility supply temperatures. They can achieve this by planning increased space for additional or larger heat rejection units — or by setting the water temperature conservatively from the outset.

Temperature set points are not dictated solely by IT requirements, but also by flow rate considerations — which has consequences for pipe and pump sizing. Operating close to temperature limits means loss of cooling capacity for the coolant distribution units (CDU), requiring either larger CDUs or more of them. Slim margins also mean any degradation or loss of cooling may have a near immediate effect at full load: a cooling failure in water or single-phase dielectric cold-plate systems may have less than 10 seconds of ride-through time.

Today, temperatures seem to be converging around 32°C (89.6°F) for facility water — a good balance between facility efficiency, cooling capacity and support for a wide range of DLC systems. Site manuals for many water-cooled IT systems also have the same limit. Although this is far higher than any elevated water temperature for air-cooling systems, it still requires additional heat rejection infrastructure either in the form of water evaporation or mechanical cooling. Whether lower temperatures will be needed as server processors approach 500 W — with large memory arrays and even higher power accelerators — will depend on a number of factors, but it is fair to assume the likely answer will be “yes”, despite the high cost of larger mechanical plants.

These considerations and limitations are mostly defined by water cold-plate systems. Single-phase immersion with forced convection and two-phase coolants, probably in the form of cold-plate evaporators rather than immersion, offer alternative approaches to DLC that should help ease supply temperature restrictions. For the time being, water cold plates remain the most widely available and are commonly deployed, and mainstream data center operators will need to ensure they meet the IT system requirements that use them.

In many cases, Uptime Intelligence expects operators to opt for lower facility supply water temperatures for their DLC systems, which brings benefits in lower pumping energy and fewer CDUs for the same cooling capacity, and is also more future proof. Many operators have already opted for conservative water temperatures as they upgrade their facilities for a blend of air and liquid-cooled IT. Others will install DLC systems that are not connected to a water supply but are air-cooled using fans and large radiators.

The Uptime Intelligence View

The switch to liquid to cool IT electronics offers a host of energy and compute performance benefits. However, future expectations based on the past performance of DLC installations are unlikely to be met. The challenges of silicon thermal management will only become more difficult as new generations of high-power server and memory chips develop. This is due to stricter component temperature limits, with future maximum facility water temperatures to be set at more conservative levels. For now, the vision of a lean data center cooling plant without either compressors or evaporative water consumption remains elusive.

FinOps gives hope to those struggling with cloud costs

March 6, 2024/in Design, Executive, Operations/by John O’Brien, Senior Research Analyst, Uptime Institute, [email protected]

Cloud workloads are continuing to grow — sometimes adding to traditional on-premises workloads, sometimes replacing them. The Uptime Institute Global Data Center Survey 2023 shows that organizations expect the public cloud to account for 15% of their workloads by 2025. When private cloud hosting and software as a service (SaaS) are included, this share rises to approximately one-third of their workloads by 2025.

As enterprise managers decide where to put their workloads, they need to weigh up the cost, security, performance, accountability, skills and other factors. Mission-critical workloads are the most challenging. Uptime Institute research suggests around two-thirds of organizations do not host their mission-critical applications in the cloud because of concerns around data security, regulation and compliance, and the cost or return on investment (ROI). Cost overruns are a particular concern: the deployment of new resource-hungry workloads means that unless organizations have better visibility and control over their cloud costs, this situation is only going to get worse.

This report explains the role that the fast-emerging discipline of FinOps (a portmanteau of finance and operations) can play in helping to understand and control cloud costs, and in moving and placing workloads. FinOps has been hailed as a distinct new discipline and a major advance in cloud governance — but, as always, there are elements of hype and some implementation challenges that need to be considered.

Unpredictable cloud costs

The cost and ROI concerns identified by cloud customers are real. Almost every company that has used the cloud has experienced unwelcome cost surprises — many organizations have paid out millions of dollars in unbudgeted cloud fees. This highlights how difficult it has been to plan for and predict cloud costs accurately.

In the early days, cloud services were often promoted as being cheaper; however, that argument has moved on with a focus now on value, innovation and function. In fact, cloud costs have often proved to be higher than some of the alternatives, such as keeping applications or data in-house.

It can make for a complex and confusing picture for managers trying to understand why costs have spiraled. Three examples of why this can happen unexpectedly are:

Cloud migration. In many cases, organizations simply lifted and shifted on-premises workloads to the cloud, saving short-term recoding and development costs at the expense of later technical and operational limitations with cost implications. On-premises workloads were not built for the cloud, and without being modernized through cloud-native techniques, such as containerization and code refactoring, they will fail to benefit from the as-a-service and on-demand consumption models offered by the public cloud.
Technical issues. Cloud platform incompatibilities, coding errors, poor integration and undiscovered dependencies and latencies result in applications performing inefficiently and consuming cloud resources erratically.
On-demand pricing. The most common default way of buying cloud services has cost-effective ways to run some workloads, but not all. Amazon Web Services recommends on-demand pricing for “short-term, irregular workloads that cannot be interrupted.” On-demand therefore lends itself to workloads that can be switched on and off as required, such as build and test environments. However, mission-critical applications, such as enterprise resource planning and databases, need to be available 24/7 to ensure data synchronization happens continuously. If deployed on-demand, consumption costs will start to spiral — but an alternative pricing option, such as reserved instances, can enable substantial savings. It is critical to be able to match the right workload to the right cloud consumption model.

Failing to know which workload is best suited to which cloud consumption model can result in applications and workloads being unavailable when needed or overprovisioned (because they keep running in the background when not in use). Being under or overprovisioned may lead to unsatisfactory results for the customer and, ultimately, poor ROI.

To mitigate some of these issues, cloud providers offer different pricing plans to support different workloads and consumption requirements. However, it is still imperative for the customer to understand their own needs and the implications of these decisions.

This is where FinOps comes in. Customers cannot rely on cloud vendors to be impartial. They need to be able to identify their own optimum pricing models based on their workload requirements, consumption demands and use-case requirements. And they need to be able to accurately compare different providers and their products against one another to achieve the best value.

What is FinOps?

The advocates and suppliers of FinOps tools and practices set out to provide much-needed visibility into the costs of running workloads and applications in the cloud.

The non-profit organization FinOps Foundation describes FinOps as a “financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions.”

Many of the disciplines and methods of FinOps are extensions of management accounting, applied to the complexities of digital infrastructure and cloud computing. Tools falling under the FinOps label have been developed that, in a slower-moving, less-automated environment, would be carried out by the finance teams using Excel. These tools can track consumption, model it, set alerts, apply showback and chargeback, and help manage and conduct scenario analysis to model the impact of using different services or developing or introducing applications. Governance and processes may be needed to identify or prevent overspend at an early stage.

The FinOps Foundation represents about 10,000 practitioners from various organizations worldwide, including around 90% of the Fortune 50. These companies have among the largest cloud spend of all corporates, and they are now helping to develop and standardize best practices for FinOps.

FinOps is helping organizations manage the growing complexity of their hybrid IT and increasingly multi-cloud environments. A recent report by management consultancy McKinsey & Company (The FinOps way: how to avoid the pitfalls to realizing cloud’s value) claimed that using FinOps can cut cloud costs by 20% to 30%. However, it also found that organizations do not develop at-scale FinOps processes until their cloud spend hits $100 million per year. This suggests that many organizations buying cloud services below this level have yet to adopt money-saving FinOps disciplines.

For the largest organizations, FinOps has rapidly become a critical part of modern cloud operations, alongside other important cloud operations disciplines, which include:

DevOps. A set of practices to unite cloud software development and IT operations team objectives and outcomes.
AIOps. The application of artificial intelligence (AI) and machine learning techniques for training and inferencing in IT operations.
DataOps. A set of collaborative practices for managing data quality, governance and continuous improvement across IT and operations teams.
Site reliability engineering. This combines software engineering with IT operations experts to ensure cloud system availability; through automation, monitoring, testing production environments and performing incident response.

Like these disciplines, FinOps fills a critical gap in knowledge and skills as more IT, data and systems are managed in the cloud. It can also help to break down operational silos between cloud and traditional enterprise functions. For instance, FinOps sits at the intersection of IT, cloud and finance — it enables enterprises to more efficiently report, analyze and optimize cloud and other IT costs.

Controlling, forecasting and optimizing the costs of running applications in the cloud can be challenging for several reasons, especially when, as is often the case, more than one cloud provider is either being used or being considered. Listed below are the challenges that control cloud costs:

Cloud consumption is not always under the customer’s control, particularly in an on-demand pricing environment.
Each cloud provider measures the consumption of its service in different ways; there is no standard approach.
Each provider offers different incentives and discounts to customers.
Each cloud service has many metrics associated with it that need to be monitored, relating to utilization, optimization, performance and adhering to key performance indicators (KPIs). The more cloud vendor services that are consumed, the more complex this activity becomes.
Metrics are only sometimes related to tangible units that are easy for the customer to predict. For example, a transaction may generate a cost on a database platform, however, the customer may have no understanding or visibility of how many of these transactions will be executed in a certain period.
Applications may scale up by accident due to errant code or human error and use resources without purpose. Similarly, applications may not scale down when able (reducing costs) due to incorrect configuration.
Resiliency levels can be achieved in different ways, with different costs. Higher levels of resiliency can add costs, some of which may not have been planned for initially, such as unexpected costs for using additional availability zones.

Why FinOps’ time is now

FinOps is not just about cost containment, it is also about identifying and realizing value from the cloud. One of FinOps’ goals is to help organizations analyze the value of new services by conducting a full cost analysis based on data. FinOps aims to close the gap between the different teams involved in commercial and financial calculations and in the development and deployment of new services.

The FinOps function — at least in some organizations — sits outside of the existing IT, finance and engineering teams to provide an independent, objective voice to arbitrate and negotiate when needed.

These are the key reasons driving FinOps adoption:

Cloud bills are increasing as adoption grows across the business and are now attracting attention from stakeholders as a significant long-term expense.
Growth in hybrid IT, where organizations use a mix of cloud locations and on-premises facilities, has stimulated the need for accurate data to make more informed decisions on workload placement.
AI model training and inferencing is a new asset that will drive (possibly explosive) demand for hybrid IT — both on-premises and public cloud consumption. Optimized financial processes that can predict capacity, consumption and bills will be essential.
In the years to come, many organizations will need to make strategic decisions about where to place large workloads and whether to use their own, colocation or cloud facilities. Better tools and disciplines are needed to model the very significant financial implications.

Macroeconomic conditions (notably rising inflation) are forcing organizations to reduce expenditure where possible to sustain gross margins.

Governance over digital infrastructure costs

Uptime believes that the more successful digital infrastructure organizations will develop FinOps capabilities over the coming years as cloud services — and their huge costs — become more embedded in the core of the business. This will bring a level of governance to cloud use that could be profound. In time, the larger organizations that depend on hybrid IT infrastructure will likely extend the discipline — or some integration extension of it — to cover all IT, from on-premises to colocation, hosting and cloud.

Despite the hype, some measured skepticism is required. In its simplest iterations, FinOps can bring down obvious cloud overspend, but it can also add costs, complexity and slow innovation. Further, it is still unclear to what extent new and dominant tools, standards and disciplines will become firmly established, and how integrated these functions will become with the rest of the digital infrastructure for financial management. Ideally, chief information officers and chief financial officers do not want to battle with an array of accounting methodologies, tools and reporting lines but instead use integrated sets of tools.

Most cloud providers already supply documentation and tools related to FinOps. This free capability represents a good starting point for implementing FinOps practices. However, cloud provider support is unlikely to be useful in making the organizational changes needed to help bridge the gap between finance and IT. Furthermore, the support provided by a cloud provider will only extend to the services offered by the cloud provider — multi-cloud optimization is far more complex. FinOps capabilities presented by cloud providers cannot offer an unbiased view of expenditure.

Third-party tools, such as Apptio Cloudability, Spot by NetApp, Flexera One, Kubecost and VMware Aria Cost, provide independent FinOps toolsets that can be used across multiple clouds. But larger cloud customers have resorted to developing their own tools. Financial giant Capital One built its own cloud financial management tool because it found its FinOps activities had outgrown its original commercial off-the-shelf product.

FinOps is an emerging discipline, and there is still work being done to achieve standardization across all stakeholders and interests. This is an important area for organizations to watch before they go too far with their investments.

The Uptime Intelligence View

It is clear there is huge cloud overspend at many organizations. However, cloud expenditure can only be effectively managed with specialist domain skills. Finance and IT need to work together to manage cloud costs, and a FinOps function is required to help strike a balance between cutting costs and enabling scalability and innovation. FinOps, however, should not be embraced uncritically. Software, standards and disciplines will take many years to mature. Ultimately, FinOps can become a foundational part of what is ultimately needed — a way to comprehensively manage and model all digital infrastructure costs.

The majority of enterprise IT is now off-premises

February 21, 2024/in Executive, Operations/by Max Smolaks, Research Analyst, [email protected]

Corporate data centers have been the backbone of enterprise IT since the 1960s and continue to play an essential role in supporting critical business and financial functions for much of the global economy. Yet, while their importance remains, their prominence as part of an enterprise’s digital infrastructure appears to be fading.

Today, businesses have more options for where to house their IT workloads than ever before. Colocation, edge sites, the public cloud infrastructure and software as a service all offer a mature alternative to take on many, if not all, enterprise workloads.

Findings from the 2023 Uptime Institute Global Data Center Survey, the longest-running survey of its kind, show for the first time that the proportion of IT workloads hosted in on-premises data centers now represents slightly less than half of the total enterprise footprint. This marks an important and long-anticipated moment for the industry.

It does not mean that corporate data center capacity, usage or expenditure is shrinking in absolute terms, but it does indicate that for new workloads, organizations are tending to choose third-party data centers and services.

The share of workloads in corporate facilities is likely to continue to fall, with organizations switching to third-party venues as the preferred deployment model for their applications — each with their own set of advantages and drawbacks but all delivering capacity without any capital expenditure.

Senior management loves outsourcing

In Uptime’s 2020 annual data center survey, respondents reported that, on average, 58% of their organization’s IT workloads were hosted in corporate data centers. In 2023, this percentage fell to 48%, with respondents forecasting that just 43% of workloads would be hosted in corporate data centers in 2025 (see Figure 1).

Figure 1. Cloud and hosting grow at the expense of corporate data centers

Diagram: Cloud and hosting grow at the expense of corporate data centers

Increasingly, the economic odds are stacked against corporate facilities as businesses look to offload the financial burden and organizational complexity of building and managing data center capacity.

Specialized third-party data centers are typically more efficient than their on-premises counterparts. Larger cloud and colocation facilities benefit from economies of scale when purchasing mechanical and electrical equipment, helping them to achieve lower costs. For some applications, smaller third-party facilities are more attractive because they enable organizations to bring latency-sensitive or high-availability services closer to industrial or commercial sites.

Data center outsourcing has several other key benefits:

Capacity and change management. Outsourcing liberates corporate teams from the onerous task of finding the space and power required to expand their IT estates.
Staffing and skills. The data center industry is suffering from an acute shortage of staff. When outsourcing, this becomes someone else’s problem.
Environmental reporting. Outsourcing simplifies the process of compliance with current and upcoming sustainability regulations because it places much of the burden on the service provider.
Supporting new (or exotic) technologies. Outsourcing enables businesses to experiment with cutting-edge IT hardware without having to make large upfront investments (e.g., dense IT clusters that require liquid cooling).

Additional regulatory requirements that will come into force in the next two to three years — such as the EU’s Energy Efficiency Directive recast (see The EU’s Energy Efficiency Directive: ready or not here it comes) — will make data center outsourcing even more tempting.

The public cloud offers a further set of advantages over in-house data centers. These include a wide selection of hardware “instances” and the ability for customers to grow or shrink their actively used IT resources at will with no advance notification. In addition to flexible pay-as-you-go pricing models, customers can tweak their costs by opting for either on-demand, reserved or spot instances. It is no surprise that the public cloud accounts for a growing proportion of IT workloads today, with a share of 12%, up from 8% in 2020.

However, this does not mean that the public cloud is a perfect fit for every workload, particularly if the application is not rearchitected to take advantage of the technical and economic benefits provided by a cloud platform. One problem highlighted by Uptime Institute Intelligence is the lack of visibility into cloud providers’ platforms, which prevents customers from assessing their operational resiliency or gaining a better understanding of potential vulnerabilities. There is also the tendency for public cloud deployments to generate runaway costs. The intense competition between cloud providers and their proprietary software stacks makes multi-cloud strategies, which alleviate some of the inherent risks in cloud architectures, too costly and complex to implement.

In specific cases, cloud service providers enjoy an oligopolistic advantage in accessing the latest technologies. For example, Microsoft, Baidu, Google and Tencent recently spent billions of dollars securing very large numbers of GPUs to build out specialized artificial intelligence (AI) training clusters, depleting the supply chain and causing GPU shortages. In the near term, many businesses that opt to develop their own AI models will simply not be able to purchase the GPUs they need and will be forced to rent them from cloud providers.

In terms of regional differences, Asia-Pacific (excluding China) is leading the shift off-premises, with just 39% of IT workloads currently hosted in corporate data centers (see Figure 2). Historically, enterprise data centers have been more developed in North America and Europe, and this category of facilities may never develop to the same extent elsewhere.

Figure 2. The regions with fewer IT workloads in corporate data centers

Diagram: The regions with fewer IT workloads in corporate data centers

Digital infrastructure ecosystems in Asia and Latin America are likely to emerge as examples of “leapfrogging” — when regions with poorly developed technology bases advance directly towards the adoption of modern systems without going through intermediary steps.

The Uptime Intelligence View

In theory, the gradual movement of workloads away from in-house data centers gives colocation, cloud, and hosting providers more leverage when influencing the development of IT equipment, new mechanical and electrical designs, data center topologies, and crucially, software architectures. The constraints of enterprise data centers that have defined more than 60 years of business IT will gradually become less important. Does this mean that almost all IT workloads will — eventually — end up running in third-party data centers? This is unlikely, but it will be almost impossible for this trend to reverse: for organizations “born” in a public cloud or using colocation, moving to their own enterprise data center will tend to be highly unattractive or cost prohibitive.

Large data centers are mostly more efficient, analysis confirms

February 7, 2024/in Executive, Operations/by Jacqueline Davis, Research Analyst, Uptime Institute, [email protected]

Uptime Institute calculates an industry average power usage effectiveness (PUE), which is a ratio of total site power to IT power, each year using data from the Uptime Institute Global Data Center Survey. This PUE data is pulled from a large sample over the course of 15 years and provides a reliable view of progress in facility efficiency.

Uptime’s data shows that industry PUE has remained at a high average (ranging from 1.55 to 1.59) since around 2020. Despite ongoing industry modernization, this overall PUE figure has remained almost static, in part because many older and less-efficient legacy facilities have a moderating effect on the average. In 2023, industry average PUE stood at 1.58.

For the 2023 annual survey, Uptime refined and expanded the survey questionnaire to provide deeper insights into the industry trends and improvements underlying the slow-moving average. This analysis builds on Uptime’s recent PUE research that focused on the influence of facility age and regional location (see Global PUEs — are they going anywhere?).

Influence of larger sites on PUE

Uptime’s headline PUE figure of 1.58 (weighted per respondent, for consistency with historical data) approximates the efficiency of an average facility. The new survey data allows us to analyze PUE in greater detail across a range of facility sizes. We applied provisioned IT capacity (in megawatts, MW) as a weighting factor, to examine PUE of a normalized unit of IT power. Using this approach, the capacity-weighted PUE figure is 1.47 — a result that many may expect, given the large amount of total IT power deployed in larger (often newer) data centers.

Larger facilities tend to be more efficient — most are relatively new and use leading-edge equipment, with more efficient cooling designs and optimized controls. Modernization of smaller facilities is less likely to yield a return on investment from energy savings. In Figure 1, Uptime compares survey respondents’ annual average PUE based on their data center’s provisioned IT capacity, showing a clear trend of efficiency improvements as data centers increase in capacity size.

Figure 1. Weighted average PUE by data center IT capacity

Diagram: Weighted average PUE by data center IT capacity

The PUE metric was introduced to track the efficiency of a given data center over time, rather than to compare between different data centers. Uptime analyzes PUE in large sample sizes to track trends in the industry, including the influence of facility size on PUE. Other factors shaping PUE include IT equipment utilization, facility design and age, system redundancy and local climate conditions.

Capacity and scrutiny will grow

Data centers are expanding in capacity — and this will warrant closer attention to the influence of facility size on efficiency. Some campuses currently have capacities of 300 MW, and several others are planned to reach in excess of 1 gigawatt (GW), which is between 10 and 30 times more power than the largest data centers seen in recent years. Uptime has identified approximately 28 hyperscale colocation campuses in development in addition to existing large hyperscale cloud sites. If the planned capacity of these sites is realized, they would account for approximately one-quarter of data center energy consumption globally.

These hyperscale colocation campuses, in common with many large new colocation facilities in existing prime data center locations, are designed for PUEs significantly below the industry average (1.4 and lower). Scala Data Centers is one example: it is building its Tamboré Campus in São Paolo (Brazil), which is intended to reach 450 MW with a PUE of 1.4. To preserve an economic advantage, the organization will need to optimize efficiency as the number of tenants filling the data halls increases.

Cloud hyperscalers Google, Amazon Web Services and Microsoft already claim PUE of 1.2 or lower at some sites. However, this is not always representative of the actual PUE of a customer application in the cloud. Their workloads may be provisioned by a colocation partner whose PUE is higher, or cloud workloads may need to be replicated across one or more availability zones — driving up energy usage and aggregating PUE across multiple facilities.

PUE improvements will be demanded as legislatures start to reference PUE in binding regulations. The new Energy Efficiency Act in Germany, which came into force in September 2023, mandates data centers in Germany to achieve a PUE of 1.5 from July 1, 2027 and a PUE of 1.3 from July 1, 2030. New data centers opening from July 1, 2026 are required to have a PUE of 1.2, or less — which even new build operators may find challenging at higher levels of resiliency.

The Uptime Intelligence View

Facility size is one of many important factors influencing facility efficiency and this is reflected in the capacity-weighted average PUE figure of 1.47, as opposed to the per-site average of 1.58. The data may suggest that over time, the replacement of older sites with larger more efficient ones may produce a more impactful or immediate improvement in efficiency than modernizing smaller sites.

Jacqueline Davis, Research Analyst, [email protected]

John O’Brien, Senior Research Analyst [email protected]

When net-zero goals meet harsh realities

January 24, 2024/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]

For more than a decade, the data center industry — and the wider digital infrastructure that relies on it — has lived with the threat of much greater sustainability legislation or other forms of mandatory or semi-mandatory controls. But in a period of boom, it has mostly been a background worry, with legislators more concerned about disrupting an important new industry.

The EU, for example, first introduced the voluntary Code of Conduct for data centers in 2008, warning that legislation would follow if carbon and energy footprints were not brought under control. In the UK, a carbon reduction commitment was required of data centers but was later withdrawn.

Some countries and states, including California, Amsterdam and Singapore, have introduced tighter planning restrictions for data centers and even moratoriums on new developments. However, some of these have been watered down or suspended.

Since 2018, Uptime Institute has repeatedly warned of the likelihood of more legislation and greater public pressure, advising operators to, at the very least, avoid over-ambitious statements, collect better data and prepare. But the pressure to do so has not been strong: improvements in energy efficiency and processors’ performance (Moore’s law), along with the greater use of cloud computing, have held down energy and carbon use, while facility efficiency has gradually improved.

This “green honeymoon” is, however, coming to an end and for some, this will be both painful and expensive. From 2024, new reporting laws and a toughening of requirements will enforce stricter carbon reporting in many countries. These will attempt to ensure that corporate promises are both realistic and evidence-based (see the end of the report for details), and their effects will not be confined to countries or states where there is primary legislation.

A difficult period ahead

Meeting these tougher public goals will not be easy. For several different reasons, which span software and processor developments to the availability of renewable energy in the power grid, the ability of organizations using digital infrastructure to contain or reduce their energy use and carbon emissions will become more difficult.

Ultimately, these pressures may combine to encourage the widespread adoption of more aggressive and thoughtful sustainability strategies as well as a period of progressive and effective investment.

But Uptime Intelligence is also forecasting a difficult period for the sector from 2024 to 2030, as organizations miss sustainability goals and reporting requirements, battle with regulators and even some partners, and struggle to align their corporate business goals with wider sustainability objectives.

For example, in August, the UN-backed Science Based Targets initiative (SBTi) removed Amazon’s operations (including AWS) from its list of committed companies, as Amazon had failed to validate its net-zero emissions target to the SBTi criteria for science-based targets.

This is part of a wider trend. The CDP, previously known as the Carbon Disclosure Project and the most comprehensive global registry of corporate carbon emission commitments, recently said that of the 19,000 companies with registered plans on its platform, only 81 were credible.

A clear disconnect

In the coming years, larger and listed companies in most major economies will have to report their carbon emissions and climate-related risks, sometimes under financial reporting law and sometimes through special directives — the EU’s Corporate Sustainability Reporting Directive (CRSD) and California’s Climate Corporate Data Accountability Act (passed in September 2023) are two examples. The US Securities and Exchange Commission will also eventually require some emissions and risk disclosure from listed companies.

In some jurisdictions, the reporting and improvement of energy use will be required. The latest recast of the EU’s Energy Efficiency Directive (EED), finally published in October 2023, has detailed reporting requirements for data centers that include IT and network equipment use. The German implementation of the EED goes a step further, setting down PUE levels and requirements to reuse heat (with some exceptions). It also requires separate reporting by owners and operators of IT in colocation data centers.

There is a move towards greater precision and accountability at the non-governmental level, too. The principles of carbon emission measurement and reporting that underpin, for example, all corporate net-zero objectives tend to be agreed upon internationally by institutions such as the World Resources Institute and the World Business Council for Sustainable Development; in turn, these are used by bodies such as the SBTi and the CDP. Here too, standards are being rewritten, so that, for example, the use of carbon offsets is becoming less acceptable, forcing operators to buy carbon-free energy directly.

With all these developments under way, there is a startling disconnect between many of the public commitments by countries and companies, and what most digital infrastructure organizations are currently doing or are able to do. Figure 1 below shows that, according to two big surveys from Uptime Institute and IBM, far less than half of managers polled in IT (digital infrastructure) organizations say they are currently tracking any kind of carbon emission data (fuel, purchased electricity, and purchased goods and services).

Figure 1. Digital infrastructure’s tracking of carbon emissions

Diagram: Digital infrastructure’s tracking of carbon emissions

The difference between the two surveys highlights a second disconnect. IBM’s findings, based on responses from senior IT and sustainability staff, show a much higher proportion of organizations collecting carbon emission data than Uptime’s. However, the Uptime group is more likely to be directly responsible for electricity bills and associated carbon emissions, as well as generator fuel use, and is therefore more likely to have the tools and knowledge to collect the data.

One explanation for this is that sustainability and senior IT staff may not always collect all the underlying data but may use higher-level models and estimates. While this may be legally acceptable, it will not provide the data to identify waste and make the critical, detailed improvements necessary to reduce the digital carbon footprint.

In interviews with enterprises and colocation companies, Uptime similarly found that most of those concerned with reducing energy consumption or collecting sustainability data have limited contact with sustainability or executive teams.

Further challenges

Accurate, timely reporting of carbon emissions and other data will be difficult enough for many digital infrastructure operators, especially as it extends to Scope 3 (embedded, third-party and supply chain emissions). But operators will ultimately face a challenge that will not only be more difficult but may require significant investment: reducing emissions, whether in absolute terms or relative to the overall business workload.

Reducing emissions has never been easy, so why may it become more difficult? The first set of problems relates to IT. In the past five years, Moore’s law-type improvements in processor performance have slowed, supplemented or replaced multi-core processors and GPUs. These are doing more work but also require more power, which pushes up the power and cooling requirements both at a server level and in aggregate across data centers. Significantly improved cooling (e.g., direct liquid cooling), better utilization of IT and better, more intelligent management of workloads will be required to prevent runaway power consumption and carbon emissions.

A second set of problems relates to the energy grid. In most regions, it will take decades before grids are able to operate carbon-free most or all the time. But carbon reporting standards will increasingly require the use of in-region, carbon-free energy (or renewable energy). As more data center operators seek to buy carbon-free energy to meet net-zero goals, this renewable energy will rise in price — if it is available at all. Purchasing enough carbon-free energy to match all demand (24×7) will, at best, be expensive and, at worst, impossible.

The third problem is the continuing explosion in workload growth. Energy use by data centers is currently estimated to be between 150 terawatt-hours (TWh) and 400 TWh a year. And even without generative AI, it is universally expected to increase significantly, with some predictions expecting this to double or more beyond 2030 as workloads increase. With generative AI — the overall impact of which is as yet not fully understood — energy use could skyrocket, straining power grids and supply chains, and rendering carbon emission targets yet more difficult to meet.

The Uptime Intelligence View

This analysis suggests that, for most operators of digital infrastructure, it will not only be very difficult to meet stated carbon emission targets, but new reporting requirements will mean that many will be seen to fail. This may be expensive and incur reputational damage. Managers should work closely with all appropriate functional departments and partners to develop a strategy based on realistic goals and real data. No one should announce public goals without first doing this work.

What role might generative AI play in the data center?

January 10, 2024/in Design, Executive, Operations/by John O’Brien, Senior Research Analyst, Uptime Institute, [email protected]

Advances in artificial intelligence (AI) are expected to change the way work is done across numerous organizations and job roles. This is especially true for generative AI tools, which are capable of synthesizing new content based on patterns learned from existing data.

This update explores the rise of large language models (LLMs) and generative AI. It examines whether data center managers should be as dismissive of this technology as many appear to be and considers whether generative AI will find a role in data center operations.

This is the first in a series of reports on AI and its impact on the data center sector. Future reports will cover the use of AI in data center management tools, the power and cooling demands of AI systems, training and inference processing, and how the technology will affect core and edge data centers.

Trust in AI is affected by the hype

Data center owners and operators are starting to develop an understanding of the potential benefits of AI in data center management. So long as the underlying models are robust, transparent and trusted, AI is proving beneficial and is increasingly being used in areas such as predictive maintenance, anomaly detection, physical security and filtering and prioritizing alerts.

At the same time, a deluge of marketing messages and media coverage is creating confusion around the exact capabilities of AI-based products and services. The Uptime Institute Global Data Center Survey 2023 reveals that managers are significantly less likely to trust AI for their operational decision-making than they were a year ago. This fall in trust coincides with the sudden emergence of generative AI. Conversations with data center managers show that there is widespread caution in the industry.

Machine learning, deep learning and generative AI

Machine learning (ML) is an umbrella term for software techniques that involve training mathematical models on large data sets. The process enables ML models to analyze new data and make predictions or solve tasks without being explicitly programmed to do so, in a process called inferencing.

Deep learning — an advanced approach to ML inspired by the workings of the human brain — makes use of deep neural networks (DNNs) to identify patterns and trends in seemingly unrelated or uncorrelated data.

Generative AI is not a specific technology but a type of application that relies on the latest advances in DNN research. Much of the recent progress in generative AI is down to the transformer architecture — a method of building DNNs unveiled by Google in 2017 as part of its search engine technology and later used to create tools such as ChatGPT, which generates text, and DALL-E for images.

Transformers use attention mechanisms to learn the relationships between datapoints — such as words and phrases — largely without human oversight. The architecture manages to simultaneously improve output accuracy while reducing the duration of training required to create generative AI models. This technique kick-started a revolution in applied AI that became one of the key trends of 2023.

LLMs which are based on transformers, like ChatGPT, are trained using hundreds of gigabytes of text and can generate essays, scholarly or journalistic articles and even poetry. However, they face an issue that differentiates them from other types of AI and prevents them from being embraced in a mission-critical setting: they cannot guarantee the accuracy of their output.

The good, the bad and the impossible

With the growing awareness of the benefits of AI comes greater knowledge of its limitations. One of the key limitations of LLMs is their tendency to “hallucinate” and provide false information in a convincing manner. These models are not looking up facts; they are pattern-spotting engines that guess the next best option in a sequence.

This has led to some much-publicized news stories about early adopters of tools like ChatGPT landing themselves in trouble because they relied on the output of generative AI models that contained factual errors. Such stories have likely contributed to the erosion of trust in AI as a tool for data center management — even if these types of issues are exclusive to generative AI.

It is a subject of debate among researchers and academics whether hallucinations can be eliminated entirely from the output of generative AI, but their prevalence can certainly be reduced.

One way to achieve this is called grounding and involves the automated cross-checking of LLM output against web search results or reliable data sources. Another way to minimize the chances of hallucinations is called process supervision. Here, the models are trained to reward themselves for each correct step of their reasoning rather than the right conclusion.

Finally, there is the creation of domain-specific LLMs. These can be either built from scratch using data sourced from specific organizations or industry verticals, or they can be created through the fine-tuning of generic or “foundational” models to perform well-defined, industry-specific tasks. Domain-specific LLMs are much better at understanding jargon and are less likely to hallucinate when used in a professional setting because they have not been designed to cater to a wide variety of use cases.

The propensity to provide false information with confidence likely disqualifies generative AI tools from ever taking part in operational decision-making in the data center — this is better handled by other types of AI or traditional data analytics. However, there are other aspects of data center management that could be enhanced by generative AI, albeit with human supervision.

Generative AI and data center management

First, generative AI can be extremely powerful as an outlining tool for creating first-pass documents, models, designs and even calculations. For this reason, it will likely find a place in those parts of the industry that are concerned with the creation and planning of data centers and operations. However, the limitations of generative AI mean that its accuracy can never be assumed or guaranteed, and that human expertise and oversight will still be required.

Generative AI also has the potential to be valuable in certain management and operations activities within the data center as a productivity and administrative support tool.

The Uptime Institute Data Center Resiliency Survey 2023 reveals that 39% of data center operators have experienced a serious outage because of human error, of which 50% were the result of a failure to follow the correct procedures. To mitigate these issues, generative AI could be used to support the learning and development of staff with different levels of experience and knowledge.

Generative AI could also be used to create and update the method of procedures (MOPs), standard operating procedures (SOPs) and emergency operating procedures (EOPs), which can often get overlooked due to time and management pressures. Other examples of potential applications of generative AI include the creation of:

Technical user guides and operating manuals that are pertinent to the specific infrastructure within the facility.
Step-by-step maintenance procedures.
Standard information guides for new employee and / or customer onboarding.
Recruitment materials.
Risk awareness information and updates.
Q&A resources that can be updated as required.

In our conversations with Uptime Institute members and other data center operators, some said they used generative AI for purposes like these — for example, when summarizing notes made at industry events and team meetings. These members agreed that LLMs will eventually be capable of creating documents like MOPs, SOPs and EOPs.

It is crucial to note that in these scenarios, AI-based tools would be used to draft documents that would be checked by experienced data center professionals before being used in operations. The question is whether the efficiencies gained in the process of using generative AI offset the risks that necessitate human validation.

The Uptime Intelligence View

There is considerable confusion around AI. The first point to understand is that ML employs several techniques, and many of the analytics methods and tools can be very useful in a data center setting. But it is generative AI that is causing most of the stir. Given the number of hurdles and uncertainties, data center managers are right to be skeptical about how far they can trust generative AI to provide actionable intelligence in the data center.

That said, it does have very real potential in management and operations, supporting managers and teams in their training, knowledge and documentation of operating procedures. AI — in many forms — is also likely to find its way into numerous software tools that play an important supporting role in data center design, development and operations. Managers should track the technology and understand where and how it can be safely used in their facilities.

John O’Brien, Senior Research Analyst, Uptime Institute [email protected]

Max Smolaks, Research Analyst, Uptime Institute [email protected]

Performance expectations of liquid cooling need a reality check

The only way is up: silicon power keeps escalating

DLC to wash cooling problems away

Temperature expectations need to cool down

The Uptime Intelligence View

The majority of enterprise IT is now off-premises

Senior management loves outsourcing

The Uptime Intelligence View

Large data centers are mostly more efficient, analysis confirms

Influence of larger sites on PUE

Capacity and scrutiny will grow

The Uptime Intelligence View

When net-zero goals meet harsh realities

A difficult period ahead

A clear disconnect

Further challenges

The Uptime Intelligence View

What role might generative AI play in the data center?

Trust in AI is affected by the hype

Machine learning, deep learning and generative AI

The good, the bad and the impossible

Generative AI and data center management

The Uptime Intelligence View

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices