DLC will not come to the rescue of data center sustainability

DLC will not come to the rescue of data center sustainability

A growing number of data center operators and equipment vendors are anticipating the proliferation of direct liquid cooling systems (DLC) over the next few years. As far as projections go, Uptime Institute’s surveys agree: the industry consensus for the mainstream adoption of liquid-cooled IT converges on the latter half of the 2020s.

DLC systems, such as cold plate and immersion, have already proved themselves in technical computing applications as well as mainframe systems for decades. More recently, IT and facility equipment vendors, together with some of the larger data center operators, have started working on commercializing DLC systems for much broader adoption.

A common theme running through both operators’ expectations of DLC and vendors’ messaging is that a main benefit of DLC is improved energy efficiency. Specifically, the superior thermal performance of liquids compared with air will dramatically reduce the consumption of electricity and water in heat rejection systems, such as chillers, as well as increase opportunities for year-round free cooling in some climates. In turn, the data center’s operational sustainability credentials would improve significantly. Better still, the cooling infrastructure would become leaner, cost less and be easier to maintain.

These benefits will be out of reach for many facilities for several practical reasons. The reality of mainstream data centers combined with the varied requirements of generic IT workloads (as opposed to high-performance computing) means that cost and energy efficiency gains will be unevenly distributed across the sector. Many of the operators deploying DLC systems in the next few years will likely prioritize speed and ease of installation into existing environments, as well as focus on maintaining infrastructure resiliency — rather than aiming for maximum DLC efficiency.

Another major factor is time: the pace of adoption. The use of DLC in mission-critical facilities, let alone a large-scale change, represents a wholesale shift in cooling design and infrastructure operations, with industry best practices yet to catch up. Adding to the hurdles is that many data center operators will deem the current DLC systems limited or uneconomical for their applications, slowing rollout across the industry.

Cooling in mixed company

Data center operators retrofitting a DLC system into their existing data center footprint will often do so gradually in an iterative process, accumulating operational experience. Operators will need to manage a potentially long period when liquid-cooled and air-cooled IT systems and infrastructure coexist in the same data center. This is because air-cooled IT systems will continue to be in production for many years to come, with typical life cycles of between five and seven years. In many cases, this will also mean a cooling infrastructure (for heat transport and rejection) shared between air and liquid systems.

In these hybrid environments, DLC’s energy efficiency will be constrained by the supply temperature requirements of air-cooling equipment, which puts a lid on operating at higher temperatures —compromising the energy and capital efficiency benefits of DLC on the facility side. This includes DLC systems that are integrated with chilled water systems (running the return facility loop as supply for DLC may deliver some marginal gains) and DLC implementations where the coolant distribution unit (CDU) is cooled by the cold air supply.

Even though DLC eliminates many, if not all, server fans and reduces airflow requirements for major gains in total infrastructure energy efficiency, these gains will be difficult to quantify for real-world reporting purposes because IT fan power is not a commonly tracked metric — it is hidden in the IT load.

It will take years for DLC installations to reach the scale where a dedicated cooling infrastructure can be justified as a standard approach, and for energy efficiency gains to have a positive effect on the industry’s energy performance, such as in power usage effectiveness (PUE) numbers. Most likely, any impact on PUE or sustainability performance from DLC adoption will remain imperceptible for years.

Hidden trade-offs in temperature

There are other factors that will limit the cooling efficiency seen with DLC installations. At the core of DLC’s efficiency potential are the liquid coolants’ favorable thermal properties, which enable them to capture IT heat more effectively. The same thermal properties can also be used for a cooling performance advantage as opposed to maximizing cooling system efficiency. When planning for and configuring a DLC system, some operators will give performance, underpinned by lower operating temperatures, more weight in their balancing act between design trade-offs.

Facility water temperature is a crucial variable in this trade-off. Many DLC systems can cool IT effectively with facility water that is as high as 104°F (40°C) or even higher in specific cases. This minimizes capital and energy expenditure (and water consumption) for the heat rejection infrastructure, particularly for data centers in hotter climates.

Yet, even when presented with the choice, a significant number of facility and IT operators will choose lower supply temperatures for their DLC systems’ water supply. This is because there are substantial benefits to using lower water temperatures — often below 68°F (20°C) — despite the costs involved. Chiefly, a low facility water temperature reduces the flow rate needed for the same cooling capacity, which eases pressure on pipes and pumping.

Conversely, organizations that use warm water and DLC to enable data center designs with dry coolers face planning and design uncertainties. High facility water temperatures not only require higher flow rates and pumping power but also need to account for potential supply temperature reductions in the future as IT requirements become stricter due to evolving server silicon. For a given capacity, this could mean more or larger dry coolers, which potentially require upgrades with mechanical or evaporative assistance. Data center operators that want free cooling benefits and a light mechanical plant have a complex planning and design task ahead.

On the IT side, taking advantage of low temperatures makes sense when maximizing the performance and energy efficiency of processors because silicon exhibits lower static power losses at lower temperatures. This approach is already common today because the primary reason for most current DLC installations is to support high IT performance objectives. Data center operators currently use DLC primarily because they need to cool high-density IT rather than conserve energy.

The bulk of DLC system sales in the coming years will likely be to support high-performance IT systems, many of which will use processors with restricted temperature limits — these models are sold by chipmakers specifically to maximize compute speeds. Operators may select low water temperatures to accommodate these low-temperature processors and to maximize the cooling capacity of the CDU. In effect, a significant share of DLC adoption will likely represent an investment in performance rather than facility efficiency gains.

DLC changes more than the coolant

For all its potential benefits, a switch to DLC raises some challenges to resiliency design, maintenance and operation. These can be especially daunting in the absence of mature and application-specific guidance from standards organizations. Data center operators that support business-critical workloads are unlikely to accept compromises to resiliency standards and realized application availability for a new mode of cooling, regardless of the technical or economic benefits.

In the event of a failure in the DLC system, cold plates tend to offer much less than a minute of ride-through time because of their small coolant volume. The latest high-powered processors would have only a few seconds of ride-through at full load when using typical cold plate systems. Operating at high temperatures means that there are thin margins in a failure, something that operators of mainstream, mission-critical facilities will be mindful of when making these decisions.

In addition, implementing concurrent maintainability or fault tolerance with some DLC equipment may not be practical. As a result, a conversion to DLC can demand that organizations maintain their infrastructure resiliency standard in a different way from air cooling. Operators may consider protecting coolant pumps with an uninterruptible power supply (UPS) and using software resiliency strategies when possible.

Organizational procedures for procurement, commissioning, maintenance and operations need to be re-examined because DLC disrupts the current division of facilities and IT infrastructure functions. For air-cooling equipment, there is strong consensus regarding the division of equipment between facilities and IT teams, as well as their corresponding responsibilities in procurement, maintenance and resiliency. No such consensus exists for liquid cooling equipment. A resetting of staff responsibilities will require much closer cooperation between facilities and IT infrastructure teams.

These considerations will temper the enthusiasm for large-scale use of DLC and make for a more measured approach to its adoption. As operators increasingly understand the ways in which DLC deployment is not straightforward, they will bide their time and wait for industry best practices to mature and fill their knowledge gaps.

In the long term (i.e., 10 years or more), DLC is likely to handle a large share of IT workloads, including a broad set of systems running business applications. This will happen as standardization efforts, real-world experience with DLC systems in production environments and mature guidance take shape in new, more robust products and best practices for the industry. To grow the number and size of deployments of cold plate and immersion systems in mission-critical facility infrastructure, DLC system designs will have to meet additional technical and economic objectives. This will complicate the case for efficiency improvements.

The cooling efficiency figures of today’s DLC products are often derived from niche applications that differ from typical commercial data centers — and real-world efficiency gains from DLC in mainstream data centers will necessarily be subject to more trade-offs and constraints.

In the near term, the business case for DLC is likely to tilt in favor of prioritizing IT performance and ease of retrofitting with a shared cooling infrastructure. Importantly, choosing lower, more traditional water supply temperatures and utilizing chillers appears to be an attractive proposition for added resiliency and future-proofing. As many data center operators deem performance needs and mixed environments to be more pressing business concerns — free cooling aspirations, along with their benefits in sustainability, will have to wait for much of the industry.

US mandates crypto energy reporting: will data centers be next?

US mandates crypto energy reporting: will data centers be next?

Rising concerns about cryptocurrency mining energy use have led the US Energy Information Administration (EIA) to launch a six-month emergency data reporting mandate (on January 26, 2024) to obtain information from 82 cryptocurrency mining companies. The emergency order which was approved by the Office of Management and Budget (OMB) requires cryptocurrency miners to provide information detailing their monthly energy consumption, average and maximum electricity demand, energy suppliers, mining unit counts, and the hash rate (compute power of a blockchain network) at each of their operating locations from February 2024 to July 2024. The order is expected to capture information from 150 facilities.

At the end of February 2024, the initiative (survey) was temporarily put on hold after a lawsuit brought by a cryptocurrency association and a bitcoin mining company alleged that the data collection initiative could harm businesses by forcing them to divulge confidential and sensitive information. The lawsuit contested the notion that cryptocurrency mining operations pose a danger to the reliability of the grid, which will now have to be proven in court.

While the legal action has halted the survey for at least a month, it does not dispute that the OBR and the EIA have the legal mechanisms available to launch this initiative — and the same mechanisms can be applied to other sectors of the data center industry.

EIA estimates rise in crypto mining energy use

The EIA administrator requested emergency reporting authority from the OMB because they and their staff concluded that escalating cryptocurrency mining energy demand in the US could reasonably result in public harm for the following reasons:

  • Rising Bitcoin prices risk more electricity use as miners expand their operations.
  • The increased energy demand is occurring without an accompanying increase in energy supply, which is likely to increase energy prices and grid instability. 
  • There is no data available to assess the speed and extent of the potential energy use growth, making it difficult to mitigate the potential public harm.

The order was generated based on data models developed under an EIA in-depth analysis of cryptocurrency mining activities that estimated that these operations were responsible for 0.6% to 2.3% of US electricity consumption and that energy consumption was likely to grow rapidly (see Tracking electricity consumption from US cryptocurrency mining operations). In addition, concerns were expressed by members of Congress and the Administration (see First signs of federal data center reporting mandates appear in US) about cryptocurrency mining energy use, while grid planners have indicated that the growth in energy consumption will negatively affect electricity supply costs, the quantity of reserve supply, grid reliability and greenhouse gas emissions.

To continue data collection beyond the six-month emergency order, the EIA is currently using the agency’s authority to request a three-year extension to the data collection period.

Energy reporting for traditional data centers likely

While the emergency order warrants the data center industry’s attention, it is the fact that the EIA has the authority to prescribe major energy users to supply this information that will be of real concern.

The rapid growth of conventional data center operations has increasingly been under public scrutiny, particularly in regard to data center expansion in the US. The US Office of Science and Technology Policy report (reviewed in First signs of federal data center reporting mandates appear in US) estimated that the energy consumption of traditional data centers was likely to be equivalent to the energy consumption of cryptocurrency mining operations, making traditional data centers a logical next target for facility and energy use reporting.

Data center operators should not be surprised if the EIA turns its attention to energy consumption in traditional data centers, proposing regulations for reporting sometime in 2024. The buildout of traditional data centers is eliciting the same criticisms levelled at cryptocurrency mining operations: they reduce available electricity supply and grid reliability, raise electricity costs, and increase emissions due to increased demand on fossil fuel-powered generation facilities. The final push for the EIA to act is likely to be the EU’s publication of region-wide data center energy use as reported under the EED (which is still being finalized), and associated regulations.

Conclusion

US data center operators should prepare for a potential EIA-mandated energy consumption reporting regulation in 2024. The reporting requirements are likely to resemble those mandated for cryptocurrency mining operations: facility information, energy consumption and demand data, and a count of installed equipment. Two items that will have to be addressed in the regulatory proposal are the criteria that data centers are required to report, such as data center type, installed IT capacity, installed power capacity; and the reporting frequency (monthly, quarterly or annually). Any US data center energy consumption reporting regulation will require publication in the US’ Federal Register and a comment period, giving the industry an opportunity to review and shape the final reporting requirements.


The Uptime Intelligence View

US data center operators have been sanguine about the potential for government regulation mandating data reporting or minimum performance requirements for key metrics. Unfortunately, the regulation establishing the US EIA contains a vehicle (15 USC 772) that authorizes the EIA to compel major energy users to report energy consumption and relevant facility data. Given the current public, legislator and regulator concerns relating to the projected growth of data center energy demand, which is anticipated to accelerate with the growth of AI offerings, it is highly likely that the EIA will propose a regulation mandating energy consumption reporting for data centers in 2024.

Addendum

The US Energy Information Administration (EIA) and Bitcoin industry groups have reached a settlement in the Bitcoin industry group’s lawsuit. EIA has agreed to “destroy” all the data collected. Under the settlement agreement, EIA intends to use its authority 15 USC 772 (Administrators information-gathering power) to request a three-year data collection period (Federal Register Vol. 89 No. 28; February 9, 2024). This process will allow the industry to provide comments and shape the requirements of the data collection process.

Performance expectations of liquid cooling need a reality check

Performance expectations of liquid cooling need a reality check

The idea of using liquids to cool IT hardware, exemplified by technologies such as cold plates and immersion cooling, is frequently hailed as the ultimate solution to the data center’s energy efficiency and sustainability challenges. If a data center replaces air cooling with direct liquid cooling (DLC), chilled water systems can operate at higher supply and return water temperatures, which are favorable for both year-round free cooling and waste heat recovery.

Indeed, there are some larger DLC system installations that use only dry coolers for heat rejection, and a few installations are integrated into heat reuse schemes. As supply chains remain strained and regulatory environments tighten, the attraction of leaner and more efficient data center infrastructure will only grow.

However, thermal trends in server silicon will challenge engineering assumptions, chiefly DLC coolant design temperature points that ultimately underpin operators’ technical, economic and sustainability expectations of DLC. Some data center operators say the mix of technical and regulatory changes on the horizon are difficult to understand when planning for future capacity expansions — and the evolution of data center silicon will only add to the complications.

The only way is up: silicon power keeps escalating

Uptime Institute Intelligence has repeatedly noted the gradual but inescapable trend towards higher server power — barring a fundamental change in chip manufacturing technology (see Silicon heatwave: the looming change in data center climates). Not long ago, a typical enterprise server used less than 200 watts (W) on average, and stayed well below 400 W even when fully loaded. More recent highly performant dual-socket servers can reach 700 W to800 W thermal power, even when lightly configured with memory, storage and networking. In a few years, mainstream data center servers with high-performance configurations will require as much as 1 kilowatt (kW) in cooling, even without the addition of power-hungry accelerators.

The underlying driver for this trend is semiconductor physics combined with server economics for two key reasons. First, even though semiconductor circuits’ switching energy is dropping, the energy gains are being outpaced by an increase in the scale of integration. As semiconductor technology advances, the same area of silicon will gradually consume (and dissipate) ever more power as a result. Chips are also increasing in size, compounding this effect.

Second, many large server buyers prefer highly performant chips that can process greater software payloads faster because these chips drive infrastructure efficiency and business value. For some, such as financial traders and cloud services providers, higher performance can translate into more direct revenue. In return for these benefits, IT customers are ready to pay hefty price premiums and accept that high-end chips are more power-hungry.

DLC to wash cooling problems away

The escalation of silicon power is now supercharged by the high demand for artificial intelligence (AI) training and other supercomputing workloads, which will make the use of air cooling more costly. Fan power in high-performance servers can often account for 10% to 20% of total system power, in addition to silicon static power losses, due to operating near the upper temperature limit. There is also a loss of server density, resulting from the need to accommodate larger heat sinks and fans, and to allow more space between the electronics.

In addition, air cooling may soon see restrictions in operating temperatures after nearly two decades of gradual relaxation of set points. In its 2021 Equipment thermal guidelines for data processing environments, US industry body ASHRAE created a new environmental class for high-density servers with a recommended supply temperature maximum of 22°C (71.6°F) — a whole 5°C (9°F) lower than the general guidelines (Class A1 to A4), with a corresponding dip in data center energy efficiency (see New ASHRAE guidelines challenge efficiency drive).

Adopting DLC offers relief from the pressure of these trends. The superior thermal performance of liquids, whether water or engineered fluids, makes the job of removing several hundred watts of thermal energy from compact IT electronics more straightforward. Current top-of-the-line processors (up to 350 W thermal design power) and accelerators (up to 700 W on standard parts such as NVIDIA data center GPUs) can be effectively cooled even at high liquid coolant temperatures, allowing the facility water supply for the DLC system to be running as high as 40°C (104°F), and even up to 45°C (113°F).

High facility water temperatures could enable the use of dry coolers in most climates; or alternatively, the facility can offer valuable waste heat to a potential offtaker. The promise is attractive: much reduced IT and facility fan power, elimination of compressors that also lower capital and maintenance needs, and little to no water use for cooling. Today, several high-performance computing facilities with DLC systems take advantage of the heat-rejection or heat-reuse benefits of high temperatures.

Temperature expectations need to cool down

Achieving these benefits is not necessarily straightforward. Details of DLC system implementation, further increases in component thermal power, and temperature restrictions on some components all complicate the process further.

  • Temperatures depend on the type of DLC implementation. Many water-cooled IT systems, the most common type in use today, often serialize multiple cold plates within a server to simplify tubing, which means downstream components will receive a higher temperature coolant than the original supply. This is particularly true for densified compute systems with very compact chassis, and restricts coolant supply temperatures well below what would be theoretically permissible with a parallel supply to every single cold plate.
  • Thermal design power has not peaked. The forces underlying the rise in silicon power (discussed above) remain in play, and the data center industry widely expects even more power-hungry components in the coming years. Yet, these expectations remain in the realm of anecdotes, rumors and leaks in the trade press, rather than by way of publicly available information. Server chip vendors refuse to publicize the details of their roadmaps — only select customers under nondisclosure agreements have improved visibility. From our discussions with suppliers, Uptime Intelligence can surmise that more powerful processors are likely to surpass the 500 W mark by 2025. Some suppliers are running proof of concepts simulating 800 W silicon heat loads, and higher.
  • Temperature restrictions of processors. It is not necessarily the heat load that will cap facility water temperatures, but the changing silicon temperature requirements. As thermal power goes up, the maximum temperature permitted on the processor case (known as Tcase) is coming down —to create a larger temperature difference to the silicon and boost heat flux. Intel has also introduced processor models specified for liquid cooling, with Tcase as low as 57°C (134.6°F), which is more than a 20°C (36°F) drop from comparable air-cooled parts. These low-Tcase models are intended to take advantage of the lower operating temperature made possible by liquid cooling to maximize peak performance levels when running computationally intense code, which is typical in technical and scientific computing.
  • Memory module cooling.In all the speculation around high-power processors and accelerators, a potentially overlooked issue is the cooling of server memory modules, whose heat output was once treated as negligible. As module density, operating speeds and overall capacity increase with successive generations, maintaining healthy operating temperature ranges is becoming more challenging. Unlike logic chips, such as processors that can withstand higher operating temperatures, dynamic memory (DRAM) cells show performance degradation above 85°C (185°F), including elevated power use, higher latency, and — if thermal escalation is unchecked — bit errors and overwhelmed error correction schemes. Because some of the memory modules will be typically downstream of processors in a cold-plate system, they receive higher temperature coolant. In many cases it won’t be the processor’s Tcase that will restrict coolant supply temperatures, but the limits of memory chips.

The net effect of all these factors is clear: widespread deployment of DLC to promote virtually free heat rejection and heat reuse will remain aspirational in all but a few select cases where the facility infrastructure is designed around a specific liquid-cooled IT deployment.

There are too many moving parts to accurately assess the precise requirements of mainstream DLC systems in the next five years. What is clear, however, is that the very same forces that are pushing the data center industry towards liquid cooling will also challenge some of the engineering assumptions around its expected benefits.

Operators that are considering dedicated heat rejection for DLC installations will want to make sure they prepare the infrastructure for a gradual decrease in facility supply temperatures. They can achieve this by planning increased space for additional or larger heat rejection units — or by setting the water temperature conservatively from the outset.

Temperature set points are not dictated solely by IT requirements, but also by flow rate considerations — which has consequences for pipe and pump sizing. Operating close to temperature limits means loss of cooling capacity for the coolant distribution units (CDU), requiring either larger CDUs or more of them. Slim margins also mean any degradation or loss of cooling may have a near immediate effect at full load: a cooling failure in water or single-phase dielectric cold-plate systems may have less than 10 seconds of ride-through time.

Today, temperatures seem to be converging around 32°C (89.6°F) for facility water — a good balance between facility efficiency, cooling capacity and support for a wide range of DLC systems. Site manuals for many water-cooled IT systems also have the same limit. Although this is far higher than any elevated water temperature for air-cooling systems, it still requires additional heat rejection infrastructure either in the form of water evaporation or mechanical cooling. Whether lower temperatures will be needed as server processors approach 500 W — with large memory arrays and even higher power accelerators — will depend on a number of factors, but it is fair to assume the likely answer will be “yes”, despite the high cost of larger mechanical plants.

These considerations and limitations are mostly defined by water cold-plate systems. Single-phase immersion with forced convection and two-phase coolants, probably in the form of cold-plate evaporators rather than immersion, offer alternative approaches to DLC that should help ease supply temperature restrictions. For the time being, water cold plates remain the most widely available and are commonly deployed, and mainstream data center operators will need to ensure they meet the IT system requirements that use them.

In many cases, Uptime Intelligence expects operators to opt for lower facility supply water temperatures for their DLC systems, which brings benefits in lower pumping energy and fewer CDUs for the same cooling capacity, and is also more future proof. Many operators have already opted for conservative water temperatures as they upgrade their facilities for a blend of air and liquid-cooled IT. Others will install DLC systems that are not connected to a water supply but are air-cooled using fans and large radiators.


The Uptime Intelligence View

The switch to liquid to cool IT electronics offers a host of energy and compute performance benefits. However, future expectations based on the past performance of DLC installations are unlikely to be met. The challenges of silicon thermal management will only become more difficult as new generations of high-power server and memory chips develop. This is due to stricter component temperature limits, with future maximum facility water temperatures to be set at more conservative levels. For now, the vision of a lean data center cooling plant without either compressors or evaporative water consumption remains elusive.

FinOps gives hope to those struggling with cloud costs

FinOps gives hope to those struggling with cloud costs

Cloud workloads are continuing to grow — sometimes adding to traditional on-premises workloads, sometimes replacing them. The Uptime Institute Global Data Center Survey 2023 shows that organizations expect the public cloud to account for 15% of their workloads by 2025. When private cloud hosting and software as a service (SaaS) are included, this share rises to approximately one-third of their workloads by 2025.

As enterprise managers decide where to put their workloads, they need to weigh up the cost, security, performance, accountability, skills and other factors. Mission-critical workloads are the most challenging. Uptime Institute research suggests around two-thirds of organizations do not host their mission-critical applications in the cloud because of concerns around data security, regulation and compliance, and the cost or return on investment (ROI). Cost overruns are a particular concern: the deployment of new resource-hungry workloads means that unless organizations have better visibility and control over their cloud costs, this situation is only going to get worse.

This report explains the role that the fast-emerging discipline of FinOps (a portmanteau of finance and operations) can play in helping to understand and control cloud costs, and in moving and placing workloads. FinOps has been hailed as a distinct new discipline and a major advance in cloud governance — but, as always, there are elements of hype and some implementation challenges that need to be considered.

Unpredictable cloud costs

The cost and ROI concerns identified by cloud customers are real. Almost every company that has used the cloud has experienced unwelcome cost surprises — many organizations have paid out millions of dollars in unbudgeted cloud fees. This highlights how difficult it has been to plan for and predict cloud costs accurately.

In the early days, cloud services were often promoted as being cheaper; however, that argument has moved on with a focus now on value, innovation and function. In fact, cloud costs have often proved to be higher than some of the alternatives, such as keeping applications or data in-house.

It can make for a complex and confusing picture for managers trying to understand why costs have spiraled. Three examples of why this can happen unexpectedly are:

  • Cloud migration. In many cases, organizations simply lifted and shifted on-premises workloads to the cloud, saving short-term recoding and development costs at the expense of later technical and operational limitations with cost implications. On-premises workloads were not built for the cloud, and without being modernized through cloud-native techniques, such as containerization and code refactoring, they will fail to benefit from the as-a-service and on-demand consumption models offered by the public cloud.
  • Technical issues. Cloud platform incompatibilities, coding errors, poor integration and undiscovered dependencies and latencies result in applications performing inefficiently and consuming cloud resources erratically.
  • On-demand pricing. The most common default way of buying cloud services has cost-effective ways to run some workloads, but not all. Amazon Web Services recommends on-demand pricing for “short-term, irregular workloads that cannot be interrupted.” On-demand therefore lends itself to workloads that can be switched on and off as required, such as build and test environments. However, mission-critical applications, such as enterprise resource planning and databases, need to be available 24/7 to ensure data synchronization happens continuously. If deployed on-demand, consumption costs will start to spiral — but an alternative pricing option, such as reserved instances, can enable substantial savings. It is critical to be able to match the right workload to the right cloud consumption model.

Failing to know which workload is best suited to which cloud consumption model can result in applications and workloads being unavailable when needed or overprovisioned (because they keep running in the background when not in use). Being under or overprovisioned may lead to unsatisfactory results for the customer and, ultimately, poor ROI.

To mitigate some of these issues, cloud providers offer different pricing plans to support different workloads and consumption requirements. However, it is still imperative for the customer to understand their own needs and the implications of these decisions.

This is where FinOps comes in. Customers cannot rely on cloud vendors to be impartial. They need to be able to identify their own optimum pricing models based on their workload requirements, consumption demands and use-case requirements. And they need to be able to accurately compare different providers and their products against one another to achieve the best value.

What is FinOps?

The advocates and suppliers of FinOps tools and practices set out to provide much-needed visibility into the costs of running workloads and applications in the cloud.

The non-profit organization FinOps Foundation describes FinOps as a “financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions.”

Many of the disciplines and methods of FinOps are extensions of management accounting, applied to the complexities of digital infrastructure and cloud computing. Tools falling under the FinOps label have been developed that, in a slower-moving, less-automated environment, would be carried out by the finance teams using Excel. These tools can track consumption, model it, set alerts, apply showback and chargeback, and help manage and conduct scenario analysis to model the impact of using different services or developing or introducing applications. Governance and processes may be needed to identify or prevent overspend at an early stage.

The FinOps Foundation represents about 10,000 practitioners from various organizations worldwide, including around 90% of the Fortune 50. These companies have among the largest cloud spend of all corporates, and they are now helping to develop and standardize best practices for FinOps.

FinOps is helping organizations manage the growing complexity of their hybrid IT and increasingly multi-cloud environments. A recent report by management consultancy McKinsey & Company (The FinOps way: how to avoid the pitfalls to realizing cloud’s value) claimed that using FinOps can cut cloud costs by 20% to 30%. However, it also found that organizations do not develop at-scale FinOps processes until their cloud spend hits $100 million per year. This suggests that many organizations buying cloud services below this level have yet to adopt money-saving FinOps disciplines.

For the largest organizations, FinOps has rapidly become a critical part of modern cloud operations, alongside other important cloud operations disciplines, which include:

  • DevOps. A set of practices to unite cloud software development and IT operations team objectives and outcomes.
  • AIOps. The application of artificial intelligence (AI) and machine learning techniques for training and inferencing in IT operations.
  • DataOps. A set of collaborative practices for managing data quality, governance and continuous improvement across IT and operations teams.
  • Site reliability engineering. This combines software engineering with IT operations experts to ensure cloud system availability; through automation, monitoring, testing production environments and performing incident response.

Like these disciplines, FinOps fills a critical gap in knowledge and skills as more IT, data and systems are managed in the cloud. It can also help to break down operational silos between cloud and traditional enterprise functions. For instance, FinOps sits at the intersection of IT, cloud and finance — it enables enterprises to more efficiently report, analyze and optimize cloud and other IT costs.

Controlling, forecasting and optimizing the costs of running applications in the cloud can be challenging for several reasons, especially when, as is often the case, more than one cloud provider is either being used or being considered. Listed below are the challenges that control cloud costs:

  • Cloud consumption is not always under the customer’s control, particularly in an on-demand pricing environment.
  • Each cloud provider measures the consumption of its service in different ways; there is no standard approach.
  • Each provider offers different incentives and discounts to customers.
  • Each cloud service has many metrics associated with it that need to be monitored, relating to utilization, optimization, performance and adhering to key performance indicators (KPIs). The more cloud vendor services that are consumed, the more complex this activity becomes.
  • Metrics are only sometimes related to tangible units that are easy for the customer to predict. For example, a transaction may generate a cost on a database platform, however, the customer may have no understanding or visibility of how many of these transactions will be executed in a certain period.
  • Applications may scale up by accident due to errant code or human error and use resources without purpose. Similarly, applications may not scale down when able (reducing costs) due to incorrect configuration.
  • Resiliency levels can be achieved in different ways, with different costs. Higher levels of resiliency can add costs, some of which may not have been planned for initially, such as unexpected costs for using additional availability zones.

Why FinOps’ time is now

FinOps is not just about cost containment, it is also about identifying and realizing value from the cloud. One of FinOps’ goals is to help organizations analyze the value of new services by conducting a full cost analysis based on data. FinOps aims to close the gap between the different teams involved in commercial and financial calculations and in the development and deployment of new services.

The FinOps function — at least in some organizations — sits outside of the existing IT, finance and engineering teams to provide an independent, objective voice to arbitrate and negotiate when needed.

These are the key reasons driving FinOps adoption:

  • Cloud bills are increasing as adoption grows across the business and are now attracting attention from stakeholders as a significant long-term expense.
  • Growth in hybrid IT, where organizations use a mix of cloud locations and on-premises facilities, has stimulated the need for accurate data to make more informed decisions on workload placement.
  • AI model training and inferencing is a new asset that will drive (possibly explosive) demand for hybrid IT — both on-premises and public cloud consumption. Optimized financial processes that can predict capacity, consumption and bills will be essential.
  • In the years to come, many organizations will need to make strategic decisions about where to place large workloads and whether to use their own, colocation or cloud facilities. Better tools and disciplines are needed to model the very significant financial implications.

Macroeconomic conditions (notably rising inflation) are forcing organizations to reduce expenditure where possible to sustain gross margins.

Governance over digital infrastructure costs

Uptime believes that the more successful digital infrastructure organizations will develop FinOps capabilities over the coming years as cloud services — and their huge costs — become more embedded in the core of the business. This will bring a level of governance to cloud use that could be profound. In time, the larger organizations that depend on hybrid IT infrastructure will likely extend the discipline — or some integration extension of it — to cover all IT, from on-premises to colocation, hosting and cloud.

Despite the hype, some measured skepticism is required. In its simplest iterations, FinOps can bring down obvious cloud overspend, but it can also add costs, complexity and slow innovation. Further, it is still unclear to what extent new and dominant tools, standards and disciplines will become firmly established, and how integrated these functions will become with the rest of the digital infrastructure for financial management. Ideally, chief information officers and chief financial officers do not want to battle with an array of accounting methodologies, tools and reporting lines but instead use integrated sets of tools.

Most cloud providers already supply documentation and tools related to FinOps. This free capability represents a good starting point for implementing FinOps practices. However, cloud provider support is unlikely to be useful in making the organizational changes needed to help bridge the gap between finance and IT. Furthermore, the support provided by a cloud provider will only extend to the services offered by the cloud provider — multi-cloud optimization is far more complex. FinOps capabilities presented by cloud providers cannot offer an unbiased view of expenditure.

Third-party tools, such as Apptio Cloudability, Spot by NetApp, Flexera One, Kubecost and VMware Aria Cost, provide independent FinOps toolsets that can be used across multiple clouds. But larger cloud customers have resorted to developing their own tools. Financial giant Capital One built its own cloud financial management tool because it found its FinOps activities had outgrown its original commercial off-the-shelf product.

FinOps is an emerging discipline, and there is still work being done to achieve standardization across all stakeholders and interests. This is an important area for organizations to watch before they go too far with their investments.


The Uptime Intelligence View

It is clear there is huge cloud overspend at many organizations. However, cloud expenditure can only be effectively managed with specialist domain skills. Finance and IT need to work together to manage cloud costs, and a FinOps function is required to help strike a balance between cutting costs and enabling scalability and innovation. FinOps, however, should not be embraced uncritically. Software, standards and disciplines will take many years to mature. Ultimately, FinOps can become a foundational part of what is ultimately needed — a way to comprehensively manage and model all digital infrastructure costs.

The majority of enterprise IT is now off-premises

The majority of enterprise IT is now off-premises

Corporate data centers have been the backbone of enterprise IT since the 1960s and continue to play an essential role in supporting critical business and financial functions for much of the global economy. Yet, while their importance remains, their prominence as part of an enterprise’s digital infrastructure appears to be fading.

Today, businesses have more options for where to house their IT workloads than ever before. Colocation, edge sites, the public cloud infrastructure and software as a service all offer a mature alternative to take on many, if not all, enterprise workloads.

Findings from the 2023 Uptime Institute Global Data Center Survey, the longest-running survey of its kind, show for the first time that the proportion of IT workloads hosted in on-premises data centers now represents slightly less than half of the total enterprise footprint. This marks an important and long-anticipated moment for the industry.

It does not mean that corporate data center capacity, usage or expenditure is shrinking in absolute terms, but it does indicate that for new workloads, organizations are tending to choose third-party data centers and services.

The share of workloads in corporate facilities is likely to continue to fall, with organizations switching to third-party venues as the preferred deployment model for their applications — each with their own set of advantages and drawbacks but all delivering capacity without any capital expenditure.

Senior management loves outsourcing

In Uptime’s 2020 annual data center survey, respondents reported that, on average, 58% of their organization’s IT workloads were hosted in corporate data centers. In 2023, this percentage fell to 48%, with respondents forecasting that just 43% of workloads would be hosted in corporate data centers in 2025 (see Figure 1).

Figure 1. Cloud and hosting grow at the expense of corporate data centers

Diagram: Cloud and hosting grow at the expense of corporate data centers

Increasingly, the economic odds are stacked against corporate facilities as businesses look to offload the financial burden and organizational complexity of building and managing data center capacity.

Specialized third-party data centers are typically more efficient than their on-premises counterparts. Larger cloud and colocation facilities benefit from economies of scale when purchasing mechanical and electrical equipment, helping them to achieve lower costs. For some applications, smaller third-party facilities are more attractive because they enable organizations to bring latency-sensitive or high-availability services closer to industrial or commercial sites.

Data center outsourcing has several other key benefits:

  • Capacity and change management. Outsourcing liberates corporate teams from the onerous task of finding the space and power required to expand their IT estates.
  • Staffing and skills. The data center industry is suffering from an acute shortage of staff. When outsourcing, this becomes someone else’s problem.
  • Environmental reporting. Outsourcing simplifies the process of compliance with current and upcoming sustainability regulations because it places much of the burden on the service provider.
  • Supporting new (or exotic) technologies. Outsourcing enables businesses to experiment with cutting-edge IT hardware without having to make large upfront investments (e.g., dense IT clusters that require liquid cooling).

Additional regulatory requirements that will come into force in the next two to three years — such as the EU’s Energy Efficiency Directive recast (see The EU’s Energy Efficiency Directive: ready or not here it comes) — will make data center outsourcing even more tempting.

The public cloud offers a further set of advantages over in-house data centers. These include a wide selection of hardware “instances” and the ability for customers to grow or shrink their actively used IT resources at will with no advance notification. In addition to flexible pay-as-you-go pricing models, customers can tweak their costs by opting for either on-demand, reserved or spot instances. It is no surprise that the public cloud accounts for a growing proportion of IT workloads today, with a share of 12%, up from 8% in 2020.

However, this does not mean that the public cloud is a perfect fit for every workload, particularly if the application is not rearchitected to take advantage of the technical and economic benefits provided by a cloud platform. One problem highlighted by Uptime Institute Intelligence is the lack of visibility into cloud providers’ platforms, which prevents customers from assessing their operational resiliency or gaining a better understanding of potential vulnerabilities. There is also the tendency for public cloud deployments to generate runaway costs. The intense competition between cloud providers and their proprietary software stacks makes multi-cloud strategies, which alleviate some of the inherent risks in cloud architectures, too costly and complex to implement.

In specific cases, cloud service providers enjoy an oligopolistic advantage in accessing the latest technologies. For example, Microsoft, Baidu, Google and Tencent recently spent billions of dollars securing very large numbers of GPUs to build out specialized artificial intelligence (AI) training clusters, depleting the supply chain and causing GPU shortages. In the near term, many businesses that opt to develop their own AI models will simply not be able to purchase the GPUs they need and will be forced to rent them from cloud providers.

In terms of regional differences, Asia-Pacific (excluding China) is leading the shift off-premises, with just 39% of IT workloads currently hosted in corporate data centers (see Figure 2). Historically, enterprise data centers have been more developed in North America and Europe, and this category of facilities may never develop to the same extent elsewhere.

Figure 2. The regions with fewer IT workloads in corporate data centers

Diagram: The regions with fewer IT workloads in corporate data centers

Digital infrastructure ecosystems in Asia and Latin America are likely to emerge as examples of “leapfrogging” — when regions with poorly developed technology bases advance directly towards the adoption of modern systems without going through intermediary steps.


The Uptime Intelligence View

In theory, the gradual movement of workloads away from in-house data centers gives colocation, cloud, and hosting providers more leverage when influencing the development of IT equipment, new mechanical and electrical designs, data center topologies, and crucially, software architectures. The constraints of enterprise data centers that have defined more than 60 years of business IT will gradually become less important. Does this mean that almost all IT workloads will — eventually — end up running in third-party data centers? This is unlikely, but it will be almost impossible for this trend to reverse: for organizations “born” in a public cloud or using colocation, moving to their own enterprise data center will tend to be highly unattractive or cost prohibitive.

Large data centers are mostly more efficient, analysis confirms

Large data centers are mostly more efficient, analysis confirms

Uptime Institute calculates an industry average power usage effectiveness (PUE), which is a ratio of total site power to IT power, each year using data from the Uptime Institute Global Data Center Survey. This PUE data is pulled from a large sample over the course of 15 years and provides a reliable view of progress in facility efficiency.

Uptime’s data shows that industry PUE has remained at a high average (ranging from 1.55 to 1.59) since around 2020. Despite ongoing industry modernization, this overall PUE figure has remained almost static, in part because many older and less-efficient legacy facilities have a moderating effect on the average. In 2023, industry average PUE stood at 1.58.

For the 2023 annual survey, Uptime refined and expanded the survey questionnaire to provide deeper insights into the industry trends and improvements underlying the slow-moving average. This analysis builds on Uptime’s recent PUE research that focused on the influence of facility age and regional location (see Global PUEs — are they going anywhere?).

Influence of larger sites on PUE

Uptime’s headline PUE figure of 1.58 (weighted per respondent, for consistency with historical data) approximates the efficiency of an average facility. The new survey data allows us to analyze PUE in greater detail across a range of facility sizes. We applied provisioned IT capacity (in megawatts, MW) as a weighting factor, to examine PUE of a normalized unit of IT power. Using this approach, the capacity-weighted PUE figure is 1.47 — a result that many may expect, given the large amount of total IT power deployed in larger (often newer) data centers.

Larger facilities tend to be more efficient — most are relatively new and use leading-edge equipment, with more efficient cooling designs and optimized controls. Modernization of smaller facilities is less likely to yield a return on investment from energy savings. In Figure 1, Uptime compares survey respondents’ annual average PUE based on their data center’s provisioned IT capacity, showing a clear trend of efficiency improvements as data centers increase in capacity size.

Figure 1. Weighted average PUE by data center IT capacity

Diagram: Weighted average PUE by data center IT capacity

The PUE metric was introduced to track the efficiency of a given data center over time, rather than to compare between different data centers. Uptime analyzes PUE in large sample sizes to track trends in the industry, including the influence of facility size on PUE. Other factors shaping PUE include IT equipment utilization, facility design and age, system redundancy and local climate conditions.

Capacity and scrutiny will grow

Data centers are expanding in capacity — and this will warrant closer attention to the influence of facility size on efficiency. Some campuses currently have capacities of 300 MW, and several others are planned to reach in excess of 1 gigawatt (GW), which is between 10 and 30 times more power than the largest data centers seen in recent years. Uptime has identified approximately 28 hyperscale colocation campuses in development in addition to existing large hyperscale cloud sites. If the planned capacity of these sites is realized, they would account for approximately one-quarter of data center energy consumption globally.

These hyperscale colocation campuses, in common with many large new colocation facilities in existing prime data center locations, are designed for PUEs significantly below the industry average (1.4 and lower). Scala Data Centers is one example: it is building its Tamboré Campus in São Paolo (Brazil), which is intended to reach 450 MW with a PUE of 1.4. To preserve an economic advantage, the organization will need to optimize efficiency as the number of tenants filling the data halls increases.

Cloud hyperscalers Google, Amazon Web Services and Microsoft already claim PUE of 1.2 or lower at some sites. However, this is not always representative of the actual PUE of a customer application in the cloud. Their workloads may be provisioned by a colocation partner whose PUE is higher, or cloud workloads may need to be replicated across one or more availability zones — driving up energy usage and aggregating PUE across multiple facilities.

PUE improvements will be demanded as legislatures start to reference PUE in binding regulations. The new Energy Efficiency Act in Germany, which came into force in September 2023, mandates data centers in Germany to achieve a PUE of 1.5 from July 1, 2027 and a PUE of 1.3 from July 1, 2030. New data centers opening from July 1, 2026 are required to have a PUE of 1.2, or less — which even new build operators may find challenging at higher levels of resiliency.


The Uptime Intelligence View

Facility size is one of many important factors influencing facility efficiency and this is reflected in the capacity-weighted average PUE figure of 1.47, as opposed to the per-site average of 1.58.  The data may suggest that over time, the replacement of older sites with larger more efficient ones may produce a more impactful or immediate improvement in efficiency than modernizing smaller sites.

Jacqueline Davis, Research Analyst, [email protected]

John O’Brien, Senior Research Analyst [email protected]