Does the spread of direct liquid cooling make PUE less relevant?
The power usage effectiveness (PUE) metric is predominant thanks to its universal applicability and its simplicity: energy used by the entire data center, divided by energy used by the IT equipment. However, its simplicity could limit its future relevance, as techniques such as direct liquid cooling (DLC) profoundly change the profile of data center energy consumption.
PUE has long been used beyond its original intention, including as a single defining efficiency metric and as a comparative benchmark between different data centers, ever since it was developed by The Green Grid in 2007. Annualized PUE has become the global de facto standard for data center energy efficiency, in part because it can hide many sins: PUE doesn’t account for important trade-offs in, for example, resiliency, water consumption and, perhaps most crucially, the efficiency of IT.
However, looming technical changes to facility infrastructure could, if extensively implemented, render PUE unsuitable for historical or contemporary benchmarking. One such change is the possibility of DLC entering mainstream adoption. While DLC technology has been an established yet niche technology for decades, some in the data center sector think it’s on the verge of being more widely used.
Among the drivers for DLC is the ongoing escalation of server processor power, which could mean new servers will increasingly be offered in both traditional and DLC configurations.
According to a recent Uptime survey, only one in four respondents think air cooling will remain dominant beyond the next decade in data centers larger than 1 megawatt (MW; see Figure 1).
Regardless of the form, full or partial immersion or direct-to-chip (cold plates), DLC reshapes the composition of energy consumption of the facility and IT infrastructure beyond simply lowering the calculated PUE to near the absolute limit. Most DLC implementations achieve a partial PUE of 1.02 to 1.03, outperforming the most efficient air-cooling systems by low single-digit percentages. But PUE does not capture most of DLC’s energy gains because it also lowers the power consumption of IT, raising questions about how to account for infrastructure efficiency.
In other words, DLC changes enough variables outside the scope of PUE that its application as an energy efficiency metric becomes unsuitable.
There are two major reasons why DLC PUEs are qualitatively different from PUEs of air-cooled infrastructure. One is that DLC systems do not require most of the IT system fans that move air through the chassis (cold-plate systems still need some fans in power supplies, and for low-power electronics). Because server fans are powered by the server power supply, their consumption counts as IT power. Suppliers have modeled fan power consumption extensively, and it is a non-trivial amount. Estimates typically range between 5% and 10% of total IT power depending on fan efficiency, size and speeds (supply air temperature can also be a factor).
The other, less-explored component of IT energy is semiconductor power losses due to temperature. Modern high-performance processors are liable to relatively high leakage currents that flow even when the chip is not cycling (sleeping circuits with no clock signal). This is known as static power, as opposed to the dynamic (active) power consumed when a switch gate changes state to perform work. As the scale of integration grows with more advanced chip manufacturing technologies, so does the challenge of leakage. Against the efforts of chipmakers to contain it without giving up too much performance or transistor density, static power remains significant in the total power equation for large compute chips tuned for performance, such as server processors.
Static power, unlike dynamic power, correlates strongly with temperature. Because DLC systems can maintain chip operating temperatures far below that of air-cooled ones (say, at 48 degrees Celsius /118.4 degrees Fahrenheit, as opposed to 72 degrees Celsius/161.6 degrees Fahrenheit for air-cooled systems), they can dramatically reduce static power. In a 2010 study on a supercomputer in Japan, Fujitsu estimated that water cooling lowered processor power by a little over 10% when cooled from 85 degrees Celsius/185 degrees Fahrenheit to 30 degrees Celsius/86 degrees Fahrenheit. Static power has likely become a bigger problem since this study was conducted, suggesting that cooler chip operation has the potential to curb total IT power by several percentage points.
Without guidance from chipmakers on the static power profile of their processors, the only way to quantify this energy benefit is via experimentation. Worse still, the impact on total power will vary across servers using different chips, for multiple reasons (e.g., processor utilization, workload intensity, and semiconductor technology and manufacturing variations between different chipmakers or chip generations). All this complicates the case for including static power in a new efficiency metric — or in the business case for DLC. In other words, it is a known factor, but to what extent is unknown.
There are other developments in infrastructure design that can undermine the relevance of PUE. For example, distributed, rack-integrated uninterruptible power supplies with small battery packs can become part of the IT infrastructure, rather than the purview of facilities management. If the promise of widespread adoption of DLC materializes, PUE, in its current form, may be heading toward the end of its usefulness. The potential absence of a useful PUE metric would represent a discontinuity of historical trending. Moreover, it would hollow out competitive benchmarking: all DLC data centers will be very efficient, with immaterial energy differences. If liquid-cooled servers gain more foothold (as many — but not all — in the data center sector expect it will), operators will likely need a new metric for energy efficiency, if not as a replacement for PUE, then as a supplement. Tracking of IT utilization, and an overall more granular approach to monitoring the power consumption of workloads, could quantify efficiency gains much better than any future versions of PUE.