Blog Single Author Small - Uptime Institute Blog

OT protection: is air-gapping the answer?

October 21, 2024/in Design, Executive, Operations/by Michael O’Neil, Consulting Analyst - Cybersecurity, Uptime Institute, moneil@uptimeinstitute.com

Cyberattacks on operational technology (OT) were virtually unknown until five years ago, but their volume has been doubling since 2009. This threat is distinct from IT-focused vulnerabilities that cybersecurity measures regularly address. The risk associated with OT compromise is substantial: power or cooling failures can cripple an entire facility, potentially for weeks or months.

Many organizations believe they have air-gapped (isolated from IT networks) OT systems, protecting them against external threats. The term is often used incorrectly, however, which means exploits are harder to anticipate. Data center managers need to understand the nature of the threat and their defense options to protect their critical environments.

Physical OT consequences of rising cyberattacks

OT systems are used in most critical environments. They automate power generation, water and wastewater treatment, pipeline operations and other industrial processes. Unlike IT systems, which are inter-networked by design, these systems operate autonomously and are dedicated to the environment where they are deployed.

OT is essential to data center service delivery: this includes the technologies that control power and cooling and manage generators, uninterruptible power supplies (UPS) and other environmental systems.

Traditionally, OT has lacked robust native security, relying on air-gapping for defense. However, the integration of IT and OT has eroded OT’s segregation from the broader corporate attack surface, and threat sources are increasingly targeting OT as a vulnerable environment.

Figure 1. Publicly reported OT cyberattacks with physical consequences

Diagram: Publicly reported OT cyberattacks with physical consequences

Research from Waterfall Security shows a 15-year trend in publicly reported cyberattacks that resulted in physical OT consequences. The data shows that these attacks were rare before 2019, but their incidence has risen steeply over the past five years (see 2024 Threat report: OT cyberattacks with physical consequences).

This trend should concern data center professionals. OT power and cooling technologies are essential to data center operations. Most organizations can restore a compromised IT system effectively, but few (if any) can recover from a major OT outage. Unlike IT systems, OT equipment cannot be restored from backup.

Air-gapping is not air-tight

Many operators believe their OT systems are protected by air-gaps: systems that are entirely isolated from external networks. True air-gaps, however, are rare — and have been for decades. Suppliers and customers have long recognized that OT data supports high-value applications, particularly in terms of remote monitoring and predictive maintenance. In large-scale industrial applications — and data centers — predictive maintenance can drive better resource utilization, uptime and reliability. Sensor data enables firms to, for example, diagnose potential issues, automate replacement part orders and schedule technicians, all which will help to increase equipment life spans, and reduce maintenance costs and availability concerns.

In data centers, these sensing capabilities are often “baked in” to warranty and maintenance agreements. But the data required to drive IT predictive maintenance systems comes from OT systems, and this means the data needs a route from the OT equipment to an IT network. In most cases, the data route is designed to work in one direction, from the OT environment to the IT application.

Indeed, the Uptime Institute Data Center Security Survey 2023 found that nearly half of operators using six critical OT systems (including UPS, generators, fire systems, electrical management systems, cooling control and physical security / access control) have enabled remote monitoring. Only 12% of this group, however, have enabled remote control of these systems that requires a path leading from IT back to the OT environment.

Remote control has operational benefits but increases security risk. Paths from OT to IT enable beneficial OT data use, but also open the possibility of an intruder following the same route back to attack vulnerable OT environments. A bi-directional route exposes OT to IT traffic (and, potentially, to IT attackers) by design. A true air-gap (an OT environment that is not connected in any way to IT applications) is better protected than either of these alternatives, but will not support IT applications that require OT data.

Defense in depth: a possible solution?

Many organizations use the Purdue model as a framework to protect OT equipment. The Purdue model divides the overall operating environment into six layers (see Table 1). Physical (mechanical) OT equipment is level 0. Level 1 refers to instrumentation directly connected to level 0 equipment. Level 2 systems manage the level 1 devices. IT/OT integration is a primary function at level 3, while levels 4 and 5 refer to IT networks, including IT systems such as enterprise resource planning systems that drive predictive maintenance. Cybersecurity typically focuses on the upper levels of the model; the lower-level OT environments typically lack security features found in IT systems.

Table 1. The Purdue model as applied to data center environments

Table: The Purdue model as applied to data center environments

Lateral and vertical movement challenges

When organizations report that OT equipment is air-gapped, they usually mean that the firewalls between the different layers are configured to only permit communication between adjacent layers. This prevents an intruder from moving from the upper (IT) layers to the lower (OT) levels. However, data needs to move vertically from the physical equipment to the IT applications across layers — otherwise the IT applications would not receive the required input from OT. If there are vertical paths through the model, an adversary would be able to “pivot” an attack from one layer to the next.

This is not the type of threat that most IT security organizations expect. Enterprise cyber strategies look for ways to reduce the impact of a breach by limiting lateral movement (an adversary’s ability to move from one IT system to another, such as from a compromised endpoint device to a server or application) across the corporate network. The enterprise searches for, responds to, and remediates compromised systems and networks.

OT networks also employ the principles of detecting, responding to and recovering from attacks. The priority of OT, however, is to prevent vertical movement and to avoid threats that can penetrate the IT/OT divide.

Key defense considerations

Data center managers should establish multiple layers of data center OT defense. The key principles include:

Defending the conduits. Attacks propagate from the IT environment through the Purdue model via connections between levels. Programmable firewalls are points of potential failure. Critical facilities use network engineering solutions, such as unidirectional networks and/or gateways, to prevent vertical movement.
Maintaining cyber, mechanical and physical protection. A data center OT security strategy combines “detect, respond, recover” capabilities, mechanical checks (e.g., governors that limit AC compressor speed or that respond to temperature or vibration), and physical security vigilance.
Preventing OT compromise. Air-gapping is used to describe approaches that may not be effective in preventing OT compromises. Data center managers need to ensure that defensive measures will protect data center OT systems from outages that could cripple an entire facility.
Weighing up the true cost of IT/OT integration. Business cases for applications that rely on OT data generally anticipate reduced maintenance downtime. But the cost side of the ledger needs to extend to expenses associated with protecting OT.

The Uptime Intelligence View

Attacks on OT systems are increasing, and defense strategies are often based on faulty assumptions and informed by strategies that reflect IT rather than OT requirements. A successful attack on data center OT could be catastrophic, resulting in long-term (potentially months long) facility outages.

Many OT security measures stem from the belief that OT is protected by air-gapping, but this description may not be accurate and the protection may not be adequate in the face of escalating OT attacks. Data center managers should consider OT security strategies aimed at threat prevention. These should combine the detect, respond, recover cyber principles and engineering-grade protection for OT equipment.

It is also crucial to integrate vigilant physical security to protect against unauthorized access to vulnerable OT environments. Prevention is the most critical line of defense.

Managing server performance for power: a missed opportunity

September 18, 2024/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com

An earlier Uptime Intelligence report discussed the characteristics of processor power management (known as C-states) and explained how they can reduce server energy consumption to make substantial contributions to the overall energy performance and sustainability of data center infrastructure (see Understanding how server power management works). During periods of low activity, such features can potentially lower the server power requirements by more than 20% in return for prolonging the time it takes to respond to requests.

But there is more to managing server power than just conserving energy when the machine is not busy — setting processor performance levels that are appropriate for the application is another way to optimize energy performance. This is the crux of the issue: there is often a mismatch between the performance delivered and the performance required for a good quality of service (QoS).

When the performance is too low, the consequences are often clear: employees lose productivity, customers leave. But when application performance exceeds needs, the cost remains hidden: excessive power use.

Server power management: enter P-states

Uptime Intelligence survey data indicates that power management remains an underused feature — most servers do not have it enabled (see Tools to watch and improve power use by IT are underused). The extra power use may appear small at first, amounting to only tens of watts per server. But when scaled to larger facilities or to the global data center footprint, they add up to a huge waste of power and money.

The potential to improve the energy performance of data center infrastructure is material, but the variables involved in adopting server power management mean it is not a trivial task. Modern chip design is what creates this potential. All server processors in operation today are equipped with mechanisms to change their clock frequency and supply voltage in preordained pairs of steps (called dynamic voltage-frequency scaling). Initially, these techniques were devised to lower energy use in laptops and other low-power systems when running code that does not fully utilize resources. Known as P-states, these are in addition to C-states (low-power modes during idle time).

Later, mechanisms were added to do the opposite: increase clock speeds and voltages beyond nominal rates as long as the processor stays within hard limits for power, temperature and frequency. The effect of this approach, known as turbo mode, has gradually become more pronounced with ever-higher core counts, particularly in servers (see Cooling to play a more active role in IT performance and efficiency). As processors dynamically reallocate the power budget from lowly utilized or idling cores to highly stressed ones, clock speeds can well exceed nominal ratings — often close to doubling. In recent CPUs, even the power budget can be calibrated higher than the factory default.

As a result, server-processor behavior has become increasingly opportunistic in the past decade. When allowed, processors will dynamically seek out the electrical configuration that yields maximum performance if the software (signaled by the operating system, detected by the hardware mechanisms, or both) requests it. Such behavior is generally great for performance, particularly in a highly mixed application environment where certain software benefits from running on many cores in parallel while others prefer fewer but faster ones.

The unquantified costs of high performance

Ensuring top server performance comes at the cost of using more energy. For performance-critical applications such as technical computing, financial transactions, high-speed analytics and real-time operating systems, the use and cost of energy is often not a concern.

But for a large array of workloads, this will result in a considerable amount of energy waste. There are two main components to this waste. First, the energy consumption curve for semiconductors gets steeper the closer the chip pushes to the top of its performance envelope because both dynamic (switching) and static (leakage) power increase exponentially. All the while, the performance gains diminish because the rest of the system, including the memory, storage and network subsystems, will be unable to keep up with the processor’s race pace. This increases the amount of time that the processor needs to wait for data or instructions.

Second, energy waste originates from a mismatch between performance and QoS. Select applications and systems, such as transaction processing and storage servers, tend to have defined QoS policies for performance (e.g., responding to 99% of queries within a second). QoS is typically about setting a floor below which performance should not drop — it is rarely about ensuring systems do not overperform, for example, by processing transactions or responding to queries unnecessarily fast.

If a second for a database query is still within tolerance, there is, by definition, limited value to having a response under one-tenth of a second just because the server can process a query that fast when the load is light. And yet, it happens all the time. For many, if not most, workloads, however, this level of overperformance is neither defined nor tracked, which invites an exploration of acceptable QoS.

Governing P-states for energy efficiency

At its core, the governance of P-states is like managing idle power through C-states, except with many more options, which adds complexity through choice. This report does not discuss the number of P-states because this would be highly dependent on the processor used. Similarly to C-states, a higher number denotes a higher potential energy saving; for example, P2 consumes less power than P1. P0 is the highest-performance state a processor can select.

No P-state control. This option tends to result in aggressive processor behavior that pushes for the maximum speeds available (electronically and thermally) for any and all of its cores. While this will result in the most energy consumed, it is preferable for high-performance applications, particularly latency-sensitive applications where every microsecond counts. If this level of performance is not justified by the workload, it can be an exceedingly wasteful control mode.
Hardware control. Also called the autonomous mode, this leaves P-state management for the processor to decide based on detected activity. While this mode allows for very fast transitions between states, it lacks the runtime information gathered by the operating system; hence, it will likely result in only marginal energy savings. On the other hand, this approach is agnostic of the operating system or hypervisor. The expected savings compared with no P-state control are up to around 10%, depending on the load and server configuration.
Software-hardware cooperation. In this mode, the operating system provider gives the processor hints on selecting the appropriate P-states. The theory is that this enables the processor control logic to make better decisions than pure hardware control while retaining the benefit of fast transitions between states to maintain system responsiveness. Power consumption reductions here can be as high as 15% to 20% at low to moderate utilization.
Software control. In this mode, the operating system uses a governor (a control mechanism that regulates a function, in this case speed) to make the performance decisions executed by the processor if the electrical and thermal conditions (supply voltage and current, clock frequency and silicon temperature) allow it. This mode typically carries the biggest energy-saving potential when a sophisticated software governor is used. Both Windows and Linux operating systems offer predefined plans that let the system administrator prioritize performance, balance or lower energy use.
The trade-off here is additional latency: whenever the processor is in a low-performance state and transitions to a higher-performance state (e.g., P0) in response to a bout of compute or interrupt activity, it takes material time. Highly latency-sensitive and bursty workloads may see substantial impact.
Power reductions can be outsized across most of the system load curve. Depending on the sophistication of the operating system governor and the selected power plan, energy savings can reach between 25% and 50%.

While there are inevitable trade-offs between performance and efficiency, in all the control scenarios, the impact on performance is often negligible. This is true for users of business and web applications, or for the total runtime of technical computing jobs. High-performance requirements alone do not preempt the use of P-state control: once the processor selects P0, there is no difference between a system with controls and no controls.

Applications that do not tolerate dynamic P-state controls well are often the already-suspected exceptions. This is due to their latency-sensitive, bursty nature, where the processor is unable to match the change in performance needs (scaling voltages and frequency) fast enough, even though it takes microseconds.

Arguably, for most use cases, the main concern should be power consumption, not performance. Server efficiency benchmarking data, such as that published by the Standard Performance Evaluation Corporation and The Green Grid, indicates that modern servers achieve the best energy efficiency when their performance envelope is limited (e.g., to P2) because it prevents the chip from aggressively seeking the highest clock rates across many cores. This would result in disproportionately higher power use for little in return.

Upcoming Uptime Intelligence reports will identify the software tools that data center operators can use to monitor and manage the power and performance settings of server fleets.

The Uptime Intelligence View

Server power management in its multiple forms offers data center operators easy wins in IT efficiency and opportunities to lower operational expenditure. It will be especially attractive to small- and medium-sized enterprises that run mixed workloads with low criticality, yet the effects of server power management will be much more impressive when implemented at scale and for the right workloads.

Resiliency v low PUE: regulators a catalyst for innovation?

September 4, 2024/in Design, Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com

Research has shown that while data center owners and operators are mostly in favor of sustainability regulation, they have a low opinion of the regulators’ expertise. Some operators have cited Germany’s Energy Efficiency Act as evidence: the law lays down a set of extremely challenging power usage effectiveness (PUE) requirements that will likely force some data centers to close and prevent the building of others.

The headline requirement of the act (see Table 1) is that new data centers (i.e., those that become operational in 2026 and beyond) must have an annual operational PUE of 1.2 or below within two years of commissioning. While meeting this tough energy efficiency stipulation is possible, very few data centers today — even new and well-managed ones — are achieving this figure. Uptime Intelligence data shows an average PUE of 1.45 for data centers built in the past three years.

This stretch goal of 1.2 is particularly challenging for those trying to grapple with competing business or technical requirements. For example, some are still trying to populate new data centers, meaning they are operating at partial load; many have overriding requirements to meet Tier III concurrent maintainability and Tier IV fault-tolerant objectives; and increasingly, many are aiming to support high-density racks or servers with strict cooling requirements. For these operators, achieving this PUE level routinely will require effort and innovation that is at the limits of modern data center engineering — and it may require considerable investment.

The rules for existing data centers are also tough. By 2030, all data centers will have to operate below a PUE of 1.3. This requirement will ultimately lead to data center closures, refurbishments and migration to newer colocations and the cloud — although it is far too early to say which of these strategies will dominate.

Table 1. PUE requirements under Germany’s Energy Efficiency Act

Table: PUE requirements under Germany’s Energy Efficiency Act

To the limits

The goal of Germany’s regulators is to push data center designers, owners and operators to the limits. In doing so, they hope to encourage innovation and best practices in the industry that will turn Germany into a model of efficient data center operations and thus encourage similar practices across Europe and beyond.

Whether the regulators are pushing too hard on PUE or if the act will trigger some unintended consequences (such as a movement of workloads outside of Germany) will likely only emerge over time. But the policy raises several questions and challenges for the entire industry. In particular, how can rigorous efficiency goals be achieved while maintaining high availability? And how will high-density workloads affect the achievement of efficiency goals?

We consulted several Uptime Institute experts* to review and respond to these questions. They take a mostly positive view that while the PUE requirements are tough, they are achievable for new (or recently constructed) mission-critical data centers — but there will be cost and design engineering consequences. Some of their observations are:

Achieving fault tolerance / concurrent availability. Higher availability (Tier III and Tier IV) data centers can achieve very low PUEs — contrary to some reports. A Tier IV data center (fully fault tolerant) is not inherently less efficient than a Tier II or Tier III (concurrently maintainable) data center, especially given recent advances in equipment technology (such as digital scroll compressors, regular use of variable frequency drives, more sophisticated control and automation capabilities). These are some of the technologies that can help ensure that the use of redundant extra capacity or resilient components does not require significantly more energy.
It is often thought that Tier IV uses a lot more energy than a Tier III data center. However, this is not necessarily the case — the difference can be negligible. The only definitive increase in power consumption that a Tier IV data center may require is an increased load on uninterruptible power supply (UPS) systems and a few extra components. However, this increase can be as little as a 3% to 5% loss on 10% to 20% of the total mechanical power load for most data centers.
The idea that Tier IV data centers are less efficient usually comes from the requirement to have two systems in operation. But this does not necessarily mean that each of the two systems in operation can support the full current workload at a moment’s notice. The goal is to ensure that there is no interruption of service.
On the electrical side, concurrent availability and fault tolerance may be achieved using a “swing UPS” and distributed N+1 UPS topologies. In this well-established architecture, the batteries provide immediate energy storage and are connected to an online UPS. There is not necessarily a need for two fully powered (2N) UPS systems; rather, the distribution is arranged so that a single failure only impacts a percentage of the IT workload. Any single redundant UPS only needs to be able to support that percentage of IT workload. The total UPS installed capacity, therefore, is far less than twice the entire workload, meaning that the power use and losses and are less.
Similarly, power use and losses can be reduced in the distribution systems by designing with as few transformers and distribution boards and transfer switches as possible. Fault tolerance can be harder to achieve in this way, with a larger number of smaller components increasing complexity — but this has the benefit of higher electrical efficiency.
On the mechanical side, two cooling systems can be operational, with each supporting a part of the current workload. When a system fails, thermal storage is used to buy time to power up additional cooling units. Again, this is a well-established approach.
Cooling.Low PUEs assume that highly efficient cooling is place, which in turn requires the use of ambient cooling. Regulation will drive the industry to greater use of economization — whether these are air-side economizers, water-side economizers, or pumped refrigerant economizers.
- While very low PUEs do not rule out the use of standard direct expansion (DX) units, their energy use may be problematic. The use of DX units on hot days may be required to such an extent that very low PUEs are not achievable.
- More water, less power? The PUE limitation of 1.2 may encourage the use of cooling technologies that evaporate more water. In many places, air-cooled chillers may struggle to cool the workload on hot days, which will require mechanical assistance. This may happen too frequently to achieve low annualized PUEs. A water-cooled plant or the use of evaporative cooling will likely use less power — but, of course, require much more water.

Increasing density will challenge cooling — even direct liquid cooling (DLC). Current and forecasted high-end processors, including graphics processing units, require server inlet temperatures at the lower ends of the ASHRAE range. This will make it very difficult to achieve low PUEs.
DLC can only provide a limited solution. While it is highly effective in efficiently cooling higher density systems at the processor level, the lower temperatures required necessitate ever-chilled or refrigerated air / water — which, of course, increases the power consumption and pushes up the PUEs.
Regulation may drive location. Low PUEs are much easier and more economically achieved in cooler or less humid climates. Regulators that mandate low PUEs will have to take this into account. Germany, for example, has mostly cool or cold winters and warm or hot summers, but with low / manageable humidity — therefore, it may be well suited to economization. However, it will be easier and less expensive to achieve these low PUEs in Northern Germany, or even in Scandinavia, rather than in Southern Germany.
Build outs. Germany’s Energy Efficiency Act requires that operators reach a low PUE within two years of the data center’s commissioning. This, in theory, gives the operator time to fill up the data center and reach an optimum level of efficiency. However, most data centers actually fill out and/or build out over a much longer timescale (four years is more typical).
This may have wider design implications. Achieving a PUE of 1.2 at full workload requires that the equipment is selected and powered to suit the workload. But at a partial workload, many of these systems will be over-sized or not be as efficient. To achieve optimal PUE at all workloads, it may be necessary to deploy more, smaller capacity components and take a more modular approach — possibly using repeatable, prefab subsystems. This will have cost implications; to achieve concurrent maintainability for all loads, designers may have to use innovative N+1 designs with the greater use of smaller components.
Capital costs may rise.Research suggests that low PUE data centers can also have low operational costs — notably because of reduced energy use. However, the topologies and the number of components, especially for higher availability facilities, may be more expensive. Regulators mandating lower PUEs may be forcing up capital costs, although these can be recuperated later.

The Uptime Intelligence View

The data center industry is not so new. Many facilities — perhaps most — achieve very high availability as a result of proven technology and assiduous attention by management and operators. However, the great majority do not use state-of-the-art engineering to achieve high energy efficiency. Regulators want to push them in this direction. Uptime Intelligence’s assessment is that — by adopting well-thought-out designs and building in a modular, balanced way — it is possible to have and operate highly available, mission-critical data centers with very high energy efficiency at all workloads. However, there will likely be a considerable cost premium.

*The Uptime Institute experts consulted for this Update are:

Chris Brown, Chief Technical Officer, Uptime Institute

Ryan Orr, Vice President, Topology Services and Global Tier Authority, Uptime Institute

Jay Dietrich, Research Director Sustainability, Uptime Institute

Dr. Tomas Rahkonen, Research Director, Distributed Data Centers, Uptime Institute

Data center management software is evolving — at last

August 21, 2024/in Executive, Operations/by Max Smolaks, Research Analyst, msmolaks@uptimeinstitute.com

Despite their role as enablers of technological progress, data center operators have been slow to take advantage of developments in software, connectivity and sensor technologies that can help optimize and automate the running of critical infrastructure.

Most data center owners and operators currently use building management systems (BMS) and / or data center infrastructure management (DCIM) software as their primary interfaces for facility operations. These tools have important roles to play in data center operations but have limited analytics and automation capabilities, and they often do little to improve facility efficiency.

To handle the increasing complexity and scale of modern data centers — and to optimize for efficiency — operations teams need new software tools that can generate value from the data produced by the facilities equipment.

For more than a decade, DCIM software has been positioned as the main platform for organizing and presenting actionable data — but its role has been largely passive. Now, a new school of thought on data center management software is emerging, proposed by data scientists and statisticians. From their point of view, a data center is not a collection of physical components but a complex system of data patterns.

Seen in this way, every device has a known optimal state, and the overall system can be balanced so that as many devices as possible are working in this state. Any deviation would then indicate a fault with equipment, sensors or data.

This approach has been used to identify defective chiller valves, imbalanced computer room air conditioning (CRAC) fans, inefficient uninterruptible power supply (UPS) units and opportunities to lower floor pressure — all discovered in data, without anyone having to visit the facility. The overall system — the entire data center — can also be modelled in this way, so that it can be continuously optimized.

All about the data

Data centers are full of sensors. They can be found inside CRAC and computer room air handler units, chillers, coolant distribution units, UPS systems, power distribution units, generators and switchgear, and many operators install optional temperature, humidity and pressure sensors in their data halls.

These sensors can serve as a source of valuable operational insight — yet the data they produce is rarely analyzed. In most cases, applications of this information are limited to real-time monitoring and basic forecasting.

In recent years, major mechanical and electrical equipment suppliers have added network connectivity to their products as they look to harvest sensor data to understand where and how their equipment is used. This has benefits for both sides: customers can monitor their facilities from any device, anywhere, while suppliers have access to information that can be used in quality control, condition-based or predictive maintenance, and new product design.

The greatest benefits of this trend are yet to be harnessed. Aggregated sensor data can be used to train artificial intelligence (AI) models with a view to automating an increasing number of data center tasks. Data center owners and operators do not have to rely on equipment vendors to deliver this kind of innovation. They can tap into the same equipment data, which can be accessed using industry-standard protocols like SNMP or Modbus.

When combining sensor data with an emerging category of data center optimization tools — many of which rely on machine learning — data center operators can improve their infrastructure efficiency, achieve higher degrees of automation and lower the risk of human error.

The past few years have also spawned new platforms that simplify data manipulation and analysis. These enable larger organizations to develop their own applications that leverage equipment data — including their own machine learning models.

The new wave

Dynamic cooling optimization is the best-understood example of this new, data-centric approach to facilities management. These applications rely on sensor data and machine learning to determine and continually “learn” the relationships between variables such as rack temperatures, cooling equipment settings and overall cooling capacity. The software can then tweak the cooling equipment performance based on minute changes in temperature, enabling the facility to respond to the needs of the IT in near real-time.

Many companies working in this field have close ties to the research community. AI-powered cooling optimization vendor TycheTools was founded by a team from the Technical University of Madrid. The Dutch startup Coolgradient was co-founded by a data scientist and collaborates with several universities. US-based Phaidra has brought together some of the talent that previously published and commercialized cutting-edge research as part of Google’s DeepMind.

Some of the benefits offered by data-centric management software include:

Improved facility efficiency: through the automated configuration of power and cooling equipment as well as the identification of inefficient or faulty hardware.
Better maintenance: by enabling predictive or condition-based maintenance strategies that consider the state of individual hardware components.
Discovery of stranded capacity: through the thorough analysis of all data center metrics, not just high-level indicators.
Elimination of human error: through either a higher degree of automation or automatically generated recommendations for human employees.
Improvements in skill management: by analyzing the skills of the most experienced staff and codifying them in software.

Not all machine learning models require extensive compute resources, rich datasets and long training times. In fact, many of the models used in data centers today are small and relatively simple. Both training and inference can run on general-purpose servers, and it is not always necessary to aggregate data from multiple sites — a model trained locally on a single facility’s data will often be sufficient to deliver the expected results.

New tools bring new challenges

The adoption of data-centric tools for infrastructure management will require owners and operators to recognize the importance of data quality. They will not be able to trust the output of machine learning models if they cannot trust their data — and that means additional work on standardizing and cleaning their operational data stores.

In some cases, data center operators will have to hire analysts and data scientists to work alongside the facilities and IT teams.

Data harvesting at scale will invariably require more networking inside the data center — some of it wireless — and this presents a potentially wider attack surface for cybercriminals. As such, cybersecurity will be an important consideration for any operational AI deployment and a key risk that will need to be continuously managed.

Evolution is inevitable

Uptime Institute has long argued that data center management tools need to evolve toward greater autonomy. The data center maturity model (see Table 1) was first proposed in 2019 and applied specifically to DCIM. Four years later, there is finally the beginning of a shift toward Level 4 and Level 5 autonomy, albeit with a caveat — DCIM software alone will likely never evolve these capabilities. Instead, it will need to be combined with a new generation of data-centric tools.

Table 1. The Data Center Management Maturity Model

Table: The Data Center Management Maturity Model

Not every organization will, or should, take advantage of Level 4 and Level 5 functionality. These tools will provide an advantage to the operators of modern facilities that have exhausted the list of traditional efficiency measures, such as those achieving PUE values of less than 1.3.

For the rest, being an early adopter will not justify the expense. There are cheaper and easier ways to improve facility efficiency that do not require extensive data standardization efforts or additional skills in data science.

At present, AI and analytics innovation in data center management appears to be driven by startups rather than established software vendors. Few BMS and DCIM developers have integrated machine learning into their core products, and while some companies have features in development, these will take time to reach the market — if they ever leave the lab.

Uptime Intelligence is tracking six early-stage or private companies that use facilities data to create machine learning models and already have products or services on the market. It is likely more will emerge in the coming months and years.

These businesses are creating a new category of software and services that will require new types of interactions with all the moving parts inside the data center, as well as new commercial strategies and new methods of measuring the return on investment. Not all of them will be successful.

The speed of mainstream adoption will depend on how easy these tools will be to implement. Eventually, the industry will arrive at a specific set of processes and policies that focus on benefitting from equipment data.

The Uptime Intelligence View

The design and capabilities of facilities equipment have changed considerably over the past 10 years, and traditional data center management tools have not kept up. A new generation of software from less-established vendors now offers an opportunity to shift the focus from physical infrastructure to data. This introduces new risks — but the benefits are too great to ignore.

Generative AI and global power consumption: high, but not that high

July 31, 2024/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com

In the past year, Uptime Intelligence has been asked more questions about generative AI and its impact on the data center sector than any other topic. The questions come from enterprise and colocation operators, suppliers of a wide variety of equipment and services, regulators and the media.

Most of the questions concern power consumption. The compute clusters — necessary for the efficient creation and use of generative AI models — are enormously power-hungry, creating a surge in projected demand for capacity and, for operators, challenges in power distribution and cooling in the data center.

The questions about power typically fall into one of three groups. The first is centered around, “What does generative AI mean for density, power distribution and cooling in the data center?”

Uptime Intelligence’s view is that the claims of density distress are exaggerated. While the AI systems for training generative AI models, equipped with Nvidia accelerators, are much denser than usual, they are not extreme and can be managed by spreading them out into more cabinets. Further Uptime Intelligence reports will cover this aspect since it is not the focus here.

Also, generative AI has an indirect effect on most data center operators in the form of added pressures on the supply chain. With lead times on some key equipment, such as engine generators, transformers and power distribution units, already abnormally long, unforeseen demand for even more capacity by generative AI will certainly not help. The issues of density and forms of indirect impact were initially addressed in the Uptime Intelligence report Hunger for AI will have limited impact on most operators.

The second set of questions relates to the availability of power in a given region or grid — and especially of low-carbon energy. This is also a critical, practical issue that operators and utilities are trying to understand. The largest clusters for training large generative AI models comprise many hundreds of server nodes and draw several megawatts of power at full load. However, the issues are largely localized, and they are also not the focus here.

Generative AI and global power

Instead, the focus of this report is the third typical question, “How much global power will AI use or require?” While in immediate practical terms this is not a key concern for most data center operators, the headline numbers will shape media coverage and public opinion, which ultimately will drive regulatory action.

However, some of the numbers on AI power circulating in the press and at conferences, cited by key influencers and other parties, are extremely high. If accurate, these figures suggest major infrastructural and regulatory challenges ahead — however, any unduly high forecasts may prompt regulators to overreact.

Among the forecasts at the higher end is from Schneider Electric’s respected research team, which estimated AI power demand at 4 gigawatts (GW), equivalent to 35 terawatt-hours (TWh) if annualized, in 2023, rising to around 15 GW (131 TWh if annualized) in 2028 (see Schneider White paper 110: The AI Disruption: Challenges and Guidance for Data Center Design). Most likely, these figures include all AI workloads and not only new generative models.

And Alex de Vries, of digital trends platform Digiconomist and whose calculations of Bitcoin energy use have been influential, has estimated AI workload use at 85 TWh to 134 TWh by 2027. These figures suggest AI could add 30% to 50% or more to global data center power demand over the next few years (see below).

There are two reasons why Uptime Intelligence considers these scenarios overly bullish. First, estimates on power for all AI workloads are problematic for both taxonomy (what is AI) and the improbability of tracking it meaningfully. Also, most forms of AI are already accounted for in capacity planning, despite generative AI being unexpected. Second, projections that span beyond a 12- to 18-month horizon carry high uncertainty.

Some of the numbers cited above imply an approximately thousand-fold increase in AI compute capacity in 3 to 4 years from the first quarter of 2024, when accounting for hardware and software technology evolution. That is not only unprecedented but also has weak business fundamentals when considering the hundreds of billions of dollars it would take to build all that AI infrastructure.

Uptime Intelligence takes a more conservative view with its estimates — but these estimates still indicate rapidly escalating energy use by new large generative AI models.

To reach our numbers, we have estimated the current shipments and installed base of Nvidia-based systems through to the first quarter of 2025 and the likely power consumption associated with their use. Systems based on Nvidia’s data center accelerators, derived from GPUs, dominate the generative AI model accelerator market and will continue to do so until at least mid-2025 due to an entrenched advantage in the software toolchain.

We have considered a range of technical and market factors to calculate the power requirements of generative AI infrastructure: workload profiles (workload activity, utilization and load concurrency in a cluster), the shifting balance between training and inferencing, and the average PUE of the data center housing the generative AI systems.

This data supports our earlier stated view that the initial impact of generative AI is limited, beyond a few dozen large sites. For the first quarter of 2024, we estimate the annualized power use by Nvidia systems installed to be around 5.8 TWh. This figure, however, will rise rapidly if Nvidia meets its forecast sales and shipment targets. By the first quarter of 2025, the generative AI infrastructure in place could account for 21.9 TWh of annual energy consumption.

We expect these numbers to shift as new information emerges, but they are indicative. To put these numbers into perspective, the total global data center energy use has been variously estimated at between 200 TWh and 450 TWh per year in the periods from 2020 to 2022. (The methodologies and terms of various studies vary widely and suggest that data centers use between 1% and 4% of all consumed electricity.) By taking a middle figure of 300 TWh for the annual global data center power consumption, Uptime Intelligence puts generative AI annualized energy at around 2.3% of the total grid power consumption by data centers in the first quarter of 2024. However, this could reach 7.3% by the first quarter of 2025.

Outlook

These numbers indicate that generative AI’s power use is not, currently, disruptively impactful, given the data center sector’s explosive growth in recent years. Generative AI’s share of power consumption relative to its footprint is outsized given its likely very high utilization. A share of data center capacity by power will fall in the low-to-mid single digits even by the end of 2025.

However, it is nevertheless a dramatic increase and suggests that generative AI could account for a much larger percentage of overall data center power use in the years ahead. Inevitably, the surge needs to slow down, also in great part because newer AI systems built around vastly more efficient accelerators will displace the install base en masse rather than adding net new infrastructure.

While Uptime Intelligence believes some of the estimates of generative AI power use (and data center power use) to be too high, the sharp uptick — and the concentration of demand in certain regions — will still be high enough to attract and stimulate regulatory attention. Uptime Intelligence will continue to analyze the development of AI infrastructure and its impact on the data center industry.

Andy Lawrence, Executive Research Director alawrence@uptimeinstitute.com

Daniel Bizo, Research Director dbizo@uptimeinstitute.com

Understanding how server power management works

July 10, 2024/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com

Uptime Intelligence regularly addresses IT infrastructure efficiency, particularly servers, in our reports on data center energy performance and sustainability. Without active contribution from IT operations, facility operations alone will not be able to meet future energy and sustainability demands on data center infrastructure. Purchases of renewable energy and renewable energy certificates will become increasingly — and, in many locations, prohibitively — expensive as demand outstrips supply, making the energy wasted by IT even more costly.

The power efficiency of a server fleet, that is, how much work servers perform for the energy they use, is influenced by multiple factors. Hardware features receive the most attention from IT buyers: the server’s technology generation, the configuration of the system and the selection of power supply or fan settings. The single most significant factor that affects server efficiency, however, is the level at which the servers are typically utilized; a seemingly obvious consideration — and enough for regulators to include it as a reporting requirement in the EU’s new Energy Efficiency Directive (see EED comes into force, creating an enormous task for the industry). Even so, the process of sourcing the correct utilization data for the purposes of power efficiency calculations (as opposed to capacity planning) remains arguably misunderstood (see Tools to watch and improve power use by IT are underused).

The primacy of server utilization in data center efficiency has increased in recent years. The latest server platforms are only able to deliver major gains in energy performance when put to heavy-duty work — either by carrying a larger software payload through workload consolidation, or by running scalable, large applications. If these conditions are not met, running lighter or bursty workloads on today’s servers (regardless of whether based on Intel or AMD chips) will deliver only a marginal, if any, improvement in the power efficiency compared with many of the supposedly outdated servers that are five to seven years old (see Server efficiency increases again — but so do the caveats).

Cycles of a processor’s sleep

This leads into the key discussion point of this report: the importance of taking advantage of dynamic energy saving features. Settings for power and performance management of servers are often an overlooked — and underused — lever in improving power efficiency. Server power management techniques affect power use and overall system efficiency significantly. This effect is even more pronounced for systems that are only lightly loaded or spend much of their time doing little work: for example, servers that run enterprise applications.

The reduction in server power demand resulting from power management can be substantial. In July 2023 Uptime Intelligence published a report discussing data (although sparse) that indicates 10% to 20% reductions in energy use from enabling certain power-saving modes in modern servers, with only a marginal performance penalty when running a Java-based business logic (see The strong case for power management). Energy efficiency gains will depend on the type of processor and hardware configuration, but we consider the results indicative for most servers. Despite this, our research indicates that many, if not most, IT operations do not use power management features.

So, what are these power management settings? Server power management settings are governed by the firmware statically (what modes are enabled upon system start up) and dynamically by the operating system or hypervisor once running through the Advanced Configuration and Power Interface.

There are many components in a server that may have power management features, enabling them to run slower or power off. Operating systems also have their own software mechanisms, such as suspending their operation and saving the machine state to central memory or the storage system.

But in servers, which tend to be always powered on, it is the processors’ power management modes that dictate most of the energy gains. Modern processors have sophisticated power management features for idling, that is, when the processor does not execute code. These are represented by various levels of C-states (the C stands for CPU) denoted by numbers, such as C1 and C2 (with C0 being the fully active state).

The number of these states has expanded over time as chip architects introduce new, more advanced power-saving features to help processors reduce their energy use when doing no work. The chief benefit of these techniques is to minimize leakage currents that would otherwise increasingly permeate modern processor silicon.

The higher the C-state number, the more of its circuitry the CPU sends to various states of sleep. In summary:

C0: processor active.
C1/C1E: processor core halts, not performing work, but is ready to immediately resume operation with negligible performance penalty, optionally reducing its voltage and frequency to save power.
C3: processor clock distribution is switched off and core caches are emptied.
C4: enhancement to C3 that extends the parts covered.
C6: essentially powers down entire cores after saving the state to resume from later.
C7 and higher: shared resources between cores may be powered downs, or even the entire processor package.

Skipped numbers, such as C2 and C5, are incremental, transitionary processor power states between main states. Not all these C-states are available on all processor architectures.

A good sleep in a millisecond

The levels of C-states and understanding them matters because they largely define the cost in performance and the benefit in power. The measured results of 10% to 20% reduction in energy use when enabling certain power management features, as discussed earlier, have allowed the server processor (an AMD model) to enter processor power states up to C6. These sleep states save power even when the server, on a human level of perception, is processing database transactions and responding to queries.

This is because processors operate on a timescale measured in nanoseconds, while software-level requests between commands can take milliseconds even on a busy machine. This is a factor of one million difference: milliseconds between work assignments represent millions of processor cycles waiting. For modern server processors, some of the many cores may often have no work to do for a second or more, which is an eternity on the processor’s time scale. On a human scale, a comparable time would be several years of inactivity.

However, there is a cost associated with the processor cores going to sleep. Entering ever deeper sleep states across processor cores or entire chips can take thousands of cycles, and as many as tens of thousands of cycles to wake up and reinstate operation. This added latency to respond to wake-up requests is what shows up as a loss of performance in measurements. In the reference measurement running the Java-based business logic, this is in the 5% to 6% range — arguably a small price to pay.

Workloads will vary greatly in the size of the performance penalty introduced by this added latency. Crucially, they will differ even more in how costly the lost application performance is for the business — high-frequency trading or processing of high volumes of mission-critical online transactions are areas where any loss of performance is unacceptable. Another area may include storage servers with heavy demand for handling random read-write operations at low latency. But a vast array of applications will not see material change to the quality of service.

Using server power management is not a binary decision either. IT buyers can also calibrate the depth of sleep they enable for the server processor (and other components) to enter. Limiting it to C3 or C1E may deliver better trade-offs. Many servers, however, are not running performance-critical applications and spend most of their time doing no work — even if it seems that they are often, by human standards, called upon. For servers that are often idle, the energy saved can be in the 20% to 40% range, which can amount to tens of watts for every lightly loaded or idle server.

Optimizing server energy performance does not stop with C-states. Performance-governing features (setting performance levels when the processor is actively working), known as P-states, offer another set of possibilities to find better trade-offs between power and performance. Rather than minimizing waste when the processor idles, P-states direct how much power should be expended on getting the work done. Future reports will introduce P-states for a more complete view of server power and performance management for IT infrastructure operators that are looking for further options in meeting their efficiency and sustainability objectives.

The Uptime Intelligence View

A server processor’s power management is a seemingly minute function buried under layers of technical details of an infrastructure. Still, its role in the overall energy performance of a data center infrastructure will be outsized for many organizations. In the near term, blanket policies (or simply IT administrator habits) of keeping server power management features switched off will inevitably be challenged by internal stakeholders in pursuit of cost efficiencies and better sustainability credentials; or, possibly in the longer term, by regulators catching on to technicalities and industry practices. Technical organizations at enterprises and IT service providers will want to map out server power management opportunities ahead of time.

OT protection: is air-gapping the answer?

Physical OT consequences of rising cyberattacks

Air-gapping is not air-tight

Defense in depth: a possible solution?

Lateral and vertical movement challenges

Key defense considerations

The Uptime Intelligence View

Managing server performance for power: a missed opportunity

Server power management: enter P-states

The unquantified costs of high performance

Governing P-states for energy efficiency

The Uptime Intelligence View

Resiliency v low PUE: regulators a catalyst for innovation?

To the limits

The Uptime Intelligence View

Data center management software is evolving — at last

All about the data

The new wave

New tools bring new challenges

Evolution is inevitable

The Uptime Intelligence View

Generative AI and global power consumption: high, but not that high

Generative AI and global power

Outlook

Understanding how server power management works

Cycles of a processor’s sleep

A good sleep in a millisecond

The Uptime Intelligence View

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices