DCIM past and present: what’s changed?

DCIM past and present: what’s changed?

Data center infrastructure management (DCIM) software is an important class of software that, despite some false starts, many operators regard as essential to running modern, flexible and efficient data centers. It has had a difficult history — many suppliers have struggled to meet customer requirements and adoption remains patchy. Critics argue that, because of the complexity of data center operations, DCIM software often requires expensive customization and feature development for which many operators have neither the expertise nor the budget.

This is the first of a series of reports by Uptime Intelligence exploring data center management software in 2024 — two decades or more after the first commercial products were introduced. Data center management software is a wider category than DCIM: many products are point solutions; some extend beyond a single site and others have control functions. Uptime Intelligence is referring to this category as data center management and control (DCM-C) software.

DCIM, however, remains at the core. This report identifies the key areas in which DCIM has changed over the past decade and, in future reports, Uptime Intelligence will explore the broader DCM-C software landscape.

What is DCIM?

DCIM refers to data center infrastructure management software, which collects and manages information about a data center’s IT and facility assets, resource use and operational status, often across multiple systems and distributed environments. DCIM primarily focuses on three areas:

  • IT asset management. This involves logging and tracking of assets in a single searchable database. This can include server and rack data, IP addresses, network ports, serial numbers, parts and operating systems.
  • Monitoring. This usually includes monitoring rack space, data and power (including power use by IT and connected devices), as well as environmental data (such as temperature, humidity, air flow, water and air pressure).
  • Dashboards and reporting. To track energy use, sustainability data, PUE and environmental health (thermal, pressure etc.), and monitor system performance, alerts and critical events. This may also include the ability to simulate and project forward – for example, for the purposes of capacity management.

In the past, some operators have taken the view that DCIM does not justify the investment, given its cost and the difficulty of successful implementation. However, these reservations may be product specific and can depend on the situation; many others have claimed a strong return on investment and better overall management of the data center with DCIM.

Growing need for DCIM

Uptime’s discussions with operators suggest there is a growing need for DCIM software, and related software tools, to help resolve some of the urgent operational issues around sustainability, resiliency and capacity management. The current potential benefits of DCIM include:

  • Improved facility efficiency and resiliency through automating IT updates and maintenance schedules, and the identification of inefficient or faulty hardware.
  • Improved capacity management by tracking power, space and cooling usage, and locating appropriate resources to reserve.
  • Procedures and rules are followed. Changes are documented systematically; asset changes are captured and stored — and permitted only if the requirements are met.  
  • Denser IT accommodated by identifying available space and power for IT, it may be easier to densify racks, such as allocate resources to AI/machine learning and high-performance computing. The introduction of direct liquid cooling (DLC) will further complicate environments.
  • Human error eliminated through a higher degree of task automation, as well as improved workflows, when making system changes or updating records.

Meanwhile, there will be new requirements from customers for improved monitoring, reporting and measurement of data, including:

  • Monitoring equipment performance to avoid undue wear and tear or system stress might reduce the risk of outages.
  • Shorter ride through times may require more monitoring. For example, IT equipment may only have a short window of cooling from the UPS, in the event of a major power outage.  
  • Greater variety of IT equipment (graphics processing units, central processing units, application-specific integrated circuits) may mean a less predictable, more unstable environment. Monitoring will be required to ensure that their different power loads, temperature ranges and cooling requirements are managed effectively.
  • Sustainability metrics (such as PUE), as well as other measurables (such as water usage effectiveness, carbon usage effectiveness and metrics to calculate Scope 1, 2 or 3 greenhouse gas emissions).
  • Legal requirements for transparency of environmental, sustainability and resiliency data.

Supplier landscape resets

In the past decade, many DCIM suppliers have reset, adapted and modernized their technology to meet customer demand. Many have now introduced mobile and browser-based offerings, colocation customer portals and better metrics tracking, data analytics, cloud and software as a service (SaaS).

Customers are also demanding more vendor agnostic DCIM software. Operators have sometimes struggled with DCIM’s inability to work with existing building management systems from other vendors, which then requires additional costly work on application programming interfaces and integration. Some operators have noted that DCIM software from one specific vendor still only provides out-of-the-box monitoring for their own brand of equipment. These concerns have influenced (and continue to influence) customer buying decisions.

Adaptation has been difficult for some of the largest DCIM suppliers, and some organizations have now exited the market. As one of the largest data center equipment vendors, for example, Vertiv’s discontinuation of Trellis in 2021 was a significant exit: customers found Trellis too large and complex for most implementations. Even today, operators continue to migrate off Trellis onto other DCIM systems.

Other structural change in the DCIM market include Carrier and Schneider Electric acquiring Nlyte and Aveva, respectively, and Sunbird spinning out from hardware vendor Raritan (Legrand) (see Table 1).

Table 1. A selection of current DCIM suppliers

Table: A selection of current DCIM suppliers

There are currently a growing number of independent service vendors currently offering DCIM, each with different specialisms. For example, Hyperview is solely cloud-based, RiT Tech focuses on universal data integration, while Device42 specialises in IT asset discovery. Independent service vendors benefit those unwilling to acquire DCIM software and data center equipment from the same supplier.

Those DCIM software businesses that have been acquired by equipment vendors are typically kept at arm’s length. Schneider and Carrier both retain the Aveva and Nlyte brands and culture to preserve their differentiation and independence.

There are many products in the data center management area that are sometimes — in Uptime’s view —labeled incorrectly as DCIM. These include products that offer discrete or adjacent DCIM capabilities, such as: Vertiv Environet Alert (facility monitoring); IBM Maximo (asset management); AMI Data Center Manager (server monitoring); Vigilent (AI-based cooling monitoring and control); and EkkoSense (digital twin-based cooling optimization). Uptime views these as part of the wider DCM-C control category, which will be discussed in a future report in this series.

Attendees at Uptime network member events between 2013 and 2020 may recall that complaints about DCIM products, implementation, integration and pricing were a regular feature. Much of the early software was market driven, fragile and suffered from performance issues, but DCIM software has undoubtedly improved from where it was a decade ago.

The next sections of this report discuss areas in which DCIM software has improved and where there is still room for improvement.

Modern development techniques

Modern development techniques, such as continuous improvement / continuous delivery and agile / DevOps have encouraged a regular cadence of new releases and updates. Containerized applications have introduced modular DCIM, while SaaS has provided greater pricing and delivery flexibility.

Modularity

DCIM is no longer a monolithic software package. Previously, it was purchased as a core bundle, but now DCIM is more modular with add-ons that can be purchased as required. This may make DCIM more cost-effective, with operators being able to more accurately assess the return on investment, before committing to further investment. Ten years ago, the main customers for DCIM were enterprises, with control over IT — but limited budgets. Now, DCIM customers are more likely to be colocation providers with more specific requirements, little interest in the IT, and probably require more modular, targeted solutions with links into their own commercial systems.

SaaS

Subscription-based pricing for greater flexibility and visibility on costs. This is different from traditional DCIM licence and support software pricing, which typically locked customers in for minimum-term contracts. Since SaaS is subscription-based, there is more onus on the supplier to respond to customer requests in a timely manner. While some DCIM vendors offer cloud-hosted versions of their products, most operators still opt for on-premises DCIM deployments, due to perceived data and security concerns.

IT and software integrations

Historically, DCIM suffered from configurability, responsiveness and integration issues. In recent years, more effort has been made toward third-party software and IT integration and encouraging better data sharing between systems. Much DCIM software now uses application programming interfaces (APIs) and industry standard protocols to achieve this:

Application programming interfaces

APIs have made it easier for DCIM to connect with third-party software, such as IT service management, IT operations management, and monitoring and observability tools, which are often used in other parts of the organization. The aim for operators is to achieve a comprehensive view across the IT and facilities landscape, and to help orchestrate requests that come in and out of the data center. Some DCIM systems, for example, come with pre-built integrations and tools, such as ServiceNow and Salesforce, that are widely used by IT enterprise teams. These enterprise tools can provide a richer set of functionalities in IT and customer management and support. They also use robotic process automation technology, to automate repetitive manual tasks, such as rekeying data between systems, updating records and automating responses.  

IT/OT protocols

Support for a growing number of IT/OT protocols has made it easier for DCIM to connect with a broader range of IT/OT systems. This helps operators to access the data needed to meet new sustainability requirements. For example, support for simple network management protocol can provide DCIM with network performance data that can be used to monitor and detect connection faults. Meanwhile, support for the intelligent platform management interface can enable remote monitoring of servers.

User experience has improved

DCIM provides a better user experience than a decade ago. However, operators still need to be vigilant that a sophisticated front end is not a substitute for functionality.

Visualization

Visualization for monitoring and planning has seen significant progress, with interactive 3D and augmented reality views of IT equipment, racks and data halls.Sensor data is being used, for example, to identify available capacity, hot-spots or areas experiencing over-cooling. This information is presented visually to the user, who can follow changes over time and drag and drop assets into new configurations. On the fringes of DCIM, computational fluid dynamics can visualize air flows within the facility, which can then be used to make assumptions about the impact of specific changes on the environment. Meanwhile, the increasing adoption of computer-aided design can enable operators to render accurate and dynamic digital twin simulations for data center design and engineering and, ultimately, the management of assets across their life cycle.

Better workflow automation

At a process level, some DCIM suites offer workflow management modules to help managers initiate, manage and track service requests and changes. Drag and drop workflows can help managers optimize existing processes. This has the potential to reduce data entry omissions and errors, which have always been among the main barriers to successful DCIM deployments.

Rising demand for data and AI

Growing demand for more detailed data center metrics and insights related to performance, efficiency and regulations will make DCIM data more valuable. This, however, depends on how well DCIM software can capture, store and retrieve reliable data across the facility.

Customers today require greater levels of analytical intelligence from their DCIM. Greater use of AI and ML could enable the software to spot patterns, anomalies and provide next best action recommendations. DCIM has not fared well in this area, which has opened the door to a new generation of AI-enabled optimization tools. The Uptime report What is the role of AI in digital infrastructure management? identifies three near-term applications of ML in the data center — predictive analytics, equipment setting optimization and anomaly detection.

DCIM is making progress in sustainability data monitoring and reporting and a number of DCIM suppliers are now actively developing sustainability modules and dashboards. One supplier, for example, is developing a Scope 1, 2 and 3 greenhouse gas emissions model based on a range of datasets, such as server product performance sheets, component catalogs and the international Environmental Product Declaration (EPD) database. Several suppliers are working on dashboards that bring together all the data required for compliance with the EU’s Energy Efficiency Directive. Once functional, these dashboards could compare data centers, devices and manufacturers, as well as provide progress reports.


The Uptime Intelligence View

DCIM has matured as a software solution over the past decade. Improvements in function modularity, SaaS, remote working, integration, user experience and data analytics, have all progressed to the point where DCIM is now considered a viable and worthwhile investment. DCIM data will also be increasingly valuable for regulatory reporting requirements. Nonetheless, there remains more work to be done. Customers still have legitimate concerns about its complexity, cost and accuracy, while DCIM’s ability to apply AI and analytics — although an area of great promise — is still viewed cautiously. Even when commercial DCIM packages were less robust and functional, those operators that researched it diligently and deployed it carefully found it to be largely effective. This remains true today.

Water cold plates lead in the small, but growing, world of DLC

Water cold plates lead in the small, but growing, world of DLC

Direct liquid cooling (DLC), including cold plate and immersion systems, is becoming more common in data centers — but so far this transition has been gradual and unevenly distributed with some data centers using it widely, others not at all. The use of DLC in 2024 still accounts for only a small minority of the world’s IT servers, according to the Uptime Institute Cooling Systems Survey 2024. The adoption of DLC remains slow in general-purpose business IT, and the most substantial deployments currently concentrate on high-performance computing applications, such as academic research, engineering, AI model development and cryptocurrency.

This year’s cooling systems survey results continue a trend: of those operators that use DLC in some form, the greatest number of operators deploy water cold plate systems, with other DLC types trailing significantly. Multiple DLC types will grow in adoption over the next few years, and many installations will be in hybrid cooling environments where they share space and infrastructure with air-cooling equipment.

Within this crowded field, water cold plates’ lead is not overwhelming. Water cold plate systems retain advantages that explain this result: ease of integration into shared facility infrastructure, a well-understood coolant chemistry, greater choice in IT and cooling equipment, and less disruption to IT hardware procurement and warranty compared with other current forms of DLC. Many of these advantages are down to its long-established history spanning decades.

This year’s cooling systems survey provides additional insights into the DLC techniques operators are currently considering. Of those data center operators using DLC, many more (64%) have deployed water cold plates than the next-highest-ranking types: dielectric-cooled cold plates (30%) and single-phase immersion systems (26%) (see Figure 1).

Figure 1. Operators currently using DLC prefer water-cooled cold plates

Diagram: Operators currently using DLC prefer water-cooled cold plates

At present, most users say they use DLC for a small portion of their IT — typically for their densest, most difficult equipment to cool (see DLC momentum rises, but operators remain cautious). These small deployments favor hybrid approaches, rather than potentially expensive dedicated heat rejection infrastructure.

Hybrid cooling predominates — for now

Many DLC installations are in hybrid (mixed) setups in which DLC equipment sits alongside air cooling equipment in the data hall, sharing both heat transport and heat rejection infrastructure. This approach can compromise DLC’s energy efficiency advantages (see DLC will not come to the rescue of data center sustainability), but for operators with only small DLC deployments, it can be the only viable option. Indeed, when operators named the factors that make a DLC system viable, the greatest number (52%, n=392) chose ease of retrofitting DLC into their existing infrastructure.

For those operators who primarily serve mainstream business workloads, IT is rarely dense and powerful enough to require liquid cooling. Nearly half of operators (48%, n=94) only use DLC on less than 10% of their IT racks — and only one in three (33%, n=54) have heat transport and rejection equipment dedicated to their liquid-cooled IT. At this early stage of DLC growth, economics and operational risks dictate that operators prefer cooling technologies that integrate more readily into their existing space. Water cold plate systems can meet this need, despite potential drawbacks.

Cold plates are good neighbors, but not perfect

Water-cooled servers typically fit into standard racks, which simplifies deployment — especially when these servers coexist with air cooled IT. Existing racks can be reused either fully or partially loaded with water-cooled servers, and installing new racks is also straightforward.

IT suppliers prefer to sell the DLC solution integrated with their own hardware, ranging from a single server chassis to rows of racks including cooling distribution units (CDUs). Today, this approach typically favors a cold plate system, so that operators and IT teams have the broadest selection of equipment and compatible IT hardware with vendor warranty coverage.

The use of water in data center cooling has a long history. In the early years of mainframes water was used in part due to its advantageous thermal properties compared with air cooling, but also because of the need to remove heat effectively from the relatively small rooms that computers shared with staff.

Today, water cold plates are used extensively in supercomputing, handling extremely dense cabinets. Operators benefit from water’s low cost and ready availability, and many are already skilled in maintaining its chemistry (even though quality requirements for the water coolant are more stringent for cold plates compared with a facility loop).

The risk (and, in some cases, the vivid memories) of water leakage onto electrified IT components is one reason some operators are hesitant to embrace this technology, but leaks are statistically rare and there are established best practices in mitigation. However, with positive pressure cold plate loops, which is the type most deployed by operators, there is never zero risk. The possibility of water damage is perhaps the single key weakness of water cold plates driving interest in alternative dielectric DLC techniques.

In terms of thermal performance, water is not without competition. Two-phase dielectric coolants show strong performance by taking advantage of the added cooling effect from vaporization. Vendors offer this technology in the form of both immersion tanks and cold plates, with the latter edging ahead in popularity because it requires less change to products and data center operations. The downside of all engineered coolants is the added cost, as well as the environmental concerns around manufacturing and leaks.

Some in the data center industry predict two-phase cooling products will mature to capitalize on their performance potential and eventually become a major form of cooling in the world of IT. Uptime’s survey data suggests that water cold plate systems currently have a balance of benefits and risks that make practical and economic sense for a greater number of operators. But the sudden pressure on cooling and other facility infrastructure brought about by specialized hardware for generative AI will likely create new opportunities for a wider range of DLC techniques.

Outlook

Uptime’s surveys of data center operators are a useful indicator of how operators are meeting their cooling needs, among others. The data thus far suggests a gradual DLC rollout, with water cold plates holding a stable (if not overwhelming) lead. Uptime’s interviews with vendors and operators consistently paint a picture of widespread hybrid cooling environments, which incentivize cooling designs that are flexible and interoperable.

Many water cold plate systems on the market are well suited to these conditions. Looking five years ahead, densified IT for generative AI and other intensive workloads is likely to influence data center business priorities and designs more widely. DLC adoption and operator preferences for specific technology types are likely to shift in response. Pure thermal performance is key but not the sole factor. The success of any DLC technique will rely on overcoming the barriers to its adoption, availability from trusted suppliers and support for a wide range of IT configurations from multiple hardware manufacturers.

OT protection: is air-gapping the answer?

OT protection: is air-gapping the answer?

Cyberattacks on operational technology (OT) were virtually unknown until five years ago, but their volume has been doubling since 2009. This threat is distinct from IT-focused vulnerabilities that cybersecurity measures regularly address. The risk associated with OT compromise is substantial: power or cooling failures can cripple an entire facility, potentially for weeks or months.

Many organizations believe they have air-gapped (isolated from IT networks) OT systems, protecting them against external threats. The term is often used incorrectly, however, which means exploits are harder to anticipate. Data center managers need to understand the nature of the threat and their defense options to protect their critical environments.

Physical OT consequences of rising cyberattacks

OT systems are used in most critical environments. They automate power generation, water and wastewater treatment, pipeline operations and other industrial processes. Unlike IT systems, which are inter-networked by design, these systems operate autonomously and are dedicated to the environment where they are deployed.

OT is essential to data center service delivery: this includes the technologies that control power and cooling and manage generators, uninterruptible power supplies (UPS) and other environmental systems.

Traditionally, OT has lacked robust native security, relying on air-gapping for defense. However, the integration of IT and OT has eroded OT’s segregation from the broader corporate attack surface, and threat sources are increasingly targeting OT as a vulnerable environment.

Figure 1. Publicly reported OT cyberattacks with physical consequences

Diagram: Publicly reported OT cyberattacks with physical consequences

Research from Waterfall Security shows a 15-year trend in publicly reported cyberattacks that resulted in physical OT consequences. The data shows that these attacks were rare before 2019, but their incidence has risen steeply over the past five years (see 2024 Threat report: OT cyberattacks with physical consequences).

This trend should concern data center professionals. OT power and cooling technologies are essential to data center operations. Most organizations can restore a compromised IT system effectively, but few (if any) can recover from a major OT outage. Unlike IT systems, OT equipment cannot be restored from backup.

Air-gapping is not air-tight

Many operators believe their OT systems are protected by air-gaps: systems that are entirely isolated from external networks. True air-gaps, however, are rare — and have been for decades. Suppliers and customers have long recognized that OT data supports high-value applications, particularly in terms of remote monitoring and predictive maintenance. In large-scale industrial applications — and data centers — predictive maintenance can drive better resource utilization, uptime and reliability. Sensor data enables firms to, for example, diagnose potential issues, automate replacement part orders and schedule technicians, all which will help to increase equipment life spans, and reduce maintenance costs and availability concerns.

In data centers, these sensing capabilities are often “baked in” to warranty and maintenance agreements. But the data required to drive IT predictive maintenance systems comes from OT systems, and this means the data needs a route from the OT equipment to an IT network. In most cases, the data route is designed to work in one direction, from the OT environment to the IT application.

Indeed, the Uptime Institute Data Center Security Survey 2023 found that nearly half of operators using six critical OT systems (including UPS, generators, fire systems, electrical management systems, cooling control and physical security / access control) have enabled remote monitoring. Only 12% of this group, however, have enabled remote control of these systems that requires a path leading from IT back to the OT environment.

Remote control has operational benefits but increases security risk. Paths from OT to IT enable beneficial OT data use, but also open the possibility of an intruder following the same route back to attack vulnerable OT environments. A bi-directional route exposes OT to IT traffic (and, potentially, to IT attackers) by design. A true air-gap (an OT environment that is not connected in any way to IT applications) is better protected than either of these alternatives, but will not support IT applications that require OT data.

Defense in depth: a possible solution?

Many organizations use the Purdue model as a framework to protect OT equipment. The Purdue model divides the overall operating environment into six layers (see Table 1). Physical (mechanical) OT equipment is level 0. Level 1 refers to instrumentation directly connected to level 0 equipment. Level 2 systems manage the level 1 devices. IT/OT integration is a primary function at level 3, while levels 4 and 5 refer to IT networks, including IT systems such as enterprise resource planning systems that drive predictive maintenance. Cybersecurity typically focuses on the upper levels of the model; the lower-level OT environments typically lack security features found in IT systems.

Table 1. The Purdue model as applied to data center environments

Table: The Purdue model as applied to data center environments

Lateral and vertical movement challenges

When organizations report that OT equipment is air-gapped, they usually mean that the firewalls between the different layers are configured to only permit communication between adjacent layers. This prevents an intruder from moving from the upper (IT) layers to the lower (OT) levels. However, data needs to move vertically from the physical equipment to the IT applications across layers — otherwise the IT applications would not receive the required input from OT. If there are vertical paths through the model, an adversary would be able to “pivot” an attack from one layer to the next.

This is not the type of threat that most IT security organizations expect. Enterprise cyber strategies look for ways to reduce the impact of a breach by limiting lateral movement (an adversary’s ability to move from one IT system to another, such as from a compromised endpoint device to a server or application) across the corporate network. The enterprise searches for, responds to, and remediates compromised systems and networks.

OT networks also employ the principles of detecting, responding to and recovering from attacks. The priority of OT, however, is to prevent vertical movement and to avoid threats that can penetrate the IT/OT divide.

Key defense considerations

Data center managers should establish multiple layers of data center OT defense. The key principles include:

  • Defending the conduits. Attacks propagate from the IT environment through the Purdue model via connections between levels. Programmable firewalls are points of potential failure. Critical facilities use network engineering solutions, such as unidirectional networks and/or gateways, to prevent vertical movement.
  • Maintaining cyber, mechanical and physical protection. A data center OT security strategy combines “detect, respond, recover” capabilities, mechanical checks (e.g., governors that limit AC compressor speed or that respond to temperature or vibration), and physical security vigilance.
  • Preventing OT compromise. Air-gapping is used to describe approaches that may not be effective in preventing OT compromises. Data center managers need to ensure that defensive measures will protect data center OT systems from outages that could cripple an entire facility.
  • Weighing up the true cost of IT/OT integration. Business cases for applications that rely on OT data generally anticipate reduced maintenance downtime. But the cost side of the ledger needs to extend to expenses associated with protecting OT.

The Uptime Intelligence View

Attacks on OT systems are increasing, and defense strategies are often based on faulty assumptions and informed by strategies that reflect IT rather than OT requirements. A successful attack on data center OT could be catastrophic, resulting in long-term (potentially months long) facility outages.

Many OT security measures stem from the belief that OT is protected by air-gapping, but this description may not be accurate and the protection may not be adequate in the face of escalating OT attacks. Data center managers should consider OT security strategies aimed at threat prevention. These should combine the detect, respond, recover cyber principles and engineering-grade protection for OT equipment.

It is also crucial to integrate vigilant physical security to protect against unauthorized access to vulnerable OT environments. Prevention is the most critical line of defense.

Managing server performance for power: a missed opportunity

Managing server performance for power: a missed opportunity

An earlier Uptime Intelligence report discussed the characteristics of processor power management (known as C-states) and explained how they can reduce server energy consumption to make substantial contributions to the overall energy performance and sustainability of data center infrastructure (see Understanding how server power management works). During periods of low activity, such features can potentially lower the server power requirements by more than 20% in return for prolonging the time it takes to respond to requests.

But there is more to managing server power than just conserving energy when the machine is not busy — setting processor performance levels that are appropriate for the application is another way to optimize energy performance. This is the crux of the issue: there is often a mismatch between the performance delivered and the performance required for a good quality of service (QoS).

When the performance is too low, the consequences are often clear: employees lose productivity, customers leave. But when application performance exceeds needs, the cost remains hidden: excessive power use.

Server power management: enter P-states

Uptime Intelligence survey data indicates that power management remains an underused feature — most servers do not have it enabled (see Tools to watch and improve power use by IT are underused). The extra power use may appear small at first, amounting to only tens of watts per server. But when scaled to larger facilities or to the global data center footprint, they add up to a huge waste of power and money.

The potential to improve the energy performance of data center infrastructure is material, but the variables involved in adopting server power management mean it is not a trivial task. Modern chip design is what creates this potential. All server processors in operation today are equipped with mechanisms to change their clock frequency and supply voltage in preordained pairs of steps (called dynamic voltage-frequency scaling). Initially, these techniques were devised to lower energy use in laptops and other low-power systems when running code that does not fully utilize resources. Known as P-states, these are in addition to C-states (low-power modes during idle time).

Later, mechanisms were added to do the opposite: increase clock speeds and voltages beyond nominal rates as long as the processor stays within hard limits for power, temperature and frequency. The effect of this approach, known as turbo mode, has gradually become more pronounced with ever-higher core counts, particularly in servers (see Cooling to play a more active role in IT performance and efficiency). As processors dynamically reallocate the power budget from lowly utilized or idling cores to highly stressed ones, clock speeds can well exceed nominal ratings — often close to doubling. In recent CPUs, even the power budget can be calibrated higher than the factory default.

As a result, server-processor behavior has become increasingly opportunistic in the past decade. When allowed, processors will dynamically seek out the electrical configuration that yields maximum performance if the software (signaled by the operating system, detected by the hardware mechanisms, or both) requests it. Such behavior is generally great for performance, particularly in a highly mixed application environment where certain software benefits from running on many cores in parallel while others prefer fewer but faster ones.

The unquantified costs of high performance

Ensuring top server performance comes at the cost of using more energy. For performance-critical applications such as technical computing, financial transactions, high-speed analytics and real-time operating systems, the use and cost of energy is often not a concern.

But for a large array of workloads, this will result in a considerable amount of energy waste. There are two main components to this waste. First, the energy consumption curve for semiconductors gets steeper the closer the chip pushes to the top of its performance envelope because both dynamic (switching) and static (leakage) power increase exponentially. All the while, the performance gains diminish because the rest of the system, including the memory, storage and network subsystems, will be unable to keep up with the processor’s race pace. This increases the amount of time that the processor needs to wait for data or instructions.

Second, energy waste originates from a mismatch between performance and QoS. Select applications and systems, such as transaction processing and storage servers, tend to have defined QoS policies for performance (e.g., responding to 99% of queries within a second). QoS is typically about setting a floor below which performance should not drop — it is rarely about ensuring systems do not overperform, for example, by processing transactions or responding to queries unnecessarily fast.

If a second for a database query is still within tolerance, there is, by definition, limited value to having a response under one-tenth of a second just because the server can process a query that fast when the load is light. And yet, it happens all the time. For many, if not most, workloads, however, this level of overperformance is neither defined nor tracked, which invites an exploration of acceptable QoS.

Governing P-states for energy efficiency

At its core, the governance of P-states is like managing idle power through C-states, except with many more options, which adds complexity through choice. This report does not discuss the number of P-states because this would be highly dependent on the processor used. Similarly to C-states, a higher number denotes a higher potential energy saving; for example, P2 consumes less power than P1. P0 is the highest-performance state a processor can select.

  • No P-state control. This option tends to result in aggressive processor behavior that pushes for the maximum speeds available (electronically and thermally) for any and all of its cores. While this will result in the most energy consumed, it is preferable for high-performance applications, particularly latency-sensitive applications where every microsecond counts. If this level of performance is not justified by the workload, it can be an exceedingly wasteful control mode.
  • Hardware control. Also called the autonomous mode, this leaves P-state management for the processor to decide based on detected activity. While this mode allows for very fast transitions between states, it lacks the runtime information gathered by the operating system; hence, it will likely result in only marginal energy savings. On the other hand, this approach is agnostic of the operating system or hypervisor. The expected savings compared with no P-state control are up to around 10%, depending on the load and server configuration.
  • Software-hardware cooperation. In this mode, the operating system provider gives the processor hints on selecting the appropriate P-states. The theory is that this enables the processor control logic to make better decisions than pure hardware control while retaining the benefit of fast transitions between states to maintain system responsiveness. Power consumption reductions here can be as high as 15% to 20% at low to moderate utilization.
  • Software control. In this mode, the operating system uses a governor (a control mechanism that regulates a function, in this case speed) to make the performance decisions executed by the processor if the electrical and thermal conditions (supply voltage and current, clock frequency and silicon temperature) allow it. This mode typically carries the biggest energy-saving potential when a sophisticated software governor is used. Both Windows and Linux operating systems offer predefined plans that let the system administrator prioritize performance, balance or lower energy use.
    The trade-off here is additional latency: whenever the processor is in a low-performance state and transitions to a higher-performance state (e.g., P0) in response to a bout of compute or interrupt activity, it takes material time. Highly latency-sensitive and bursty workloads may see substantial impact.
    Power reductions can be outsized across most of the system load curve. Depending on the sophistication of the operating system governor and the selected power plan, energy savings can reach between 25% and 50%.

While there are inevitable trade-offs between performance and efficiency, in all the control scenarios, the impact on performance is often negligible. This is true for users of business and web applications, or for the total runtime of technical computing jobs. High-performance requirements alone do not preempt the use of P-state control: once the processor selects P0, there is no difference between a system with controls and no controls.

Applications that do not tolerate dynamic P-state controls well are often the already-suspected exceptions. This is due to their latency-sensitive, bursty nature, where the processor is unable to match the change in performance needs (scaling voltages and frequency) fast enough, even though it takes microseconds.

Arguably, for most use cases, the main concern should be power consumption, not performance. Server efficiency benchmarking data, such as that published by the Standard Performance Evaluation Corporation and The Green Grid, indicates that modern servers achieve the best energy efficiency when their performance envelope is limited (e.g., to P2) because it prevents the chip from aggressively seeking the highest clock rates across many cores. This would result in disproportionately higher power use for little in return.

Upcoming Uptime Intelligence reports will identify the software tools that data center operators can use to monitor and manage the power and performance settings of server fleets.


The Uptime Intelligence View

Server power management in its multiple forms offers data center operators easy wins in IT efficiency and opportunities to lower operational expenditure. It will be especially attractive to small- and medium-sized enterprises that run mixed workloads with low criticality, yet the effects of server power management will be much more impressive when implemented at scale and for the right workloads.

Resiliency v low PUE: regulators a catalyst for innovation?

Resiliency v low PUE: regulators a catalyst for innovation?

Research has shown that while data center owners and operators are mostly in favor of sustainability regulation, they have a low opinion of the regulators’ expertise. Some operators have cited Germany’s Energy Efficiency Act as evidence: the law lays down a set of extremely challenging power usage effectiveness (PUE) requirements that will likely force some data centers to close and prevent the building of others.

The headline requirement of the act (see Table 1) is that new data centers (i.e., those that become operational in 2026 and beyond) must have an annual operational PUE of 1.2 or below within two years of commissioning. While meeting this tough energy efficiency stipulation is possible, very few data centers today — even new and well-managed ones — are achieving this figure. Uptime Intelligence data shows an average PUE of 1.45 for data centers built in the past three years.

This stretch goal of 1.2 is particularly challenging for those trying to grapple with competing business or technical requirements. For example, some are still trying to populate new data centers, meaning they are operating at partial load; many have overriding requirements to meet Tier III concurrent maintainability and Tier IV fault-tolerant objectives; and increasingly, many are aiming to support high-density racks or servers with strict cooling requirements. For these operators, achieving this PUE level routinely will require effort and innovation that is at the limits of modern data center engineering — and it may require considerable investment.

The rules for existing data centers are also tough. By 2030, all data centers will have to operate below a PUE of 1.3. This requirement will ultimately lead to data center closures, refurbishments and migration to newer colocations and the cloud — although it is far too early to say which of these strategies will dominate.

Table 1. PUE requirements under Germany’s Energy Efficiency Act

Table: PUE requirements under Germany’s Energy Efficiency Act

To the limits

The goal of Germany’s regulators is to push data center designers, owners and operators to the limits. In doing so, they hope to encourage innovation and best practices in the industry that will turn Germany into a model of efficient data center operations and thus encourage similar practices across Europe and beyond.

Whether the regulators are pushing too hard on PUE or if the act will trigger some unintended consequences (such as a movement of workloads outside of Germany) will likely only emerge over time. But the policy raises several questions and challenges for the entire industry. In particular, how can rigorous efficiency goals be achieved while maintaining high availability? And how will high-density workloads affect the achievement of efficiency goals?

We consulted several Uptime Institute experts* to review and respond to these questions. They take a mostly positive view that while the PUE requirements are tough, they are achievable for new (or recently constructed) mission-critical data centers — but there will be cost and design engineering consequences. Some of their observations are:

  • Achieving fault tolerance / concurrent availability. Higher availability (Tier III and Tier IV) data centers can achieve very low PUEs — contrary to some reports. A Tier IV data center (fully fault tolerant) is not inherently less efficient than a Tier II or Tier III (concurrently maintainable) data center, especially given recent advances in equipment technology (such as digital scroll compressors, regular use of variable frequency drives, more sophisticated control and automation capabilities). These are some of the technologies that can help ensure that the use of redundant extra capacity or resilient components does not require significantly more energy.
    It is often thought that Tier IV uses a lot more energy than a Tier III data center. However, this is not necessarily the case — the difference can be negligible. The only definitive increase in power consumption that a Tier IV data center may require is an increased load on uninterruptible power supply (UPS) systems and a few extra components. However, this increase can be as little as a 3% to 5% loss on 10% to 20% of the total mechanical power load for most data centers.
    The idea that Tier IV data centers are less efficient usually comes from the requirement to have two systems in operation. But this does not necessarily mean that each of the two systems in operation can support the full current workload at a moment’s notice. The goal is to ensure that there is no interruption of service.
    On the electrical side, concurrent availability and fault tolerance may be achieved using a “swing UPS” and distributed N+1 UPS topologies. In this well-established architecture, the batteries provide immediate energy storage and are connected to an online UPS. There is not necessarily a need for two fully powered (2N) UPS systems; rather, the distribution is arranged so that a single failure only impacts a percentage of the IT workload. Any single redundant UPS only needs to be able to support that percentage of IT workload. The total UPS installed capacity, therefore, is far less than twice the entire workload, meaning that the power use and losses and are less.
    Similarly, power use and losses can be reduced in the distribution systems by designing with as few transformers and distribution boards and transfer switches as possible. Fault tolerance can be harder to achieve in this way, with a larger number of smaller components increasing complexity — but this has the benefit of higher electrical efficiency.
    On the mechanical side, two cooling systems can be operational, with each supporting a part of the current workload. When a system fails, thermal storage is used to buy time to power up additional cooling units. Again, this is a well-established approach.
  • Cooling.Low PUEs assume that highly efficient cooling is place, which in turn requires the use of ambient cooling. Regulation will drive the industry to greater use of economization — whether these are air-side economizers, water-side economizers, or pumped refrigerant economizers.
    • While very low PUEs do not rule out the use of standard direct expansion (DX) units, their energy use may be problematic. The use of DX units on hot days may be required to such an extent that very low PUEs are not achievable.
    • More water, less power? The PUE limitation of 1.2 may encourage the use of cooling technologies that evaporate more water. In many places, air-cooled chillers may struggle to cool the workload on hot days, which will require mechanical assistance. This may happen too frequently to achieve low annualized PUEs. A water-cooled plant or the use of evaporative cooling will likely use less power — but, of course, require much more water.
  • Increasing density will challenge cooling — even direct liquid cooling (DLC). Current and forecasted high-end processors, including graphics processing units, require server inlet temperatures at the lower ends of the ASHRAE range. This will make it very difficult to achieve low PUEs.
    DLC can only provide a limited solution. While it is highly effective in efficiently cooling higher density systems at the processor level, the lower temperatures required necessitate ever-chilled or refrigerated air / water — which, of course, increases the power consumption and pushes up the PUEs.
  • Regulation may drive location. Low PUEs are much easier and more economically achieved in cooler or less humid climates. Regulators that mandate low PUEs will have to take this into account. Germany, for example, has mostly cool or cold winters and warm or hot summers, but with low / manageable humidity — therefore, it may be well suited to economization. However, it will be easier and less expensive to achieve these low PUEs in Northern Germany, or even in Scandinavia, rather than in Southern Germany.
  • Build outs. Germany’s Energy Efficiency Act requires that operators reach a low PUE within two years of the data center’s commissioning. This, in theory, gives the operator time to fill up the data center and reach an optimum level of efficiency. However, most data centers actually fill out and/or build out over a much longer timescale (four years is more typical).
    This may have wider design implications. Achieving a PUE of 1.2 at full workload requires that the equipment is selected and powered to suit the workload. But at a partial workload, many of these systems will be over-sized or not be as efficient. To achieve optimal PUE at all workloads, it may be necessary to deploy more, smaller capacity components and take a more modular approach — possibly using repeatable, prefab subsystems. This will have cost implications; to achieve concurrent maintainability for all loads, designers may have to use innovative N+1 designs with the greater use of smaller components.
  • Capital costs may rise.Research suggests that low PUE data centers can also have low operational costs — notably because of reduced energy use. However, the topologies and the number of components, especially for higher availability facilities, may be more expensive. Regulators mandating lower PUEs may be forcing up capital costs, although these can be recuperated later.

The Uptime Intelligence View

The data center industry is not so new. Many facilities — perhaps most — achieve very high availability as a result of proven technology and assiduous attention by management and operators. However, the great majority do not use state-of-the-art engineering to achieve high energy efficiency. Regulators want to push them in this direction. Uptime Intelligence’s assessment is that — by adopting well-thought-out designs and building in a modular, balanced way — it is possible to have and operate highly available, mission-critical data centers with very high energy efficiency at all workloads. However, there will likely be a considerable cost premium.


*The Uptime Institute experts consulted for this Update are:

Chris Brown, Chief Technical Officer, Uptime Institute

Ryan Orr, Vice President, Topology Services and Global Tier Authority, Uptime Institute

Jay Dietrich, Research Director Sustainability, Uptime Institute

Dr. Tomas Rahkonen, Research Director, Distributed Data Centers, Uptime Institute

Data center management software is evolving — at last

Data center management software is evolving — at last

Despite their role as enablers of technological progress, data center operators have been slow to take advantage of developments in software, connectivity and sensor technologies that can help optimize and automate the running of critical infrastructure.

Most data center owners and operators currently use building management systems (BMS) and / or data center infrastructure management (DCIM) software as their primary interfaces for facility operations. These tools have important roles to play in data center operations but have limited analytics and automation capabilities, and they often do little to improve facility efficiency.

To handle the increasing complexity and scale of modern data centers — and to optimize for efficiency — operations teams need new software tools that can generate value from the data produced by the facilities equipment.

For more than a decade, DCIM software has been positioned as the main platform for organizing and presenting actionable data — but its role has been largely passive. Now, a new school of thought on data center management software is emerging, proposed by data scientists and statisticians. From their point of view, a data center is not a collection of physical components but a complex system of data patterns.

Seen in this way, every device has a known optimal state, and the overall system can be balanced so that as many devices as possible are working in this state. Any deviation would then indicate a fault with equipment, sensors or data.

This approach has been used to identify defective chiller valves, imbalanced computer room air conditioning (CRAC) fans, inefficient uninterruptible power supply (UPS) units and opportunities to lower floor pressure — all discovered in data, without anyone having to visit the facility. The overall system — the entire data center — can also be modelled in this way, so that it can be continuously optimized.

All about the data

Data centers are full of sensors. They can be found inside CRAC and computer room air handler units, chillers, coolant distribution units, UPS systems, power distribution units, generators and switchgear, and many operators install optional temperature, humidity and pressure sensors in their data halls.

These sensors can serve as a source of valuable operational insight — yet the data they produce is rarely analyzed. In most cases, applications of this information are limited to real-time monitoring and basic forecasting.

In recent years, major mechanical and electrical equipment suppliers have added network connectivity to their products as they look to harvest sensor data to understand where and how their equipment is used. This has benefits for both sides: customers can monitor their facilities from any device, anywhere, while suppliers have access to information that can be used in quality control, condition-based or predictive maintenance, and new product design.

The greatest benefits of this trend are yet to be harnessed. Aggregated sensor data can be used to train artificial intelligence (AI) models with a view to automating an increasing number of data center tasks. Data center owners and operators do not have to rely on equipment vendors to deliver this kind of innovation. They can tap into the same equipment data, which can be accessed using industry-standard protocols like SNMP or Modbus.

When combining sensor data with an emerging category of data center optimization tools — many of which rely on machine learning — data center operators can improve their infrastructure efficiency, achieve higher degrees of automation and lower the risk of human error.

The past few years have also spawned new platforms that simplify data manipulation and analysis. These enable larger organizations to develop their own applications that leverage equipment data — including their own machine learning models.

The new wave

Dynamic cooling optimization is the best-understood example of this new, data-centric approach to facilities management. These applications rely on sensor data and machine learning to determine and continually “learn” the relationships between variables such as rack temperatures, cooling equipment settings and overall cooling capacity. The software can then tweak the cooling equipment performance based on minute changes in temperature, enabling the facility to respond to the needs of the IT in near real-time.

Many companies working in this field have close ties to the research community. AI-powered cooling optimization vendor TycheTools was founded by a team from the Technical University of Madrid. The Dutch startup Coolgradient was co-founded by a data scientist and collaborates with several universities. US-based Phaidra has brought together some of the talent that previously published and commercialized cutting-edge research as part of Google’s DeepMind.

Some of the benefits offered by data-centric management software include:

  • Improved facility efficiency: through the automated configuration of power and cooling equipment as well as the identification of inefficient or faulty hardware.
  • Better maintenance: by enabling predictive or condition-based maintenance strategies that consider the state of individual hardware components.
  • Discovery of stranded capacity: through the thorough analysis of all data center metrics, not just high-level indicators.
  • Elimination of human error: through either a higher degree of automation or automatically generated recommendations for human employees.
  • Improvements in skill management: by analyzing the skills of the most experienced staff and codifying them in software.

Not all machine learning models require extensive compute resources, rich datasets and long training times. In fact, many of the models used in data centers today are small and relatively simple. Both training and inference can run on general-purpose servers, and it is not always necessary to aggregate data from multiple sites — a model trained locally on a single facility’s data will often be sufficient to deliver the expected results.

New tools bring new challenges

The adoption of data-centric tools for infrastructure management will require owners and operators to recognize the importance of data quality. They will not be able to trust the output of machine learning models if they cannot trust their data — and that means additional work on standardizing and cleaning their operational data stores.

In some cases, data center operators will have to hire analysts and data scientists to work alongside the facilities and IT teams.

Data harvesting at scale will invariably require more networking inside the data center — some of it wireless — and this presents a potentially wider attack surface for cybercriminals. As such, cybersecurity will be an important consideration for any operational AI deployment and a key risk that will need to be continuously managed.

Evolution is inevitable

Uptime Institute has long argued that data center management tools need to evolve toward greater autonomy. The data center maturity model (see Table 1) was first proposed in 2019 and applied specifically to DCIM. Four years later, there is finally the beginning of a shift toward Level 4 and Level 5 autonomy, albeit with a caveat — DCIM software alone will likely never evolve these capabilities. Instead, it will need to be combined with a new generation of data-centric tools.

Table 1. The Data Center Management Maturity Model

Table: The Data Center Management Maturity Model

Not every organization will, or should, take advantage of Level 4 and Level 5 functionality. These tools will provide an advantage to the operators of modern facilities that have exhausted the list of traditional efficiency measures, such as those achieving PUE values of less than 1.3.

For the rest, being an early adopter will not justify the expense. There are cheaper and easier ways to improve facility efficiency that do not require extensive data standardization efforts or additional skills in data science.

At present, AI and analytics innovation in data center management appears to be driven by startups rather than established software vendors. Few BMS and DCIM developers have integrated machine learning into their core products, and while some companies have features in development, these will take time to reach the market — if they ever leave the lab.

Uptime Intelligence is tracking six early-stage or private companies that use facilities data to create machine learning models and already have products or services on the market. It is likely more will emerge in the coming months and years.

These businesses are creating a new category of software and services that will require new types of interactions with all the moving parts inside the data center, as well as new commercial strategies and new methods of measuring the return on investment. Not all of them will be successful.

The speed of mainstream adoption will depend on how easy these tools will be to implement. Eventually, the industry will arrive at a specific set of processes and policies that focus on benefitting from equipment data.


The Uptime Intelligence View

The design and capabilities of facilities equipment have changed considerably over the past 10 years, and traditional data center management tools have not kept up. A new generation of software from less-established vendors now offers an opportunity to shift the focus from physical infrastructure to data. This introduces new risks — but the benefits are too great to ignore.