The advent of AI training and inference applications, combined with the continued expansion of the digital world and electrification of the economy, raises two questions about electricity generation capacity: where will the new energy be sourced from, and how can it be decarbonized?
Groups, such as the International Energy Agency and the Electric Power Research Institute, project that data center energy demand will rise at an annual rate of 5% or more in many markets over the next decade. When combined with mandated decarbonization and electrification initiatives, electricity generation capacity will need to expand between two and three times by 2050 to meet this demand.
Many data center operators have promoted the need to add only carbon-free generation assets to increase capacity. This is going to be challenging: deployment of meaningful carbon-free generation, including energy from geothermal sources and small nuclear reactors (SMRs), and battery capacity, such as long duration energy storage (LDES), are at least five or more years away. Given the current state of technology development and deployment of manufacturing capacity, it will likely take at least 10 years before they are widely used on the electricity grid in most regions. This means that natural gas generation assets will have to be included in the grid expansion to maintain grid reliability.
Impact of wind / solar on grid reliability
To demonstrate the point, consider the following example. Figure 1 depicts the current generation capacity in Germany under five different weather and time of day scenarios (labeled as scenario A1 to A5). Table 1 provides the real-time capacity factors for the wind, solar and battery assets used for these scenarios. German energy demand in summer is 65 GW (not shown) and 78 GW (shown by the dotted line) in winter. The blue line is the total available generation capacity.
Figure 1. Grid generation capacity under different weather and time of day scenarios
Table 1. Generation capacity factors for scenarios A and B
Scenario A details the total available generation capacity of the German electricity grid in 2023. The grid has sufficient dispatchable generation to maintain grid reliability. It also has enough grid interconnects to import and export electricity production to address the over- or under-production due to the variable output of the wind and solar generation assets. An example of energy importation is the 5 GW of nuclear generation capacity, which comes from France.
Scenario A1 depicts the available generation capacity based on the average capacity factors of the different generation types. Fossil fuel and nuclear units typically have a 95% or greater capacity factor because they are only offline for minimal periods during scheduled maintenance. Wind and solar assets have much lower average capacity factors (see Table 1) due to the vagaries of the weather. Lithium-ion batteries have limited capacity because they discharge for two to four hours and a typical charge / discharge cycle is one day. As a result, the average available capacity for Germany is only half of the installed capacity.
Grid stress
The average capacity only tells half the story because the output from wind, solar and battery energy sources vary between zero and maximum capacity based on weather conditions and battery charge / discharge cycles.
Scenarios A2 and A3 illustrate daytime and nighttime situations with high solar and wind output. In these scenarios, the available generation capacity significantly exceeds electricity demand.
In scenario A2, the 139 GW of available wind and solar assets enable Germany to operate on 100% carbon-free energy, export energy to other grid regions and charge battery systems. Fossil fuel units will be placed in their minimal operating condition to minimize output but ensure availability as solar assets ramp down production in the evening. Some solar and wind-generating assets will likely have to be curtailed (disconnected from the grid) to maintain grid stability.
In scenario A3, the wind and fossil fuel assets provide sufficient generation capacity to match the demand. The output of the fossil fuel assets can be adjusted as the wind output modulates to maintain grid balance and reliability. Discharging the batteries and importing or exporting electricity to other grid regions are unnecessary.
Scenarios A4 and A5 show the challenges posed by solar and wind generation variability and limited battery discharge times (current lithium-ion battery technologies have a four-hour discharge limit). These scenarios represent the 10th to 20th percentile of wind and solar generation asset availability. Low wind and low or nonexistent solar output push the available generation capacity close to the electricity demand. If several fossil fuel assets are offline and/or imports are limited, the grid authority will have to rely on demand management capacity, battery capacity and available imports to keep the grid balanced and avoid rolling blackouts (one- to two-hour shutdowns of electricity in one or more grid sub-regions).
Demand management
The scenario described above is not peculiar to Germany. Wind and solar generation assets represent the bulk of new capacity to meet new data center and electrification demand and replace retiring fossil fuel and nuclear assets as they are (1) mandated by many state, province and national governments and (2) the only economically viable form of carbon-free electricity generation. Unfortunately, their variable output leaves significant supply gaps of hours, days and seasonally that cannot currently be addressed with carbon-free generation assets. These gaps will have to be filled with a combination of new and existing fossil fuel assets, existing nuclear assets (new plants, as new conventional or small modular nuclear reactors are eight years or more in the future) and demand management.
As the installed solar and wind generation capacity increases to a greater percentage of supply capacity, data center operators will play a major role in demand management strategies. Several strategies are available to the utilities / data center operators, but each has its drawbacks.
Backup generation systems
Significantly, emergency generator sets will need to be utilized to take data centers off the grid to address generation capacity shortfalls. This strategy is being deployed in the US (California), Ireland and other data center markets. In conversations with operators, several report being requested by their grid authority to deploy their emergency generators to relieve grid demand and ensure stability during the summer of 2023.
As new data centers are built and deployed, operators in stressed grid regions (i.e., those with a high percentage of capacity delivered by wind and solar assets) should plan and permit their emergency generators to be operated for 25% or more of the year to support grid reliability in the face of variable wind and solar generation asset production.
Workload reduction
Demand management can also be achieved by reducing data center energy consumption. Google has reported the development of protocols to shut down non-critical workloads, such as controllable batch jobs, or shift a workload to one or more data centers in other grid regions. The reports did not provide the volume of workloads moved or the demand reductions, which suggests that they were not significant. These tools are likely only used on workloads controlled by Google, such as search workloads or work on development servers. They are unlikely to be deployed on client cloud workloads because many IT operators are uncomfortable with unnecessarily stressing their operations and processes.
An example of an IT enterprise operation that could support demand management is a financial organization running twin data centers in two different grid regions. When grid stability is threatened, it could execute its emergency processes to move all workloads to a single data center. In addition to receiving a payment for reducing demand, this would be an opportunity to test and validate its emergency workload transfer processes. While there is a strong logic to justify this choice, IT managers will likely be hesitant to agree to this approach.
Outages and reliability problems are more likely to emerge when operational changes are being made and demand management payments from the grid authority or energy retailer will not compensate the risk of penalties under service level agreements. The use of backup generators will likely be the preferred response, though problems, such as issues with starting and synchronizing generators or transfer switch failures, can be experienced when switching to generators.
New solar and wind capacity needed
The energy demands of new data centers and the broader electrification of the economy will require the installation of new electricity generation capacity in grid regions around the world. Large colocation and cloud service providers have been particularly vocal that these generation assets should be carbon-free and not involve new fossil fuel generation assets. An analysis of the impact of increasing the wind, solar and battery generation capacity on the German grid by 20% to support a 15% increase in demand reveals the inherent dangers of this position.
Figure 2 details the available grid generation capacity under the five available capacity scenarios (see Table 1) when the wind, solar and battery capacities are increased by 20%. This increase in generating capacity is assumed to support a 15% rise in energy demand, which increases winter demand to 90 GW.
Figure 2. Impact of a 20% increase in wind and solar generation
The 20% generating capacity increase delivers sufficient available capacity for periods of moderate to high wind and solar capacity (scenarios B2 and B3). There is a sufficient capacity reserve (about 10% to 15% of demand) to provide the needed generation capacity if some fossil fuel-based generators are offline or the imports of nuclear-generated electricity are not available.
However, the capacity increase does not significantly increase the available capacity at periods of low solar and wind output, putting grid stability at risk in scenarios B4 and B5. The available capacity in scenario B4 increases by only 4 GW, which is insufficient to meet the new demand or provide any capacity reserve. Under scenario B5, there is barely enough capacity to provide a sufficient reserve (about 10% to 15% of capacity). In both cases, grid stability is at risk and some combination of demand management, battery capacity and imports will be required.
Until reliable, dispatchable carbon-free electricity technologies, such as SMRs, geothermal generation and LDES, are developed and deployed at scale, grid stability will depend on the presence of sufficient fossil fuel generation assets to match energy supply to demand. The deployed capacity will likely have to be 75% to 100% of projected demand to address three- to 10-day periods of low solar and wind output and expected seasonal variations in wind and solar production.
To enable growth and ensure reliable electricity service while increasing renewable energy generation capacity, data center operators will need to balance and likely compromise their location selection criteria, decarbonization goals, and the size of their electrical demand growth.
Managing growing power demand
Data center operators will have to reevaluate their sustainability objectives in light of the need to increase the overall grid generation capacity and the percentage of that capacity that is carbon-free generation while maintaining grid stability and reliability. Operators should consider the following options to take meaningful steps to further decarbonize the grid:
Set realistic goals for global operations to consume 80% to 90% carbon-free energy over the next decade rather than making unrealistic near-term net-zero claims. Studies have shown significant technical and economic obstacles to decarbonizing the last 10% to 20% of electricity generation while maintaining grid reliability. Most data centers currently operate on 10% to 50% (or less) carbon-free energy. Working to achieve 80% to 90% consumption of carbon-free energy by 2035 at an enterprise level is a reasonable goal that will help to support the grid’s decarbonization.
Support research, development and deployment of reliable, dispatchable carbon-free electricity generation and LDES technologies. Many large data center operators are making exemplary efforts in this area by supporting innovative, on-site operational tests and entering into prospective contracts for future purchases of energy or equipment.
Design, permit and deploy backup generation systems for 2,000 to 3,000 hours of annual operation to support grid reliability during periods of low production from wind and solar generation assets. Operators are already activating these systems to manage grid demands in some jurisdictions, such as Ireland and California in the US, that have a large installed wind or solar generation base with high variability in day-to-day and seasonal output.
Intensify efforts to improve the efficiency of the IT infrastructure. Efforts to improve IT efficiency need to include the deployment of server power management functions and storage capacity optimization methods; increases in the utilization of installed server, storage and network capacity to higher levels; refreshing and consolidating IT equipment; and designing energy efficiency into software applications and algorithms with the same intensity and commitment given to providing performance and function. The IT infrastructure consumes 70% to 90% of the energy in most data centers yet data shows that more than 60% of IT operators largely ignore the opportunities to drive IT equipment efficiency improvements.
Operators need to improve and validate techniques to move workloads from a data center in a stressed grid region to an unstressed grid region. Cloud workloads can be shifted to underutilized data centers in an unstressed grid region. Where a financial operator runs twin data centers, the full workload of paired, twin data centers can be moved to the facility that is in an unstressed grid. Data center operators will be hesitant to deploy these techniques because they can place critical operations at risk.
The data center industry can maintain its sustainability focus despite the expected growth in power demand. To do so, IT operators need to refocus on continually increasing the work delivered per unit of energy consumed by the IT infrastructure and the megawatt-hours of carbon-free energy consumed by data center operations. With available technologies, much can be done while developing the technologies needed to decarbonize the last 10% to 20% of the electricity grid.
The Uptime Intelligence View
Accelerating the growth of data center capacity and energy consumption does not have to imperil the industry’s drive toward sustainability. Instead, it requires that the sector pragmatically reassess its sustainability efforts. Sustainability strategies need to first intensify their focus on increased IT infrastructure utilization and work delivered per unit of energy consumed, and then take responsible steps to decarbonize the energy consumed by data centers while supporting necessary efforts to grow generation capacity on the grid and ensure grid stability.
https://journal.uptimeinstitute.com/wp-content/uploads/2025/01/Grid-growth-decarbonization-featured.jpg5401030Jay Dietrich, Research Director of Sustainability, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngJay Dietrich, Research Director of Sustainability, Uptime Institute, [email protected]2025-01-08 12:00:002025-01-07 10:49:39Grid growth and decarbonization: An unhappy couple
Many employees approach AI-based systems in the workplace with a level of mistrust. This lack of trust can slow the implementation of new tools and systems, alienate staff and reduce productivity. Data center managers can avoid this outcome by understanding the factors that drive mistrust in AI and devising a strategy to minimize them.
Perceived interpersonal trust is a key productivity driver for humans but is rarely discussed in a data center context. Researchers at the University of Cambridge in the UK have found that interpersonal trust and organizational trust have a strong correlation with staff productivity. In terms of resource allocation, lack of trust requires employees to invest time and effort organizing fail-safes to circumvent perceived risks. This takes attention away from the task at hand and results in less output.
In the data center industry, trust in AI-based decision-making has declined significantly in the past three years. In Uptime Institute’s 2024 global survey of data center managers, 42% of operators said they would not trust an adequately trained AI system to make operational decisions in the data center, which is up 18 percentage points since 2022 (Figure 1). If this decline in trust continues, it will be harder to introduce AI-based tools.
Figure 1. More operators distrust AI in 2024 than in previous years
Managers who wish to unlock the productivity gains associated with AI may need to create specific conditions to build perceived trust between employees and AI-based tools.
Balancing trust and cognitive loads
The trust-building cycle requires a level of uncertainty. In the Mayer, Davis and Shoormen trust model, this uncertainty occurs when an individual is presented with the option to transfer decision-making autonomy to another party, which, in the data center, might be an AI-based control system (see An integrative model of organization trust). Individuals evaluate perceived characteristics of the other party against risk, to determine whether they can relinquish decision-making control. If this leads to desirable outcomes, individuals gain trust and perceive less risk in the future.
Trust toward AI-based systems can be encouraged by using specific deployment techniques. In Uptime Institute’s Artificial Intelligence and Software Survey 2024, almost half of the operators that have deployed AI capabilities report that predictive maintenance is driving their use of AI.
Researchers from Australia’s University of Technology Sydney and University of Sydney tested human interaction with AI-based predictive maintenance systems, with participants having to decide how to manage a situation with a burst water pipe under different levels of uncertainty and cognitive load (cognitive load being the amount of working memory resources used). For all participants, trust in the automatically generated suggestions was significantly higher under low cognitive loads. AI systems that communicated decision risk odds prevented trust from decreasing, even when cognitive load increased.
Without decision risk odds displayed, employees devoted more cognitive resources toward deciphering ambiguity, leaving less space in their working memory for problem solving. Interpretability of the output of AI-based systems drives trust: it allows users to understand the context of specific suggestions, alerts and predictions. If a user cannot understand how a predictive maintenance system came to a certain conclusion, they will lose trust. In this situation, productivity will stall as workers devote cognitive resources toward attempting to retrace the steps the system made.
Team dynamics
In some cases, staff who work with AI systems personify them and treat them as co-workers rather than tools. Similarly to human social group dynamics, and the negative bias felt toward those outside of one’s group (“outgroup” dynamics), staff may then lack trust in these AI systems.
AI systems can engender anxiety relating to job security and may trigger the fear of being replaced — although this is less of a factor in the data center industry, where staff are in short supply and not at high risk of losing their jobs. Nonetheless, researchers at the Institute of Management Sciences in Pakistan find that adoption of AI in general is linked with cognitive job insecurity, which threatens workers’ perceived trust in an organization.
Introduction of AI-based tools in a data center may also cause a loss in expert status for some senior employees, who might then view these tools as a threat to their identity.
Practical solutions
Although there are many obstacles to introducing AI-based tools into a human team, the solutions to mitigating them are often intuitive and psychological, rather than technological. Data center team managers can improve trust in AI technology through the following options:
Choose AI tools that demonstrate risk transparently. Display a metric for estimated prediction accuracy.
Choose AI tools that emphasize interpretability. This could include descriptions of branching logic, statistical data, metrics or other context for AI-based suggestions or decisions.
Combat outgroup bias. Arrange for trusted “ingroup” team leads to demonstrate AI tools to the rest of the group (instead of the tool vendors or those unfamiliar to the team).
Implement training throughout the AI transition process. Many employees will experience cognitive job insecurity despite being told their positions are secure. Investing in and implementing training during the AI transition process allows staff to feel a sense of control over their ability to affect the situation, minimize the gap between known and needed skills and prevent a sense of losing expert status.
Many of the solutions described above rely on social contracts—the transactional and relational agreements between employees and an organization. US psychologist Denise Rousseau (a professor at Carnegie Mellon University, Pittsburgh PA) describes relational trust as the expectation that a company will pay back an employee’s investments through growth, benefits and job security — all factors that go beyond the rewards of a salary.
When this relational contract is broken, staff will typically shift their behavior and deprioritize long-term company outcomes in favor of short-term personal gains.
Data center team leaders can use AI technologies to strengthen or break relational contracts in their organizations. Those who consider the factors outlined above will be more successful with maintaining an effective team.
The Uptime Intelligence View
An increasing number of operators cite plans to introduce AI-based tools into their data center teams, yet surveys increasingly report a mistrust in AI. When human factors, such as trust, are well managed, AI can be an asset to any data center team. If the current downward trend in trust continues, AI systems will become harder to implement. Solutions should focus on utilizing positive staff dynamics, such as organizational trust and social contracts.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/12/Building-trust-working-with-AI-based-tools-featured.jpg5401030Rose Weinschenk, Research Associate, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRose Weinschenk, Research Associate, Uptime Institute, [email protected]2024-12-19 17:00:002024-12-19 14:33:06Building trust: working with AI-based tools
Data center infrastructure management (DCIM) software is an important class of software that, despite some false starts, many operators regard as essential to running modern, flexible and efficient data centers. It has had a difficult history — many suppliers have struggled to meet customer requirements and adoption remains patchy. Critics argue that, because of the complexity of data center operations, DCIM software often requires expensive customization and feature development for which many operators have neither the expertise nor the budget.
This is the first of a series of reports by Uptime Intelligence exploring data center management software in 2024 — two decades or more after the first commercial products were introduced. Data center management software is a wider category than DCIM: many products are point solutions; some extend beyond a single site and others have control functions. Uptime Intelligence is referring to this category as data center management and control (DCM-C) software.
DCIM, however, remains at the core. This report identifies the key areas in which DCIM has changed over the past decade and, in future reports, Uptime Intelligence will explore the broader DCM-C software landscape.
What is DCIM?
DCIM refers to data center infrastructure management software, which collects and manages information about a data center’s IT and facility assets, resource use and operational status, often across multiple systems and distributed environments. DCIM primarily focuses on three areas:
IT asset management. This involves logging and tracking of assets in a single searchable database. This can include server and rack data, IP addresses, network ports, serial numbers, parts and operating systems.
Monitoring. This usually includes monitoring rack space, data and power (including power use by IT and connected devices), as well as environmental data (such as temperature, humidity, air flow, water and air pressure).
Dashboards and reporting. To track energy use, sustainability data, PUE and environmental health (thermal, pressure etc.), and monitor system performance, alerts and critical events. This may also include the ability to simulate and project forward – for example, for the purposes of capacity management.
In the past, some operators have taken the view that DCIM does not justify the investment, given its cost and the difficulty of successful implementation. However, these reservations may be product specific and can depend on the situation; many others have claimed a strong return on investment and better overall management of the data center with DCIM.
Growing need for DCIM
Uptime’s discussions with operators suggest there is a growing need for DCIM software, and related software tools, to help resolve some of the urgent operational issues around sustainability, resiliency and capacity management. The current potential benefits of DCIM include:
Improved facility efficiency and resiliency through automating IT updates and maintenance schedules, and the identification of inefficient or faulty hardware.
Improved capacity management by tracking power, space and cooling usage, and locating appropriate resources to reserve.
Procedures and rules are followed. Changes are documented systematically; asset changes are captured and stored — and permitted only if the requirements are met.
Denser IT accommodated by identifying available space and power for IT, it may be easier to densify racks, such as allocate resources to AI/machine learning and high-performance computing. The introduction of direct liquid cooling (DLC) will further complicate environments.
Human error eliminated through a higher degree of task automation, as well as improved workflows, when making system changes or updating records.
Meanwhile, there will be new requirements from customers for improved monitoring, reporting and measurement of data, including:
Monitoring equipment performance to avoid undue wear and tear or system stress might reduce the risk of outages.
Shorter ride through times may require more monitoring. For example, IT equipment may only have a short window of cooling from the UPS, in the event of a major power outage.
Greater variety of IT equipment (graphics processing units, central processing units, application-specific integrated circuits) may mean a less predictable, more unstable environment. Monitoring will be required to ensure that their different power loads, temperature ranges and cooling requirements are managed effectively.
Sustainability metrics (such as PUE), as well as other measurables (such as water usage effectiveness, carbon usage effectiveness and metrics to calculate Scope 1, 2 or 3 greenhouse gas emissions).
Legal requirements for transparency of environmental, sustainability and resiliency data.
Supplier landscape resets
In the past decade, many DCIM suppliers have reset, adapted and modernized their technology to meet customer demand. Many have now introduced mobile and browser-based offerings, colocation customer portals and better metrics tracking, data analytics, cloud and software as a service (SaaS).
Customers are also demanding more vendor agnostic DCIM software. Operators have sometimes struggled with DCIM’s inability to work with existing building management systems from other vendors, which then requires additional costly work on application programming interfaces and integration. Some operators have noted that DCIM software from one specific vendor still only provides out-of-the-box monitoring for their own brand of equipment. These concerns have influenced (and continue to influence) customer buying decisions.
Adaptation has been difficult for some of the largest DCIM suppliers, and some organizations have now exited the market. As one of the largest data center equipment vendors, for example, Vertiv’s discontinuation of Trellis in 2021 was a significant exit: customers found Trellis too large and complex for most implementations. Even today, operators continue to migrate off Trellis onto other DCIM systems.
Other structural change in the DCIM market include Carrier and Schneider Electric acquiring Nlyte and Aveva, respectively, and Sunbird spinning out from hardware vendor Raritan (Legrand) (see Table 1).
Table 1. A selection of current DCIM suppliers
There are currently a growing number of independent service vendors currently offering DCIM, each with different specialisms. For example, Hyperview is solely cloud-based, RiT Tech focuses on universal data integration, while Device42 specialises in IT asset discovery. Independent service vendors benefit those unwilling to acquire DCIM software and data center equipment from the same supplier.
Those DCIM software businesses that have been acquired by equipment vendors are typically kept at arm’s length. Schneider and Carrier both retain the Aveva and Nlyte brands and culture to preserve their differentiation and independence.
There are many products in the data center management area that are sometimes — in Uptime’s view —labeled incorrectly as DCIM. These include products that offer discrete or adjacent DCIM capabilities, such as: Vertiv Environet Alert (facility monitoring); IBM Maximo (asset management); AMI Data Center Manager (server monitoring); Vigilent (AI-based cooling monitoring and control); and EkkoSense (digital twin-based cooling optimization). Uptime views these as part of the wider DCM-C control category, which will be discussed in a future report in this series.
Attendees at Uptime network member events between 2013 and 2020 may recall that complaints about DCIM products, implementation, integration and pricing were a regular feature. Much of the early software was market driven, fragile and suffered from performance issues, but DCIM software has undoubtedly improved from where it was a decade ago.
The next sections of this report discuss areas in which DCIM software has improved and where there is still room for improvement.
Modern development techniques
Modern development techniques, such as continuous improvement / continuous delivery and agile / DevOps have encouraged a regular cadence of new releases and updates. Containerized applications have introduced modular DCIM, while SaaS has provided greater pricing and delivery flexibility.
Modularity
DCIM is no longer a monolithic software package. Previously, it was purchased as a core bundle, but now DCIM is more modular with add-ons that can be purchased as required. This may make DCIM more cost-effective, with operators being able to more accurately assess the return on investment, before committing to further investment. Ten years ago, the main customers for DCIM were enterprises, with control over IT — but limited budgets. Now, DCIM customers are more likely to be colocation providers with more specific requirements, little interest in the IT, and probably require more modular, targeted solutions with links into their own commercial systems.
SaaS
Subscription-based pricing for greater flexibility and visibility on costs. This is different from traditional DCIM licence and support software pricing, which typically locked customers in for minimum-term contracts. Since SaaS is subscription-based, there is more onus on the supplier to respond to customer requests in a timely manner. While some DCIM vendors offer cloud-hosted versions of their products, most operators still opt for on-premises DCIM deployments, due to perceived data and security concerns.
IT and software integrations
Historically, DCIM suffered from configurability, responsiveness and integration issues. In recent years, more effort has been made toward third-party software and IT integration and encouraging better data sharing between systems. Much DCIM software now uses application programming interfaces (APIs) and industry standard protocols to achieve this:
Application programming interfaces
APIs have made it easier for DCIM to connect with third-party software, such as IT service management, IT operations management, and monitoring and observability tools, which are often used in other parts of the organization. The aim for operators is to achieve a comprehensive view across the IT and facilities landscape, and to help orchestrate requests that come in and out of the data center. Some DCIM systems, for example, come with pre-built integrations and tools, such as ServiceNow and Salesforce, that are widely used by IT enterprise teams. These enterprise tools can provide a richer set of functionalities in IT and customer management and support. They also use robotic process automation technology, to automate repetitive manual tasks, such as rekeying data between systems, updating records and automating responses.
IT/OT protocols
Support for a growing number of IT/OT protocols has made it easier for DCIM to connect with a broader range of IT/OT systems. This helps operators to access the data needed to meet new sustainability requirements. For example, support for simple network management protocol can provide DCIM with network performance data that can be used to monitor and detect connection faults. Meanwhile, support for the intelligent platform management interface can enable remote monitoring of servers.
User experience has improved
DCIM provides a better user experience than a decade ago. However, operators still need to be vigilant that a sophisticated front end is not a substitute for functionality.
Visualization
Visualization for monitoring and planning has seen significant progress, with interactive 3D and augmented reality views of IT equipment, racks and data halls.Sensor data is being used, for example, to identify available capacity, hot-spots or areas experiencing over-cooling. This information is presented visually to the user, who can follow changes over time and drag and drop assets into new configurations. On the fringes of DCIM, computational fluid dynamics can visualize air flows within the facility, which can then be used to make assumptions about the impact of specific changes on the environment. Meanwhile, the increasing adoption of computer-aided design can enable operators to render accurate and dynamic digital twin simulations for data center design and engineering and, ultimately, the management of assets across their life cycle.
Better workflow automation
At a process level, some DCIM suites offer workflow management modules to help managers initiate, manage and track service requests and changes. Drag and drop workflows can help managers optimize existing processes. This has the potential to reduce data entry omissions and errors, which have always been among the main barriers to successful DCIM deployments.
Rising demand for data and AI
Growing demand for more detailed data center metrics and insights related to performance, efficiency and regulations will make DCIM data more valuable. This, however, depends on how well DCIM software can capture, store and retrieve reliable data across the facility.
Customers today require greater levels of analytical intelligence from their DCIM. Greater use of AI and ML could enable the software to spot patterns, anomalies and provide next best action recommendations. DCIM has not fared well in this area, which has opened the door to a new generation of AI-enabled optimization tools. The Uptime report What is the role of AI in digital infrastructure management? identifies three near-term applications of ML in the data center — predictive analytics, equipment setting optimization and anomaly detection.
DCIM is making progress in sustainability data monitoring and reporting and a number of DCIM suppliers are now actively developing sustainability modules and dashboards. One supplier, for example, is developing a Scope 1, 2 and 3 greenhouse gas emissions model based on a range of datasets, such as server product performance sheets, component catalogs and the international Environmental Product Declaration (EPD) database. Several suppliers are working on dashboards that bring together all the data required for compliance with the EU’s Energy Efficiency Directive. Once functional, these dashboards could compare data centers, devices and manufacturers, as well as provide progress reports.
The Uptime Intelligence View
DCIM has matured as a software solution over the past decade. Improvements in function modularity, SaaS, remote working, integration, user experience and data analytics, have all progressed to the point where DCIM is now considered a viable and worthwhile investment. DCIM data will also be increasingly valuable for regulatory reporting requirements. Nonetheless, there remains more work to be done. Customers still have legitimate concerns about its complexity, cost and accuracy, while DCIM’s ability to apply AI and analytics — although an area of great promise — is still viewed cautiously. Even when commercial DCIM packages were less robust and functional, those operators that researched it diligently and deployed it carefully found it to be largely effective. This remains true today.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/11/DCIM-past-and-present-whats-changed-featured.jpg5401030John O’Brien, Senior Research Analyst, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngJohn O’Brien, Senior Research Analyst, Uptime Institute, [email protected]2024-11-13 15:00:002024-11-13 10:14:33DCIM past and present: what’s changed?
Direct liquid cooling (DLC), including cold plate and immersion systems, is becoming more common in data centers — but so far this transition has been gradual and unevenly distributed with some data centers using it widely, others not at all. The use of DLC in 2024 still accounts for only a small minority of the world’s IT servers, according to the Uptime Institute Cooling Systems Survey 2024. The adoption of DLC remains slow in general-purpose business IT, and the most substantial deployments currently concentrate on high-performance computing applications, such as academic research, engineering, AI model development and cryptocurrency.
This year’s cooling systems survey results continue a trend: of those operators that use DLC in some form, the greatest number of operators deploy water cold plate systems, with other DLC types trailing significantly. Multiple DLC types will grow in adoption over the next few years, and many installations will be in hybrid cooling environments where they share space and infrastructure with air-cooling equipment.
Within this crowded field, water cold plates’ lead is not overwhelming. Water cold plate systems retain advantages that explain this result: ease of integration into shared facility infrastructure, a well-understood coolant chemistry, greater choice in IT and cooling equipment, and less disruption to IT hardware procurement and warranty compared with other current forms of DLC. Many of these advantages are down to its long-established history spanning decades.
This year’s cooling systems survey provides additional insights into the DLC techniques operators are currently considering. Of those data center operators using DLC, many more (64%) have deployed water cold plates than the next-highest-ranking types: dielectric-cooled cold plates (30%) and single-phase immersion systems (26%) (see Figure 1).
Figure 1. Operators currently using DLC prefer water-cooled cold plates
At present, most users say they use DLC for a small portion of their IT — typically for their densest, most difficult equipment to cool (see DLC momentum rises, but operators remain cautious). These small deployments favor hybrid approaches, rather than potentially expensive dedicated heat rejection infrastructure.
Hybrid cooling predominates — for now
Many DLC installations are in hybrid (mixed) setups in which DLC equipment sits alongside air cooling equipment in the data hall, sharing both heat transport and heat rejection infrastructure. This approach can compromise DLC’s energy efficiency advantages (see DLC will not come to the rescue of data center sustainability), but for operators with only small DLC deployments, it can be the only viable option. Indeed, when operators named the factors that make a DLC system viable, the greatest number (52%, n=392) chose ease of retrofitting DLC into their existing infrastructure.
For those operators who primarily serve mainstream business workloads, IT is rarely dense and powerful enough to require liquid cooling. Nearly half of operators (48%, n=94) only use DLC on less than 10% of their IT racks — and only one in three (33%, n=54) have heat transport and rejection equipment dedicated to their liquid-cooled IT. At this early stage of DLC growth, economics and operational risks dictate that operators prefer cooling technologies that integrate more readily into their existing space. Water cold plate systems can meet this need, despite potential drawbacks.
Cold plates are good neighbors, but not perfect
Water-cooled servers typically fit into standard racks, which simplifies deployment — especially when these servers coexist with air cooled IT. Existing racks can be reused either fully or partially loaded with water-cooled servers, and installing new racks is also straightforward.
IT suppliers prefer to sell the DLC solution integrated with their own hardware, ranging from a single server chassis to rows of racks including cooling distribution units (CDUs). Today, this approach typically favors a cold plate system, so that operators and IT teams have the broadest selection of equipment and compatible IT hardware with vendor warranty coverage.
The use of water in data center cooling has a long history. In the early years of mainframes water was used in part due to its advantageous thermal properties compared with air cooling, but also because of the need to remove heat effectively from the relatively small rooms that computers shared with staff.
Today, water cold plates are used extensively in supercomputing, handling extremely dense cabinets. Operators benefit from water’s low cost and ready availability, and many are already skilled in maintaining its chemistry (even though quality requirements for the water coolant are more stringent for cold plates compared with a facility loop).
The risk (and, in some cases, the vivid memories) of water leakage onto electrified IT components is one reason some operators are hesitant to embrace this technology, but leaks are statistically rare and there are established best practices in mitigation. However, with positive pressure cold plate loops, which is the type most deployed by operators, there is never zero risk. The possibility of water damage is perhaps the single key weakness of water cold plates driving interest in alternative dielectric DLC techniques.
In terms of thermal performance, water is not without competition. Two-phase dielectric coolants show strong performance by taking advantage of the added cooling effect from vaporization. Vendors offer this technology in the form of both immersion tanks and cold plates, with the latter edging ahead in popularity because it requires less change to products and data center operations. The downside of all engineered coolants is the added cost, as well as the environmental concerns around manufacturing and leaks.
Some in the data center industry predict two-phase cooling products will mature to capitalize on their performance potential and eventually become a major form of cooling in the world of IT. Uptime’s survey data suggests that water cold plate systems currently have a balance of benefits and risks that make practical and economic sense for a greater number of operators. But the sudden pressure on cooling and other facility infrastructure brought about by specialized hardware for generative AI will likely create new opportunities for a wider range of DLC techniques.
Outlook
Uptime’s surveys of data center operators are a useful indicator of how operators are meeting their cooling needs, among others. The data thus far suggests a gradual DLC rollout, with water cold plates holding a stable (if not overwhelming) lead. Uptime’s interviews with vendors and operators consistently paint a picture of widespread hybrid cooling environments, which incentivize cooling designs that are flexible and interoperable.
Many water cold plate systems on the market are well suited to these conditions. Looking five years ahead, densified IT for generative AI and other intensive workloads is likely to influence data center business priorities and designs more widely. DLC adoption and operator preferences for specific technology types are likely to shift in response. Pure thermal performance is key but not the sole factor. The success of any DLC technique will rely on overcoming the barriers to its adoption, availability from trusted suppliers and support for a wide range of IT configurations from multiple hardware manufacturers.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/10/Water-cold-plates-lead-in-the-small-but-growing-world-of-DLC-featured.jpg5401030Jacqueline Davis, Research Analyst, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngJacqueline Davis, Research Analyst, Uptime Institute, [email protected]2024-10-30 15:00:002024-11-13 10:21:48Water cold plates lead in the small, but growing, world of DLC
Cyberattacks on operational technology (OT) were virtually unknown until five years ago, but their volume has been doubling since 2009. This threat is distinct from IT-focused vulnerabilities that cybersecurity measures regularly address. The risk associated with OT compromise is substantial: power or cooling failures can cripple an entire facility, potentially for weeks or months.
Many organizations believe they have air-gapped (isolated from IT networks) OT systems, protecting them against external threats. The term is often used incorrectly, however, which means exploits are harder to anticipate. Data center managers need to understand the nature of the threat and their defense options to protect their critical environments.
Physical OT consequences of rising cyberattacks
OT systems are used in most critical environments. They automate power generation, water and wastewater treatment, pipeline operations and other industrial processes. Unlike IT systems, which are inter-networked by design, these systems operate autonomously and are dedicated to the environment where they are deployed.
OT is essential to data center service delivery: this includes the technologies that control power and cooling and manage generators, uninterruptible power supplies (UPS) and other environmental systems.
Traditionally, OT has lacked robust native security, relying on air-gapping for defense. However, the integration of IT and OT has eroded OT’s segregation from the broader corporate attack surface, and threat sources are increasingly targeting OT as a vulnerable environment.
Figure 1. Publicly reported OT cyberattacks with physical consequences
This trend should concern data center professionals. OT power and cooling technologies are essential to data center operations. Most organizations can restore a compromised IT system effectively, but few (if any) can recover from a major OT outage. Unlike IT systems, OT equipment cannot be restored from backup.
Air-gapping is not air-tight
Many operators believe their OT systems are protected by air-gaps: systems that are entirely isolated from external networks. True air-gaps, however, are rare — and have been for decades. Suppliers and customers have long recognized that OT data supports high-value applications, particularly in terms of remote monitoring and predictive maintenance. In large-scale industrial applications — and data centers — predictive maintenance can drive better resource utilization, uptime and reliability. Sensor data enables firms to, for example, diagnose potential issues, automate replacement part orders and schedule technicians, all which will help to increase equipment life spans, and reduce maintenance costs and availability concerns.
In data centers, these sensing capabilities are often “baked in” to warranty and maintenance agreements. But the data required to drive IT predictive maintenance systems comes from OT systems, and this means the data needs a route from the OT equipment to an IT network. In most cases, the data route is designed to work in one direction, from the OT environment to the IT application.
Indeed, the Uptime Institute Data Center Security Survey 2023 found that nearly half of operators using six critical OT systems (including UPS, generators, fire systems, electrical management systems, cooling control and physical security / access control) have enabled remote monitoring. Only 12% of this group, however, have enabled remote control of these systems that requires a path leading from IT back to the OT environment.
Remote control has operational benefits but increases security risk. Paths from OT to IT enable beneficial OT data use, but also open the possibility of an intruder following the same route back to attack vulnerable OT environments. A bi-directional route exposes OT to IT traffic (and, potentially, to IT attackers) by design. A true air-gap (an OT environment that is not connected in any way to IT applications) is better protected than either of these alternatives, but will not support IT applications that require OT data.
Defense in depth: a possible solution?
Many organizations use the Purdue model as a framework to protect OT equipment. The Purdue model divides the overall operating environment into six layers (see Table 1). Physical (mechanical) OT equipment is level 0. Level 1 refers to instrumentation directly connected to level 0 equipment. Level 2 systems manage the level 1 devices. IT/OT integration is a primary function at level 3, while levels 4 and 5 refer to IT networks, including IT systems such as enterprise resource planning systems that drive predictive maintenance. Cybersecurity typically focuses on the upper levels of the model; the lower-level OT environments typically lack security features found in IT systems.
Table 1. The Purdue model as applied to data center environments
Lateral and vertical movement challenges
When organizations report that OT equipment is air-gapped, they usually mean that the firewalls between the different layers are configured to only permit communication between adjacent layers. This prevents an intruder from moving from the upper (IT) layers to the lower (OT) levels. However, data needs to move vertically from the physical equipment to the IT applications across layers — otherwise the IT applications would not receive the required input from OT. If there are vertical paths through the model, an adversary would be able to “pivot” an attack from one layer to the next.
This is not the type of threat that most IT security organizations expect. Enterprise cyber strategies look for ways to reduce the impact of a breach by limiting lateral movement (an adversary’s ability to move from one IT system to another, such as from a compromised endpoint device to a server or application) across the corporate network. The enterprise searches for, responds to, and remediates compromised systems and networks.
OT networks also employ the principles of detecting, responding to and recovering from attacks. The priority of OT, however, is to prevent vertical movement and to avoid threats that can penetrate the IT/OT divide.
Key defense considerations
Data center managers should establish multiple layers of data center OT defense. The key principles include:
Defending the conduits. Attacks propagate from the IT environment through the Purdue model via connections between levels. Programmable firewalls are points of potential failure. Critical facilities use network engineering solutions, such as unidirectional networks and/or gateways, to prevent vertical movement.
Maintaining cyber, mechanical and physical protection. A data center OT security strategy combines “detect, respond, recover” capabilities, mechanical checks (e.g., governors that limit AC compressor speed or that respond to temperature or vibration), and physical security vigilance.
Preventing OT compromise. Air-gapping is used to describe approaches that may not be effective in preventing OT compromises. Data center managers need to ensure that defensive measures will protect data center OT systems from outages that could cripple an entire facility.
Weighing up the true cost of IT/OT integration. Business cases for applications that rely on OT data generally anticipate reduced maintenance downtime. But the cost side of the ledger needs to extend to expenses associated with protecting OT.
The Uptime Intelligence View
Attacks on OT systems are increasing, and defense strategies are often based on faulty assumptions and informed by strategies that reflect IT rather than OT requirements. A successful attack on data center OT could be catastrophic, resulting in long-term (potentially months long) facility outages.
Many OT security measures stem from the belief that OT is protected by air-gapping, but this description may not be accurate and the protection may not be adequate in the face of escalating OT attacks. Data center managers should consider OT security strategies aimed at threat prevention. These should combine the detect, respond, recover cyber principles and engineering-grade protection for OT equipment.
It is also crucial to integrate vigilant physical security to protect against unauthorized access to vulnerable OT environments. Prevention is the most critical line of defense.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/10/OT-Protection-Is-Air-Gapping-the-Answer-featured.jpg5401030Michael O’Neil, Consulting Analyst - Cybersecurity, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngMichael O’Neil, Consulting Analyst - Cybersecurity, Uptime Institute, [email protected]2024-10-21 15:00:002024-11-13 10:23:12OT protection: is air-gapping the answer?
An earlier Uptime Intelligence report discussed the characteristics of processor power management (known as C-states) and explained how they can reduce server energy consumption to make substantial contributions to the overall energy performance and sustainability of data center infrastructure (see Understanding how server power management works). During periods of low activity, such features can potentially lower the server power requirements by more than 20% in return for prolonging the time it takes to respond to requests.
But there is more to managing server power than just conserving energy when the machine is not busy — setting processor performance levels that are appropriate for the application is another way to optimize energy performance. This is the crux of the issue: there is often a mismatch between the performance delivered and the performance required for a good quality of service (QoS).
When the performance is too low, the consequences are often clear: employees lose productivity, customers leave. But when application performance exceeds needs, the cost remains hidden: excessive power use.
Server power management: enter P-states
Uptime Intelligence survey data indicates that power management remains an underused feature — most servers do not have it enabled (see Tools to watch and improve power use by IT are underused). The extra power use may appear small at first, amounting to only tens of watts per server. But when scaled to larger facilities or to the global data center footprint, they add up to a huge waste of power and money.
The potential to improve the energy performance of data center infrastructure is material, but the variables involved in adopting server power management mean it is not a trivial task. Modern chip design is what creates this potential. All server processors in operation today are equipped with mechanisms to change their clock frequency and supply voltage in preordained pairs of steps (called dynamic voltage-frequency scaling). Initially, these techniques were devised to lower energy use in laptops and other low-power systems when running code that does not fully utilize resources. Known as P-states, these are in addition to C-states (low-power modes during idle time).
Later, mechanisms were added to do the opposite: increase clock speeds and voltages beyond nominal rates as long as the processor stays within hard limits for power, temperature and frequency. The effect of this approach, known as turbo mode, has gradually become more pronounced with ever-higher core counts, particularly in servers (see Cooling to play a more active role in IT performance and efficiency). As processors dynamically reallocate the power budget from lowly utilized or idling cores to highly stressed ones, clock speeds can well exceed nominal ratings — often close to doubling. In recent CPUs, even the power budget can be calibrated higher than the factory default.
As a result, server-processor behavior has become increasingly opportunistic in the past decade. When allowed, processors will dynamically seek out the electrical configuration that yields maximum performance if the software (signaled by the operating system, detected by the hardware mechanisms, or both) requests it. Such behavior is generally great for performance, particularly in a highly mixed application environment where certain software benefits from running on many cores in parallel while others prefer fewer but faster ones.
The unquantified costs of high performance
Ensuring top server performance comes at the cost of using more energy. For performance-critical applications such as technical computing, financial transactions, high-speed analytics and real-time operating systems, the use and cost of energy is often not a concern.
But for a large array of workloads, this will result in a considerable amount of energy waste. There are two main components to this waste. First, the energy consumption curve for semiconductors gets steeper the closer the chip pushes to the top of its performance envelope because both dynamic (switching) and static (leakage) power increase exponentially. All the while, the performance gains diminish because the rest of the system, including the memory, storage and network subsystems, will be unable to keep up with the processor’s race pace. This increases the amount of time that the processor needs to wait for data or instructions.
Second, energy waste originates from a mismatch between performance and QoS. Select applications and systems, such as transaction processing and storage servers, tend to have defined QoS policies for performance (e.g., responding to 99% of queries within a second). QoS is typically about setting a floor below which performance should not drop — it is rarely about ensuring systems do not overperform, for example, by processing transactions or responding to queries unnecessarily fast.
If a second for a database query is still within tolerance, there is, by definition, limited value to having a response under one-tenth of a second just because the server can process a query that fast when the load is light. And yet, it happens all the time. For many, if not most, workloads, however, this level of overperformance is neither defined nor tracked, which invites an exploration of acceptable QoS.
Governing P-states for energy efficiency
At its core, the governance of P-states is like managing idle power through C-states, except with many more options, which adds complexity through choice. This report does not discuss the number of P-states because this would be highly dependent on the processor used. Similarly to C-states, a higher number denotes a higher potential energy saving; for example, P2 consumes less power than P1. P0 is the highest-performance state a processor can select.
No P-state control. This option tends to result in aggressive processor behavior that pushes for the maximum speeds available (electronically and thermally) for any and all of its cores. While this will result in the most energy consumed, it is preferable for high-performance applications, particularly latency-sensitive applications where every microsecond counts. If this level of performance is not justified by the workload, it can be an exceedingly wasteful control mode.
Hardware control. Also called the autonomous mode, this leaves P-state management for the processor to decide based on detected activity. While this mode allows for very fast transitions between states, it lacks the runtime information gathered by the operating system; hence, it will likely result in only marginal energy savings. On the other hand, this approach is agnostic of the operating system or hypervisor. The expected savings compared with no P-state control are up to around 10%, depending on the load and server configuration.
Software-hardware cooperation. In this mode, the operating system provider gives the processor hints on selecting the appropriate P-states. The theory is that this enables the processor control logic to make better decisions than pure hardware control while retaining the benefit of fast transitions between states to maintain system responsiveness. Power consumption reductions here can be as high as 15% to 20% at low to moderate utilization.
Software control. In this mode, the operating system uses a governor (a control mechanism that regulates a function, in this case speed) to make the performance decisions executed by the processor if the electrical and thermal conditions (supply voltage and current, clock frequency and silicon temperature) allow it. This mode typically carries the biggest energy-saving potential when a sophisticated software governor is used. Both Windows and Linux operating systems offer predefined plans that let the system administrator prioritize performance, balance or lower energy use. The trade-off here is additional latency: whenever the processor is in a low-performance state and transitions to a higher-performance state (e.g., P0) in response to a bout of compute or interrupt activity, it takes material time. Highly latency-sensitive and bursty workloads may see substantial impact. Power reductions can be outsized across most of the system load curve. Depending on the sophistication of the operating system governor and the selected power plan, energy savings can reach between 25% and 50%.
While there are inevitable trade-offs between performance and efficiency, in all the control scenarios, the impact on performance is often negligible. This is true for users of business and web applications, or for the total runtime of technical computing jobs. High-performance requirements alone do not preempt the use of P-state control: once the processor selects P0, there is no difference between a system with controls and no controls.
Applications that do not tolerate dynamic P-state controls well are often the already-suspected exceptions. This is due to their latency-sensitive, bursty nature, where the processor is unable to match the change in performance needs (scaling voltages and frequency) fast enough, even though it takes microseconds.
Arguably, for most use cases, the main concern should be power consumption, not performance. Server efficiency benchmarking data, such as that published by the Standard Performance Evaluation Corporation and The Green Grid, indicates that modern servers achieve the best energy efficiency when their performance envelope is limited (e.g., to P2) because it prevents the chip from aggressively seeking the highest clock rates across many cores. This would result in disproportionately higher power use for little in return.
Upcoming Uptime Intelligence reports will identify the software tools that data center operators can use to monitor and manage the power and performance settings of server fleets.
The Uptime Intelligence View
Server power management in its multiple forms offers data center operators easy wins in IT efficiency and opportunities to lower operational expenditure. It will be especially attractive to small- and medium-sized enterprises that run mixed workloads with low criticality, yet the effects of server power management will be much more impressive when implemented at scale and for the right workloads.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/09/Managing-server-performance-for-power-a-missed-opportunity-featured.jpg5401030Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDaniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]2024-09-18 15:00:002024-09-17 15:35:54Managing server performance for power: a missed opportunity
Grid growth and decarbonization: An unhappy couple
/in Executive, Operations/by Jay Dietrich, Research Director of Sustainability, Uptime Institute, [email protected]The advent of AI training and inference applications, combined with the continued expansion of the digital world and electrification of the economy, raises two questions about electricity generation capacity: where will the new energy be sourced from, and how can it be decarbonized?
Groups, such as the International Energy Agency and the Electric Power Research Institute, project that data center energy demand will rise at an annual rate of 5% or more in many markets over the next decade. When combined with mandated decarbonization and electrification initiatives, electricity generation capacity will need to expand between two and three times by 2050 to meet this demand.
Many data center operators have promoted the need to add only carbon-free generation assets to increase capacity. This is going to be challenging: deployment of meaningful carbon-free generation, including energy from geothermal sources and small nuclear reactors (SMRs), and battery capacity, such as long duration energy storage (LDES), are at least five or more years away. Given the current state of technology development and deployment of manufacturing capacity, it will likely take at least 10 years before they are widely used on the electricity grid in most regions. This means that natural gas generation assets will have to be included in the grid expansion to maintain grid reliability.
Impact of wind / solar on grid reliability
To demonstrate the point, consider the following example. Figure 1 depicts the current generation capacity in Germany under five different weather and time of day scenarios (labeled as scenario A1 to A5). Table 1 provides the real-time capacity factors for the wind, solar and battery assets used for these scenarios. German energy demand in summer is 65 GW (not shown) and 78 GW (shown by the dotted line) in winter. The blue line is the total available generation capacity.
Figure 1. Grid generation capacity under different weather and time of day scenarios
Table 1. Generation capacity factors for scenarios A and B
Scenario A details the total available generation capacity of the German electricity grid in 2023. The grid has sufficient dispatchable generation to maintain grid reliability. It also has enough grid interconnects to import and export electricity production to address the over- or under-production due to the variable output of the wind and solar generation assets. An example of energy importation is the 5 GW of nuclear generation capacity, which comes from France.
Scenario A1 depicts the available generation capacity based on the average capacity factors of the different generation types. Fossil fuel and nuclear units typically have a 95% or greater capacity factor because they are only offline for minimal periods during scheduled maintenance. Wind and solar assets have much lower average capacity factors (see Table 1) due to the vagaries of the weather. Lithium-ion batteries have limited capacity because they discharge for two to four hours and a typical charge / discharge cycle is one day. As a result, the average available capacity for Germany is only half of the installed capacity.
Grid stress
The average capacity only tells half the story because the output from wind, solar and battery energy sources vary between zero and maximum capacity based on weather conditions and battery charge / discharge cycles.
Low wind and low or nonexistent solar output push the available generation capacity close to the electricity demand. If several fossil fuel assets are offline and/or imports are limited, the grid authority will have to rely on demand management capacity, battery capacity and available imports to keep the grid balanced and avoid rolling blackouts (one- to two-hour shutdowns of electricity in one or more grid sub-regions).
Demand management
The scenario described above is not peculiar to Germany. Wind and solar generation assets represent the bulk of new capacity to meet new data center and electrification demand and replace retiring fossil fuel and nuclear assets as they are (1) mandated by many state, province and national governments and (2) the only economically viable form of carbon-free electricity generation. Unfortunately, their variable output leaves significant supply gaps of hours, days and seasonally that cannot currently be addressed with carbon-free generation assets. These gaps will have to be filled with a combination of new and existing fossil fuel assets, existing nuclear assets (new plants, as new conventional or small modular nuclear reactors are eight years or more in the future) and demand management.
As the installed solar and wind generation capacity increases to a greater percentage of supply capacity, data center operators will play a major role in demand management strategies. Several strategies are available to the utilities / data center operators, but each has its drawbacks.
Backup generation systems
Significantly, emergency generator sets will need to be utilized to take data centers off the grid to address generation capacity shortfalls. This strategy is being deployed in the US (California), Ireland and other data center markets. In conversations with operators, several report being requested by their grid authority to deploy their emergency generators to relieve grid demand and ensure stability during the summer of 2023.
As new data centers are built and deployed, operators in stressed grid regions (i.e., those with a high percentage of capacity delivered by wind and solar assets) should plan and permit their emergency generators to be operated for 25% or more of the year to support grid reliability in the face of variable wind and solar generation asset production.
Workload reduction
Demand management can also be achieved by reducing data center energy consumption. Google has reported the development of protocols to shut down non-critical workloads, such as controllable batch jobs, or shift a workload to one or more data centers in other grid regions. The reports did not provide the volume of workloads moved or the demand reductions, which suggests that they were not significant. These tools are likely only used on workloads controlled by Google, such as search workloads or work on development servers. They are unlikely to be deployed on client cloud workloads because many IT operators are uncomfortable with unnecessarily stressing their operations and processes.
An example of an IT enterprise operation that could support demand management is a financial organization running twin data centers in two different grid regions. When grid stability is threatened, it could execute its emergency processes to move all workloads to a single data center. In addition to receiving a payment for reducing demand, this would be an opportunity to test and validate its emergency workload transfer processes. While there is a strong logic to justify this choice, IT managers will likely be hesitant to agree to this approach.
Outages and reliability problems are more likely to emerge when operational changes are being made and demand management payments from the grid authority or energy retailer will not compensate the risk of penalties under service level agreements. The use of backup generators will likely be the preferred response, though problems, such as issues with starting and synchronizing generators or transfer switch failures, can be experienced when switching to generators.
New solar and wind capacity needed
The energy demands of new data centers and the broader electrification of the economy will require the installation of new electricity generation capacity in grid regions around the world. Large colocation and cloud service providers have been particularly vocal that these generation assets should be carbon-free and not involve new fossil fuel generation assets. An analysis of the impact of increasing the wind, solar and battery generation capacity on the German grid by 20% to support a 15% increase in demand reveals the inherent dangers of this position.
Figure 2 details the available grid generation capacity under the five available capacity scenarios (see Table 1) when the wind, solar and battery capacities are increased by 20%. This increase in generating capacity is assumed to support a 15% rise in energy demand, which increases winter demand to 90 GW.
Figure 2. Impact of a 20% increase in wind and solar generation
The 20% generating capacity increase delivers sufficient available capacity for periods of moderate to high wind and solar capacity (scenarios B2 and B3). There is a sufficient capacity reserve (about 10% to 15% of demand) to provide the needed generation capacity if some fossil fuel-based generators are offline or the imports of nuclear-generated electricity are not available.
However, the capacity increase does not significantly increase the available capacity at periods of low solar and wind output, putting grid stability at risk in scenarios B4 and B5. The available capacity in scenario B4 increases by only 4 GW, which is insufficient to meet the new demand or provide any capacity reserve. Under scenario B5, there is barely enough capacity to provide a sufficient reserve (about 10% to 15% of capacity). In both cases, grid stability is at risk and some combination of demand management, battery capacity and imports will be required.
Until reliable, dispatchable carbon-free electricity technologies, such as SMRs, geothermal generation and LDES, are developed and deployed at scale, grid stability will depend on the presence of sufficient fossil fuel generation assets to match energy supply to demand. The deployed capacity will likely have to be 75% to 100% of projected demand to address three- to 10-day periods of low solar and wind output and expected seasonal variations in wind and solar production.
To enable growth and ensure reliable electricity service while increasing renewable energy generation capacity, data center operators will need to balance and likely compromise their location selection criteria, decarbonization goals, and the size of their electrical demand growth.
Managing growing power demand
Data center operators will have to reevaluate their sustainability objectives in light of the need to increase the overall grid generation capacity and the percentage of that capacity that is carbon-free generation while maintaining grid stability and reliability. Operators should consider the following options to take meaningful steps to further decarbonize the grid:
The data center industry can maintain its sustainability focus despite the expected growth in power demand. To do so, IT operators need to refocus on continually increasing the work delivered per unit of energy consumed by the IT infrastructure and the megawatt-hours of carbon-free energy consumed by data center operations. With available technologies, much can be done while developing the technologies needed to decarbonize the last 10% to 20% of the electricity grid.
The Uptime Intelligence View
Accelerating the growth of data center capacity and energy consumption does not have to imperil the industry’s drive toward sustainability. Instead, it requires that the sector pragmatically reassess its sustainability efforts. Sustainability strategies need to first intensify their focus on increased IT infrastructure utilization and work delivered per unit of energy consumed, and then take responsible steps to decarbonize the energy consumed by data centers while supporting necessary efforts to grow generation capacity on the grid and ensure grid stability.
Building trust: working with AI-based tools
/in Executive, Operations/by Rose Weinschenk, Research Associate, Uptime Institute, [email protected]Many employees approach AI-based systems in the workplace with a level of mistrust. This lack of trust can slow the implementation of new tools and systems, alienate staff and reduce productivity. Data center managers can avoid this outcome by understanding the factors that drive mistrust in AI and devising a strategy to minimize them.
Perceived interpersonal trust is a key productivity driver for humans but is rarely discussed in a data center context. Researchers at the University of Cambridge in the UK have found that interpersonal trust and organizational trust have a strong correlation with staff productivity. In terms of resource allocation, lack of trust requires employees to invest time and effort organizing fail-safes to circumvent perceived risks. This takes attention away from the task at hand and results in less output.
In the data center industry, trust in AI-based decision-making has declined significantly in the past three years. In Uptime Institute’s 2024 global survey of data center managers, 42% of operators said they would not trust an adequately trained AI system to make operational decisions in the data center, which is up 18 percentage points since 2022 (Figure 1). If this decline in trust continues, it will be harder to introduce AI-based tools.
Figure 1. More operators distrust AI in 2024 than in previous years
Managers who wish to unlock the productivity gains associated with AI may need to create specific conditions to build perceived trust between employees and AI-based tools.
Balancing trust and cognitive loads
The trust-building cycle requires a level of uncertainty. In the Mayer, Davis and Shoormen trust model, this uncertainty occurs when an individual is presented with the option to transfer decision-making autonomy to another party, which, in the data center, might be an AI-based control system (see An integrative model of organization trust). Individuals evaluate perceived characteristics of the other party against risk, to determine whether they can relinquish decision-making control. If this leads to desirable outcomes, individuals gain trust and perceive less risk in the future.
Trust toward AI-based systems can be encouraged by using specific deployment techniques. In Uptime Institute’s Artificial Intelligence and Software Survey 2024, almost half of the operators that have deployed AI capabilities report that predictive maintenance is driving their use of AI.
Researchers from Australia’s University of Technology Sydney and University of Sydney tested human interaction with AI-based predictive maintenance systems, with participants having to decide how to manage a situation with a burst water pipe under different levels of uncertainty and cognitive load (cognitive load being the amount of working memory resources used). For all participants, trust in the automatically generated suggestions was significantly higher under low cognitive loads. AI systems that communicated decision risk odds prevented trust from decreasing, even when cognitive load increased.
Without decision risk odds displayed, employees devoted more cognitive resources toward deciphering ambiguity, leaving less space in their working memory for problem solving. Interpretability of the output of AI-based systems drives trust: it allows users to understand the context of specific suggestions, alerts and predictions. If a user cannot understand how a predictive maintenance system came to a certain conclusion, they will lose trust. In this situation, productivity will stall as workers devote cognitive resources toward attempting to retrace the steps the system made.
Team dynamics
In some cases, staff who work with AI systems personify them and treat them as co-workers rather than tools. Similarly to human social group dynamics, and the negative bias felt toward those outside of one’s group (“outgroup” dynamics), staff may then lack trust in these AI systems.
AI systems can engender anxiety relating to job security and may trigger the fear of being replaced — although this is less of a factor in the data center industry, where staff are in short supply and not at high risk of losing their jobs. Nonetheless, researchers at the Institute of Management Sciences in Pakistan find that adoption of AI in general is linked with cognitive job insecurity, which threatens workers’ perceived trust in an organization.
Introduction of AI-based tools in a data center may also cause a loss in expert status for some senior employees, who might then view these tools as a threat to their identity.
Practical solutions
Although there are many obstacles to introducing AI-based tools into a human team, the solutions to mitigating them are often intuitive and psychological, rather than technological. Data center team managers can improve trust in AI technology through the following options:
Many of the solutions described above rely on social contracts—the transactional and relational agreements between employees and an organization. US psychologist Denise Rousseau (a professor at Carnegie Mellon University, Pittsburgh PA) describes relational trust as the expectation that a company will pay back an employee’s investments through growth, benefits and job security — all factors that go beyond the rewards of a salary.
When this relational contract is broken, staff will typically shift their behavior and deprioritize long-term company outcomes in favor of short-term personal gains.
Data center team leaders can use AI technologies to strengthen or break relational contracts in their organizations. Those who consider the factors outlined above will be more successful with maintaining an effective team.
The Uptime Intelligence View
An increasing number of operators cite plans to introduce AI-based tools into their data center teams, yet surveys increasingly report a mistrust in AI. When human factors, such as trust, are well managed, AI can be an asset to any data center team. If the current downward trend in trust continues, AI systems will become harder to implement. Solutions should focus on utilizing positive staff dynamics, such as organizational trust and social contracts.
DCIM past and present: what’s changed?
/in Executive, Operations/by John O’Brien, Senior Research Analyst, Uptime Institute, [email protected]Data center infrastructure management (DCIM) software is an important class of software that, despite some false starts, many operators regard as essential to running modern, flexible and efficient data centers. It has had a difficult history — many suppliers have struggled to meet customer requirements and adoption remains patchy. Critics argue that, because of the complexity of data center operations, DCIM software often requires expensive customization and feature development for which many operators have neither the expertise nor the budget.
This is the first of a series of reports by Uptime Intelligence exploring data center management software in 2024 — two decades or more after the first commercial products were introduced. Data center management software is a wider category than DCIM: many products are point solutions; some extend beyond a single site and others have control functions. Uptime Intelligence is referring to this category as data center management and control (DCM-C) software.
DCIM, however, remains at the core. This report identifies the key areas in which DCIM has changed over the past decade and, in future reports, Uptime Intelligence will explore the broader DCM-C software landscape.
What is DCIM?
DCIM refers to data center infrastructure management software, which collects and manages information about a data center’s IT and facility assets, resource use and operational status, often across multiple systems and distributed environments. DCIM primarily focuses on three areas:
In the past, some operators have taken the view that DCIM does not justify the investment, given its cost and the difficulty of successful implementation. However, these reservations may be product specific and can depend on the situation; many others have claimed a strong return on investment and better overall management of the data center with DCIM.
Growing need for DCIM
Uptime’s discussions with operators suggest there is a growing need for DCIM software, and related software tools, to help resolve some of the urgent operational issues around sustainability, resiliency and capacity management. The current potential benefits of DCIM include:
Meanwhile, there will be new requirements from customers for improved monitoring, reporting and measurement of data, including:
Supplier landscape resets
In the past decade, many DCIM suppliers have reset, adapted and modernized their technology to meet customer demand. Many have now introduced mobile and browser-based offerings, colocation customer portals and better metrics tracking, data analytics, cloud and software as a service (SaaS).
Customers are also demanding more vendor agnostic DCIM software. Operators have sometimes struggled with DCIM’s inability to work with existing building management systems from other vendors, which then requires additional costly work on application programming interfaces and integration. Some operators have noted that DCIM software from one specific vendor still only provides out-of-the-box monitoring for their own brand of equipment. These concerns have influenced (and continue to influence) customer buying decisions.
Adaptation has been difficult for some of the largest DCIM suppliers, and some organizations have now exited the market. As one of the largest data center equipment vendors, for example, Vertiv’s discontinuation of Trellis in 2021 was a significant exit: customers found Trellis too large and complex for most implementations. Even today, operators continue to migrate off Trellis onto other DCIM systems.
Other structural change in the DCIM market include Carrier and Schneider Electric acquiring Nlyte and Aveva, respectively, and Sunbird spinning out from hardware vendor Raritan (Legrand) (see Table 1).
Table 1. A selection of current DCIM suppliers
There are currently a growing number of independent service vendors currently offering DCIM, each with different specialisms. For example, Hyperview is solely cloud-based, RiT Tech focuses on universal data integration, while Device42 specialises in IT asset discovery. Independent service vendors benefit those unwilling to acquire DCIM software and data center equipment from the same supplier.
Those DCIM software businesses that have been acquired by equipment vendors are typically kept at arm’s length. Schneider and Carrier both retain the Aveva and Nlyte brands and culture to preserve their differentiation and independence.
There are many products in the data center management area that are sometimes — in Uptime’s view —labeled incorrectly as DCIM. These include products that offer discrete or adjacent DCIM capabilities, such as: Vertiv Environet Alert (facility monitoring); IBM Maximo (asset management); AMI Data Center Manager (server monitoring); Vigilent (AI-based cooling monitoring and control); and EkkoSense (digital twin-based cooling optimization). Uptime views these as part of the wider DCM-C control category, which will be discussed in a future report in this series.
Attendees at Uptime network member events between 2013 and 2020 may recall that complaints about DCIM products, implementation, integration and pricing were a regular feature. Much of the early software was market driven, fragile and suffered from performance issues, but DCIM software has undoubtedly improved from where it was a decade ago.
The next sections of this report discuss areas in which DCIM software has improved and where there is still room for improvement.
Modern development techniques
Modern development techniques, such as continuous improvement / continuous delivery and agile / DevOps have encouraged a regular cadence of new releases and updates. Containerized applications have introduced modular DCIM, while SaaS has provided greater pricing and delivery flexibility.
Modularity
DCIM is no longer a monolithic software package. Previously, it was purchased as a core bundle, but now DCIM is more modular with add-ons that can be purchased as required. This may make DCIM more cost-effective, with operators being able to more accurately assess the return on investment, before committing to further investment. Ten years ago, the main customers for DCIM were enterprises, with control over IT — but limited budgets. Now, DCIM customers are more likely to be colocation providers with more specific requirements, little interest in the IT, and probably require more modular, targeted solutions with links into their own commercial systems.
SaaS
Subscription-based pricing for greater flexibility and visibility on costs. This is different from traditional DCIM licence and support software pricing, which typically locked customers in for minimum-term contracts. Since SaaS is subscription-based, there is more onus on the supplier to respond to customer requests in a timely manner. While some DCIM vendors offer cloud-hosted versions of their products, most operators still opt for on-premises DCIM deployments, due to perceived data and security concerns.
IT and software integrations
Historically, DCIM suffered from configurability, responsiveness and integration issues. In recent years, more effort has been made toward third-party software and IT integration and encouraging better data sharing between systems. Much DCIM software now uses application programming interfaces (APIs) and industry standard protocols to achieve this:
Application programming interfaces
APIs have made it easier for DCIM to connect with third-party software, such as IT service management, IT operations management, and monitoring and observability tools, which are often used in other parts of the organization. The aim for operators is to achieve a comprehensive view across the IT and facilities landscape, and to help orchestrate requests that come in and out of the data center. Some DCIM systems, for example, come with pre-built integrations and tools, such as ServiceNow and Salesforce, that are widely used by IT enterprise teams. These enterprise tools can provide a richer set of functionalities in IT and customer management and support. They also use robotic process automation technology, to automate repetitive manual tasks, such as rekeying data between systems, updating records and automating responses.
IT/OT protocols
Support for a growing number of IT/OT protocols has made it easier for DCIM to connect with a broader range of IT/OT systems. This helps operators to access the data needed to meet new sustainability requirements. For example, support for simple network management protocol can provide DCIM with network performance data that can be used to monitor and detect connection faults. Meanwhile, support for the intelligent platform management interface can enable remote monitoring of servers.
User experience has improved
DCIM provides a better user experience than a decade ago. However, operators still need to be vigilant that a sophisticated front end is not a substitute for functionality.
Visualization
Visualization for monitoring and planning has seen significant progress, with interactive 3D and augmented reality views of IT equipment, racks and data halls.Sensor data is being used, for example, to identify available capacity, hot-spots or areas experiencing over-cooling. This information is presented visually to the user, who can follow changes over time and drag and drop assets into new configurations. On the fringes of DCIM, computational fluid dynamics can visualize air flows within the facility, which can then be used to make assumptions about the impact of specific changes on the environment. Meanwhile, the increasing adoption of computer-aided design can enable operators to render accurate and dynamic digital twin simulations for data center design and engineering and, ultimately, the management of assets across their life cycle.
Better workflow automation
At a process level, some DCIM suites offer workflow management modules to help managers initiate, manage and track service requests and changes. Drag and drop workflows can help managers optimize existing processes. This has the potential to reduce data entry omissions and errors, which have always been among the main barriers to successful DCIM deployments.
Rising demand for data and AI
Growing demand for more detailed data center metrics and insights related to performance, efficiency and regulations will make DCIM data more valuable. This, however, depends on how well DCIM software can capture, store and retrieve reliable data across the facility.
Customers today require greater levels of analytical intelligence from their DCIM. Greater use of AI and ML could enable the software to spot patterns, anomalies and provide next best action recommendations. DCIM has not fared well in this area, which has opened the door to a new generation of AI-enabled optimization tools. The Uptime report What is the role of AI in digital infrastructure management? identifies three near-term applications of ML in the data center — predictive analytics, equipment setting optimization and anomaly detection.
DCIM is making progress in sustainability data monitoring and reporting and a number of DCIM suppliers are now actively developing sustainability modules and dashboards. One supplier, for example, is developing a Scope 1, 2 and 3 greenhouse gas emissions model based on a range of datasets, such as server product performance sheets, component catalogs and the international Environmental Product Declaration (EPD) database. Several suppliers are working on dashboards that bring together all the data required for compliance with the EU’s Energy Efficiency Directive. Once functional, these dashboards could compare data centers, devices and manufacturers, as well as provide progress reports.
The Uptime Intelligence View
DCIM has matured as a software solution over the past decade. Improvements in function modularity, SaaS, remote working, integration, user experience and data analytics, have all progressed to the point where DCIM is now considered a viable and worthwhile investment. DCIM data will also be increasingly valuable for regulatory reporting requirements. Nonetheless, there remains more work to be done. Customers still have legitimate concerns about its complexity, cost and accuracy, while DCIM’s ability to apply AI and analytics — although an area of great promise — is still viewed cautiously. Even when commercial DCIM packages were less robust and functional, those operators that researched it diligently and deployed it carefully found it to be largely effective. This remains true today.
Water cold plates lead in the small, but growing, world of DLC
/in Design, Executive, Operations/by Jacqueline Davis, Research Analyst, Uptime Institute, [email protected]Direct liquid cooling (DLC), including cold plate and immersion systems, is becoming more common in data centers — but so far this transition has been gradual and unevenly distributed with some data centers using it widely, others not at all. The use of DLC in 2024 still accounts for only a small minority of the world’s IT servers, according to the Uptime Institute Cooling Systems Survey 2024. The adoption of DLC remains slow in general-purpose business IT, and the most substantial deployments currently concentrate on high-performance computing applications, such as academic research, engineering, AI model development and cryptocurrency.
This year’s cooling systems survey results continue a trend: of those operators that use DLC in some form, the greatest number of operators deploy water cold plate systems, with other DLC types trailing significantly. Multiple DLC types will grow in adoption over the next few years, and many installations will be in hybrid cooling environments where they share space and infrastructure with air-cooling equipment.
Within this crowded field, water cold plates’ lead is not overwhelming. Water cold plate systems retain advantages that explain this result: ease of integration into shared facility infrastructure, a well-understood coolant chemistry, greater choice in IT and cooling equipment, and less disruption to IT hardware procurement and warranty compared with other current forms of DLC. Many of these advantages are down to its long-established history spanning decades.
This year’s cooling systems survey provides additional insights into the DLC techniques operators are currently considering. Of those data center operators using DLC, many more (64%) have deployed water cold plates than the next-highest-ranking types: dielectric-cooled cold plates (30%) and single-phase immersion systems (26%) (see Figure 1).
Figure 1. Operators currently using DLC prefer water-cooled cold plates
At present, most users say they use DLC for a small portion of their IT — typically for their densest, most difficult equipment to cool (see DLC momentum rises, but operators remain cautious). These small deployments favor hybrid approaches, rather than potentially expensive dedicated heat rejection infrastructure.
Hybrid cooling predominates — for now
Many DLC installations are in hybrid (mixed) setups in which DLC equipment sits alongside air cooling equipment in the data hall, sharing both heat transport and heat rejection infrastructure. This approach can compromise DLC’s energy efficiency advantages (see DLC will not come to the rescue of data center sustainability), but for operators with only small DLC deployments, it can be the only viable option. Indeed, when operators named the factors that make a DLC system viable, the greatest number (52%, n=392) chose ease of retrofitting DLC into their existing infrastructure.
For those operators who primarily serve mainstream business workloads, IT is rarely dense and powerful enough to require liquid cooling. Nearly half of operators (48%, n=94) only use DLC on less than 10% of their IT racks — and only one in three (33%, n=54) have heat transport and rejection equipment dedicated to their liquid-cooled IT. At this early stage of DLC growth, economics and operational risks dictate that operators prefer cooling technologies that integrate more readily into their existing space. Water cold plate systems can meet this need, despite potential drawbacks.
Cold plates are good neighbors, but not perfect
Water-cooled servers typically fit into standard racks, which simplifies deployment — especially when these servers coexist with air cooled IT. Existing racks can be reused either fully or partially loaded with water-cooled servers, and installing new racks is also straightforward.
IT suppliers prefer to sell the DLC solution integrated with their own hardware, ranging from a single server chassis to rows of racks including cooling distribution units (CDUs). Today, this approach typically favors a cold plate system, so that operators and IT teams have the broadest selection of equipment and compatible IT hardware with vendor warranty coverage.
The use of water in data center cooling has a long history. In the early years of mainframes water was used in part due to its advantageous thermal properties compared with air cooling, but also because of the need to remove heat effectively from the relatively small rooms that computers shared with staff.
Today, water cold plates are used extensively in supercomputing, handling extremely dense cabinets. Operators benefit from water’s low cost and ready availability, and many are already skilled in maintaining its chemistry (even though quality requirements for the water coolant are more stringent for cold plates compared with a facility loop).
The risk (and, in some cases, the vivid memories) of water leakage onto electrified IT components is one reason some operators are hesitant to embrace this technology, but leaks are statistically rare and there are established best practices in mitigation. However, with positive pressure cold plate loops, which is the type most deployed by operators, there is never zero risk. The possibility of water damage is perhaps the single key weakness of water cold plates driving interest in alternative dielectric DLC techniques.
In terms of thermal performance, water is not without competition. Two-phase dielectric coolants show strong performance by taking advantage of the added cooling effect from vaporization. Vendors offer this technology in the form of both immersion tanks and cold plates, with the latter edging ahead in popularity because it requires less change to products and data center operations. The downside of all engineered coolants is the added cost, as well as the environmental concerns around manufacturing and leaks.
Some in the data center industry predict two-phase cooling products will mature to capitalize on their performance potential and eventually become a major form of cooling in the world of IT. Uptime’s survey data suggests that water cold plate systems currently have a balance of benefits and risks that make practical and economic sense for a greater number of operators. But the sudden pressure on cooling and other facility infrastructure brought about by specialized hardware for generative AI will likely create new opportunities for a wider range of DLC techniques.
Outlook
Uptime’s surveys of data center operators are a useful indicator of how operators are meeting their cooling needs, among others. The data thus far suggests a gradual DLC rollout, with water cold plates holding a stable (if not overwhelming) lead. Uptime’s interviews with vendors and operators consistently paint a picture of widespread hybrid cooling environments, which incentivize cooling designs that are flexible and interoperable.
Many water cold plate systems on the market are well suited to these conditions. Looking five years ahead, densified IT for generative AI and other intensive workloads is likely to influence data center business priorities and designs more widely. DLC adoption and operator preferences for specific technology types are likely to shift in response. Pure thermal performance is key but not the sole factor. The success of any DLC technique will rely on overcoming the barriers to its adoption, availability from trusted suppliers and support for a wide range of IT configurations from multiple hardware manufacturers.
OT protection: is air-gapping the answer?
/in Design, Executive, Operations/by Michael O’Neil, Consulting Analyst - Cybersecurity, Uptime Institute, [email protected]Cyberattacks on operational technology (OT) were virtually unknown until five years ago, but their volume has been doubling since 2009. This threat is distinct from IT-focused vulnerabilities that cybersecurity measures regularly address. The risk associated with OT compromise is substantial: power or cooling failures can cripple an entire facility, potentially for weeks or months.
Many organizations believe they have air-gapped (isolated from IT networks) OT systems, protecting them against external threats. The term is often used incorrectly, however, which means exploits are harder to anticipate. Data center managers need to understand the nature of the threat and their defense options to protect their critical environments.
Physical OT consequences of rising cyberattacks
OT systems are used in most critical environments. They automate power generation, water and wastewater treatment, pipeline operations and other industrial processes. Unlike IT systems, which are inter-networked by design, these systems operate autonomously and are dedicated to the environment where they are deployed.
OT is essential to data center service delivery: this includes the technologies that control power and cooling and manage generators, uninterruptible power supplies (UPS) and other environmental systems.
Traditionally, OT has lacked robust native security, relying on air-gapping for defense. However, the integration of IT and OT has eroded OT’s segregation from the broader corporate attack surface, and threat sources are increasingly targeting OT as a vulnerable environment.
Figure 1. Publicly reported OT cyberattacks with physical consequences
Research from Waterfall Security shows a 15-year trend in publicly reported cyberattacks that resulted in physical OT consequences. The data shows that these attacks were rare before 2019, but their incidence has risen steeply over the past five years (see 2024 Threat report: OT cyberattacks with physical consequences).
This trend should concern data center professionals. OT power and cooling technologies are essential to data center operations. Most organizations can restore a compromised IT system effectively, but few (if any) can recover from a major OT outage. Unlike IT systems, OT equipment cannot be restored from backup.
Air-gapping is not air-tight
Many operators believe their OT systems are protected by air-gaps: systems that are entirely isolated from external networks. True air-gaps, however, are rare — and have been for decades. Suppliers and customers have long recognized that OT data supports high-value applications, particularly in terms of remote monitoring and predictive maintenance. In large-scale industrial applications — and data centers — predictive maintenance can drive better resource utilization, uptime and reliability. Sensor data enables firms to, for example, diagnose potential issues, automate replacement part orders and schedule technicians, all which will help to increase equipment life spans, and reduce maintenance costs and availability concerns.
In data centers, these sensing capabilities are often “baked in” to warranty and maintenance agreements. But the data required to drive IT predictive maintenance systems comes from OT systems, and this means the data needs a route from the OT equipment to an IT network. In most cases, the data route is designed to work in one direction, from the OT environment to the IT application.
Indeed, the Uptime Institute Data Center Security Survey 2023 found that nearly half of operators using six critical OT systems (including UPS, generators, fire systems, electrical management systems, cooling control and physical security / access control) have enabled remote monitoring. Only 12% of this group, however, have enabled remote control of these systems that requires a path leading from IT back to the OT environment.
Remote control has operational benefits but increases security risk. Paths from OT to IT enable beneficial OT data use, but also open the possibility of an intruder following the same route back to attack vulnerable OT environments. A bi-directional route exposes OT to IT traffic (and, potentially, to IT attackers) by design. A true air-gap (an OT environment that is not connected in any way to IT applications) is better protected than either of these alternatives, but will not support IT applications that require OT data.
Defense in depth: a possible solution?
Many organizations use the Purdue model as a framework to protect OT equipment. The Purdue model divides the overall operating environment into six layers (see Table 1). Physical (mechanical) OT equipment is level 0. Level 1 refers to instrumentation directly connected to level 0 equipment. Level 2 systems manage the level 1 devices. IT/OT integration is a primary function at level 3, while levels 4 and 5 refer to IT networks, including IT systems such as enterprise resource planning systems that drive predictive maintenance. Cybersecurity typically focuses on the upper levels of the model; the lower-level OT environments typically lack security features found in IT systems.
Table 1. The Purdue model as applied to data center environments
Lateral and vertical movement challenges
When organizations report that OT equipment is air-gapped, they usually mean that the firewalls between the different layers are configured to only permit communication between adjacent layers. This prevents an intruder from moving from the upper (IT) layers to the lower (OT) levels. However, data needs to move vertically from the physical equipment to the IT applications across layers — otherwise the IT applications would not receive the required input from OT. If there are vertical paths through the model, an adversary would be able to “pivot” an attack from one layer to the next.
This is not the type of threat that most IT security organizations expect. Enterprise cyber strategies look for ways to reduce the impact of a breach by limiting lateral movement (an adversary’s ability to move from one IT system to another, such as from a compromised endpoint device to a server or application) across the corporate network. The enterprise searches for, responds to, and remediates compromised systems and networks.
OT networks also employ the principles of detecting, responding to and recovering from attacks. The priority of OT, however, is to prevent vertical movement and to avoid threats that can penetrate the IT/OT divide.
Key defense considerations
Data center managers should establish multiple layers of data center OT defense. The key principles include:
The Uptime Intelligence View
Attacks on OT systems are increasing, and defense strategies are often based on faulty assumptions and informed by strategies that reflect IT rather than OT requirements. A successful attack on data center OT could be catastrophic, resulting in long-term (potentially months long) facility outages.
Many OT security measures stem from the belief that OT is protected by air-gapping, but this description may not be accurate and the protection may not be adequate in the face of escalating OT attacks. Data center managers should consider OT security strategies aimed at threat prevention. These should combine the detect, respond, recover cyber principles and engineering-grade protection for OT equipment.
It is also crucial to integrate vigilant physical security to protect against unauthorized access to vulnerable OT environments. Prevention is the most critical line of defense.
Managing server performance for power: a missed opportunity
/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]An earlier Uptime Intelligence report discussed the characteristics of processor power management (known as C-states) and explained how they can reduce server energy consumption to make substantial contributions to the overall energy performance and sustainability of data center infrastructure (see Understanding how server power management works). During periods of low activity, such features can potentially lower the server power requirements by more than 20% in return for prolonging the time it takes to respond to requests.
But there is more to managing server power than just conserving energy when the machine is not busy — setting processor performance levels that are appropriate for the application is another way to optimize energy performance. This is the crux of the issue: there is often a mismatch between the performance delivered and the performance required for a good quality of service (QoS).
When the performance is too low, the consequences are often clear: employees lose productivity, customers leave. But when application performance exceeds needs, the cost remains hidden: excessive power use.
Server power management: enter P-states
Uptime Intelligence survey data indicates that power management remains an underused feature — most servers do not have it enabled (see Tools to watch and improve power use by IT are underused). The extra power use may appear small at first, amounting to only tens of watts per server. But when scaled to larger facilities or to the global data center footprint, they add up to a huge waste of power and money.
The potential to improve the energy performance of data center infrastructure is material, but the variables involved in adopting server power management mean it is not a trivial task. Modern chip design is what creates this potential. All server processors in operation today are equipped with mechanisms to change their clock frequency and supply voltage in preordained pairs of steps (called dynamic voltage-frequency scaling). Initially, these techniques were devised to lower energy use in laptops and other low-power systems when running code that does not fully utilize resources. Known as P-states, these are in addition to C-states (low-power modes during idle time).
Later, mechanisms were added to do the opposite: increase clock speeds and voltages beyond nominal rates as long as the processor stays within hard limits for power, temperature and frequency. The effect of this approach, known as turbo mode, has gradually become more pronounced with ever-higher core counts, particularly in servers (see Cooling to play a more active role in IT performance and efficiency). As processors dynamically reallocate the power budget from lowly utilized or idling cores to highly stressed ones, clock speeds can well exceed nominal ratings — often close to doubling. In recent CPUs, even the power budget can be calibrated higher than the factory default.
As a result, server-processor behavior has become increasingly opportunistic in the past decade. When allowed, processors will dynamically seek out the electrical configuration that yields maximum performance if the software (signaled by the operating system, detected by the hardware mechanisms, or both) requests it. Such behavior is generally great for performance, particularly in a highly mixed application environment where certain software benefits from running on many cores in parallel while others prefer fewer but faster ones.
The unquantified costs of high performance
Ensuring top server performance comes at the cost of using more energy. For performance-critical applications such as technical computing, financial transactions, high-speed analytics and real-time operating systems, the use and cost of energy is often not a concern.
But for a large array of workloads, this will result in a considerable amount of energy waste. There are two main components to this waste. First, the energy consumption curve for semiconductors gets steeper the closer the chip pushes to the top of its performance envelope because both dynamic (switching) and static (leakage) power increase exponentially. All the while, the performance gains diminish because the rest of the system, including the memory, storage and network subsystems, will be unable to keep up with the processor’s race pace. This increases the amount of time that the processor needs to wait for data or instructions.
Second, energy waste originates from a mismatch between performance and QoS. Select applications and systems, such as transaction processing and storage servers, tend to have defined QoS policies for performance (e.g., responding to 99% of queries within a second). QoS is typically about setting a floor below which performance should not drop — it is rarely about ensuring systems do not overperform, for example, by processing transactions or responding to queries unnecessarily fast.
If a second for a database query is still within tolerance, there is, by definition, limited value to having a response under one-tenth of a second just because the server can process a query that fast when the load is light. And yet, it happens all the time. For many, if not most, workloads, however, this level of overperformance is neither defined nor tracked, which invites an exploration of acceptable QoS.
Governing P-states for energy efficiency
At its core, the governance of P-states is like managing idle power through C-states, except with many more options, which adds complexity through choice. This report does not discuss the number of P-states because this would be highly dependent on the processor used. Similarly to C-states, a higher number denotes a higher potential energy saving; for example, P2 consumes less power than P1. P0 is the highest-performance state a processor can select.
The trade-off here is additional latency: whenever the processor is in a low-performance state and transitions to a higher-performance state (e.g., P0) in response to a bout of compute or interrupt activity, it takes material time. Highly latency-sensitive and bursty workloads may see substantial impact.
Power reductions can be outsized across most of the system load curve. Depending on the sophistication of the operating system governor and the selected power plan, energy savings can reach between 25% and 50%.
While there are inevitable trade-offs between performance and efficiency, in all the control scenarios, the impact on performance is often negligible. This is true for users of business and web applications, or for the total runtime of technical computing jobs. High-performance requirements alone do not preempt the use of P-state control: once the processor selects P0, there is no difference between a system with controls and no controls.
Applications that do not tolerate dynamic P-state controls well are often the already-suspected exceptions. This is due to their latency-sensitive, bursty nature, where the processor is unable to match the change in performance needs (scaling voltages and frequency) fast enough, even though it takes microseconds.
Arguably, for most use cases, the main concern should be power consumption, not performance. Server efficiency benchmarking data, such as that published by the Standard Performance Evaluation Corporation and The Green Grid, indicates that modern servers achieve the best energy efficiency when their performance envelope is limited (e.g., to P2) because it prevents the chip from aggressively seeking the highest clock rates across many cores. This would result in disproportionately higher power use for little in return.
Upcoming Uptime Intelligence reports will identify the software tools that data center operators can use to monitor and manage the power and performance settings of server fleets.
The Uptime Intelligence View
Server power management in its multiple forms offers data center operators easy wins in IT efficiency and opportunities to lower operational expenditure. It will be especially attractive to small- and medium-sized enterprises that run mixed workloads with low criticality, yet the effects of server power management will be much more impressive when implemented at scale and for the right workloads.