Amazon Web Services (AWS) was the world’s first hyperscale cloud provider, and it remains the largest today. It represents around one-third of the global market, offering more than 200 infrastructure, platform and software services across 34 regions. To efficiently deliver so many services at such a scale, AWS designs and builds much of its own hardware.
The core AWS service is Amazon EC2 (Elastic Cloud Compute), which delivers virtual machines as a service. Not only is Amazon EC2 a service for customers, but it is also the underlying, hidden foundation for AWS’s platform and software services. The technology deployed in AWS data centers is often used by its parent company, Amazon, to deliver e-commerce, streaming and other consumer capabilities.
A hyperscale cloud provider does more than just manage “someone else’s computer,” as the joke goes. At the annual AWS re:Invent conference in November 2024, one speaker stated that AWS EC2 users create around 130 million new instances daily, which is well beyond anything colocation or enterprise data centers can achieve. Managing the IT infrastructure to meet such demand requires servers and silicon specifically designed for the task. Since 2017, a core capability in AWS infrastructure has been the Nitro system, which enables such scale by offloading virtualization, networking and storage management from the server processor and onto a custom chip.
Nitro architecture
Virtualization software divides a physical server into many virtual machines. It is a vital component of the public cloud because it enables the provider to create, sell and destroy computing units purchased on demand by users.
The AWS Nitro system consists of a custom network interface card containing a system-on-chip (SoC) and a lightweight hypervisor (virtualization software layer) installed on each server. Designed by Annapurna Labs, which Amazon acquired in 2015, the hardware and firmware are developed and maintained by AWS engineering teams.
The system offloads many of the functions of software virtualization onto dedicated hardware. This offloading reduces CPU overhead, freeing up resources previously consumed by virtualization software for running customer workloads. It also offloads some security and networking functionality.
A full breakdown of Nitro’s capability is provided in Table 1.
Table 1 Features of Nitro card
AWS has millions of servers that are connected and ready to use. Nitro enables users (or applications) to provision resources and start them up securely within seconds without requiring human interaction. It also provides AWS with the ability to control and optimize its estate.
Through Nitro, AWS can manage all its servers regardless of the underlying hardware, operating system, or the AWS service provisioned upon them. Nitro allows x86 and ARM servers to be managed using the same technology, and it can also support accelerators such as Nvidia GPUs and AWS’s own Inferentia and Trainium application-specific integrated circuits for AI workloads.
Although AWS uses servers from original equipment manufacturers, such as Dell and HPE, it also designs its own, manufacturing them via original design manufacturers (ODMs), usually based in Asia. These servers are stripped of nonessential components to reduce cost overheads and optimize performance for AWS’s specific requirements, such as running its ARM-based CPU, Graviton. In addition, AWS designs its own networking equipment, which is also manufactured by ODMs, reportedly including Wiwynn and Quanta.
The Graviton CPU
Graviton is ASW’s family of ARM-based chips, designed and manufactured by Annapurna Labs. Just like Nitro, Graviton is becoming an increasingly important enabler for AWS, and the two capabilities are becoming more entwined.
The use of Graviton is growing, according to speakers at the re:Invent conference. In the past two years, 50% of AWS’s new CPU capacity has been based on Graviton. Customers can consume Graviton directly through a range of EC2 virtual machines, but AWS also utilizes Graviton to power platforms and services where the customer has no visibility to (or interest in) the underlying technology — for example, 150,000 Graviton chips power the AWS DynamoDB database service.
Graviton is also employed by the parent company: Amazon used 150,000 Graviton chips during its annual Prime Day sale to meet its e-commerce demand.
The growth in Graviton processor adoption is driven primarily by economics. Compared with instances using x86 designs by Intel and AMD, AWS prices Graviton instances lower at comparable configurations (vCPUs, memory, bandwidth) as it tries to steer customers towards its own platform.
For AWS, selling access to its own chips captures revenue that would otherwise have gone to its partners Intel and AMD. It also gives AWS a differentiator in the market and a degree of lock-in; AWS’s competitors are now offering ARM services, but Graviton is more mature and widely adopted in the cloud market.
The downside for cloud customers is that chips based on ARM instruction sets cannot run the vast library of x86 codes and have a less mature software toolchain. This makes it harder for developers to implement some features or extract optimal performance, making them unsuitable for many commercial business applications.
Nitro enhances AWS’s latest Graviton chip (version 4) by providing a secure foundation through hardware-based attestation and isolation. Graviton4 processors and Nitro chips verify each other’s identity cryptographically and establish encrypted communication channels, which helps protect workloads running on AWS from unauthorized access with minimal performance impact.
Scalable storage
Nitro also enables storage to be disaggregated from compute, making it independently scalable.
Compute and storage do not necessarily scale with each other. One application might need a lot of compute and little disk, while another might need the complete opposite. This presents a problem in a static server with a fixed capacity of compute and storage.
In a traditional storage array, a head node is a server that manages the interactions between storage users and the actual disks. A storage array is provisioned with a head node and many disks connected directly to it.
The problem with this setup is that the maximum number of disks that the array can support is decided at setup. If an array is full, a new array has to be purchased.
As the size of the array design grows, practical challenges arise. AWS scaled a single storage array to 288 drives, with the hardware holding nearly six petabytes and weighing two tons. The sheer size of the appliance meant:
Data center floors had to be reinforced.
Specialized equipment was required to move and install arrays.
Vibrations from all drives moving in unison created performance issues.
A single failure of a head node would render 288 drives inaccessible.
To allow storage to scale independently and reliably from compute without such deployment challenges, AWS designed its own storage system, effectively utilizing Nitro as a lightweight head node.
In AWS’s method, each disk enclosure contains its own Nitro card. The Nitro card acts as a basic head node, managing the disks contained within the enclosure, and interacting with virtual machines hosted on servers elsewhere.
The primary benefits for AWS are easier maintenance and increased reliability. If a Nitro card fails, only a few drives lose connectivity, as opposed to an entire array of disks. Any failed drive can be removed from the service and a replacement added without causing downtime of the other disks or compute server. If a virtual machine goes down due to a failure of a compute server, it can be restarted elsewhere and the disks reconnected automatically, without loss of data.
The Uptime Intelligence View
Enterprises and colocation providers should focus on what the hyperscalers cannot do — supporting a wide range of hardware configured for each customer (internal or external), ensuring that hardware is secure (physically and virtually) and accessible only by that customer, and offering hands-on support tweaked to customer needs. They should also accept that customers will use the cloud for some applications simply because the hyperscalers can squeeze efficiency and provide scalability to a degree that is impossible for most organizations. Colocations and private facilities should enable the use of both on-premises and cloud infrastructure for their applications.
https://journal.uptimeinstitute.com/wp-content/uploads/2025/02/AWS-silicon-software-cloud-scalability-featured.jpg5401030Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com2025-02-12 15:00:002025-02-12 10:02:45How AWS’s own silicon and software deliver cloud scalability
Over the past year, demand for GPUs to train generative AI models has soared. Some organizations have invested in GPU clusters costing millions of dollars for this purpose. Cloud services offered by the major hyperscalers and a new wave of GPU-focused cloud providers deliver an alternative to dedicated infrastructure for those unwilling, or unable, to purchase their own GPU clusters.
There are many factors affecting the choice between dedicated and cloud-based GPUs. Among these are the ability to power and cool high density GPU clusters in a colocation facility or enterprise, the availability of relevant skills, and data sovereignty. But often, it is an economic decision. Dedicated infrastructure requires significant capital expenditure that not all companies can raise. Furthermore, many organizations are only just beginning to understand how (and if) they can use AI to create value. An investment in dedicated equipment for AI is a considerable risk, considering its uncertain future.
In contrast, cloud services can be consumed by the hour with no commitment or reservation required (although this is changing in some cases as cloud providers struggle to supply enough resources to meet demand). Companies can experiment with AI in the public cloud, eventually developing full-featured AI applications without making significant upfront investments, instead paying by the unit.
Ultimately, the public cloud consumption model allows customers to change their mind without financial consequence, by terminating resources if they are not needed and vice versa. Such flexibility is not possible when a big investment has been made in dedicated infrastructure.
Although the cloud may be more flexible than dedicated equipment, it is not necessarily cheaper. The cheapest choice depends on how the hardware is used. To beat the cloud on cost, dedicated equipment must be sweated — that is, used to its fullest potential — to obtain a maximum return on investment.
Those organizations that fail to sweat their infrastructure face higher costs. A dedicated cluster with a utilization of 40% is 25% cheaper per unit than cloud, but a utilization of just 8% makes dedicated infrastructure four times the price of cloud per unit.
Unit cost comparisons
Unit pricing is inherently built into the public cloud’s operating model: customers are charged per virtual machine, unit of storage or other consumption metric. Cloud users can purchase units spontaneously and delete them when needed. Since the provider is responsible for managing the capacity of the underlying hardware, cloud users are unaffected by how many servers the provider has — or how well they are utilized.
Dedicated equipment purchased with capital is not usually considered in terms of consumption units. However, unit costs are important because they help determine the potential return on an investment or purchase.
To compare the cost of dedicated equipment against the cost of cloud infrastructure, the total capital and operating expenses associated with dedicated equipment must be broken out into unit costs. These unit costs reflect how much a theoretical buyer would need to pay (per consumed unit) for the capital investment and operating costs to be fully repaid.
This concept is best explained hypothetically. Consider a server containing two CPUs. The server is only expected to be in service for a year. Many departments share the server, consuming units of CPU months (i.e., accessing a CPU for one month). There are 24 CPU months in a year (two CPUs x 12 months). Table 1 calculates each CPU month’s unit cost, comparing whether the server is highly utilized over its life or only partly utilized.
Table 1 Example unit cost calculations
Table 1 shows how utilization impacts the unit cost. Greater utilization equals a lower unit cost, as the investment delivers more units of value-adding capability. Lower utilization equals a higher unit cost, as the same investment delivers fewer units.
Note that utilization in this context means the average utilization across all servers, even those that are sitting idle. If the organization has purchased 10 servers in an AI cluster, but only one is being used, the average utilization is just 10%; the unused servers are wasted investments.
Dedicated infrastructure versus public cloud
The previous example shows a hypothetical scenario to demonstrate unit costs. Figure 1 shows a real unit cost comparison of Nvidia DGX H100 nodes hosted in a North Virginia (US) colocation facility against an equivalent cloud instance, averaged with pricing data from AWS, Google Cloud, Microsoft Azure, CoreWeave, Lambda Labs and Nebius (collected in January 2025). Colocation costs reflected in this calculation include power, space and server capital as described in Table 3 (later in this report). Dedicated unit costs are the same regardless of the number of DGX servers installed in a cluster. Notably, the price between the hyperscalers and smaller GPU providers varies substantially (Uptime will publish a future report on this topic).
Figure 1 Variation in cost per server-hour by average cluster utilization over server lifetime
In Figure 1, the unit costs of cloud instances are constant because users buy instances using a per-unit model. The unit costs for the dedicated infrastructure vary with utilization.
Note that this report does not provide forensic analysis applicable to all scenarios, but rather illustrates how utilization is the key metric in on-premises versus cloud comparisons.
Dedicated is cheaper at scale
According to Figure 1, there is a breakeven point at 33% where dedicated infrastructure becomes cheaper per unit than public cloud. This breakeven means that over the amortization period, a third of the cluster’s capacity must be consumed for it to be cheaper than the cloud. The percentage may seem low, but it might be challenging to achieve in practice due to a multitude of factors such as the size of the model, network architecture and choice of software (Uptime will publish a future report explaining these factors).
Table 2 shows two different scenarios for training. In one scenario, a model is fine-tuned every quarter; in the other, the same model is fully retrained every other month.
Table 2 How training cycles impact utilization
When the dedicated cluster is being used for regular retraining, utilization increases, thereby lowering unit costs. However, when the cluster is only occasionally fine-tuning the model, utilization decreases, increasing unit costs.
In the occasional fine-tuning scenario, utilization of the dedicated infrastructure is just 8%. Using the dedicated equipment for an hour costs $250, compared with $66 for cloud, a cost increase of almost 300%. However, in the regular retraining scenario, the server has been used 40% of its lifetime, thereby undercutting the cloud on price ($50 versus $66 is a 25% cost saving).
Regular retraining would not necessarily improve the model’s performance enough to justify the increased expenditure.
Asymmetric risk
The cost impact of failing to meet the breakeven is greater than the benefit usually gained by doing so. In Figure 2, a dedicated cluster utilized 10 percentage points over the threshold makes a $16 saving per unit compared with the public cloud. But if an enterprise has been overly optimistic in its forecast, and misses the breakeven by 10 percentage points, it could have saved $30 by using public cloud.
Figure 2 Variation in unit costs by utilization, focusing on asymmetric risk
This risk of underutilizing an investment only exists with dedicated infrastructure. If an on-demand cloud resource is unused, it can be terminated at no cost. A cluster that has been purchased but is not being used is a sunk cost. In Figure 2, the dedicated infrastructure that fails to meet the threshold will continue to be more expensive than cloud until the utilization is increased. If cloud had been used, utilization would not be a consideration.
This risk is more pronounced because AI infrastructure is a brand new requirement for most enterprises. When cloud services were first launched, most organizations already had infrastructure, skills and procedures in place to manage their own data centers. As a result, there was no urgency to move to the cloud — in fact, migrating to the cloud required significant time and effort. Adoption was staggered.
Most organizations today, however, do not have existing AI infrastructure. Hype is triggering urgency and Chief Information Officers are under pressure to choose the best deployment method for AI requirements.
Using the model
The breakeven point will vary depending on an enterprise’s specific requirements. A discount on hardware list pricing or power can reduce the breakeven point, making dedicated infrastructure cheaper than the cloud, even if it is significantly underutilized.
Conversely, if a cloud buyer can reduce their costs through enterprise buying agreements or reserved instances, the breakeven point can move to the right, making the cloud cheaper than dedicated infrastructure. Cloud providers generally cut instance prices over time as demand shifts to newer instances equipped with the latest hardware.
While enterprise buyers need to conduct their own comparisons for their specific use case, the assumptions in this report provide a suitable benchmark.
Table 3 shows a list of assumptions. Rather than estimating the costs of an enterprise data center, the cluster is assumed to be hosted in colocation facilities, primarily for reasons of simplicity. Colocation service pricing aggregates the cost of building the facility and data center labor, which are difficult to analyze from scratch.
Table 3 Comparison model assumptions
Conclusion
Companies considering purchasing an AI cluster must factor utilization at the heart of their calculations. What cluster capacity is needed, and how much will it be used for value-adding activity over its lifetime? The answers to these questions are not easy to determine, as they are affected by the roster of potential projects, the complexity of the models, the frequency of retraining and cluster upgrade cycles. The complexity is compounded by current hype, which makes predicting future demand and value challenging. Cloud presents a good place to start, as experimentation does not require capital investment. However, as AI requirements mature, dedicated equipment could make good financial sense, if used effectively.
https://journal.uptimeinstitute.com/wp-content/uploads/2025/01/Sweat-Dedicated-GPU-Clusters-featured.jpg5401030Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com2025-01-30 15:00:002025-01-29 14:46:41Sweat dedicated GPU clusters to beat cloud on cost
The advent of AI training and inference applications, combined with the continued expansion of the digital world and electrification of the economy, raises two questions about electricity generation capacity: where will the new energy be sourced from, and how can it be decarbonized?
Groups, such as the International Energy Agency and the Electric Power Research Institute, project that data center energy demand will rise at an annual rate of 5% or more in many markets over the next decade. When combined with mandated decarbonization and electrification initiatives, electricity generation capacity will need to expand between two and three times by 2050 to meet this demand.
Many data center operators have promoted the need to add only carbon-free generation assets to increase capacity. This is going to be challenging: deployment of meaningful carbon-free generation, including energy from geothermal sources and small nuclear reactors (SMRs), and battery capacity, such as long duration energy storage (LDES), are at least five or more years away. Given the current state of technology development and deployment of manufacturing capacity, it will likely take at least 10 years before they are widely used on the electricity grid in most regions. This means that natural gas generation assets will have to be included in the grid expansion to maintain grid reliability.
Impact of wind / solar on grid reliability
To demonstrate the point, consider the following example. Figure 1 depicts the current generation capacity in Germany under five different weather and time of day scenarios (labeled as scenario A1 to A5). Table 1 provides the real-time capacity factors for the wind, solar and battery assets used for these scenarios. German energy demand in summer is 65 GW (not shown) and 78 GW (shown by the dotted line) in winter. The blue line is the total available generation capacity.
Figure 1. Grid generation capacity under different weather and time of day scenarios
Table 1. Generation capacity factors for scenarios A and B
Scenario A details the total available generation capacity of the German electricity grid in 2023. The grid has sufficient dispatchable generation to maintain grid reliability. It also has enough grid interconnects to import and export electricity production to address the over- or under-production due to the variable output of the wind and solar generation assets. An example of energy importation is the 5 GW of nuclear generation capacity, which comes from France.
Scenario A1 depicts the available generation capacity based on the average capacity factors of the different generation types. Fossil fuel and nuclear units typically have a 95% or greater capacity factor because they are only offline for minimal periods during scheduled maintenance. Wind and solar assets have much lower average capacity factors (see Table 1) due to the vagaries of the weather. Lithium-ion batteries have limited capacity because they discharge for two to four hours and a typical charge / discharge cycle is one day. As a result, the average available capacity for Germany is only half of the installed capacity.
Grid stress
The average capacity only tells half the story because the output from wind, solar and battery energy sources vary between zero and maximum capacity based on weather conditions and battery charge / discharge cycles.
Scenarios A2 and A3 illustrate daytime and nighttime situations with high solar and wind output. In these scenarios, the available generation capacity significantly exceeds electricity demand.
In scenario A2, the 139 GW of available wind and solar assets enable Germany to operate on 100% carbon-free energy, export energy to other grid regions and charge battery systems. Fossil fuel units will be placed in their minimal operating condition to minimize output but ensure availability as solar assets ramp down production in the evening. Some solar and wind-generating assets will likely have to be curtailed (disconnected from the grid) to maintain grid stability.
In scenario A3, the wind and fossil fuel assets provide sufficient generation capacity to match the demand. The output of the fossil fuel assets can be adjusted as the wind output modulates to maintain grid balance and reliability. Discharging the batteries and importing or exporting electricity to other grid regions are unnecessary.
Scenarios A4 and A5 show the challenges posed by solar and wind generation variability and limited battery discharge times (current lithium-ion battery technologies have a four-hour discharge limit). These scenarios represent the 10th to 20th percentile of wind and solar generation asset availability. Low wind and low or nonexistent solar output push the available generation capacity close to the electricity demand. If several fossil fuel assets are offline and/or imports are limited, the grid authority will have to rely on demand management capacity, battery capacity and available imports to keep the grid balanced and avoid rolling blackouts (one- to two-hour shutdowns of electricity in one or more grid sub-regions).
Demand management
The scenario described above is not peculiar to Germany. Wind and solar generation assets represent the bulk of new capacity to meet new data center and electrification demand and replace retiring fossil fuel and nuclear assets as they are (1) mandated by many state, province and national governments and (2) the only economically viable form of carbon-free electricity generation. Unfortunately, their variable output leaves significant supply gaps of hours, days and seasonally that cannot currently be addressed with carbon-free generation assets. These gaps will have to be filled with a combination of new and existing fossil fuel assets, existing nuclear assets (new plants, as new conventional or small modular nuclear reactors are eight years or more in the future) and demand management.
As the installed solar and wind generation capacity increases to a greater percentage of supply capacity, data center operators will play a major role in demand management strategies. Several strategies are available to the utilities / data center operators, but each has its drawbacks.
Backup generation systems
Significantly, emergency generator sets will need to be utilized to take data centers off the grid to address generation capacity shortfalls. This strategy is being deployed in the US (California), Ireland and other data center markets. In conversations with operators, several report being requested by their grid authority to deploy their emergency generators to relieve grid demand and ensure stability during the summer of 2023.
As new data centers are built and deployed, operators in stressed grid regions (i.e., those with a high percentage of capacity delivered by wind and solar assets) should plan and permit their emergency generators to be operated for 25% or more of the year to support grid reliability in the face of variable wind and solar generation asset production.
Workload reduction
Demand management can also be achieved by reducing data center energy consumption. Google has reported the development of protocols to shut down non-critical workloads, such as controllable batch jobs, or shift a workload to one or more data centers in other grid regions. The reports did not provide the volume of workloads moved or the demand reductions, which suggests that they were not significant. These tools are likely only used on workloads controlled by Google, such as search workloads or work on development servers. They are unlikely to be deployed on client cloud workloads because many IT operators are uncomfortable with unnecessarily stressing their operations and processes.
An example of an IT enterprise operation that could support demand management is a financial organization running twin data centers in two different grid regions. When grid stability is threatened, it could execute its emergency processes to move all workloads to a single data center. In addition to receiving a payment for reducing demand, this would be an opportunity to test and validate its emergency workload transfer processes. While there is a strong logic to justify this choice, IT managers will likely be hesitant to agree to this approach.
Outages and reliability problems are more likely to emerge when operational changes are being made and demand management payments from the grid authority or energy retailer will not compensate the risk of penalties under service level agreements. The use of backup generators will likely be the preferred response, though problems, such as issues with starting and synchronizing generators or transfer switch failures, can be experienced when switching to generators.
New solar and wind capacity needed
The energy demands of new data centers and the broader electrification of the economy will require the installation of new electricity generation capacity in grid regions around the world. Large colocation and cloud service providers have been particularly vocal that these generation assets should be carbon-free and not involve new fossil fuel generation assets. An analysis of the impact of increasing the wind, solar and battery generation capacity on the German grid by 20% to support a 15% increase in demand reveals the inherent dangers of this position.
Figure 2 details the available grid generation capacity under the five available capacity scenarios (see Table 1) when the wind, solar and battery capacities are increased by 20%. This increase in generating capacity is assumed to support a 15% rise in energy demand, which increases winter demand to 90 GW.
Figure 2. Impact of a 20% increase in wind and solar generation
The 20% generating capacity increase delivers sufficient available capacity for periods of moderate to high wind and solar capacity (scenarios B2 and B3). There is a sufficient capacity reserve (about 10% to 15% of demand) to provide the needed generation capacity if some fossil fuel-based generators are offline or the imports of nuclear-generated electricity are not available.
However, the capacity increase does not significantly increase the available capacity at periods of low solar and wind output, putting grid stability at risk in scenarios B4 and B5. The available capacity in scenario B4 increases by only 4 GW, which is insufficient to meet the new demand or provide any capacity reserve. Under scenario B5, there is barely enough capacity to provide a sufficient reserve (about 10% to 15% of capacity). In both cases, grid stability is at risk and some combination of demand management, battery capacity and imports will be required.
Until reliable, dispatchable carbon-free electricity technologies, such as SMRs, geothermal generation and LDES, are developed and deployed at scale, grid stability will depend on the presence of sufficient fossil fuel generation assets to match energy supply to demand. The deployed capacity will likely have to be 75% to 100% of projected demand to address three- to 10-day periods of low solar and wind output and expected seasonal variations in wind and solar production.
To enable growth and ensure reliable electricity service while increasing renewable energy generation capacity, data center operators will need to balance and likely compromise their location selection criteria, decarbonization goals, and the size of their electrical demand growth.
Managing growing power demand
Data center operators will have to reevaluate their sustainability objectives in light of the need to increase the overall grid generation capacity and the percentage of that capacity that is carbon-free generation while maintaining grid stability and reliability. Operators should consider the following options to take meaningful steps to further decarbonize the grid:
Set realistic goals for global operations to consume 80% to 90% carbon-free energy over the next decade rather than making unrealistic near-term net-zero claims. Studies have shown significant technical and economic obstacles to decarbonizing the last 10% to 20% of electricity generation while maintaining grid reliability. Most data centers currently operate on 10% to 50% (or less) carbon-free energy. Working to achieve 80% to 90% consumption of carbon-free energy by 2035 at an enterprise level is a reasonable goal that will help to support the grid’s decarbonization.
Support research, development and deployment of reliable, dispatchable carbon-free electricity generation and LDES technologies. Many large data center operators are making exemplary efforts in this area by supporting innovative, on-site operational tests and entering into prospective contracts for future purchases of energy or equipment.
Design, permit and deploy backup generation systems for 2,000 to 3,000 hours of annual operation to support grid reliability during periods of low production from wind and solar generation assets. Operators are already activating these systems to manage grid demands in some jurisdictions, such as Ireland and California in the US, that have a large installed wind or solar generation base with high variability in day-to-day and seasonal output.
Intensify efforts to improve the efficiency of the IT infrastructure. Efforts to improve IT efficiency need to include the deployment of server power management functions and storage capacity optimization methods; increases in the utilization of installed server, storage and network capacity to higher levels; refreshing and consolidating IT equipment; and designing energy efficiency into software applications and algorithms with the same intensity and commitment given to providing performance and function. The IT infrastructure consumes 70% to 90% of the energy in most data centers yet data shows that more than 60% of IT operators largely ignore the opportunities to drive IT equipment efficiency improvements.
Operators need to improve and validate techniques to move workloads from a data center in a stressed grid region to an unstressed grid region. Cloud workloads can be shifted to underutilized data centers in an unstressed grid region. Where a financial operator runs twin data centers, the full workload of paired, twin data centers can be moved to the facility that is in an unstressed grid. Data center operators will be hesitant to deploy these techniques because they can place critical operations at risk.
The data center industry can maintain its sustainability focus despite the expected growth in power demand. To do so, IT operators need to refocus on continually increasing the work delivered per unit of energy consumed by the IT infrastructure and the megawatt-hours of carbon-free energy consumed by data center operations. With available technologies, much can be done while developing the technologies needed to decarbonize the last 10% to 20% of the electricity grid.
The Uptime Intelligence View
Accelerating the growth of data center capacity and energy consumption does not have to imperil the industry’s drive toward sustainability. Instead, it requires that the sector pragmatically reassess its sustainability efforts. Sustainability strategies need to first intensify their focus on increased IT infrastructure utilization and work delivered per unit of energy consumed, and then take responsible steps to decarbonize the energy consumed by data centers while supporting necessary efforts to grow generation capacity on the grid and ensure grid stability.
https://journal.uptimeinstitute.com/wp-content/uploads/2025/01/Grid-growth-decarbonization-featured.jpg5401030Jay Dietrich, Research Director of Sustainability, Uptime Institute, jdietrich@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngJay Dietrich, Research Director of Sustainability, Uptime Institute, jdietrich@uptimeinstitute.com2025-01-08 12:00:002025-01-07 10:49:39Grid growth and decarbonization: An unhappy couple
Many employees approach AI-based systems in the workplace with a level of mistrust. This lack of trust can slow the implementation of new tools and systems, alienate staff and reduce productivity. Data center managers can avoid this outcome by understanding the factors that drive mistrust in AI and devising a strategy to minimize them.
Perceived interpersonal trust is a key productivity driver for humans but is rarely discussed in a data center context. Researchers at the University of Cambridge in the UK have found that interpersonal trust and organizational trust have a strong correlation with staff productivity. In terms of resource allocation, lack of trust requires employees to invest time and effort organizing fail-safes to circumvent perceived risks. This takes attention away from the task at hand and results in less output.
In the data center industry, trust in AI-based decision-making has declined significantly in the past three years. In Uptime Institute’s 2024 global survey of data center managers, 42% of operators said they would not trust an adequately trained AI system to make operational decisions in the data center, which is up 18 percentage points since 2022 (Figure 1). If this decline in trust continues, it will be harder to introduce AI-based tools.
Figure 1. More operators distrust AI in 2024 than in previous years
Managers who wish to unlock the productivity gains associated with AI may need to create specific conditions to build perceived trust between employees and AI-based tools.
Balancing trust and cognitive loads
The trust-building cycle requires a level of uncertainty. In the Mayer, Davis and Shoormen trust model, this uncertainty occurs when an individual is presented with the option to transfer decision-making autonomy to another party, which, in the data center, might be an AI-based control system (see An integrative model of organization trust). Individuals evaluate perceived characteristics of the other party against risk, to determine whether they can relinquish decision-making control. If this leads to desirable outcomes, individuals gain trust and perceive less risk in the future.
Trust toward AI-based systems can be encouraged by using specific deployment techniques. In Uptime Institute’s Artificial Intelligence and Software Survey 2024, almost half of the operators that have deployed AI capabilities report that predictive maintenance is driving their use of AI.
Researchers from Australia’s University of Technology Sydney and University of Sydney tested human interaction with AI-based predictive maintenance systems, with participants having to decide how to manage a situation with a burst water pipe under different levels of uncertainty and cognitive load (cognitive load being the amount of working memory resources used). For all participants, trust in the automatically generated suggestions was significantly higher under low cognitive loads. AI systems that communicated decision risk odds prevented trust from decreasing, even when cognitive load increased.
Without decision risk odds displayed, employees devoted more cognitive resources toward deciphering ambiguity, leaving less space in their working memory for problem solving. Interpretability of the output of AI-based systems drives trust: it allows users to understand the context of specific suggestions, alerts and predictions. If a user cannot understand how a predictive maintenance system came to a certain conclusion, they will lose trust. In this situation, productivity will stall as workers devote cognitive resources toward attempting to retrace the steps the system made.
Team dynamics
In some cases, staff who work with AI systems personify them and treat them as co-workers rather than tools. Similarly to human social group dynamics, and the negative bias felt toward those outside of one’s group (“outgroup” dynamics), staff may then lack trust in these AI systems.
AI systems can engender anxiety relating to job security and may trigger the fear of being replaced — although this is less of a factor in the data center industry, where staff are in short supply and not at high risk of losing their jobs. Nonetheless, researchers at the Institute of Management Sciences in Pakistan find that adoption of AI in general is linked with cognitive job insecurity, which threatens workers’ perceived trust in an organization.
Introduction of AI-based tools in a data center may also cause a loss in expert status for some senior employees, who might then view these tools as a threat to their identity.
Practical solutions
Although there are many obstacles to introducing AI-based tools into a human team, the solutions to mitigating them are often intuitive and psychological, rather than technological. Data center team managers can improve trust in AI technology through the following options:
Choose AI tools that demonstrate risk transparently. Display a metric for estimated prediction accuracy.
Choose AI tools that emphasize interpretability. This could include descriptions of branching logic, statistical data, metrics or other context for AI-based suggestions or decisions.
Combat outgroup bias. Arrange for trusted “ingroup” team leads to demonstrate AI tools to the rest of the group (instead of the tool vendors or those unfamiliar to the team).
Implement training throughout the AI transition process. Many employees will experience cognitive job insecurity despite being told their positions are secure. Investing in and implementing training during the AI transition process allows staff to feel a sense of control over their ability to affect the situation, minimize the gap between known and needed skills and prevent a sense of losing expert status.
Many of the solutions described above rely on social contracts—the transactional and relational agreements between employees and an organization. US psychologist Denise Rousseau (a professor at Carnegie Mellon University, Pittsburgh PA) describes relational trust as the expectation that a company will pay back an employee’s investments through growth, benefits and job security — all factors that go beyond the rewards of a salary.
When this relational contract is broken, staff will typically shift their behavior and deprioritize long-term company outcomes in favor of short-term personal gains.
Data center team leaders can use AI technologies to strengthen or break relational contracts in their organizations. Those who consider the factors outlined above will be more successful with maintaining an effective team.
The Uptime Intelligence View
An increasing number of operators cite plans to introduce AI-based tools into their data center teams, yet surveys increasingly report a mistrust in AI. When human factors, such as trust, are well managed, AI can be an asset to any data center team. If the current downward trend in trust continues, AI systems will become harder to implement. Solutions should focus on utilizing positive staff dynamics, such as organizational trust and social contracts.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/12/Building-trust-working-with-AI-based-tools-featured.jpg5401030Rose Weinschenk, Research Associate, Uptime Institute, rweinschenk@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRose Weinschenk, Research Associate, Uptime Institute, rweinschenk@uptimeinstitute.com2024-12-19 17:00:002024-12-19 14:33:06Building trust: working with AI-based tools
Data center infrastructure management (DCIM) software is an important class of software that, despite some false starts, many operators regard as essential to running modern, flexible and efficient data centers. It has had a difficult history — many suppliers have struggled to meet customer requirements and adoption remains patchy. Critics argue that, because of the complexity of data center operations, DCIM software often requires expensive customization and feature development for which many operators have neither the expertise nor the budget.
This is the first of a series of reports by Uptime Intelligence exploring data center management software in 2024 — two decades or more after the first commercial products were introduced. Data center management software is a wider category than DCIM: many products are point solutions; some extend beyond a single site and others have control functions. Uptime Intelligence is referring to this category as data center management and control (DCM-C) software.
DCIM, however, remains at the core. This report identifies the key areas in which DCIM has changed over the past decade and, in future reports, Uptime Intelligence will explore the broader DCM-C software landscape.
What is DCIM?
DCIM refers to data center infrastructure management software, which collects and manages information about a data center’s IT and facility assets, resource use and operational status, often across multiple systems and distributed environments. DCIM primarily focuses on three areas:
IT asset management. This involves logging and tracking of assets in a single searchable database. This can include server and rack data, IP addresses, network ports, serial numbers, parts and operating systems.
Monitoring. This usually includes monitoring rack space, data and power (including power use by IT and connected devices), as well as environmental data (such as temperature, humidity, air flow, water and air pressure).
Dashboards and reporting. To track energy use, sustainability data, PUE and environmental health (thermal, pressure etc.), and monitor system performance, alerts and critical events. This may also include the ability to simulate and project forward – for example, for the purposes of capacity management.
In the past, some operators have taken the view that DCIM does not justify the investment, given its cost and the difficulty of successful implementation. However, these reservations may be product specific and can depend on the situation; many others have claimed a strong return on investment and better overall management of the data center with DCIM.
Growing need for DCIM
Uptime’s discussions with operators suggest there is a growing need for DCIM software, and related software tools, to help resolve some of the urgent operational issues around sustainability, resiliency and capacity management. The current potential benefits of DCIM include:
Improved facility efficiency and resiliency through automating IT updates and maintenance schedules, and the identification of inefficient or faulty hardware.
Improved capacity management by tracking power, space and cooling usage, and locating appropriate resources to reserve.
Procedures and rules are followed. Changes are documented systematically; asset changes are captured and stored — and permitted only if the requirements are met.
Denser IT accommodated by identifying available space and power for IT, it may be easier to densify racks, such as allocate resources to AI/machine learning and high-performance computing. The introduction of direct liquid cooling (DLC) will further complicate environments.
Human error eliminated through a higher degree of task automation, as well as improved workflows, when making system changes or updating records.
Meanwhile, there will be new requirements from customers for improved monitoring, reporting and measurement of data, including:
Monitoring equipment performance to avoid undue wear and tear or system stress might reduce the risk of outages.
Shorter ride through times may require more monitoring. For example, IT equipment may only have a short window of cooling from the UPS, in the event of a major power outage.
Greater variety of IT equipment (graphics processing units, central processing units, application-specific integrated circuits) may mean a less predictable, more unstable environment. Monitoring will be required to ensure that their different power loads, temperature ranges and cooling requirements are managed effectively.
Sustainability metrics (such as PUE), as well as other measurables (such as water usage effectiveness, carbon usage effectiveness and metrics to calculate Scope 1, 2 or 3 greenhouse gas emissions).
Legal requirements for transparency of environmental, sustainability and resiliency data.
Supplier landscape resets
In the past decade, many DCIM suppliers have reset, adapted and modernized their technology to meet customer demand. Many have now introduced mobile and browser-based offerings, colocation customer portals and better metrics tracking, data analytics, cloud and software as a service (SaaS).
Customers are also demanding more vendor agnostic DCIM software. Operators have sometimes struggled with DCIM’s inability to work with existing building management systems from other vendors, which then requires additional costly work on application programming interfaces and integration. Some operators have noted that DCIM software from one specific vendor still only provides out-of-the-box monitoring for their own brand of equipment. These concerns have influenced (and continue to influence) customer buying decisions.
Adaptation has been difficult for some of the largest DCIM suppliers, and some organizations have now exited the market. As one of the largest data center equipment vendors, for example, Vertiv’s discontinuation of Trellis in 2021 was a significant exit: customers found Trellis too large and complex for most implementations. Even today, operators continue to migrate off Trellis onto other DCIM systems.
Other structural change in the DCIM market include Carrier and Schneider Electric acquiring Nlyte and Aveva, respectively, and Sunbird spinning out from hardware vendor Raritan (Legrand) (see Table 1).
Table 1. A selection of current DCIM suppliers
There are currently a growing number of independent service vendors currently offering DCIM, each with different specialisms. For example, Hyperview is solely cloud-based, RiT Tech focuses on universal data integration, while Device42 specialises in IT asset discovery. Independent service vendors benefit those unwilling to acquire DCIM software and data center equipment from the same supplier.
Those DCIM software businesses that have been acquired by equipment vendors are typically kept at arm’s length. Schneider and Carrier both retain the Aveva and Nlyte brands and culture to preserve their differentiation and independence.
There are many products in the data center management area that are sometimes — in Uptime’s view —labeled incorrectly as DCIM. These include products that offer discrete or adjacent DCIM capabilities, such as: Vertiv Environet Alert (facility monitoring); IBM Maximo (asset management); AMI Data Center Manager (server monitoring); Vigilent (AI-based cooling monitoring and control); and EkkoSense (digital twin-based cooling optimization). Uptime views these as part of the wider DCM-C control category, which will be discussed in a future report in this series.
Attendees at Uptime network member events between 2013 and 2020 may recall that complaints about DCIM products, implementation, integration and pricing were a regular feature. Much of the early software was market driven, fragile and suffered from performance issues, but DCIM software has undoubtedly improved from where it was a decade ago.
The next sections of this report discuss areas in which DCIM software has improved and where there is still room for improvement.
Modern development techniques
Modern development techniques, such as continuous improvement / continuous delivery and agile / DevOps have encouraged a regular cadence of new releases and updates. Containerized applications have introduced modular DCIM, while SaaS has provided greater pricing and delivery flexibility.
Modularity
DCIM is no longer a monolithic software package. Previously, it was purchased as a core bundle, but now DCIM is more modular with add-ons that can be purchased as required. This may make DCIM more cost-effective, with operators being able to more accurately assess the return on investment, before committing to further investment. Ten years ago, the main customers for DCIM were enterprises, with control over IT — but limited budgets. Now, DCIM customers are more likely to be colocation providers with more specific requirements, little interest in the IT, and probably require more modular, targeted solutions with links into their own commercial systems.
SaaS
Subscription-based pricing for greater flexibility and visibility on costs. This is different from traditional DCIM licence and support software pricing, which typically locked customers in for minimum-term contracts. Since SaaS is subscription-based, there is more onus on the supplier to respond to customer requests in a timely manner. While some DCIM vendors offer cloud-hosted versions of their products, most operators still opt for on-premises DCIM deployments, due to perceived data and security concerns.
IT and software integrations
Historically, DCIM suffered from configurability, responsiveness and integration issues. In recent years, more effort has been made toward third-party software and IT integration and encouraging better data sharing between systems. Much DCIM software now uses application programming interfaces (APIs) and industry standard protocols to achieve this:
Application programming interfaces
APIs have made it easier for DCIM to connect with third-party software, such as IT service management, IT operations management, and monitoring and observability tools, which are often used in other parts of the organization. The aim for operators is to achieve a comprehensive view across the IT and facilities landscape, and to help orchestrate requests that come in and out of the data center. Some DCIM systems, for example, come with pre-built integrations and tools, such as ServiceNow and Salesforce, that are widely used by IT enterprise teams. These enterprise tools can provide a richer set of functionalities in IT and customer management and support. They also use robotic process automation technology, to automate repetitive manual tasks, such as rekeying data between systems, updating records and automating responses.
IT/OT protocols
Support for a growing number of IT/OT protocols has made it easier for DCIM to connect with a broader range of IT/OT systems. This helps operators to access the data needed to meet new sustainability requirements. For example, support for simple network management protocol can provide DCIM with network performance data that can be used to monitor and detect connection faults. Meanwhile, support for the intelligent platform management interface can enable remote monitoring of servers.
User experience has improved
DCIM provides a better user experience than a decade ago. However, operators still need to be vigilant that a sophisticated front end is not a substitute for functionality.
Visualization
Visualization for monitoring and planning has seen significant progress, with interactive 3D and augmented reality views of IT equipment, racks and data halls.Sensor data is being used, for example, to identify available capacity, hot-spots or areas experiencing over-cooling. This information is presented visually to the user, who can follow changes over time and drag and drop assets into new configurations. On the fringes of DCIM, computational fluid dynamics can visualize air flows within the facility, which can then be used to make assumptions about the impact of specific changes on the environment. Meanwhile, the increasing adoption of computer-aided design can enable operators to render accurate and dynamic digital twin simulations for data center design and engineering and, ultimately, the management of assets across their life cycle.
Better workflow automation
At a process level, some DCIM suites offer workflow management modules to help managers initiate, manage and track service requests and changes. Drag and drop workflows can help managers optimize existing processes. This has the potential to reduce data entry omissions and errors, which have always been among the main barriers to successful DCIM deployments.
Rising demand for data and AI
Growing demand for more detailed data center metrics and insights related to performance, efficiency and regulations will make DCIM data more valuable. This, however, depends on how well DCIM software can capture, store and retrieve reliable data across the facility.
Customers today require greater levels of analytical intelligence from their DCIM. Greater use of AI and ML could enable the software to spot patterns, anomalies and provide next best action recommendations. DCIM has not fared well in this area, which has opened the door to a new generation of AI-enabled optimization tools. The Uptime report What is the role of AI in digital infrastructure management? identifies three near-term applications of ML in the data center — predictive analytics, equipment setting optimization and anomaly detection.
DCIM is making progress in sustainability data monitoring and reporting and a number of DCIM suppliers are now actively developing sustainability modules and dashboards. One supplier, for example, is developing a Scope 1, 2 and 3 greenhouse gas emissions model based on a range of datasets, such as server product performance sheets, component catalogs and the international Environmental Product Declaration (EPD) database. Several suppliers are working on dashboards that bring together all the data required for compliance with the EU’s Energy Efficiency Directive. Once functional, these dashboards could compare data centers, devices and manufacturers, as well as provide progress reports.
The Uptime Intelligence View
DCIM has matured as a software solution over the past decade. Improvements in function modularity, SaaS, remote working, integration, user experience and data analytics, have all progressed to the point where DCIM is now considered a viable and worthwhile investment. DCIM data will also be increasingly valuable for regulatory reporting requirements. Nonetheless, there remains more work to be done. Customers still have legitimate concerns about its complexity, cost and accuracy, while DCIM’s ability to apply AI and analytics — although an area of great promise — is still viewed cautiously. Even when commercial DCIM packages were less robust and functional, those operators that researched it diligently and deployed it carefully found it to be largely effective. This remains true today.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/11/DCIM-past-and-present-whats-changed-featured.jpg5401030John O’Brien, Senior Research Analyst, Uptime Institute, jobrien@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngJohn O’Brien, Senior Research Analyst, Uptime Institute, jobrien@uptimeinstitute.com2024-11-13 15:00:002024-11-13 10:14:33DCIM past and present: what’s changed?
Direct liquid cooling (DLC), including cold plate and immersion systems, is becoming more common in data centers — but so far this transition has been gradual and unevenly distributed with some data centers using it widely, others not at all. The use of DLC in 2024 still accounts for only a small minority of the world’s IT servers, according to the Uptime Institute Cooling Systems Survey 2024. The adoption of DLC remains slow in general-purpose business IT, and the most substantial deployments currently concentrate on high-performance computing applications, such as academic research, engineering, AI model development and cryptocurrency.
This year’s cooling systems survey results continue a trend: of those operators that use DLC in some form, the greatest number of operators deploy water cold plate systems, with other DLC types trailing significantly. Multiple DLC types will grow in adoption over the next few years, and many installations will be in hybrid cooling environments where they share space and infrastructure with air-cooling equipment.
Within this crowded field, water cold plates’ lead is not overwhelming. Water cold plate systems retain advantages that explain this result: ease of integration into shared facility infrastructure, a well-understood coolant chemistry, greater choice in IT and cooling equipment, and less disruption to IT hardware procurement and warranty compared with other current forms of DLC. Many of these advantages are down to its long-established history spanning decades.
This year’s cooling systems survey provides additional insights into the DLC techniques operators are currently considering. Of those data center operators using DLC, many more (64%) have deployed water cold plates than the next-highest-ranking types: dielectric-cooled cold plates (30%) and single-phase immersion systems (26%) (see Figure 1).
Figure 1. Operators currently using DLC prefer water-cooled cold plates
At present, most users say they use DLC for a small portion of their IT — typically for their densest, most difficult equipment to cool (see DLC momentum rises, but operators remain cautious). These small deployments favor hybrid approaches, rather than potentially expensive dedicated heat rejection infrastructure.
Hybrid cooling predominates — for now
Many DLC installations are in hybrid (mixed) setups in which DLC equipment sits alongside air cooling equipment in the data hall, sharing both heat transport and heat rejection infrastructure. This approach can compromise DLC’s energy efficiency advantages (see DLC will not come to the rescue of data center sustainability), but for operators with only small DLC deployments, it can be the only viable option. Indeed, when operators named the factors that make a DLC system viable, the greatest number (52%, n=392) chose ease of retrofitting DLC into their existing infrastructure.
For those operators who primarily serve mainstream business workloads, IT is rarely dense and powerful enough to require liquid cooling. Nearly half of operators (48%, n=94) only use DLC on less than 10% of their IT racks — and only one in three (33%, n=54) have heat transport and rejection equipment dedicated to their liquid-cooled IT. At this early stage of DLC growth, economics and operational risks dictate that operators prefer cooling technologies that integrate more readily into their existing space. Water cold plate systems can meet this need, despite potential drawbacks.
Cold plates are good neighbors, but not perfect
Water-cooled servers typically fit into standard racks, which simplifies deployment — especially when these servers coexist with air cooled IT. Existing racks can be reused either fully or partially loaded with water-cooled servers, and installing new racks is also straightforward.
IT suppliers prefer to sell the DLC solution integrated with their own hardware, ranging from a single server chassis to rows of racks including cooling distribution units (CDUs). Today, this approach typically favors a cold plate system, so that operators and IT teams have the broadest selection of equipment and compatible IT hardware with vendor warranty coverage.
The use of water in data center cooling has a long history. In the early years of mainframes water was used in part due to its advantageous thermal properties compared with air cooling, but also because of the need to remove heat effectively from the relatively small rooms that computers shared with staff.
Today, water cold plates are used extensively in supercomputing, handling extremely dense cabinets. Operators benefit from water’s low cost and ready availability, and many are already skilled in maintaining its chemistry (even though quality requirements for the water coolant are more stringent for cold plates compared with a facility loop).
The risk (and, in some cases, the vivid memories) of water leakage onto electrified IT components is one reason some operators are hesitant to embrace this technology, but leaks are statistically rare and there are established best practices in mitigation. However, with positive pressure cold plate loops, which is the type most deployed by operators, there is never zero risk. The possibility of water damage is perhaps the single key weakness of water cold plates driving interest in alternative dielectric DLC techniques.
In terms of thermal performance, water is not without competition. Two-phase dielectric coolants show strong performance by taking advantage of the added cooling effect from vaporization. Vendors offer this technology in the form of both immersion tanks and cold plates, with the latter edging ahead in popularity because it requires less change to products and data center operations. The downside of all engineered coolants is the added cost, as well as the environmental concerns around manufacturing and leaks.
Some in the data center industry predict two-phase cooling products will mature to capitalize on their performance potential and eventually become a major form of cooling in the world of IT. Uptime’s survey data suggests that water cold plate systems currently have a balance of benefits and risks that make practical and economic sense for a greater number of operators. But the sudden pressure on cooling and other facility infrastructure brought about by specialized hardware for generative AI will likely create new opportunities for a wider range of DLC techniques.
Outlook
Uptime’s surveys of data center operators are a useful indicator of how operators are meeting their cooling needs, among others. The data thus far suggests a gradual DLC rollout, with water cold plates holding a stable (if not overwhelming) lead. Uptime’s interviews with vendors and operators consistently paint a picture of widespread hybrid cooling environments, which incentivize cooling designs that are flexible and interoperable.
Many water cold plate systems on the market are well suited to these conditions. Looking five years ahead, densified IT for generative AI and other intensive workloads is likely to influence data center business priorities and designs more widely. DLC adoption and operator preferences for specific technology types are likely to shift in response. Pure thermal performance is key but not the sole factor. The success of any DLC technique will rely on overcoming the barriers to its adoption, availability from trusted suppliers and support for a wide range of IT configurations from multiple hardware manufacturers.
https://journal.uptimeinstitute.com/wp-content/uploads/2024/10/Water-cold-plates-lead-in-the-small-but-growing-world-of-DLC-featured.jpg5401030Jacqueline Davis, Research Analyst, Uptime Institute, jdavis@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngJacqueline Davis, Research Analyst, Uptime Institute, jdavis@uptimeinstitute.com2024-10-30 15:00:002024-11-13 10:21:48Water cold plates lead in the small, but growing, world of DLC
How AWS’s own silicon and software deliver cloud scalability
/in Design, Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.comAmazon Web Services (AWS) was the world’s first hyperscale cloud provider, and it remains the largest today. It represents around one-third of the global market, offering more than 200 infrastructure, platform and software services across 34 regions. To efficiently deliver so many services at such a scale, AWS designs and builds much of its own hardware.
The core AWS service is Amazon EC2 (Elastic Cloud Compute), which delivers virtual machines as a service. Not only is Amazon EC2 a service for customers, but it is also the underlying, hidden foundation for AWS’s platform and software services. The technology deployed in AWS data centers is often used by its parent company, Amazon, to deliver e-commerce, streaming and other consumer capabilities.
A hyperscale cloud provider does more than just manage “someone else’s computer,” as the joke goes. At the annual AWS re:Invent conference in November 2024, one speaker stated that AWS EC2 users create around 130 million new instances daily, which is well beyond anything colocation or enterprise data centers can achieve. Managing the IT infrastructure to meet such demand requires servers and silicon specifically designed for the task. Since 2017, a core capability in AWS infrastructure has been the Nitro system, which enables such scale by offloading virtualization, networking and storage management from the server processor and onto a custom chip.
Nitro architecture
Virtualization software divides a physical server into many virtual machines. It is a vital component of the public cloud because it enables the provider to create, sell and destroy computing units purchased on demand by users.
The AWS Nitro system consists of a custom network interface card containing a system-on-chip (SoC) and a lightweight hypervisor (virtualization software layer) installed on each server. Designed by Annapurna Labs, which Amazon acquired in 2015, the hardware and firmware are developed and maintained by AWS engineering teams.
The system offloads many of the functions of software virtualization onto dedicated hardware. This offloading reduces CPU overhead, freeing up resources previously consumed by virtualization software for running customer workloads. It also offloads some security and networking functionality.
A full breakdown of Nitro’s capability is provided in Table 1.
Table 1 Features of Nitro card
AWS has millions of servers that are connected and ready to use. Nitro enables users (or applications) to provision resources and start them up securely within seconds without requiring human interaction. It also provides AWS with the ability to control and optimize its estate.
Through Nitro, AWS can manage all its servers regardless of the underlying hardware, operating system, or the AWS service provisioned upon them. Nitro allows x86 and ARM servers to be managed using the same technology, and it can also support accelerators such as Nvidia GPUs and AWS’s own Inferentia and Trainium application-specific integrated circuits for AI workloads.
Although AWS uses servers from original equipment manufacturers, such as Dell and HPE, it also designs its own, manufacturing them via original design manufacturers (ODMs), usually based in Asia. These servers are stripped of nonessential components to reduce cost overheads and optimize performance for AWS’s specific requirements, such as running its ARM-based CPU, Graviton. In addition, AWS designs its own networking equipment, which is also manufactured by ODMs, reportedly including Wiwynn and Quanta.
The Graviton CPU
Graviton is ASW’s family of ARM-based chips, designed and manufactured by Annapurna Labs. Just like Nitro, Graviton is becoming an increasingly important enabler for AWS, and the two capabilities are becoming more entwined.
The use of Graviton is growing, according to speakers at the re:Invent conference. In the past two years, 50% of AWS’s new CPU capacity has been based on Graviton. Customers can consume Graviton directly through a range of EC2 virtual machines, but AWS also utilizes Graviton to power platforms and services where the customer has no visibility to (or interest in) the underlying technology — for example, 150,000 Graviton chips power the AWS DynamoDB database service.
Graviton is also employed by the parent company: Amazon used 150,000 Graviton chips during its annual Prime Day sale to meet its e-commerce demand.
The growth in Graviton processor adoption is driven primarily by economics. Compared with instances using x86 designs by Intel and AMD, AWS prices Graviton instances lower at comparable configurations (vCPUs, memory, bandwidth) as it tries to steer customers towards its own platform.
For AWS, selling access to its own chips captures revenue that would otherwise have gone to its partners Intel and AMD. It also gives AWS a differentiator in the market and a degree of lock-in; AWS’s competitors are now offering ARM services, but Graviton is more mature and widely adopted in the cloud market.
The downside for cloud customers is that chips based on ARM instruction sets cannot run the vast library of x86 codes and have a less mature software toolchain. This makes it harder for developers to implement some features or extract optimal performance, making them unsuitable for many commercial business applications.
Nitro enhances AWS’s latest Graviton chip (version 4) by providing a secure foundation through hardware-based attestation and isolation. Graviton4 processors and Nitro chips verify each other’s identity cryptographically and establish encrypted communication channels, which helps protect workloads running on AWS from unauthorized access with minimal performance impact.
Scalable storage
Nitro also enables storage to be disaggregated from compute, making it independently scalable.
Compute and storage do not necessarily scale with each other. One application might need a lot of compute and little disk, while another might need the complete opposite. This presents a problem in a static server with a fixed capacity of compute and storage.
In a traditional storage array, a head node is a server that manages the interactions between storage users and the actual disks. A storage array is provisioned with a head node and many disks connected directly to it.
The problem with this setup is that the maximum number of disks that the array can support is decided at setup. If an array is full, a new array has to be purchased.
As the size of the array design grows, practical challenges arise. AWS scaled a single storage array to 288 drives, with the hardware holding nearly six petabytes and weighing two tons. The sheer size of the appliance meant:
To allow storage to scale independently and reliably from compute without such deployment challenges, AWS designed its own storage system, effectively utilizing Nitro as a lightweight head node.
In AWS’s method, each disk enclosure contains its own Nitro card. The Nitro card acts as a basic head node, managing the disks contained within the enclosure, and interacting with virtual machines hosted on servers elsewhere.
The primary benefits for AWS are easier maintenance and increased reliability. If a Nitro card fails, only a few drives lose connectivity, as opposed to an entire array of disks. Any failed drive can be removed from the service and a replacement added without causing downtime of the other disks or compute server. If a virtual machine goes down due to a failure of a compute server, it can be restarted elsewhere and the disks reconnected automatically, without loss of data.
The Uptime Intelligence View
Enterprises and colocation providers should focus on what the hyperscalers cannot do — supporting a wide range of hardware configured for each customer (internal or external), ensuring that hardware is secure (physically and virtually) and accessible only by that customer, and offering hands-on support tweaked to customer needs. They should also accept that customers will use the cloud for some applications simply because the hyperscalers can squeeze efficiency and provide scalability to a degree that is impossible for most organizations. Colocations and private facilities should enable the use of both on-premises and cloud infrastructure for their applications.
Sweat dedicated GPU clusters to beat cloud on cost
/in Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.comOver the past year, demand for GPUs to train generative AI models has soared. Some organizations have invested in GPU clusters costing millions of dollars for this purpose. Cloud services offered by the major hyperscalers and a new wave of GPU-focused cloud providers deliver an alternative to dedicated infrastructure for those unwilling, or unable, to purchase their own GPU clusters.
There are many factors affecting the choice between dedicated and cloud-based GPUs. Among these are the ability to power and cool high density GPU clusters in a colocation facility or enterprise, the availability of relevant skills, and data sovereignty. But often, it is an economic decision. Dedicated infrastructure requires significant capital expenditure that not all companies can raise. Furthermore, many organizations are only just beginning to understand how (and if) they can use AI to create value. An investment in dedicated equipment for AI is a considerable risk, considering its uncertain future.
In contrast, cloud services can be consumed by the hour with no commitment or reservation required (although this is changing in some cases as cloud providers struggle to supply enough resources to meet demand). Companies can experiment with AI in the public cloud, eventually developing full-featured AI applications without making significant upfront investments, instead paying by the unit.
Ultimately, the public cloud consumption model allows customers to change their mind without financial consequence, by terminating resources if they are not needed and vice versa. Such flexibility is not possible when a big investment has been made in dedicated infrastructure.
Although the cloud may be more flexible than dedicated equipment, it is not necessarily cheaper. The cheapest choice depends on how the hardware is used. To beat the cloud on cost, dedicated equipment must be sweated — that is, used to its fullest potential — to obtain a maximum return on investment.
Those organizations that fail to sweat their infrastructure face higher costs. A dedicated cluster with a utilization of 40% is 25% cheaper per unit than cloud, but a utilization of just 8% makes dedicated infrastructure four times the price of cloud per unit.
Unit cost comparisons
Unit pricing is inherently built into the public cloud’s operating model: customers are charged per virtual machine, unit of storage or other consumption metric. Cloud users can purchase units spontaneously and delete them when needed. Since the provider is responsible for managing the capacity of the underlying hardware, cloud users are unaffected by how many servers the provider has — or how well they are utilized.
Dedicated equipment purchased with capital is not usually considered in terms of consumption units. However, unit costs are important because they help determine the potential return on an investment or purchase.
To compare the cost of dedicated equipment against the cost of cloud infrastructure, the total capital and operating expenses associated with dedicated equipment must be broken out into unit costs. These unit costs reflect how much a theoretical buyer would need to pay (per consumed unit) for the capital investment and operating costs to be fully repaid.
This concept is best explained hypothetically. Consider a server containing two CPUs. The server is only expected to be in service for a year. Many departments share the server, consuming units of CPU months (i.e., accessing a CPU for one month). There are 24 CPU months in a year (two CPUs x 12 months). Table 1 calculates each CPU month’s unit cost, comparing whether the server is highly utilized over its life or only partly utilized.
Table 1 Example unit cost calculations
Table 1 shows how utilization impacts the unit cost. Greater utilization equals a lower unit cost, as the investment delivers more units of value-adding capability. Lower utilization equals a higher unit cost, as the same investment delivers fewer units.
Note that utilization in this context means the average utilization across all servers, even those that are sitting idle. If the organization has purchased 10 servers in an AI cluster, but only one is being used, the average utilization is just 10%; the unused servers are wasted investments.
Dedicated infrastructure versus public cloud
The previous example shows a hypothetical scenario to demonstrate unit costs. Figure 1 shows a real unit cost comparison of Nvidia DGX H100 nodes hosted in a North Virginia (US) colocation facility against an equivalent cloud instance, averaged with pricing data from AWS, Google Cloud, Microsoft Azure, CoreWeave, Lambda Labs and Nebius (collected in January 2025). Colocation costs reflected in this calculation include power, space and server capital as described in Table 3 (later in this report). Dedicated unit costs are the same regardless of the number of DGX servers installed in a cluster. Notably, the price between the hyperscalers and smaller GPU providers varies substantially (Uptime will publish a future report on this topic).
Figure 1 Variation in cost per server-hour by average cluster utilization over server lifetime
In Figure 1, the unit costs of cloud instances are constant because users buy instances using a per-unit model. The unit costs for the dedicated infrastructure vary with utilization.
Note that this report does not provide forensic analysis applicable to all scenarios, but rather illustrates how utilization is the key metric in on-premises versus cloud comparisons.
Dedicated is cheaper at scale
According to Figure 1, there is a breakeven point at 33% where dedicated infrastructure becomes cheaper per unit than public cloud. This breakeven means that over the amortization period, a third of the cluster’s capacity must be consumed for it to be cheaper than the cloud. The percentage may seem low, but it might be challenging to achieve in practice due to a multitude of factors such as the size of the model, network architecture and choice of software (Uptime will publish a future report explaining these factors).
Table 2 shows two different scenarios for training. In one scenario, a model is fine-tuned every quarter; in the other, the same model is fully retrained every other month.
Table 2 How training cycles impact utilization
When the dedicated cluster is being used for regular retraining, utilization increases, thereby lowering unit costs. However, when the cluster is only occasionally fine-tuning the model, utilization decreases, increasing unit costs.
In the occasional fine-tuning scenario, utilization of the dedicated infrastructure is just 8%. Using the dedicated equipment for an hour costs $250, compared with $66 for cloud, a cost increase of almost 300%. However, in the regular retraining scenario, the server has been used 40% of its lifetime, thereby undercutting the cloud on price ($50 versus $66 is a 25% cost saving).
Regular retraining would not necessarily improve the model’s performance enough to justify the increased expenditure.
Asymmetric risk
The cost impact of failing to meet the breakeven is greater than the benefit usually gained by doing so. In Figure 2, a dedicated cluster utilized 10 percentage points over the threshold makes a $16 saving per unit compared with the public cloud. But if an enterprise has been overly optimistic in its forecast, and misses the breakeven by 10 percentage points, it could have saved $30 by using public cloud.
Figure 2 Variation in unit costs by utilization, focusing on asymmetric risk
This risk of underutilizing an investment only exists with dedicated infrastructure. If an on-demand cloud resource is unused, it can be terminated at no cost. A cluster that has been purchased but is not being used is a sunk cost. In Figure 2, the dedicated infrastructure that fails to meet the threshold will continue to be more expensive than cloud until the utilization is increased. If cloud had been used, utilization would not be a consideration.
This risk is more pronounced because AI infrastructure is a brand new requirement for most enterprises. When cloud services were first launched, most organizations already had infrastructure, skills and procedures in place to manage their own data centers. As a result, there was no urgency to move to the cloud — in fact, migrating to the cloud required significant time and effort. Adoption was staggered.
Most organizations today, however, do not have existing AI infrastructure. Hype is triggering urgency and Chief Information Officers are under pressure to choose the best deployment method for AI requirements.
Using the model
The breakeven point will vary depending on an enterprise’s specific requirements. A discount on hardware list pricing or power can reduce the breakeven point, making dedicated infrastructure cheaper than the cloud, even if it is significantly underutilized.
Conversely, if a cloud buyer can reduce their costs through enterprise buying agreements or reserved instances, the breakeven point can move to the right, making the cloud cheaper than dedicated infrastructure. Cloud providers generally cut instance prices over time as demand shifts to newer instances equipped with the latest hardware.
While enterprise buyers need to conduct their own comparisons for their specific use case, the assumptions in this report provide a suitable benchmark.
Table 3 shows a list of assumptions. Rather than estimating the costs of an enterprise data center, the cluster is assumed to be hosted in colocation facilities, primarily for reasons of simplicity. Colocation service pricing aggregates the cost of building the facility and data center labor, which are difficult to analyze from scratch.
Table 3 Comparison model assumptions
Conclusion
Companies considering purchasing an AI cluster must factor utilization at the heart of their calculations. What cluster capacity is needed, and how much will it be used for value-adding activity over its lifetime? The answers to these questions are not easy to determine, as they are affected by the roster of potential projects, the complexity of the models, the frequency of retraining and cluster upgrade cycles. The complexity is compounded by current hype, which makes predicting future demand and value challenging. Cloud presents a good place to start, as experimentation does not require capital investment. However, as AI requirements mature, dedicated equipment could make good financial sense, if used effectively.
Grid growth and decarbonization: An unhappy couple
/in Executive, Operations/by Jay Dietrich, Research Director of Sustainability, Uptime Institute, jdietrich@uptimeinstitute.comThe advent of AI training and inference applications, combined with the continued expansion of the digital world and electrification of the economy, raises two questions about electricity generation capacity: where will the new energy be sourced from, and how can it be decarbonized?
Groups, such as the International Energy Agency and the Electric Power Research Institute, project that data center energy demand will rise at an annual rate of 5% or more in many markets over the next decade. When combined with mandated decarbonization and electrification initiatives, electricity generation capacity will need to expand between two and three times by 2050 to meet this demand.
Many data center operators have promoted the need to add only carbon-free generation assets to increase capacity. This is going to be challenging: deployment of meaningful carbon-free generation, including energy from geothermal sources and small nuclear reactors (SMRs), and battery capacity, such as long duration energy storage (LDES), are at least five or more years away. Given the current state of technology development and deployment of manufacturing capacity, it will likely take at least 10 years before they are widely used on the electricity grid in most regions. This means that natural gas generation assets will have to be included in the grid expansion to maintain grid reliability.
Impact of wind / solar on grid reliability
To demonstrate the point, consider the following example. Figure 1 depicts the current generation capacity in Germany under five different weather and time of day scenarios (labeled as scenario A1 to A5). Table 1 provides the real-time capacity factors for the wind, solar and battery assets used for these scenarios. German energy demand in summer is 65 GW (not shown) and 78 GW (shown by the dotted line) in winter. The blue line is the total available generation capacity.
Figure 1. Grid generation capacity under different weather and time of day scenarios
Table 1. Generation capacity factors for scenarios A and B
Scenario A details the total available generation capacity of the German electricity grid in 2023. The grid has sufficient dispatchable generation to maintain grid reliability. It also has enough grid interconnects to import and export electricity production to address the over- or under-production due to the variable output of the wind and solar generation assets. An example of energy importation is the 5 GW of nuclear generation capacity, which comes from France.
Scenario A1 depicts the available generation capacity based on the average capacity factors of the different generation types. Fossil fuel and nuclear units typically have a 95% or greater capacity factor because they are only offline for minimal periods during scheduled maintenance. Wind and solar assets have much lower average capacity factors (see Table 1) due to the vagaries of the weather. Lithium-ion batteries have limited capacity because they discharge for two to four hours and a typical charge / discharge cycle is one day. As a result, the average available capacity for Germany is only half of the installed capacity.
Grid stress
The average capacity only tells half the story because the output from wind, solar and battery energy sources vary between zero and maximum capacity based on weather conditions and battery charge / discharge cycles.
Low wind and low or nonexistent solar output push the available generation capacity close to the electricity demand. If several fossil fuel assets are offline and/or imports are limited, the grid authority will have to rely on demand management capacity, battery capacity and available imports to keep the grid balanced and avoid rolling blackouts (one- to two-hour shutdowns of electricity in one or more grid sub-regions).
Demand management
The scenario described above is not peculiar to Germany. Wind and solar generation assets represent the bulk of new capacity to meet new data center and electrification demand and replace retiring fossil fuel and nuclear assets as they are (1) mandated by many state, province and national governments and (2) the only economically viable form of carbon-free electricity generation. Unfortunately, their variable output leaves significant supply gaps of hours, days and seasonally that cannot currently be addressed with carbon-free generation assets. These gaps will have to be filled with a combination of new and existing fossil fuel assets, existing nuclear assets (new plants, as new conventional or small modular nuclear reactors are eight years or more in the future) and demand management.
As the installed solar and wind generation capacity increases to a greater percentage of supply capacity, data center operators will play a major role in demand management strategies. Several strategies are available to the utilities / data center operators, but each has its drawbacks.
Backup generation systems
Significantly, emergency generator sets will need to be utilized to take data centers off the grid to address generation capacity shortfalls. This strategy is being deployed in the US (California), Ireland and other data center markets. In conversations with operators, several report being requested by their grid authority to deploy their emergency generators to relieve grid demand and ensure stability during the summer of 2023.
As new data centers are built and deployed, operators in stressed grid regions (i.e., those with a high percentage of capacity delivered by wind and solar assets) should plan and permit their emergency generators to be operated for 25% or more of the year to support grid reliability in the face of variable wind and solar generation asset production.
Workload reduction
Demand management can also be achieved by reducing data center energy consumption. Google has reported the development of protocols to shut down non-critical workloads, such as controllable batch jobs, or shift a workload to one or more data centers in other grid regions. The reports did not provide the volume of workloads moved or the demand reductions, which suggests that they were not significant. These tools are likely only used on workloads controlled by Google, such as search workloads or work on development servers. They are unlikely to be deployed on client cloud workloads because many IT operators are uncomfortable with unnecessarily stressing their operations and processes.
An example of an IT enterprise operation that could support demand management is a financial organization running twin data centers in two different grid regions. When grid stability is threatened, it could execute its emergency processes to move all workloads to a single data center. In addition to receiving a payment for reducing demand, this would be an opportunity to test and validate its emergency workload transfer processes. While there is a strong logic to justify this choice, IT managers will likely be hesitant to agree to this approach.
Outages and reliability problems are more likely to emerge when operational changes are being made and demand management payments from the grid authority or energy retailer will not compensate the risk of penalties under service level agreements. The use of backup generators will likely be the preferred response, though problems, such as issues with starting and synchronizing generators or transfer switch failures, can be experienced when switching to generators.
New solar and wind capacity needed
The energy demands of new data centers and the broader electrification of the economy will require the installation of new electricity generation capacity in grid regions around the world. Large colocation and cloud service providers have been particularly vocal that these generation assets should be carbon-free and not involve new fossil fuel generation assets. An analysis of the impact of increasing the wind, solar and battery generation capacity on the German grid by 20% to support a 15% increase in demand reveals the inherent dangers of this position.
Figure 2 details the available grid generation capacity under the five available capacity scenarios (see Table 1) when the wind, solar and battery capacities are increased by 20%. This increase in generating capacity is assumed to support a 15% rise in energy demand, which increases winter demand to 90 GW.
Figure 2. Impact of a 20% increase in wind and solar generation
The 20% generating capacity increase delivers sufficient available capacity for periods of moderate to high wind and solar capacity (scenarios B2 and B3). There is a sufficient capacity reserve (about 10% to 15% of demand) to provide the needed generation capacity if some fossil fuel-based generators are offline or the imports of nuclear-generated electricity are not available.
However, the capacity increase does not significantly increase the available capacity at periods of low solar and wind output, putting grid stability at risk in scenarios B4 and B5. The available capacity in scenario B4 increases by only 4 GW, which is insufficient to meet the new demand or provide any capacity reserve. Under scenario B5, there is barely enough capacity to provide a sufficient reserve (about 10% to 15% of capacity). In both cases, grid stability is at risk and some combination of demand management, battery capacity and imports will be required.
Until reliable, dispatchable carbon-free electricity technologies, such as SMRs, geothermal generation and LDES, are developed and deployed at scale, grid stability will depend on the presence of sufficient fossil fuel generation assets to match energy supply to demand. The deployed capacity will likely have to be 75% to 100% of projected demand to address three- to 10-day periods of low solar and wind output and expected seasonal variations in wind and solar production.
To enable growth and ensure reliable electricity service while increasing renewable energy generation capacity, data center operators will need to balance and likely compromise their location selection criteria, decarbonization goals, and the size of their electrical demand growth.
Managing growing power demand
Data center operators will have to reevaluate their sustainability objectives in light of the need to increase the overall grid generation capacity and the percentage of that capacity that is carbon-free generation while maintaining grid stability and reliability. Operators should consider the following options to take meaningful steps to further decarbonize the grid:
The data center industry can maintain its sustainability focus despite the expected growth in power demand. To do so, IT operators need to refocus on continually increasing the work delivered per unit of energy consumed by the IT infrastructure and the megawatt-hours of carbon-free energy consumed by data center operations. With available technologies, much can be done while developing the technologies needed to decarbonize the last 10% to 20% of the electricity grid.
The Uptime Intelligence View
Accelerating the growth of data center capacity and energy consumption does not have to imperil the industry’s drive toward sustainability. Instead, it requires that the sector pragmatically reassess its sustainability efforts. Sustainability strategies need to first intensify their focus on increased IT infrastructure utilization and work delivered per unit of energy consumed, and then take responsible steps to decarbonize the energy consumed by data centers while supporting necessary efforts to grow generation capacity on the grid and ensure grid stability.
Building trust: working with AI-based tools
/in Executive, Operations/by Rose Weinschenk, Research Associate, Uptime Institute, rweinschenk@uptimeinstitute.comMany employees approach AI-based systems in the workplace with a level of mistrust. This lack of trust can slow the implementation of new tools and systems, alienate staff and reduce productivity. Data center managers can avoid this outcome by understanding the factors that drive mistrust in AI and devising a strategy to minimize them.
Perceived interpersonal trust is a key productivity driver for humans but is rarely discussed in a data center context. Researchers at the University of Cambridge in the UK have found that interpersonal trust and organizational trust have a strong correlation with staff productivity. In terms of resource allocation, lack of trust requires employees to invest time and effort organizing fail-safes to circumvent perceived risks. This takes attention away from the task at hand and results in less output.
In the data center industry, trust in AI-based decision-making has declined significantly in the past three years. In Uptime Institute’s 2024 global survey of data center managers, 42% of operators said they would not trust an adequately trained AI system to make operational decisions in the data center, which is up 18 percentage points since 2022 (Figure 1). If this decline in trust continues, it will be harder to introduce AI-based tools.
Figure 1. More operators distrust AI in 2024 than in previous years
Managers who wish to unlock the productivity gains associated with AI may need to create specific conditions to build perceived trust between employees and AI-based tools.
Balancing trust and cognitive loads
The trust-building cycle requires a level of uncertainty. In the Mayer, Davis and Shoormen trust model, this uncertainty occurs when an individual is presented with the option to transfer decision-making autonomy to another party, which, in the data center, might be an AI-based control system (see An integrative model of organization trust). Individuals evaluate perceived characteristics of the other party against risk, to determine whether they can relinquish decision-making control. If this leads to desirable outcomes, individuals gain trust and perceive less risk in the future.
Trust toward AI-based systems can be encouraged by using specific deployment techniques. In Uptime Institute’s Artificial Intelligence and Software Survey 2024, almost half of the operators that have deployed AI capabilities report that predictive maintenance is driving their use of AI.
Researchers from Australia’s University of Technology Sydney and University of Sydney tested human interaction with AI-based predictive maintenance systems, with participants having to decide how to manage a situation with a burst water pipe under different levels of uncertainty and cognitive load (cognitive load being the amount of working memory resources used). For all participants, trust in the automatically generated suggestions was significantly higher under low cognitive loads. AI systems that communicated decision risk odds prevented trust from decreasing, even when cognitive load increased.
Without decision risk odds displayed, employees devoted more cognitive resources toward deciphering ambiguity, leaving less space in their working memory for problem solving. Interpretability of the output of AI-based systems drives trust: it allows users to understand the context of specific suggestions, alerts and predictions. If a user cannot understand how a predictive maintenance system came to a certain conclusion, they will lose trust. In this situation, productivity will stall as workers devote cognitive resources toward attempting to retrace the steps the system made.
Team dynamics
In some cases, staff who work with AI systems personify them and treat them as co-workers rather than tools. Similarly to human social group dynamics, and the negative bias felt toward those outside of one’s group (“outgroup” dynamics), staff may then lack trust in these AI systems.
AI systems can engender anxiety relating to job security and may trigger the fear of being replaced — although this is less of a factor in the data center industry, where staff are in short supply and not at high risk of losing their jobs. Nonetheless, researchers at the Institute of Management Sciences in Pakistan find that adoption of AI in general is linked with cognitive job insecurity, which threatens workers’ perceived trust in an organization.
Introduction of AI-based tools in a data center may also cause a loss in expert status for some senior employees, who might then view these tools as a threat to their identity.
Practical solutions
Although there are many obstacles to introducing AI-based tools into a human team, the solutions to mitigating them are often intuitive and psychological, rather than technological. Data center team managers can improve trust in AI technology through the following options:
Many of the solutions described above rely on social contracts—the transactional and relational agreements between employees and an organization. US psychologist Denise Rousseau (a professor at Carnegie Mellon University, Pittsburgh PA) describes relational trust as the expectation that a company will pay back an employee’s investments through growth, benefits and job security — all factors that go beyond the rewards of a salary.
When this relational contract is broken, staff will typically shift their behavior and deprioritize long-term company outcomes in favor of short-term personal gains.
Data center team leaders can use AI technologies to strengthen or break relational contracts in their organizations. Those who consider the factors outlined above will be more successful with maintaining an effective team.
The Uptime Intelligence View
An increasing number of operators cite plans to introduce AI-based tools into their data center teams, yet surveys increasingly report a mistrust in AI. When human factors, such as trust, are well managed, AI can be an asset to any data center team. If the current downward trend in trust continues, AI systems will become harder to implement. Solutions should focus on utilizing positive staff dynamics, such as organizational trust and social contracts.
DCIM past and present: what’s changed?
/in Executive, Operations/by John O’Brien, Senior Research Analyst, Uptime Institute, jobrien@uptimeinstitute.comData center infrastructure management (DCIM) software is an important class of software that, despite some false starts, many operators regard as essential to running modern, flexible and efficient data centers. It has had a difficult history — many suppliers have struggled to meet customer requirements and adoption remains patchy. Critics argue that, because of the complexity of data center operations, DCIM software often requires expensive customization and feature development for which many operators have neither the expertise nor the budget.
This is the first of a series of reports by Uptime Intelligence exploring data center management software in 2024 — two decades or more after the first commercial products were introduced. Data center management software is a wider category than DCIM: many products are point solutions; some extend beyond a single site and others have control functions. Uptime Intelligence is referring to this category as data center management and control (DCM-C) software.
DCIM, however, remains at the core. This report identifies the key areas in which DCIM has changed over the past decade and, in future reports, Uptime Intelligence will explore the broader DCM-C software landscape.
What is DCIM?
DCIM refers to data center infrastructure management software, which collects and manages information about a data center’s IT and facility assets, resource use and operational status, often across multiple systems and distributed environments. DCIM primarily focuses on three areas:
In the past, some operators have taken the view that DCIM does not justify the investment, given its cost and the difficulty of successful implementation. However, these reservations may be product specific and can depend on the situation; many others have claimed a strong return on investment and better overall management of the data center with DCIM.
Growing need for DCIM
Uptime’s discussions with operators suggest there is a growing need for DCIM software, and related software tools, to help resolve some of the urgent operational issues around sustainability, resiliency and capacity management. The current potential benefits of DCIM include:
Meanwhile, there will be new requirements from customers for improved monitoring, reporting and measurement of data, including:
Supplier landscape resets
In the past decade, many DCIM suppliers have reset, adapted and modernized their technology to meet customer demand. Many have now introduced mobile and browser-based offerings, colocation customer portals and better metrics tracking, data analytics, cloud and software as a service (SaaS).
Customers are also demanding more vendor agnostic DCIM software. Operators have sometimes struggled with DCIM’s inability to work with existing building management systems from other vendors, which then requires additional costly work on application programming interfaces and integration. Some operators have noted that DCIM software from one specific vendor still only provides out-of-the-box monitoring for their own brand of equipment. These concerns have influenced (and continue to influence) customer buying decisions.
Adaptation has been difficult for some of the largest DCIM suppliers, and some organizations have now exited the market. As one of the largest data center equipment vendors, for example, Vertiv’s discontinuation of Trellis in 2021 was a significant exit: customers found Trellis too large and complex for most implementations. Even today, operators continue to migrate off Trellis onto other DCIM systems.
Other structural change in the DCIM market include Carrier and Schneider Electric acquiring Nlyte and Aveva, respectively, and Sunbird spinning out from hardware vendor Raritan (Legrand) (see Table 1).
Table 1. A selection of current DCIM suppliers
There are currently a growing number of independent service vendors currently offering DCIM, each with different specialisms. For example, Hyperview is solely cloud-based, RiT Tech focuses on universal data integration, while Device42 specialises in IT asset discovery. Independent service vendors benefit those unwilling to acquire DCIM software and data center equipment from the same supplier.
Those DCIM software businesses that have been acquired by equipment vendors are typically kept at arm’s length. Schneider and Carrier both retain the Aveva and Nlyte brands and culture to preserve their differentiation and independence.
There are many products in the data center management area that are sometimes — in Uptime’s view —labeled incorrectly as DCIM. These include products that offer discrete or adjacent DCIM capabilities, such as: Vertiv Environet Alert (facility monitoring); IBM Maximo (asset management); AMI Data Center Manager (server monitoring); Vigilent (AI-based cooling monitoring and control); and EkkoSense (digital twin-based cooling optimization). Uptime views these as part of the wider DCM-C control category, which will be discussed in a future report in this series.
Attendees at Uptime network member events between 2013 and 2020 may recall that complaints about DCIM products, implementation, integration and pricing were a regular feature. Much of the early software was market driven, fragile and suffered from performance issues, but DCIM software has undoubtedly improved from where it was a decade ago.
The next sections of this report discuss areas in which DCIM software has improved and where there is still room for improvement.
Modern development techniques
Modern development techniques, such as continuous improvement / continuous delivery and agile / DevOps have encouraged a regular cadence of new releases and updates. Containerized applications have introduced modular DCIM, while SaaS has provided greater pricing and delivery flexibility.
Modularity
DCIM is no longer a monolithic software package. Previously, it was purchased as a core bundle, but now DCIM is more modular with add-ons that can be purchased as required. This may make DCIM more cost-effective, with operators being able to more accurately assess the return on investment, before committing to further investment. Ten years ago, the main customers for DCIM were enterprises, with control over IT — but limited budgets. Now, DCIM customers are more likely to be colocation providers with more specific requirements, little interest in the IT, and probably require more modular, targeted solutions with links into their own commercial systems.
SaaS
Subscription-based pricing for greater flexibility and visibility on costs. This is different from traditional DCIM licence and support software pricing, which typically locked customers in for minimum-term contracts. Since SaaS is subscription-based, there is more onus on the supplier to respond to customer requests in a timely manner. While some DCIM vendors offer cloud-hosted versions of their products, most operators still opt for on-premises DCIM deployments, due to perceived data and security concerns.
IT and software integrations
Historically, DCIM suffered from configurability, responsiveness and integration issues. In recent years, more effort has been made toward third-party software and IT integration and encouraging better data sharing between systems. Much DCIM software now uses application programming interfaces (APIs) and industry standard protocols to achieve this:
Application programming interfaces
APIs have made it easier for DCIM to connect with third-party software, such as IT service management, IT operations management, and monitoring and observability tools, which are often used in other parts of the organization. The aim for operators is to achieve a comprehensive view across the IT and facilities landscape, and to help orchestrate requests that come in and out of the data center. Some DCIM systems, for example, come with pre-built integrations and tools, such as ServiceNow and Salesforce, that are widely used by IT enterprise teams. These enterprise tools can provide a richer set of functionalities in IT and customer management and support. They also use robotic process automation technology, to automate repetitive manual tasks, such as rekeying data between systems, updating records and automating responses.
IT/OT protocols
Support for a growing number of IT/OT protocols has made it easier for DCIM to connect with a broader range of IT/OT systems. This helps operators to access the data needed to meet new sustainability requirements. For example, support for simple network management protocol can provide DCIM with network performance data that can be used to monitor and detect connection faults. Meanwhile, support for the intelligent platform management interface can enable remote monitoring of servers.
User experience has improved
DCIM provides a better user experience than a decade ago. However, operators still need to be vigilant that a sophisticated front end is not a substitute for functionality.
Visualization
Visualization for monitoring and planning has seen significant progress, with interactive 3D and augmented reality views of IT equipment, racks and data halls.Sensor data is being used, for example, to identify available capacity, hot-spots or areas experiencing over-cooling. This information is presented visually to the user, who can follow changes over time and drag and drop assets into new configurations. On the fringes of DCIM, computational fluid dynamics can visualize air flows within the facility, which can then be used to make assumptions about the impact of specific changes on the environment. Meanwhile, the increasing adoption of computer-aided design can enable operators to render accurate and dynamic digital twin simulations for data center design and engineering and, ultimately, the management of assets across their life cycle.
Better workflow automation
At a process level, some DCIM suites offer workflow management modules to help managers initiate, manage and track service requests and changes. Drag and drop workflows can help managers optimize existing processes. This has the potential to reduce data entry omissions and errors, which have always been among the main barriers to successful DCIM deployments.
Rising demand for data and AI
Growing demand for more detailed data center metrics and insights related to performance, efficiency and regulations will make DCIM data more valuable. This, however, depends on how well DCIM software can capture, store and retrieve reliable data across the facility.
Customers today require greater levels of analytical intelligence from their DCIM. Greater use of AI and ML could enable the software to spot patterns, anomalies and provide next best action recommendations. DCIM has not fared well in this area, which has opened the door to a new generation of AI-enabled optimization tools. The Uptime report What is the role of AI in digital infrastructure management? identifies three near-term applications of ML in the data center — predictive analytics, equipment setting optimization and anomaly detection.
DCIM is making progress in sustainability data monitoring and reporting and a number of DCIM suppliers are now actively developing sustainability modules and dashboards. One supplier, for example, is developing a Scope 1, 2 and 3 greenhouse gas emissions model based on a range of datasets, such as server product performance sheets, component catalogs and the international Environmental Product Declaration (EPD) database. Several suppliers are working on dashboards that bring together all the data required for compliance with the EU’s Energy Efficiency Directive. Once functional, these dashboards could compare data centers, devices and manufacturers, as well as provide progress reports.
The Uptime Intelligence View
DCIM has matured as a software solution over the past decade. Improvements in function modularity, SaaS, remote working, integration, user experience and data analytics, have all progressed to the point where DCIM is now considered a viable and worthwhile investment. DCIM data will also be increasingly valuable for regulatory reporting requirements. Nonetheless, there remains more work to be done. Customers still have legitimate concerns about its complexity, cost and accuracy, while DCIM’s ability to apply AI and analytics — although an area of great promise — is still viewed cautiously. Even when commercial DCIM packages were less robust and functional, those operators that researched it diligently and deployed it carefully found it to be largely effective. This remains true today.
Water cold plates lead in the small, but growing, world of DLC
/in Design, Executive, Operations/by Jacqueline Davis, Research Analyst, Uptime Institute, jdavis@uptimeinstitute.comDirect liquid cooling (DLC), including cold plate and immersion systems, is becoming more common in data centers — but so far this transition has been gradual and unevenly distributed with some data centers using it widely, others not at all. The use of DLC in 2024 still accounts for only a small minority of the world’s IT servers, according to the Uptime Institute Cooling Systems Survey 2024. The adoption of DLC remains slow in general-purpose business IT, and the most substantial deployments currently concentrate on high-performance computing applications, such as academic research, engineering, AI model development and cryptocurrency.
This year’s cooling systems survey results continue a trend: of those operators that use DLC in some form, the greatest number of operators deploy water cold plate systems, with other DLC types trailing significantly. Multiple DLC types will grow in adoption over the next few years, and many installations will be in hybrid cooling environments where they share space and infrastructure with air-cooling equipment.
Within this crowded field, water cold plates’ lead is not overwhelming. Water cold plate systems retain advantages that explain this result: ease of integration into shared facility infrastructure, a well-understood coolant chemistry, greater choice in IT and cooling equipment, and less disruption to IT hardware procurement and warranty compared with other current forms of DLC. Many of these advantages are down to its long-established history spanning decades.
This year’s cooling systems survey provides additional insights into the DLC techniques operators are currently considering. Of those data center operators using DLC, many more (64%) have deployed water cold plates than the next-highest-ranking types: dielectric-cooled cold plates (30%) and single-phase immersion systems (26%) (see Figure 1).
Figure 1. Operators currently using DLC prefer water-cooled cold plates
At present, most users say they use DLC for a small portion of their IT — typically for their densest, most difficult equipment to cool (see DLC momentum rises, but operators remain cautious). These small deployments favor hybrid approaches, rather than potentially expensive dedicated heat rejection infrastructure.
Hybrid cooling predominates — for now
Many DLC installations are in hybrid (mixed) setups in which DLC equipment sits alongside air cooling equipment in the data hall, sharing both heat transport and heat rejection infrastructure. This approach can compromise DLC’s energy efficiency advantages (see DLC will not come to the rescue of data center sustainability), but for operators with only small DLC deployments, it can be the only viable option. Indeed, when operators named the factors that make a DLC system viable, the greatest number (52%, n=392) chose ease of retrofitting DLC into their existing infrastructure.
For those operators who primarily serve mainstream business workloads, IT is rarely dense and powerful enough to require liquid cooling. Nearly half of operators (48%, n=94) only use DLC on less than 10% of their IT racks — and only one in three (33%, n=54) have heat transport and rejection equipment dedicated to their liquid-cooled IT. At this early stage of DLC growth, economics and operational risks dictate that operators prefer cooling technologies that integrate more readily into their existing space. Water cold plate systems can meet this need, despite potential drawbacks.
Cold plates are good neighbors, but not perfect
Water-cooled servers typically fit into standard racks, which simplifies deployment — especially when these servers coexist with air cooled IT. Existing racks can be reused either fully or partially loaded with water-cooled servers, and installing new racks is also straightforward.
IT suppliers prefer to sell the DLC solution integrated with their own hardware, ranging from a single server chassis to rows of racks including cooling distribution units (CDUs). Today, this approach typically favors a cold plate system, so that operators and IT teams have the broadest selection of equipment and compatible IT hardware with vendor warranty coverage.
The use of water in data center cooling has a long history. In the early years of mainframes water was used in part due to its advantageous thermal properties compared with air cooling, but also because of the need to remove heat effectively from the relatively small rooms that computers shared with staff.
Today, water cold plates are used extensively in supercomputing, handling extremely dense cabinets. Operators benefit from water’s low cost and ready availability, and many are already skilled in maintaining its chemistry (even though quality requirements for the water coolant are more stringent for cold plates compared with a facility loop).
The risk (and, in some cases, the vivid memories) of water leakage onto electrified IT components is one reason some operators are hesitant to embrace this technology, but leaks are statistically rare and there are established best practices in mitigation. However, with positive pressure cold plate loops, which is the type most deployed by operators, there is never zero risk. The possibility of water damage is perhaps the single key weakness of water cold plates driving interest in alternative dielectric DLC techniques.
In terms of thermal performance, water is not without competition. Two-phase dielectric coolants show strong performance by taking advantage of the added cooling effect from vaporization. Vendors offer this technology in the form of both immersion tanks and cold plates, with the latter edging ahead in popularity because it requires less change to products and data center operations. The downside of all engineered coolants is the added cost, as well as the environmental concerns around manufacturing and leaks.
Some in the data center industry predict two-phase cooling products will mature to capitalize on their performance potential and eventually become a major form of cooling in the world of IT. Uptime’s survey data suggests that water cold plate systems currently have a balance of benefits and risks that make practical and economic sense for a greater number of operators. But the sudden pressure on cooling and other facility infrastructure brought about by specialized hardware for generative AI will likely create new opportunities for a wider range of DLC techniques.
Outlook
Uptime’s surveys of data center operators are a useful indicator of how operators are meeting their cooling needs, among others. The data thus far suggests a gradual DLC rollout, with water cold plates holding a stable (if not overwhelming) lead. Uptime’s interviews with vendors and operators consistently paint a picture of widespread hybrid cooling environments, which incentivize cooling designs that are flexible and interoperable.
Many water cold plate systems on the market are well suited to these conditions. Looking five years ahead, densified IT for generative AI and other intensive workloads is likely to influence data center business priorities and designs more widely. DLC adoption and operator preferences for specific technology types are likely to shift in response. Pure thermal performance is key but not the sole factor. The success of any DLC technique will rely on overcoming the barriers to its adoption, availability from trusted suppliers and support for a wide range of IT configurations from multiple hardware manufacturers.