24x7 carbon-free energy (part one): expectations and realities

24×7 carbon-free energy (part one): expectations and realities

Data center operators that set net-zero goals will ultimately have to transition to 100% 24×7 carbon-free energy. But current technological limitations mean it is not economically feasible in most grid regions.

In the past decade, the digital infrastructure industry has embraced a definition of renewable energy use that combines the use of renewable energy credits (RECs), guarantees of origin (GOs) and carbon offsets to compensate for the carbon emissions of fossil fuel-based generation, and the consumption of renewable energy in the data center.

The first component is relatively easy to achieve and has not proved prohibitively expensive, so far. The second, which will increasingly be required, will not be so easy and will cause difficulty and controversy in the digital infrastructure industry (and many other industries) for years to come.

In the Uptime Institute Sustainability and Climate Change Survey 2022, a third of data center owners / operators reported buying RECs, GOs and other carbon offsets to support their renewable energy procurement and net-zero sustainability goals.

But many organizations, including some regulators and other influential bodies, are not happy that some operators are using RECs and GOs to claim their operations are carbon neutral with 100% renewable energy use. They argue that, with or without offsets, at least some of the electricity the data center consumes is responsible for carbon dioxide (CO2) emissions — which are measured in metric tonnes (MT) of CO2 per megawatt-hour (MWh).

The trend is clear — offsets are becoming less acceptable. Many digital infrastructure operators have started to (or are planning to) refocus their sustainability objectives towards 100% 24×7 carbon-free energy (CFE) consumption — consuming CFE for every hour of operation.

This Update, the first of two parts, will focus on the key challenges that data center operators face as they move to use more 24×7 CFE. Part two, 24×7 carbon-free energy: getting to 100%, will outline the steps that operators, grid authorities, utilities and other organizations should take to enable data centers to be completely powered by CFE.

A pairing of 24×7 CFE and net-zero emissions goals alters an operator’s timeframe to achieve their net-zero emissions as it removes the ability to offset for success. At present, 100% 24×7 CFE is technically and / or economically unviable in most grid regions and will not be viable in some regions until 2040 to 2050, and beyond. This is because solar and wind generation is highly variable, and the capacity and discharge time of energy storage technologies is limited. Even grid regions with high levels of renewable energy generation can only support a few facilities with 100% 24×7 CFE because they cannot produce enough MWh during periods of low solar and wind output. Consequently, this approach could be expensive and financially risky.  

The fact that most grid regions cannot consistently generate enough CFE complicates net-zero carbon emissions goals. The Uptime Institute Sustainability and Climate Change Survey 2022 found that almost two-thirds of data center owners and operators expect to achieve a net-zero emissions goal by 2030 and that three-quarters expect to by 2035 (Figure 1). Operators will not be able to achieve these enterprise-wide goals by using 100% 24×7 CFE. They will have to use RECs, GOs and offsets extensively — and significant financial expenditure — to deliver on their promises.

Figure 1. Time period when operators expect to achieve their net-zero goals

Diagram: Time period when operators expect to achieve their net-zero goals

Although achieving net-zero emissions goals through offsets may seem attractive, offset costs are rising and operators need to assess the costs and benefits of increasing renewable energy / CFE at their data centers. Over the long term, if organizations demand more CFE, they will incentivize energy suppliers to generate more, and to develop and deploy energy storage, grid management software and inter-region high voltage interconnects.

To establish a 24×7 CFE strategy, operators or their energy retailers should track and control CFE assets and energy delivery to their data centers, and include factors that affect the availability and cost of CFE MWh in procurement contracts.

Renewable energy intermittency

The MWh output of solar and wind generation assets varies by minute, hour, day and season. Figure 2 shows how wind generation varied over two days in January 2019 in the grid region of ERCOT (a grid operator in Texas, US). On January 8, wind turbines produced 32% to 48% of their total capacity (capacity factor) and satisfied 19% to 33% of the entire grid demand. On January 14, wind capacity fell to 4% to 18% and wind output met only 3% to 9% of the total demand.

Figure 2. Wind generation data for two days in January 2019

Diagram: Wind generation data for two days in January 2019

To match the generation output to grid demand, sufficient reliable generation capacity (typically fossil fuel-fired) needs to be available to address the variations in wind output during the two days: 3,000 MW on January 8 and 2,500 MW on January 14. There needs to be four to five natural gas turbine generators available. Capacity is also needed to cover the (roughly) 6,000 MW of output difference between the two days.

ERCOT needs to maintain sufficient reliable generation capacity (typically fossil fuel-fired) to match grid-wide generation capacity to demand, in the face of the hour-to-hour and day-to-day variation in wind and solar output. There has to be four or five 700 MW natural gas turbines in reserve to meet the 2,500 MW to 3,000 MW of output variation on January 14 and January 8, respectively.

In 2020, wind generation in ERCOT satisfied less than 5% of grid demand for 600 hours — that’s 7% of the year. Given the number of hours of low MWh generation, only a few facilities in the grid region can aspire to 100% CFE. The challenge of significant hours of low generation exists in most grid regions that strive to depend on wind and solar generation to meet carbon free energy goals. This limits the ability of operators to set and achieve 100% CFE goals for specific facilities over the next 10 to 20 years.

Seasonal variations can be even more challenging. The wind generation gigawatt hours (GWh) in the Irish grid were two to three times higher in January and February 2022 than in July and August 2022 (Figure 3, light blue line), which left a seasonal production gap of up to 1,000 GWh. The total GWh output for each month masks the hour-to-hour variations of wind generation (see Figure 3). These amplify the importance of having sufficient and reliable generation available on the grid.

Figure 3. Monthly gigawatt-hours generated in Ireland by fuel type in 2022

Diagram: Monthly gigawatt-hours generated in Ireland by fuel type in 2022

The mechanics of a 24×7 CFE strategy

To manage a facility-level program to procure, track and consume CFE, it is necessary to understand how an electricity grid operates. Figure 4 shows a fictional electricity grid modelled on the Irish grid with the addition of a baseload 300 MW nuclear generation facility.

The operation of an electrical grid can be equated to a lake of electrons held at a specified level by a dam. A defined group of generation facilities continuously “fills” the electron lake. The electricity consumers draw electrons from the lake to power homes and commercial and industrial facilities (Figure 4). The grid operator or authority maintains the lake level (the grid potential in MW) by adding or subtracting online generation capacity (MW of generation) at the same “elevation” as the MW demand of the electricity consumers pulling electrons from the lake.

If the MWs of generation and consumption become unbalanced, the grid authority needs to quickly add or subtract generation capacity or remove portions of demand (demand response programs, brownouts and rolling blackouts) to keep the lake level balanced and prevent grid failure (a complete blackout).

Figure 4. Example of generation and demand balancing within an electricity grid

Diagram: Example of generation and demand balancing within an electricity grid

The real-time or hourly production of each generation facility supplying the grid can be matched to the hourly consumption of a grid-connected data center. The grid authority records the real-time MWh generation from each facility and the data can be made available to the public. At least two software tools, WattTime and Cleartrace, are available in some geographies to match CFE generation to facility consumption. Google, for example, uses a tracking tool and publishes periodic updates of the percentage utilization of CFE at each of its owned data centers (Figure 5).

Figure 5. Percentage of hourly carbon-free energy utilized by Google’s owned data centers

Diagram: Percentage of hourly carbon-free energy utilized by Google's owned data centers

Grid authorities depend on readily dispatchable generation assets to balance generation and demand. Lithium-ion batteries are a practical way to respond to rapid short-term changes (within a period of four hours or less) in wind and solar output. Because of the limited capacity and discharge time of lithium-ion batteries — the only currently available grid-scale battery technology — they cannot manage MW demand variation across 24 hours (Figure 2) or seasons (Figure 3). Consequently, grid authorities depend on large-scale fossil fuel and biomass generation, and hydroelectric where available, to maintain a reserve to meet large, time-varied MW demand fluctuations.

The grid authority can have agreements with energy consumers to reduce demand on the grid, but executing these agreements typically requires two to four hours’ notice and a day-ahead warning. These agreements cannot deliver real-time demand adjustments.

The energy capacity / demand bar charts at the bottom of Figure 4 illustrates the capacity and demand balancing process within a grid system at different times of the day and year. They show several critical points for managing a grid with significant, intermittent wind and solar generating capacity.

  • In periods of high wind output, such as January and February 2022, fossil fuel generation is lower because it is only used to meet demand when CFE options cannot. During these periods of high wind output, organizations can approach or achieve 100% CFE use at costs close to grid costs.

    Solar generation — confined to daylight hours — varies by the hour, day and season. As with wind generation, fluctuations in solar output need to be managed by dispatching batteries or reliable generation for short-term variations in solar intensity and longer reductions, such as cloudy days and at night.
  • For periods of low wind output (Figure 4, July 2022 average day and night), the grid deploys reliable fossil fuel assets to support demand. These periods of low wind output prevent facilities from economically achieving 100% 24×7 CFE.
  • Nuclear plants provide a reliable, continuous supply of baseload CFE. An energy retailer could blend wind, solar and nuclear output in some grid regions to enable 100% 24×7 CFE consumption at an operating facility.

    Current nuclear generation technologies are designed to operate at a consistent MW level. They change their output over a day or more, not over minutes or hours. New advanced small modular reactor technologies adjust output over shorter periods to increase their value to the grid, but these will not be widely deployed for at least a decade.
  • For batteries to be considered a CFE asset, they must either be charged during high wind or solar output periods, or from nuclear plants. Software tools that limit charging to periods of CFE availability will be required.

To approach or achieve 100% CFE with only wind generation, data centers need to buy three to five times more MWs than they need. The Google Oklahoma (88% hourly CFE) and Iowa (97% hourly CFE) data centers (Figure 5) illustrate this. Power purchase contracts for the two locations (February 2023) show 1,012 MW of wind generation capacity purchased for the Oklahoma data center and 869 MW of wind capacity purchased for the Iowa data center. Assuming the current demand for these two data centers is 250 MW to 350 MW and 150 MW to 250 MW, respectively, the contracted wind capacity required to achieve the CFE percentages is three to five times the data center’s average operating demand.  

Table 1 lists retail energy contracts with guaranteed percentages of 24×7 CFE publicized over the past three years. These contracts are available in many markets and can provide data center operators with reliable power with a guaranteed quantity of CFE. The energy retailer takes on the risk of managing the intermittent renewable output and assuring supply continuity. In some cases, nuclear power may be used to meet the CFE commitment.

One detail absent from publicly available data for these contracts is the electricity rate. Conversations with one energy retailer indicated that these contracts charge a premium to market retail rates — the higher the percentage of guaranteed 24×7 CFE, the higher the premium. Operators can minimize the financial risk and premiums of this type of retail contract by agreeing to longer terms and lower guarantees — for example 5- to 10-year terms and 50% to 70% guaranteed 24×7 CFE. For small and medium size data center operators, a 24×7 renewables retail contract with a 50% to 70% guarantee will likely provide electric rate certainty with premiums of 5% or less (see Table 1) and minimal financial risk compared with the power purchase agreement (PPA) approach used by Alcoa (Table 2).

Table 1. Examples of 24×7 renewable energy retail contracts

Table: Examples of 24x7 renewable energy retail contracts

Alcoa wind PPA contracts in Spain

Aluminum producer Alcoa Corporation recently procured 1.8 GW of wind PPA contracts for its smelter in Spain. Table 2 shows the financial and operational risks associated with an over-purchase of wind capacity. The smelter has 400 MW of energy demand. Alcoa estimates it would need at least 2 GW of wind capacity to achieve 100% CFE. The Wall Street Journal Alcoa contract information detailed in Table 2 indicates that with its 1.8 GW of PPAs, Alcoa is likely to reach more than 95% 24×7 CFE. The purchase had two objectives: to stabilize the facility’s long-term electricity cost, and to produce near carbon-free aluminum.

The contracted quantity of wind power is 4.5 times the facility’s power demand, so Alcoa has estimated that to get close to 100% 24×7 CFE, it needs a high level of overcapacity, based on its modelling of wind farm output. In practice, Alcoa will likely have to buy some power on the spot market when wind output is low, but the overall emissions of the power it uses will be minimal.

Table 2 details the estimated total MWh of generation and financial implications of the PPAs. Assuming the wind farms have a capacity factor (CF) of 0.45, the PPA contracts secure more than twice as much electricity as the smelter consumes. This excess 3.6 million MWh will be sold into the electricity spot market and Alcoa’s profits or losses under the PPAs will be determined by the price at the time of generation.

The LevelTen Energy Q3 2022 PPA Price Index report was consulted for the P25 electricity rate (the rate at which 25% of available PPA contracts have a lower electricity rate) to estimate the rate of the signed PPAs: €75/MWh was selected. Two EU-wide average electricity rates, taken from the Ember European power price tracker, were chosen to bracket the potential profits and losses associated with the contracts for January 2021 (€50/MWh) and December 2022 (€200/MWh).

Table 2. Projection of power generation and economic returns for the Alcoa plant wind PPAs

Table: Projection of power generation and economic returns for the Alcoa plant wind PPAs

The contracts have significant rewards and risks. If the December 2022 rate remains stable for a year, the agreement will generate €887 million in profit for Alcoa. Conversely, if the January 2021 price had remained stable for a year, Alcoa would have lost €180 million. The wind farms are slated to go online in 2024. The Alcoa plant will restart when the wind farms start to supply electricity to the grid. Current electricity rate projections suggest that the contracts will initially operate with a profit. However, profits are not guaranteed over the 10- to 20-year life of the PPAs.

Real-time spot market electricity rates approach zero over time as the total installed wind and solar generation capacity increases within a grid region. Meeting EU government commitments for a carbon-free grid requires significant wind and solar overcapacity on the overall grid. As excess generation capacity is deployed to the grid to meet these commitments, periods of power overproduction will increase, which is likely to depress spot market prices. There is a strong probability that Alcoa’s contracts will generate losses in their later years as the electricity grid moves toward being carbon free by 2035.

The wind PPAs will provide Alcoa with two near-term revenue generators. First, Alcoa could sell its estimated 3.6 GWh of excess GOs. Bundled or unbundled GOs from its excess electricity generation should be in high demand from data center operators and other enterprises with 2030 net-zero carbon commitments. Second, it could sell the electricity itself. Selling at 2022 year-end rates for Nordic hydropower GOs (€10/MWh) would realize a profit of €36 million.

Low or zero carbon aluminum will also be in high demand and command premium prices as companies seek to decarbonize their facilities or products. While the premium is uncertain, it will add to the benefits of wind power purchases and improve the economics of Alcoa operations. The Alcoa PPA contracts have many upsides, but the Alcoa CFO faces a range of possible financial outcomes inherent in this CFE strategy.

Conclusions

Deploying wind and solar generation overcapacity creates broader operational and financial challenges for grid regions. As overcapacity increases, the use of reliable fossil fuel and nuclear generation assets to maintain grid stability will decrease. As a result, these assets may not generate sufficient revenue to cover their operating and financing costs, which will force them to close. Some nuclear generation facilities in the US have already been retired early because of this.

Grid authorities, utility regulators and legislative bodies need to address these challenges. They need to plan grid capacity: this includes evaluating the shape curves of wind and solar MWh output to determine the quantities of reliable generation required to maintain grid reliability. They should target incentives at developing and deploying clean energy technologies that can boost capacity to meet demand — such as long duration energy storage, small modular nuclear reactors and hydrogen generation. Without sufficient quantities of reliable generation assets, grid stability will slowly erode, which will put data center operations at risk and potentially increase the runtime of backup generation systems beyond their permitted limits.

Cloud resiliency: plan to lose control of your planes

Cloud resiliency: plan to lose control of your planes

Cloud providers divide the technologies that underpin their services into two ”planes”, each with a different architecture and availability goal. The control plane manages resources in the cloud; the data plane runs the cloud buyer’s application.

In this Update, Uptime Institute Intelligence presents research that shows control planes have poorer availability than data planes. This presents a risk to applications built using cloud-native architectures, which rely on the control plane to scale during periods of intense demand. We show how overprovisioning capacity is the primary way to reduce this risk. The downside is an increase in costs.

Data and control planes

An availability design goal is an unverified claim of service availability that is neither guaranteed by the cloud provider nor independently confirmed. Amazon Web Services (AWS), for example, states 99.99% and 99.95% availability design goals for many of its services’ data planes and control planes. Service level agreements (SLAs), which refund customers a proportion of their expenditure when resources are not available, are often based on these design goals.

Design goals and SLAs differ between control and data planes because each plane performs different tasks using different underlying architecture. Control plane availability refers to the availability of the mechanism it uses to manage and control services, such as:

  • Creating a virtual machine or other resource allocation on a physical server.
  • Provisioning a virtual network interface on that resource so it can be accessed over the network.
  • Installing security rules on the resource and setting access controls.
  • Configuring the resource with custom settings.
  • Hosting an application programming interface (API) endpoint so that users and code can programmatically manage resources.
  • Offering a management graphical user interface for operations teams to administrate their cloud estates.

The data plane availability refers to the availability of the mechanism that executes a service, such as:

  • Routing packets to and from resources.
  • Writing and reading to and from disks.
  • Executing application instructions on the server.

In practice, such separation means a control plane could be unavailable, preventing services from being created or turned off, while the data plane (and, therefore, the application) continues to operate.

Data plane and control plane issues can impact business, but a data plane problem is usually considered more significant as it would immediately affect customers already using the application. The precise impact of a data plane or control plane outage depends on the application architecture and the type of problem.

Measuring plane availability

Providers use the term “region” to describe a geographical area that contains a collection of independent availability zones (AZs), which are logical representations of data center facilities. A country may have many regions, and each region typically has two or three AZs. Cloud providers state that users must architect their applications to be resilient by distributing resources across AZs and/or regions.

To compare the respective availability of resilient cloud architectures (see Comparative availabilities of resilient cloud architectures), Uptime Institute used historical cloud provider status updates from AWS, Google Cloud and Microsoft Azure to determine historical availabilities for a simple load balancer application deployed in three architectures:

  • Two virtual machines deployed in the same AZ (“single-zone”).
  • Two virtual machines deployed in different AZs in the same region (“dual-zone”).
  • Two virtual machines deployed in different AZs in different regions (“dual-region”).

Figure 1 shows the worst-case availabilities (i.e., the worst availability of all regions and zones analyzed) for the control planes and data planes in these architectures.

diagram: Worst-case historical availabilities by architecture
Figure 1. Worst-case historical availabilities by architecture

Unsurprisingly, even in the worst-case scenario, a dual-region architecture is the most resilient, followed by a dual-zone architecture.

The data plane has a significantly higher availability than the control plane. In the dual region category, the data plane had an availability of 99.96%, equivalent to 20 minutes of monthly downtime. The control plane had an availability of 99.80%, equal to 80 minutes of downtime — four times that of the data plane. This difference is to be expected considering the different design goals and SLAs associated with control and data planes. Our research found that control plane outages do not typically happen at the same time as data plane outages.

However, availability in the control plane is far more consistent than in the data plane and isn’t greatly affected by choice of architecture. This consistency is because the cloud provider manages availability and resiliency — the control plane effectively acts as a platform as a service (PaaS). Suppose an organization hosts an application in a single zone. In that case, the application programming interfaces (APIs) — and the management interface used to administer that application — are managed by the cloud provider and hosted in multiple zones and regions, despite the application only being hosted in a single zone.

The warning for cloud customers here is that the control plane is more likely to be a point of failure than the application, and little can be done to make the control plane more resilient.

Assessing the risk

The crucial question organizations should seek to answer during risk assessments is, “What happens if we cannot add or remove resources to our application?”

For static applications that can’t scale, a control plane outage is often more of an inconvenience than a business-critical problem. Some maintenance tasks may be delayed until the outage is over, but the application will continue to run normally.

For cloud-native scalable applications, the risk is more considerable. A cloud-native application should be able to scale up and down dynamically, depending on demand, which involves creating (or terminating) a resource and configuring it to work with the application that is currently executing.

Scaling capacity in response to application demand is typically done using one or a combination of three methods:

  • Automatically, using cloud provider autoscaling services that detect a breach of a threshold of a metric (e.g., CPU utilization).
  • Automatically from an application, where code communicates with a cloud provider’s API.
  • Manually, by an administrator using a management portal or API.

The management portal, the API, the autoscaler service, and the creation or termination of the resource may all be part of the control plane.

If the application has been scaled back to cut costs but faces increased demand, any of these mechanisms can increase capacity. But if they have failed, the application will continue to run but will not accommodate the additional demand. In this scenario, the application may continue to service existing end users satisfactorily, but additional end users may have to wait. In the worst case, the application might fail catastrophically because of oversubscription.

Addressing the risk

The only real redress for this risk is to provision a buffer of capacity on the services that support the application. If a demanding period requires more resources, the application can immediately use this buffer, even if the control plane is out of service.

Having a buffer is a sensible approach for many reasons. Suppose other issues cause delays in provisioning capacity, for example an AZ fails or increased demand in a region causes a lack of capacity. A buffer will prevent a time lag between end-user demand and resource access.

The question is how much buffer capacity to provision. A resilient architecture often automatically includes suitable buffer capacity because the application is distributed across AZs or regions. If an application is split across, say, two AZs, it is probably designed with enough buffer on each virtual machine to enable the application to continue seamlessly if one of the zones is lost. Each virtual machine has a 50% load, with the unused 50% available if the other AZ suffers an outage. In this case, no new resources can be added if the control plane fails, but the application is already operating at half load and has excess capacity available. Of course, if the other AZ fails at the same time as a control plane outage, the remaining zone will face a capacity squeeze.

Similarly, databases may be deployed across multiple AZs using an active-active or active-failover configuration. If an AZ fails, the other database will automatically be available to support transactions, regardless of the control plane functionality.

Organizations need to be aware that this is their risk to manage. The application must be architected for control and data plane failures. Organizations also need to be mindful that there is a cost implication. As detailed in our Comparative availabilities of resilient cloud architectures report, deploying across multiple zones increases resiliency, but also increases costs by 43%. Carbon emissions, too, rise by 83%. Similarly, duplicating databases across zones increases availability — but at a price.

Summary

Organizations must consider how their applications will perform if a control plane failure prevents them adding, terminating or administrating cloud resources. The effect on static applications will be minimal because such applications can’t scale up and down in line with changing demand. However, the impact on cloud-native applications may be substantial because the application may struggle under additional demand without the capacity to scale.

The simplest solution is to provide a buffer of unused capacity to support unexpected demand. If additional capacity can’t be added because of a control plane outage, the buffer can meet additional demand in the interim.

The exact size of the buffer depends on the application and its typical demand pattern. However, most applications will already have a buffer built in so they can respond immediately to AZ or region outages. Often, this buffer will be enough to manage control plane failure risks.

Organizations face a tricky balancing act. Some end users might not get the performance they expect if the buffer is too small. If the buffer is too large, the organization pays for capacity it doesn’t need.

Server efficiency increases again — but so do the caveats

Server efficiency increases again — but so do the caveats

Early in 2022, Uptime Intelligence observed that the return of Moore’s law in the data center (or, more accurately, the performance and energy efficiency gains associated with it) would come with major caveats (see Moore’s law resumes — but not for all). Next-generation server technologies’ potential to improve energy efficiency, Uptime Intelligence surmised at the time, would be unlocked by more advanced enterprise IT teams and at-scale operators that can concentrate workloads for high utilization and make use of new hardware features.

Conversely, when servers built around the latest Intel and AMD chips do not have enough work, energy performance can deteriorate due to higher levels of idle server power. Following the recent launch of new server processors, performance data confirms this, which suggests that a rethink on the traditional assumptions relating to server refresh cycles is needed.

Server efficiency back on track

First, the good news: new processor benchmarks confirm that best-in-class server energy efficiency is back in line with historical trends, following a long hiatus in the late 2010s.

The latest server chips from AMD (codenamed Genoa) deliver a major jump in core density (they integrate up to 192 cores in a standard dual-socket system) and energy efficiency potential. This is largely due to a step change in manufacturing technology by contract chipmaker Taiwan Semiconductor Manufacturing Company (TSMC). Compared with server technology from four to five years ago, this new crop of chips offers four times the performance at more than twice the energy efficiency, as measured by the Standard Performance Evaluation Corporation (SPEC). The SPEC Power benchmark simulates a Java-based transaction processing logic to exercise processors and the memory subsystem, indicating compute performance and efficiency characteristics. Over the past 10 years, mainstream server technology has become six times more energy efficient based on this metric (Figure 1).

diagram: Best-in-class server energy performance (long-term trend)
Figure 1. Best-in-class server energy performance (long-term trend)

This brings server energy performance back onto its original track before it was derailed by Intel’s protracted struggle to develop its then next-generation manufacturing technology. Although Intel continues to dominate server processor shipments due to its high manufacturing capacity, it is still fighting to recover from the crises it created nearly a decade ago (see Data center efficiency stalls as Intel struggles to deliver).

Now, the bad news, too, is that server efficiency is once again following historical trends. Although the jump in performance with the latest generation of AMD server processors is sizeable, the long-term slowdown in performance improvements continues. The latest doubling of server performance took about five years. In the first half of the 2010s it took about two years, and far less than two years between 2000 and 2010. Advances in chip manufacturing have slowed for both technical and economic reasons: the science and engineering challenges behind semiconductor manufacturing are extremely difficult and the costs are huge.

An even bigger reason for the drop in development pace, however, is architectural: diminishing returns from design innovation. In the past, the addition of more cores, on-chip integration of memory controllers and peripheral device controllers all radically improved chip performance and overall system efficiency. Then server engineers increased their focus on energy performance, resulting in more efficient power supplies and other power electronics, as well as energy-optimized cooling via variable speed fans and better airflow. These major changes have reduced much of the server energy waste.

One big-ticket item in server efficiency remains: to tackle the memory (DRAM) performance and energy problems. Current memory chip technologies don’t score well on either metric — DRAM latency worsens with every generation, while energy efficiency (per bit) barely improves.

Server technology development will continue apace. Competition between Intel and AMD is energizing the market as they vie for the business of large IT buyers that are typically attracted to the economics of performant servers carrying ever larger software payloads. Energy performance is a definitive component of this. However, more intense competition is unlikely to overcome the technical boundaries highlighted above. While generic efficiency gains (lacking software-level optimization) from future server platforms will continue, the average pace of improvement is likely to slow further.

Based on the long-term trajectory, the next doubling in server performance will take five to six years, boosted by more processor cores per server, but partially offset by the growing consumption of memory power. Failing a technological breakthrough in semiconductor manufacturing, the past rates of server energy performance gains will not return in the foreseeable future.

Strong efficiency for heavy workloads

As well as the slowdown in overall efficiency gains, the profile of energy performance improvements has shifted from benefiting virtually all use cases towards favoring, sometimes exclusively, larger workloads. This trend began several years ago, but AMD’s latest generation of products offers a dramatic demonstration of its effect.

Based on the SPEC Power database, new processors from AMD offer vast processing capacity headroom (more than 65%) compared with previous generations (AMD’s 2019 and 2021 generations codenamed, respectively, Rome and Milan). However, these new processors use considerably more server power, which mutes overall energy performance benefits (Figure 2). The most significant opportunities offered by AMD’s Genoa processor are for aggressive workload consolidation (footprint compression) and running scale-out workloads such as large database engines, analytics systems and high-performance computing (HPC) applications.

diagram: Performance and power characteristics of recent AMD and Intel server generations
Figure 2. Performance and power characteristics of recent AMD and Intel server generations

If the extra performance potential is not used, however, there is little reason to upgrade to the latest technology. SPEC Power data indicates that the energy performance of previous generations of AMD-based servers is often as good — or better — when running relatively small workloads (for example, where the application’s scaling is limited, or further consolidation is not technically practical or economical).

Figure 2 also demonstrates the size of the challenge faced by Intel: at any performance level, the current (Sapphire Rapids) and previous (Ice Lake) generations of Intel-based systems use more power — much more when highly loaded. The exception to this rule, not captured by SPEC Power, is when the performance-critical code paths of specific software are heavily optimized for an Intel platform, such as select technical, supercomputing and AI (deep neural network) applications that often take advantage of hardware acceleration features. In these specific cases, Intel’s latest products can often close the gap or even exceed the energy performance of AMD’s processors.

Servers that age well

The fact that the latest generation of processors from Intel and AMD is not more efficient than the previous generation is less of a problem.  The greater problem is that server technology platforms released in the past two years do not offer automatic efficiency improvements over the enormous installed base of legacy servers, which are often more than five or six years old.

Figure 3 highlights the relationship between performance and power where, without a major workload consolidation (2:1 or higher), the business case to upgrade old servers (see blue crosses) remains dubious. This area is not fixed and will vary by application, but approximately equals the real-world performance and power envelope of most Intel-based servers released before 2020. Typically, only the best performing, highly utilized legacy servers will warrant an upgrade without further consolidation.

diagram: Many older servers remain more efficient at running lighter loads
Figure 3. Many older servers remain more efficient at running lighter loads

For many organizations, the processing demand in their applications is not rising fast enough to benefit from the performance and energy efficiency improvements the newer servers offer. There are many applications that, by today’s standards, only lightly exercise server resources, and that are not fully optimized for multi-core systems. Average utilization levels are often low (between 10% and 20%), because many servers are sized for expected peak demand, but spend most of their time in idle or reduced performance states.

Counterintuitively, perhaps, servers built with Intel’s previous processor generations in the second half of the 2010s (largely since 2017, using Intel’s 14-nanometer technology) tend to show better power economy when used lightly, than their present-day counterparts. For many applications this means that, although they use more power when worked hard, they can conserve even more in periods of low demand.

Several factors may undermine the business case for the consolidation of workloads onto fewer, higher performance systems: these include the costs and risks of migration, threats to infrastructure resiliency, and incompatibility (if the software stack is not tested or certified for the new platform). The latest server technologies may offer major efficiency gains to at-scale IT services providers, AI developers and HPC shops, but for enterprises the benefits will be fewer and harder to achieve.

The regulatory drive for data center sustainability will likely direct more attention to IT’s role soon — the lead example being the proposed EU’s recast of the Energy Efficiency Directive. Regulators, consultants and IT infrastructure owners will want a number of proxy metrics for IT energy performance, and the age of a server is an intuitive choice. Data strongly indicates this form of simplistic ageism is misplaced.

Data shows the cloud goes where the money is

Data shows the cloud goes where the money is

Hyperscale cloud providers have opened numerous operating regions in all corners of the world over the past decade. The three most prominent — Amazon Web Services (AWS), Google Cloud and Microsoft Azure — now have 105 distinct regions (excluding government and edge locations) for customers to choose from to locate their applications and data. Over the next year, this will grow to 130 regions. Other large cloud providers such as IBM, Oracle and Alibaba are also expanding globally, and this trend is likely to continue.

Each region requires enormous investments in data centers, IT, software, people, and networks. The opening of a region may both develop and disrupt the digital infrastructure of the countries involved. This Update, part of Uptime Intelligence’s series of publications explaining and examining the development of the cloud, shows how investment can be tracked — and, to a degree, predicted — by looking at the size of the markets involved.

Providers use the term “region” to describe a geographical area containing a collection of independent availability zones (AZs), which are logical representations of data center facilities. A country may have many regions, with each region typically having two or three AZs. The three leading hyperscalers’ estates include more than 300 hyperscale AZs and many more data centers (including both hyperscale-owned and hyperscale-leased facilities) in operation today. Developers use AZs to build resilient applications in a single region.

The primary reason providers offer a range of regions is latency. In general, no matter how good the network infrastructure is, the further the end user is from the cloud application, the greater the delay and the poorer the end-user experience (especially on latency-sensitive applications, such as interactive gaming). Another important driver is that some cloud buyers are required to keep applications and user data in data centers in a specific jurisdiction for compliance, regulatory or governance reasons.

Figure 1 shows how many of the three largest cloud providers have regions in each country.

diagram: Count of the three largest cloud providers (AWS, Google, Microsoft) operating a cloud region in a country: current and planned
Figure 1. Count of the three largest cloud providers (AWS, Google, Microsoft) operating a cloud region in a country: current and planned

The economics of a hyperscale public cloud depends on scale. Implementing a cloud region of multiple AZs (and, therefore, data centers) requires substantial investment, even if it relies on colocation sites. Cloud providers need to expect enough return to justify such an investment.

To achieve this return on investment, a geographical region must have the telecommunications infrastructure to support the entire cloud region. Practically too, the location must be able to support the data center itself, and be able to provide reliable power, telecommunications, security and skills.

Considering these requirements, cloud providers focus their expansion plans on economies with the largest gross domestic product (GDP). GDP measures economic activity but, more generally, is an indicator of the health of an economy. Typically, countries with a high GDP have broad and capable telecommunications infrastructure, high technology skills, robust legal and contractual frameworks, and the supporting infrastructure and supply chains required for data center implementation and operation. Furthermore, organizations in countries with higher GDPs have greater spending power and access to borrowing. In other words, they have the cash to spend on cloud applications to give the provider a high enough return on investment.

The 17 countries where all three hyperscalers currently operate cloud regions, or plan to, account for 56% of global GDP. The GDP of countries where at least one hyperscaler intends to operate is 87% of global GDP across just 40 countries (for comparison, the United Nations comprises 195 countries).

Figure 2 shows GDP against hyperscalers present in a country. (US and China’s GDPs are not shown because they are significant outliers.) The figure shows a trend: a greater GDP increases the likelihood of a hyperscaler presence in the region. Three countries buck this trend: Mexico, Turkey and Russia.

diagram: GDP against hyperscaler presence (China and US removed because of outlying GDP)
Figure 2. GDP against hyperscaler presence (China and US removed because of outlying GDP)

Observations

  • The US is due to grow to 24 hyperscaler cloud regions across 13 states (excluding the US government), which is substantially more than any other country. This widespread presence is because Google, Microsoft and AWS are US companies with significant experience of operating in the country. The US is the single most influential and competitive market for digital services, with a shared official language, an abundance of available land, a business-friendly environment, and relatively few differences in regulatory requirements between local authorities.
  • Despite China’s vast GDP, only two of the big three US hyperscalers operate there today: AWS and Microsoft Azure. However, unlike all other cloud regions, AWS and Microsoft regions are outsourced to Chinese companies to comply with local data protection requirements. AWS outsources its cloud regions to Sinnet and Ningxia Western Cloud Data Technology (NWCD); Azure outsources its cloud to 21Vianet. Notably, China’s cloud regions are totally isolated from all non-China cloud regions regarding connectivity, billing and governance. Google considered opening a China region in 2018 but abandoned the idea in 2020; one reason for this being a reluctance to operate through a partner, reportedly. China has its own hyperscaler clouds: Alibaba Cloud, Huawei Cloud, Tencent Cloud and Baidu AI Cloud. These hyperscalers have implemented regions beyond China and into greater Asia-Pacific, Europe, the US and the Middle East, primarily so that these China-based organizations can reach other markets.
  • Mexico has a high GDP but only one cloud region, which Microsoft Azure is currently developing. Mexico’s proximity to the US and a good international telecommunications infrastructure means applications targeting Mexican users do not necessarily suffer significant latency. The success of the Mexico region will depend on the eventual price of cloud resources there. If Mexico does not offer substantially lower costs and higher revenues than nearby US regions (for example San Antonio in Texas, where Microsoft Azure operates), and if customers are not legally required to keep data local, Mexican users could be served from the US, despite minor latency effects and added network bandwidth costs. Uptime thinks other hyperscale cloud providers are unlikely to create new regions in Mexico in the next few years for this reason.
  • Today, no multinational hyperscaler cloud provider offers a Russia region. This is unlikely to change soon because of sanctions imposed by a raft of countries since Russia invaded Ukraine. Cloud providers have historically steered clear of Russia because of geopolitical tensions with the US and Europe. Even before the Ukraine invasion, AWS had a policy of not working with the Russian government. Other hyperscalers, such as IBM, Oracle Cloud and China’s Alibaba, are also absent from Russia. The Russian Commonwealth of Independent States has no hyperscaler presence. Yandex is Russia’s most-used search engine and the country’s key cloud provider.
  • A total of 16 European countries have either a current or planned hyperscaler presence and represent 70% of the continent’s GDP. Although latency is a driver, data protection is a more significant factor. European countries tend to have greater data protection requirements than the rest of the world, which drives the need to keep data within a jurisdiction.
  • Turkey has a high GDP but no hyperscaler presence today. This is perhaps because the country can be served, with low latency, by nearby EU regions. Data governance concerns may also be a barrier to investment. However, Turkey may be a target for future cloud provider investment.
  • Today, the three hyperscalers are only present in one African country, South Africa — even though Egypt and Nigeria have larger GDPs. Many applications aimed at a North African audience may be suitably located in Southern Europe with minimal latency. However, Nigeria could be a potential target for a future cloud. It has high GDP, good connectivity through several submarine cables, and would appeal to the central and western African markets.
  • South American cloud regions were previously restricted to Brazil, but Google now has a Chilean region, and Azure has one in the works. Argentina and Chile have high relative GDP. It would not be surprising if AWS followed suit.

Conclusions

As discussed in Cloud scalability and resiliency from first principles, building applications across different cloud providers is challenging and costly. As a result, customers will seek cloud providers that operate in all the regions they want to reach. To meet this need, providers are following the money. Higher GDP generally equates to more resilient, stable economies, where companies are likely to invest and infrastructure is readily available. The current exception is Russia. High GDP countries yet to have a cloud presence include Turkey and Nigeria.

In practice, most organizations will be able to meet most of their international needs using hyperscaler cloud infrastructure. However, they need to carefully consider where they may want to host applications in the future. Their current provider may not support a target location, but migrating to a new provider that does is often not feasible. (A future Uptime Intelligence update will further explore specific gaps in cloud provider coverage.)

There is an alternative to building data centers or using colocation providers in regions without hyperscalers: organizations seeking new markets could consider where hyperscaler cloud providers may expand next. Rather than directly tracking market demand, software vendors may launch new services when a suitable region is brought online. The cost of duplicating an existing cloud application into a new region is small (especially compared with a new data center build or multi-cloud development). Sales and technical support can often be provided remotely without an expensive in-country presence.

Similarly, colocation providers can follow the money and consider the cloud providers’ expansion plans. A location such as Nigeria, with high GDP and no hyperscaler presence (but good telecommunications infrastructure) may be ideal for data center buildouts for future hyperscaler requirements.

Colocation providers also have opportunities outside the GDP leaders. Many organizations still need local data centers for compliance or regulatory reasons, or for peace of mind, even if a hyperscaler data center is relatively close in terms of latency. In the Uptime Institute Data Center Capacity Trends Survey 2022, 44% of 65 respondents said they would use their own data center if their preferred public cloud provider was unavailable in a country, and 29% said they would use a colocation provider.

Cloud providers increasingly offer private cloud appliances that can be installed in a customer’s data center and connected to the public cloud for a hybrid deployment (e.g., AWS Outposts, VMware, Microsoft Azure Stack). Colocation providers should consider if partnerships with hyperscaler cloud providers can support hybrid cloud implementations outside the locations where hyperscalers operate.

Cloud providers have no limits in terms of country or market. If they see an opportunity to make money, they will take it. But they need to see a return on their investment. Such returns are more likely where demand is high (often where GDP is high) and infrastructure is sufficient.

Cooling to play a more active role in IT performance and efficiency

Cooling to play a more active role in IT performance and efficiency

Data center operators and IT tenants have traditionally adopted a binary view of cooling performance: it either meets service level commitments, or it does not. The relationship is also coldly transactional: as long as sufficient volumes of air of the right temperature and quality (in accordance with service-level agreements that typically follow ASHRAE’s guidance) reach the IT rack, the data center facility’s mission has been accomplished. What happens after that point with IT cooling, and how it affects IT hardware, is not facilities’ business.

This practice was born in an era when the power density of IT hardware was much lower, and when server processors still had a fixed performance envelope. Processors were running at a given nominal frequency, under any load, that was defined at the time of manufacturing. This frequency was always guaranteed if there was sufficient cooling available, whatever the workload.

Chipmakers guide IT system builders and customers to select the right components (heat sinks, fans) via processor thermal specifications. Every processor is assigned a power rating for the amount of heat its cooling system must be able to handle at the corresponding temperature limit. This is not theoretical maximum power but rather the maximum that can realistically be sustained (seconds or more) running real-world software. This maximum is called thermal design power (TDP).

The majority of software applications don’t stress the processor enough to get close to the TDP, even if they use 100% of the processor’s time — typically only high-performance computing code makes processors work that hard. With frequencies fixed, this results in power consumption (and thermal power) that is considerably below the TDP rating. Since the early 2000s, nominal processor speeds have tended to be limited by power rather than the maximum speed of circuitry, so for most applications there is untapped performance potential within the TDP envelope.

This gap is wider still in multicore processors when the software cannot benefit from all the cores present. This results in an even larger portion of the power budget not being used to increase application performance. The higher the core count, the bigger this gap can be unless the workload is highly multithreaded.

Processors looking for opportunities

Most server processors and accelerators that came to market in the past decade have mechanisms to address this (otherwise ever-growing) imbalance. Although implementation details differ between chipmakers (Intel, AMD, NVIDIA, IBM), they all dynamically deploy available power budget to maximize performance when and where it is needed most.

This balancing happens in two major ways: frequency scaling and management of power allocation to cores. When a modern server processor enters a phase of high utilization but remains under its thermal specification, it starts to increase supply voltage and then matches frequency in incremental steps. It continues to scale the steps until it reaches any one of the preset limits: frequency, current, power or temperature — whichever comes first.

If the workload is not evenly distributed across cores, or leaves some cores unused, the processor allocates unused power to highly utilized cores (if power was the limiting factor for their performance) to enable them to scale their frequencies even higher. The major beneficiary of independent core scaling is the vast repository of single- or lightly threaded software, but multithreaded applications also benefit where they struggle with Amdahl’s law (when the application is hindered by parts of the code that are not parallelized, so that overall performance depends largely on how fast a core can work through those segments).

This opportunistic behavior of modern processors means the quality of cooling, considering both supply of cold air and its distribution within the server, is not binary anymore. Considerably better cooling increases the performance envelope of the processor, a phenomenon that supercomputing vendors and users have been exploring for years. It also tends to improve overall efficiency because more work is done for the energy used.

Performance is best served cold

Better cooling unlocks performance and efficiency in two major ways:

  • The processor operates at lower temperatures (everything else being equal).
  • It can operate at higher thermal power levels.

The lowering of operational temperature through improved cooling brings many performance benefits such as enabling individual processor cores to run at elevated speeds for longer without hitting their temperature limit.

Another, likely sizeable, benefit lies in reducing static power in the silicon. Static power is power lost to leakage currents that perform no useful work, yet keep flowing through transistor gates even when they are in the ”off” state. Static power was not an issue 25 years ago, but has become more difficult to suppress as transistor structures have become smaller, and their insulation properties correspondingly worse. High-performance logic designs, such as those in server processors, are particularly burdened by static power because they integrate a large number of fast-switching transistors.

Semiconductor technology engineers and chip designers have adopted new materials and sophisticated power-saving techniques to reduce leakage currents. However, the issue persists. Although chipmakers do not reveal the static power consumption of their products, it is likely to take a considerable component of the power budget of the processor, probably a low double-digit percentage share.

Various academic research papers have shown that static leakage currents depend on the temperature of silicon, but the exact profile of that correlation varies greatly across chip manufacturing technologies — such details remain hidden from the public eye.

Upgraded air coolers can measurably improve application performance when the processor is thermally limited during periods of high load, though such a speed-up tends to be in the low single digits. This can be achieved by lowering inlet air temperatures or, more commonly, by upgrading the processors’ cooling to lower thermal resistance. Examples of this are: adding larger, CFD-optimized heat sinks built from thermally better conducting alloy (e.g., copper-based alloys); using better thermal interface materials; and introducing more powerful fans to increase airflow. If combined with better facility air delivery and lower inlet temperatures, the speed-up is higher still.

No silver bullets, just liquid cooling

But the markedly lower thermal resistance and consequent lowered silicon temperature that direct liquid cooling (DLC) brings makes a more pronounced difference. Compared with air coolers at the same temperature, DLC (cold plate and immersion) can free up more power by reducing the temperature-dependent component of static leakage currents.

There is an even bigger performance potential in the better thermal properties of liquid cooling: prolonging the time that server processors can spend in controlled power excursions above their TDP level, without hitting critical temperature limits. This behavior, now common in server processors, is designed to offer bursts of extra performance, and can result in a short-term (tens of seconds) heat load that is substantially higher than the rated cooling requirement.

Typically, excursions reach 15% to 25% above the TDP, which did not previously pose a major challenge. However, in the latest generation of products from AMD and Intel, this results in up to 400 watts (W) and 420 W, respectively, of sustained thermal power per processor — up from less than 250 W about five years ago.

Such high-power levels are not exclusive to processor models aimed at high-performance computing applications: a growing number of mainstream processor models intended for cloud, hosting and enterprise workload consolidation can have these demanding thermal requirements. The favorable economics of higher performance servers (including their energy efficiency across an array of applications) generates demand for powerful processors.

Although these TDPs and power excursion levels are still manageable with air when using high-performance heat sinks (at the cost of rack density because of very large heat sinks, and lots of fan power), peak performance levels will start to slip out of reach for standard air cooling in the coming years. Server processor development roadmaps call for even more powerful processor models in the coming years, probably reaching 600 W in thermal excursion power by the mid-2020s.

As processor power escalates and temperature limits grow more restrictive, even DLC temperature choices will be a growing trade-off dilemma as data center and IT infrastructure operators try to balance capital costs, cooling performance, energy efficiency and sustainability credentials. Inevitably, the relationship between data center cooling, server performance and overall IT efficiency will demand more attention.

The effects of a failing power grid in South Africa

The effects of a failing power grid in South Africa

European countries narrowly avoided an energy crisis in the past winter months, as a shortfall in fossil fuel supplies from Russia threatened to destabilize power grids across the region. This elevated level of risk to the normally robust European grid has not been seen for decades.

A combination of unseasonably mild weather, energy saving initiatives and alternative gas supplies averted a full-blown energy crisis, at least for now, although business and home consumers are paying a heavy price through high energy bills. The potential risk to the grid forced European data center operators to re-evaluate both their power arrangements and their relationship with the grid. Even without an energy security crisis, power systems elsewhere are becoming less reliable, including some of the major grid regions in the US.

Most mission-critical data centers are designed not to depend on the availability of an electrical utility, but to benefit from its lower power costs. On-site power generation — usually provided by diesel engine generators — is the most common option to backup electricity supplies, because it is under the facility operator’s direct control.

A mission-critical design objective of power autonomy, however, does not shield data center operators from problems that affect utility power systems. The reliability of the grid affects:

  • The cost of powering the data center.
  • How much diesel to buy and store.
  • Maintenance schedules and costs.
  • Cascading risks to facility operations.

South Africa provides a case study in how grid instability affects data center operations. The country has emerged as a regional data center hub over the past decade (largely due to its economic and infrastructure head-start over other major African countries), despite experiencing its own energy crisis over the past 16 years.

A total of 11 major subsea network cables land in South Africa, and its telecommunications infrastructure is the most developed on the continent. Although it cannot match the capacity of other global data center hubs, South Africa’s data center market is highly active — and is expanding (including recent investments by global colocation providers Digital Realty and Equinix). Cloud vendors already present in South Africa include Amazon Web Services (AWS), Microsoft Azure, Huawei and Oracle, with Google Cloud joining soon. These organizations must contend with a notoriously unreliable grid.

Factors contributing to grid instability

Most of South Africa’s power grid is operated by state-owned Eskom, the largest producer of electricity in Africa. Years of under-investment in generation and transmission infrastructure have forced Eskom to impose periods of load-shedding — planned rolling blackouts based on a rotating schedule — since 2007.

Recent years have seen substation breakdowns, cost overruns, widespread theft of coal and diesel, industrial sabotage, multiple corruption scandals and a $5 billion government bail-out. Meanwhile, energy prices nearly tripled in real terms between 2007 and 2020.

In 2022, the crisis deepened, with more power outages than in any of the previous 15 years — nearly 300 load-shedding events, which is three times the previous record of 2020 (Figure 1). Customers are usually notified about upcoming disruption through the EskomSePush (ESP) app. Eskom’s load-shedding measures do not distinguish between commercial and residential properties.

diagram: Number of load-shedding instances initiated by Eskom from 2018 to 2022
Figure 1. Number of load-shedding instances initiated by Eskom from 2018 to 2022

Blackouts normally last for several hours, and there can be several a day. Eskom’s app recorded at least 3,212 hours of load-shedding across South Africa’s grid in 2022. For more than 83 hours, South Africa’s grid remained in “Stage 6”, which means the grid was in a power shortfall of at least 6,000 megawatts. A new record was set in late February 2023, when the grid entered “Stage 8” load-shedding for the first time. Eskom has, in the past, estimated that in “Stage 8”, an average South African could expect to be supplied with power for only 12 hours a day.

Reliance on diesel

In this environment, many businesses depend on diesel generators as a source of power — including data centers, hospitals, factories, water treatment facilities, shopping centers and bank branches. This increased demand for generator sets, spare parts and fuel has led to supply shortages.

Load-shedding in South Africa often affects road signs and traffic lights, which means fuel deliveries are usually late. In addition, trucks often have to queue for hours to load fuel from refineries. As a result, most local data center operators have two or three fuel supply contracts, and some are expanding their on-site storage tanks to provide fuel for several days (as opposed to the 12-24 hours typical in Europe and the US).

There is also the cost of fuel. The general rule in South Africa is that generating power on-site costs about seven to eight times more than buying utility power. With increased runtime hours on generators, this quickly becomes a substantial expense compared with utility energy bills.

Running generators for prolonged periods accelerates wear and heightens the risk of mechanical failure, with the result that the generator units need to be serviced more often. As data center operations staff spend more time monitoring and servicing generators and fuel systems, other maintenance tasks are often deferred, which creates additional risks elsewhere in the facility.

To minimize the risk of downtime, some operators are adapting their facilities to accommodate temporary external generator set connections. This enables them to provision additional power capacity in four to five hours. One city, Johannesburg, has access to gas turbines as an alternative to diesel generators, but these are not available in other cities.

Even if the data center remains operational through frequent power cuts, its connectivity providers, which also rely on generators, may not. Mobile network towers, equipped with UPS systems and batteries, are frequently offline because they do not get enough time to recharge between load-shedding periods if there are several occurrences a day. MTN, one of the country’s largest network operators, had to deploy 2,000 generators to keep its towers online and is thought to be using more than 400,000 liters of fuel a month.

Frequent outages on the grid create another problem: cable theft. In one instance, a data center operator’s utility power did not return following a scheduled load-shedding. The copper cables leading to the facility were stolen by thieves, who used the load-shedding announcements to work out when it was safe to steal the cables.

Lessons for operators in Europe and beyond

  • Frequent grid failures increase the cost of digital services and alter the terms of service level agreements.
  • Grid issues may take years to emerge. The data center industry needs to be vigilant and respond to early warning signs.
  • Data center operators must work with utilities, regulators and industry associations to shape the development of grids that power their facilities.
  • Uptime’s view is that data center operators will find a way to meet demand — even in hostile environments.

Issues with the supply of Russian gas to Europe might be temporary, but there are other concerns for power grids around the world. In the US, an aging electricity transmission infrastructure — much of it built in the 1970s and 1980s — requires urgent modernization, which will cost billions of dollars. It is not clear who will foot this bill. Meanwhile, power outages across the US over the past six years have more than doubled compared with the previous six years, according to federal data.

While the scale of grid disruption seen in South Africa is extreme, it offers lessons on what happens when the grid destabilizes and how to mitigate those problems. An unstable grid will cause similar problems for data center operators around the world, and range from ballooning power costs to a higher risk of equipment failure. This risk will creep into other parts of the facility infrastructure if operations staff do not have time to perform the necessary maintenance tasks. Generators may be the primary source of data center power, but they are best used as an insurance policy.