Blog Single Author Small - Uptime Institute Blog

24x7 carbon-free energy (part two): getting to 100%

24×7 carbon-free energy (part two): getting to 100%

July 12, 2023/in Executive, Operations/by Jay Dietrich, Research Director of Sustainability, Uptime Institute, jdietrich@uptimeinstitute.com

Digital infrastructure operators have started to refocus their sustainability objectives on 100% 24×7 carbon-free energy (CFE) consumption: using carbon-free energy for every hour of operation.

To establish a 24×7 CFE strategy, operators must track and control CFE assets and the delivery of energy to their data centers and use procurement contracts designed to manage the factors that affect the availability and cost of renewable energy / CFE megawatt-hours (MWh).

The first Update in this two-part series, 24×7 carbon-free energy (part one): expectations and realities, focused on the challenges that data center operators face as they look to increase the share of CFE consumed in their facilities. This second report outlines the steps that operators, grid authorities, utilities and other organizations need to take to enable data centers to be 100% powered by CFE. As previously discussed, data centers that approach 100% CFE using only wind and solar generation currently have to buy several times more generation capacity than they need.

Figure 1 illustrates the economic challenges of approaching 100% 24×7 CFE. To reach 50% to 80% CFE consumption, the price premium is likely to be 5% or less in markets with high CFE penetration (30% or more of the generated MWh). The levelized cost of electricity (LCOE) is the average net present cost of the electricity based on the cost of generation over the lifetime of an individual or group of generation facilities. Beyond 80% 24×7 CFE, the electricity rate escalates because not enough CFE generation and storage assets are available to provide a reliable electricity supply during periods of low wind and solar generation.

To push toward 100% CFE consumption, operators will need to take actions — or support efforts — to increase grid region interconnects and to develop and deploy reliable, dispatchable, carbon-free generation and long duration energy storage (LDES) capacity. LDES are storage technologies that can store energy for extended time periods, discharge electricity continuously for one to 10 days or longer, and supply electricity at rates that are competitive with other generation technologies.

Figure 1. The cost of electricity as the percentage of 24×7 CFE approaches 100%

Diagram: The cost of electricity as the percentage of 24x7 CFE approaches 100%

Increase grid region interconnections

Wind and / or solar resources are prevalent in some regions and absent in others. In many cases, areas with productive wind and solar resources are distant from those with high electricity demand. Connecting abundant resources with high-demand areas requires the build-out of high voltage interconnects within and between grid regions and countries. Recent news and trade articles have detailed that the current lack of grid interconnections and flexibility to support more distributed generation assets is slowing the building and deployment of planned solar and wind generation facilities. Numerous published studies detail the high voltage distribution system buildouts needed to support 100% CFE in grid regions around the globe.

The untapped potential of inter-regional grid interconnections is illustrated by the excess generation capacity associated with Google’s wind power purchase agreements (PPAs) for its Oklahoma and Iowa data centers in the Midwest region of the US, where wind generation is high. Google has four data centers in the adjacent US Southeast region (where wind generation is low). These have a low percentage of 24×7 CFE so would benefit from access to the excess Iowa and Oklahoma wind capacity. However, the extra wind capacity cannot be transported from the Midwest Reliability Organization (MRO) grid region to the Southeast Reliability Corporation (SERC) because of a lack of high-voltage interconnection capacity (Figure 2).

Figure 2. Google wind and solar assets by grid region (US)

Diagram: Google wind and solar assets by grid region (US)

The buildout of these projects is complicated by permit issues and financing details. Again, the Google example is constructive. The Clean Line, a 2 GW (gigawatt) high voltage transmission line between MRO and SERC, was proposed by a development consortium around 2013. The line would have enabled Google and other purchasers of wind power in the MRO region to transport excess wind generation to their data centers and other facilities in the SERC region. Clean Line may have also transported excess solar power from the SERC region to the MRO region.

The permitting process extended over four years or more, slowed by legal challenges from property owners and others. The developer had difficulties securing finance, because financiers required evidence of transportation contracts from MRO to SERC, and of PPAs with energy retailers and utilities in SERC. Generators in MRO would not sign transmission contracts without a firm construction schedule. Regulated utilities, the primary energy retailers in SERC, were hesitant to sign PPAs for intermittent renewable power because they would have to hold reliable generation assets in reserve to manage output variations. The developer built a lower capacity section of the line from MRO to the western edge of SERC, but the remainder of the planned system was shelved.

Reliable CFE generation

Complete decarbonization of the electricity supply will require deploying reliable carbon-free or low-carbon energy generation across the global energy grid. Nuclear and geothermal technologies are proven and potentially financially viable, but designs must be refreshed.

Nuclear generation is undergoing a redesign, with an emphasis on small modular reactors. These designs consist of defined, repeatable modules primarily constructed in central manufacturing facilities and assembled at the production site. They are designed to provide variable power output to help the grid match generation to demand. Like other systems critical to grid decarbonization, new reactor types are in the development and demonstration stage and will not be deployed at scale for a decade or more.

Geothermal generation depends on access to the high-temperature zones underground. Opportunities for deployment are currently limited to areas such as Hawaii, where high-temperature subterranean zones are close to the surface.

Horizontal drilling technologies developed by the petroleum industry can access deeper high-temperature zones. However, the corrosive, high-pressure, and high-temperature conditions experienced by these systems present many engineering challenges. Several companies are developing and demonstrating installations that overcome these challenges.

Hydrolytic hydrogen generation is another technology that offers a means to deploy reliable electricity generation assets fueled with hydrogen. It has the advantage of providing a market for the excess wind and solar generation overcapacity discussed earlier in the report. The economics of hydrolytic systems suffer from current conversion efficiencies of 75% or less and uncertain economic conditions. Production incentives provided by the US Inflation Reduction Act, and similar programs in Japan, Australia, the EU and the UK, are attracting initial investments and accelerating the installation of hydrogen generation infrastructure to demonstrate system capabilities and reduce unit costs.

These three technologies illustrate the challenges in increasing the reliable carbon-free energy capacity available on the electricity grid. This discussion is not an exhaustive review, but an example of the technical and manufacturing challenges and extended timelines required to develop and deploy these technologies at scale.

Long duration energy storage

Long duration energy storage (LDES) is another critical technology needed to move the electricity grid to 100% 24×7 CFE. Table 1 details various battery and physical storage technologies currently being developed. The table represents a general survey of technologies and is not exhaustive.

Table 1. Long duration energy storage (LDES) technologies

Table: Long duration energy storage (LDES) technologies

Each type of battery is designed to fill a specific operating niche, and each type is vital to attaining a low-carbon or carbon-free electricity grid. All these technologies are intended to be paired with wind and solar generation assets to produce a firm, reliable supply of CFE for a specified time duration.

4-hour discharge duration: Lithium-ion batteries are a proven technology in this category, but they suffer from high costs, risk of fires, and deterioration of battery capacity with time. Other technologies with lower capital costs and better operating characteristics are under development. These batteries are designed to manage the short-term minute-to-minute, hour-to-hour variations in the output of wind and solar generation assets.
8- to 12-hour discharge duration: These storage systems address the need for medium-duration, dispatchable power to fill in variations in wind and solar generation over periods beyond 4 hours. The batteries will provide continuous, quasi-reliable round the clock power by charging during periods of excess wind and solar production and discharging during periods of no or low MWh output. These systems will rely on sophisticated software controls to manage the charge/discharge process and to integrate the power generation with grid demand.
≥10 day discharge duration: These storage systems are designed to support the grid during longer periods of low generation — such as low solar output caused by several cloudy days or a multiday period of low wind output — and to cover significant week-to-week output variations. As with the 8- to 12-hour LDES systems, the energy cost and the frequency of discharge to the grid will govern the economics of the operation.

The cost of energy from the battery depends on three factors: the energy cost to charge the batteries, the round-trip efficiency of the charging / discharging process, and the number of charge / discharge cycles achieved for a given period.

The impact of the cost of power to charge the batteries is evident. The critical point is that the economics of a battery system depend on buying power during periods of grid saturation when wholesale electricity prices are approaching zero.

Roundtrip efficiency dictates the quantity of excess energy that has to be purchased to compensate for the inefficiencies of the charge / discharge cycle. If the roundtrip efficiency is 80%, the battery operator must purchase 1.25 MWh of energy for every 1 MWh delivered back to the grid. To make the economics of a battery system work, charging power needs to be purchased at a near zero rate.

To be profitable, the revenue from energy power sales must cover the cost of buying power, operating and maintaining the battery system, and paying off the loan used to finance the facility. The economics of an 8- to 12-hour battery system will be optimized if the system charges / discharges every 24 hours. If the battery system can only dispatch energy on one out of seven days, the revenue is unlikely to cover the financing costs.

Effectively integrating storage systems into the grid will require sophisticated software systems to control the charging and discharging of individual battery systems and the flow to and from battery systems at a grid level. These software systems will likely need to include economic algorithms tuned to address the financial viability of all the generating assets on the grid.

Conclusions

Transitioning the global electricity grid to CFE will be a long-term endeavor that will take several decades. Given the challenges inherent in the transition, data center operators should develop plans to increase their electricity consumption to 70% to 80% CFE, while supporting extended grid interconnections and the development and commercialization of LDES technologies and non-intermittent CFE electricity generation.

There will be grid regions where the availability of wind, solar, hydroelectric and nuclear generation assets will facilitate a faster move to economic 100% CFE. There are other regions where 100% CFE use will not be achieved for decades. Data center operators will need to accommodate these regional differences in their sustainability strategies. Because of the economic and technical challenges of attaining 100% CFE in most grid regions, it is unrealistic to expect to reach 100% CFE consumption across a multi-facility, multi-regional data center portfolio until at least 2040.

Uptime Institute advises operators to reset the timing of their net zero commitments and to revise their strategy to depend less on procuring renewable energy credits, guarantees of origin, and carbon offsets. Data center operators will achieve a better outcome if they apply their resources to promote and procure CFE for their data center. But where CFE purchases are the key to operational emissions reductions, net zero commitments will need to move out in time.

Data center managers should craft CFE procurement strategies and plans to incrementally increase CFE consumption in all IT operations — owned, colocation and cloud. By focusing net zero strategies on CFE procurement, data center managers will achieve real progress toward reducing their greenhouse gas emissions and will accelerate the transition to a low-carbon electricity grid.

Where the cloud meets the edge

July 5, 2023/in Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com

Low latency is the main reason cloud providers offer edge services. Only a few years ago, the same providers argued that the public cloud (hosted in hyperscale data centers) was suitable for most workloads. But as organizations have remained steadfast in their need for low latency and better data control, providers have softened their resistance and created new capabilities that enable customers to deploy public cloud software in many more locations.

These customers need a presence close to end users because applications, such as gaming, video streaming and real-time automation, require low latency to perform well. End users — both consumers and enterprises — want more sophisticated capabilities with quicker responses, which developers building on the cloud want to deliver. Application providers also want to reduce the network costs and bandwidth constraints that result from moving data over wide distances — which further reinforces the need to keep data close to point of use.

Cloud providers offer a range of edge services to meet the demand for low latency. This Update explains what cloud providers offer to deliver edge capabilities. Figure 1 shows a summary of the key products and services available.

Figure 1. Key edge services offered by cloud providers

Diagram: Key edge services offered by cloud providers

In the architectures shown in Figure 1, public cloud providers generally consider edge and on-premises locations as extensions of the public cloud (in a hybrid cloud configuration), rather than as isolated private clouds operating independently. The providers regard their hyperscale data centers as default destinations for workloads and propose that on-premises and edge sites be used for specific purposes where a centralized location is not appropriate. Effectively, cloud providers see the edge location as a tactical staging post.

Cloud providers don’t want customers to view edge locations as similarly feature-rich as cloud regions. Providers only offer a limited number of services at the edge to address specific needs, in the belief that the central cloud region should be the mainstay of most applications. Because the edge cloud relies on the public cloud for some aspects of administration, there is a risk of data leakage and loss of control of the edge device. Furthermore, the connection between the public cloud and edge is a potential single point of failure.

Infrastructure as a service

In an infrastructure as a service (IaaS) model, the cloud provider manages all aspects of the data center, including the physical hardware and the orchestration software that delivers the cloud capability. Users are usually charged per resource consumed per period.

The IaaS providers use the term “region” to describe a geographical area that contains a collection of independent availability zones (AZs). An AZ consists of at least one data center. A country may have many regions, each typically having two or three AZs. Cloud providers offer nearly all their services that are located within regions or AZs as IaaS.

Cloud providers also offer metro and near-premises locations sited in smaller data centers, appliances or colocation sites nearer to the point of use. They manage these in a similar way to AZs. Providers claim millisecond-level connections between end-users and these edge locations. However, the edge locations usually have fewer capabilities, poorer resiliency, and higher prices than AZs and broader regions that are hosted from the (usually) larger data centers.

Furthermore, providers don’t have edge locations in all areas — they create new locations only where volume is likely to make the investment worthwhile, typically in cities. Similarly, the speed of connectivity between edge sites and end-users depends on the supporting network infrastructure and availability of 4G or 5G.

When a cloud provider customer wants to create a cloud resource or build out an application or service, they first choose a region and AZ using the cloud provider’s graphical user interface (GUI) or application programming interface (API). To the user of GUIs and APIs, metro and near-premises locations appear as options for deployment just as a major cloud region would. Buyers can set up a new cloud provider account and deploy resources in an IaaS edge location in minutes.

Metro locations

Metro locations do not offer the range of cloud services that regions do. They typically offer only compute, storage, container management and load balancing. A region has multiple AZs; a metro location does not. As a result, it is impossible to build a fully resilient application in a single metro location (see Cloud scalability and resiliency from first principles).

The prime example of a metro location is Amazon Web Services (AWS) Local Zones. Use cases usually focus on graphically intense applications such as virtual desktops or video game streaming, or real-time processing of video, audio or sensor data. Customers, therefore, should understand that although edge services in the cloud might cut latency and bandwidth costs, resiliency may also be lower.

These metro-based services, however, may still match many or most enterprise levels of resiliency. Data center infrastructure that supports metro locations is typically in the scale of megawatts, is staffed, and is built to be concurrently maintainable. Connection to end users is usually provided over redundant local fiber connections.

Near-premises (or far edge) locations

Like cloud metro locations, near-premises locations have a smaller range of services and AZs than regions do. The big difference between near-premises and metros is that resources in near-premises locations are deployed directly on top of 4G or 5G network infrastructure, perhaps only a single cell tower away from end-users. This reduces hops between networks, substantially reducing latency and delays caused by congestion.

Cloud providers partner with major network carriers to enable this, for example AWS’s Wavelengths service delivered in partnership with Vodafone, and Microsoft Azure Edge Zone’s service with AT&T. Use cases include, or may include, real-time applications, such as live video processing and analysis, autonomous vehicles and augmented reality. 5G enables connectivity where there is no fixed-line telecoms infrastructure or in temporary locations.

These sites may be cell tower locations or exchanges, operating tens of kilowatts (kW) to a few hundred kW. They are usually unstaffed (remotely monitored) with varying levels of redundancy and a single (or no) engine generator.

On-premises cloud extensions

Cloud providers also offer hardware and software that can be installed in a data center that the customer chooses. The customer is responsible for all aspects of data center management and maintenance, while the cloud provider manages the hardware or software remotely. The provider’s service is often charged per resource per period over an agreed term. A customer may choose these options over IaaS because no suitable cloud edge locations are available, or because regulations or strategy require them to use their own data centers.

Because the customer chooses the location, the equipment and data center may vary. In the edge domain, these locations are: typically 10 kW to a few hundred kW; owned or leased; and constructed from retrofitted rooms or using specialized edge data center products (see Edge data centers: A guide to suppliers). Levels of redundancy and staff expertise vary, so some edge data center product suppliers provide complementary remote monitoring services. Connectivity is supplied through telecom interconnection and local fiber, and latency between site and end user varies significantly.

Increasingly, colocation providers differentiate by directly peering with cloud providers’ networks to reduce latency. For example, Google Cloud recommends its Dedicated Interconnect service for applications where latency between public cloud and colocation site must be under 5ms. Currently, 143 colocation sites peer with Google Cloud, including those owned by companies such as Equinix, NTT, Global Switch, Interxion, Digital Realty and CenturyLink. Other cloud providers have similar arrangements with colocation operators.

Three on-premises options

Three categories of cloud extensions can be deployed on-premises. They differ in how easy it is to customize the combination of hardware and software. An edge cloud appliance is simple to implement but has limited configuration options; a pay-as-you-go server gives flexibility in capacity and cloud integration but requires more configuration; finally, a container platform gives flexibility in hardware and software and multi-cloud possibilities, but requires a high level of expertise.

Edge cloud appliance

An edge appliance is a pre-configured hardware appliance with pre-installed orchestration software. The customer installs the hardware in its data center and can configure it, to a limited degree, to connect to the public cloud provider. The customer generally has no direct access to the hardware or orchestration software.

Organizations deploy resources via the same GUI and APIs as they would use to deploy public cloud resources in regions and AZs. Typically, the appliance needs to be connected to the public cloud for administration purposes, with some exceptions (see Tweak to AWS Outposts reflects demand for greater cloud autonomy). The appliance remains the property of the cloud provider, and the buyer typically leases it based on resource capacity over three years. Examples include AWS Outposts, Azure Stack Hub and Oracle Roving Edge Infrastructure.

Pay-as-you-go server

A pay-as-you-go server is a physical server leased to the buyer and charged based on committed and consumed resources (see New server leasing models promise cloud-like flexibility). The provider maintains the server, measures consumption remotely, proposes capacity increases based on performance, and refreshes servers when appropriate. The provider may also include software on the server, again charged using a pay-as-you-go model. Such software may consist of cloud orchestration tools that provide private cloud capabilities and connect to the public cloud for a hybrid model. Customers can choose their hardware specifications and use the provider’s software or a third party’s. Examples include HPE GreenLake and Dell APEX.

Container software

Customers can also choose their own blend of hardware and software with containers as the underlying technology, to enable interoperability with the public cloud. Containers allow software applications to be decomposed into many small functions that can be maintained, scaled, and managed individually. Their portability enables applications to work across locations.

Cloud providers offer managed software for remote sites that is compatible with public clouds. Examples include Google Anthos, IBM Cloud Satellite and Red Hat OpenShift Container Platform. In this option, buyers can choose their hardware and some aspects of their orchestration software (e.g., container engines), but they are also responsible for building the system and managing the complex mix of components (see Is navigating cloud-native complexity worth the hassle?).

Considerations

Buyers can host applications in edge locations quickly and easily by deploying in metro and near-premises locations offered by cloud providers. Where a suitable edge location is not available, or the organization prefers to use on-premises data centers, buyers have multiple options for extending public cloud capability to an edge data center.

Edge locations differ in terms of resiliency, product availability and — most importantly — latency. Latency should be the main motivation for deploying at the edge. Generally, cloud buyers pay more for deploying applications in edge locations than they would in a cloud region. If there is no need for low latency, edge locations may be an expensive luxury.

Buyers must deploy applications to be resilient across edge locations and cloud regions. Edge locations have less resiliency, may be unstaffed, and may be more likely to fail. Applications must be architected to continue to operate if an entire edge location fails.

Cloud provider edge locations and products are not generally designed to operate in isolation — they are intended to serve as extensions of the public cloud for specific workloads. Often, on-premises and edge locations are managed via public cloud interfaces. If the connection between the edge site and the public cloud goes down, the site may continue to operate — but it will not be possible to deploy new resources until the site is reconnected to the public cloud platform that provides the management interface.

Data protection is often cited as a use case for edge, a reason why some operators may choose to locate applications and data at the edge. However, because the edge device and public cloud need to be connected, there is a risk of user data or metadata inadvertently leaving the edge and entering the public cloud, thereby breaching data protection requirements. This risk must be managed.

Dr. Owen Rogers, Research Director of Cloud Computing

Tomas Rahkonen, Research Director of Distributed Data Centers

Data center operators will face more grid disturbances

June 28, 2023/in Executive, Operations/by Jacqueline Davis, Research Analyst, Uptime Institute, jdavis@uptimeinstitute.com

The energy crisis of 2022, resulting from Russia’s invasion of Ukraine, caused serious problems for data center operators in Europe. Energy prices leapt up and are likely to stay high. This has resulted in ongoing concerns that utilities in some European countries, which have a mismatch of supply and demand, will have to shed loads.

Even before the current crisis, long-term trends in the energy sector pointed towards less reliable electrical grids, not only in Europe, but also in North America. Data center operators — which rarely have protected status due to their generator use — are having to adapt to a new reality: electrical grids in the future may be less reliable than they are now.

Frequency and voltage disturbances may occur more frequently, even if full outages do not, with consequent risk to equipment. Data center operators may need to test their generators more often and work closely with their utilities to ensure they are collaborating to combat grid instability. In some situations, colocation providers may have to work with their customers to ensure service level agreements (SLAs) are met, and even shift some workloads ahead of power events.

There are four key factors that affect the reliability of electrical grids:

Greater adoption of intermittent renewable energy sources.
Aging of electricity transmission systems.
More frequent (and more widespread) extreme weather events.
Geopolitical instability that threatens oil and gas supplies.

Individually, these factors are unlikely to deplete grid reserve margins, which measure the difference between expected maximum available supply and expected peak demand. When combined, however, this confluence of factors can create an environment that complicates grid operation and increases the likelihood of outages.

Two out of six regional grid operators in the US have reported that their reserve margins are currently under target. One of them — the Midcontinent Independent System Operator — is expected to have a 1,300 megawatt (MW) capacity shortfall in the summer of 2023. In Europe, Germany is turning to foreign power suppliers to meet its reserve margins; even before the war in Ukraine, the country was lacking 1,400 MW of capacity. There has been a marked increase in power event outage warnings — even if they are not happening yet.

Dispatchable versus intermittent

An often-cited reason for grid instability is that dispatchable (or firm) power generation — for example, from natural gas, coal and nuclear sources — is being replaced by intermittent renewable power generated by solar and wind energy. This hinders the ability of grid operators to match supply and demand of energy and makes them more vulnerable to variations in weather. Utilities find themselves caught between competing mandates: maintain strict levels of reserve power, decommission the most reliable fossil fuel power plants, and maintain profitability.

Historically, electrical grid operators have depended on firm power generation to buttress the power system. Due to the market preference for renewable energy, many power plants that provide grid stability now face severe economic and regulatory challenges. These include lower-than-expected capacity factors, which compare power plants’ output with how much they could produce at peak capacity; higher costs of fuel and upgrades; stricter emissions requirements; and uncertainty regarding license renewals.

The effects are seen in the levelized cost of electricity (LCOE), measured in dollars per megawatt-hour (MWh). The LCOE is the average net present value of electricity generation over the lifetime of a generation asset. It factors in debt payments, fuel costs and maintenance costs, and requires assumptions regarding fuel costs and interest rates. According to financial services company Lazard’s annual US analysis (2023 Levelized cost of energy+), utility scale solar and onshore wind LCOE (without subsidies) is $24-$96 per MWh and $24-$75 per MWh respectively. Combined cycle natural gas plants, for comparison, have an LCOE of $39-$101 per MWh, while natural gas peaking plants have an LCOE of $115-$221 per MWh.

The economic viability of generation assets, as represented by LCOE, is only a part of a very complex analysis to determine the reserve margin needed to maintain grid stability. Wind and solar generation varies significantly by the hour and the season, making it essential that the grid has sufficient dispatchable assets (see 24×7 carbon-free energy (part one): expectations and realities).

Grid-scale energy storage systems, such as electrochemical battery arrays or pumped storage hydropower, are a possible solution to the replacement of fossil fuel-based firm power. Aside from a few examples, these types of storage do not have the capacity to support the grid when renewable energy output is low. It is not merely a question of more installations, but affordability: grid-scale energy storage is not currently economical.

Grids are showing their age

Much of the infrastructure supporting electricity transmission in the US and Europe was built in the 1970s and 1980s. These grids were designed to transmit power from large, high-capacity generation facilities to consumers. They were not designed to manage the multi-directional power flow of widely distributed wind and solar generation assets, electric vehicle chargers which can consume or supply power, and other vagaries of a fully electrified economy. The number of end users served by electrical grids has since increased dramatically, with the world’s population doubling — and power networks are now struggling to keep up without expensive capacity upgrades and new transmission lines.

In the US, the Department of Energy found that the average age of large power transformers, which handle 90% of the country’s electricity flow, is more than 40 years. Three-quarters of the country’s 707,000 miles (1,138,000 km) of transmission lines — which have a lifespan of 40 to 50 years — are more than 25 years old. The number of transmission outages in the US has more than doubled in the past six years compared with the previous six years, according to North American Electric Reliability Corporation. This is partially due to an increase in severe weather events, which put more stress on the power grid (more on this below).

Renewable energy in some grid regions is being wasted because there is not enough transmission capacity to transport it to other regions that have low generation and high demand. In the US, for example, the lack of high-voltage transmission lines prevents energy delivery from expansive wind farms in Iowa and Oklahoma to data centers in the southeast. Germany is facing similar problems as it struggles to connect wind capacity in the north near the Baltic Sea to the large industrial consumers in the south.

Energy experts estimate that modernizing the US power grid will require an investment of $1 trillion to $2 trillion over the next two decades. Current business and regulatory processes cannot attract and support this level of funding; new funding structures and business processes will be required to transform the grid for a decarbonized future.

Weather is changing

The Earth is becoming warmer, which means long-term changes to broad temperature and weather patterns. These changes are increasing the risk of extreme weather events, such as heat waves, floods, droughts and fires. According to a recent study published in the Proceedings of the National Academy of Sciences, individual regions within the continental US were more than twice as likely to see a “once in a hundred years” extreme temperature event in 2019 than they were in 1979.

Extreme weather events put the grid system at risk. High temperatures reduce the capacity of high-voltage transmission, high winds knock power lines off the towers, and stormy conditions can result in the loss of generation assets. In the past three years, Gulf Coast hurricanes, West Coast wildfires, Midwest heat waves and a Texas deep freeze have all caused local power systems to fail. The key issue with extreme weather is that it can disrupt grid generation, transmission, and fuel supplies simultaneously.

The Texas freeze of February 2021 serves as a cautionary tale of interdependent systems. As the power outages began, some natural gas compressors became inoperable because they were powered by electricity without backup — leading to further blackouts and interrupting the vital gas supply needed to maintain or restart generation stations.

Geopolitical tensions create fuel supply risks

Fuel supply shocks resulting from the war in Ukraine rapidly increased oil and natural gas prices in Europe during the winter of 2022. The global energy supply chain is a complex system of generators, distributors and retailers that prioritizes economic efficiency and low cost, often at the expense of resiliency, which can create the conditions for cascading failures. Many countries in the EU warned that fuel shortages would cause rolling blackouts if the weather became too cold.

In the UK, the government advised that a reasonable worst-case scenario would see rolling power cuts across the country. Large industrial customers, including data center operators, would have been affected, losing power in 3-hour blocks without any warning. In France, planning for the worst-case scenario assumed that up to 60% of the country’s population would be affected by scheduled power cuts. Even Germany, a country with one of the most reliable grids in the world, made plans for short-term blackouts.

Fuel supply shocks are reflected in high energy prices and create short- and long-term risks for the electrical grid. Prior to the conflict in Ukraine, dispatchable fossil fuel peaker plants, which only operate during periods of high demand, struggled to maintain price competitiveness with renewable energy producers. This trend was exacerbated by high fuel costs and renewable energy subsidies.

Any political upheaval in the major oil-producing regions, such as the gulf countries or North Africa, would affect energy prices and energy supply. Development of existing shale gas deposits could offer some short-term energy security benefits in markets, such as the US, the UK and Germany, but political pressures are currently preventing such projects from getting off the ground.

Being prepared for utility power problems

Loss of power is still the most common cause of data center outages. Uptime Institute’s 2022 global survey of data center managers attributes 44% of outages to power issues — greater than the second, third and fourth most common outage causes combined.

The frequency of power-related outages is partly due to power system complexity and a dependence on moving parts: a loss of grid power often reveals weaknesses elsewhere, including equipment maintenance regimens and staff training.

The most common power-related outages result from faults with uninterruptible power supply (UPS) systems, transfer switches and generators. Compared with other data center systems, such as cooling, power is more prone to fail completely, rather than operating at partial capacity.

Disconnecting from the power grid and running on emergency backup systems for extended periods (hours and days) is not a common practice in most geographies. Diesel generators remain the de facto standard for data center backup power and these systems have inherent limitations.

Any issues with utility power create risks across the data center electrical system. Major concerns include:

Load transfer risks between the grid and the generators. It is recommended that data center operators fully disconnect from the electrical grid once a year to test procedures, yet many choose not to do so out of operational concerns in a production environment. This means that lurking failures in transfer switches and paralleling switchgears may go undetected and any operational mistakes remain undiscovered.
Fuel reserves and refueling. The volume of on-site fuel storage must meet cost and available space constraints; requires system maintenance; and spill and leak management. Longer grid outages can exceed on-site fuel capacity, making operators dependent on outside vendors for fuel resupply. These vendors are, in turn, dependent on the diesel supply chain, which may be disrupted during a wide-area grid outage because some diesel terminals may lack backup power to pump the fuel. Fuel delivery procedures may come under time pressure and may not be fully observed with the potential to create accidents, such as contamination and spills.
Increased likelihood of engine failures. More frequent warm-up and cool-down cycles, as well as higher than expected runtime hours, accelerate generator wear. As many as 27% of data center operators have experienced a generator-related power outage in the past three years, according to the Uptime Institute Annual outages analysis 2023. Ongoing supply chain bottlenecks may mean that rental generators are in short supply while replacement parts or new engines may take months to arrive. This may force the data center to operate at a derated capacity, lower redundancy, or both.
Pollutant emissions. Many jurisdictions limit generator operating hours to cap emissions of nitrogen oxides, sulfur oxides, and other particulates. For example, in the US, most diesel generators are limited to 100 hours of full load per year and non-compliance can result in fines.
Battery system wear and failures. Frequent deep discharge and recharge cycles wear battery cells out faster, particularly lead-acid batteries. Lithium-ion chemistries are not immune either: discharges create thermal stress for Lithium-ion cells when currents are high. Temperature can also spike towards the end-of-cell capacity, which increases the chance of a thermal event as a result of inherent manufacturing imperfections. Often, the loss of critical load caused by a failure in an uninterruptible power supply system is due to batteries not being monitored closely enough by experienced technicians.

It will get worse before it gets better

The likely worst-case scenario facing US and European data center operators in 2023 and beyond will consist primarily of load-shedding requests, brownouts and 2- to 4-hour, controlled rolling blackouts as opposed to widespread, long-lasting grid outages.

Ultimately, how much margin electrical grids will have in reserve is not possible to predict with any accuracy beyond a few weeks. Extreme weather events, in particular, are exceedingly difficult to model. Other unexpected events — such as power station maintenance, transmission failures and geopolitical developments that affect energy supply — might contribute to a deterioration of grid reliability.

Operators can take precautions to prepare for rolling blackouts, including developing a closer relationship with their energy provider. These steps are well understood, but evidence shows they are not always followed.

Considering historically low fuel reserves and long lead times for replacement components, all measures are best undertaken in advance. Data center operators should reach out to suppliers to confirm their capability to deliver fuel and their outlook for future supply — and determine whether the supplier is equipped to sell fuel in the event of a grid outage. Based on the response, operators should assess the need for additional on-site fuel storage or consider contracting additional vendors for their resupplies.

Data center operators should also test their backup power systems regularly. The so-called pull-the-plug test, which checks the entire backup system, is recommended annually. A building load test should be performed monthly, which involves less risk than pull-the-plug testing and checks most of the backup power system. This test may not run long enough to test certain components, such as fuel pumps and level switches, and these should be tested manually. Additionally, data center operators should test the condition of their stored fuel, and filter or treat it as necessary.

The challenges to grid reliability vary by location, but in general Uptime Intelligence sees evidence that the risk of rolling blackouts at times of peak demand is escalating. Operators can mitigate risk to their data centers by testing their backup power system and arranging for additional fuel suppliers as needed. They should also revisit their emergency procedures — even those with a very low chance of occurrence.

Jacqueline Davis, Research Analyst

Lenny Simon, Research Associate

Daniel Bizo, Research Director

Max Smolaks, Research Analyst

24x7 carbon-free energy (part one): expectations and realities

24×7 carbon-free energy (part one): expectations and realities

June 21, 2023/in Executive, Operations/by Jay Dietrich, Research Director of Sustainability, Uptime Institute, jdietrich@uptimeinstitute.com

Data center operators that set net-zero goals will ultimately have to transition to 100% 24×7 carbon-free energy. But current technological limitations mean it is not economically feasible in most grid regions.

In the past decade, the digital infrastructure industry has embraced a definition of renewable energy use that combines the use of renewable energy credits (RECs), guarantees of origin (GOs) and carbon offsets to compensate for the carbon emissions of fossil fuel-based generation, and the consumption of renewable energy in the data center.

The first component is relatively easy to achieve and has not proved prohibitively expensive, so far. The second, which will increasingly be required, will not be so easy and will cause difficulty and controversy in the digital infrastructure industry (and many other industries) for years to come.

In the Uptime Institute Sustainability and Climate Change Survey 2022, a third of data center owners / operators reported buying RECs, GOs and other carbon offsets to support their renewable energy procurement and net-zero sustainability goals.

But many organizations, including some regulators and other influential bodies, are not happy that some operators are using RECs and GOs to claim their operations are carbon neutral with 100% renewable energy use. They argue that, with or without offsets, at least some of the electricity the data center consumes is responsible for carbon dioxide (CO2) emissions — which are measured in metric tonnes (MT) of CO2 per megawatt-hour (MWh).

The trend is clear — offsets are becoming less acceptable. Many digital infrastructure operators have started to (or are planning to) refocus their sustainability objectives towards 100% 24×7 carbon-free energy (CFE) consumption — consuming CFE for every hour of operation.

This Update, the first of two parts, will focus on the key challenges that data center operators face as they move to use more 24×7 CFE. Part two, 24×7 carbon-free energy: getting to 100%, will outline the steps that operators, grid authorities, utilities and other organizations should take to enable data centers to be completely powered by CFE.

A pairing of 24×7 CFE and net-zero emissions goals alters an operator’s timeframe to achieve their net-zero emissions as it removes the ability to offset for success. At present, 100% 24×7 CFE is technically and / or economically unviable in most grid regions and will not be viable in some regions until 2040 to 2050, and beyond. This is because solar and wind generation is highly variable, and the capacity and discharge time of energy storage technologies is limited. Even grid regions with high levels of renewable energy generation can only support a few facilities with 100% 24×7 CFE because they cannot produce enough MWh during periods of low solar and wind output. Consequently, this approach could be expensive and financially risky.

The fact that most grid regions cannot consistently generate enough CFE complicates net-zero carbon emissions goals. The Uptime Institute Sustainability and Climate Change Survey 2022 found that almost two-thirds of data center owners and operators expect to achieve a net-zero emissions goal by 2030 and that three-quarters expect to by 2035 (Figure 1). Operators will not be able to achieve these enterprise-wide goals by using 100% 24×7 CFE. They will have to use RECs, GOs and offsets extensively — and significant financial expenditure — to deliver on their promises.

Figure 1. Time period when operators expect to achieve their net-zero goals

Diagram: Time period when operators expect to achieve their net-zero goals

Although achieving net-zero emissions goals through offsets may seem attractive, offset costs are rising and operators need to assess the costs and benefits of increasing renewable energy / CFE at their data centers. Over the long term, if organizations demand more CFE, they will incentivize energy suppliers to generate more, and to develop and deploy energy storage, grid management software and inter-region high voltage interconnects.

To establish a 24×7 CFE strategy, operators or their energy retailers should track and control CFE assets and energy delivery to their data centers, and include factors that affect the availability and cost of CFE MWh in procurement contracts.

Renewable energy intermittency

The MWh output of solar and wind generation assets varies by minute, hour, day and season. Figure 2 shows how wind generation varied over two days in January 2019 in the grid region of ERCOT (a grid operator in Texas, US). On January 8, wind turbines produced 32% to 48% of their total capacity (capacity factor) and satisfied 19% to 33% of the entire grid demand. On January 14, wind capacity fell to 4% to 18% and wind output met only 3% to 9% of the total demand.

Figure 2. Wind generation data for two days in January 2019

Diagram: Wind generation data for two days in January 2019

To match the generation output to grid demand, sufficient reliable generation capacity (typically fossil fuel-fired) needs to be available to address the variations in wind output during the two days: 3,000 MW on January 8 and 2,500 MW on January 14. There needs to be four to five natural gas turbine generators available. Capacity is also needed to cover the (roughly) 6,000 MW of output difference between the two days.

ERCOT needs to maintain sufficient reliable generation capacity (typically fossil fuel-fired) to match grid-wide generation capacity to demand, in the face of the hour-to-hour and day-to-day variation in wind and solar output. There has to be four or five 700 MW natural gas turbines in reserve to meet the 2,500 MW to 3,000 MW of output variation on January 14 and January 8, respectively.

In 2020, wind generation in ERCOT satisfied less than 5% of grid demand for 600 hours — that’s 7% of the year. Given the number of hours of low MWh generation, only a few facilities in the grid region can aspire to 100% CFE. The challenge of significant hours of low generation exists in most grid regions that strive to depend on wind and solar generation to meet carbon free energy goals. This limits the ability of operators to set and achieve 100% CFE goals for specific facilities over the next 10 to 20 years.

Seasonal variations can be even more challenging. The wind generation gigawatt hours (GWh) in the Irish grid were two to three times higher in January and February 2022 than in July and August 2022 (Figure 3, light blue line), which left a seasonal production gap of up to 1,000 GWh. The total GWh output for each month masks the hour-to-hour variations of wind generation (see Figure 3). These amplify the importance of having sufficient and reliable generation available on the grid.

Figure 3. Monthly gigawatt-hours generated in Ireland by fuel type in 2022

Diagram: Monthly gigawatt-hours generated in Ireland by fuel type in 2022

The mechanics of a 24×7 CFE strategy

To manage a facility-level program to procure, track and consume CFE, it is necessary to understand how an electricity grid operates. Figure 4 shows a fictional electricity grid modelled on the Irish grid with the addition of a baseload 300 MW nuclear generation facility.

The operation of an electrical grid can be equated to a lake of electrons held at a specified level by a dam. A defined group of generation facilities continuously “fills” the electron lake. The electricity consumers draw electrons from the lake to power homes and commercial and industrial facilities (Figure 4). The grid operator or authority maintains the lake level (the grid potential in MW) by adding or subtracting online generation capacity (MW of generation) at the same “elevation” as the MW demand of the electricity consumers pulling electrons from the lake.

If the MWs of generation and consumption become unbalanced, the grid authority needs to quickly add or subtract generation capacity or remove portions of demand (demand response programs, brownouts and rolling blackouts) to keep the lake level balanced and prevent grid failure (a complete blackout).

Figure 4. Example of generation and demand balancing within an electricity grid

Diagram: Example of generation and demand balancing within an electricity grid

The real-time or hourly production of each generation facility supplying the grid can be matched to the hourly consumption of a grid-connected data center. The grid authority records the real-time MWh generation from each facility and the data can be made available to the public. At least two software tools, WattTime and Cleartrace, are available in some geographies to match CFE generation to facility consumption. Google, for example, uses a tracking tool and publishes periodic updates of the percentage utilization of CFE at each of its owned data centers (Figure 5).

Figure 5. Percentage of hourly carbon-free energy utilized by Google’s owned data centers

Diagram: Percentage of hourly carbon-free energy utilized by Google's owned data centers

Grid authorities depend on readily dispatchable generation assets to balance generation and demand. Lithium-ion batteries are a practical way to respond to rapid short-term changes (within a period of four hours or less) in wind and solar output. Because of the limited capacity and discharge time of lithium-ion batteries — the only currently available grid-scale battery technology — they cannot manage MW demand variation across 24 hours (Figure 2) or seasons (Figure 3). Consequently, grid authorities depend on large-scale fossil fuel and biomass generation, and hydroelectric where available, to maintain a reserve to meet large, time-varied MW demand fluctuations.

The grid authority can have agreements with energy consumers to reduce demand on the grid, but executing these agreements typically requires two to four hours’ notice and a day-ahead warning. These agreements cannot deliver real-time demand adjustments.

The energy capacity / demand bar charts at the bottom of Figure 4 illustrates the capacity and demand balancing process within a grid system at different times of the day and year. They show several critical points for managing a grid with significant, intermittent wind and solar generating capacity.

In periods of high wind output, such as January and February 2022, fossil fuel generation is lower because it is only used to meet demand when CFE options cannot. During these periods of high wind output, organizations can approach or achieve 100% CFE use at costs close to grid costs.

Solar generation — confined to daylight hours — varies by the hour, day and season. As with wind generation, fluctuations in solar output need to be managed by dispatching batteries or reliable generation for short-term variations in solar intensity and longer reductions, such as cloudy days and at night.

For periods of low wind output (Figure 4, July 2022 average day and night), the grid deploys reliable fossil fuel assets to support demand. These periods of low wind output prevent facilities from economically achieving 100% 24×7 CFE.

Nuclear plants provide a reliable, continuous supply of baseload CFE. An energy retailer could blend wind, solar and nuclear output in some grid regions to enable 100% 24×7 CFE consumption at an operating facility.

Current nuclear generation technologies are designed to operate at a consistent MW level. They change their output over a day or more, not over minutes or hours. New advanced small modular reactor technologies adjust output over shorter periods to increase their value to the grid, but these will not be widely deployed for at least a decade.

For batteries to be considered a CFE asset, they must either be charged during high wind or solar output periods, or from nuclear plants. Software tools that limit charging to periods of CFE availability will be required.

To approach or achieve 100% CFE with only wind generation, data centers need to buy three to five times more MWs than they need. The Google Oklahoma (88% hourly CFE) and Iowa (97% hourly CFE) data centers (Figure 5) illustrate this. Power purchase contracts for the two locations (February 2023) show 1,012 MW of wind generation capacity purchased for the Oklahoma data center and 869 MW of wind capacity purchased for the Iowa data center. Assuming the current demand for these two data centers is 250 MW to 350 MW and 150 MW to 250 MW, respectively, the contracted wind capacity required to achieve the CFE percentages is three to five times the data center’s average operating demand.

Table 1 lists retail energy contracts with guaranteed percentages of 24×7 CFE publicized over the past three years. These contracts are available in many markets and can provide data center operators with reliable power with a guaranteed quantity of CFE. The energy retailer takes on the risk of managing the intermittent renewable output and assuring supply continuity. In some cases, nuclear power may be used to meet the CFE commitment.

One detail absent from publicly available data for these contracts is the electricity rate. Conversations with one energy retailer indicated that these contracts charge a premium to market retail rates — the higher the percentage of guaranteed 24×7 CFE, the higher the premium. Operators can minimize the financial risk and premiums of this type of retail contract by agreeing to longer terms and lower guarantees — for example 5- to 10-year terms and 50% to 70% guaranteed 24×7 CFE. For small and medium size data center operators, a 24×7 renewables retail contract with a 50% to 70% guarantee will likely provide electric rate certainty with premiums of 5% or less (see Table 1) and minimal financial risk compared with the power purchase agreement (PPA) approach used by Alcoa (Table 2).

Table 1. Examples of 24×7 renewable energy retail contracts

Table: Examples of 24x7 renewable energy retail contracts

Alcoa wind PPA contracts in Spain

Aluminum producer Alcoa Corporation recently procured 1.8 GW of wind PPA contracts for its smelter in Spain. Table 2 shows the financial and operational risks associated with an over-purchase of wind capacity. The smelter has 400 MW of energy demand. Alcoa estimates it would need at least 2 GW of wind capacity to achieve 100% CFE. The Wall Street Journal Alcoa contract information detailed in Table 2 indicates that with its 1.8 GW of PPAs, Alcoa is likely to reach more than 95% 24×7 CFE. The purchase had two objectives: to stabilize the facility’s long-term electricity cost, and to produce near carbon-free aluminum.

The contracted quantity of wind power is 4.5 times the facility’s power demand, so Alcoa has estimated that to get close to 100% 24×7 CFE, it needs a high level of overcapacity, based on its modelling of wind farm output. In practice, Alcoa will likely have to buy some power on the spot market when wind output is low, but the overall emissions of the power it uses will be minimal.

Table 2 details the estimated total MWh of generation and financial implications of the PPAs. Assuming the wind farms have a capacity factor (CF) of 0.45, the PPA contracts secure more than twice as much electricity as the smelter consumes. This excess 3.6 million MWh will be sold into the electricity spot market and Alcoa’s profits or losses under the PPAs will be determined by the price at the time of generation.

The LevelTen Energy Q3 2022 PPA Price Index report was consulted for the P25 electricity rate (the rate at which 25% of available PPA contracts have a lower electricity rate) to estimate the rate of the signed PPAs: €75/MWh was selected. Two EU-wide average electricity rates, taken from the Ember European power price tracker, were chosen to bracket the potential profits and losses associated with the contracts for January 2021 (€50/MWh) and December 2022 (€200/MWh).

Table 2. Projection of power generation and economic returns for the Alcoa plant wind PPAs

Table: Projection of power generation and economic returns for the Alcoa plant wind PPAs

The contracts have significant rewards and risks. If the December 2022 rate remains stable for a year, the agreement will generate €887 million in profit for Alcoa. Conversely, if the January 2021 price had remained stable for a year, Alcoa would have lost €180 million. The wind farms are slated to go online in 2024. The Alcoa plant will restart when the wind farms start to supply electricity to the grid. Current electricity rate projections suggest that the contracts will initially operate with a profit. However, profits are not guaranteed over the 10- to 20-year life of the PPAs.

Real-time spot market electricity rates approach zero over time as the total installed wind and solar generation capacity increases within a grid region. Meeting EU government commitments for a carbon-free grid requires significant wind and solar overcapacity on the overall grid. As excess generation capacity is deployed to the grid to meet these commitments, periods of power overproduction will increase, which is likely to depress spot market prices. There is a strong probability that Alcoa’s contracts will generate losses in their later years as the electricity grid moves toward being carbon free by 2035.

The wind PPAs will provide Alcoa with two near-term revenue generators. First, Alcoa could sell its estimated 3.6 GWh of excess GOs. Bundled or unbundled GOs from its excess electricity generation should be in high demand from data center operators and other enterprises with 2030 net-zero carbon commitments. Second, it could sell the electricity itself. Selling at 2022 year-end rates for Nordic hydropower GOs (€10/MWh) would realize a profit of €36 million.

Low or zero carbon aluminum will also be in high demand and command premium prices as companies seek to decarbonize their facilities or products. While the premium is uncertain, it will add to the benefits of wind power purchases and improve the economics of Alcoa operations. The Alcoa PPA contracts have many upsides, but the Alcoa CFO faces a range of possible financial outcomes inherent in this CFE strategy.

Conclusions

Deploying wind and solar generation overcapacity creates broader operational and financial challenges for grid regions. As overcapacity increases, the use of reliable fossil fuel and nuclear generation assets to maintain grid stability will decrease. As a result, these assets may not generate sufficient revenue to cover their operating and financing costs, which will force them to close. Some nuclear generation facilities in the US have already been retired early because of this.

Grid authorities, utility regulators and legislative bodies need to address these challenges. They need to plan grid capacity: this includes evaluating the shape curves of wind and solar MWh output to determine the quantities of reliable generation required to maintain grid reliability. They should target incentives at developing and deploying clean energy technologies that can boost capacity to meet demand — such as long duration energy storage, small modular nuclear reactors and hydrogen generation. Without sufficient quantities of reliable generation assets, grid stability will slowly erode, which will put data center operations at risk and potentially increase the runtime of backup generation systems beyond their permitted limits.

Cloud resiliency: plan to lose control of your planes

June 14, 2023/in Design, Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com

Cloud providers divide the technologies that underpin their services into two ”planes”, each with a different architecture and availability goal. The control plane manages resources in the cloud; the data plane runs the cloud buyer’s application.

In this Update, Uptime Institute Intelligence presents research that shows control planes have poorer availability than data planes. This presents a risk to applications built using cloud-native architectures, which rely on the control plane to scale during periods of intense demand. We show how overprovisioning capacity is the primary way to reduce this risk. The downside is an increase in costs.

Data and control planes

An availability design goal is an unverified claim of service availability that is neither guaranteed by the cloud provider nor independently confirmed. Amazon Web Services (AWS), for example, states 99.99% and 99.95% availability design goals for many of its services’ data planes and control planes. Service level agreements (SLAs), which refund customers a proportion of their expenditure when resources are not available, are often based on these design goals.

Design goals and SLAs differ between control and data planes because each plane performs different tasks using different underlying architecture. Control plane availability refers to the availability of the mechanism it uses to manage and control services, such as:

Creating a virtual machine or other resource allocation on a physical server.
Provisioning a virtual network interface on that resource so it can be accessed over the network.
Installing security rules on the resource and setting access controls.
Configuring the resource with custom settings.
Hosting an application programming interface (API) endpoint so that users and code can programmatically manage resources.
Offering a management graphical user interface for operations teams to administrate their cloud estates.

The data plane availability refers to the availability of the mechanism that executes a service, such as:

Routing packets to and from resources.
Writing and reading to and from disks.
Executing application instructions on the server.

In practice, such separation means a control plane could be unavailable, preventing services from being created or turned off, while the data plane (and, therefore, the application) continues to operate.

Data plane and control plane issues can impact business, but a data plane problem is usually considered more significant as it would immediately affect customers already using the application. The precise impact of a data plane or control plane outage depends on the application architecture and the type of problem.

Measuring plane availability

Providers use the term “region” to describe a geographical area that contains a collection of independent availability zones (AZs), which are logical representations of data center facilities. A country may have many regions, and each region typically has two or three AZs. Cloud providers state that users must architect their applications to be resilient by distributing resources across AZs and/or regions.

To compare the respective availability of resilient cloud architectures (see Comparative availabilities of resilient cloud architectures), Uptime Institute used historical cloud provider status updates from AWS, Google Cloud and Microsoft Azure to determine historical availabilities for a simple load balancer application deployed in three architectures:

Two virtual machines deployed in the same AZ (“single-zone”).
Two virtual machines deployed in different AZs in the same region (“dual-zone”).
Two virtual machines deployed in different AZs in different regions (“dual-region”).

Figure 1 shows the worst-case availabilities (i.e., the worst availability of all regions and zones analyzed) for the control planes and data planes in these architectures.

diagram: Worst-case historical availabilities by architecture — **Figure 1. Worst-case historical availabilities by architecture**

Unsurprisingly, even in the worst-case scenario, a dual-region architecture is the most resilient, followed by a dual-zone architecture.

The data plane has a significantly higher availability than the control plane. In the dual region category, the data plane had an availability of 99.96%, equivalent to 20 minutes of monthly downtime. The control plane had an availability of 99.80%, equal to 80 minutes of downtime — four times that of the data plane. This difference is to be expected considering the different design goals and SLAs associated with control and data planes. Our research found that control plane outages do not typically happen at the same time as data plane outages.

However, availability in the control plane is far more consistent than in the data plane and isn’t greatly affected by choice of architecture. This consistency is because the cloud provider manages availability and resiliency — the control plane effectively acts as a platform as a service (PaaS). Suppose an organization hosts an application in a single zone. In that case, the application programming interfaces (APIs) — and the management interface used to administer that application — are managed by the cloud provider and hosted in multiple zones and regions, despite the application only being hosted in a single zone.

The warning for cloud customers here is that the control plane is more likely to be a point of failure than the application, and little can be done to make the control plane more resilient.

Assessing the risk

The crucial question organizations should seek to answer during risk assessments is, “What happens if we cannot add or remove resources to our application?”

For static applications that can’t scale, a control plane outage is often more of an inconvenience than a business-critical problem. Some maintenance tasks may be delayed until the outage is over, but the application will continue to run normally.

For cloud-native scalable applications, the risk is more considerable. A cloud-native application should be able to scale up and down dynamically, depending on demand, which involves creating (or terminating) a resource and configuring it to work with the application that is currently executing.

Scaling capacity in response to application demand is typically done using one or a combination of three methods:

Automatically, using cloud provider autoscaling services that detect a breach of a threshold of a metric (e.g., CPU utilization).
Automatically from an application, where code communicates with a cloud provider’s API.
Manually, by an administrator using a management portal or API.

The management portal, the API, the autoscaler service, and the creation or termination of the resource may all be part of the control plane.

If the application has been scaled back to cut costs but faces increased demand, any of these mechanisms can increase capacity. But if they have failed, the application will continue to run but will not accommodate the additional demand. In this scenario, the application may continue to service existing end users satisfactorily, but additional end users may have to wait. In the worst case, the application might fail catastrophically because of oversubscription.

Addressing the risk

The only real redress for this risk is to provision a buffer of capacity on the services that support the application. If a demanding period requires more resources, the application can immediately use this buffer, even if the control plane is out of service.

Having a buffer is a sensible approach for many reasons. Suppose other issues cause delays in provisioning capacity, for example an AZ fails or increased demand in a region causes a lack of capacity. A buffer will prevent a time lag between end-user demand and resource access.

The question is how much buffer capacity to provision. A resilient architecture often automatically includes suitable buffer capacity because the application is distributed across AZs or regions. If an application is split across, say, two AZs, it is probably designed with enough buffer on each virtual machine to enable the application to continue seamlessly if one of the zones is lost. Each virtual machine has a 50% load, with the unused 50% available if the other AZ suffers an outage. In this case, no new resources can be added if the control plane fails, but the application is already operating at half load and has excess capacity available. Of course, if the other AZ fails at the same time as a control plane outage, the remaining zone will face a capacity squeeze.

Similarly, databases may be deployed across multiple AZs using an active-active or active-failover configuration. If an AZ fails, the other database will automatically be available to support transactions, regardless of the control plane functionality.

Organizations need to be aware that this is their risk to manage. The application must be architected for control and data plane failures. Organizations also need to be mindful that there is a cost implication. As detailed in our Comparative availabilities of resilient cloud architectures report, deploying across multiple zones increases resiliency, but also increases costs by 43%. Carbon emissions, too, rise by 83%. Similarly, duplicating databases across zones increases availability — but at a price.

Summary

Organizations must consider how their applications will perform if a control plane failure prevents them adding, terminating or administrating cloud resources. The effect on static applications will be minimal because such applications can’t scale up and down in line with changing demand. However, the impact on cloud-native applications may be substantial because the application may struggle under additional demand without the capacity to scale.

The simplest solution is to provide a buffer of unused capacity to support unexpected demand. If additional capacity can’t be added because of a control plane outage, the buffer can meet additional demand in the interim.

The exact size of the buffer depends on the application and its typical demand pattern. However, most applications will already have a buffer built in so they can respond immediately to AZ or region outages. Often, this buffer will be enough to manage control plane failure risks.

Organizations face a tricky balancing act. Some end users might not get the performance they expect if the buffer is too small. If the buffer is too large, the organization pays for capacity it doesn’t need.

Server efficiency increases again — but so do the caveats

June 7, 2023/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com

Early in 2022, Uptime Intelligence observed that the return of Moore’s law in the data center (or, more accurately, the performance and energy efficiency gains associated with it) would come with major caveats (see Moore’s law resumes — but not for all). Next-generation server technologies’ potential to improve energy efficiency, Uptime Intelligence surmised at the time, would be unlocked by more advanced enterprise IT teams and at-scale operators that can concentrate workloads for high utilization and make use of new hardware features.

Conversely, when servers built around the latest Intel and AMD chips do not have enough work, energy performance can deteriorate due to higher levels of idle server power. Following the recent launch of new server processors, performance data confirms this, which suggests that a rethink on the traditional assumptions relating to server refresh cycles is needed.

Server efficiency back on track

First, the good news: new processor benchmarks confirm that best-in-class server energy efficiency is back in line with historical trends, following a long hiatus in the late 2010s.

The latest server chips from AMD (codenamed Genoa) deliver a major jump in core density (they integrate up to 192 cores in a standard dual-socket system) and energy efficiency potential. This is largely due to a step change in manufacturing technology by contract chipmaker Taiwan Semiconductor Manufacturing Company (TSMC). Compared with server technology from four to five years ago, this new crop of chips offers four times the performance at more than twice the energy efficiency, as measured by the Standard Performance Evaluation Corporation (SPEC). The SPEC Power benchmark simulates a Java-based transaction processing logic to exercise processors and the memory subsystem, indicating compute performance and efficiency characteristics. Over the past 10 years, mainstream server technology has become six times more energy efficient based on this metric (Figure 1).

diagram: Best-in-class server energy performance (long-term trend) — **Figure 1. Best-in-class server energy performance (long-term trend)**

This brings server energy performance back onto its original track before it was derailed by Intel’s protracted struggle to develop its then next-generation manufacturing technology. Although Intel continues to dominate server processor shipments due to its high manufacturing capacity, it is still fighting to recover from the crises it created nearly a decade ago (see Data center efficiency stalls as Intel struggles to deliver).

Now, the bad news, too, is that server efficiency is once again following historical trends. Although the jump in performance with the latest generation of AMD server processors is sizeable, the long-term slowdown in performance improvements continues. The latest doubling of server performance took about five years. In the first half of the 2010s it took about two years, and far less than two years between 2000 and 2010. Advances in chip manufacturing have slowed for both technical and economic reasons: the science and engineering challenges behind semiconductor manufacturing are extremely difficult and the costs are huge.

An even bigger reason for the drop in development pace, however, is architectural: diminishing returns from design innovation. In the past, the addition of more cores, on-chip integration of memory controllers and peripheral device controllers all radically improved chip performance and overall system efficiency. Then server engineers increased their focus on energy performance, resulting in more efficient power supplies and other power electronics, as well as energy-optimized cooling via variable speed fans and better airflow. These major changes have reduced much of the server energy waste.

One big-ticket item in server efficiency remains: to tackle the memory (DRAM) performance and energy problems. Current memory chip technologies don’t score well on either metric — DRAM latency worsens with every generation, while energy efficiency (per bit) barely improves.

Server technology development will continue apace. Competition between Intel and AMD is energizing the market as they vie for the business of large IT buyers that are typically attracted to the economics of performant servers carrying ever larger software payloads. Energy performance is a definitive component of this. However, more intense competition is unlikely to overcome the technical boundaries highlighted above. While generic efficiency gains (lacking software-level optimization) from future server platforms will continue, the average pace of improvement is likely to slow further.

Based on the long-term trajectory, the next doubling in server performance will take five to six years, boosted by more processor cores per server, but partially offset by the growing consumption of memory power. Failing a technological breakthrough in semiconductor manufacturing, the past rates of server energy performance gains will not return in the foreseeable future.

Strong efficiency for heavy workloads

As well as the slowdown in overall efficiency gains, the profile of energy performance improvements has shifted from benefiting virtually all use cases towards favoring, sometimes exclusively, larger workloads. This trend began several years ago, but AMD’s latest generation of products offers a dramatic demonstration of its effect.

Based on the SPEC Power database, new processors from AMD offer vast processing capacity headroom (more than 65%) compared with previous generations (AMD’s 2019 and 2021 generations codenamed, respectively, Rome and Milan). However, these new processors use considerably more server power, which mutes overall energy performance benefits (Figure 2). The most significant opportunities offered by AMD’s Genoa processor are for aggressive workload consolidation (footprint compression) and running scale-out workloads such as large database engines, analytics systems and high-performance computing (HPC) applications.

diagram: Performance and power characteristics of recent AMD and Intel server generations — **Figure 2. Performance and power characteristics of recent AMD and Intel server generations**

If the extra performance potential is not used, however, there is little reason to upgrade to the latest technology. SPEC Power data indicates that the energy performance of previous generations of AMD-based servers is often as good — or better — when running relatively small workloads (for example, where the application’s scaling is limited, or further consolidation is not technically practical or economical).

Figure 2 also demonstrates the size of the challenge faced by Intel: at any performance level, the current (Sapphire Rapids) and previous (Ice Lake) generations of Intel-based systems use more power — much more when highly loaded. The exception to this rule, not captured by SPEC Power, is when the performance-critical code paths of specific software are heavily optimized for an Intel platform, such as select technical, supercomputing and AI (deep neural network) applications that often take advantage of hardware acceleration features. In these specific cases, Intel’s latest products can often close the gap or even exceed the energy performance of AMD’s processors.

Servers that age well

The fact that the latest generation of processors from Intel and AMD is not more efficient than the previous generation is less of a problem. The greater problem is that server technology platforms released in the past two years do not offer automatic efficiency improvements over the enormous installed base of legacy servers, which are often more than five or six years old.

Figure 3 highlights the relationship between performance and power where, without a major workload consolidation (2:1 or higher), the business case to upgrade old servers (see blue crosses) remains dubious. This area is not fixed and will vary by application, but approximately equals the real-world performance and power envelope of most Intel-based servers released before 2020. Typically, only the best performing, highly utilized legacy servers will warrant an upgrade without further consolidation.

diagram: Many older servers remain more efficient at running lighter loads — **Figure 3. Many older servers remain more efficient at running lighter loads**

For many organizations, the processing demand in their applications is not rising fast enough to benefit from the performance and energy efficiency improvements the newer servers offer. There are many applications that, by today’s standards, only lightly exercise server resources, and that are not fully optimized for multi-core systems. Average utilization levels are often low (between 10% and 20%), because many servers are sized for expected peak demand, but spend most of their time in idle or reduced performance states.

Counterintuitively, perhaps, servers built with Intel’s previous processor generations in the second half of the 2010s (largely since 2017, using Intel’s 14-nanometer technology) tend to show better power economy when used lightly, than their present-day counterparts. For many applications this means that, although they use more power when worked hard, they can conserve even more in periods of low demand.

Several factors may undermine the business case for the consolidation of workloads onto fewer, higher performance systems: these include the costs and risks of migration, threats to infrastructure resiliency, and incompatibility (if the software stack is not tested or certified for the new platform). The latest server technologies may offer major efficiency gains to at-scale IT services providers, AI developers and HPC shops, but for enterprises the benefits will be fewer and harder to achieve.

The regulatory drive for data center sustainability will likely direct more attention to IT’s role soon — the lead example being the proposed EU’s recast of the Energy Efficiency Directive. Regulators, consultants and IT infrastructure owners will want a number of proxy metrics for IT energy performance, and the age of a server is an intuitive choice. Data strongly indicates this form of simplistic ageism is misplaced.

24×7 carbon-free energy (part two): getting to 100%

Increase grid region interconnections

Reliable CFE generation

Long duration energy storage

Conclusions

Data center operators will face more grid disturbances

Dispatchable versus intermittent

Grids are showing their age

Weather is changing

Geopolitical tensions create fuel supply risks

Being prepared for utility power problems

It will get worse before it gets better

24×7 carbon-free energy (part one): expectations and realities

Renewable energy intermittency

The mechanics of a 24×7 CFE strategy

Alcoa wind PPA contracts in Spain

Conclusions

Cloud resiliency: plan to lose control of your planes

Data and control planes

Measuring plane availability

Assessing the risk

Addressing the risk

Summary

Server efficiency increases again — but so do the caveats

Server efficiency back on track

Strong efficiency for heavy workloads

Servers that age well

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices