Lifting and shifting apps to the cloud: a source of risk creep?

Lifting and shifting apps to the cloud: a source of risk creep?

Public cloud infrastructures have come a long way over the past 16 years to slowly earn the trust of enterprises in running their most important applications and storing sensitive data. In the Uptime Institute Global Data Center Survey 2022, more than a third of enterprises that operate their own IT infrastructure said they also placed some of their mission-critical workloads in a public cloud.

This gradual change in enterprises’ posture, however, can only be partially attributed to improved or more visible cloud resiliency. An equal, or arguably even bigger, component in this shift in attitude is enterprises’ willingness to make compromises when using the cloud, which includes sometimes accepting less resilient cloud data center facilities. However, a more glaring downgrade lies in the loss of the ability to configure IT hardware specifically for sensitive business applications.

In more traditional, monolithic applications, both the data center and IT hardware play a central role in their reliability and availability. Most critical applications that predate the cloud era depend heavily on hardware features because they run on a single or a small number of servers. By design, more application performance meant bigger, more powerful servers (scaling up as opposed to scaling out), and more reliability and availability meant picking servers engineered for mission-critical use.

In contrast, cloud-native applications should be designed to scale across tens or hundreds of servers, with the assumption that the hardware cannot be relied upon. Cloud providers are upfront that customers are expected to build in resiliency and reliability using software and services.

But such architectures are complex, may require specialist skills and come with high software management overheads. Legacy mission-critical applications, such as databases, are not always set up to look after their reliability on their own without depending on hardware and operating system / hypervisor mechanisms. To move such applications to the cloud and maintain their reliability, organizations may need to substantially refactor the code.

This Uptime Update discusses why organizations that are migrating critical workloads from their own IT infrastructure to the cloud will need to change their attitudes towards reliability to avoid creating risks.

Much more than availability

The language that surrounds infrastructure resiliency is often ambiguous and masks several interrelated but distinct aspects of engineering. Historically, the industry has largely discussed availability considerations around the public cloud, which most stakeholders understand as not experiencing outages to their cloud services.

In common public cloud parlance, availability is almost always used interchangeably with reliability. When offering advice on their reliability features or on how to architect cloud applications for reliability, cloud providers tend to discuss almost exclusively what falls under the high-availability engineering discipline (e.g., data replication, clustering and recovery schemes). In the software domain, physical and IT infrastructure reliability may be conflated with site reliability engineering, which is a software development and deployment framework.

These crossover in two significant ways. First, availability objectives, such as the likelihood that the system is ready to operate at any given time, are only a part of reliability engineering — or rather, one of its outcomes. Reliability engineering is primarily concerned with the system’s ability to perform its function free of errors. It also aims to suppress the likelihood that failures will affect the system’s health. Crucially, this includes the detection and containment of abnormal operations, such as a device making mistakes. In short, reliability is the likelihood of producing correct outputs.

For facilities, this typically translates to the ability to deliver conditioned power and air — even during times of maintenance and failures. For IT systems, reliability is about the capacity to perform compute and storage jobs without errors in calculations or data.

Second, the reliability of any system builds on the robustness of its constituent parts, which include the smallest components. In the cloud, however, the atomic unit of reliability that is visible to customers is a consumable cloud resource, such as a virtual machine or container, and more complex cloud services, such as data storage, network and an array of application interfaces.

Today, enterprises have not only limited information on the cloud data centers’ physical infrastructure resiliency (either topology or maintenance and operations practices), but also low visibility of, or any choice in, the reliability features of IT hardware and infrastructure software that underpin cloud services.

Engineering for reliability: a lost art?

This abstraction of hardware resources is a major departure from the classical infrastructure practices for IT systems that run mission-critical business and industrial applications. Server reliability greatly depends on the architectural features that detect and recover from errors occurring in processors and memory chips, often with the added help of the operating system.

Typical examples of errors include soft bit flips (transient bit errors typically caused by an anomaly) and hard bit flips (permanent faults) in memory cell arrays. Bit errors can be found both in the processor and in external memory banks, as well as operational errors and design bugs in processor logic that could produce incorrect outputs or result in a software crash.

For much of its history, the IT industry has gone to great and costly lengths to design mission-critical servers (and storage systems) that can be trusted to manage data and perform operations as intended. The engineering discipline addressing server robustness is generally known as reliability, availability and serviceability (RAS, which was originally coined by IBM five decades ago), with the serviceability aspect referring to maintenance and upgrades, including software, without causing any downtime.

Traditional examples of these servers include mainframes, UNIX-based and other proprietary software and hardware systems. However, in the past couple of decades x86-based mission-critical systems, which are distinct from volume servers in their RAS features, have also taken hold in the market.

What sets mission-critical hardware design apart is its extensive reliability mechanisms in its error detection, correction and recovery capabilities that go beyond those found in mainstream hardware. While perfect resistance of errors is not possible, such features greatly reduce the chances of errors and software crashes.

Mission-critical systems tend to be able to isolate a range of faulty hardware components without resulting in any disruption. These components include memory chips (the most common source of data integrity and system stability issues), processor units or entire processors, or even an entire physical partition of a mission-critical server. Often, critical memory contents are mirrored within the system across different banks of memory to safeguard against hardware failures.

Server reliability doesn’t end with design, however. Vendors of mission-critical servers and storage systems test the final version of any new server platform for many months to ensure they perform correctly, known as the validation process, before volume manufacturing begins.

Entire sectors, such as financial services, e-commerce, manufacturing, transport and more, have come to depend on and trust such hardware for the correctness of their critical applications and data.

Someone else’s server, my reliability

Maintaining a mission-critical level of infrastructure reliability in the cloud (or even just establishing underlying infrastructure reliability in general), as opposed to “simple” availability, is not straightforward. Major cloud providers don’t address the topic of reliability in much depth to begin with.

What techniques, if any, cloud operators could deploy to safeguard customer applications against data corruption and application failures beyond the use of basic error correction code in memory, which is only able to handle random, single-bit errors, is difficult to know. Currently, there are no hyperscale cloud instances that offer enhanced RAS features comparable to mission-critical IT systems.

While IBM and Microsoft both offer migration paths directly for some mission-critical architectures, such as IBM Power and older s390x mainframes, their focus is on the modernization of legacy applications rather than maintaining reliability and availability levels that are comparable to on-premises systems. The reliability on offer is even less clear when it comes to more abstracted cloud services, such as software as a service and database as a service offerings or serverless computing.

Arguably, the future of reliability lies with software mechanisms. In particular, the software stack needs to adapt by getting rid of its dependency on hardware RAS features, whether this is achieved through verifying computations, memory coherency or the ability to remove and add hardware resources.

This puts the onus of RAS engineering almost solely on the cloud user. For new critical applications, purely software-based RAS by design is a must. However, the time and costs of refactoring or rearchitecting an existing mission-critical software stack to verify results and handle hardware-originating errors are not trivial — and are likely to be prohibitive in many cases, if possible.

Without the assistance of advanced RAS features in mission-critical IT systems, performance, particularly response times, will also likely take a hit if the same depth of reliability is required. At best, this means the need for more server resources to handle the same workload because the software mechanisms for extensive system reliability features will carry a substantial computational and data overhead.

These considerations should temper the pace at which mission-critical monolithic applications migrate to the cloud. Yet, these arguments are almost academic. The benefits of high reliability are difficult to quantify and compare (even more so than availability), in part because it is counterfactual — it is hard to measure what is being prevented.

Over time, cloud operators might invest more in generic infrastructure reliability and even offer products with enhanced RAS for legacy applications. But software-based RAS is the way forward in a world where hardware has become generic and abstracted.

Enterprise decision-makers should at least be mindful of the reliability (and availability) trade-offs involved with the migration of existing mission-critical applications to the cloud, and budget for investing in the necessary architectural and software changes if they expect the same level of service that an enterprise IT infrastructure can provide.

Use tools to control cloud costs before it’s too late

Use tools to control cloud costs before it’s too late

The public cloud’s on-demand pricing model is vital in enabling application scalability — the key benefit of cloud computing. Resources need to be readily available for a cloud application to scale when required without the customer having to give advance notification. Cloud providers can offer such flexibility by allowing customers to pay their bills in arrears based on the number of resources consumed during a specified period.

This flexibility does have a downside, however. If more resources are consumed than expected due to increased demand or configuration errors, the organization is still liable to pay for them — it is too late to control costs after the fact. A total of 42% of respondents to the Uptime Institute Data Center Capacity Trends Survey 2022 cited escalating costs as the top reason for moving workloads from the public cloud back to on-premises infrastructure. Chief information officers face a tricky balancing act when allowing applications to scale to meet business objectives without letting budgets spiral out of control.

This Uptime Intelligence Update summarizes the challenges that cloud customers face when forecasting, controlling and optimizing costs. It also provides simple steps that can help buyers take control of their spend. As organizations face increasing macroeconomic pressures, reducing cloud expenditure has never been more important (see Cloud migrations to face closer scrutiny).

Cloud complexity

A cloud application is usually architected from multiple cloud services, such as virtual machines, storage platforms and databases. Each cloud service has its own metrics for billing. For example, customers may be charged for storage services based on the amount of storage used, the number of transactions made, and the bandwidth consumed between the storage service and the end user. The result is that even a simple bill for a cloud application will have many different charges spread across various services.

Controlling, forecasting and optimizing the costs of cloud-native applications (i.e., applications built for the cloud that can scale automatically) is challenging for several reasons:

  • Consumption is not always under the customer’s control. For example, many end users might upload data to a storage platform — thus increasing the customer’s bill — without the customer being aware until the end of the billing period.
  • Each service has many metrics to consider and an application will typically use multiple cloud services. Each cloud provider measures the consumption of their service in different ways; there is no standard approach.
  • Metrics are not always related to tangible units that are easy for the customer to predict. For example, a specific and unknown type of transaction may generate a cost on a database platform; however, the customer may have no understanding or visibility of how many of these transactions will be executed in a certain period.
  • Applications may scale up by accident due to errant code or human error and use resources without purpose. Similarly, applications may not scale down when able (reducing costs) due to incorrect configuration.

Conversely, applications that don’t scale, such as those lifted and shifted from on-premises locations, are generally predictable and stable in capacity terms. However, without the ability to scale down (and reduce costs), infrastructure expenses are not always as low as hoped (see Asset utilization drives cloud repatriation economics).

A sudden and unexpected increase in a monthly cloud bill is often described as “bill shock,” a term coined initially for unexpectedly large consumer phone bills. Is bill shock a problem? Not necessarily. If an application is scaled to derive more revenue from the end users, for example, then paying more for the underlying infrastructure is not an issue. But although applications may be designed to be scalable, organizations and budgets are not. An IT department might generate new revenue for the organization from spending more on infrastructure — but if the department has a fixed budget, the chief financial officer might not understand why costs have increased. Most organizations would not report the cost of cloud services against any revenue created by the investment in those services — to senior management, cloud services may appear to be an expense rather than a value-generating activity.

The complexity of the situation has led to the creation of an open-source project, the FinOps Foundation. The foundation describes FinOps (a portmanteau of finance and operations) as a “financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions.” At a high level, the foundation describes six principles to effectively manage cloud costs:

  • Teams need to collaborate.
  • Decisions should be driven by the business value of the cloud.
  • Everyone needs to take ownership of their cloud usage.
  • FinOps data should be accessible and timely.
  • FinOps needs to be driven by a centralized team.
  • Organizations should take advantage of the variable cost model of the cloud.

The need for a foundation dedicated to cloud finance demonstrates the complexity of managing cloud costs effectively. Fully executing the foundation’s six key steps requires substantial investment and motivation — and many organizations will need expert assistance in this endeavor.

Taking charge

There are some simple steps organizations can take to control their public cloud costs, most related are to the foundation’s six principles:

Set alerts to warn of overspending

All cloud providers allow customers to set custom spend alerts, which warn when a cost threshold has been reached. Such alerts enable the budget holders to determine if the spend is justified, if further funding should be sought, or if the spending is accidental and needs to be curtailed. Setting alerts is the minimal step all organizations should take to control their cloud expenditures. Organizations should ensure that alerts are configured and sent to a valid mailbox, phone number or event management system.

Use free tools to forecast month-to-month consumption

Most cloud providers include tools to forecast future spending based on past performance. These tools aren’t perfect, but they do give some visibility into how an application consumes resources over time, for free. It’s better to inform leadership in advance if costs are expected to rise rather than after the bill is due.

Work with stakeholders to determine future needs

Ensure that all parts of the business that use the public cloud understand how costs may change. For example, a new product launch, sale or event may increase the use of a website, which increases costs. Knowing this in advance enables a more realistic forecast of future costs and an open discussion on who will pay.

Consider showback and chargeback models

In a showback model, the IT department shows individual departments and business units their monthly cloud spends. The idea being that they become more aware of how their decisions affect expenditure, which enables them to take steps to reduce it. In a chargeback model, IT invoices these departments for the cloud costs related to their applications. Each department is then responsible for its own costs and is obliged to justify the expenditure relative to the value gained (e.g., increased revenue, better customer satisfaction).

Showback can be set up relatively quickly by “tagging” resources appropriately with an owner and then using the cloud provider’s reporting tools to break down business owners’ spending. Chargeback is a more significant undertaking, which affects the culture and structure of a company — most non-IT teams may not have the understanding or appetite to be financially responsible for their IT bills.

Take advantage of optimization tools

With an accurate forecast, organizations can use alternative pricing models to reduce their spend. These models give customers discounts of up to 70% compared with on-demand pricing in return for a commitment of up to three years or a minimum spend. Many cloud providers also offer spot instances, which provide cheap access to cloud resources on the understanding that this access can be terminated without warning. The best use of alternative pricing models will be discussed further in a future Uptime Intelligence Update. Most cloud providers offer tools that suggest alternative pricing models based on past performance. Such tools can also identify “orphaned” resources that cost money but don’t appear to be doing anything useful.

Security and governance practices prevent overspend

A well-tested application hosted in a secure cloud environment reduces the likelihood of things going wrong and costs increasing as a result. For example, organizations should use role-based access to ensure only those employees who need to create resources are permitted to do so. This prevents costly services from being set up and subsequently forgotten about. Similarly, cloud customers should take appropriate precautions to stop malicious scripts from executing in their environment and sending out large quantities of data that will increase bandwidth costs. IT teams should test code thoroughly before deployment to reduce the chance of accidental resource consumption.

Get help

Most hyperscaler cloud providers, including Amazon Web Services, Google Cloud Platform, Microsoft Azure, Oracle Cloud, IBM Cloud and Alibaba Cloud, offer tools to aid cost forecasting, optimization and management. Smaller cloud providers are less likely to have these features but their charges are usually based on fewer metrics and offer fewer services, thereby reducing complexity.

Some organizations use third-party platforms to track and optimize their spend. The key benefit of these platforms is that they can optimize across multiple cloud providers and are independent, which arguably provides a more unbiased view of costs. These platforms include Apptio Cloudability, Flexera, NetApp CloudCheckr, IBM Turbonomic and VMware CloudHealth.

There are also consultancies and managed service providers, such as Accenture, Deloitte and HCLTech, that integrate cost-optimization practices into organizations and optimize cloud costs on their customer’s behalf on an ongoing basis. The cost of not acting can be substantial. This analyst spent $4,000 on a bare-metal cloud server after accidentally leaving it running for two months. Without an alert set up, the analyst only became aware when the cloud provider posted an invoice to his home address. Organizations should check that warnings and limits are configured now, before it is too late. If cloud costs are a significant part of the IT expenditure, expert advice is essential.

24x7 carbon-free energy (part two): getting to 100%

24×7 carbon-free energy (part two): getting to 100%

Digital infrastructure operators have started to refocus their sustainability objectives on 100% 24×7 carbon-free energy (CFE) consumption: using carbon-free energy for every hour of operation.

To establish a 24×7 CFE strategy, operators must track and control CFE assets and the delivery of energy to their data centers and use procurement contracts designed to manage the factors that affect the availability and cost of renewable energy / CFE megawatt-hours (MWh).

The first Update in this two-part series, 24×7 carbon-free energy (part one): expectations and realities, focused on the challenges that data center operators face as they look to increase the share of CFE consumed in their facilities. This second report outlines the steps that operators, grid authorities, utilities and other organizations need to take to enable data centers to be 100% powered by CFE. As previously discussed, data centers that approach 100% CFE using only wind and solar generation currently have to buy several times more generation capacity than they need.

Figure 1 illustrates the economic challenges of approaching 100% 24×7 CFE. To reach 50% to 80% CFE consumption, the price premium is likely to be 5% or less in markets with high CFE penetration (30% or more of the generated MWh). The levelized cost of electricity (LCOE) is the average net present cost of the electricity based on the cost of generation over the lifetime of an individual or group of generation facilities. Beyond 80% 24×7 CFE, the electricity rate escalates because not enough CFE generation and storage assets are available to provide a reliable electricity supply during periods of low wind and solar generation.

To push toward 100% CFE consumption, operators will need to take actions — or support efforts — to increase grid region interconnects and to develop and deploy reliable, dispatchable, carbon-free generation and long duration energy storage (LDES) capacity. LDES are storage technologies that can store energy for extended time periods, discharge electricity continuously for one to 10 days or longer, and supply electricity at rates that are competitive with other generation technologies.

Figure 1. The cost of electricity as the percentage of 24×7 CFE approaches 100%

Diagram: The cost of electricity as the percentage of 24x7 CFE approaches 100%

Increase grid region interconnections

Wind and / or solar resources are prevalent in some regions and absent in others. In many cases, areas with productive wind and solar resources are distant from those with high electricity demand. Connecting abundant resources with high-demand areas requires the build-out of high voltage interconnects within and between grid regions and countries. Recent news and trade articles have detailed that the current lack of grid interconnections and flexibility to support more distributed generation assets is slowing the building and deployment of planned solar and wind generation facilities. Numerous published studies detail the high voltage distribution system buildouts needed to support 100% CFE in grid regions around the globe.

The untapped potential of inter-regional grid interconnections is illustrated by the excess generation capacity associated with Google’s wind power purchase agreements (PPAs) for its Oklahoma and Iowa data centers in the Midwest region of the US, where wind generation is high. Google has four data centers in the adjacent US Southeast region (where wind generation is low). These have a low percentage of 24×7 CFE so would benefit from access to the excess Iowa and Oklahoma wind capacity. However, the extra wind capacity cannot be transported from the Midwest Reliability Organization (MRO) grid region to the Southeast Reliability Corporation (SERC) because of a lack of high-voltage interconnection capacity (Figure 2).

Figure 2. Google wind and solar assets by grid region (US)

Diagram: Google wind and solar assets by grid region (US)

The buildout of these projects is complicated by permit issues and financing details. Again, the Google example is constructive. The Clean Line, a 2 GW (gigawatt) high voltage transmission line between MRO and SERC, was proposed by a development consortium around 2013. The line would have enabled Google and other purchasers of wind power in the MRO region to transport excess wind generation to their data centers and other facilities in the SERC region. Clean Line may have also transported excess solar power from the SERC region to the MRO region.

The permitting process extended over four years or more, slowed by legal challenges from property owners and others. The developer had difficulties securing finance, because financiers required evidence of transportation contracts from MRO to SERC, and of PPAs with energy retailers and utilities in SERC. Generators in MRO would not sign transmission contracts without a firm construction schedule. Regulated utilities, the primary energy retailers in SERC, were hesitant to sign PPAs for intermittent renewable power because they would have to hold reliable generation assets in reserve to manage output variations. The developer built a lower capacity section of the line from MRO to the western edge of SERC, but the remainder of the planned system was shelved.

Reliable CFE generation

Complete decarbonization of the electricity supply will require deploying reliable carbon-free or low-carbon energy generation across the global energy grid. Nuclear and geothermal technologies are proven and potentially financially viable, but designs must be refreshed.

Nuclear generation is undergoing a redesign, with an emphasis on small modular reactors. These designs consist of defined, repeatable modules primarily constructed in central manufacturing facilities and assembled at the production site. They are designed to provide variable power output to help the grid match generation to demand. Like other systems critical to grid decarbonization, new reactor types are in the development and demonstration stage and will not be deployed at scale for a decade or more.

Geothermal generation depends on access to the high-temperature zones underground. Opportunities for deployment are currently limited to areas such as Hawaii, where high-temperature subterranean zones are close to the surface.

Horizontal drilling technologies developed by the petroleum industry can access deeper high-temperature zones. However, the corrosive, high-pressure, and high-temperature conditions experienced by these systems present many engineering challenges. Several companies are developing and demonstrating installations that overcome these challenges.

Hydrolytic hydrogen generation is another technology that offers a means to deploy reliable electricity generation assets fueled with hydrogen. It has the advantage of providing a market for the excess wind and solar generation overcapacity discussed earlier in the report. The economics of hydrolytic systems suffer from current conversion efficiencies of 75% or less and uncertain economic conditions. Production incentives provided by the US Inflation Reduction Act, and similar programs in Japan, Australia, the EU and the UK, are attracting initial investments and accelerating the installation of hydrogen generation infrastructure to demonstrate system capabilities and reduce unit costs.

These three technologies illustrate the challenges in increasing the reliable carbon-free energy capacity available on the electricity grid. This discussion is not an exhaustive review, but an example of the technical and manufacturing challenges and extended timelines required to develop and deploy these technologies at scale.

Long duration energy storage

Long duration energy storage (LDES) is another critical technology needed to move the electricity grid to 100% 24×7 CFE. Table 1 details various battery and physical storage technologies currently being developed. The table represents a general survey of technologies and is not exhaustive.

Table 1. Long duration energy storage (LDES) technologies

Table: Long duration energy storage (LDES) technologies

Each type of battery is designed to fill a specific operating niche, and each type is vital to attaining a low-carbon or carbon-free electricity grid. All these technologies are intended to be paired with wind and solar generation assets to produce a firm, reliable supply of CFE for a specified time duration.

  • 4-hour discharge duration: Lithium-ion batteries are a proven technology in this category, but they suffer from high costs, risk of fires, and deterioration of battery capacity with time. Other technologies with lower capital costs and better operating characteristics are under development. These batteries are designed to manage the short-term minute-to-minute, hour-to-hour variations in the output of wind and solar generation assets.
  • 8- to 12-hour discharge duration: These storage systems address the need for medium-duration, dispatchable power to fill in variations in wind and solar generation over periods beyond 4 hours. The batteries will provide continuous, quasi-reliable round the clock power by charging during periods of excess wind and solar production and discharging during periods of no or low MWh output. These systems will rely on sophisticated software controls to manage the charge/discharge process and to integrate the power generation with grid demand.
  • ≥10 day discharge duration: These storage systems are designed to support the grid during longer periods of low generation — such as low solar output caused by several cloudy days or a multiday period of low wind output — and to cover significant week-to-week output variations. As with the 8- to 12-hour LDES systems, the energy cost and the frequency of discharge to the grid will govern the economics of the operation.

The cost of energy from the battery depends on three factors: the energy cost to charge the batteries, the round-trip efficiency of the charging / discharging process, and the number of charge / discharge cycles achieved for a given period.

The impact of the cost of power to charge the batteries is evident. The critical point is that the economics of a battery system depend on buying power during periods of grid saturation when wholesale electricity prices are approaching zero.

Roundtrip efficiency dictates the quantity of excess energy that has to be purchased to compensate for the inefficiencies of the charge / discharge cycle. If the roundtrip efficiency is 80%, the battery operator must purchase 1.25 MWh of energy for every 1 MWh delivered back to the grid. To make the economics of a battery system work, charging power needs to be purchased at a near zero rate.

To be profitable, the revenue from energy power sales must cover the cost of buying power, operating and maintaining the battery system, and paying off the loan used to finance the facility. The economics of an 8- to 12-hour battery system will be optimized if the system charges / discharges every 24 hours. If the battery system can only dispatch energy on one out of seven days, the revenue is unlikely to cover the financing costs.

Effectively integrating storage systems into the grid will require sophisticated software systems to control the charging and discharging of individual battery systems and the flow to and from battery systems at a grid level. These software systems will likely need to include economic algorithms tuned to address the financial viability of all the generating assets on the grid.

Conclusions

Transitioning the global electricity grid to CFE will be a long-term endeavor that will take several decades. Given the challenges inherent in the transition, data center operators should develop plans to increase their electricity consumption to 70% to 80% CFE, while supporting extended grid interconnections and the development and commercialization of LDES technologies and non-intermittent CFE electricity generation.

There will be grid regions where the availability of wind, solar, hydroelectric and nuclear generation assets will facilitate a faster move to economic 100% CFE. There are other regions where 100% CFE use will not be achieved for decades. Data center operators will need to accommodate these regional differences in their sustainability strategies. Because of the economic and technical challenges of attaining 100% CFE in most grid regions, it is unrealistic to expect to reach 100% CFE consumption across a multi-facility, multi-regional data center portfolio until at least 2040.

Uptime Institute advises operators to reset the timing of their net zero commitments and to revise their strategy to depend less on procuring renewable energy credits, guarantees of origin, and carbon offsets. Data center operators will achieve a better outcome if they apply their resources to promote and procure CFE for their data center. But where CFE purchases are the key to operational emissions reductions, net zero commitments will need to move out in time.

Data center managers should craft CFE procurement strategies and plans to incrementally increase CFE consumption in all IT operations — owned, colocation and cloud. By focusing net zero strategies on CFE procurement, data center managers will achieve real progress toward reducing their greenhouse gas emissions and will accelerate the transition to a low-carbon electricity grid.

Where the cloud meets the edge

Where the cloud meets the edge

Low latency is the main reason cloud providers offer edge services. Only a few years ago, the same providers argued that the public cloud (hosted in hyperscale data centers) was suitable for most workloads. But as organizations have remained steadfast in their need for low latency and better data control, providers have softened their resistance and created new capabilities that enable customers to deploy public cloud software in many more locations.

These customers need a presence close to end users because applications, such as gaming, video streaming and real-time automation, require low latency to perform well. End users — both consumers and enterprises — want more sophisticated capabilities with quicker responses, which developers building on the cloud want to deliver. Application providers also want to reduce the network costs and bandwidth constraints that result from moving data over wide distances — which further reinforces the need to keep data close to point of use.

Cloud providers offer a range of edge services to meet the demand for low latency. This Update explains what cloud providers offer to deliver edge capabilities. Figure 1 shows a summary of the key products and services available.

Figure 1. Key edge services offered by cloud providers

Diagram: Key edge services offered by cloud providers

In the architectures shown in Figure 1, public cloud providers generally consider edge and on-premises locations as extensions of the public cloud (in a hybrid cloud configuration), rather than as isolated private clouds operating independently. The providers regard their hyperscale data centers as default destinations for workloads and propose that on-premises and edge sites be used for specific purposes where a centralized location is not appropriate. Effectively, cloud providers see the edge location as a tactical staging post.

Cloud providers don’t want customers to view edge locations as similarly feature-rich as cloud regions. Providers only offer a limited number of services at the edge to address specific needs, in the belief that the central cloud region should be the mainstay of most applications. Because the edge cloud relies on the public cloud for some aspects of administration, there is a risk of data leakage and loss of control of the edge device. Furthermore, the connection between the public cloud and edge is a potential single point of failure.

Infrastructure as a service

In an infrastructure as a service (IaaS) model, the cloud provider manages all aspects of the data center, including the physical hardware and the orchestration software that delivers the cloud capability. Users are usually charged per resource consumed per period.

The IaaS providers use the term “region” to describe a geographical area that contains a collection of independent availability zones (AZs). An AZ consists of at least one data center. A country may have many regions, each typically having two or three AZs. Cloud providers offer nearly all their services that are located within regions or AZs as IaaS.

Cloud providers also offer metro and near-premises locations sited in smaller data centers, appliances or colocation sites nearer to the point of use.  They manage these in a similar way to AZs. Providers claim millisecond-level connections between end-users and these edge locations. However, the edge locations usually have fewer capabilities, poorer resiliency, and higher prices than AZs and broader regions that are hosted from the (usually) larger data centers.

Furthermore, providers don’t have edge locations in all areas they create new locations only where volume is likely to make the investment worthwhile, typically in cities. Similarly, the speed of connectivity between edge sites and end-users depends on the supporting network infrastructure and availability of 4G or 5G.

When a cloud provider customer wants to create a cloud resource or build out an application or service, they first choose a region and AZ using the cloud provider’s graphical user interface (GUI) or application programming interface (API). To the user of GUIs and APIs, metro and near-premises locations appear as options for deployment just as a major cloud region would. Buyers can set up a new cloud provider account and deploy resources in an IaaS edge location in minutes.

Metro locations

Metro locations do not offer the range of cloud services that regions do. They typically offer only compute, storage, container management and load balancing. A region has multiple AZs; a metro location does not. As a result, it is impossible to build a fully resilient application in a single metro location (see Cloud scalability and resiliency from first principles).

The prime example of a metro location is Amazon Web Services (AWS) Local Zones. Use cases usually focus on graphically intense applications such as virtual desktops or video game streaming, or real-time processing of video, audio or sensor data. Customers, therefore, should understand that although edge services in the cloud might cut latency and bandwidth costs, resiliency may also be lower.

These metro-based services, however, may still match many or most enterprise levels of resiliency. Data center infrastructure that supports metro locations is typically in the scale of megawatts, is staffed, and is built to be concurrently maintainable. Connection to end users is usually provided over redundant local fiber connections.

Near-premises (or far edge) locations

Like cloud metro locations, near-premises locations have a smaller range of services and AZs than regions do. The big difference between near-premises and metros is that resources in near-premises locations are deployed directly on top of 4G or 5G network infrastructure, perhaps only a single cell tower away from end-users. This reduces hops between networks, substantially reducing latency and delays caused by congestion.

Cloud providers partner with major network carriers to enable this, for example AWS’s Wavelengths service delivered in partnership with Vodafone, and Microsoft Azure Edge Zone’s service with AT&T. Use cases include, or may include, real-time applications, such as live video processing and analysis, autonomous vehicles and augmented reality. 5G enables connectivity where there is no fixed-line telecoms infrastructure or in temporary locations.

These sites may be cell tower locations or exchanges, operating tens of kilowatts (kW) to a few hundred kW. They are usually unstaffed (remotely monitored) with varying levels of redundancy and a single (or no) engine generator.

On-premises cloud extensions

Cloud providers also offer hardware and software that can be installed in a data center that the customer chooses. The customer is responsible for all aspects of data center management and maintenance, while the cloud provider manages the hardware or software remotely. The provider’s service is often charged per resource per period over an agreed term. A customer may choose these options over IaaS because no suitable cloud edge locations are available, or because regulations or strategy require them to use their own data centers.

Because the customer chooses the location, the equipment and data center may vary. In the edge domain, these locations are: typically 10 kW to a few hundred kW; owned or leased; and constructed from retrofitted rooms or using specialized edge data center products (see Edge data centers: A guide to suppliers). Levels of redundancy and staff expertise vary, so some edge data center product suppliers provide complementary remote monitoring services. Connectivity is supplied through telecom interconnection and local fiber, and latency between site and end user varies significantly.

Increasingly, colocation providers differentiate by directly peering with cloud providers’ networks to reduce latency. For example, Google Cloud recommends its Dedicated Interconnect service for applications where latency between public cloud and colocation site must be under 5ms. Currently, 143 colocation sites peer with Google Cloud, including those owned by companies such as Equinix, NTT, Global Switch, Interxion, Digital Realty and CenturyLink. Other cloud providers have similar arrangements with colocation operators.

Three on-premises options

Three categories of cloud extensions can be deployed on-premises. They differ in how easy it is to customize the combination of hardware and software. An edge cloud appliance is simple to implement but has limited configuration options; a pay-as-you-go server gives flexibility in capacity and cloud integration but requires more configuration; finally, a container platform gives flexibility in hardware and software and multi-cloud possibilities, but requires a high level of expertise.

Edge cloud appliance

An edge appliance is a pre-configured hardware appliance with pre-installed orchestration software. The customer installs the hardware in its data center and can configure it, to a limited degree, to connect to the public cloud provider. The customer generally has no direct access to the hardware or orchestration software.

Organizations deploy resources via the same GUI and APIs as they would use to deploy public cloud resources in regions and AZs. Typically, the appliance needs to be connected to the public cloud for administration purposes, with some exceptions (see Tweak to AWS Outposts reflects demand for greater cloud autonomy). The appliance remains the property of the cloud provider, and the buyer typically leases it based on resource capacity over three years. Examples include AWS Outposts, Azure Stack Hub and Oracle Roving Edge Infrastructure.

Pay-as-you-go server

A pay-as-you-go server is a physical server leased to the buyer and charged based on committed and consumed resources (see New server leasing models promise cloud-like flexibility). The provider maintains the server, measures consumption remotely, proposes capacity increases based on performance, and refreshes servers when appropriate. The provider may also include software on the server, again charged using a pay-as-you-go model. Such software may consist of cloud orchestration tools that provide private cloud capabilities and connect to the public cloud for a hybrid model. Customers can choose their hardware specifications and use the provider’s software or a third party’s. Examples include HPE GreenLake and Dell APEX.

Container software

Customers can also choose their own blend of hardware and software with containers as the underlying technology, to enable interoperability with the public cloud. Containers allow software applications to be decomposed into many small functions that can be maintained, scaled, and managed individually. Their portability enables applications to work across locations.

Cloud providers offer managed software for remote sites that is compatible with public clouds. Examples include Google Anthos, IBM Cloud Satellite and Red Hat OpenShift Container Platform. In this option, buyers can choose their hardware and some aspects of their orchestration software (e.g., container engines), but they are also responsible for building the system and managing the complex mix of components (see Is navigating cloud-native complexity worth the hassle?).

Considerations

Buyers can host applications in edge locations quickly and easily by deploying in metro and near-premises locations offered by cloud providers. Where a suitable edge location is not available, or the organization prefers to use on-premises data centers, buyers have multiple options for extending public cloud capability to an edge data center.

Edge locations differ in terms of resiliency, product availability and — most importantly — latency. Latency should be the main motivation for deploying at the edge. Generally, cloud buyers pay more for deploying applications in edge locations than they would in a cloud region. If there is no need for low latency, edge locations may be an expensive luxury.

Buyers must deploy applications to be resilient across edge locations and cloud regions. Edge locations have less resiliency, may be unstaffed, and may be more likely to fail. Applications must be architected to continue to operate if an entire edge location fails.

Cloud provider edge locations and products are not generally designed to operate in isolation — they are intended to serve as extensions of the public cloud for specific workloads. Often, on-premises and edge locations are managed via public cloud interfaces. If the connection between the edge site and the public cloud goes down, the site may continue to operate — but it will not be possible to deploy new resources until the site is reconnected to the public cloud platform that provides the management interface.

Data protection is often cited as a use case for edge, a reason why some operators may choose to locate applications and data at the edge. However, because the edge device and public cloud need to be connected, there is a risk of user data or metadata inadvertently leaving the edge and entering the public cloud, thereby breaching data protection requirements. This risk must be managed.


Dr. Owen Rogers, Research Director of Cloud Computing

Tomas Rahkonen, Research Director of Distributed Data Centers

Data center operators will face more grid disturbances

Data center operators will face more grid disturbances

The energy crisis of 2022, resulting from Russia’s invasion of Ukraine, caused serious problems for data center operators in Europe. Energy prices leapt up and are likely to stay high. This has resulted in ongoing concerns that utilities in some European countries, which have a mismatch of supply and demand, will have to shed loads.

Even before the current crisis, long-term trends in the energy sector pointed towards less reliable electrical grids, not only in Europe, but also in North America. Data center operators — which rarely have protected status due to their generator use — are having to adapt to a new reality: electrical grids in the future may be less reliable than they are now.

Frequency and voltage disturbances may occur more frequently, even if full outages do not, with consequent risk to equipment. Data center operators may need to test their generators more often and work closely with their utilities to ensure they are collaborating to combat grid instability. In some situations, colocation providers may have to work with their customers to ensure service level agreements (SLAs) are met, and even shift some workloads ahead of power events.

There are four key factors that affect the reliability of electrical grids:

  • Greater adoption of intermittent renewable energy sources.
  • Aging of electricity transmission systems.
  • More frequent (and more widespread) extreme weather events.
  • Geopolitical instability that threatens oil and gas supplies.

Individually, these factors are unlikely to deplete grid reserve margins, which measure the difference between expected maximum available supply and expected peak demand. When combined, however, this confluence of factors can create an environment that complicates grid operation and increases the likelihood of outages.

Two out of six regional grid operators in the US have reported that their reserve margins are currently under target. One of them — the Midcontinent Independent System Operator — is expected to have a 1,300 megawatt (MW) capacity shortfall in the summer of 2023. In Europe, Germany is turning to foreign power suppliers to meet its reserve margins; even before the war in Ukraine, the country was lacking 1,400 MW of capacity. There has been a marked increase in power event outage warnings — even if they are not happening yet.

Dispatchable versus intermittent

An often-cited reason for grid instability is that dispatchable (or firm) power generation — for example, from natural gas, coal and nuclear sources — is being replaced by intermittent renewable power generated by solar and wind energy. This hinders the ability of grid operators to match supply and demand of energy and makes them more vulnerable to variations in weather. Utilities find themselves caught between competing mandates: maintain strict levels of reserve power, decommission the most reliable fossil fuel power plants, and maintain profitability.

Historically, electrical grid operators have depended on firm power generation to buttress the power system. Due to the market preference for renewable energy, many power plants that provide grid stability now face severe economic and regulatory challenges. These include lower-than-expected capacity factors, which compare power plants’ output with how much they could produce at peak capacity; higher costs of fuel and upgrades; stricter emissions requirements; and uncertainty regarding license renewals.

The effects are seen in the levelized cost of electricity (LCOE), measured in dollars per megawatt-hour (MWh). The LCOE is the average net present value of electricity generation over the lifetime of a generation asset. It factors in debt payments, fuel costs and maintenance costs, and requires assumptions regarding fuel costs and interest rates. According to financial services company Lazard’s annual US analysis (2023 Levelized cost of energy+), utility scale solar and onshore wind LCOE (without subsidies) is $24-$96 per MWh and $24-$75 per MWh respectively. Combined cycle natural gas plants, for comparison, have an LCOE of $39-$101 per MWh, while natural gas peaking plants have an LCOE of $115-$221 per MWh.

The economic viability of generation assets, as represented by LCOE, is only a part of a very complex analysis to determine the reserve margin needed to maintain grid stability. Wind and solar generation varies significantly by the hour and the season, making it essential that the grid has sufficient dispatchable assets (see 24×7 carbon-free energy (part one): expectations and realities).

Grid-scale energy storage systems, such as electrochemical battery arrays or pumped storage hydropower, are a possible solution to the replacement of fossil fuel-based firm power. Aside from a few examples, these types of storage do not have the capacity to support the grid when renewable energy output is low. It is not merely a question of more installations, but affordability: grid-scale energy storage is not currently economical.

Grids are showing their age

Much of the infrastructure supporting electricity transmission in the US and Europe was built in the 1970s and 1980s. These grids were designed to transmit power from large, high-capacity generation facilities to consumers. They were not designed to manage the multi-directional power flow of widely distributed wind and solar generation assets, electric vehicle chargers which can consume or supply power, and other vagaries of a fully electrified economy. The number of end users served by electrical grids has since increased dramatically, with the world’s population doubling — and power networks are now struggling to keep up without expensive capacity upgrades and new transmission lines.

In the US, the Department of Energy found that the average age of large power transformers, which handle 90% of the country’s electricity flow, is more than 40 years. Three-quarters of the country’s 707,000 miles (1,138,000 km) of transmission lines — which have a lifespan of 40 to 50 years — are more than 25 years old. The number of transmission outages in the US has more than doubled in the past six years compared with the previous six years, according to North American Electric Reliability Corporation. This is partially due to an increase in severe weather events, which put more stress on the power grid (more on this below).

Renewable energy in some grid regions is being wasted because there is not enough transmission capacity to transport it to other regions that have low generation and high demand. In the US, for example, the lack of high-voltage transmission lines prevents energy delivery from expansive wind farms in Iowa and Oklahoma to data centers in the southeast. Germany is facing similar problems as it struggles to connect wind capacity in the north near the Baltic Sea to the large industrial consumers in the south.

Energy experts estimate that modernizing the US power grid will require an investment of $1 trillion to $2 trillion over the next two decades. Current business and regulatory processes cannot attract and support this level of funding; new funding structures and business processes will be required to transform the grid for a decarbonized future.

Weather is changing

The Earth is becoming warmer, which means long-term changes to broad temperature and weather patterns. These changes are increasing the risk of extreme weather events, such as heat waves, floods, droughts and fires. According to a recent study published in the Proceedings of the National Academy of Sciences, individual regions within the continental US were more than twice as likely to see a “once in a hundred years” extreme temperature event in 2019 than they were in 1979.

Extreme weather events put the grid system at risk. High temperatures reduce the capacity of high-voltage transmission, high winds knock power lines off the towers, and stormy conditions can result in the loss of generation assets. In the past three years, Gulf Coast hurricanes, West Coast wildfires, Midwest heat waves and a Texas deep freeze have all caused local power systems to fail. The key issue with extreme weather is that it can disrupt grid generation, transmission, and fuel supplies simultaneously.

The Texas freeze of February 2021 serves as a cautionary tale of interdependent systems. As the power outages began, some natural gas compressors became inoperable because they were powered by electricity without backup — leading to further blackouts and interrupting the vital gas supply needed to maintain or restart generation stations.

Geopolitical tensions create fuel supply risks

Fuel supply shocks resulting from the war in Ukraine rapidly increased oil and natural gas prices in Europe during the winter of 2022. The global energy supply chain is a complex system of generators, distributors and retailers that prioritizes economic efficiency and low cost, often at the expense of resiliency, which can create the conditions for cascading failures. Many countries in the EU warned that fuel shortages would cause rolling blackouts if the weather became too cold.

In the UK, the government advised that a reasonable worst-case scenario would see rolling power cuts across the country. Large industrial customers, including data center operators, would have been affected, losing power in 3-hour blocks without any warning. In France, planning for the worst-case scenario assumed that up to 60% of the country’s population would be affected by scheduled power cuts. Even Germany, a country with one of the most reliable grids in the world, made plans for short-term blackouts.

Fuel supply shocks are reflected in high energy prices and create short- and long-term risks for the electrical grid. Prior to the conflict in Ukraine, dispatchable fossil fuel peaker plants, which only operate during periods of high demand, struggled to maintain price competitiveness with renewable energy producers. This trend was exacerbated by high fuel costs and renewable energy subsidies.

Any political upheaval in the major oil-producing regions, such as the gulf countries or North Africa, would affect energy prices and energy supply. Development of existing shale gas deposits could offer some short-term energy security benefits in markets, such as the US, the UK and Germany, but political pressures are currently preventing such projects from getting off the ground.

Being prepared for utility power problems

Loss of power is still the most common cause of data center outages. Uptime Institute’s 2022 global survey of data center managers attributes 44% of outages to power issues — greater than the second, third and fourth most common outage causes combined.

The frequency of power-related outages is partly due to power system complexity and a dependence on moving parts: a loss of grid power often reveals weaknesses elsewhere, including equipment maintenance regimens and staff training.

The most common power-related outages result from faults with uninterruptible power supply (UPS) systems, transfer switches and generators. Compared with other data center systems, such as cooling, power is more prone to fail completely, rather than operating at partial capacity.

Disconnecting from the power grid and running on emergency backup systems for extended periods (hours and days) is not a common practice in most geographies. Diesel generators remain the de facto standard for data center backup power and these systems have inherent limitations.

Any issues with utility power create risks across the data center electrical system. Major concerns include:

  • Load transfer risks between the grid and the generators. It is recommended that data center operators fully disconnect from the electrical grid once a year to test procedures, yet many choose not to do so out of operational concerns in a production environment. This means that lurking failures in transfer switches and paralleling switchgears may go undetected and any operational mistakes remain undiscovered.
  • Fuel reserves and refueling. The volume of on-site fuel storage must meet cost and available space constraints; requires system maintenance; and spill and leak management. Longer grid outages can exceed on-site fuel capacity, making operators dependent on outside vendors for fuel resupply. These vendors are, in turn, dependent on the diesel supply chain, which may be disrupted during a wide-area grid outage because some diesel terminals may lack backup power to pump the fuel. Fuel delivery procedures may come under time pressure and may not be fully observed with the potential to create accidents, such as contamination and spills.
  • Increased likelihood of engine failures. More frequent warm-up and cool-down cycles, as well as higher than expected runtime hours, accelerate generator wear. As many as 27% of data center operators have experienced a generator-related power outage in the past three years, according to the Uptime Institute Annual outages analysis 2023. Ongoing supply chain bottlenecks may mean that rental generators are in short supply while replacement parts or new engines may take months to arrive. This may force the data center to operate at a derated capacity, lower redundancy, or both.
  • Pollutant emissions. Many jurisdictions limit generator operating hours to cap emissions of nitrogen oxides, sulfur oxides, and other particulates. For example, in the US, most diesel generators are limited to 100 hours of full load per year and non-compliance can result in fines.
  • Battery system wear and failures. Frequent deep discharge and recharge cycles wear battery cells out faster, particularly lead-acid batteries. Lithium-ion chemistries are not immune either: discharges create thermal stress for Lithium-ion cells when currents are high. Temperature can also spike towards the end-of-cell capacity, which increases the chance of a thermal event as a result of inherent manufacturing imperfections. Often, the loss of critical load caused by a failure in an uninterruptible power supply system is due to batteries not being monitored closely enough by experienced technicians.

It will get worse before it gets better

The likely worst-case scenario facing US and European data center operators in 2023 and beyond will consist primarily of load-shedding requests, brownouts and 2- to 4-hour, controlled rolling blackouts as opposed to widespread, long-lasting grid outages.

Ultimately, how much margin electrical grids will have in reserve is not possible to predict with any accuracy beyond a few weeks. Extreme weather events, in particular, are exceedingly difficult to model. Other unexpected events — such as power station maintenance, transmission failures and geopolitical developments that affect energy supply — might contribute to a deterioration of grid reliability.

Operators can take precautions to prepare for rolling blackouts, including developing a closer relationship with their energy provider. These steps are well understood, but evidence shows they are not always followed.

Considering historically low fuel reserves and long lead times for replacement components, all measures are best undertaken in advance. Data center operators should reach out to suppliers to confirm their capability to deliver fuel and their outlook for future supply — and determine whether the supplier is equipped to sell fuel in the event of a grid outage. Based on the response, operators should assess the need for additional on-site fuel storage or consider contracting additional vendors for their resupplies.

Data center operators should also test their backup power systems regularly. The so-called pull-the-plug test, which checks the entire backup system, is recommended annually. A building load test should be performed monthly, which involves less risk than pull-the-plug testing and checks most of the backup power system. This test may not run long enough to test certain components, such as fuel pumps and level switches, and these should be tested manually. Additionally, data center operators should test the condition of their stored fuel, and filter or treat it as necessary.

The challenges to grid reliability vary by location, but in general Uptime Intelligence sees evidence that the risk of rolling blackouts at times of peak demand is escalating. Operators can mitigate risk to their data centers by testing their backup power system and arranging for additional fuel suppliers as needed. They should also revisit their emergency procedures — even those with a very low chance of occurrence.


Jacqueline Davis, Research Analyst

Lenny Simon, Research Associate

Daniel Bizo, Research Director

Max Smolaks, Research Analyst

24x7 carbon-free energy (part one): expectations and realities

24×7 carbon-free energy (part one): expectations and realities

Data center operators that set net-zero goals will ultimately have to transition to 100% 24×7 carbon-free energy. But current technological limitations mean it is not economically feasible in most grid regions.

In the past decade, the digital infrastructure industry has embraced a definition of renewable energy use that combines the use of renewable energy credits (RECs), guarantees of origin (GOs) and carbon offsets to compensate for the carbon emissions of fossil fuel-based generation, and the consumption of renewable energy in the data center.

The first component is relatively easy to achieve and has not proved prohibitively expensive, so far. The second, which will increasingly be required, will not be so easy and will cause difficulty and controversy in the digital infrastructure industry (and many other industries) for years to come.

In the Uptime Institute Sustainability and Climate Change Survey 2022, a third of data center owners / operators reported buying RECs, GOs and other carbon offsets to support their renewable energy procurement and net-zero sustainability goals.

But many organizations, including some regulators and other influential bodies, are not happy that some operators are using RECs and GOs to claim their operations are carbon neutral with 100% renewable energy use. They argue that, with or without offsets, at least some of the electricity the data center consumes is responsible for carbon dioxide (CO2) emissions — which are measured in metric tonnes (MT) of CO2 per megawatt-hour (MWh).

The trend is clear — offsets are becoming less acceptable. Many digital infrastructure operators have started to (or are planning to) refocus their sustainability objectives towards 100% 24×7 carbon-free energy (CFE) consumption — consuming CFE for every hour of operation.

This Update, the first of two parts, will focus on the key challenges that data center operators face as they move to use more 24×7 CFE. Part two, 24×7 carbon-free energy: getting to 100%, will outline the steps that operators, grid authorities, utilities and other organizations should take to enable data centers to be completely powered by CFE.

A pairing of 24×7 CFE and net-zero emissions goals alters an operator’s timeframe to achieve their net-zero emissions as it removes the ability to offset for success. At present, 100% 24×7 CFE is technically and / or economically unviable in most grid regions and will not be viable in some regions until 2040 to 2050, and beyond. This is because solar and wind generation is highly variable, and the capacity and discharge time of energy storage technologies is limited. Even grid regions with high levels of renewable energy generation can only support a few facilities with 100% 24×7 CFE because they cannot produce enough MWh during periods of low solar and wind output. Consequently, this approach could be expensive and financially risky.  

The fact that most grid regions cannot consistently generate enough CFE complicates net-zero carbon emissions goals. The Uptime Institute Sustainability and Climate Change Survey 2022 found that almost two-thirds of data center owners and operators expect to achieve a net-zero emissions goal by 2030 and that three-quarters expect to by 2035 (Figure 1). Operators will not be able to achieve these enterprise-wide goals by using 100% 24×7 CFE. They will have to use RECs, GOs and offsets extensively — and significant financial expenditure — to deliver on their promises.

Figure 1. Time period when operators expect to achieve their net-zero goals

Diagram: Time period when operators expect to achieve their net-zero goals

Although achieving net-zero emissions goals through offsets may seem attractive, offset costs are rising and operators need to assess the costs and benefits of increasing renewable energy / CFE at their data centers. Over the long term, if organizations demand more CFE, they will incentivize energy suppliers to generate more, and to develop and deploy energy storage, grid management software and inter-region high voltage interconnects.

To establish a 24×7 CFE strategy, operators or their energy retailers should track and control CFE assets and energy delivery to their data centers, and include factors that affect the availability and cost of CFE MWh in procurement contracts.

Renewable energy intermittency

The MWh output of solar and wind generation assets varies by minute, hour, day and season. Figure 2 shows how wind generation varied over two days in January 2019 in the grid region of ERCOT (a grid operator in Texas, US). On January 8, wind turbines produced 32% to 48% of their total capacity (capacity factor) and satisfied 19% to 33% of the entire grid demand. On January 14, wind capacity fell to 4% to 18% and wind output met only 3% to 9% of the total demand.

Figure 2. Wind generation data for two days in January 2019

Diagram: Wind generation data for two days in January 2019

To match the generation output to grid demand, sufficient reliable generation capacity (typically fossil fuel-fired) needs to be available to address the variations in wind output during the two days: 3,000 MW on January 8 and 2,500 MW on January 14. There needs to be four to five natural gas turbine generators available. Capacity is also needed to cover the (roughly) 6,000 MW of output difference between the two days.

ERCOT needs to maintain sufficient reliable generation capacity (typically fossil fuel-fired) to match grid-wide generation capacity to demand, in the face of the hour-to-hour and day-to-day variation in wind and solar output. There has to be four or five 700 MW natural gas turbines in reserve to meet the 2,500 MW to 3,000 MW of output variation on January 14 and January 8, respectively.

In 2020, wind generation in ERCOT satisfied less than 5% of grid demand for 600 hours — that’s 7% of the year. Given the number of hours of low MWh generation, only a few facilities in the grid region can aspire to 100% CFE. The challenge of significant hours of low generation exists in most grid regions that strive to depend on wind and solar generation to meet carbon free energy goals. This limits the ability of operators to set and achieve 100% CFE goals for specific facilities over the next 10 to 20 years.

Seasonal variations can be even more challenging. The wind generation gigawatt hours (GWh) in the Irish grid were two to three times higher in January and February 2022 than in July and August 2022 (Figure 3, light blue line), which left a seasonal production gap of up to 1,000 GWh. The total GWh output for each month masks the hour-to-hour variations of wind generation (see Figure 3). These amplify the importance of having sufficient and reliable generation available on the grid.

Figure 3. Monthly gigawatt-hours generated in Ireland by fuel type in 2022

Diagram: Monthly gigawatt-hours generated in Ireland by fuel type in 2022

The mechanics of a 24×7 CFE strategy

To manage a facility-level program to procure, track and consume CFE, it is necessary to understand how an electricity grid operates. Figure 4 shows a fictional electricity grid modelled on the Irish grid with the addition of a baseload 300 MW nuclear generation facility.

The operation of an electrical grid can be equated to a lake of electrons held at a specified level by a dam. A defined group of generation facilities continuously “fills” the electron lake. The electricity consumers draw electrons from the lake to power homes and commercial and industrial facilities (Figure 4). The grid operator or authority maintains the lake level (the grid potential in MW) by adding or subtracting online generation capacity (MW of generation) at the same “elevation” as the MW demand of the electricity consumers pulling electrons from the lake.

If the MWs of generation and consumption become unbalanced, the grid authority needs to quickly add or subtract generation capacity or remove portions of demand (demand response programs, brownouts and rolling blackouts) to keep the lake level balanced and prevent grid failure (a complete blackout).

Figure 4. Example of generation and demand balancing within an electricity grid

Diagram: Example of generation and demand balancing within an electricity grid

The real-time or hourly production of each generation facility supplying the grid can be matched to the hourly consumption of a grid-connected data center. The grid authority records the real-time MWh generation from each facility and the data can be made available to the public. At least two software tools, WattTime and Cleartrace, are available in some geographies to match CFE generation to facility consumption. Google, for example, uses a tracking tool and publishes periodic updates of the percentage utilization of CFE at each of its owned data centers (Figure 5).

Figure 5. Percentage of hourly carbon-free energy utilized by Google’s owned data centers

Diagram: Percentage of hourly carbon-free energy utilized by Google's owned data centers

Grid authorities depend on readily dispatchable generation assets to balance generation and demand. Lithium-ion batteries are a practical way to respond to rapid short-term changes (within a period of four hours or less) in wind and solar output. Because of the limited capacity and discharge time of lithium-ion batteries — the only currently available grid-scale battery technology — they cannot manage MW demand variation across 24 hours (Figure 2) or seasons (Figure 3). Consequently, grid authorities depend on large-scale fossil fuel and biomass generation, and hydroelectric where available, to maintain a reserve to meet large, time-varied MW demand fluctuations.

The grid authority can have agreements with energy consumers to reduce demand on the grid, but executing these agreements typically requires two to four hours’ notice and a day-ahead warning. These agreements cannot deliver real-time demand adjustments.

The energy capacity / demand bar charts at the bottom of Figure 4 illustrates the capacity and demand balancing process within a grid system at different times of the day and year. They show several critical points for managing a grid with significant, intermittent wind and solar generating capacity.

  • In periods of high wind output, such as January and February 2022, fossil fuel generation is lower because it is only used to meet demand when CFE options cannot. During these periods of high wind output, organizations can approach or achieve 100% CFE use at costs close to grid costs.

    Solar generation — confined to daylight hours — varies by the hour, day and season. As with wind generation, fluctuations in solar output need to be managed by dispatching batteries or reliable generation for short-term variations in solar intensity and longer reductions, such as cloudy days and at night.
  • For periods of low wind output (Figure 4, July 2022 average day and night), the grid deploys reliable fossil fuel assets to support demand. These periods of low wind output prevent facilities from economically achieving 100% 24×7 CFE.
  • Nuclear plants provide a reliable, continuous supply of baseload CFE. An energy retailer could blend wind, solar and nuclear output in some grid regions to enable 100% 24×7 CFE consumption at an operating facility.

    Current nuclear generation technologies are designed to operate at a consistent MW level. They change their output over a day or more, not over minutes or hours. New advanced small modular reactor technologies adjust output over shorter periods to increase their value to the grid, but these will not be widely deployed for at least a decade.
  • For batteries to be considered a CFE asset, they must either be charged during high wind or solar output periods, or from nuclear plants. Software tools that limit charging to periods of CFE availability will be required.

To approach or achieve 100% CFE with only wind generation, data centers need to buy three to five times more MWs than they need. The Google Oklahoma (88% hourly CFE) and Iowa (97% hourly CFE) data centers (Figure 5) illustrate this. Power purchase contracts for the two locations (February 2023) show 1,012 MW of wind generation capacity purchased for the Oklahoma data center and 869 MW of wind capacity purchased for the Iowa data center. Assuming the current demand for these two data centers is 250 MW to 350 MW and 150 MW to 250 MW, respectively, the contracted wind capacity required to achieve the CFE percentages is three to five times the data center’s average operating demand.  

Table 1 lists retail energy contracts with guaranteed percentages of 24×7 CFE publicized over the past three years. These contracts are available in many markets and can provide data center operators with reliable power with a guaranteed quantity of CFE. The energy retailer takes on the risk of managing the intermittent renewable output and assuring supply continuity. In some cases, nuclear power may be used to meet the CFE commitment.

One detail absent from publicly available data for these contracts is the electricity rate. Conversations with one energy retailer indicated that these contracts charge a premium to market retail rates — the higher the percentage of guaranteed 24×7 CFE, the higher the premium. Operators can minimize the financial risk and premiums of this type of retail contract by agreeing to longer terms and lower guarantees — for example 5- to 10-year terms and 50% to 70% guaranteed 24×7 CFE. For small and medium size data center operators, a 24×7 renewables retail contract with a 50% to 70% guarantee will likely provide electric rate certainty with premiums of 5% or less (see Table 1) and minimal financial risk compared with the power purchase agreement (PPA) approach used by Alcoa (Table 2).

Table 1. Examples of 24×7 renewable energy retail contracts

Table: Examples of 24x7 renewable energy retail contracts

Alcoa wind PPA contracts in Spain

Aluminum producer Alcoa Corporation recently procured 1.8 GW of wind PPA contracts for its smelter in Spain. Table 2 shows the financial and operational risks associated with an over-purchase of wind capacity. The smelter has 400 MW of energy demand. Alcoa estimates it would need at least 2 GW of wind capacity to achieve 100% CFE. The Wall Street Journal Alcoa contract information detailed in Table 2 indicates that with its 1.8 GW of PPAs, Alcoa is likely to reach more than 95% 24×7 CFE. The purchase had two objectives: to stabilize the facility’s long-term electricity cost, and to produce near carbon-free aluminum.

The contracted quantity of wind power is 4.5 times the facility’s power demand, so Alcoa has estimated that to get close to 100% 24×7 CFE, it needs a high level of overcapacity, based on its modelling of wind farm output. In practice, Alcoa will likely have to buy some power on the spot market when wind output is low, but the overall emissions of the power it uses will be minimal.

Table 2 details the estimated total MWh of generation and financial implications of the PPAs. Assuming the wind farms have a capacity factor (CF) of 0.45, the PPA contracts secure more than twice as much electricity as the smelter consumes. This excess 3.6 million MWh will be sold into the electricity spot market and Alcoa’s profits or losses under the PPAs will be determined by the price at the time of generation.

The LevelTen Energy Q3 2022 PPA Price Index report was consulted for the P25 electricity rate (the rate at which 25% of available PPA contracts have a lower electricity rate) to estimate the rate of the signed PPAs: €75/MWh was selected. Two EU-wide average electricity rates, taken from the Ember European power price tracker, were chosen to bracket the potential profits and losses associated with the contracts for January 2021 (€50/MWh) and December 2022 (€200/MWh).

Table 2. Projection of power generation and economic returns for the Alcoa plant wind PPAs

Table: Projection of power generation and economic returns for the Alcoa plant wind PPAs

The contracts have significant rewards and risks. If the December 2022 rate remains stable for a year, the agreement will generate €887 million in profit for Alcoa. Conversely, if the January 2021 price had remained stable for a year, Alcoa would have lost €180 million. The wind farms are slated to go online in 2024. The Alcoa plant will restart when the wind farms start to supply electricity to the grid. Current electricity rate projections suggest that the contracts will initially operate with a profit. However, profits are not guaranteed over the 10- to 20-year life of the PPAs.

Real-time spot market electricity rates approach zero over time as the total installed wind and solar generation capacity increases within a grid region. Meeting EU government commitments for a carbon-free grid requires significant wind and solar overcapacity on the overall grid. As excess generation capacity is deployed to the grid to meet these commitments, periods of power overproduction will increase, which is likely to depress spot market prices. There is a strong probability that Alcoa’s contracts will generate losses in their later years as the electricity grid moves toward being carbon free by 2035.

The wind PPAs will provide Alcoa with two near-term revenue generators. First, Alcoa could sell its estimated 3.6 GWh of excess GOs. Bundled or unbundled GOs from its excess electricity generation should be in high demand from data center operators and other enterprises with 2030 net-zero carbon commitments. Second, it could sell the electricity itself. Selling at 2022 year-end rates for Nordic hydropower GOs (€10/MWh) would realize a profit of €36 million.

Low or zero carbon aluminum will also be in high demand and command premium prices as companies seek to decarbonize their facilities or products. While the premium is uncertain, it will add to the benefits of wind power purchases and improve the economics of Alcoa operations. The Alcoa PPA contracts have many upsides, but the Alcoa CFO faces a range of possible financial outcomes inherent in this CFE strategy.

Conclusions

Deploying wind and solar generation overcapacity creates broader operational and financial challenges for grid regions. As overcapacity increases, the use of reliable fossil fuel and nuclear generation assets to maintain grid stability will decrease. As a result, these assets may not generate sufficient revenue to cover their operating and financing costs, which will force them to close. Some nuclear generation facilities in the US have already been retired early because of this.

Grid authorities, utility regulators and legislative bodies need to address these challenges. They need to plan grid capacity: this includes evaluating the shape curves of wind and solar MWh output to determine the quantities of reliable generation required to maintain grid reliability. They should target incentives at developing and deploying clean energy technologies that can boost capacity to meet demand — such as long duration energy storage, small modular nuclear reactors and hydrogen generation. Without sufficient quantities of reliable generation assets, grid stability will slowly erode, which will put data center operations at risk and potentially increase the runtime of backup generation systems beyond their permitted limits.