The Energy Efficiency Directive: requirements come into focus

The Energy Efficiency Directive: requirements come into focus

The European Commission (EC) continues to grapple with the challenges of implementing the Energy Efficiency Directive (EED) reporting and metrics mandates. The publication of the Task B report Labelling and minimum performance standards schemes for data centres and the Task C report EU repository for the reporting obligation of data centres on June 7, 2023 represent the next step on the implementation journey.

Data center operators need to monitor the evolution of the EC’s plans, recognizing that the final version will not be published until the end of 2023. The good news is that about 95% of the requirements have already been set out and the EC’s focus is now on collecting data to inform the future development and selection of efficiency metrics, minimum performance standards and a rating system.

The Task B and C reports clarify most of the data reporting requirements and set out the preferred policy options for assessing data center energy performance. The reports also indicate the direction and scope of the effort to establish a work per energy metric, supporting metrics — such as power usage effectiveness (PUE) and renewable energy factor (REF) — and the appropriate minimum performance thresholds.

Data reporting update

Operators will need to periodically update their data collection processes to keep pace with adjustments to the EED reporting requirements. The Task C report introduces a refined and expanded list of data reporting elements (see Tables 1 and 2).

The EC intends for IT operators to report maximum work, as measured by the server efficiency rating tool (SERT), and the storage capacity of server and storage equipment, respectively, as well as the estimated CPU utilization of the server equipment with an assessment of its confidence level. The EC will use these data elements to assess the readiness of operators to report a work per energy metric and how data center types and equipment redundancy levels should differentiate thresholds.  

Table 1. The EED’s indicator values

Table 1. The EED’s indicator values

Table 2 describes the new data indicators that have been added to the reporting requirements — water use and renewable energy consumption were listed previously. The Task A report and earlier versions of the EED recast had identified the other elements but had not designated them for reporting.

Table 2. The indicator data to be reported to EU member states

Table 2. The indicator data to be reported to EU member states

The EC will use these data elements to develop an understanding of data center operating characteristics:

  • How backup generators are used to support data center and electrical grid operations.  
  • The percentage of annual water use that comes from potable water sources. In its comments on the Task B and C reports, Uptime Institute recommended that the EC also collect data on facilities’ cooling systems so that water use and water usage effectiveness (WUE) can be correlated to cooling system type.
  • Identify the number of data centers that are capturing heat for reuse and the quality of heat that the systems generate.
  • Quantify the renewable energy consumed to run the data center and the quantity of guarantees of origin (GOs) used to offset electricity purchases from the grid. In its comments to the EC, Uptime Institute recommended the use of megawatt-hours (MWh) of renewable or carbon-free energy consumed in the operation, not the combination of MWh of consumption and offsets, to calculate the REF metric.

Section 4 of the Task C report details the full scope of the data elements that need to be reported under the EED. Five data elements specified in Annex VIa of the final EED draft are missing: temperature set points; installed power; annual incoming and outgoing data traffic; amount of data stored and processed; and power utilization. The final data reporting process needs to include these, as instructed by the directive. Table 3 lists the remaining data elements that are not covered in Tables 1, 2 and 4.

Table 3. Other reporting items mandated in the Task A report and the EED

Table 3. Other reporting items mandated in the Task A report and the EED

Data center operators need to set up and exercise their data collection process in good time to ensure quality data for their first EED report on May 15, 2024. They should be able to easily collect and report most of the required data, with one exception.

The source of the servers’ maximum work capacity, utilization data, and the methodologies to estimate these values are currently under development. It is likely that these will not be available until the final Task A to D reports are published at the end of 2023. Operators are advised to track the development of these methodologies and be prepared to incorporate them into their reporting processes upon publication.

Colocation operators need to move quickly to establish their processes for collecting the IT equipment data from their tenants. The ideal solution for this challenge would be an industry-standard template that lets IT operators autoload their data to their colocation provider with embedded quality checks. The template would then aggregate the data for each data center location and autoload it to the EU-wide database, as proposed in the Task C document. It is possibly too late to prepare this solution for the May 2024 report, however, the data center industry should seriously consider creating, testing and deploying such a template for the 2025 report.

The reporting of metrics

The EC intends to assess data center locations for at least four of the eight EN 50600-4 metrics (see Table 4). The metrics will be calculated from the submitted indicator data. Task C designates the public reporting of PUE, WUE, REF and energy reuse factor (ERF) by data center location. Two ICT metrics, IT equipment energy efficiency for servers (ITEEsv) and IT equipment utilization for servers (ITEUsv), will be calculated from indicator data but not publicly reported.

The cooling efficiency ratio (CER) and the carbon usage effectiveness (CUE) are not designated for indicator data collection or calculation. Uptime Institute recommended that the EC collect the energy used and produced from the cooling system as well as the cooling system type to enable the EC to understand the relationship between CER, WUE and the cooling system type.

Table 4. The use of EN 50600-4 data center metrics for EED reporting

Table 4. The use of EN 50600-4 data center metrics for EED reporting

Public reporting of location-specific information

Table 5 lists the location-specific information that the EC recommends being made available to the public.

Table 5. The public reporting of indicator data and metrics

Table 5. The public reporting of indicator data and metrics

These data elements will reveal a significant level of detail about individual data center operations and will focus scrutiny on operators that are perceived as inefficient or using excessive quantities of energy or water. Operators are advised to look at their data from 2020 to 2022 for these elements. They will need to consider how the public will perceive the data, determine the appropriateness of creating an improvement plan and develop a communication strategy to engage with stakeholders the company’s management of its data center operations.

EU member states and EU-level reporting will provide aggregated data to detail the overall scope of data center operations in the individual jurisdictions. 

Data center efficiency rating systems and thresholds

The Task B report looks at the variety of sources on which the EC has built its data center efficiency rating systems and thresholds. In particular, the report evaluates:

  • A total of 25 national and regional laws, such as North Holland regional regulation.
  • Voluntary initiatives, such as the European Code of Conduct for Energy Efficiency in Data Centres.
  • Voluntary certification schemes, such as Germany’s Blauer Engel Data Centers ecolabel.
  • Building-focused data center certification schemes, such as the Leadership in Energy and Environmental Design (LEED) and Building Research Establishment Environmental Assessment Method (BREEAM).
  • Self-regulation, such as the Climate Neutral Data Center Pact.
  • A maturity model for energy management and environmental sustainability (CLC/TS 50600-5-1).

The report distills this information into four of the most promising options.

  1. Minimum performance thresholds for PUE and REF. These could be based on current commitments, such as those by the Climate Neutral Data Center Pact, and government targets for minimum renewable energy consumption percentages.
  2. Information requirements in the form of a rating-based labeling system. This would likely be differentiated by data center type and redundancy level, and built on metric performance thresholds and best practices:
    • Mandatory system. A multi-level rating system based on efficiency metric thresholds with a minimum number of energy management best practices.
    • Voluntary system. Pass / fail performance criteria will likely be set for a mix of efficiency metrics and best practices for energy management.
  3. Information requirements in the form of metrics. Indicator data and metric performance results would be published annually, enabling stakeholders to assess year-to-year performance improvements.

Integrating the Task A through C reports, the EC appears poised to publish the performance indicator data and metrics calculations detailed in option 3 in the above list for the May 15, 2024 public report. The EC’s intent seems to be to collect sufficient data to assess the current values of the key metrics (see Table 4) at operating data centers and the performance variations resulting from different data center types and redundancy levels. This information will help the EC to select the best set of metrics with performance threshold values to compel operators to improve their environmental performance and the quantity of work delivered for each unit of energy and water consumption.

Unfortunately, the EC will have only the 2024 data reports available for analysis ahead of the second assessment’s deadline of May 15, 2025, when it will recommend further measures. The quality of the 2024 data is likely to be suspect due to the short time that operators will have had to collect, aggregate, and report this data. It would benefit the European Parliament and the EC to delay the second assessment report until March 2026.

This extension would enable an analysis of two years’ worth of data reports, give operators time to establish and stabilize their reporting processes, and the EC time to observe trends in the data to improve their recommendations.

Conclusion

The details of the EED data reporting requirements are slowly coming into focus with the publication of the draft Task B and Task C reports. There is the potential for some changes to both the final, approved EED and the final Task A to D reports that will be delivered to the European Parliament by the end of 2023, but they are likely to be minimal. The EED was approved by the European Parliament on July 11, 2023 and is scheduled for Council approval on July 27, 2023, with formal publication two to three weeks later.

The broad outline and most of the specifics of the required data reports are now clearly defined. Operators need to move quickly to ensure that they are collecting and validating the necessary data for the May 15, 2024 report.

A significant exception is the lack of clarity regarding the measurements and methodologies that IT operators can use to estimate their server work capacity and utilization. The data center industry and the EC need to develop, approve and make available the data sets and methodologies that can facilitate the calculation of these indicators.


The Uptime Intelligence View

The EC is quickly converging on the data and metrics that will be collected and reported for the May 15, 2024 report. The publicly reported, location-specific data will give the public an intimate view of data center resource consumption and its basic efficiency levels. More importantly, it will enable the EC to gather the data it needs to evaluate a work per energy metric and develop minimum performance thresholds that have the potential to alter an IT manager’s business goals and objectives toward a greater focus on environmental performance.

Lifting and shifting apps to the cloud: a source of risk creep?

Lifting and shifting apps to the cloud: a source of risk creep?

Public cloud infrastructures have come a long way over the past 16 years to slowly earn the trust of enterprises in running their most important applications and storing sensitive data. In the Uptime Institute Global Data Center Survey 2022, more than a third of enterprises that operate their own IT infrastructure said they also placed some of their mission-critical workloads in a public cloud.

This gradual change in enterprises’ posture, however, can only be partially attributed to improved or more visible cloud resiliency. An equal, or arguably even bigger, component in this shift in attitude is enterprises’ willingness to make compromises when using the cloud, which includes sometimes accepting less resilient cloud data center facilities. However, a more glaring downgrade lies in the loss of the ability to configure IT hardware specifically for sensitive business applications.

In more traditional, monolithic applications, both the data center and IT hardware play a central role in their reliability and availability. Most critical applications that predate the cloud era depend heavily on hardware features because they run on a single or a small number of servers. By design, more application performance meant bigger, more powerful servers (scaling up as opposed to scaling out), and more reliability and availability meant picking servers engineered for mission-critical use.

In contrast, cloud-native applications should be designed to scale across tens or hundreds of servers, with the assumption that the hardware cannot be relied upon. Cloud providers are upfront that customers are expected to build in resiliency and reliability using software and services.

But such architectures are complex, may require specialist skills and come with high software management overheads. Legacy mission-critical applications, such as databases, are not always set up to look after their reliability on their own without depending on hardware and operating system / hypervisor mechanisms. To move such applications to the cloud and maintain their reliability, organizations may need to substantially refactor the code.

This Uptime Update discusses why organizations that are migrating critical workloads from their own IT infrastructure to the cloud will need to change their attitudes towards reliability to avoid creating risks.

Much more than availability

The language that surrounds infrastructure resiliency is often ambiguous and masks several interrelated but distinct aspects of engineering. Historically, the industry has largely discussed availability considerations around the public cloud, which most stakeholders understand as not experiencing outages to their cloud services.

In common public cloud parlance, availability is almost always used interchangeably with reliability. When offering advice on their reliability features or on how to architect cloud applications for reliability, cloud providers tend to discuss almost exclusively what falls under the high-availability engineering discipline (e.g., data replication, clustering and recovery schemes). In the software domain, physical and IT infrastructure reliability may be conflated with site reliability engineering, which is a software development and deployment framework.

These crossover in two significant ways. First, availability objectives, such as the likelihood that the system is ready to operate at any given time, are only a part of reliability engineering — or rather, one of its outcomes. Reliability engineering is primarily concerned with the system’s ability to perform its function free of errors. It also aims to suppress the likelihood that failures will affect the system’s health. Crucially, this includes the detection and containment of abnormal operations, such as a device making mistakes. In short, reliability is the likelihood of producing correct outputs.

For facilities, this typically translates to the ability to deliver conditioned power and air — even during times of maintenance and failures. For IT systems, reliability is about the capacity to perform compute and storage jobs without errors in calculations or data.

Second, the reliability of any system builds on the robustness of its constituent parts, which include the smallest components. In the cloud, however, the atomic unit of reliability that is visible to customers is a consumable cloud resource, such as a virtual machine or container, and more complex cloud services, such as data storage, network and an array of application interfaces.

Today, enterprises have not only limited information on the cloud data centers’ physical infrastructure resiliency (either topology or maintenance and operations practices), but also low visibility of, or any choice in, the reliability features of IT hardware and infrastructure software that underpin cloud services.

Engineering for reliability: a lost art?

This abstraction of hardware resources is a major departure from the classical infrastructure practices for IT systems that run mission-critical business and industrial applications. Server reliability greatly depends on the architectural features that detect and recover from errors occurring in processors and memory chips, often with the added help of the operating system.

Typical examples of errors include soft bit flips (transient bit errors typically caused by an anomaly) and hard bit flips (permanent faults) in memory cell arrays. Bit errors can be found both in the processor and in external memory banks, as well as operational errors and design bugs in processor logic that could produce incorrect outputs or result in a software crash.

For much of its history, the IT industry has gone to great and costly lengths to design mission-critical servers (and storage systems) that can be trusted to manage data and perform operations as intended. The engineering discipline addressing server robustness is generally known as reliability, availability and serviceability (RAS, which was originally coined by IBM five decades ago), with the serviceability aspect referring to maintenance and upgrades, including software, without causing any downtime.

Traditional examples of these servers include mainframes, UNIX-based and other proprietary software and hardware systems. However, in the past couple of decades x86-based mission-critical systems, which are distinct from volume servers in their RAS features, have also taken hold in the market.

What sets mission-critical hardware design apart is its extensive reliability mechanisms in its error detection, correction and recovery capabilities that go beyond those found in mainstream hardware. While perfect resistance of errors is not possible, such features greatly reduce the chances of errors and software crashes.

Mission-critical systems tend to be able to isolate a range of faulty hardware components without resulting in any disruption. These components include memory chips (the most common source of data integrity and system stability issues), processor units or entire processors, or even an entire physical partition of a mission-critical server. Often, critical memory contents are mirrored within the system across different banks of memory to safeguard against hardware failures.

Server reliability doesn’t end with design, however. Vendors of mission-critical servers and storage systems test the final version of any new server platform for many months to ensure they perform correctly, known as the validation process, before volume manufacturing begins.

Entire sectors, such as financial services, e-commerce, manufacturing, transport and more, have come to depend on and trust such hardware for the correctness of their critical applications and data.

Someone else’s server, my reliability

Maintaining a mission-critical level of infrastructure reliability in the cloud (or even just establishing underlying infrastructure reliability in general), as opposed to “simple” availability, is not straightforward. Major cloud providers don’t address the topic of reliability in much depth to begin with.

What techniques, if any, cloud operators could deploy to safeguard customer applications against data corruption and application failures beyond the use of basic error correction code in memory, which is only able to handle random, single-bit errors, is difficult to know. Currently, there are no hyperscale cloud instances that offer enhanced RAS features comparable to mission-critical IT systems.

While IBM and Microsoft both offer migration paths directly for some mission-critical architectures, such as IBM Power and older s390x mainframes, their focus is on the modernization of legacy applications rather than maintaining reliability and availability levels that are comparable to on-premises systems. The reliability on offer is even less clear when it comes to more abstracted cloud services, such as software as a service and database as a service offerings or serverless computing.

Arguably, the future of reliability lies with software mechanisms. In particular, the software stack needs to adapt by getting rid of its dependency on hardware RAS features, whether this is achieved through verifying computations, memory coherency or the ability to remove and add hardware resources.

This puts the onus of RAS engineering almost solely on the cloud user. For new critical applications, purely software-based RAS by design is a must. However, the time and costs of refactoring or rearchitecting an existing mission-critical software stack to verify results and handle hardware-originating errors are not trivial — and are likely to be prohibitive in many cases, if possible.

Without the assistance of advanced RAS features in mission-critical IT systems, performance, particularly response times, will also likely take a hit if the same depth of reliability is required. At best, this means the need for more server resources to handle the same workload because the software mechanisms for extensive system reliability features will carry a substantial computational and data overhead.

These considerations should temper the pace at which mission-critical monolithic applications migrate to the cloud. Yet, these arguments are almost academic. The benefits of high reliability are difficult to quantify and compare (even more so than availability), in part because it is counterfactual — it is hard to measure what is being prevented.

Over time, cloud operators might invest more in generic infrastructure reliability and even offer products with enhanced RAS for legacy applications. But software-based RAS is the way forward in a world where hardware has become generic and abstracted.

Enterprise decision-makers should at least be mindful of the reliability (and availability) trade-offs involved with the migration of existing mission-critical applications to the cloud, and budget for investing in the necessary architectural and software changes if they expect the same level of service that an enterprise IT infrastructure can provide.

Use tools to control cloud costs before it’s too late

Use tools to control cloud costs before it’s too late

The public cloud’s on-demand pricing model is vital in enabling application scalability — the key benefit of cloud computing. Resources need to be readily available for a cloud application to scale when required without the customer having to give advance notification. Cloud providers can offer such flexibility by allowing customers to pay their bills in arrears based on the number of resources consumed during a specified period.

This flexibility does have a downside, however. If more resources are consumed than expected due to increased demand or configuration errors, the organization is still liable to pay for them — it is too late to control costs after the fact. A total of 42% of respondents to the Uptime Institute Data Center Capacity Trends Survey 2022 cited escalating costs as the top reason for moving workloads from the public cloud back to on-premises infrastructure. Chief information officers face a tricky balancing act when allowing applications to scale to meet business objectives without letting budgets spiral out of control.

This Uptime Intelligence Update summarizes the challenges that cloud customers face when forecasting, controlling and optimizing costs. It also provides simple steps that can help buyers take control of their spend. As organizations face increasing macroeconomic pressures, reducing cloud expenditure has never been more important (see Cloud migrations to face closer scrutiny).

Cloud complexity

A cloud application is usually architected from multiple cloud services, such as virtual machines, storage platforms and databases. Each cloud service has its own metrics for billing. For example, customers may be charged for storage services based on the amount of storage used, the number of transactions made, and the bandwidth consumed between the storage service and the end user. The result is that even a simple bill for a cloud application will have many different charges spread across various services.

Controlling, forecasting and optimizing the costs of cloud-native applications (i.e., applications built for the cloud that can scale automatically) is challenging for several reasons:

  • Consumption is not always under the customer’s control. For example, many end users might upload data to a storage platform — thus increasing the customer’s bill — without the customer being aware until the end of the billing period.
  • Each service has many metrics to consider and an application will typically use multiple cloud services. Each cloud provider measures the consumption of their service in different ways; there is no standard approach.
  • Metrics are not always related to tangible units that are easy for the customer to predict. For example, a specific and unknown type of transaction may generate a cost on a database platform; however, the customer may have no understanding or visibility of how many of these transactions will be executed in a certain period.
  • Applications may scale up by accident due to errant code or human error and use resources without purpose. Similarly, applications may not scale down when able (reducing costs) due to incorrect configuration.

Conversely, applications that don’t scale, such as those lifted and shifted from on-premises locations, are generally predictable and stable in capacity terms. However, without the ability to scale down (and reduce costs), infrastructure expenses are not always as low as hoped (see Asset utilization drives cloud repatriation economics).

A sudden and unexpected increase in a monthly cloud bill is often described as “bill shock,” a term coined initially for unexpectedly large consumer phone bills. Is bill shock a problem? Not necessarily. If an application is scaled to derive more revenue from the end users, for example, then paying more for the underlying infrastructure is not an issue. But although applications may be designed to be scalable, organizations and budgets are not. An IT department might generate new revenue for the organization from spending more on infrastructure — but if the department has a fixed budget, the chief financial officer might not understand why costs have increased. Most organizations would not report the cost of cloud services against any revenue created by the investment in those services — to senior management, cloud services may appear to be an expense rather than a value-generating activity.

The complexity of the situation has led to the creation of an open-source project, the FinOps Foundation. The foundation describes FinOps (a portmanteau of finance and operations) as a “financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions.” At a high level, the foundation describes six principles to effectively manage cloud costs:

  • Teams need to collaborate.
  • Decisions should be driven by the business value of the cloud.
  • Everyone needs to take ownership of their cloud usage.
  • FinOps data should be accessible and timely.
  • FinOps needs to be driven by a centralized team.
  • Organizations should take advantage of the variable cost model of the cloud.

The need for a foundation dedicated to cloud finance demonstrates the complexity of managing cloud costs effectively. Fully executing the foundation’s six key steps requires substantial investment and motivation — and many organizations will need expert assistance in this endeavor.

Taking charge

There are some simple steps organizations can take to control their public cloud costs, most related are to the foundation’s six principles:

Set alerts to warn of overspending

All cloud providers allow customers to set custom spend alerts, which warn when a cost threshold has been reached. Such alerts enable the budget holders to determine if the spend is justified, if further funding should be sought, or if the spending is accidental and needs to be curtailed. Setting alerts is the minimal step all organizations should take to control their cloud expenditures. Organizations should ensure that alerts are configured and sent to a valid mailbox, phone number or event management system.

Use free tools to forecast month-to-month consumption

Most cloud providers include tools to forecast future spending based on past performance. These tools aren’t perfect, but they do give some visibility into how an application consumes resources over time, for free. It’s better to inform leadership in advance if costs are expected to rise rather than after the bill is due.

Work with stakeholders to determine future needs

Ensure that all parts of the business that use the public cloud understand how costs may change. For example, a new product launch, sale or event may increase the use of a website, which increases costs. Knowing this in advance enables a more realistic forecast of future costs and an open discussion on who will pay.

Consider showback and chargeback models

In a showback model, the IT department shows individual departments and business units their monthly cloud spends. The idea being that they become more aware of how their decisions affect expenditure, which enables them to take steps to reduce it. In a chargeback model, IT invoices these departments for the cloud costs related to their applications. Each department is then responsible for its own costs and is obliged to justify the expenditure relative to the value gained (e.g., increased revenue, better customer satisfaction).

Showback can be set up relatively quickly by “tagging” resources appropriately with an owner and then using the cloud provider’s reporting tools to break down business owners’ spending. Chargeback is a more significant undertaking, which affects the culture and structure of a company — most non-IT teams may not have the understanding or appetite to be financially responsible for their IT bills.

Take advantage of optimization tools

With an accurate forecast, organizations can use alternative pricing models to reduce their spend. These models give customers discounts of up to 70% compared with on-demand pricing in return for a commitment of up to three years or a minimum spend. Many cloud providers also offer spot instances, which provide cheap access to cloud resources on the understanding that this access can be terminated without warning. The best use of alternative pricing models will be discussed further in a future Uptime Intelligence Update. Most cloud providers offer tools that suggest alternative pricing models based on past performance. Such tools can also identify “orphaned” resources that cost money but don’t appear to be doing anything useful.

Security and governance practices prevent overspend

A well-tested application hosted in a secure cloud environment reduces the likelihood of things going wrong and costs increasing as a result. For example, organizations should use role-based access to ensure only those employees who need to create resources are permitted to do so. This prevents costly services from being set up and subsequently forgotten about. Similarly, cloud customers should take appropriate precautions to stop malicious scripts from executing in their environment and sending out large quantities of data that will increase bandwidth costs. IT teams should test code thoroughly before deployment to reduce the chance of accidental resource consumption.

Get help

Most hyperscaler cloud providers, including Amazon Web Services, Google Cloud Platform, Microsoft Azure, Oracle Cloud, IBM Cloud and Alibaba Cloud, offer tools to aid cost forecasting, optimization and management. Smaller cloud providers are less likely to have these features but their charges are usually based on fewer metrics and offer fewer services, thereby reducing complexity.

Some organizations use third-party platforms to track and optimize their spend. The key benefit of these platforms is that they can optimize across multiple cloud providers and are independent, which arguably provides a more unbiased view of costs. These platforms include Apptio Cloudability, Flexera, NetApp CloudCheckr, IBM Turbonomic and VMware CloudHealth.

There are also consultancies and managed service providers, such as Accenture, Deloitte and HCLTech, that integrate cost-optimization practices into organizations and optimize cloud costs on their customer’s behalf on an ongoing basis. The cost of not acting can be substantial. This analyst spent $4,000 on a bare-metal cloud server after accidentally leaving it running for two months. Without an alert set up, the analyst only became aware when the cloud provider posted an invoice to his home address. Organizations should check that warnings and limits are configured now, before it is too late. If cloud costs are a significant part of the IT expenditure, expert advice is essential.

24x7 carbon-free energy (part two): getting to 100%

24×7 carbon-free energy (part two): getting to 100%

Digital infrastructure operators have started to refocus their sustainability objectives on 100% 24×7 carbon-free energy (CFE) consumption: using carbon-free energy for every hour of operation.

To establish a 24×7 CFE strategy, operators must track and control CFE assets and the delivery of energy to their data centers and use procurement contracts designed to manage the factors that affect the availability and cost of renewable energy / CFE megawatt-hours (MWh).

The first Update in this two-part series, 24×7 carbon-free energy (part one): expectations and realities, focused on the challenges that data center operators face as they look to increase the share of CFE consumed in their facilities. This second report outlines the steps that operators, grid authorities, utilities and other organizations need to take to enable data centers to be 100% powered by CFE. As previously discussed, data centers that approach 100% CFE using only wind and solar generation currently have to buy several times more generation capacity than they need.

Figure 1 illustrates the economic challenges of approaching 100% 24×7 CFE. To reach 50% to 80% CFE consumption, the price premium is likely to be 5% or less in markets with high CFE penetration (30% or more of the generated MWh). The levelized cost of electricity (LCOE) is the average net present cost of the electricity based on the cost of generation over the lifetime of an individual or group of generation facilities. Beyond 80% 24×7 CFE, the electricity rate escalates because not enough CFE generation and storage assets are available to provide a reliable electricity supply during periods of low wind and solar generation.

To push toward 100% CFE consumption, operators will need to take actions — or support efforts — to increase grid region interconnects and to develop and deploy reliable, dispatchable, carbon-free generation and long duration energy storage (LDES) capacity. LDES are storage technologies that can store energy for extended time periods, discharge electricity continuously for one to 10 days or longer, and supply electricity at rates that are competitive with other generation technologies.

Figure 1. The cost of electricity as the percentage of 24×7 CFE approaches 100%

Diagram: The cost of electricity as the percentage of 24x7 CFE approaches 100%

Increase grid region interconnections

Wind and / or solar resources are prevalent in some regions and absent in others. In many cases, areas with productive wind and solar resources are distant from those with high electricity demand. Connecting abundant resources with high-demand areas requires the build-out of high voltage interconnects within and between grid regions and countries. Recent news and trade articles have detailed that the current lack of grid interconnections and flexibility to support more distributed generation assets is slowing the building and deployment of planned solar and wind generation facilities. Numerous published studies detail the high voltage distribution system buildouts needed to support 100% CFE in grid regions around the globe.

The untapped potential of inter-regional grid interconnections is illustrated by the excess generation capacity associated with Google’s wind power purchase agreements (PPAs) for its Oklahoma and Iowa data centers in the Midwest region of the US, where wind generation is high. Google has four data centers in the adjacent US Southeast region (where wind generation is low). These have a low percentage of 24×7 CFE so would benefit from access to the excess Iowa and Oklahoma wind capacity. However, the extra wind capacity cannot be transported from the Midwest Reliability Organization (MRO) grid region to the Southeast Reliability Corporation (SERC) because of a lack of high-voltage interconnection capacity (Figure 2).

Figure 2. Google wind and solar assets by grid region (US)

Diagram: Google wind and solar assets by grid region (US)

The buildout of these projects is complicated by permit issues and financing details. Again, the Google example is constructive. The Clean Line, a 2 GW (gigawatt) high voltage transmission line between MRO and SERC, was proposed by a development consortium around 2013. The line would have enabled Google and other purchasers of wind power in the MRO region to transport excess wind generation to their data centers and other facilities in the SERC region. Clean Line may have also transported excess solar power from the SERC region to the MRO region.

The permitting process extended over four years or more, slowed by legal challenges from property owners and others. The developer had difficulties securing finance, because financiers required evidence of transportation contracts from MRO to SERC, and of PPAs with energy retailers and utilities in SERC. Generators in MRO would not sign transmission contracts without a firm construction schedule. Regulated utilities, the primary energy retailers in SERC, were hesitant to sign PPAs for intermittent renewable power because they would have to hold reliable generation assets in reserve to manage output variations. The developer built a lower capacity section of the line from MRO to the western edge of SERC, but the remainder of the planned system was shelved.

Reliable CFE generation

Complete decarbonization of the electricity supply will require deploying reliable carbon-free or low-carbon energy generation across the global energy grid. Nuclear and geothermal technologies are proven and potentially financially viable, but designs must be refreshed.

Nuclear generation is undergoing a redesign, with an emphasis on small modular reactors. These designs consist of defined, repeatable modules primarily constructed in central manufacturing facilities and assembled at the production site. They are designed to provide variable power output to help the grid match generation to demand. Like other systems critical to grid decarbonization, new reactor types are in the development and demonstration stage and will not be deployed at scale for a decade or more.

Geothermal generation depends on access to the high-temperature zones underground. Opportunities for deployment are currently limited to areas such as Hawaii, where high-temperature subterranean zones are close to the surface.

Horizontal drilling technologies developed by the petroleum industry can access deeper high-temperature zones. However, the corrosive, high-pressure, and high-temperature conditions experienced by these systems present many engineering challenges. Several companies are developing and demonstrating installations that overcome these challenges.

Hydrolytic hydrogen generation is another technology that offers a means to deploy reliable electricity generation assets fueled with hydrogen. It has the advantage of providing a market for the excess wind and solar generation overcapacity discussed earlier in the report. The economics of hydrolytic systems suffer from current conversion efficiencies of 75% or less and uncertain economic conditions. Production incentives provided by the US Inflation Reduction Act, and similar programs in Japan, Australia, the EU and the UK, are attracting initial investments and accelerating the installation of hydrogen generation infrastructure to demonstrate system capabilities and reduce unit costs.

These three technologies illustrate the challenges in increasing the reliable carbon-free energy capacity available on the electricity grid. This discussion is not an exhaustive review, but an example of the technical and manufacturing challenges and extended timelines required to develop and deploy these technologies at scale.

Long duration energy storage

Long duration energy storage (LDES) is another critical technology needed to move the electricity grid to 100% 24×7 CFE. Table 1 details various battery and physical storage technologies currently being developed. The table represents a general survey of technologies and is not exhaustive.

Table 1. Long duration energy storage (LDES) technologies

Table: Long duration energy storage (LDES) technologies

Each type of battery is designed to fill a specific operating niche, and each type is vital to attaining a low-carbon or carbon-free electricity grid. All these technologies are intended to be paired with wind and solar generation assets to produce a firm, reliable supply of CFE for a specified time duration.

  • 4-hour discharge duration: Lithium-ion batteries are a proven technology in this category, but they suffer from high costs, risk of fires, and deterioration of battery capacity with time. Other technologies with lower capital costs and better operating characteristics are under development. These batteries are designed to manage the short-term minute-to-minute, hour-to-hour variations in the output of wind and solar generation assets.
  • 8- to 12-hour discharge duration: These storage systems address the need for medium-duration, dispatchable power to fill in variations in wind and solar generation over periods beyond 4 hours. The batteries will provide continuous, quasi-reliable round the clock power by charging during periods of excess wind and solar production and discharging during periods of no or low MWh output. These systems will rely on sophisticated software controls to manage the charge/discharge process and to integrate the power generation with grid demand.
  • ≥10 day discharge duration: These storage systems are designed to support the grid during longer periods of low generation — such as low solar output caused by several cloudy days or a multiday period of low wind output — and to cover significant week-to-week output variations. As with the 8- to 12-hour LDES systems, the energy cost and the frequency of discharge to the grid will govern the economics of the operation.

The cost of energy from the battery depends on three factors: the energy cost to charge the batteries, the round-trip efficiency of the charging / discharging process, and the number of charge / discharge cycles achieved for a given period.

The impact of the cost of power to charge the batteries is evident. The critical point is that the economics of a battery system depend on buying power during periods of grid saturation when wholesale electricity prices are approaching zero.

Roundtrip efficiency dictates the quantity of excess energy that has to be purchased to compensate for the inefficiencies of the charge / discharge cycle. If the roundtrip efficiency is 80%, the battery operator must purchase 1.25 MWh of energy for every 1 MWh delivered back to the grid. To make the economics of a battery system work, charging power needs to be purchased at a near zero rate.

To be profitable, the revenue from energy power sales must cover the cost of buying power, operating and maintaining the battery system, and paying off the loan used to finance the facility. The economics of an 8- to 12-hour battery system will be optimized if the system charges / discharges every 24 hours. If the battery system can only dispatch energy on one out of seven days, the revenue is unlikely to cover the financing costs.

Effectively integrating storage systems into the grid will require sophisticated software systems to control the charging and discharging of individual battery systems and the flow to and from battery systems at a grid level. These software systems will likely need to include economic algorithms tuned to address the financial viability of all the generating assets on the grid.

Conclusions

Transitioning the global electricity grid to CFE will be a long-term endeavor that will take several decades. Given the challenges inherent in the transition, data center operators should develop plans to increase their electricity consumption to 70% to 80% CFE, while supporting extended grid interconnections and the development and commercialization of LDES technologies and non-intermittent CFE electricity generation.

There will be grid regions where the availability of wind, solar, hydroelectric and nuclear generation assets will facilitate a faster move to economic 100% CFE. There are other regions where 100% CFE use will not be achieved for decades. Data center operators will need to accommodate these regional differences in their sustainability strategies. Because of the economic and technical challenges of attaining 100% CFE in most grid regions, it is unrealistic to expect to reach 100% CFE consumption across a multi-facility, multi-regional data center portfolio until at least 2040.

Uptime Institute advises operators to reset the timing of their net zero commitments and to revise their strategy to depend less on procuring renewable energy credits, guarantees of origin, and carbon offsets. Data center operators will achieve a better outcome if they apply their resources to promote and procure CFE for their data center. But where CFE purchases are the key to operational emissions reductions, net zero commitments will need to move out in time.

Data center managers should craft CFE procurement strategies and plans to incrementally increase CFE consumption in all IT operations — owned, colocation and cloud. By focusing net zero strategies on CFE procurement, data center managers will achieve real progress toward reducing their greenhouse gas emissions and will accelerate the transition to a low-carbon electricity grid.

Where the cloud meets the edge

Where the cloud meets the edge

Low latency is the main reason cloud providers offer edge services. Only a few years ago, the same providers argued that the public cloud (hosted in hyperscale data centers) was suitable for most workloads. But as organizations have remained steadfast in their need for low latency and better data control, providers have softened their resistance and created new capabilities that enable customers to deploy public cloud software in many more locations.

These customers need a presence close to end users because applications, such as gaming, video streaming and real-time automation, require low latency to perform well. End users — both consumers and enterprises — want more sophisticated capabilities with quicker responses, which developers building on the cloud want to deliver. Application providers also want to reduce the network costs and bandwidth constraints that result from moving data over wide distances — which further reinforces the need to keep data close to point of use.

Cloud providers offer a range of edge services to meet the demand for low latency. This Update explains what cloud providers offer to deliver edge capabilities. Figure 1 shows a summary of the key products and services available.

Figure 1. Key edge services offered by cloud providers

Diagram: Key edge services offered by cloud providers

In the architectures shown in Figure 1, public cloud providers generally consider edge and on-premises locations as extensions of the public cloud (in a hybrid cloud configuration), rather than as isolated private clouds operating independently. The providers regard their hyperscale data centers as default destinations for workloads and propose that on-premises and edge sites be used for specific purposes where a centralized location is not appropriate. Effectively, cloud providers see the edge location as a tactical staging post.

Cloud providers don’t want customers to view edge locations as similarly feature-rich as cloud regions. Providers only offer a limited number of services at the edge to address specific needs, in the belief that the central cloud region should be the mainstay of most applications. Because the edge cloud relies on the public cloud for some aspects of administration, there is a risk of data leakage and loss of control of the edge device. Furthermore, the connection between the public cloud and edge is a potential single point of failure.

Infrastructure as a service

In an infrastructure as a service (IaaS) model, the cloud provider manages all aspects of the data center, including the physical hardware and the orchestration software that delivers the cloud capability. Users are usually charged per resource consumed per period.

The IaaS providers use the term “region” to describe a geographical area that contains a collection of independent availability zones (AZs). An AZ consists of at least one data center. A country may have many regions, each typically having two or three AZs. Cloud providers offer nearly all their services that are located within regions or AZs as IaaS.

Cloud providers also offer metro and near-premises locations sited in smaller data centers, appliances or colocation sites nearer to the point of use.  They manage these in a similar way to AZs. Providers claim millisecond-level connections between end-users and these edge locations. However, the edge locations usually have fewer capabilities, poorer resiliency, and higher prices than AZs and broader regions that are hosted from the (usually) larger data centers.

Furthermore, providers don’t have edge locations in all areas they create new locations only where volume is likely to make the investment worthwhile, typically in cities. Similarly, the speed of connectivity between edge sites and end-users depends on the supporting network infrastructure and availability of 4G or 5G.

When a cloud provider customer wants to create a cloud resource or build out an application or service, they first choose a region and AZ using the cloud provider’s graphical user interface (GUI) or application programming interface (API). To the user of GUIs and APIs, metro and near-premises locations appear as options for deployment just as a major cloud region would. Buyers can set up a new cloud provider account and deploy resources in an IaaS edge location in minutes.

Metro locations

Metro locations do not offer the range of cloud services that regions do. They typically offer only compute, storage, container management and load balancing. A region has multiple AZs; a metro location does not. As a result, it is impossible to build a fully resilient application in a single metro location (see Cloud scalability and resiliency from first principles).

The prime example of a metro location is Amazon Web Services (AWS) Local Zones. Use cases usually focus on graphically intense applications such as virtual desktops or video game streaming, or real-time processing of video, audio or sensor data. Customers, therefore, should understand that although edge services in the cloud might cut latency and bandwidth costs, resiliency may also be lower.

These metro-based services, however, may still match many or most enterprise levels of resiliency. Data center infrastructure that supports metro locations is typically in the scale of megawatts, is staffed, and is built to be concurrently maintainable. Connection to end users is usually provided over redundant local fiber connections.

Near-premises (or far edge) locations

Like cloud metro locations, near-premises locations have a smaller range of services and AZs than regions do. The big difference between near-premises and metros is that resources in near-premises locations are deployed directly on top of 4G or 5G network infrastructure, perhaps only a single cell tower away from end-users. This reduces hops between networks, substantially reducing latency and delays caused by congestion.

Cloud providers partner with major network carriers to enable this, for example AWS’s Wavelengths service delivered in partnership with Vodafone, and Microsoft Azure Edge Zone’s service with AT&T. Use cases include, or may include, real-time applications, such as live video processing and analysis, autonomous vehicles and augmented reality. 5G enables connectivity where there is no fixed-line telecoms infrastructure or in temporary locations.

These sites may be cell tower locations or exchanges, operating tens of kilowatts (kW) to a few hundred kW. They are usually unstaffed (remotely monitored) with varying levels of redundancy and a single (or no) engine generator.

On-premises cloud extensions

Cloud providers also offer hardware and software that can be installed in a data center that the customer chooses. The customer is responsible for all aspects of data center management and maintenance, while the cloud provider manages the hardware or software remotely. The provider’s service is often charged per resource per period over an agreed term. A customer may choose these options over IaaS because no suitable cloud edge locations are available, or because regulations or strategy require them to use their own data centers.

Because the customer chooses the location, the equipment and data center may vary. In the edge domain, these locations are: typically 10 kW to a few hundred kW; owned or leased; and constructed from retrofitted rooms or using specialized edge data center products (see Edge data centers: A guide to suppliers). Levels of redundancy and staff expertise vary, so some edge data center product suppliers provide complementary remote monitoring services. Connectivity is supplied through telecom interconnection and local fiber, and latency between site and end user varies significantly.

Increasingly, colocation providers differentiate by directly peering with cloud providers’ networks to reduce latency. For example, Google Cloud recommends its Dedicated Interconnect service for applications where latency between public cloud and colocation site must be under 5ms. Currently, 143 colocation sites peer with Google Cloud, including those owned by companies such as Equinix, NTT, Global Switch, Interxion, Digital Realty and CenturyLink. Other cloud providers have similar arrangements with colocation operators.

Three on-premises options

Three categories of cloud extensions can be deployed on-premises. They differ in how easy it is to customize the combination of hardware and software. An edge cloud appliance is simple to implement but has limited configuration options; a pay-as-you-go server gives flexibility in capacity and cloud integration but requires more configuration; finally, a container platform gives flexibility in hardware and software and multi-cloud possibilities, but requires a high level of expertise.

Edge cloud appliance

An edge appliance is a pre-configured hardware appliance with pre-installed orchestration software. The customer installs the hardware in its data center and can configure it, to a limited degree, to connect to the public cloud provider. The customer generally has no direct access to the hardware or orchestration software.

Organizations deploy resources via the same GUI and APIs as they would use to deploy public cloud resources in regions and AZs. Typically, the appliance needs to be connected to the public cloud for administration purposes, with some exceptions (see Tweak to AWS Outposts reflects demand for greater cloud autonomy). The appliance remains the property of the cloud provider, and the buyer typically leases it based on resource capacity over three years. Examples include AWS Outposts, Azure Stack Hub and Oracle Roving Edge Infrastructure.

Pay-as-you-go server

A pay-as-you-go server is a physical server leased to the buyer and charged based on committed and consumed resources (see New server leasing models promise cloud-like flexibility). The provider maintains the server, measures consumption remotely, proposes capacity increases based on performance, and refreshes servers when appropriate. The provider may also include software on the server, again charged using a pay-as-you-go model. Such software may consist of cloud orchestration tools that provide private cloud capabilities and connect to the public cloud for a hybrid model. Customers can choose their hardware specifications and use the provider’s software or a third party’s. Examples include HPE GreenLake and Dell APEX.

Container software

Customers can also choose their own blend of hardware and software with containers as the underlying technology, to enable interoperability with the public cloud. Containers allow software applications to be decomposed into many small functions that can be maintained, scaled, and managed individually. Their portability enables applications to work across locations.

Cloud providers offer managed software for remote sites that is compatible with public clouds. Examples include Google Anthos, IBM Cloud Satellite and Red Hat OpenShift Container Platform. In this option, buyers can choose their hardware and some aspects of their orchestration software (e.g., container engines), but they are also responsible for building the system and managing the complex mix of components (see Is navigating cloud-native complexity worth the hassle?).

Considerations

Buyers can host applications in edge locations quickly and easily by deploying in metro and near-premises locations offered by cloud providers. Where a suitable edge location is not available, or the organization prefers to use on-premises data centers, buyers have multiple options for extending public cloud capability to an edge data center.

Edge locations differ in terms of resiliency, product availability and — most importantly — latency. Latency should be the main motivation for deploying at the edge. Generally, cloud buyers pay more for deploying applications in edge locations than they would in a cloud region. If there is no need for low latency, edge locations may be an expensive luxury.

Buyers must deploy applications to be resilient across edge locations and cloud regions. Edge locations have less resiliency, may be unstaffed, and may be more likely to fail. Applications must be architected to continue to operate if an entire edge location fails.

Cloud provider edge locations and products are not generally designed to operate in isolation — they are intended to serve as extensions of the public cloud for specific workloads. Often, on-premises and edge locations are managed via public cloud interfaces. If the connection between the edge site and the public cloud goes down, the site may continue to operate — but it will not be possible to deploy new resources until the site is reconnected to the public cloud platform that provides the management interface.

Data protection is often cited as a use case for edge, a reason why some operators may choose to locate applications and data at the edge. However, because the edge device and public cloud need to be connected, there is a risk of user data or metadata inadvertently leaving the edge and entering the public cloud, thereby breaching data protection requirements. This risk must be managed.


Dr. Owen Rogers, Research Director of Cloud Computing

Tomas Rahkonen, Research Director of Distributed Data Centers

Data center operators will face more grid disturbances

Data center operators will face more grid disturbances

The energy crisis of 2022, resulting from Russia’s invasion of Ukraine, caused serious problems for data center operators in Europe. Energy prices leapt up and are likely to stay high. This has resulted in ongoing concerns that utilities in some European countries, which have a mismatch of supply and demand, will have to shed loads.

Even before the current crisis, long-term trends in the energy sector pointed towards less reliable electrical grids, not only in Europe, but also in North America. Data center operators — which rarely have protected status due to their generator use — are having to adapt to a new reality: electrical grids in the future may be less reliable than they are now.

Frequency and voltage disturbances may occur more frequently, even if full outages do not, with consequent risk to equipment. Data center operators may need to test their generators more often and work closely with their utilities to ensure they are collaborating to combat grid instability. In some situations, colocation providers may have to work with their customers to ensure service level agreements (SLAs) are met, and even shift some workloads ahead of power events.

There are four key factors that affect the reliability of electrical grids:

  • Greater adoption of intermittent renewable energy sources.
  • Aging of electricity transmission systems.
  • More frequent (and more widespread) extreme weather events.
  • Geopolitical instability that threatens oil and gas supplies.

Individually, these factors are unlikely to deplete grid reserve margins, which measure the difference between expected maximum available supply and expected peak demand. When combined, however, this confluence of factors can create an environment that complicates grid operation and increases the likelihood of outages.

Two out of six regional grid operators in the US have reported that their reserve margins are currently under target. One of them — the Midcontinent Independent System Operator — is expected to have a 1,300 megawatt (MW) capacity shortfall in the summer of 2023. In Europe, Germany is turning to foreign power suppliers to meet its reserve margins; even before the war in Ukraine, the country was lacking 1,400 MW of capacity. There has been a marked increase in power event outage warnings — even if they are not happening yet.

Dispatchable versus intermittent

An often-cited reason for grid instability is that dispatchable (or firm) power generation — for example, from natural gas, coal and nuclear sources — is being replaced by intermittent renewable power generated by solar and wind energy. This hinders the ability of grid operators to match supply and demand of energy and makes them more vulnerable to variations in weather. Utilities find themselves caught between competing mandates: maintain strict levels of reserve power, decommission the most reliable fossil fuel power plants, and maintain profitability.

Historically, electrical grid operators have depended on firm power generation to buttress the power system. Due to the market preference for renewable energy, many power plants that provide grid stability now face severe economic and regulatory challenges. These include lower-than-expected capacity factors, which compare power plants’ output with how much they could produce at peak capacity; higher costs of fuel and upgrades; stricter emissions requirements; and uncertainty regarding license renewals.

The effects are seen in the levelized cost of electricity (LCOE), measured in dollars per megawatt-hour (MWh). The LCOE is the average net present value of electricity generation over the lifetime of a generation asset. It factors in debt payments, fuel costs and maintenance costs, and requires assumptions regarding fuel costs and interest rates. According to financial services company Lazard’s annual US analysis (2023 Levelized cost of energy+), utility scale solar and onshore wind LCOE (without subsidies) is $24-$96 per MWh and $24-$75 per MWh respectively. Combined cycle natural gas plants, for comparison, have an LCOE of $39-$101 per MWh, while natural gas peaking plants have an LCOE of $115-$221 per MWh.

The economic viability of generation assets, as represented by LCOE, is only a part of a very complex analysis to determine the reserve margin needed to maintain grid stability. Wind and solar generation varies significantly by the hour and the season, making it essential that the grid has sufficient dispatchable assets (see 24×7 carbon-free energy (part one): expectations and realities).

Grid-scale energy storage systems, such as electrochemical battery arrays or pumped storage hydropower, are a possible solution to the replacement of fossil fuel-based firm power. Aside from a few examples, these types of storage do not have the capacity to support the grid when renewable energy output is low. It is not merely a question of more installations, but affordability: grid-scale energy storage is not currently economical.

Grids are showing their age

Much of the infrastructure supporting electricity transmission in the US and Europe was built in the 1970s and 1980s. These grids were designed to transmit power from large, high-capacity generation facilities to consumers. They were not designed to manage the multi-directional power flow of widely distributed wind and solar generation assets, electric vehicle chargers which can consume or supply power, and other vagaries of a fully electrified economy. The number of end users served by electrical grids has since increased dramatically, with the world’s population doubling — and power networks are now struggling to keep up without expensive capacity upgrades and new transmission lines.

In the US, the Department of Energy found that the average age of large power transformers, which handle 90% of the country’s electricity flow, is more than 40 years. Three-quarters of the country’s 707,000 miles (1,138,000 km) of transmission lines — which have a lifespan of 40 to 50 years — are more than 25 years old. The number of transmission outages in the US has more than doubled in the past six years compared with the previous six years, according to North American Electric Reliability Corporation. This is partially due to an increase in severe weather events, which put more stress on the power grid (more on this below).

Renewable energy in some grid regions is being wasted because there is not enough transmission capacity to transport it to other regions that have low generation and high demand. In the US, for example, the lack of high-voltage transmission lines prevents energy delivery from expansive wind farms in Iowa and Oklahoma to data centers in the southeast. Germany is facing similar problems as it struggles to connect wind capacity in the north near the Baltic Sea to the large industrial consumers in the south.

Energy experts estimate that modernizing the US power grid will require an investment of $1 trillion to $2 trillion over the next two decades. Current business and regulatory processes cannot attract and support this level of funding; new funding structures and business processes will be required to transform the grid for a decarbonized future.

Weather is changing

The Earth is becoming warmer, which means long-term changes to broad temperature and weather patterns. These changes are increasing the risk of extreme weather events, such as heat waves, floods, droughts and fires. According to a recent study published in the Proceedings of the National Academy of Sciences, individual regions within the continental US were more than twice as likely to see a “once in a hundred years” extreme temperature event in 2019 than they were in 1979.

Extreme weather events put the grid system at risk. High temperatures reduce the capacity of high-voltage transmission, high winds knock power lines off the towers, and stormy conditions can result in the loss of generation assets. In the past three years, Gulf Coast hurricanes, West Coast wildfires, Midwest heat waves and a Texas deep freeze have all caused local power systems to fail. The key issue with extreme weather is that it can disrupt grid generation, transmission, and fuel supplies simultaneously.

The Texas freeze of February 2021 serves as a cautionary tale of interdependent systems. As the power outages began, some natural gas compressors became inoperable because they were powered by electricity without backup — leading to further blackouts and interrupting the vital gas supply needed to maintain or restart generation stations.

Geopolitical tensions create fuel supply risks

Fuel supply shocks resulting from the war in Ukraine rapidly increased oil and natural gas prices in Europe during the winter of 2022. The global energy supply chain is a complex system of generators, distributors and retailers that prioritizes economic efficiency and low cost, often at the expense of resiliency, which can create the conditions for cascading failures. Many countries in the EU warned that fuel shortages would cause rolling blackouts if the weather became too cold.

In the UK, the government advised that a reasonable worst-case scenario would see rolling power cuts across the country. Large industrial customers, including data center operators, would have been affected, losing power in 3-hour blocks without any warning. In France, planning for the worst-case scenario assumed that up to 60% of the country’s population would be affected by scheduled power cuts. Even Germany, a country with one of the most reliable grids in the world, made plans for short-term blackouts.

Fuel supply shocks are reflected in high energy prices and create short- and long-term risks for the electrical grid. Prior to the conflict in Ukraine, dispatchable fossil fuel peaker plants, which only operate during periods of high demand, struggled to maintain price competitiveness with renewable energy producers. This trend was exacerbated by high fuel costs and renewable energy subsidies.

Any political upheaval in the major oil-producing regions, such as the gulf countries or North Africa, would affect energy prices and energy supply. Development of existing shale gas deposits could offer some short-term energy security benefits in markets, such as the US, the UK and Germany, but political pressures are currently preventing such projects from getting off the ground.

Being prepared for utility power problems

Loss of power is still the most common cause of data center outages. Uptime Institute’s 2022 global survey of data center managers attributes 44% of outages to power issues — greater than the second, third and fourth most common outage causes combined.

The frequency of power-related outages is partly due to power system complexity and a dependence on moving parts: a loss of grid power often reveals weaknesses elsewhere, including equipment maintenance regimens and staff training.

The most common power-related outages result from faults with uninterruptible power supply (UPS) systems, transfer switches and generators. Compared with other data center systems, such as cooling, power is more prone to fail completely, rather than operating at partial capacity.

Disconnecting from the power grid and running on emergency backup systems for extended periods (hours and days) is not a common practice in most geographies. Diesel generators remain the de facto standard for data center backup power and these systems have inherent limitations.

Any issues with utility power create risks across the data center electrical system. Major concerns include:

  • Load transfer risks between the grid and the generators. It is recommended that data center operators fully disconnect from the electrical grid once a year to test procedures, yet many choose not to do so out of operational concerns in a production environment. This means that lurking failures in transfer switches and paralleling switchgears may go undetected and any operational mistakes remain undiscovered.
  • Fuel reserves and refueling. The volume of on-site fuel storage must meet cost and available space constraints; requires system maintenance; and spill and leak management. Longer grid outages can exceed on-site fuel capacity, making operators dependent on outside vendors for fuel resupply. These vendors are, in turn, dependent on the diesel supply chain, which may be disrupted during a wide-area grid outage because some diesel terminals may lack backup power to pump the fuel. Fuel delivery procedures may come under time pressure and may not be fully observed with the potential to create accidents, such as contamination and spills.
  • Increased likelihood of engine failures. More frequent warm-up and cool-down cycles, as well as higher than expected runtime hours, accelerate generator wear. As many as 27% of data center operators have experienced a generator-related power outage in the past three years, according to the Uptime Institute Annual outages analysis 2023. Ongoing supply chain bottlenecks may mean that rental generators are in short supply while replacement parts or new engines may take months to arrive. This may force the data center to operate at a derated capacity, lower redundancy, or both.
  • Pollutant emissions. Many jurisdictions limit generator operating hours to cap emissions of nitrogen oxides, sulfur oxides, and other particulates. For example, in the US, most diesel generators are limited to 100 hours of full load per year and non-compliance can result in fines.
  • Battery system wear and failures. Frequent deep discharge and recharge cycles wear battery cells out faster, particularly lead-acid batteries. Lithium-ion chemistries are not immune either: discharges create thermal stress for Lithium-ion cells when currents are high. Temperature can also spike towards the end-of-cell capacity, which increases the chance of a thermal event as a result of inherent manufacturing imperfections. Often, the loss of critical load caused by a failure in an uninterruptible power supply system is due to batteries not being monitored closely enough by experienced technicians.

It will get worse before it gets better

The likely worst-case scenario facing US and European data center operators in 2023 and beyond will consist primarily of load-shedding requests, brownouts and 2- to 4-hour, controlled rolling blackouts as opposed to widespread, long-lasting grid outages.

Ultimately, how much margin electrical grids will have in reserve is not possible to predict with any accuracy beyond a few weeks. Extreme weather events, in particular, are exceedingly difficult to model. Other unexpected events — such as power station maintenance, transmission failures and geopolitical developments that affect energy supply — might contribute to a deterioration of grid reliability.

Operators can take precautions to prepare for rolling blackouts, including developing a closer relationship with their energy provider. These steps are well understood, but evidence shows they are not always followed.

Considering historically low fuel reserves and long lead times for replacement components, all measures are best undertaken in advance. Data center operators should reach out to suppliers to confirm their capability to deliver fuel and their outlook for future supply — and determine whether the supplier is equipped to sell fuel in the event of a grid outage. Based on the response, operators should assess the need for additional on-site fuel storage or consider contracting additional vendors for their resupplies.

Data center operators should also test their backup power systems regularly. The so-called pull-the-plug test, which checks the entire backup system, is recommended annually. A building load test should be performed monthly, which involves less risk than pull-the-plug testing and checks most of the backup power system. This test may not run long enough to test certain components, such as fuel pumps and level switches, and these should be tested manually. Additionally, data center operators should test the condition of their stored fuel, and filter or treat it as necessary.

The challenges to grid reliability vary by location, but in general Uptime Intelligence sees evidence that the risk of rolling blackouts at times of peak demand is escalating. Operators can mitigate risk to their data centers by testing their backup power system and arranging for additional fuel suppliers as needed. They should also revisit their emergency procedures — even those with a very low chance of occurrence.


Jacqueline Davis, Research Analyst

Lenny Simon, Research Associate

Daniel Bizo, Research Director

Max Smolaks, Research Analyst