Cloud resiliency: plan to lose control of your planes

Cloud resiliency: plan to lose control of your planes

Cloud providers divide the technologies that underpin their services into two ”planes”, each with a different architecture and availability goal. The control plane manages resources in the cloud; the data plane runs the cloud buyer’s application.

In this Update, Uptime Institute Intelligence presents research that shows control planes have poorer availability than data planes. This presents a risk to applications built using cloud-native architectures, which rely on the control plane to scale during periods of intense demand. We show how overprovisioning capacity is the primary way to reduce this risk. The downside is an increase in costs.

Data and control planes

An availability design goal is an unverified claim of service availability that is neither guaranteed by the cloud provider nor independently confirmed. Amazon Web Services (AWS), for example, states 99.99% and 99.95% availability design goals for many of its services’ data planes and control planes. Service level agreements (SLAs), which refund customers a proportion of their expenditure when resources are not available, are often based on these design goals.

Design goals and SLAs differ between control and data planes because each plane performs different tasks using different underlying architecture. Control plane availability refers to the availability of the mechanism it uses to manage and control services, such as:

  • Creating a virtual machine or other resource allocation on a physical server.
  • Provisioning a virtual network interface on that resource so it can be accessed over the network.
  • Installing security rules on the resource and setting access controls.
  • Configuring the resource with custom settings.
  • Hosting an application programming interface (API) endpoint so that users and code can programmatically manage resources.
  • Offering a management graphical user interface for operations teams to administrate their cloud estates.

The data plane availability refers to the availability of the mechanism that executes a service, such as:

  • Routing packets to and from resources.
  • Writing and reading to and from disks.
  • Executing application instructions on the server.

In practice, such separation means a control plane could be unavailable, preventing services from being created or turned off, while the data plane (and, therefore, the application) continues to operate.

Data plane and control plane issues can impact business, but a data plane problem is usually considered more significant as it would immediately affect customers already using the application. The precise impact of a data plane or control plane outage depends on the application architecture and the type of problem.

Measuring plane availability

Providers use the term “region” to describe a geographical area that contains a collection of independent availability zones (AZs), which are logical representations of data center facilities. A country may have many regions, and each region typically has two or three AZs. Cloud providers state that users must architect their applications to be resilient by distributing resources across AZs and/or regions.

To compare the respective availability of resilient cloud architectures (see Comparative availabilities of resilient cloud architectures), Uptime Institute used historical cloud provider status updates from AWS, Google Cloud and Microsoft Azure to determine historical availabilities for a simple load balancer application deployed in three architectures:

  • Two virtual machines deployed in the same AZ (“single-zone”).
  • Two virtual machines deployed in different AZs in the same region (“dual-zone”).
  • Two virtual machines deployed in different AZs in different regions (“dual-region”).

Figure 1 shows the worst-case availabilities (i.e., the worst availability of all regions and zones analyzed) for the control planes and data planes in these architectures.

diagram: Worst-case historical availabilities by architecture
Figure 1. Worst-case historical availabilities by architecture

Unsurprisingly, even in the worst-case scenario, a dual-region architecture is the most resilient, followed by a dual-zone architecture.

The data plane has a significantly higher availability than the control plane. In the dual region category, the data plane had an availability of 99.96%, equivalent to 20 minutes of monthly downtime. The control plane had an availability of 99.80%, equal to 80 minutes of downtime — four times that of the data plane. This difference is to be expected considering the different design goals and SLAs associated with control and data planes. Our research found that control plane outages do not typically happen at the same time as data plane outages.

However, availability in the control plane is far more consistent than in the data plane and isn’t greatly affected by choice of architecture. This consistency is because the cloud provider manages availability and resiliency — the control plane effectively acts as a platform as a service (PaaS). Suppose an organization hosts an application in a single zone. In that case, the application programming interfaces (APIs) — and the management interface used to administer that application — are managed by the cloud provider and hosted in multiple zones and regions, despite the application only being hosted in a single zone.

The warning for cloud customers here is that the control plane is more likely to be a point of failure than the application, and little can be done to make the control plane more resilient.

Assessing the risk

The crucial question organizations should seek to answer during risk assessments is, “What happens if we cannot add or remove resources to our application?”

For static applications that can’t scale, a control plane outage is often more of an inconvenience than a business-critical problem. Some maintenance tasks may be delayed until the outage is over, but the application will continue to run normally.

For cloud-native scalable applications, the risk is more considerable. A cloud-native application should be able to scale up and down dynamically, depending on demand, which involves creating (or terminating) a resource and configuring it to work with the application that is currently executing.

Scaling capacity in response to application demand is typically done using one or a combination of three methods:

  • Automatically, using cloud provider autoscaling services that detect a breach of a threshold of a metric (e.g., CPU utilization).
  • Automatically from an application, where code communicates with a cloud provider’s API.
  • Manually, by an administrator using a management portal or API.

The management portal, the API, the autoscaler service, and the creation or termination of the resource may all be part of the control plane.

If the application has been scaled back to cut costs but faces increased demand, any of these mechanisms can increase capacity. But if they have failed, the application will continue to run but will not accommodate the additional demand. In this scenario, the application may continue to service existing end users satisfactorily, but additional end users may have to wait. In the worst case, the application might fail catastrophically because of oversubscription.

Addressing the risk

The only real redress for this risk is to provision a buffer of capacity on the services that support the application. If a demanding period requires more resources, the application can immediately use this buffer, even if the control plane is out of service.

Having a buffer is a sensible approach for many reasons. Suppose other issues cause delays in provisioning capacity, for example an AZ fails or increased demand in a region causes a lack of capacity. A buffer will prevent a time lag between end-user demand and resource access.

The question is how much buffer capacity to provision. A resilient architecture often automatically includes suitable buffer capacity because the application is distributed across AZs or regions. If an application is split across, say, two AZs, it is probably designed with enough buffer on each virtual machine to enable the application to continue seamlessly if one of the zones is lost. Each virtual machine has a 50% load, with the unused 50% available if the other AZ suffers an outage. In this case, no new resources can be added if the control plane fails, but the application is already operating at half load and has excess capacity available. Of course, if the other AZ fails at the same time as a control plane outage, the remaining zone will face a capacity squeeze.

Similarly, databases may be deployed across multiple AZs using an active-active or active-failover configuration. If an AZ fails, the other database will automatically be available to support transactions, regardless of the control plane functionality.

Organizations need to be aware that this is their risk to manage. The application must be architected for control and data plane failures. Organizations also need to be mindful that there is a cost implication. As detailed in our Comparative availabilities of resilient cloud architectures report, deploying across multiple zones increases resiliency, but also increases costs by 43%. Carbon emissions, too, rise by 83%. Similarly, duplicating databases across zones increases availability — but at a price.

Summary

Organizations must consider how their applications will perform if a control plane failure prevents them adding, terminating or administrating cloud resources. The effect on static applications will be minimal because such applications can’t scale up and down in line with changing demand. However, the impact on cloud-native applications may be substantial because the application may struggle under additional demand without the capacity to scale.

The simplest solution is to provide a buffer of unused capacity to support unexpected demand. If additional capacity can’t be added because of a control plane outage, the buffer can meet additional demand in the interim.

The exact size of the buffer depends on the application and its typical demand pattern. However, most applications will already have a buffer built in so they can respond immediately to AZ or region outages. Often, this buffer will be enough to manage control plane failure risks.

Organizations face a tricky balancing act. Some end users might not get the performance they expect if the buffer is too small. If the buffer is too large, the organization pays for capacity it doesn’t need.

Server efficiency increases again — but so do the caveats

Server efficiency increases again — but so do the caveats

Early in 2022, Uptime Intelligence observed that the return of Moore’s law in the data center (or, more accurately, the performance and energy efficiency gains associated with it) would come with major caveats (see Moore’s law resumes — but not for all). Next-generation server technologies’ potential to improve energy efficiency, Uptime Intelligence surmised at the time, would be unlocked by more advanced enterprise IT teams and at-scale operators that can concentrate workloads for high utilization and make use of new hardware features.

Conversely, when servers built around the latest Intel and AMD chips do not have enough work, energy performance can deteriorate due to higher levels of idle server power. Following the recent launch of new server processors, performance data confirms this, which suggests that a rethink on the traditional assumptions relating to server refresh cycles is needed.

Server efficiency back on track

First, the good news: new processor benchmarks confirm that best-in-class server energy efficiency is back in line with historical trends, following a long hiatus in the late 2010s.

The latest server chips from AMD (codenamed Genoa) deliver a major jump in core density (they integrate up to 192 cores in a standard dual-socket system) and energy efficiency potential. This is largely due to a step change in manufacturing technology by contract chipmaker Taiwan Semiconductor Manufacturing Company (TSMC). Compared with server technology from four to five years ago, this new crop of chips offers four times the performance at more than twice the energy efficiency, as measured by the Standard Performance Evaluation Corporation (SPEC). The SPEC Power benchmark simulates a Java-based transaction processing logic to exercise processors and the memory subsystem, indicating compute performance and efficiency characteristics. Over the past 10 years, mainstream server technology has become six times more energy efficient based on this metric (Figure 1).

diagram: Best-in-class server energy performance (long-term trend)
Figure 1. Best-in-class server energy performance (long-term trend)

This brings server energy performance back onto its original track before it was derailed by Intel’s protracted struggle to develop its then next-generation manufacturing technology. Although Intel continues to dominate server processor shipments due to its high manufacturing capacity, it is still fighting to recover from the crises it created nearly a decade ago (see Data center efficiency stalls as Intel struggles to deliver).

Now, the bad news, too, is that server efficiency is once again following historical trends. Although the jump in performance with the latest generation of AMD server processors is sizeable, the long-term slowdown in performance improvements continues. The latest doubling of server performance took about five years. In the first half of the 2010s it took about two years, and far less than two years between 2000 and 2010. Advances in chip manufacturing have slowed for both technical and economic reasons: the science and engineering challenges behind semiconductor manufacturing are extremely difficult and the costs are huge.

An even bigger reason for the drop in development pace, however, is architectural: diminishing returns from design innovation. In the past, the addition of more cores, on-chip integration of memory controllers and peripheral device controllers all radically improved chip performance and overall system efficiency. Then server engineers increased their focus on energy performance, resulting in more efficient power supplies and other power electronics, as well as energy-optimized cooling via variable speed fans and better airflow. These major changes have reduced much of the server energy waste.

One big-ticket item in server efficiency remains: to tackle the memory (DRAM) performance and energy problems. Current memory chip technologies don’t score well on either metric — DRAM latency worsens with every generation, while energy efficiency (per bit) barely improves.

Server technology development will continue apace. Competition between Intel and AMD is energizing the market as they vie for the business of large IT buyers that are typically attracted to the economics of performant servers carrying ever larger software payloads. Energy performance is a definitive component of this. However, more intense competition is unlikely to overcome the technical boundaries highlighted above. While generic efficiency gains (lacking software-level optimization) from future server platforms will continue, the average pace of improvement is likely to slow further.

Based on the long-term trajectory, the next doubling in server performance will take five to six years, boosted by more processor cores per server, but partially offset by the growing consumption of memory power. Failing a technological breakthrough in semiconductor manufacturing, the past rates of server energy performance gains will not return in the foreseeable future.

Strong efficiency for heavy workloads

As well as the slowdown in overall efficiency gains, the profile of energy performance improvements has shifted from benefiting virtually all use cases towards favoring, sometimes exclusively, larger workloads. This trend began several years ago, but AMD’s latest generation of products offers a dramatic demonstration of its effect.

Based on the SPEC Power database, new processors from AMD offer vast processing capacity headroom (more than 65%) compared with previous generations (AMD’s 2019 and 2021 generations codenamed, respectively, Rome and Milan). However, these new processors use considerably more server power, which mutes overall energy performance benefits (Figure 2). The most significant opportunities offered by AMD’s Genoa processor are for aggressive workload consolidation (footprint compression) and running scale-out workloads such as large database engines, analytics systems and high-performance computing (HPC) applications.

diagram: Performance and power characteristics of recent AMD and Intel server generations
Figure 2. Performance and power characteristics of recent AMD and Intel server generations

If the extra performance potential is not used, however, there is little reason to upgrade to the latest technology. SPEC Power data indicates that the energy performance of previous generations of AMD-based servers is often as good — or better — when running relatively small workloads (for example, where the application’s scaling is limited, or further consolidation is not technically practical or economical).

Figure 2 also demonstrates the size of the challenge faced by Intel: at any performance level, the current (Sapphire Rapids) and previous (Ice Lake) generations of Intel-based systems use more power — much more when highly loaded. The exception to this rule, not captured by SPEC Power, is when the performance-critical code paths of specific software are heavily optimized for an Intel platform, such as select technical, supercomputing and AI (deep neural network) applications that often take advantage of hardware acceleration features. In these specific cases, Intel’s latest products can often close the gap or even exceed the energy performance of AMD’s processors.

Servers that age well

The fact that the latest generation of processors from Intel and AMD is not more efficient than the previous generation is less of a problem.  The greater problem is that server technology platforms released in the past two years do not offer automatic efficiency improvements over the enormous installed base of legacy servers, which are often more than five or six years old.

Figure 3 highlights the relationship between performance and power where, without a major workload consolidation (2:1 or higher), the business case to upgrade old servers (see blue crosses) remains dubious. This area is not fixed and will vary by application, but approximately equals the real-world performance and power envelope of most Intel-based servers released before 2020. Typically, only the best performing, highly utilized legacy servers will warrant an upgrade without further consolidation.

diagram: Many older servers remain more efficient at running lighter loads
Figure 3. Many older servers remain more efficient at running lighter loads

For many organizations, the processing demand in their applications is not rising fast enough to benefit from the performance and energy efficiency improvements the newer servers offer. There are many applications that, by today’s standards, only lightly exercise server resources, and that are not fully optimized for multi-core systems. Average utilization levels are often low (between 10% and 20%), because many servers are sized for expected peak demand, but spend most of their time in idle or reduced performance states.

Counterintuitively, perhaps, servers built with Intel’s previous processor generations in the second half of the 2010s (largely since 2017, using Intel’s 14-nanometer technology) tend to show better power economy when used lightly, than their present-day counterparts. For many applications this means that, although they use more power when worked hard, they can conserve even more in periods of low demand.

Several factors may undermine the business case for the consolidation of workloads onto fewer, higher performance systems: these include the costs and risks of migration, threats to infrastructure resiliency, and incompatibility (if the software stack is not tested or certified for the new platform). The latest server technologies may offer major efficiency gains to at-scale IT services providers, AI developers and HPC shops, but for enterprises the benefits will be fewer and harder to achieve.

The regulatory drive for data center sustainability will likely direct more attention to IT’s role soon — the lead example being the proposed EU’s recast of the Energy Efficiency Directive. Regulators, consultants and IT infrastructure owners will want a number of proxy metrics for IT energy performance, and the age of a server is an intuitive choice. Data strongly indicates this form of simplistic ageism is misplaced.

Data shows the cloud goes where the money is

Data shows the cloud goes where the money is

Hyperscale cloud providers have opened numerous operating regions in all corners of the world over the past decade. The three most prominent — Amazon Web Services (AWS), Google Cloud and Microsoft Azure — now have 105 distinct regions (excluding government and edge locations) for customers to choose from to locate their applications and data. Over the next year, this will grow to 130 regions. Other large cloud providers such as IBM, Oracle and Alibaba are also expanding globally, and this trend is likely to continue.

Each region requires enormous investments in data centers, IT, software, people, and networks. The opening of a region may both develop and disrupt the digital infrastructure of the countries involved. This Update, part of Uptime Intelligence’s series of publications explaining and examining the development of the cloud, shows how investment can be tracked — and, to a degree, predicted — by looking at the size of the markets involved.

Providers use the term “region” to describe a geographical area containing a collection of independent availability zones (AZs), which are logical representations of data center facilities. A country may have many regions, with each region typically having two or three AZs. The three leading hyperscalers’ estates include more than 300 hyperscale AZs and many more data centers (including both hyperscale-owned and hyperscale-leased facilities) in operation today. Developers use AZs to build resilient applications in a single region.

The primary reason providers offer a range of regions is latency. In general, no matter how good the network infrastructure is, the further the end user is from the cloud application, the greater the delay and the poorer the end-user experience (especially on latency-sensitive applications, such as interactive gaming). Another important driver is that some cloud buyers are required to keep applications and user data in data centers in a specific jurisdiction for compliance, regulatory or governance reasons.

Figure 1 shows how many of the three largest cloud providers have regions in each country.

diagram: Count of the three largest cloud providers (AWS, Google, Microsoft) operating a cloud region in a country: current and planned
Figure 1. Count of the three largest cloud providers (AWS, Google, Microsoft) operating a cloud region in a country: current and planned

The economics of a hyperscale public cloud depends on scale. Implementing a cloud region of multiple AZs (and, therefore, data centers) requires substantial investment, even if it relies on colocation sites. Cloud providers need to expect enough return to justify such an investment.

To achieve this return on investment, a geographical region must have the telecommunications infrastructure to support the entire cloud region. Practically too, the location must be able to support the data center itself, and be able to provide reliable power, telecommunications, security and skills.

Considering these requirements, cloud providers focus their expansion plans on economies with the largest gross domestic product (GDP). GDP measures economic activity but, more generally, is an indicator of the health of an economy. Typically, countries with a high GDP have broad and capable telecommunications infrastructure, high technology skills, robust legal and contractual frameworks, and the supporting infrastructure and supply chains required for data center implementation and operation. Furthermore, organizations in countries with higher GDPs have greater spending power and access to borrowing. In other words, they have the cash to spend on cloud applications to give the provider a high enough return on investment.

The 17 countries where all three hyperscalers currently operate cloud regions, or plan to, account for 56% of global GDP. The GDP of countries where at least one hyperscaler intends to operate is 87% of global GDP across just 40 countries (for comparison, the United Nations comprises 195 countries).

Figure 2 shows GDP against hyperscalers present in a country. (US and China’s GDPs are not shown because they are significant outliers.) The figure shows a trend: a greater GDP increases the likelihood of a hyperscaler presence in the region. Three countries buck this trend: Mexico, Turkey and Russia.

diagram: GDP against hyperscaler presence (China and US removed because of outlying GDP)
Figure 2. GDP against hyperscaler presence (China and US removed because of outlying GDP)

Observations

  • The US is due to grow to 24 hyperscaler cloud regions across 13 states (excluding the US government), which is substantially more than any other country. This widespread presence is because Google, Microsoft and AWS are US companies with significant experience of operating in the country. The US is the single most influential and competitive market for digital services, with a shared official language, an abundance of available land, a business-friendly environment, and relatively few differences in regulatory requirements between local authorities.
  • Despite China’s vast GDP, only two of the big three US hyperscalers operate there today: AWS and Microsoft Azure. However, unlike all other cloud regions, AWS and Microsoft regions are outsourced to Chinese companies to comply with local data protection requirements. AWS outsources its cloud regions to Sinnet and Ningxia Western Cloud Data Technology (NWCD); Azure outsources its cloud to 21Vianet. Notably, China’s cloud regions are totally isolated from all non-China cloud regions regarding connectivity, billing and governance. Google considered opening a China region in 2018 but abandoned the idea in 2020; one reason for this being a reluctance to operate through a partner, reportedly. China has its own hyperscaler clouds: Alibaba Cloud, Huawei Cloud, Tencent Cloud and Baidu AI Cloud. These hyperscalers have implemented regions beyond China and into greater Asia-Pacific, Europe, the US and the Middle East, primarily so that these China-based organizations can reach other markets.
  • Mexico has a high GDP but only one cloud region, which Microsoft Azure is currently developing. Mexico’s proximity to the US and a good international telecommunications infrastructure means applications targeting Mexican users do not necessarily suffer significant latency. The success of the Mexico region will depend on the eventual price of cloud resources there. If Mexico does not offer substantially lower costs and higher revenues than nearby US regions (for example San Antonio in Texas, where Microsoft Azure operates), and if customers are not legally required to keep data local, Mexican users could be served from the US, despite minor latency effects and added network bandwidth costs. Uptime thinks other hyperscale cloud providers are unlikely to create new regions in Mexico in the next few years for this reason.
  • Today, no multinational hyperscaler cloud provider offers a Russia region. This is unlikely to change soon because of sanctions imposed by a raft of countries since Russia invaded Ukraine. Cloud providers have historically steered clear of Russia because of geopolitical tensions with the US and Europe. Even before the Ukraine invasion, AWS had a policy of not working with the Russian government. Other hyperscalers, such as IBM, Oracle Cloud and China’s Alibaba, are also absent from Russia. The Russian Commonwealth of Independent States has no hyperscaler presence. Yandex is Russia’s most-used search engine and the country’s key cloud provider.
  • A total of 16 European countries have either a current or planned hyperscaler presence and represent 70% of the continent’s GDP. Although latency is a driver, data protection is a more significant factor. European countries tend to have greater data protection requirements than the rest of the world, which drives the need to keep data within a jurisdiction.
  • Turkey has a high GDP but no hyperscaler presence today. This is perhaps because the country can be served, with low latency, by nearby EU regions. Data governance concerns may also be a barrier to investment. However, Turkey may be a target for future cloud provider investment.
  • Today, the three hyperscalers are only present in one African country, South Africa — even though Egypt and Nigeria have larger GDPs. Many applications aimed at a North African audience may be suitably located in Southern Europe with minimal latency. However, Nigeria could be a potential target for a future cloud. It has high GDP, good connectivity through several submarine cables, and would appeal to the central and western African markets.
  • South American cloud regions were previously restricted to Brazil, but Google now has a Chilean region, and Azure has one in the works. Argentina and Chile have high relative GDP. It would not be surprising if AWS followed suit.

Conclusions

As discussed in Cloud scalability and resiliency from first principles, building applications across different cloud providers is challenging and costly. As a result, customers will seek cloud providers that operate in all the regions they want to reach. To meet this need, providers are following the money. Higher GDP generally equates to more resilient, stable economies, where companies are likely to invest and infrastructure is readily available. The current exception is Russia. High GDP countries yet to have a cloud presence include Turkey and Nigeria.

In practice, most organizations will be able to meet most of their international needs using hyperscaler cloud infrastructure. However, they need to carefully consider where they may want to host applications in the future. Their current provider may not support a target location, but migrating to a new provider that does is often not feasible. (A future Uptime Intelligence update will further explore specific gaps in cloud provider coverage.)

There is an alternative to building data centers or using colocation providers in regions without hyperscalers: organizations seeking new markets could consider where hyperscaler cloud providers may expand next. Rather than directly tracking market demand, software vendors may launch new services when a suitable region is brought online. The cost of duplicating an existing cloud application into a new region is small (especially compared with a new data center build or multi-cloud development). Sales and technical support can often be provided remotely without an expensive in-country presence.

Similarly, colocation providers can follow the money and consider the cloud providers’ expansion plans. A location such as Nigeria, with high GDP and no hyperscaler presence (but good telecommunications infrastructure) may be ideal for data center buildouts for future hyperscaler requirements.

Colocation providers also have opportunities outside the GDP leaders. Many organizations still need local data centers for compliance or regulatory reasons, or for peace of mind, even if a hyperscaler data center is relatively close in terms of latency. In the Uptime Institute Data Center Capacity Trends Survey 2022, 44% of 65 respondents said they would use their own data center if their preferred public cloud provider was unavailable in a country, and 29% said they would use a colocation provider.

Cloud providers increasingly offer private cloud appliances that can be installed in a customer’s data center and connected to the public cloud for a hybrid deployment (e.g., AWS Outposts, VMware, Microsoft Azure Stack). Colocation providers should consider if partnerships with hyperscaler cloud providers can support hybrid cloud implementations outside the locations where hyperscalers operate.

Cloud providers have no limits in terms of country or market. If they see an opportunity to make money, they will take it. But they need to see a return on their investment. Such returns are more likely where demand is high (often where GDP is high) and infrastructure is sufficient.

Cooling to play a more active role in IT performance and efficiency

Cooling to play a more active role in IT performance and efficiency

Data center operators and IT tenants have traditionally adopted a binary view of cooling performance: it either meets service level commitments, or it does not. The relationship is also coldly transactional: as long as sufficient volumes of air of the right temperature and quality (in accordance with service-level agreements that typically follow ASHRAE’s guidance) reach the IT rack, the data center facility’s mission has been accomplished. What happens after that point with IT cooling, and how it affects IT hardware, is not facilities’ business.

This practice was born in an era when the power density of IT hardware was much lower, and when server processors still had a fixed performance envelope. Processors were running at a given nominal frequency, under any load, that was defined at the time of manufacturing. This frequency was always guaranteed if there was sufficient cooling available, whatever the workload.

Chipmakers guide IT system builders and customers to select the right components (heat sinks, fans) via processor thermal specifications. Every processor is assigned a power rating for the amount of heat its cooling system must be able to handle at the corresponding temperature limit. This is not theoretical maximum power but rather the maximum that can realistically be sustained (seconds or more) running real-world software. This maximum is called thermal design power (TDP).

The majority of software applications don’t stress the processor enough to get close to the TDP, even if they use 100% of the processor’s time — typically only high-performance computing code makes processors work that hard. With frequencies fixed, this results in power consumption (and thermal power) that is considerably below the TDP rating. Since the early 2000s, nominal processor speeds have tended to be limited by power rather than the maximum speed of circuitry, so for most applications there is untapped performance potential within the TDP envelope.

This gap is wider still in multicore processors when the software cannot benefit from all the cores present. This results in an even larger portion of the power budget not being used to increase application performance. The higher the core count, the bigger this gap can be unless the workload is highly multithreaded.

Processors looking for opportunities

Most server processors and accelerators that came to market in the past decade have mechanisms to address this (otherwise ever-growing) imbalance. Although implementation details differ between chipmakers (Intel, AMD, NVIDIA, IBM), they all dynamically deploy available power budget to maximize performance when and where it is needed most.

This balancing happens in two major ways: frequency scaling and management of power allocation to cores. When a modern server processor enters a phase of high utilization but remains under its thermal specification, it starts to increase supply voltage and then matches frequency in incremental steps. It continues to scale the steps until it reaches any one of the preset limits: frequency, current, power or temperature — whichever comes first.

If the workload is not evenly distributed across cores, or leaves some cores unused, the processor allocates unused power to highly utilized cores (if power was the limiting factor for their performance) to enable them to scale their frequencies even higher. The major beneficiary of independent core scaling is the vast repository of single- or lightly threaded software, but multithreaded applications also benefit where they struggle with Amdahl’s law (when the application is hindered by parts of the code that are not parallelized, so that overall performance depends largely on how fast a core can work through those segments).

This opportunistic behavior of modern processors means the quality of cooling, considering both supply of cold air and its distribution within the server, is not binary anymore. Considerably better cooling increases the performance envelope of the processor, a phenomenon that supercomputing vendors and users have been exploring for years. It also tends to improve overall efficiency because more work is done for the energy used.

Performance is best served cold

Better cooling unlocks performance and efficiency in two major ways:

  • The processor operates at lower temperatures (everything else being equal).
  • It can operate at higher thermal power levels.

The lowering of operational temperature through improved cooling brings many performance benefits such as enabling individual processor cores to run at elevated speeds for longer without hitting their temperature limit.

Another, likely sizeable, benefit lies in reducing static power in the silicon. Static power is power lost to leakage currents that perform no useful work, yet keep flowing through transistor gates even when they are in the ”off” state. Static power was not an issue 25 years ago, but has become more difficult to suppress as transistor structures have become smaller, and their insulation properties correspondingly worse. High-performance logic designs, such as those in server processors, are particularly burdened by static power because they integrate a large number of fast-switching transistors.

Semiconductor technology engineers and chip designers have adopted new materials and sophisticated power-saving techniques to reduce leakage currents. However, the issue persists. Although chipmakers do not reveal the static power consumption of their products, it is likely to take a considerable component of the power budget of the processor, probably a low double-digit percentage share.

Various academic research papers have shown that static leakage currents depend on the temperature of silicon, but the exact profile of that correlation varies greatly across chip manufacturing technologies — such details remain hidden from the public eye.

Upgraded air coolers can measurably improve application performance when the processor is thermally limited during periods of high load, though such a speed-up tends to be in the low single digits. This can be achieved by lowering inlet air temperatures or, more commonly, by upgrading the processors’ cooling to lower thermal resistance. Examples of this are: adding larger, CFD-optimized heat sinks built from thermally better conducting alloy (e.g., copper-based alloys); using better thermal interface materials; and introducing more powerful fans to increase airflow. If combined with better facility air delivery and lower inlet temperatures, the speed-up is higher still.

No silver bullets, just liquid cooling

But the markedly lower thermal resistance and consequent lowered silicon temperature that direct liquid cooling (DLC) brings makes a more pronounced difference. Compared with air coolers at the same temperature, DLC (cold plate and immersion) can free up more power by reducing the temperature-dependent component of static leakage currents.

There is an even bigger performance potential in the better thermal properties of liquid cooling: prolonging the time that server processors can spend in controlled power excursions above their TDP level, without hitting critical temperature limits. This behavior, now common in server processors, is designed to offer bursts of extra performance, and can result in a short-term (tens of seconds) heat load that is substantially higher than the rated cooling requirement.

Typically, excursions reach 15% to 25% above the TDP, which did not previously pose a major challenge. However, in the latest generation of products from AMD and Intel, this results in up to 400 watts (W) and 420 W, respectively, of sustained thermal power per processor — up from less than 250 W about five years ago.

Such high-power levels are not exclusive to processor models aimed at high-performance computing applications: a growing number of mainstream processor models intended for cloud, hosting and enterprise workload consolidation can have these demanding thermal requirements. The favorable economics of higher performance servers (including their energy efficiency across an array of applications) generates demand for powerful processors.

Although these TDPs and power excursion levels are still manageable with air when using high-performance heat sinks (at the cost of rack density because of very large heat sinks, and lots of fan power), peak performance levels will start to slip out of reach for standard air cooling in the coming years. Server processor development roadmaps call for even more powerful processor models in the coming years, probably reaching 600 W in thermal excursion power by the mid-2020s.

As processor power escalates and temperature limits grow more restrictive, even DLC temperature choices will be a growing trade-off dilemma as data center and IT infrastructure operators try to balance capital costs, cooling performance, energy efficiency and sustainability credentials. Inevitably, the relationship between data center cooling, server performance and overall IT efficiency will demand more attention.

The effects of a failing power grid in South Africa

The effects of a failing power grid in South Africa

European countries narrowly avoided an energy crisis in the past winter months, as a shortfall in fossil fuel supplies from Russia threatened to destabilize power grids across the region. This elevated level of risk to the normally robust European grid has not been seen for decades.

A combination of unseasonably mild weather, energy saving initiatives and alternative gas supplies averted a full-blown energy crisis, at least for now, although business and home consumers are paying a heavy price through high energy bills. The potential risk to the grid forced European data center operators to re-evaluate both their power arrangements and their relationship with the grid. Even without an energy security crisis, power systems elsewhere are becoming less reliable, including some of the major grid regions in the US.

Most mission-critical data centers are designed not to depend on the availability of an electrical utility, but to benefit from its lower power costs. On-site power generation — usually provided by diesel engine generators — is the most common option to backup electricity supplies, because it is under the facility operator’s direct control.

A mission-critical design objective of power autonomy, however, does not shield data center operators from problems that affect utility power systems. The reliability of the grid affects:

  • The cost of powering the data center.
  • How much diesel to buy and store.
  • Maintenance schedules and costs.
  • Cascading risks to facility operations.

South Africa provides a case study in how grid instability affects data center operations. The country has emerged as a regional data center hub over the past decade (largely due to its economic and infrastructure head-start over other major African countries), despite experiencing its own energy crisis over the past 16 years.

A total of 11 major subsea network cables land in South Africa, and its telecommunications infrastructure is the most developed on the continent. Although it cannot match the capacity of other global data center hubs, South Africa’s data center market is highly active — and is expanding (including recent investments by global colocation providers Digital Realty and Equinix). Cloud vendors already present in South Africa include Amazon Web Services (AWS), Microsoft Azure, Huawei and Oracle, with Google Cloud joining soon. These organizations must contend with a notoriously unreliable grid.

Factors contributing to grid instability

Most of South Africa’s power grid is operated by state-owned Eskom, the largest producer of electricity in Africa. Years of under-investment in generation and transmission infrastructure have forced Eskom to impose periods of load-shedding — planned rolling blackouts based on a rotating schedule — since 2007.

Recent years have seen substation breakdowns, cost overruns, widespread theft of coal and diesel, industrial sabotage, multiple corruption scandals and a $5 billion government bail-out. Meanwhile, energy prices nearly tripled in real terms between 2007 and 2020.

In 2022, the crisis deepened, with more power outages than in any of the previous 15 years — nearly 300 load-shedding events, which is three times the previous record of 2020 (Figure 1). Customers are usually notified about upcoming disruption through the EskomSePush (ESP) app. Eskom’s load-shedding measures do not distinguish between commercial and residential properties.

diagram: Number of load-shedding instances initiated by Eskom from 2018 to 2022
Figure 1. Number of load-shedding instances initiated by Eskom from 2018 to 2022

Blackouts normally last for several hours, and there can be several a day. Eskom’s app recorded at least 3,212 hours of load-shedding across South Africa’s grid in 2022. For more than 83 hours, South Africa’s grid remained in “Stage 6”, which means the grid was in a power shortfall of at least 6,000 megawatts. A new record was set in late February 2023, when the grid entered “Stage 8” load-shedding for the first time. Eskom has, in the past, estimated that in “Stage 8”, an average South African could expect to be supplied with power for only 12 hours a day.

Reliance on diesel

In this environment, many businesses depend on diesel generators as a source of power — including data centers, hospitals, factories, water treatment facilities, shopping centers and bank branches. This increased demand for generator sets, spare parts and fuel has led to supply shortages.

Load-shedding in South Africa often affects road signs and traffic lights, which means fuel deliveries are usually late. In addition, trucks often have to queue for hours to load fuel from refineries. As a result, most local data center operators have two or three fuel supply contracts, and some are expanding their on-site storage tanks to provide fuel for several days (as opposed to the 12-24 hours typical in Europe and the US).

There is also the cost of fuel. The general rule in South Africa is that generating power on-site costs about seven to eight times more than buying utility power. With increased runtime hours on generators, this quickly becomes a substantial expense compared with utility energy bills.

Running generators for prolonged periods accelerates wear and heightens the risk of mechanical failure, with the result that the generator units need to be serviced more often. As data center operations staff spend more time monitoring and servicing generators and fuel systems, other maintenance tasks are often deferred, which creates additional risks elsewhere in the facility.

To minimize the risk of downtime, some operators are adapting their facilities to accommodate temporary external generator set connections. This enables them to provision additional power capacity in four to five hours. One city, Johannesburg, has access to gas turbines as an alternative to diesel generators, but these are not available in other cities.

Even if the data center remains operational through frequent power cuts, its connectivity providers, which also rely on generators, may not. Mobile network towers, equipped with UPS systems and batteries, are frequently offline because they do not get enough time to recharge between load-shedding periods if there are several occurrences a day. MTN, one of the country’s largest network operators, had to deploy 2,000 generators to keep its towers online and is thought to be using more than 400,000 liters of fuel a month.

Frequent outages on the grid create another problem: cable theft. In one instance, a data center operator’s utility power did not return following a scheduled load-shedding. The copper cables leading to the facility were stolen by thieves, who used the load-shedding announcements to work out when it was safe to steal the cables.

Lessons for operators in Europe and beyond

  • Frequent grid failures increase the cost of digital services and alter the terms of service level agreements.
  • Grid issues may take years to emerge. The data center industry needs to be vigilant and respond to early warning signs.
  • Data center operators must work with utilities, regulators and industry associations to shape the development of grids that power their facilities.
  • Uptime’s view is that data center operators will find a way to meet demand — even in hostile environments.

Issues with the supply of Russian gas to Europe might be temporary, but there are other concerns for power grids around the world. In the US, an aging electricity transmission infrastructure — much of it built in the 1970s and 1980s — requires urgent modernization, which will cost billions of dollars. It is not clear who will foot this bill. Meanwhile, power outages across the US over the past six years have more than doubled compared with the previous six years, according to federal data.

While the scale of grid disruption seen in South Africa is extreme, it offers lessons on what happens when the grid destabilizes and how to mitigate those problems. An unstable grid will cause similar problems for data center operators around the world, and range from ballooning power costs to a higher risk of equipment failure. This risk will creep into other parts of the facility infrastructure if operations staff do not have time to perform the necessary maintenance tasks. Generators may be the primary source of data center power, but they are best used as an insurance policy.

US operators scour Inflation Reduction Act for incentives

US operators scour Inflation Reduction Act for incentives

In the struggle to reduce carbon emissions and increase renewable energy, the US Inflation Reduction Act (IRA), passed in August 2022, is a landmark development. The misleadingly named Act, which is lauded by environmental experts and castigated by foreign leaders, is intended to rapidly accelerate the decarbonization of the world’s largest economy by introducing nearly $400 billion in federal funding over the next 10 years.

Reducing the carbon intensity of electricity production is a major focus of the act and the US clean energy industry will greatly benefit from the tax credits encouraging renewable energy development. But it also includes provisions intended to “re-shore” and create jobs in the US and to ensure that US companies have greater control over the energy supply chain. Abroad, foreign leaders have raised objections over these protectionist provisions creating (or aggravating) a political rift between the US and its allies and trading partners. In response to the IRA, the EU has redirected funds to buoy its low-carbon industries, threatened retaliatory measures and is considering the adoption of similar legislation.

While the politicians argue, stakeholders in the US have been scouring the IRA’s 274 pages for opportunities to capitalize on these lucrative incentives. Organizations that use a lot of electricity are also likely to benefit — including large data centers and their suppliers. Lawyers, accountants and investors working for organizations planning large-scale digital infrastructure investments will see opportunities too. Some of these will be substantial.

Summary of opportunities

For digital infrastructure companies, it may be possible to secure support in the following areas:

  • Renewable energy prices / power purchase agreements. Demand for renewable energy and the associated renewable energy credits is likely to be very high in the next decade, so prices are likely to rise. The tax incentives in the IRA will help to bring these prices down for renewable energy generators. By working with electricity providers and possibly co-investing, some data center operators will be able to secure lower energy prices.
  • Energy efficiency. Commercial building operators will find it easier to earn tax credits for reducing energy use. However, data centers that have already improved energy efficiency will struggle to reach the 25% reduction required to qualify. Operators may want to reduce energy use on the IT side, but this would not meet the eligibility requirements for this credit.
  • Equipment discounts / tax benefits. The act provides incentives for energy storage equipment (batteries or other technologies) which are necessary to operate a carbon-free grid. There are tax concessions for low-carbon energy generation, storage and microgrid equipment. Vendors may also qualify for tax benefits that can be sold.
  • Renewable energy generation. Most data centers generate little or no onsite renewable energy. In most cases, the power generated on site can support only a tiny fraction of the IT load. Even so, the many new incentives for equipment and batteries may make this more cost effective; and large operators may find it worthwhile to invest in generation at scale.

Detailed provisions

Any US operator considering significant investments in renewable generation and/or energy storage, including, for example, a UPS — is advised to study the act closely.

Uptime Institute Intelligence’s view is that the following IRA tax credits will apply to operators:

  • Investment tax credit (ITC), section 48. Of the available tax credits, the most significant for operators is the ITC. The ITC encourages renewable, low-carbon energy use by reducing capital costs by up to 30% through 2032. It applies to capital spending on assets, including solar, wind, geothermal equipment, electrochemical fuel cells, energy storage and microgrid controllers. The ITC will make investing in solar, wind, energy storage and fuel cells more attractive. The ITC is likely to catalyze investment in, and the deployment of, low carbon energy technologies.
  • Energy efficiency commercial buildings deduction, section 179D. This tax credit will encourage investment in energy efficiency and retrofits in commercial buildings. The incentive applies to projects that deliver at least a 25% energy efficiency improvement (reduced from the existing 50% threshold) for a building, compared with ASHRAE’s 90.1 standard reference building. The energy efficiency tax credit applies to projects in the following categories: interior lighting, heating, cooling, ventilation or the building envelope. To meet the 25% threshold, operators can retrofit several building systems. Qualified projects earn a tax credit of between 50 cents and $5 a square foot, depending on efficiency gains and labor requirements.
  • Production tax credit (PTC), section 45. This incentive does not directly apply to data center operators but will affect their business if they buy renewable energy. This tax credit rewards low-carbon energy producers by increasing their profit margin. The PTC only applies to energy producers that sell to a third party, rather than consume it directly. Qualifying projects include wind, solar and hydropower facilities. The PTC scales with inflation and lasts for 10 years. In 2022, the maximum value of the PTC was 2.6 cents per kilowatt-hour (kWh) — for reference, the average US industrial energy price in September 2022 was 10 cents per kWh. If the credit is fully passed on to consumers, energy costs will be reduced by about 25%. (Note: eligible projects must choose between the PTC and the ITC.)

For the tax credits mentioned above, organizations must meet the prevailing wage and apprenticeship requirements (initial guidance by the US Treasury Department and the Internal Revenue Service can be found here) to receive the maximum credit unless the nameplate generation capacity of the project is less than 1 megawatt for the ITC and PTC.

The incentives listed above will be available until 2032 creating certainty for operators considering an investment in renewables, efficiency retrofits or the renewable energy industry. Additionally, these tax credits are transferable: they can be sold — for cash — to another company with tax liability, such as a bank.

Hyperscalers and large colocation providers are best positioned to capitalize on these tax credits: they are building new capacity quickly, possess the expertise and staffing capacity to navigate the legal requirements, and have ambitious net-zero targets.

However, data center operators of all sizes will pursue these incentives where there is a compelling business case. Owners / operators from Uptime Institute’s 2022 global data center survey said more renewable energy purchasing options would deliver the most significant gains in sustainability performance in the next three to five years.

The IRA may also lower the cost barriers for innovative data center designs and typologies. For example, IRA incentives will strengthen the business case for pairing a facility on a microgrid with renewable and long-duration energy storage (LDES). Emerging battery chemistries in development (including iron-air, liquid metal and nickel-zinc) offer discharge durations of 10 hours to 10 days and would benefit from large deployments to prove their viability.

LDES is essential for a reliable low-carbon grid. As the IRA speeds up the deployment of renewables, organizations will need multi-day energy storage to smooth out the high variability of intermittent generators such as solar and wind. Data center facilities may be ideal sites for LDES, even if they are not dedicated for data center use.

Additionally, low-carbon baseload generators such as nuclear, hydrogen and geothermal — all eligible for IRA tax credits — will be needed to replace reliable fossil fuel generators, such as gas turbines and coal power plants.

The incentives in the IRA, welcomed with enthusiasm by climate campaigners the world over, will strengthen the business case in the US for reducing energy consumption, deploying low-carbon energy and energy storage, and/or investing in the clean energy economy.

There is, however, a more problematic side: the rare earth materials and critical components the US will need to meet the objectives of the IRA may be hard to source in sufficient quantities and allegations of protectionism may cause political rifts with other countries.


Lenny Simon, Senior Research Associate [email protected]

Andy Lawrence, Executive Director of Research [email protected]