Concerns over cloud concentration risk grow

Concerns over cloud concentration risk grow

Control over critical digital infrastructure is increasingly in the hands of a small number of major providers. While a public cloud provides a flexible, stable and distributed IT environment, there are growing concerns around its operational resiliency.

Following some recent high-profile cloud failures, and with regulators asking more questions, there is increasing anxiety that using a big cloud provider can be a single point of failure, not just technically but also from a business-risk perspective.

Many organizations and regulators take issue with the lack of transparency of cloud providers, and the lack of control (see Figure 1) that important clients have — some of which are part of the national critical infrastructure. Concentration risk, where key services are dependent on one or a few key suppliers, is a particular concern.

Diagram: More mission-critical workloads in public clouds but visibility issues persist
Figure 1. More mission-critical workloads in public clouds but visibility issues persist

However, because the range, scope of services, management tools and developer environments vary among major cloud providers, organizations are often forced to choose a single provider (at least, for each business function). Even in highly regulated and critical sectors, such as financial services, a multicloud strategy is often neither feasible, nor is it easy to change suppliers — whatever the reason.

In 2021, for example, two major US financial firms Bank of America and Morgan Stanley announced they would standardize on a primary public cloud provider (IBM and Microsoft Azure, respectively). Spreading workloads across multiple clouds that use different technologies, and retraining developers or hiring a range of specialists, had proved too complex and costly.

Big cloud providers say that running workloads just in their environment does not lead to an over-reliance. For example, diversifying within a single cloud can mitigate risk, such as deploying workloads using platform as a service (PaaS) and using an infrastructure as a service (IaaS) configuration for disaster recovery. Providers also point to the distributed nature of cloud computing, which, combined with good monitoring and automated recovery, makes it highly reliable.

Reliability and resiliency, however, are two different things. High reliability suggests there will be few outages and limited downtime, while high resilience means that a system is not only less likely to fail but it, and other systems that depend on it, can quickly recover when there is a failure. While in enterprise and colocation data centers, and in corporate IT, the designs can be scrutinized, single points of failure eliminated, and the processes for system failure rehearsed, in cloud services it is mostly (or partly) a black box. These processes are conducted by the cloud provider, behind the scenes and for the benefit of all their clients, and not to ensure the best outcomes for just a few.

Our research shows that cloud providers have high levels of reliability, but they are not immune to failure. Complex backup regimes and availability zones, supported by load and traffic management, improve the resiliency and responsiveness of cloud providers, but they also come with their own problems. When issues do occur, many customers are often affected immediately, and recovery can be complex. In 2020, Uptime Institute recorded 21 cloud / internet giant outages that had significant financial or other negative consequences (see Annual Outage Analysis 2021).

Mindful of these risks, US financial giant JPMorgan, for example, is among few in its sector taking a multicloud approach. JPMorgan managers have cited concerns over a lack of control with a single provider and, in the case of a major outage, the complexity and the time needed to migrate to another provider and back again.

Regulators are also concerned — especially in the financial services industry where new rules are forcing banks to conduct due diligence on cloud providers. In the UK, the Bank of England is introducing new rules to ensure better management oversight over large banks’ reliance on cloud. And the European Banking Authority mandates that a cloud (or other third-party) operator allows site inspections of data centers.

A newer proposed EU law has wider implications: the Digital Operational Resiliency Act (DORA) puts cloud providers under financial regulators’ purview for the first time. Expected to pass in 2022, cloud providers — among other suppliers — could face large fines if the loss of their services causes disruption in the financial services industry. European governments have also expressed political concerns over growing reliance on non-European providers.

In 2022, we expect these “concentration risk” concerns to rise up more managers’ agendas. In anticipation, some service providers plan to focus more on enabling multicloud configurations.

However, the concentration risk goes beyond cloud computing: problems at one or more big suppliers have been shown to cause technical issues for completely unrelated services. In 2021, for example, a technical problem at the content distribution network (CDN) provider Fastly led to global internet disruption; while an outage at the CDN provider Akamai took down access to cloud services from AWS and IBM (as well as online services for many banks and other companies). Each incident points to a broader issue: the concentration of control over core internet infrastructure services in relatively few major providers.

How will these concerns play out? Some large customers are demanding a better view of cloud suppliers’ infrastructure and a better understanding of potential vulnerabilities. As our research shows, more IT and data center managers would consider moving more of their mission-critical workloads into public clouds if visibility of the operational resiliency of the service improves.

While public cloud data centers may have adequate risk profiles for most mission-critical enterprise workloads already, details about the infrastructure and its risks will increasingly be inadequate for regulators or auditors. And legislation, such as the proposed DORA, with penalties for outages that go far beyond service level agreements, are likely to spur greater regulatory attention in more regions and across more mission-critical sectors.

The full Five Data Center Predictions for 2022 report is available here.

Bring on regulations for data center sustainability, say Europe and APAC

Bring on regulations for data center sustainability, say Europe and APAC

As the data center sector increases its focus on becoming more environmentally sustainable, regulators still have a part to play — the question is to what extent? In a recent Uptime Institute survey of nearly 400 data center operators and suppliers worldwide, a strong majority would favor regulators playing a greater role in improving the overall sustainability of data centers — except for respondents in North America.

Globally, more than three in five respondents favor greater reliance on statutory regulation. The strongest support (75% of respondents) is in Europe and APAC (Asia-Pacific, including China). However, in the US and Canada, fewer than half (41%) want more government involvement, with the majority of respondents saying the government plays an adequate role, or should play a lesser role, in sustainability regulation (See Figure 1).

Diagram: Appetites for sustainability laws differ by region
Figure 1. Appetites for sustainability laws differ by region

Our survey did not delve into attitudes toward governments’ role in this area, but there are a few possible explanations for North America being an outlier. Globally, there is often a technical knowledge gap between industry professionals and government policymakers. As North America is the largest mature data center market, this gap may be more pronounced, fueling a general distrust by the sector toward legislators’ ability to create effective, meaningful laws. Indeed, North American participants have a lower opinion of their regulators’ understanding of data center matters compared with the rest of the world: four of 10 respondents rate their regulators as “not at all informed or knowledgeable.”

There are, however, cases of non-US legislation lacking technical merit, such as Amsterdam’s annual power usage effectiveness (PUE) limit of 1.2 for new data center builds. Although low PUEs are important, this legislation lacks nuance and does not factor in capacity changes — PUE tends to escalate at low utilization levels (for example, below 20% of the facility’s rated capacity). The requirement for a low PUE could incentivize behavior that is counterproductive to the regulation’s intent, such as enterprises and service providers moving (and leased operators commercially attracting) power-hungry applications to achieve a certain PUE number to avoid penalties. Also, these rules do not consider the energy efficiency of the IT.

Even if we accept the PUE’s limitations, the metric will likely have low utility as a regulatory dial in the future. Once a feature of state-of-the-art data centers, strong PUEs are now straightforward to achieve. Also, major technical shifts, such as the use of direct liquid cooling, may render PUE inconsequential. (See Does the spread of direct liquid cooling make PUE less relevant?)

The issue is not simply one of over-regulation: there are instances of legislators setting the bar too low. The industry-led Climate Neutral Data Centre Pact is a case in point. Formed in the EU, this self-regulatory agreement has signatory data center operators working toward reaching net-zero emissions by 2030 — 20 years earlier than the goal set by the EU government (as part of its European Green Deal).

Why, then, are most operators (outside of North America) receptive to more legislation? Perhaps it is because regulation, in some cases, benefitted the industry’s sustainability profile and received global attention as a reliable framework. Although Amsterdam’s one-year ban on new data center construction in 2019 was largely met with disapproval from the sector, it resulted in policies (including the PUE mandate) offering a clearer path toward sustainable development.

The new regulations for Amsterdam include designated campuses for new facility construction within the municipal zones, along with standards for improving the efficient use of land and raw materials. There are also regulations relating to heat re-use and multistory designs, where possible — all of which force the sector to explore efficient, sustainable siting, design and operational choices.

Amsterdam’s temporary ban on new facilities provided a global case study for the impacts of extreme regulatory measures on the industry’s environmental footprint. Similar growth-control measures are planned in Frankfurt, Germany and Singapore. If they realize benefits similar to those experienced in Amsterdam, support for regulation may increase in these regions.

In the grand scheme of sustainability and local impact, regulatory upgrades may have minimal effect. A clear policy, however, builds business confidence by removing uncertainty — which is a boon for data center developments with an investment horizon beyond 10 years. As for North America’s overall resistance, it could simply be that the US is more averse to government regulation, in general, than elsewhere in the world.

By: Jacqueline Davis, Research Analyst, Uptime Institute and Douglas Donnellan, Research Associate, Uptime Institute

Are proof-of-work blockchains a corporate sustainability issue?

Are proof-of-work blockchains a corporate sustainability issue?

The data center and IT industry is a relatively minor — but nevertheless significant — contributor to greenhouse gas emissions. The issue of wasteful digital infrastructure energy consumption is now high on many corporate agendas and is prompting companies and overseeing authorities to act.

But one of the biggest and most problematic IT-based consumers of energy is not generally covered by these existing and planned sustainability initiatives: cryptocurrency networks — the mining platforms Bitcoin and Ethereum, in particular. While Bitcoin’s huge energy consumption is widely known and elicits strong reactions, the role and use of the energy-intensive blockchain proof-of-work (PoW) security mechanism is rarely discussed in a corporate context, or as part of the overall sustainability challenge. This is beginning to look like an important oversight.

Solving the PoW puzzle – which involves a race among miners to process a block of transactions — is a compute-intense activity calling for power-hungry processors to work more or less continuously. At scale, this adds up to a lot of energy.

The scale of cryptocurrency mining’s energy use can be seen when set alongside the energy used by the data center sector as a whole.

According to various estimates, all the world’s data centers combined used between 210 TWh (terawatt hours) and 400 TWh — or even higher — in 2020, according to various studies.* The wide range is partly accounted for by the differing methodologies used by the authors, and by a lack of sufficient information on how much various technologies are deployed.

But it is clear that in these data centers, a combination of innovation (Moore’s law, virtualization, cloud, etc.) and energy conservation has held down potentially runaway energy consumption growth to a few percentage points in recent years.

With Bitcoin, the opposite is true. Not only is energy consumption extremely high, but it is climbing steadily. A study by the UK’s Cambridge Centre for Alternative Finance in April 2021 put global power demand by Bitcoin alone — just one of many cryptocurrencies — at 143 TWh a year. Other researchers arrived at similar estimates. If Bitcoin were a country, it would be ranked in the 30 largest in the world, worthy of a seat at the next COP (Conference of the Parties) summit. This is because the energy expended by the Bitcoin network to fully process a transaction is many times that of a credit card transaction.

But it gets worse. As UK academic Oli Sharpe argues in the video Explaining Bitcoin’s deep problem, the level of difficulty required to solve the PoW algorithm must keep increasing to ensure the integrity of the currency. Furthermore, to maintain integrity, the cost of compute must also represent a sizeable fraction of the value being transferred through the network — again, forcing up the amount of equipment and energy use needed. It is as if the transaction fees once paid to central banks are now being diverted to energy and IT equipment companies. All this translates into more processing time and, therefore, more energy use. A recent Citigroup paper estimated there was a 66-fold increase in Bitcoin energy use from 2015 to 2020. At this growth rate, Bitcoin energy use will soon overtake energy use by the rest of the data center industry combined.

Many in the digital infrastructure industry have shrugged off the cryptocurrency energy problem, viewing blockchain technologies (distributed ledgers) as a technical breakthrough and an opportunity for business innovation. But that may be changing. In Uptime Institute’s recent environmental sustainability survey, we asked nearly 400 operators and vendors about their views on cryptocurrency energy use (see Figure 1). Nearly half (44%) think legislation is needed to limit cryptocurrency mining, and one in five (20%) think operators should stop any cryptocurrency mining until the energy consumption problem is solved (some say it already is, using alternative methods to secure transactions).

Diagram: The sector’s view of cryptocurrency is not favorable
Figure 1. The sector’s view of cryptocurrency is not favorable

Some may argue that Bitcoin mining uses specialist machines and very little takes place in “formal” data centers and therefore is not a mainstream digital infrastructure issue. But this is too simplistic — for three reasons.

First, all blockchains use IT, processors and networks, and many make use of large IT nodes that are operated in buildings that, if not data centers, are at least Tier 1 server rooms. The public and regulators generally see these as small data centers, and all part of the same serious problem.

Second, there are now thousands of cryptocurrencies and other blockchain services and applications. Many of these are developed and supported by large organizations, running in large data centers, and rely on energy-hungry PoW protocols.

Blockchain services offered by the cloud giants are an example. Amazon Web Services (AWS), Google and Microsoft, for example, all offer blockchain-as-a-service products that use the Ethereum blockchain platform or a variation of it (Ethereum is a blockchain platform, ether is the coin). The core Ethereum platform currently uses the energy-intensive PoW protocol, although there are some less power-hungry versions of the protocol in use or being developed. Ethereum 2, which will use a proof-of-stake (PoS) protocol, will be offered in 2022, and promises to bring a 90% reduction in energy use per transaction.

Ethereum lags Bitcoin in energy use, but its consumption is still very significant. According to Ethereum Energy Consortium data, reported on Digiconomist.net, annualized consumption to support Ethereum stands at 88.8 TWh in November 2021, a four-fold increase in less than a year.

Where is all this processing done? According to its website, 25% of all Ethereum workloads globally run on AWS. This either means energy-intensive blockchain workloads are currently running in AWS hyperscale and colocation data centers across the world (it is not clear how much of this is PoW calculating); or that AWS customers using Ethereum-based blockchains are dependent on PoW processing being done elsewhere.

There is a third issue. Many corporations and big cloud services are switching to PoS services because they use so much less energy. But given the growing importance of sustainability and carbon reporting, businesses may also need to understand the energy use of PoS technologies. Although these can be a thousand times more efficient than PoW, they can also be a thousand times more or less efficient than each other (see Energy Footprint of Blockchain Consensus Mechanisms Beyond Proof-of-Work, published by University College London’s Centre for Blockchain Technologies). Fortunately, compared with PoW’s huge footprint, these differences are still tiny.

—–

*There have been several studies of global energy use by IT and by data centers. A recent and insightful paper, by the authors who reached the lower 210 TWh figure, is Recalibrating global data center energy-use estimates (Masanet E, Shehabi A, Lei N, et al. Science 2020; 367:984–986).

Climate change: More operators prepare to weather the storms

Climate change: More operators prepare to weather the storms

In the 2020 Uptime Institute Intelligence report The gathering storm: Climate change and data center resiliency, the author noted that, “While sustainability… features heavily in the marketing of many operators, the threat of extreme weather to continued operations has received far less attention… some complacency may have set in.”

Events of 2021 have changed that. Fires, floods, big freezes and heat waves — coupled with investor activism and threatened legislation mandating greater resiliency to climate change impacts — have driven up awareness of the risks to critical infrastructure. More data center operators are now carrying out both internal and external assessments of their facilities’ vulnerability to climate change-related events and long-term changes. A growing proportion is now reacting to what they perceive to be a dramatic increase in risk.

To track changing attitudes and responses to climate change and the related threats to infrastructure, Uptime Institute Intelligence concluded two global surveys in the fourth quarters of 2020 and 2021, respectively. The latest survey was conducted before the 2021 COP26 (Conference of the Parties) meeting in Glasgow, Scotland, which, according to some surveys, raised further awareness of the seriousness of climate change in some countries.

In our 2021 survey, 70% of respondents said they or their managers have conducted climate change risks assessments of their data centers. This number is up from 64% a year earlier (see Table 1). In around half of these cases, managers are taking, or plan to take, steps to upgrade the resiliency of their critical infrastructure.

diagram: Operators respond to growing climate threats

The figures are not wholly reassuring. First, one in 10 data center operators now sees a dramatic increase in the risk to their facilities — a figure that suggests many hundreds of billions of dollars of data center assets are currently believed to be at risk by those who are managing them. Insurers, investors and customers are taking note.

Second, one in three owners/operators still has not formally conducted a risk assessment related to weather and/or climate change. Of course, data center operators and owners are risk averse, and most facilities are sited and built in a very cautious way. Even so, no regions or sites are beyond the full effects of climate change, especially as many supply chains are vulnerable. Both the data and the legislative activity suggest that more formal and regular assessments will be needed.

The surveys further revealed that about one in three organizations uses external experts to assess their critical infrastructure climate change risks, up from about a quarter a year ago. This is likely to increase as regulators and investors seek to quantify, validate and reduce risks. The Task Force for Climate-Related Financial Disclosures (TCFD), a nongovernmental organization influencing investor and government policy in many countries, advises organizations to adopt and disclose processes and standards for identifying climate change risks to all corporate assets.

Mixed resiliency at the edge

Mixed resiliency at the edge

Many analysts have forecast an explosion in demand for edge data centers. After a long, slow start, demand is beginning to build, with small, prefabricated and mostly remotely operated data centers ready to be deployed to support a varying array of applications.

There are still many uncertainties surrounding the edge market, ranging from business models to ownership, and from numbers to types of deployment. One open question is how much resiliency will be needed, and how it will be achieved.

While on-site infrastructure redundancy (site-level resiliency) remains the most common approach to achieving edge data center resiliency, Uptime Institute’s research shows increased interest in software- and network-based distributed resiliency. Nine of 10 edge data center owners and operators believe it will be very or somewhat commonly used in two to three years.

Distributed resiliency, which involves synchronous or asynchronous replication of data across multiple sites, has, until recently, mainly been used by large cloud and internet service providers. It is commonly deployed in cloud availability zones and combined with site-level resiliency at three or more connected physical data centers.

While site-level redundancy is primarily a defense against equipment faults at a site, distributed resiliency can harden against major local events or sabotage (taking out a full site). It can also reduce infrastructure costs (by reducing site-level redundancy needs) and provide increased business agility by flexible placement and shifting of IT workloads. Edge data centers making use of distributed resiliency are connected and operated in a coordinated manner, as illustrated in Figure 1. The redundant element in this case is at least one full edge data center (not a component or system). When a disruption occurs, when capacity limitations are reached, or when planned maintenance is required, some (or all) of the IT workloads in an edge data center will be shifted to one or more other edge data centers.

diagram: Different resiliency approaches are used for edge data centers
Figure 1. Different resiliency approaches are used for edge data centers

Site-level resiliency relies on redundant capacity components (also including major equipment) for critical power, cooling, and network connectivity — the approach widely adopted by almost all data centers of any size. Edge data centers using only site-level resiliency tend to run their own IT workloads independently from other edge data centers.

Edge data centers making use of distributed resiliency are connected and operated in a coordinated manner, commonly using either a hierarchical topology or a mesh topology to deliver multisite resiliency.

None of these approaches or topologies are mutually exclusive, although distributed resiliency creates opportunities to reduce component redundancy at individual edge sites without risking service continuity.

Uptime Institute’s research suggests that organizations deploying edge data centers can benefit from the combined use of site-level resiliency and distributed resiliency.

Organizations deploying distributed resiliency should expect some challenges before the system works flawlessly, due to the increased software and network complexity. Because edge data centers are typically unstaffed, resilient remote monitoring and good network management/IT monitoring are essential for early detection of disruption and capacity limitations, regardless of the resiliency approach used.

Does the spread of direct liquid cooling make PUE less relevant?

Does the spread of direct liquid cooling make PUE less relevant?

The power usage effectiveness (PUE) metric is predominant thanks to its universal applicability and its simplicity: energy used by the entire data center, divided by energy used by the IT equipment. However, its simplicity could limit its future relevance, as techniques such as direct liquid cooling (DLC) profoundly change the profile of data center energy consumption.

PUE has long been used beyond its original intention, including as a single defining efficiency metric and as a comparative benchmark between different data centers, ever since it was developed by The Green Grid in 2007. Annualized PUE has become the global de facto standard for data center energy efficiency, in part because it can hide many sins: PUE doesn’t account for important trade-offs in, for example, resiliency, water consumption and, perhaps most crucially, the efficiency of IT.

However, looming technical changes to facility infrastructure could, if extensively implemented, render PUE unsuitable for historical or contemporary benchmarking. One such change is the possibility of DLC entering mainstream adoption. While DLC technology has been an established yet niche technology for decades, some in the data center sector think it’s on the verge of being more widely used.

Among the drivers for DLC is the ongoing escalation of server processor power, which could mean new servers will increasingly be offered in both traditional and DLC configurations.

diagram: Few think air cooling will remain dominant beyond 10 years
Figure 1. Few think air cooling will remain dominant beyond 10 years

According to a recent Uptime survey, only one in four respondents think air cooling will remain dominant beyond the next decade in data centers larger than 1 megawatt (MW; see Figure 1).

Regardless of the form, full or partial immersion or direct-to-chip (cold plates), DLC reshapes the composition of energy consumption of the facility and IT infrastructure beyond simply lowering the calculated PUE to near the absolute limit. Most DLC implementations achieve a partial PUE of 1.02 to 1.03, outperforming the most efficient air-cooling systems by low single-digit percentages. But PUE does not capture most of DLC’s energy gains because it also lowers the power consumption of IT, raising questions about how to account for infrastructure efficiency.

In other words, DLC changes enough variables outside the scope of PUE that its application as an energy efficiency metric becomes unsuitable.

There are two major reasons why DLC PUEs are qualitatively different from PUEs of air-cooled infrastructure. One is that DLC systems do not require most of the IT system fans that move air through the chassis (cold-plate systems still need some fans in power supplies, and for low-power electronics). Because server fans are powered by the server power supply, their consumption counts as IT power. Suppliers have modeled fan power consumption extensively, and it is a non-trivial amount. Estimates typically range between 5% and 10% of total IT power depending on fan efficiency, size and speeds (supply air temperature can also be a factor).

The other, less-explored component of IT energy is semiconductor power losses due to temperature. Modern high-performance processors are liable to relatively high leakage currents that flow even when the chip is not cycling (sleeping circuits with no clock signal). This is known as static power, as opposed to the dynamic (active) power consumed when a switch gate changes state to perform work. As the scale of integration grows with more advanced chip manufacturing technologies, so does the challenge of leakage. Against the efforts of chipmakers to contain it without giving up too much performance or transistor density, static power remains significant in the total power equation for large compute chips tuned for performance, such as server processors.

Static power, unlike dynamic power, correlates strongly with temperature. Because DLC systems can maintain chip operating temperatures far below that of air-cooled ones (say, at 48 degrees Celsius /118.4 degrees Fahrenheit, as opposed to 72 degrees Celsius/161.6 degrees Fahrenheit for air-cooled systems), they can dramatically reduce static power. In a 2010 study on a supercomputer in Japan, Fujitsu estimated that water cooling lowered processor power by a little over 10% when cooled from 85 degrees Celsius/185 degrees Fahrenheit to 30 degrees Celsius/86 degrees Fahrenheit. Static power has likely become a bigger problem since this study was conducted, suggesting that cooler chip operation has the potential to curb total IT power by several percentage points.

Without guidance from chipmakers on the static power profile of their processors, the only way to quantify this energy benefit is via experimentation. Worse still, the impact on total power will vary across servers using different chips, for multiple reasons (e.g., processor utilization, workload intensity, and semiconductor technology and manufacturing variations between different chipmakers or chip generations). All this complicates the case for including static power in a new efficiency metric — or in the business case for DLC. In other words, it is a known factor, but to what extent is unknown.

There are other developments in infrastructure design that can undermine the relevance of PUE. For example, distributed, rack-integrated uninterruptible power supplies with small battery packs can become part of the IT infrastructure, rather than the purview of facilities management. If the promise of widespread adoption of DLC materializes, PUE, in its current form, may be heading toward the end of its usefulness. The potential absence of a useful PUE metric would represent a discontinuity of historical trending. Moreover, it would hollow out competitive benchmarking: all DLC data centers will be very efficient, with immaterial energy differences. If liquid-cooled servers gain more foothold (as many — but not all — in the data center sector expect it will), operators will likely need a new metric for energy efficiency, if not as a replacement for PUE, then as a supplement. Tracking of IT utilization, and an overall more granular approach to monitoring the power consumption of workloads, could quantify efficiency gains much better than any future versions of PUE.