Blog Single Author Small - Uptime Institute Blog

Are proof-of-work blockchains a corporate sustainability issue?

December 22, 2021/in Executive, News/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com

The data center and IT industry is a relatively minor — but nevertheless significant — contributor to greenhouse gas emissions. The issue of wasteful digital infrastructure energy consumption is now high on many corporate agendas and is prompting companies and overseeing authorities to act.

But one of the biggest and most problematic IT-based consumers of energy is not generally covered by these existing and planned sustainability initiatives: cryptocurrency networks — the mining platforms Bitcoin and Ethereum, in particular. While Bitcoin’s huge energy consumption is widely known and elicits strong reactions, the role and use of the energy-intensive blockchain proof-of-work (PoW) security mechanism is rarely discussed in a corporate context, or as part of the overall sustainability challenge. This is beginning to look like an important oversight.

Solving the PoW puzzle – which involves a race among miners to process a block of transactions — is a compute-intense activity calling for power-hungry processors to work more or less continuously. At scale, this adds up to a lot of energy.

The scale of cryptocurrency mining’s energy use can be seen when set alongside the energy used by the data center sector as a whole.

According to various estimates, all the world’s data centers combined used between 210 TWh (terawatt hours) and 400 TWh — or even higher — in 2020, according to various studies.* The wide range is partly accounted for by the differing methodologies used by the authors, and by a lack of sufficient information on how much various technologies are deployed.

But it is clear that in these data centers, a combination of innovation (Moore’s law, virtualization, cloud, etc.) and energy conservation has held down potentially runaway energy consumption growth to a few percentage points in recent years.

With Bitcoin, the opposite is true. Not only is energy consumption extremely high, but it is climbing steadily. A study by the UK’s Cambridge Centre for Alternative Finance in April 2021 put global power demand by Bitcoin alone — just one of many cryptocurrencies — at 143 TWh a year. Other researchers arrived at similar estimates. If Bitcoin were a country, it would be ranked in the 30 largest in the world, worthy of a seat at the next COP (Conference of the Parties) summit. This is because the energy expended by the Bitcoin network to fully process a transaction is many times that of a credit card transaction.

But it gets worse. As UK academic Oli Sharpe argues in the video Explaining Bitcoin’s deep problem, the level of difficulty required to solve the PoW algorithm must keep increasing to ensure the integrity of the currency. Furthermore, to maintain integrity, the cost of compute must also represent a sizeable fraction of the value being transferred through the network — again, forcing up the amount of equipment and energy use needed. It is as if the transaction fees once paid to central banks are now being diverted to energy and IT equipment companies. All this translates into more processing time and, therefore, more energy use. A recent Citigroup paper estimated there was a 66-fold increase in Bitcoin energy use from 2015 to 2020. At this growth rate, Bitcoin energy use will soon overtake energy use by the rest of the data center industry combined.

Many in the digital infrastructure industry have shrugged off the cryptocurrency energy problem, viewing blockchain technologies (distributed ledgers) as a technical breakthrough and an opportunity for business innovation. But that may be changing. In Uptime Institute’s recent environmental sustainability survey, we asked nearly 400 operators and vendors about their views on cryptocurrency energy use (see Figure 1). Nearly half (44%) think legislation is needed to limit cryptocurrency mining, and one in five (20%) think operators should stop any cryptocurrency mining until the energy consumption problem is solved (some say it already is, using alternative methods to secure transactions).

Diagram: The sector’s view of cryptocurrency is not favorable — Figure 1. The sector’s view of cryptocurrency is not favorable

Some may argue that Bitcoin mining uses specialist machines and very little takes place in “formal” data centers and therefore is not a mainstream digital infrastructure issue. But this is too simplistic — for three reasons.

First, all blockchains use IT, processors and networks, and many make use of large IT nodes that are operated in buildings that, if not data centers, are at least Tier 1 server rooms. The public and regulators generally see these as small data centers, and all part of the same serious problem.

Second, there are now thousands of cryptocurrencies and other blockchain services and applications. Many of these are developed and supported by large organizations, running in large data centers, and rely on energy-hungry PoW protocols.

Blockchain services offered by the cloud giants are an example. Amazon Web Services (AWS), Google and Microsoft, for example, all offer blockchain-as-a-service products that use the Ethereum blockchain platform or a variation of it (Ethereum is a blockchain platform, ether is the coin). The core Ethereum platform currently uses the energy-intensive PoW protocol, although there are some less power-hungry versions of the protocol in use or being developed. Ethereum 2, which will use a proof-of-stake (PoS) protocol, will be offered in 2022, and promises to bring a 90% reduction in energy use per transaction.

Ethereum lags Bitcoin in energy use, but its consumption is still very significant. According to Ethereum Energy Consortium data, reported on Digiconomist.net, annualized consumption to support Ethereum stands at 88.8 TWh in November 2021, a four-fold increase in less than a year.

Where is all this processing done? According to its website, 25% of all Ethereum workloads globally run on AWS. This either means energy-intensive blockchain workloads are currently running in AWS hyperscale and colocation data centers across the world (it is not clear how much of this is PoW calculating); or that AWS customers using Ethereum-based blockchains are dependent on PoW processing being done elsewhere.

There is a third issue. Many corporations and big cloud services are switching to PoS services because they use so much less energy. But given the growing importance of sustainability and carbon reporting, businesses may also need to understand the energy use of PoS technologies. Although these can be a thousand times more efficient than PoW, they can also be a thousand times more or less efficient than each other (see Energy Footprint of Blockchain Consensus Mechanisms Beyond Proof-of-Work, published by University College London’s Centre for Blockchain Technologies). Fortunately, compared with PoW’s huge footprint, these differences are still tiny.

—–

*There have been several studies of global energy use by IT and by data centers. A recent and insightful paper, by the authors who reached the lower 210 TWh figure, is Recalibrating global data center energy-use estimates (Masanet E, Shehabi A, Lei N, et al. Science 2020; 367:984–986).

Climate change: More operators prepare to weather the storms

December 16, 2021/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com

In the 2020 Uptime Institute Intelligence report The gathering storm: Climate change and data center resiliency, the author noted that, “While sustainability… features heavily in the marketing of many operators, the threat of extreme weather to continued operations has received far less attention… some complacency may have set in.”

Events of 2021 have changed that. Fires, floods, big freezes and heat waves — coupled with investor activism and threatened legislation mandating greater resiliency to climate change impacts — have driven up awareness of the risks to critical infrastructure. More data center operators are now carrying out both internal and external assessments of their facilities’ vulnerability to climate change-related events and long-term changes. A growing proportion is now reacting to what they perceive to be a dramatic increase in risk.

To track changing attitudes and responses to climate change and the related threats to infrastructure, Uptime Institute Intelligence concluded two global surveys in the fourth quarters of 2020 and 2021, respectively. The latest survey was conducted before the 2021 COP26 (Conference of the Parties) meeting in Glasgow, Scotland, which, according to some surveys, raised further awareness of the seriousness of climate change in some countries.

In our 2021 survey, 70% of respondents said they or their managers have conducted climate change risks assessments of their data centers. This number is up from 64% a year earlier (see Table 1). In around half of these cases, managers are taking, or plan to take, steps to upgrade the resiliency of their critical infrastructure.

diagram: Operators respond to growing climate threats

The figures are not wholly reassuring. First, one in 10 data center operators now sees a dramatic increase in the risk to their facilities — a figure that suggests many hundreds of billions of dollars of data center assets are currently believed to be at risk by those who are managing them. Insurers, investors and customers are taking note.

Second, one in three owners/operators still has not formally conducted a risk assessment related to weather and/or climate change. Of course, data center operators and owners are risk averse, and most facilities are sited and built in a very cautious way. Even so, no regions or sites are beyond the full effects of climate change, especially as many supply chains are vulnerable. Both the data and the legislative activity suggest that more formal and regular assessments will be needed.

The surveys further revealed that about one in three organizations uses external experts to assess their critical infrastructure climate change risks, up from about a quarter a year ago. This is likely to increase as regulators and investors seek to quantify, validate and reduce risks. The Task Force for Climate-Related Financial Disclosures (TCFD), a nongovernmental organization influencing investor and government policy in many countries, advises organizations to adopt and disclose processes and standards for identifying climate change risks to all corporate assets.

Mixed resiliency at the edge

November 17, 2021/in Design, Operations/by Dr. Tomas Rahkonen, Research Director of Distributed Data Centers, Uptime Institute

Many analysts have forecast an explosion in demand for edge data centers. After a long, slow start, demand is beginning to build, with small, prefabricated and mostly remotely operated data centers ready to be deployed to support a varying array of applications.

There are still many uncertainties surrounding the edge market, ranging from business models to ownership, and from numbers to types of deployment. One open question is how much resiliency will be needed, and how it will be achieved.

While on-site infrastructure redundancy (site-level resiliency) remains the most common approach to achieving edge data center resiliency, Uptime Institute’s research shows increased interest in software- and network-based distributed resiliency. Nine of 10 edge data center owners and operators believe it will be very or somewhat commonly used in two to three years.

Distributed resiliency, which involves synchronous or asynchronous replication of data across multiple sites, has, until recently, mainly been used by large cloud and internet service providers. It is commonly deployed in cloud availability zones and combined with site-level resiliency at three or more connected physical data centers.

While site-level redundancy is primarily a defense against equipment faults at a site, distributed resiliency can harden against major local events or sabotage (taking out a full site). It can also reduce infrastructure costs (by reducing site-level redundancy needs) and provide increased business agility by flexible placement and shifting of IT workloads. Edge data centers making use of distributed resiliency are connected and operated in a coordinated manner, as illustrated in Figure 1. The redundant element in this case is at least one full edge data center (not a component or system). When a disruption occurs, when capacity limitations are reached, or when planned maintenance is required, some (or all) of the IT workloads in an edge data center will be shifted to one or more other edge data centers.

diagram: Different resiliency approaches are used for edge data centers — Figure 1. Different resiliency approaches are used for edge data centers

Site-level resiliency relies on redundant capacity components (also including major equipment) for critical power, cooling, and network connectivity — the approach widely adopted by almost all data centers of any size. Edge data centers using only site-level resiliency tend to run their own IT workloads independently from other edge data centers.

Edge data centers making use of distributed resiliency are connected and operated in a coordinated manner, commonly using either a hierarchical topology or a mesh topology to deliver multisite resiliency.

None of these approaches or topologies are mutually exclusive, although distributed resiliency creates opportunities to reduce component redundancy at individual edge sites without risking service continuity.

Uptime Institute’s research suggests that organizations deploying edge data centers can benefit from the combined use of site-level resiliency and distributed resiliency.

Organizations deploying distributed resiliency should expect some challenges before the system works flawlessly, due to the increased software and network complexity. Because edge data centers are typically unstaffed, resilient remote monitoring and good network management/IT monitoring are essential for early detection of disruption and capacity limitations, regardless of the resiliency approach used.

Does the spread of direct liquid cooling make PUE less relevant?

November 5, 2021/in Design, Executive, Operations/by Jacqueline Davis, Research Analyst, Uptime Institute, jdavis@uptimeinstitute.com

The power usage effectiveness (PUE) metric is predominant thanks to its universal applicability and its simplicity: energy used by the entire data center, divided by energy used by the IT equipment. However, its simplicity could limit its future relevance, as techniques such as direct liquid cooling (DLC) profoundly change the profile of data center energy consumption.

PUE has long been used beyond its original intention, including as a single defining efficiency metric and as a comparative benchmark between different data centers, ever since it was developed by The Green Grid in 2007. Annualized PUE has become the global de facto standard for data center energy efficiency, in part because it can hide many sins: PUE doesn’t account for important trade-offs in, for example, resiliency, water consumption and, perhaps most crucially, the efficiency of IT.

However, looming technical changes to facility infrastructure could, if extensively implemented, render PUE unsuitable for historical or contemporary benchmarking. One such change is the possibility of DLC entering mainstream adoption. While DLC technology has been an established yet niche technology for decades, some in the data center sector think it’s on the verge of being more widely used.

Among the drivers for DLC is the ongoing escalation of server processor power, which could mean new servers will increasingly be offered in both traditional and DLC configurations.

diagram: Few think air cooling will remain dominant beyond 10 years — Figure 1. Few think air cooling will remain dominant beyond 10 years

According to a recent Uptime survey, only one in four respondents think air cooling will remain dominant beyond the next decade in data centers larger than 1 megawatt (MW; see Figure 1).

Regardless of the form, full or partial immersion or direct-to-chip (cold plates), DLC reshapes the composition of energy consumption of the facility and IT infrastructure beyond simply lowering the calculated PUE to near the absolute limit. Most DLC implementations achieve a partial PUE of 1.02 to 1.03, outperforming the most efficient air-cooling systems by low single-digit percentages. But PUE does not capture most of DLC’s energy gains because it also lowers the power consumption of IT, raising questions about how to account for infrastructure efficiency.

In other words, DLC changes enough variables outside the scope of PUE that its application as an energy efficiency metric becomes unsuitable.

There are two major reasons why DLC PUEs are qualitatively different from PUEs of air-cooled infrastructure. One is that DLC systems do not require most of the IT system fans that move air through the chassis (cold-plate systems still need some fans in power supplies, and for low-power electronics). Because server fans are powered by the server power supply, their consumption counts as IT power. Suppliers have modeled fan power consumption extensively, and it is a non-trivial amount. Estimates typically range between 5% and 10% of total IT power depending on fan efficiency, size and speeds (supply air temperature can also be a factor).

The other, less-explored component of IT energy is semiconductor power losses due to temperature. Modern high-performance processors are liable to relatively high leakage currents that flow even when the chip is not cycling (sleeping circuits with no clock signal). This is known as static power, as opposed to the dynamic (active) power consumed when a switch gate changes state to perform work. As the scale of integration grows with more advanced chip manufacturing technologies, so does the challenge of leakage. Against the efforts of chipmakers to contain it without giving up too much performance or transistor density, static power remains significant in the total power equation for large compute chips tuned for performance, such as server processors.

Static power, unlike dynamic power, correlates strongly with temperature. Because DLC systems can maintain chip operating temperatures far below that of air-cooled ones (say, at 48 degrees Celsius /118.4 degrees Fahrenheit, as opposed to 72 degrees Celsius/161.6 degrees Fahrenheit for air-cooled systems), they can dramatically reduce static power. In a 2010 study on a supercomputer in Japan, Fujitsu estimated that water cooling lowered processor power by a little over 10% when cooled from 85 degrees Celsius/185 degrees Fahrenheit to 30 degrees Celsius/86 degrees Fahrenheit. Static power has likely become a bigger problem since this study was conducted, suggesting that cooler chip operation has the potential to curb total IT power by several percentage points.

Without guidance from chipmakers on the static power profile of their processors, the only way to quantify this energy benefit is via experimentation. Worse still, the impact on total power will vary across servers using different chips, for multiple reasons (e.g., processor utilization, workload intensity, and semiconductor technology and manufacturing variations between different chipmakers or chip generations). All this complicates the case for including static power in a new efficiency metric — or in the business case for DLC. In other words, it is a known factor, but to what extent is unknown.

There are other developments in infrastructure design that can undermine the relevance of PUE. For example, distributed, rack-integrated uninterruptible power supplies with small battery packs can become part of the IT infrastructure, rather than the purview of facilities management. If the promise of widespread adoption of DLC materializes, PUE, in its current form, may be heading toward the end of its usefulness. The potential absence of a useful PUE metric would represent a discontinuity of historical trending. Moreover, it would hollow out competitive benchmarking: all DLC data centers will be very efficient, with immaterial energy differences. If liquid-cooled servers gain more foothold (as many — but not all — in the data center sector expect it will), operators will likely need a new metric for energy efficiency, if not as a replacement for PUE, then as a supplement. Tracking of IT utilization, and an overall more granular approach to monitoring the power consumption of workloads, could quantify efficiency gains much better than any future versions of PUE.

New ASHRAE guidelines challenge efficiency drive

October 21, 2021/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com

Earlier in 2021, ASHRAE’s Technical Committee 9.9 published an update — the fifth edition — of its Thermal Guidelines for Data Processing Environments. The update recommends important changes to data center thermal operating envelopes: the presence of pollutants is now a factor, and it introduces a new class of IT equipment for high-density computing. The new advice can, in some cases, lead operators to not only alter operational practices but also shift set points, a change that may impact both energy efficiency and contractual service level agreements (SLA) with data center services providers.

Since the original release in 2004, ASHRAE’s Thermal Guidelines have been instrumental in setting cooling standards for data centers globally. The 9.9 committee collects input from a wide cross-section of the IT and data center industry to promote an evidence-based approach to climatic controls, one which helps operators better understand both risks and optimization opportunities. Historically, most changes to the guidelines pointed data center operators toward further relaxations of climatic set points (e.g., temperature, relative humidity, dew point), which also stimulated equipment makers to develop more efficient air economizer systems.

In the fifth edition, ASHRAE adds some major caveats to its thermal guidance. While the recommendations for relative humidity (RH) extend the range up to 70% (the previous cutoff was 60%), this is conditional on the data hall having low concentrations of pollutant gases. If the presence of corrosive gases is above the set thresholds, ASHRAE now recommends operators keep RH under 50% — below its previous recommended limit. To monitor, operators should place metal strips, known as “reactivity coupons,” in the data hall and measure corroded layer formation; the limit for silver is 200 ångström per month and for copper, 300 ångström per month.

ASHRAE bases its enhanced guidance on an experimental study on the effects of gaseous pollutants and humidity on electronics, performed between 2015 and 2018 with researchers from Syracuse University (US). The experiments found that the presence of chlorine and hydrogen sulfide accelerates copper corrosion under higher humidity conditions. Without chlorine, hydrogen sulfide or similarly strong catalysts, there was no significant corrosion up to 70% RH, even when other, less aggressive gaseous pollutants (such as ozone, nitrogen dioxide and sulfur dioxide) were present.

Because corrosion from chlorine and hydrogen sulfide at 50% RH is still above acceptable levels, ASHRAE suggests operators consider chemical filtration to decontaminate.

While the data ASHRAE uses is relatively new, its conclusions echo previous standards. Those acquainted with the environmental requirements of data storage systems may find the guidance familiar — data storage vendors have been following specifications set out in ANSI/ISA-71.04 since 1985 (last updated in 2013). Long after the era of tapes, storage drives (hard disks and solid state alike) remain the foremost victims of corrosion, as their low-temperature operational requirements mean increased moisture absorption and adsorption.

However, many data center operators do not routinely measure gaseous contaminant levels, and so do not monitor for corrosion. If strong catalysts are present but undetected, this might lead to higher than expected failure rates even if temperature and RH are within target ranges. Worse still, lowering supply air temperature in an attempt to counter failures might make them more likely. ASHRAE recommends operators consider a 50% RH limit if they don’t perform reactivity coupon measurements. Somewhat confusingly, it also makes an allowance for following specifications set out in its previous update (the fourth edition), which recommends a 60% RH limit.

Restricted envelope for high-density IT systems

Another major change in the latest update is the addition of a new class of IT equipment, separate from the pre-existing classes of A1 through A4. The new class, H1, includes systems that tightly integrate a number of high-powered components (server processors, accelerators, memory chips and networking controllers). ASHRAE says these high-density systems need more narrow air temperature bands — it recommends 18°C/64.4°F to 22°C/71.6°F (as opposed to 18°C/64.4°F to 27°C/80.6°F) — to meet its cooling requirements. The allowable envelope has become tighter as well, with upper limits of 25°C/77°F for class H1, instead of 32°C/89.6°F (see Figure 1).

diagram: Thermal Guidelines for Data Processing Environments — Figure 1. 2021 recommended and allowable envelopes for ASHRAE class H1. The recommended envelope is for low levels of pollutants, verified by coupon measurements.
Source: Thermal Guidelines for Data Processing Environments, 5th Edition, ASHRAE

This is because, according to ASHRAE, there is simply not enough room in some dense systems for the higher performance heat sinks and fans that could keep components below temperature limits across the generic (classes A1 through A4) recommended envelope. ASHRAE does not stipulate what makes a system class H1, leaving it to the IT vendor to specify its products as such.

There are some potentially far-reaching implications of these new envelopes. Operators have over the past decade built and equipped a large number of facilities based on ASHRAE’s previous guidance. Many of these relatively new data centers take advantage of the recommended temperature bands by using less mechanical refrigeration and more economization. In several locations — Dublin, London and Seattle, for example — it is even possible for operators to completely eliminate mechanical cooling yet stay within ASHRAE guidelines by marrying the use of evaporative and adiabatic air handlers with advanced air-flow design and operational discipline. The result is a major leap in energy efficiency and the ability to support more IT load from a substation.

Such optimized facilities will not typically lend themselves well to the new envelopes. That most of these data centers can support 15- to 20-kilowatt IT racks doesn’t help either, since H1 is a new equipment class requiring a lower maximum for temperature — regardless of the rack’s density. To maintain the energy efficiency of highly optimized new data center designs, dense IT may need to have its own dedicated area with independent cooling. Indeed, ASHRAE says that operators should separate H1 and other more restricted equipment into areas with their own controls and cooling equipment.

Uptime will be watching with interest how colocation providers, in particular, will handle this challenge, as their typical SLAs depend heavily on the ASHRAE thermal guidelines. What may be considered an oddity today may soon become common, given that semiconductor power keeps escalating with every generation. Facility operators may deploy direct liquid cooling for high-density IT as a way out of this bind.

Too big to fail? Facebook’s global outage

October 12, 2021/in Executive, News, Operations/by Rhonda Ascierto, Vice President, Research, Uptime Institute

The bigger the outage, the greater the need for explanations and, most importantly, for taking steps to avoid a repeat.

By any standards, the outage that affected Facebook on Monday, October 4th, was big. For more than six hours, Facebook and its other businesses, including WhatsApp, Instagram and Oculus VR, disappeared from the internet — not just in a few regions or countries, but globally. So many users and machines kept retrying these websites, it caused a slowdown of the internet and issues with cellular networks.

While Facebook is large enough to ride through the immediate financial impact, it should not be dismissed. Market watchers estimate that the outage cost Facebook roughly $60 million in revenues over its more than six-hour period. The company’s shares fell 4.9% on the day, which translated into more than $47 billion in lost market cap.

Facebook may recover those losses, but the bigger ramifications may be reputational and legal. Uptime Institute research shows that the level of outages from hyperscale operators is similar to that experienced by colocation companies and enterprises — despite their huge investments in distributed availability zones and global load and traffic management. In 2020, Uptime Institute recorded 21 cloud/internet giant outages, with associated financial and reputational damage. With antitrust, data privacy and, most recently, children’s mental health concerns swirling about Facebook, the company is unlikely to welcome further reputational and legal scrutiny.

What was the cause of Facebook’s outage? The company said there was an errant command issued during planned network maintenance. While an automated auditing tool would ordinarily catch an errant command, there was a bug in the tool that didn’t properly stop it. The command led to configuration changes on Facebook’s backbone routers that coordinate network traffic among its data centers. This had a cascading effect that halted Facebook’s services.

Setting aside theories of deliberate sabotage, there is evidence that Facebook’s internet routes (Border Gateway Protocol, or BGP) were withdrawn by mistake as part of these configuration changes.

BGP is a mechanism for large internet routers to constantly exchange information about the possible routes for them to deliver network packets. BGP effectively provides very long lists of potential routing paths that are constantly updated. When Facebook stopped broadcasting its presence — something observed by sites that monitor and manage internet traffic — other networks could not find it.

One factor that exacerbated the outage is that Facebook has an atypical internet infrastructure design, specifically related to BGP and another three-letter acronym: DNS, the domain name system. While BGP functions as the internet’s routing map, the DNS serves as its address book. (The DNS translates human-friendly names for online resources into machine-friendly internet protocol addresses.)

Facebook has its own DNS registrar, which manages and broadcasts its domain names. Because of Facebook’s architecture — designed to improve flexibility and control — when its BPG configuration error happened, the Facebook registrar went offline. (As an aside, this caused some domain tools to erroneously show that the Facebook.com domain was available for sale.) As a result, internet service providers and other networks simply could not find Facebook’s network.

How did this then cause a slowdown of the internet? Billions of systems, including mobile devices running a Facebook-owned application in the background, were constantly requesting new “coordinates” for these sites. These requests are ordinarily cached in servers located at the edge, but when the BGP routes disappeared, so did those caches. Requests were routed upstream to large internet servers in core data centers.

The situation was compounded by a negative feedback loop, caused in part by application logic and in part by user behavior. Web applications will not accept a BGP routing error as an answer to a request and so they retry, often aggressively. Users and their mobile devices running these applications in the background also won’t accept an error and will repeatedly reload the website or reboot the application. The result was an up to 40% increase in DNS request traffic, which slowed down other networks (and, therefore, increased latency and timeout requests for other web applications). The increased traffic also reportedly led to issues with some cellular networks, including users being unable to make voice-over-IP phone calls.

Facebook’s outage was initially caused by routine network maintenance gone wrong, but the error was missed by an auditing tool and propagated via an automated system, which were likely both built by Facebook. The command error reportedly blocked remote administrators from reverting the configuration change. What’s more, the people with access to Facebook’s physical routers (in Facebook’s data centers) did not have access to the network/logical system. This suggests two things: the network maintenance auditing tool and process were inadequately tested, and there was a lack of specialized staff with network-system access physically inside Facebook’s data centers.

When the only people who can remedy a potential network maintenance problem rely on the network that is being worked on, it seems obvious that a contingency plan needs to be in place.

Facebook, which like other cloud/internet giants has rigorous processes for applying lessons learned, should be better protected next time. But Uptime Institute’s research shows there are no guarantees — cloud/internet giants are particularly vulnerable to network and software configuration errors, a function of their complexity and the interdependency of many data centers, zones, systems and separately managed networks. Ten of the 21 outages in 2020 that affected cloud/internet giants were caused by software/network errors. That these errors can cause traffic pileups that can then snarl completely unrelated applications globally will further concern all those who depend on publicly shared digital infrastructure – including the internet.

More detailed information about the causes and costs of outages is available in our report Annual outage analysis 2021: The causes and impacts of data center outages.

Are proof-of-work blockchains a corporate sustainability issue?

Climate change: More operators prepare to weather the storms

Mixed resiliency at the edge

Does the spread of direct liquid cooling make PUE less relevant?

New ASHRAE guidelines challenge efficiency drive

Restricted envelope for high-density IT systems

Too big to fail? Facebook’s global outage

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices