Earlier in 2021, ASHRAE’s Technical Committee 9.9 published an update — the fifth edition — of its Thermal Guidelines for Data Processing Environments. The update recommends important changes to data center thermal operating envelopes: the presence of pollutants is now a factor, and it introduces a new class of IT equipment for high-density computing. The new advice can, in some cases, lead operators to not only alter operational practices but also shift set points, a change that may impact both energy efficiency and contractual service level agreements (SLA) with data center services providers.
Since the original release in 2004, ASHRAE’s Thermal Guidelines have been instrumental in setting cooling standards for data centers globally. The 9.9 committee collects input from a wide cross-section of the IT and data center industry to promote an evidence-based approach to climatic controls, one which helps operators better understand both risks and optimization opportunities. Historically, most changes to the guidelines pointed data center operators toward further relaxations of climatic set points (e.g., temperature, relative humidity, dew point), which also stimulated equipment makers to develop more efficient air economizer systems.
In the fifth edition, ASHRAE adds some major caveats to its thermal guidance. While the recommendations for relative humidity (RH) extend the range up to 70% (the previous cutoff was 60%), this is conditional on the data hall having low concentrations of pollutant gases. If the presence of corrosive gases is above the set thresholds, ASHRAE now recommends operators keep RH under 50% — below its previous recommended limit. To monitor, operators should place metal strips, known as “reactivity coupons,” in the data hall and measure corroded layer formation; the limit for silver is 200 ångström per month and for copper, 300 ångström per month.
ASHRAE bases its enhanced guidance on an experimental study on the effects of gaseous pollutants and humidity on electronics, performed between 2015 and 2018 with researchers from Syracuse University (US). The experiments found that the presence of chlorine and hydrogen sulfide accelerates copper corrosion under higher humidity conditions. Without chlorine, hydrogen sulfide or similarly strong catalysts, there was no significant corrosion up to 70% RH, even when other, less aggressive gaseous pollutants (such as ozone, nitrogen dioxide and sulfur dioxide) were present.
Because corrosion from chlorine and hydrogen sulfide at 50% RH is still above acceptable levels, ASHRAE suggests operators consider chemical filtration to decontaminate.
While the data ASHRAE uses is relatively new, its conclusions echo previous standards. Those acquainted with the environmental requirements of data storage systems may find the guidance familiar — data storage vendors have been following specifications set out in ANSI/ISA-71.04 since 1985 (last updated in 2013). Long after the era of tapes, storage drives (hard disks and solid state alike) remain the foremost victims of corrosion, as their low-temperature operational requirements mean increased moisture absorption and adsorption.
However, many data center operators do not routinely measure gaseous contaminant levels, and so do not monitor for corrosion. If strong catalysts are present but undetected, this might lead to higher than expected failure rates even if temperature and RH are within target ranges. Worse still, lowering supply air temperature in an attempt to counter failures might make them more likely. ASHRAE recommends operators consider a 50% RH limit if they don’t perform reactivity coupon measurements. Somewhat confusingly, it also makes an allowance for following specifications set out in its previous update (the fourth edition), which recommends a 60% RH limit.
Restricted envelope for high-density IT systems
Another major change in the latest update is the addition of a new class of IT equipment, separate from the pre-existing classes of A1 through A4. The new class, H1, includes systems that tightly integrate a number of high-powered components (server processors, accelerators, memory chips and networking controllers). ASHRAE says these high-density systems need more narrow air temperature bands — it recommends 18°C/64.4°F to 22°C/71.6°F (as opposed to 18°C/64.4°F to 27°C/80.6°F) — to meet its cooling requirements. The allowable envelope has become tighter as well, with upper limits of 25°C/77°F for class H1, instead of 32°C/89.6°F (see Figure 1).
This is because, according to ASHRAE, there is simply not enough room in some dense systems for the higher performance heat sinks and fans that could keep components below temperature limits across the generic (classes A1 through A4) recommended envelope. ASHRAE does not stipulate what makes a system class H1, leaving it to the IT vendor to specify its products as such.
There are some potentially far-reaching implications of these new envelopes. Operators have over the past decade built and equipped a large number of facilities based on ASHRAE’s previous guidance. Many of these relatively new data centers take advantage of the recommended temperature bands by using less mechanical refrigeration and more economization. In several locations — Dublin, London and Seattle, for example — it is even possible for operators to completely eliminate mechanical cooling yet stay within ASHRAE guidelines by marrying the use of evaporative and adiabatic air handlers with advanced air-flow design and operational discipline. The result is a major leap in energy efficiency and the ability to support more IT load from a substation.
Such optimized facilities will not typically lend themselves well to the new envelopes. That most of these data centers can support 15- to 20-kilowatt IT racks doesn’t help either, since H1 is a new equipment class requiring a lower maximum for temperature — regardless of the rack’s density. To maintain the energy efficiency of highly optimized new data center designs, dense IT may need to have its own dedicated area with independent cooling. Indeed, ASHRAE says that operators should separate H1 and other more restricted equipment into areas with their own controls and cooling equipment.
Uptime will be watching with interest how colocation providers, in particular, will handle this challenge, as their typical SLAs depend heavily on the ASHRAE thermal guidelines. What may be considered an oddity today may soon become common, given that semiconductor power keeps escalating with every generation. Facility operators may deploy direct liquid cooling for high-density IT as a way out of this bind.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/10/ASHRAE-Guidelines.jpg6001200Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDaniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]2021-10-21 10:15:512021-10-21 10:15:52New ASHRAE guidelines challenge efficiency drive
The bigger the outage, the greater the need for explanations and, most importantly, for taking steps to avoid a repeat.
By any standards, the outage that affected Facebook on Monday, October 4th, was big. For more than six hours, Facebook and its other businesses, including WhatsApp, Instagram and Oculus VR, disappeared from the internet — not just in a few regions or countries, but globally. So many users and machines kept retrying these websites, it caused a slowdown of the internet and issues with cellular networks.
While Facebook is large enough to ride through the immediate financial impact, it should not be dismissed. Market watchers estimate that the outage cost Facebook roughly $60 million in revenues over its more than six-hour period. The company’s shares fell 4.9% on the day, which translated into more than $47 billion in lost market cap.
Facebook may recover those losses, but the bigger ramifications may be reputational and legal. Uptime Institute research shows that the level of outages from hyperscale operators is similar to that experienced by colocation companies and enterprises — despite their huge investments in distributed availability zones and global load and traffic management. In 2020, Uptime Institute recorded 21 cloud/internet giant outages, with associated financial and reputational damage. With antitrust, data privacy and, most recently, children’s mental health concerns swirling about Facebook, the company is unlikely to welcome further reputational and legal scrutiny.
What was the cause of Facebook’s outage? The company said there was an errant command issued during planned network maintenance. While an automated auditing tool would ordinarily catch an errant command, there was a bug in the tool that didn’t properly stop it. The command led to configuration changes on Facebook’s backbone routers that coordinate network traffic among its data centers. This had a cascading effect that halted Facebook’s services.
Setting aside theories of deliberate sabotage, there is evidence that Facebook’s internet routes (Border Gateway Protocol, or BGP) were withdrawn by mistake as part of these configuration changes.
BGP is a mechanism for large internet routers to constantly exchange information about the possible routes for them to deliver network packets. BGP effectively provides very long lists of potential routing paths that are constantly updated. When Facebook stopped broadcasting its presence — something observed by sites that monitor and manage internet traffic — other networks could not find it.
One factor that exacerbated the outage is that Facebook has an atypical internet infrastructure design, specifically related to BGP and another three-letter acronym: DNS, the domain name system. While BGP functions as the internet’s routing map, the DNS serves as its address book. (The DNS translates human-friendly names for online resources into machine-friendly internet protocol addresses.)
Facebook has its own DNS registrar, which manages and broadcasts its domain names. Because of Facebook’s architecture — designed to improve flexibility and control — when its BPG configuration error happened, the Facebook registrar went offline. (As an aside, this caused some domain tools to erroneously show that the Facebook.com domain was available for sale.) As a result, internet service providers and other networks simply could not find Facebook’s network.
How did this then cause a slowdown of the internet? Billions of systems, including mobile devices running a Facebook-owned application in the background, were constantly requesting new “coordinates” for these sites. These requests are ordinarily cached in servers located at the edge, but when the BGP routes disappeared, so did those caches. Requests were routed upstream to large internet servers in core data centers.
The situation was compounded by a negative feedback loop, caused in part by application logic and in part by user behavior. Web applications will not accept a BGP routing error as an answer to a request and so they retry, often aggressively. Users and their mobile devices running these applications in the background also won’t accept an error and will repeatedly reload the website or reboot the application. The result was an up to 40% increase in DNS request traffic, which slowed down other networks (and, therefore, increased latency and timeout requests for other web applications). The increased traffic also reportedly led to issues with some cellular networks, including users being unable to make voice-over-IP phone calls.
Facebook’s outage was initially caused by routine network maintenance gone wrong, but the error was missed by an auditing tool and propagated via an automated system, which were likely both built by Facebook. The command error reportedly blocked remote administrators from reverting the configuration change. What’s more, the people with access to Facebook’s physical routers (in Facebook’s data centers) did not have access to the network/logical system. This suggests two things: the network maintenance auditing tool and process were inadequately tested, and there was a lack of specialized staff with network-system access physically inside Facebook’s data centers.
When the only people who can remedy a potential network maintenance problem rely on the network that is being worked on, it seems obvious that a contingency plan needs to be in place.
Facebook, which like other cloud/internet giants has rigorous processes for applying lessons learned, should be better protected next time. But Uptime Institute’s research shows there are no guarantees — cloud/internet giants are particularly vulnerable to network and software configuration errors, a function of their complexity and the interdependency of many data centers, zones, systems and separately managed networks. Ten of the 21 outages in 2020 that affected cloud/internet giants were caused by software/network errors. That these errors can cause traffic pileups that can then snarl completely unrelated applications globally will further concern all those who depend on publicly shared digital infrastructure – including the internet.
Operators often say that data center infrastructure management (DCIM) software is a necessary evil. Modern facilities need centralized, real-time management and analytics, but DCIM is notoriously difficult to deploy. The software is also not easy to rip and replace — yet some operators unexpectedly will have to. Vertiv, a pioneer and leader in equipment monitoring and management, recently told customers it has discontinued its flagship DCIM platform Trellis. Support for existing contracts will end in 2023.
In addition to dismaying these customers, Vertiv’s decision to drop Trellis raises questions about the company’s overall strategy — and value proposition — for data center management software.
Vertiv’s software business has fallen far in the past decade. Its Aperture asset management software was the most widely used DCIM product as recently as 2015 (when the company was Emerson Network Power). Trellis launched with fanfare in 2012 and became one of the most expansive DCIM platforms, promising to integrate data center facility and IT equipment monitoring.
For years, Trellis was the centerpiece of Vertiv’s strategy toward more intelligent, more automated data centers. It was once positioned as a rival to today’s leading DCIM products, notably from Schneider Electric and Nlyte Software. At one point, IBM and HP were Trellis resellers.
Ultimately, however, Trellis was both overengineered and underwhelming. It was built using the Oracle Fusion application development system so that new functionality and software could be built on top of Trellis. This was a benefit for customers active in the Oracle environment, but not for most everyone else; the architecture was too sprawling and heavy. Vertiv says this was a primary reason for it discontinuing the product — it was simply too big and complex.
Other factors were at play. Over several years, operators reported various problems and concerns with Trellis. For example, while Trellis’ asset management functionality was eventually strong, for years it lagged Aperture, the product it was designed to replace. (Vertiv discontinued Aperture in 2017, with support of existing contracts to end by 2023.) This left customers to choose between maintaining a discontinued product or migrating to a newer one with less functionality. Many ended up migrating to a competitor’s platform.
Vertiv’s overall share of the DCIM market has steadily shrunk. Trellis, Aperture, Data Center Planner, SiteScan – all are well-known Vertiv software products that are now discontinued. (No date yet for when support for existing SiteScan contracts will end.) Moreover, Trellis was a proof point for Vertiv’s software ambitions.
So where is Vertiv heading? The company is focusing on a few remaining software products that are solid but narrowly focused. It is investing in Environet, which it acquired in 2018 when it bought Geist Global, a relatively small vendor of rack power distribution and related sensors. At the time, Environet had a small, loyal customer base. Vertiv has since extended its power and thermal monitoring, as well as its facilities space management features. Yet the software lacks the more sophisticated predictive maintenance and, notably, the IT device management capabilities of Trellis and other DCIM suites.
Separately, Vertiv continues to develop its IT out-of-band management software, as well as its iCOM-S thermal management tools. Vertiv’s recent deal to acquire the electrical supplier E+I Engineering (for $1.2 billion) means it could potentially add more power monitoring features.
Taken as a portfolio, Vertiv appears underinvested in software. This is not just when compared with some of its equipment rivals, but also in the context of the overall evolution of data center management.
The future of data center management will increasingly be predictive, automated and remote — thanks to sophisticated DCIM software driven by artificial intelligence (AI) and other big-data approaches. New developments such as AI-driven DCIM as a service are helping move smaller facilities, including at the edge, toward this vision. At the same time, DCIM deployed on-premises in larger, mission-critical facilities is becoming increasingly AI-driven, predictive, and able to assimilate a variety of data to extend real-time alarming and closed-loop automation. Schneider and Eaton, for example, both offer cloud-delivered predictive DCIM. Several DCIM vendors are investing in more AI to boost the accuracy and remit of analysis. And earlier this year, Schneider said it was expanding development of its long-standing and comprehensive on-premises DCIM platform, EcoStruxure for Data Centers. (Schneider has also quietly standardized its tools and processes to make it easier to migrate other vendors’ DCIM data into its software, among other things.)
While Vertiv has a long-held strategy to embed more software in its facilities equipment for monitoring and management, it has not executed or even articulated broader ambitions. As one of the largest equipment and services suppliers to data centers, Vertiv, like its peers, does not need to be a primary supplier of software to succeed. However, DCIM is increasingly one of the most critical management levers available to data center operators. Major infrastructure product and service suppliers often view DCIM as a strategic product line – in part because it should increase the “stickiness” of their core business. More importantly, the software’s analytics and control features can show how a supplier’s broader offerings easily fit into an intelligence infrastructure.
Through their public commitments to net-zero carbon emission targets, cloud providers have re-energized talks of a major redesign of critical power systems within the data center sector. The elimination of diesel fuel use is chief among the goals, but the desire for a leaner facility power infrastructure that closely tracks IT capacity needs is also attracting interest. New types of batteries, among other tools such as clean fuels, will likely prove instrumental in this endeavor.
Operators have already started displacing standard valve-regulated lead-acid (VRLA) banks with lithium-ion (Li-ion) batteries to refashion data center backup power. According to the upcoming 2021 Uptime Institute Global Data Center Survey report, nearly half of operators have adopted the technology for their centralized uninterruptible power supply (UPS) plant, up from about a quarter three years ago. Compared with VRLA, Li-ion batteries are smaller, easier to monitor and maintain, and require less stringent climatic controls. (See our post: Lithium-ion batteries for the data center: Are they ready for production yet? for more technical details.)
In addition, Li-ion batteries have a much longer service life, which offsets most of the higher initial cost of Li-ion packs. The change in chemistry also makes possible use cases that were not previously feasible with VRLA — for example, demand-response schemes in which operators can use their Li-ion UPS systems for peak load shaving when there is a price incentive from the electric utility.
But these advantages are not without significant trade-offs. According to recent research by the US-based National Fire Protection Association, Li-ion batteries (regardless of their specific chemistries and construction) represent a higher fire risk than VRLA. Worse still, Li-ion fires are notoriously difficult to suppress, as the breakdown of cells produces combustible gases — a situation that creates the conditions for life-threatening explosions. Even though some chemistries, such as lithium iron phosphate, have a comparatively lower risk of fire (due to a higher ignition point and no release of oxygen), the possibility of thermal runaway persists.
Then there is the matter of sustainability. VRLA batteries are effectively fully recycled to produce new batteries, at no cost to the owner (owing to lead toxicity-related regulation). That is not the case for large Li-ion battery packs, where the industrial base for recycling has yet to develop — most Li-ion-powered products haven’t reached their end of life. Repackaging and repurposing for less demanding applications seem a more likely solution than breaking down to materials but is still challenging. Furthermore, supply chains for Li-ion production, along with their social and environmental impacts, stay obscure to buyers.
These issues have already given some data center operators and their contractors pause. But they have also created an opening in the market to other branches of battery chemistry. One promising family for high-power applications relies on sodium (Na) ions instead of lithium as its main component to carry the charge, while developments in iron-air (FeO2) batteries raise hopes of building economical and safe energy storage for very long backup times.
Na-ion cells pose no thermal runaway risk and offer much higher power-to-capacity ratios. Even better, they can repeatedly discharge their full energy at very high rates, in one to two minutes, then fully recharge in 10 to 20 minutes without overheating. Durability is also better than that of Li-ion cells.
However, a significant disadvantage is the size of Na-ion batteries compared to state-of-the-art Li-ion (although they are still more compact than VRLA). These features make Na-ion batteries interesting for UPS applications with shorter runtime targets (e.g., up to 10 minutes). More importantly, perhaps, Na-ion seems safe for installation in data halls, where operators can use it in distributed UPS topologies that closely match power service level to IT capacity requirements. Social and environmental impacts from sodium battery manufacturing also appear less concerning, as they tend to use no rare earth elements or minerals tainted by the conditions of extraction (although practical recyclability is an outstanding question).
Multiple startups have already attracted funding to commercialize sodium batteries for various use cases, both stationary and mobile (including industrial vehicles). Examples are Natron Energy (US), Faradion (UK), and CATL (China). The fabrication process of Na-ion battery shares many commonalities with Li-ion, making its industrialization relatively straightforward — producers can tap into a large pool of toolmakers and manufacturing know-how. All the same, it will take another few years (and hundreds of millions of dollars) for Na-ion battery production to ramp up.
Meanwhile, US startup Form Energy has recently raised $200 million (bringing total life-to-date funding past $325 million) to commercialize an entirely different type of battery that aims to solve not the power but the energy capacity problem. The company says it has figured out how to make effective iron-air (FeO2) batteries that can store energy for long periods and discharge it slowly over many hours (the design target is 100 hours) when needed. Because of the inexpensive ingredients, Form claims that FeO2 cells cost a small fraction (15%) of Li-ion per kilowatt-hour.
Form Energy is now targeting 2025 for the first battery deployments. There seems to be a catch, however: space. According to the firm, a Form Energy system that can hold a 3-megawatt load for four days occupies an acre of land. Even if this can be reduced by lowering runtime to, say, 24 hours, it will still require several times the footprint of a diesel generator set and associated fuel storage. Nonetheless, if product development goes to plan, Form Energy may provide the data center sector with a new piece of its power technology puzzle.
In upcoming reports, Uptime Institute Intelligence will explore developments in energy storage, power infrastructure, and operational methods that may make the stretch goal of a diesel-free data center achievable.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/08/Startups-brew-new-chemistries-for-fresh-battery-types.jpg4801030Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDaniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]2021-09-01 07:00:002021-09-01 07:04:03Startups brew new chemistries for fresh battery types
Edge data centers are generally small data center facilities — designed for IT workloads up to a few hundred kilowatts — that process data closer to the population of local end users and devices. Earlier in 2021, we surveyed data center owners/operators and product and service suppliers to gauge demand and deployments. The findings were consistent: Data center end-user organizations and suppliers alike expect an uptick in edge data center demand in the near term.
The study suggests a small majority of owners/operators today use between one and five edge data centers and that this is unlikely to change in the next two to three years. This is likely an indicator of overall demand for edge computing — most organizations have some requirement for edge, but do not expect this to change significantly in the short term.
Many others, however, do expect growth. The share of owners/operators that do not use any edge data centers drops from 31% today to 12% in two to three years’ time — indicating a significant increase in owner/operator uptake. Furthermore, the portion of owners/operators in our study using more than 20 edge data centers today is expected to double in the next two to three years (from 9% today to 20% in two to three years), as shown in Figure 1.
Over 90% of respondents in North America are planning to use more than five edge data centers in two to three years’ time, a far higher proportion than the 30% to 60% of respondents in other regions. The largest portion of owners/operators planning to use more than 20 edge data centers within the next few years is in the US and Canada, closely followed by Asia-Pacific and China.
The main drivers for edge data center expansion are a need to reduce latency to improve or add new IT services, as well as requirements to reduce network costs and/or bandwidth constraints in data transport over wide distances.
Uptime Institute’s research shows that large deployments of hundreds or more edge data centers are expected for many applications, including telecom networks, the internet of things in oil and gas, retail, cloud gaming, video streaming, public transportation systems, and the growth of large international industrial companies with multiple offices. Several large-scale edge data center projects are today at a prototyping stage with trials planned using one to 10 sites. Full-scale deployments involving tens of sites are planned for the coming three years.
The view from suppliers of edge data centers in our study was similarly strong. Most suppliers are today helping to build, supply or maintain between six and 10 edge data centers per year. Many expect this will grow to 21 to 50 annually in two to three years’ time, as shown in Figure 2. The portion of suppliers that built, supplied or maintained more than 100 edge data centers in the year 2020 is minimal today; looking ahead two to three years, 15% of suppliers in our study expect they will be handling 100+ edge data center projects (a threefold increase from today).
Suppliers in the US and Canada are particularly bullish; almost one in three (32%) expect a yearly volume of more than 100 edge data centers in two to three years’ time. Those in Europe are also optimistic; roughly one-fifth of suppliers (18%) expect an annual volume of more than 100 edge data centers in the coming two to three years. Some suppliers report a recent change in market demand, with orders and tenders going from being sporadic and for a single-digit number of sites, to more requests for larger projects with tens of sites. The change may be attributed to a solidification of clients’ corporate strategy for edge computing.
It seems clear that demand for edge data centers is expected to grow across the world in the near term and especially in North America. Owners, operators and suppliers alike anticipate growth across different industry verticals. The digitization effects of COVID-19, the expansion by big clouds to the edge, and the mostly speculative deployments of 5G are among key factors driving demand. However, there is complexity involved in developing edge business cases and it is not yet clear that any single use case will drive edge data centers in high volumes.
The full report Demand and speculation fuel edge buildout is available here.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/06/Edge-Data-Denter-Demand.jpg4281030Dr. Tomas Rahkonen, Research Director of Distributed Data Centers, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDr. Tomas Rahkonen, Research Director of Distributed Data Centers, Uptime Institute2021-06-28 05:00:002021-06-25 09:35:48Eyeing an uptick in edge data center demand
A recent outage at content delivery network Fastly took down thousands of websites in different countries, including big names, such as Amazon, Twitter and Spotify, for nearly an hour. It is the latest large outage highlighting the downside of a key trend in digital infrastructure: The growth in dependency on digital service providers can undermine infrastructure resilience and business continuity.
The conundrum here is one of conflicting double standards. While many companies expend substantial money and time on hardening and testing their own data center and IT infrastructure against disruptions, they also increasingly rely on cloud service providers whose architectural details and operational practices are largely hidden from them.
Some confusion stems from high expectations and a high level of trust in the resiliency of the greatly distributed topologies of cloud and network technologies — often amounting to a leap of faith. Large providers routinely quote five nines or more availability or imply that system-wide failure is nearly impossible — but this is clearly not the case. While many do deliver on these promises most of the time, outages still happen. When they do, the impact is often broad and wide.
Fastly is far from being a lone culprit. In our report Annual outage analysis 2021: The causes and impacts of data center outages, Uptime Institute counts over 80 major outages at cloud providers in the last five years (not including other digital services and telecommunication providers). The Fastly outage serves as a reminder that cloud- and network-based services are not invulnerable to service interruptions — in fact, they sometimes prove to be surprisingly fragile, susceptible to tiny errors.
Fastly’s own initial postmortem of the event notes that the global outage was a result of a legitimate and valid configuration change by a single customer. That change led to an unexpected effect on the wider system, ultimately taking down dozens of other customers’ services. Fastly said it did not know how the bug made it into production. In some ways, this Fastly outage is similar to an Amazon Web Services (AWS) incident a few years ago, when a single storage configuration error made by an administrator ground some AWS services to a near halt on the US East Coast, causing major disruption.
Major outages at large digital service providers point to a need for greater transparency of their operational and architectural resiliency, so customers can better assess their own risk exposure. Research from Uptime Institute Intelligence shows that only a small proportion of businesses — fewer than one in five in our study — thought they had sufficient understanding of the resilience of their cloud providers to trust them to run their mission-critical workloads (see Figure 1). Nearly a quarter said they would likely use public cloud for such workloads if they gained greater visibility into the cloud providers’ resiliency design and procedures.
Furthermore, Uptime Institute’s surveys have shown that most businesses manage a rich mix of digital infrastructures and services, spanning their own private (on-premises) data centers, leased facilities, hosted capacity, and various public cloud-powered services. This comes at the cost of added complexity, which generates risks, both known and unknown, if not managed with full visibility. In a recent survey, of those who have a view on how their services’ resilience has changed as a result of adopting hybrid environments, a strong majority reported a welcome improvement. Still, one in seven reported a decrease in overall resilience.
The bottom line is that without the possibility of verifying availability and reliability claims from cloud and other IT services providers, and without proactively mitigating the challenges that hybrid architectures pose, enterprises might be signing up to added risks to their business continuity and data security. Fastly’s customers have received a stark reminder of this growing reality at their own expense.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/06/Fastly-blog-featured-diagram.jpg4661024Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDaniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]2021-06-11 15:15:072021-06-11 15:17:15Fastly outage underscores slow creep of digital services risk
New ASHRAE guidelines challenge efficiency drive
/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]Earlier in 2021, ASHRAE’s Technical Committee 9.9 published an update — the fifth edition — of its Thermal Guidelines for Data Processing Environments. The update recommends important changes to data center thermal operating envelopes: the presence of pollutants is now a factor, and it introduces a new class of IT equipment for high-density computing. The new advice can, in some cases, lead operators to not only alter operational practices but also shift set points, a change that may impact both energy efficiency and contractual service level agreements (SLA) with data center services providers.
Since the original release in 2004, ASHRAE’s Thermal Guidelines have been instrumental in setting cooling standards for data centers globally. The 9.9 committee collects input from a wide cross-section of the IT and data center industry to promote an evidence-based approach to climatic controls, one which helps operators better understand both risks and optimization opportunities. Historically, most changes to the guidelines pointed data center operators toward further relaxations of climatic set points (e.g., temperature, relative humidity, dew point), which also stimulated equipment makers to develop more efficient air economizer systems.
In the fifth edition, ASHRAE adds some major caveats to its thermal guidance. While the recommendations for relative humidity (RH) extend the range up to 70% (the previous cutoff was 60%), this is conditional on the data hall having low concentrations of pollutant gases. If the presence of corrosive gases is above the set thresholds, ASHRAE now recommends operators keep RH under 50% — below its previous recommended limit. To monitor, operators should place metal strips, known as “reactivity coupons,” in the data hall and measure corroded layer formation; the limit for silver is 200 ångström per month and for copper, 300 ångström per month.
ASHRAE bases its enhanced guidance on an experimental study on the effects of gaseous pollutants and humidity on electronics, performed between 2015 and 2018 with researchers from Syracuse University (US). The experiments found that the presence of chlorine and hydrogen sulfide accelerates copper corrosion under higher humidity conditions. Without chlorine, hydrogen sulfide or similarly strong catalysts, there was no significant corrosion up to 70% RH, even when other, less aggressive gaseous pollutants (such as ozone, nitrogen dioxide and sulfur dioxide) were present.
Because corrosion from chlorine and hydrogen sulfide at 50% RH is still above acceptable levels, ASHRAE suggests operators consider chemical filtration to decontaminate.
While the data ASHRAE uses is relatively new, its conclusions echo previous standards. Those acquainted with the environmental requirements of data storage systems may find the guidance familiar — data storage vendors have been following specifications set out in ANSI/ISA-71.04 since 1985 (last updated in 2013). Long after the era of tapes, storage drives (hard disks and solid state alike) remain the foremost victims of corrosion, as their low-temperature operational requirements mean increased moisture absorption and adsorption.
However, many data center operators do not routinely measure gaseous contaminant levels, and so do not monitor for corrosion. If strong catalysts are present but undetected, this might lead to higher than expected failure rates even if temperature and RH are within target ranges. Worse still, lowering supply air temperature in an attempt to counter failures might make them more likely. ASHRAE recommends operators consider a 50% RH limit if they don’t perform reactivity coupon measurements. Somewhat confusingly, it also makes an allowance for following specifications set out in its previous update (the fourth edition), which recommends a 60% RH limit.
Restricted envelope for high-density IT systems
Another major change in the latest update is the addition of a new class of IT equipment, separate from the pre-existing classes of A1 through A4. The new class, H1, includes systems that tightly integrate a number of high-powered components (server processors, accelerators, memory chips and networking controllers). ASHRAE says these high-density systems need more narrow air temperature bands — it recommends 18°C/64.4°F to 22°C/71.6°F (as opposed to 18°C/64.4°F to 27°C/80.6°F) — to meet its cooling requirements. The allowable envelope has become tighter as well, with upper limits of 25°C/77°F for class H1, instead of 32°C/89.6°F (see Figure 1).
This is because, according to ASHRAE, there is simply not enough room in some dense systems for the higher performance heat sinks and fans that could keep components below temperature limits across the generic (classes A1 through A4) recommended envelope. ASHRAE does not stipulate what makes a system class H1, leaving it to the IT vendor to specify its products as such.
There are some potentially far-reaching implications of these new envelopes. Operators have over the past decade built and equipped a large number of facilities based on ASHRAE’s previous guidance. Many of these relatively new data centers take advantage of the recommended temperature bands by using less mechanical refrigeration and more economization. In several locations — Dublin, London and Seattle, for example — it is even possible for operators to completely eliminate mechanical cooling yet stay within ASHRAE guidelines by marrying the use of evaporative and adiabatic air handlers with advanced air-flow design and operational discipline. The result is a major leap in energy efficiency and the ability to support more IT load from a substation.
Such optimized facilities will not typically lend themselves well to the new envelopes. That most of these data centers can support 15- to 20-kilowatt IT racks doesn’t help either, since H1 is a new equipment class requiring a lower maximum for temperature — regardless of the rack’s density. To maintain the energy efficiency of highly optimized new data center designs, dense IT may need to have its own dedicated area with independent cooling. Indeed, ASHRAE says that operators should separate H1 and other more restricted equipment into areas with their own controls and cooling equipment.
Uptime will be watching with interest how colocation providers, in particular, will handle this challenge, as their typical SLAs depend heavily on the ASHRAE thermal guidelines. What may be considered an oddity today may soon become common, given that semiconductor power keeps escalating with every generation. Facility operators may deploy direct liquid cooling for high-density IT as a way out of this bind.
Too big to fail? Facebook’s global outage
/in Executive, News, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteThe bigger the outage, the greater the need for explanations and, most importantly, for taking steps to avoid a repeat.
By any standards, the outage that affected Facebook on Monday, October 4th, was big. For more than six hours, Facebook and its other businesses, including WhatsApp, Instagram and Oculus VR, disappeared from the internet — not just in a few regions or countries, but globally. So many users and machines kept retrying these websites, it caused a slowdown of the internet and issues with cellular networks.
While Facebook is large enough to ride through the immediate financial impact, it should not be dismissed. Market watchers estimate that the outage cost Facebook roughly $60 million in revenues over its more than six-hour period. The company’s shares fell 4.9% on the day, which translated into more than $47 billion in lost market cap.
Facebook may recover those losses, but the bigger ramifications may be reputational and legal. Uptime Institute research shows that the level of outages from hyperscale operators is similar to that experienced by colocation companies and enterprises — despite their huge investments in distributed availability zones and global load and traffic management. In 2020, Uptime Institute recorded 21 cloud/internet giant outages, with associated financial and reputational damage. With antitrust, data privacy and, most recently, children’s mental health concerns swirling about Facebook, the company is unlikely to welcome further reputational and legal scrutiny.
What was the cause of Facebook’s outage? The company said there was an errant command issued during planned network maintenance. While an automated auditing tool would ordinarily catch an errant command, there was a bug in the tool that didn’t properly stop it. The command led to configuration changes on Facebook’s backbone routers that coordinate network traffic among its data centers. This had a cascading effect that halted Facebook’s services.
Setting aside theories of deliberate sabotage, there is evidence that Facebook’s internet routes (Border Gateway Protocol, or BGP) were withdrawn by mistake as part of these configuration changes.
BGP is a mechanism for large internet routers to constantly exchange information about the possible routes for them to deliver network packets. BGP effectively provides very long lists of potential routing paths that are constantly updated. When Facebook stopped broadcasting its presence — something observed by sites that monitor and manage internet traffic — other networks could not find it.
One factor that exacerbated the outage is that Facebook has an atypical internet infrastructure design, specifically related to BGP and another three-letter acronym: DNS, the domain name system. While BGP functions as the internet’s routing map, the DNS serves as its address book. (The DNS translates human-friendly names for online resources into machine-friendly internet protocol addresses.)
Facebook has its own DNS registrar, which manages and broadcasts its domain names. Because of Facebook’s architecture — designed to improve flexibility and control — when its BPG configuration error happened, the Facebook registrar went offline. (As an aside, this caused some domain tools to erroneously show that the Facebook.com domain was available for sale.) As a result, internet service providers and other networks simply could not find Facebook’s network.
How did this then cause a slowdown of the internet? Billions of systems, including mobile devices running a Facebook-owned application in the background, were constantly requesting new “coordinates” for these sites. These requests are ordinarily cached in servers located at the edge, but when the BGP routes disappeared, so did those caches. Requests were routed upstream to large internet servers in core data centers.
The situation was compounded by a negative feedback loop, caused in part by application logic and in part by user behavior. Web applications will not accept a BGP routing error as an answer to a request and so they retry, often aggressively. Users and their mobile devices running these applications in the background also won’t accept an error and will repeatedly reload the website or reboot the application. The result was an up to 40% increase in DNS request traffic, which slowed down other networks (and, therefore, increased latency and timeout requests for other web applications). The increased traffic also reportedly led to issues with some cellular networks, including users being unable to make voice-over-IP phone calls.
Facebook’s outage was initially caused by routine network maintenance gone wrong, but the error was missed by an auditing tool and propagated via an automated system, which were likely both built by Facebook. The command error reportedly blocked remote administrators from reverting the configuration change. What’s more, the people with access to Facebook’s physical routers (in Facebook’s data centers) did not have access to the network/logical system. This suggests two things: the network maintenance auditing tool and process were inadequately tested, and there was a lack of specialized staff with network-system access physically inside Facebook’s data centers.
When the only people who can remedy a potential network maintenance problem rely on the network that is being worked on, it seems obvious that a contingency plan needs to be in place.
Facebook, which like other cloud/internet giants has rigorous processes for applying lessons learned, should be better protected next time. But Uptime Institute’s research shows there are no guarantees — cloud/internet giants are particularly vulnerable to network and software configuration errors, a function of their complexity and the interdependency of many data centers, zones, systems and separately managed networks. Ten of the 21 outages in 2020 that affected cloud/internet giants were caused by software/network errors. That these errors can cause traffic pileups that can then snarl completely unrelated applications globally will further concern all those who depend on publicly shared digital infrastructure – including the internet.
More detailed information about the causes and costs of outages is available in our report Annual outage analysis 2021: The causes and impacts of data center outages.
Vertiv’s DCIM ambitions wither on Trellis’ demise
/in Executive, News, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteOperators often say that data center infrastructure management (DCIM) software is a necessary evil. Modern facilities need centralized, real-time management and analytics, but DCIM is notoriously difficult to deploy. The software is also not easy to rip and replace — yet some operators unexpectedly will have to. Vertiv, a pioneer and leader in equipment monitoring and management, recently told customers it has discontinued its flagship DCIM platform Trellis. Support for existing contracts will end in 2023.
In addition to dismaying these customers, Vertiv’s decision to drop Trellis raises questions about the company’s overall strategy — and value proposition — for data center management software.
Vertiv’s software business has fallen far in the past decade. Its Aperture asset management software was the most widely used DCIM product as recently as 2015 (when the company was Emerson Network Power). Trellis launched with fanfare in 2012 and became one of the most expansive DCIM platforms, promising to integrate data center facility and IT equipment monitoring.
For years, Trellis was the centerpiece of Vertiv’s strategy toward more intelligent, more automated data centers. It was once positioned as a rival to today’s leading DCIM products, notably from Schneider Electric and Nlyte Software. At one point, IBM and HP were Trellis resellers.
Ultimately, however, Trellis was both overengineered and underwhelming. It was built using the Oracle Fusion application development system so that new functionality and software could be built on top of Trellis. This was a benefit for customers active in the Oracle environment, but not for most everyone else; the architecture was too sprawling and heavy. Vertiv says this was a primary reason for it discontinuing the product — it was simply too big and complex.
Other factors were at play. Over several years, operators reported various problems and concerns with Trellis. For example, while Trellis’ asset management functionality was eventually strong, for years it lagged Aperture, the product it was designed to replace. (Vertiv discontinued Aperture in 2017, with support of existing contracts to end by 2023.) This left customers to choose between maintaining a discontinued product or migrating to a newer one with less functionality. Many ended up migrating to a competitor’s platform.
Vertiv’s overall share of the DCIM market has steadily shrunk. Trellis, Aperture, Data Center Planner, SiteScan – all are well-known Vertiv software products that are now discontinued. (No date yet for when support for existing SiteScan contracts will end.) Moreover, Trellis was a proof point for Vertiv’s software ambitions.
So where is Vertiv heading? The company is focusing on a few remaining software products that are solid but narrowly focused. It is investing in Environet, which it acquired in 2018 when it bought Geist Global, a relatively small vendor of rack power distribution and related sensors. At the time, Environet had a small, loyal customer base. Vertiv has since extended its power and thermal monitoring, as well as its facilities space management features. Yet the software lacks the more sophisticated predictive maintenance and, notably, the IT device management capabilities of Trellis and other DCIM suites.
Separately, Vertiv continues to develop its IT out-of-band management software, as well as its iCOM-S thermal management tools. Vertiv’s recent deal to acquire the electrical supplier E+I Engineering (for $1.2 billion) means it could potentially add more power monitoring features.
Taken as a portfolio, Vertiv appears underinvested in software. This is not just when compared with some of its equipment rivals, but also in the context of the overall evolution of data center management.
The future of data center management will increasingly be predictive, automated and remote — thanks to sophisticated DCIM software driven by artificial intelligence (AI) and other big-data approaches. New developments such as AI-driven DCIM as a service are helping move smaller facilities, including at the edge, toward this vision. At the same time, DCIM deployed on-premises in larger, mission-critical facilities is becoming increasingly AI-driven, predictive, and able to assimilate a variety of data to extend real-time alarming and closed-loop automation. Schneider and Eaton, for example, both offer cloud-delivered predictive DCIM. Several DCIM vendors are investing in more AI to boost the accuracy and remit of analysis. And earlier this year, Schneider said it was expanding development of its long-standing and comprehensive on-premises DCIM platform, EcoStruxure for Data Centers. (Schneider has also quietly standardized its tools and processes to make it easier to migrate other vendors’ DCIM data into its software, among other things.)
While Vertiv has a long-held strategy to embed more software in its facilities equipment for monitoring and management, it has not executed or even articulated broader ambitions. As one of the largest equipment and services suppliers to data centers, Vertiv, like its peers, does not need to be a primary supplier of software to succeed. However, DCIM is increasingly one of the most critical management levers available to data center operators. Major infrastructure product and service suppliers often view DCIM as a strategic product line – in part because it should increase the “stickiness” of their core business. More importantly, the software’s analytics and control features can show how a supplier’s broader offerings easily fit into an intelligence infrastructure.
Startups brew new chemistries for fresh battery types
/in Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]Through their public commitments to net-zero carbon emission targets, cloud providers have re-energized talks of a major redesign of critical power systems within the data center sector. The elimination of diesel fuel use is chief among the goals, but the desire for a leaner facility power infrastructure that closely tracks IT capacity needs is also attracting interest. New types of batteries, among other tools such as clean fuels, will likely prove instrumental in this endeavor.
Operators have already started displacing standard valve-regulated lead-acid (VRLA) banks with lithium-ion (Li-ion) batteries to refashion data center backup power. According to the upcoming 2021 Uptime Institute Global Data Center Survey report, nearly half of operators have adopted the technology for their centralized uninterruptible power supply (UPS) plant, up from about a quarter three years ago. Compared with VRLA, Li-ion batteries are smaller, easier to monitor and maintain, and require less stringent climatic controls. (See our post: Lithium-ion batteries for the data center: Are they ready for production yet? for more technical details.)
In addition, Li-ion batteries have a much longer service life, which offsets most of the higher initial cost of Li-ion packs. The change in chemistry also makes possible use cases that were not previously feasible with VRLA — for example, demand-response schemes in which operators can use their Li-ion UPS systems for peak load shaving when there is a price incentive from the electric utility.
But these advantages are not without significant trade-offs. According to recent research by the US-based National Fire Protection Association, Li-ion batteries (regardless of their specific chemistries and construction) represent a higher fire risk than VRLA. Worse still, Li-ion fires are notoriously difficult to suppress, as the breakdown of cells produces combustible gases — a situation that creates the conditions for life-threatening explosions. Even though some chemistries, such as lithium iron phosphate, have a comparatively lower risk of fire (due to a higher ignition point and no release of oxygen), the possibility of thermal runaway persists.
Then there is the matter of sustainability. VRLA batteries are effectively fully recycled to produce new batteries, at no cost to the owner (owing to lead toxicity-related regulation). That is not the case for large Li-ion battery packs, where the industrial base for recycling has yet to develop — most Li-ion-powered products haven’t reached their end of life. Repackaging and repurposing for less demanding applications seem a more likely solution than breaking down to materials but is still challenging. Furthermore, supply chains for Li-ion production, along with their social and environmental impacts, stay obscure to buyers.
These issues have already given some data center operators and their contractors pause. But they have also created an opening in the market to other branches of battery chemistry. One promising family for high-power applications relies on sodium (Na) ions instead of lithium as its main component to carry the charge, while developments in iron-air (FeO2) batteries raise hopes of building economical and safe energy storage for very long backup times.
Na-ion cells pose no thermal runaway risk and offer much higher power-to-capacity ratios. Even better, they can repeatedly discharge their full energy at very high rates, in one to two minutes, then fully recharge in 10 to 20 minutes without overheating. Durability is also better than that of Li-ion cells.
However, a significant disadvantage is the size of Na-ion batteries compared to state-of-the-art Li-ion (although they are still more compact than VRLA). These features make Na-ion batteries interesting for UPS applications with shorter runtime targets (e.g., up to 10 minutes). More importantly, perhaps, Na-ion seems safe for installation in data halls, where operators can use it in distributed UPS topologies that closely match power service level to IT capacity requirements. Social and environmental impacts from sodium battery manufacturing also appear less concerning, as they tend to use no rare earth elements or minerals tainted by the conditions of extraction (although practical recyclability is an outstanding question).
Multiple startups have already attracted funding to commercialize sodium batteries for various use cases, both stationary and mobile (including industrial vehicles). Examples are Natron Energy (US), Faradion (UK), and CATL (China). The fabrication process of Na-ion battery shares many commonalities with Li-ion, making its industrialization relatively straightforward — producers can tap into a large pool of toolmakers and manufacturing know-how. All the same, it will take another few years (and hundreds of millions of dollars) for Na-ion battery production to ramp up.
Meanwhile, US startup Form Energy has recently raised $200 million (bringing total life-to-date funding past $325 million) to commercialize an entirely different type of battery that aims to solve not the power but the energy capacity problem. The company says it has figured out how to make effective iron-air (FeO2) batteries that can store energy for long periods and discharge it slowly over many hours (the design target is 100 hours) when needed. Because of the inexpensive ingredients, Form claims that FeO2 cells cost a small fraction (15%) of Li-ion per kilowatt-hour.
Form Energy is now targeting 2025 for the first battery deployments. There seems to be a catch, however: space. According to the firm, a Form Energy system that can hold a 3-megawatt load for four days occupies an acre of land. Even if this can be reduced by lowering runtime to, say, 24 hours, it will still require several times the footprint of a diesel generator set and associated fuel storage. Nonetheless, if product development goes to plan, Form Energy may provide the data center sector with a new piece of its power technology puzzle.
In upcoming reports, Uptime Institute Intelligence will explore developments in energy storage, power infrastructure, and operational methods that may make the stretch goal of a diesel-free data center achievable.
Eyeing an uptick in edge data center demand
/in Design, Executive/by Dr. Tomas Rahkonen, Research Director of Distributed Data Centers, Uptime InstituteEdge data centers are generally small data center facilities — designed for IT workloads up to a few hundred kilowatts — that process data closer to the population of local end users and devices. Earlier in 2021, we surveyed data center owners/operators and product and service suppliers to gauge demand and deployments. The findings were consistent: Data center end-user organizations and suppliers alike expect an uptick in edge data center demand in the near term.
The study suggests a small majority of owners/operators today use between one and five edge data centers and that this is unlikely to change in the next two to three years. This is likely an indicator of overall demand for edge computing — most organizations have some requirement for edge, but do not expect this to change significantly in the short term.
Many others, however, do expect growth. The share of owners/operators that do not use any edge data centers drops from 31% today to 12% in two to three years’ time — indicating a significant increase in owner/operator uptake. Furthermore, the portion of owners/operators in our study using more than 20 edge data centers today is expected to double in the next two to three years (from 9% today to 20% in two to three years), as shown in Figure 1.
Over 90% of respondents in North America are planning to use more than five edge data centers in two to three years’ time, a far higher proportion than the 30% to 60% of respondents in other regions. The largest portion of owners/operators planning to use more than 20 edge data centers within the next few years is in the US and Canada, closely followed by Asia-Pacific and China.
The main drivers for edge data center expansion are a need to reduce latency to improve or add new IT services, as well as requirements to reduce network costs and/or bandwidth constraints in data transport over wide distances.
Uptime Institute’s research shows that large deployments of hundreds or more edge data centers are expected for many applications, including telecom networks, the internet of things in oil and gas, retail, cloud gaming, video streaming, public transportation systems, and the growth of large international industrial companies with multiple offices. Several large-scale edge data center projects are today at a prototyping stage with trials planned using one to 10 sites. Full-scale deployments involving tens of sites are planned for the coming three years.
The view from suppliers of edge data centers in our study was similarly strong. Most suppliers are today helping to build, supply or maintain between six and 10 edge data centers per year. Many expect this will grow to 21 to 50 annually in two to three years’ time, as shown in Figure 2. The portion of suppliers that built, supplied or maintained more than 100 edge data centers in the year 2020 is minimal today; looking ahead two to three years, 15% of suppliers in our study expect they will be handling 100+ edge data center projects (a threefold increase from today).
Suppliers in the US and Canada are particularly bullish; almost one in three (32%) expect a yearly volume of more than 100 edge data centers in two to three years’ time. Those in Europe are also optimistic; roughly one-fifth of suppliers (18%) expect an annual volume of more than 100 edge data centers in the coming two to three years. Some suppliers report a recent change in market demand, with orders and tenders going from being sporadic and for a single-digit number of sites, to more requests for larger projects with tens of sites. The change may be attributed to a solidification of clients’ corporate strategy for edge computing.
It seems clear that demand for edge data centers is expected to grow across the world in the near term and especially in North America. Owners, operators and suppliers alike anticipate growth across different industry verticals. The digitization effects of COVID-19, the expansion by big clouds to the edge, and the mostly speculative deployments of 5G are among key factors driving demand. However, there is complexity involved in developing edge business cases and it is not yet clear that any single use case will drive edge data centers in high volumes.
The full report Demand and speculation fuel edge buildout is available here.
Fastly outage underscores slow creep of digital services risk
/in Executive, News, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, [email protected]A recent outage at content delivery network Fastly took down thousands of websites in different countries, including big names, such as Amazon, Twitter and Spotify, for nearly an hour. It is the latest large outage highlighting the downside of a key trend in digital infrastructure: The growth in dependency on digital service providers can undermine infrastructure resilience and business continuity.
The conundrum here is one of conflicting double standards. While many companies expend substantial money and time on hardening and testing their own data center and IT infrastructure against disruptions, they also increasingly rely on cloud service providers whose architectural details and operational practices are largely hidden from them.
Some confusion stems from high expectations and a high level of trust in the resiliency of the greatly distributed topologies of cloud and network technologies — often amounting to a leap of faith. Large providers routinely quote five nines or more availability or imply that system-wide failure is nearly impossible — but this is clearly not the case. While many do deliver on these promises most of the time, outages still happen. When they do, the impact is often broad and wide.
Fastly is far from being a lone culprit. In our report Annual outage analysis 2021: The causes and impacts of data center outages, Uptime Institute counts over 80 major outages at cloud providers in the last five years (not including other digital services and telecommunication providers). The Fastly outage serves as a reminder that cloud- and network-based services are not invulnerable to service interruptions — in fact, they sometimes prove to be surprisingly fragile, susceptible to tiny errors.
Fastly’s own initial postmortem of the event notes that the global outage was a result of a legitimate and valid configuration change by a single customer. That change led to an unexpected effect on the wider system, ultimately taking down dozens of other customers’ services. Fastly said it did not know how the bug made it into production. In some ways, this Fastly outage is similar to an Amazon Web Services (AWS) incident a few years ago, when a single storage configuration error made by an administrator ground some AWS services to a near halt on the US East Coast, causing major disruption.
Major outages at large digital service providers point to a need for greater transparency of their operational and architectural resiliency, so customers can better assess their own risk exposure. Research from Uptime Institute Intelligence shows that only a small proportion of businesses — fewer than one in five in our study — thought they had sufficient understanding of the resilience of their cloud providers to trust them to run their mission-critical workloads (see Figure 1). Nearly a quarter said they would likely use public cloud for such workloads if they gained greater visibility into the cloud providers’ resiliency design and procedures.
Furthermore, Uptime Institute’s surveys have shown that most businesses manage a rich mix of digital infrastructures and services, spanning their own private (on-premises) data centers, leased facilities, hosted capacity, and various public cloud-powered services. This comes at the cost of added complexity, which generates risks, both known and unknown, if not managed with full visibility. In a recent survey, of those who have a view on how their services’ resilience has changed as a result of adopting hybrid environments, a strong majority reported a welcome improvement. Still, one in seven reported a decrease in overall resilience.
The bottom line is that without the possibility of verifying availability and reliability claims from cloud and other IT services providers, and without proactively mitigating the challenges that hybrid architectures pose, enterprises might be signing up to added risks to their business continuity and data security. Fastly’s customers have received a stark reminder of this growing reality at their own expense.