Too big to fail? Facebook’s global outage

Too big to fail? Facebook’s global outage

The bigger the outage, the greater the need for explanations and, most importantly, for taking steps to avoid a repeat.

By any standards, the outage that affected Facebook on Monday, October 4th, was big. For more than six hours, Facebook and its other businesses, including WhatsApp, Instagram and Oculus VR, disappeared from the internet — not just in a few regions or countries, but globally. So many users and machines kept retrying these websites, it caused a slowdown of the internet and issues with cellular networks.

While Facebook is large enough to ride through the immediate financial impact, it should not be dismissed. Market watchers estimate that the outage cost Facebook roughly $60 million in revenues over its more than six-hour period. The company’s shares fell 4.9% on the day, which translated into more than $47 billion in lost market cap.

Facebook may recover those losses, but the bigger ramifications may be reputational and legal. Uptime Institute research shows that the level of outages from hyperscale operators is similar to that experienced by colocation companies and enterprises — despite their huge investments in distributed availability zones and global load and traffic management. In 2020, Uptime Institute recorded 21 cloud/internet giant outages, with associated financial and reputational damage. With antitrust, data privacy and, most recently, children’s mental health concerns swirling about Facebook, the company is unlikely to welcome further reputational and legal scrutiny.

What was the cause of Facebook’s outage? The company said there was an errant command issued during planned network maintenance. While an automated auditing tool would ordinarily catch an errant command, there was a bug in the tool that didn’t properly stop it. The command led to configuration changes on Facebook’s backbone routers that coordinate network traffic among its data centers. This had a cascading effect that halted Facebook’s services.

Setting aside theories of deliberate sabotage, there is evidence that Facebook’s internet routes (Border Gateway Protocol, or BGP) were withdrawn by mistake as part of these configuration changes.

BGP is a mechanism for large internet routers to constantly exchange information about the possible routes for them to deliver network packets. BGP effectively provides very long lists of potential routing paths that are constantly updated. When Facebook stopped broadcasting its presence — something observed by sites that monitor and manage internet traffic — other networks could not find it.

One factor that exacerbated the outage is that Facebook has an atypical internet infrastructure design, specifically related to BGP and another three-letter acronym: DNS, the domain name system. While BGP functions as the internet’s routing map, the DNS serves as its address book. (The DNS translates human-friendly names for online resources into machine-friendly internet protocol addresses.)

Facebook has its own DNS registrar, which manages and broadcasts its domain names. Because of Facebook’s architecture — designed to improve flexibility and control — when its BPG configuration error happened, the Facebook registrar went offline. (As an aside, this caused some domain tools to erroneously show that the Facebook.com domain was available for sale.) As a result, internet service providers and other networks simply could not find Facebook’s network.

How did this then cause a slowdown of the internet? Billions of systems, including mobile devices running a Facebook-owned application in the background, were constantly requesting new “coordinates” for these sites. These requests are ordinarily cached in servers located at the edge, but when the BGP routes disappeared, so did those caches. Requests were routed upstream to large internet servers in core data centers.

The situation was compounded by a negative feedback loop, caused in part by application logic and in part by user behavior. Web applications will not accept a BGP routing error as an answer to a request and so they retry, often aggressively. Users and their mobile devices running these applications in the background also won’t accept an error and will repeatedly reload the website or reboot the application. The result was an up to 40% increase in DNS request traffic, which slowed down other networks (and, therefore, increased latency and timeout requests for other web applications). The increased traffic also reportedly led to issues with some cellular networks, including users being unable to make voice-over-IP phone calls.

Facebook’s outage was initially caused by routine network maintenance gone wrong, but the error was missed by an auditing tool and propagated via an automated system, which were likely both built by Facebook. The command error reportedly blocked remote administrators from reverting the configuration change. What’s more, the people with access to Facebook’s physical routers (in Facebook’s data centers) did not have access to the network/logical system. This suggests two things: the network maintenance auditing tool and process were inadequately tested, and there was a lack of specialized staff with network-system access physically inside Facebook’s data centers.

When the only people who can remedy a potential network maintenance problem rely on the network that is being worked on, it seems obvious that a contingency plan needs to be in place.

Facebook, which like other cloud/internet giants has rigorous processes for applying lessons learned, should be better protected next time. But Uptime Institute’s research shows there are no guarantees — cloud/internet giants are particularly vulnerable to network and software configuration errors, a function of their complexity and the interdependency of many data centers, zones, systems and separately managed networks. Ten of the 21 outages in 2020 that affected cloud/internet giants were caused by software/network errors. That these errors can cause traffic pileups that can then snarl completely unrelated applications globally will further concern all those who depend on publicly shared digital infrastructure – including the internet.

More detailed information about the causes and costs of outages is available in our report Annual outage analysis 2021: The causes and impacts of data center outages.

Vertiv’s DCIM ambitions wither on Trellis’ demise

Vertiv’s DCIM ambitions wither on Trellis’ demise

Operators often say that data center infrastructure management (DCIM) software is a necessary evil. Modern facilities need centralized, real-time management and analytics, but DCIM is notoriously difficult to deploy. The software is also not easy to rip and replace — yet some operators unexpectedly will have to. Vertiv, a pioneer and leader in equipment monitoring and management, recently told customers it has discontinued its flagship DCIM platform Trellis. Support for existing contracts will end in 2023.

In addition to dismaying these customers, Vertiv’s decision to drop Trellis raises questions about the company’s overall strategy — and value proposition — for data center management software.

Vertiv’s software business has fallen far in the past decade. Its Aperture asset management software was the most widely used DCIM product as recently as 2015 (when the company was Emerson Network Power). Trellis launched with fanfare in 2012 and became one of the most expansive DCIM platforms, promising to integrate data center facility and IT equipment monitoring.

For years, Trellis was the centerpiece of Vertiv’s strategy toward more intelligent, more automated data centers. It was once positioned as a rival to today’s leading DCIM products, notably from Schneider Electric and Nlyte Software. At one point, IBM and HP were Trellis resellers.

Ultimately, however, Trellis was both overengineered and underwhelming. It was built using the Oracle Fusion application development system so that new functionality and software could be built on top of Trellis. This was a benefit for customers active in the Oracle environment, but not for most everyone else; the architecture was too sprawling and heavy. Vertiv says this was a primary reason for it discontinuing the product — it was simply too big and complex.   

Other factors were at play. Over several years, operators reported various problems and concerns with Trellis. For example, while Trellis’ asset management functionality was eventually strong, for years it lagged Aperture, the product it was designed to replace. (Vertiv discontinued Aperture in 2017, with support of existing contracts to end by 2023.) This left customers to choose between maintaining a discontinued product or migrating to a newer one with less functionality. Many ended up migrating to a competitor’s platform.

Vertiv’s overall share of the DCIM market has steadily shrunk. Trellis, Aperture, Data Center Planner, SiteScan – all are well-known Vertiv software products that are now discontinued. (No date yet for when support for existing SiteScan contracts will end.) Moreover, Trellis was a proof point for Vertiv’s software ambitions.

So where is Vertiv heading? The company is focusing on a few remaining software products that are solid but narrowly focused. It is investing in Environet, which it acquired in 2018 when it bought Geist Global, a relatively small vendor of rack power distribution and related sensors. At the time, Environet had a small, loyal customer base. Vertiv has since extended its power and thermal monitoring, as well as its facilities space management features. Yet the software lacks the more sophisticated predictive maintenance and, notably, the IT device management capabilities of Trellis and other DCIM suites.

Separately, Vertiv continues to develop its IT out-of-band management software, as well as its iCOM-S thermal management tools. Vertiv’s recent deal to acquire the electrical supplier E+I Engineering (for $1.2 billion) means it could potentially add more power monitoring features. 

Taken as a portfolio, Vertiv appears underinvested in software. This is not just when compared with some of its equipment rivals, but also in the context of the overall evolution of data center management.  

The future of data center management will increasingly be predictive, automated and remote — thanks to sophisticated DCIM software driven by artificial intelligence (AI) and other big-data approaches. New developments such as AI-driven DCIM as a service are helping move smaller facilities, including at the edge, toward this vision. At the same time, DCIM deployed on-premises in larger, mission-critical facilities is becoming increasingly AI-driven, predictive, and able to assimilate a variety of data to extend real-time alarming and closed-loop automation.  Schneider and Eaton, for example,  both offer cloud-delivered predictive DCIM. Several DCIM vendors are investing in more AI to boost the accuracy and remit of analysis. And earlier this year, Schneider said it was expanding development of its long-standing and comprehensive on-premises DCIM platform, EcoStruxure for Data Centers. (Schneider has also quietly standardized its tools and processes to make it easier to migrate other vendors’ DCIM data into its software, among other things.)

While Vertiv has a long-held strategy to embed more software in its facilities equipment for monitoring and management, it has not executed or even articulated broader ambitions. As one of the largest equipment and services suppliers to data centers, Vertiv, like its peers, does not need to be a primary supplier of software to succeed. However, DCIM is increasingly one of the most critical management levers available to data center operators. Major infrastructure product and service suppliers often view DCIM as a strategic product line – in part because it should increase the “stickiness” of their core business. More importantly, the software’s analytics and control features can show how a supplier’s broader offerings easily fit into an intelligence infrastructure.

Startups brew new chemistries for fresh battery types

Startups brew new chemistries for fresh battery types

Through their public commitments to net-zero carbon emission targets, cloud providers have re-energized talks of a major redesign of critical power systems within the data center sector. The elimination of diesel fuel use is chief among the goals, but the desire for a leaner facility power infrastructure that closely tracks IT capacity needs is also attracting interest. New types of batteries, among other tools such as clean fuels, will likely prove instrumental in this endeavor.

Operators have already started displacing standard valve-regulated lead-acid (VRLA) banks with lithium-ion (Li-ion) batteries to refashion data center backup power. According to the upcoming 2021 Uptime Institute Global Data Center Survey report, nearly half of operators have adopted the technology for their centralized uninterruptible power supply (UPS) plant, up from about a quarter three years ago. Compared with VRLA, Li-ion batteries are smaller, easier to monitor and maintain, and require less stringent climatic controls. (See our post: Lithium-ion batteries for the data center: Are they ready for production yet? for more technical details.)

In addition, Li-ion batteries have a much longer service life, which offsets most of the higher initial cost of Li-ion packs. The change in chemistry also makes possible use cases that were not previously feasible with VRLA — for example, demand-response schemes in which operators can use their Li-ion UPS systems for peak load shaving when there is a price incentive from the electric utility.

But these advantages are not without significant trade-offs. According to recent research by the US-based National Fire Protection Association, Li-ion batteries (regardless of their specific chemistries and construction) represent a higher fire risk than VRLA. Worse still, Li-ion fires are notoriously difficult to suppress, as the breakdown of cells produces combustible gases — a situation that creates the conditions for life-threatening explosions. Even though some chemistries, such as lithium iron phosphate, have a comparatively lower risk of fire (due to a higher ignition point and no release of oxygen), the possibility of thermal runaway persists.

Then there is the matter of sustainability. VRLA batteries are effectively fully recycled to produce new batteries, at no cost to the owner (owing to lead toxicity-related regulation). That is not the case for large Li-ion battery packs, where the industrial base for recycling has yet to develop — most Li-ion-powered products haven’t reached their end of life. Repackaging and repurposing for less demanding applications seem a more likely solution than breaking down to materials but is still challenging. Furthermore, supply chains for Li-ion production, along with their social and environmental impacts, stay obscure to buyers.

These issues have already given some data center operators and their contractors pause. But they have also created an opening in the market to other branches of battery chemistry. One promising family for high-power applications relies on sodium (Na) ions instead of lithium as its main component to carry the charge, while developments in iron-air (FeO2) batteries raise hopes of building economical and safe energy storage for very long backup times.

Na-ion cells pose no thermal runaway risk and offer much higher power-to-capacity ratios. Even better, they can repeatedly discharge their full energy at very high rates, in one to two minutes, then fully recharge in 10 to 20 minutes without overheating. Durability is also better than that of Li-ion cells.

However, a significant disadvantage is the size of Na-ion batteries compared to state-of-the-art Li-ion (although they are still more compact than VRLA). These features make Na-ion batteries interesting for UPS applications with shorter runtime targets (e.g., up to 10 minutes). More importantly, perhaps, Na-ion seems safe for installation in data halls, where operators can use it in distributed UPS topologies that closely match power service level to IT capacity requirements. Social and environmental impacts from sodium battery manufacturing also appear less concerning, as they tend to use no rare earth elements or minerals tainted by the conditions of extraction (although practical recyclability is an outstanding question).

Multiple startups have already attracted funding to commercialize sodium batteries for various use cases, both stationary and mobile (including industrial vehicles). Examples are Natron Energy (US), Faradion (UK), and CATL (China). The fabrication process of Na-ion battery shares many commonalities with Li-ion, making its industrialization relatively straightforward — producers can tap into a large pool of toolmakers and manufacturing know-how. All the same, it will take another few years (and hundreds of millions of dollars) for Na-ion battery production to ramp up.

Meanwhile, US startup Form Energy has recently raised $200 million (bringing total life-to-date funding past $325 million) to commercialize an entirely different type of battery that aims to solve not the power but the energy capacity problem. The company says it has figured out how to make effective iron-air (FeO2) batteries that can store energy for long periods and discharge it slowly over many hours (the design target is 100 hours) when needed. Because of the inexpensive ingredients, Form claims that FeO2 cells cost a small fraction (15%) of Li-ion per kilowatt-hour.

Form Energy is now targeting 2025 for the first battery deployments. There seems to be a catch, however: space. According to the firm, a Form Energy system that can hold a 3-megawatt load for four days occupies an acre of land. Even if this can be reduced by lowering runtime to, say, 24 hours, it will still require several times the footprint of a diesel generator set and associated fuel storage. Nonetheless, if product development goes to plan, Form Energy may provide the data center sector with a new piece of its power technology puzzle.

In upcoming reports, Uptime Institute Intelligence will explore developments in energy storage, power infrastructure, and operational methods that may make the stretch goal of a diesel-free data center achievable.

Eyeing an uptick in edge data center demand

Eyeing an uptick in edge data center demand

Edge data centers are generally small data center facilities — designed for IT workloads up to a few hundred kilowatts — that process data closer to the population of local end users and devices. Earlier in 2021, we surveyed data center owners/operators and product and service suppliers to gauge demand and deployments. The findings were consistent: Data center end-user organizations and suppliers alike expect an uptick in edge data center demand in the near term.

The study suggests a small majority of owners/operators today use between one and five edge data centers and that this is unlikely to change in the next two to three years. This is likely an indicator of overall demand for edge computing — most organizations have some requirement for edge, but do not expect this to change significantly in the short term.

Many others, however, do expect growth. The share of owners/operators that do not use any edge data centers drops from 31% today to 12% in two to three years’ time — indicating a significant increase in owner/operator uptake. Furthermore, the portion of owners/operators in our study using more than 20 edge data centers today is expected to double in the next two to three years (from 9% today to 20% in two to three years), as shown in Figure 1.

Chart: Edge data center use is expected to grow from low numbers
Figure 1. Edge data center use is expected to grow from low numbers

Over 90% of respondents in North America are planning to use more than five edge data centers in two to three years’ time, a far higher proportion than the 30% to 60% of respondents in other regions. The largest portion of owners/operators planning to use more than 20 edge data centers within the next few years is in the US and Canada, closely followed by Asia-Pacific and China.

The main drivers for edge data center expansion are a need to reduce latency to improve or add new IT services, as well as requirements to reduce network costs and/or bandwidth constraints in data transport over wide distances.

Uptime Institute’s research shows that large deployments of hundreds or more edge data centers are expected for many applications, including telecom networks, the internet of things in oil and gas, retail, cloud gaming, video streaming, public transportation systems, and the growth of large international industrial companies with multiple offices. Several large-scale edge data center projects are today at a prototyping stage with trials planned using one to 10 sites. Full-scale deployments involving tens of sites are planned for the coming three years.

The view from suppliers of edge data centers in our study was similarly strong. Most suppliers are today helping to build, supply or maintain between six and 10 edge data centers per year. Many expect this will grow to 21 to 50 annually in two to three years’ time, as shown in Figure 2. The portion of suppliers that built, supplied or maintained more than 100 edge data centers in the year 2020 is minimal today; looking ahead two to three years, 15% of suppliers in our study expect they will be handling 100+ edge data center projects (a threefold increase from today).

Chart: Suppliers expect an edge data center expansion
Figure 2: Suppliers expect an edge data center expansion

Suppliers in the US and Canada are particularly bullish; almost one in three (32%) expect a yearly volume of more than 100 edge data centers in two to three years’ time. Those in Europe are also optimistic; roughly one-fifth of suppliers (18%) expect an annual volume of more than 100 edge data centers in the coming two to three years. Some suppliers report a recent change in market demand, with orders and tenders going from being sporadic and for a single-digit number of sites, to more requests for larger projects with tens of sites. The change may be attributed to a solidification of clients’ corporate strategy for edge computing.

It seems clear that demand for edge data centers is expected to grow across the world in the near term and especially in North America. Owners, operators and suppliers alike anticipate growth across different industry verticals. The digitization effects of COVID-19, the expansion by big clouds to the edge, and the mostly speculative deployments of 5G are among key factors driving demand. However, there is complexity involved in developing edge business cases and it is not yet clear that any single use case will drive edge data centers in high volumes.

The full report Demand and speculation fuel edge buildout is available here.  

Fastly outage underscores slow creep of digital services risk

A recent outage at content delivery network Fastly took down thousands of websites in different countries, including big names, such as Amazon, Twitter and Spotify, for nearly an hour. It is the latest large outage highlighting the downside of a key trend in digital infrastructure: The growth in dependency on digital service providers can undermine infrastructure resilience and business continuity.

The conundrum here is one of conflicting double standards. While many companies expend substantial money and time on hardening and testing their own data center and IT infrastructure against disruptions, they also increasingly rely on cloud service providers whose architectural details and operational practices are largely hidden from them.

Some confusion stems from high expectations and a high level of trust in the resiliency of the greatly distributed topologies of cloud and network technologies — often amounting to a leap of faith. Large providers routinely quote five nines or more availability or imply that system-wide failure is nearly impossible — but this is clearly not the case. While many do deliver on these promises most of the time, outages still happen. When they do, the impact is often broad and wide.

Fastly is far from being a lone culprit. In our report Annual outage analysis 2021: The causes and impacts of data center outages, Uptime Institute counts over 80 major outages at cloud providers in the last five years (not including other digital services and telecommunication providers). The Fastly outage serves as a reminder that cloud- and network-based services are not invulnerable to service interruptions — in fact, they sometimes prove to be surprisingly fragile, susceptible to tiny errors.

Fastly’s own initial postmortem of the event notes that the global outage was a result of a legitimate and valid configuration change by a single customer. That change led to an unexpected effect on the wider system, ultimately taking down dozens of other customers’ services. Fastly said it did not know how the bug made it into production. In some ways, this Fastly outage is similar to an Amazon Web Services (AWS) incident a few years ago, when a single storage configuration error made by an administrator ground some AWS services to a near halt on the US East Coast, causing major disruption.

Major outages at large digital service providers point to a need for greater transparency of their operational and architectural resiliency, so customers can better assess their own risk exposure. Research from Uptime Institute Intelligence shows that only a small proportion of businesses — fewer than one in five in our study — thought they had sufficient understanding of the resilience of their cloud providers to trust them to run their mission-critical workloads (see Figure 1). Nearly a quarter said they would likely use public cloud for such workloads if they gained greater visibility into the cloud providers’ resiliency design and procedures.

Diagram of adequate visibility into resiliency of public cloud operations
Figure 1. More would use public cloud if providers gave better visibility into options and operations

Furthermore, Uptime Institute’s surveys have shown that most businesses manage a rich mix of digital infrastructures and services, spanning their own private (on-premises) data centers, leased facilities, hosted capacity, and various public cloud-powered services. This comes at the cost of added complexity, which generates risks, both known and unknown, if not managed with full visibility. In a recent survey, of those who have a view on how their services’ resilience has changed as a result of adopting hybrid environments, a strong majority reported a welcome improvement. Still, one in seven reported a decrease in overall resilience.

The bottom line is that without the possibility of verifying availability and reliability claims from cloud and other IT services providers, and without proactively mitigating the challenges that hybrid architectures pose, enterprises might be signing up to added risks to their business continuity and data security. Fastly’s customers have received a stark reminder of this growing reality at their own expense.

Green tariff renewable energy purchases

Until recently, power purchase agreements (PPAs) and unbundled renewable energy certificates (RECs) were the primary means for data center operators or managers to procure renewable electricity and RECs for their operations. Many companies are not comfortable with the eight-to-20-year length and financial risks of a PPA. Unbundled RECs are an ongoing expense and do not necessarily help to increase the amount of renewable power generated.

As the supply of renewably generated electricity has grown across markets, utilities and energy retailers are offering sophisticated retail renewable energy supply contracts, referred to as green tariff contracts, which physically deliver firm (24/7 reliable) renewable power to data center facilities. The retailer creates a firm power product by combining renewable power, brown power (fossil fuel, natural gas and nuclear) and RECs from a defined set of generation projects within the grid region serving the facility.

The base of the contract is committed renewable generation capacity, which delivers the megawatt-hours (MWh) of electricity and RECs, from one or more renewable energy projects. Brown power is purchased from the spot market to fill supply gaps to ensure 24/7 supply at the meter (see schematic below). All the power purchased under the contract is physically supplied to the data center operator’s facility and matched with RECS.



In contrast to PPAs, these contracts have typical terms of four to eight years; charge a retail price inclusive of transmission, distribution costs and fees; and physically deliver 24/7 reliable power to the consuming facility. Perhaps most importantly, the financial and supply management risks inherent in a PPA are carried by the retailer. This relieves the data center operator or manager of the responsibility of managing a long-term energy supply contract. In exchange for carrying these risks, the retailer receives a 1% to 5% premium over a grid power supply contract. Paying a fixed retail price for the life of the contract brings budget certainty to the data center operator. The contract term of four-to eight years better fits the business planning horizon of most data center operators. It eliminates the uncertainty associated with estimating the location, size and energy demand of a data center operations portfolio over the full term of a 20-to-25-year PPA.

The use of green tariff contracts often does not enable the purchaser to claim additionality for the renewable electricity purchases, which may be important for some. Many green tariff contracts procure energy from older hydro facilities (in the US and France), from generation facilities coming off subsidy (in the US and Germany) or from operating generators that are selling into the spot market and are looking for a firm, reliable price over a defined period for their output (in the US, EU and China). These latter situations do not satisfy the additionality criteria. But in situations where the contract procures renewably generated electricity through a new PPA or a repowered renewable energy facility, additionality may be claimed.

But most importantly, green tariff contracts do allow data center operators to claim that they are powering their facility with a quantified amount of MWhs of renewably generated power and that they are matching their brown power purchases with RECs from renewable generation in their grid region. A green tariff purchase enables the data center operator, the energy retailer and the grid region authority to develop experience in deploying renewable generation assets to directly power a company’s operations.

This approach is arguably more valuable and responsible than buying renewably generated power under a virtual power purchase agreement (VPPA), where there is no connection between energy generation and consumption. In the case of a VPPA, the dispatch of the generation into the market can increase price volatility and has the potential to destabilize grid reliability over time. This has been demonstrated by recent market events in Germany and in the CAISO (California Independent System Operator), ERCOT (Electric Reliability Council of Texas) and SPP (Southwest Power Pool) grid regions in the US.

Green tariffs encourage the addition of more renewable power in the grid region. Requests for green tariff contracts send a signal to the energy market that data center operators desire more renewable energy procurement options for their consumed power — which they then respond to. This is demonstrated by the increase in these offerings in many electricity markets over the past five years.

Uptime Institute believes that a green tariff contract offers many data center operators a business-friendly solution for the procurement of renewably generated electricity. But like PPAs, green tariff electricity contracts are complex vehicles. Data center operators need to develop a basic understanding of the market to fully appreciate the many nuances of the issues that must be settled to finalize a contract. It may be advisable to work with a third party with trading and market experience to identify and minimize any price and supply risks in the contract.