• Link to X
  • Link to LinkedIn
  • Link to Mail
  • ABOUT UI
    • Business Partners
    • Careers
    • Contact Us
    • News & Press
    • Our Team
    • Press Releases
    • Branding Guidelines
  • CONTACT
Uptime Institute Blog
  • Journal
    • Journal Home
    • Executive
    • Operations
    • Design
  • AI Services
    • AI Infrastructure Advisory
    • AI Custom Support
  • Tier Certification
    • Overview
    • Design
    • Construction
    • Operations
    • Tier Gap Analysis
    • Prefabricated/Modular
    • Tier Certifications List
  • Professional Services
    • Overview
    • Infrastructure Services
    • Management and Operations Services
    • Energy and Sustainability Services
    • Consulting Services
  • Education
    • Course Details
    • Course Calendar
    • Competency & Confidence Assessments
    • Private Education
    • Graduate Roster
  • Events
    • Industry Events
    • Leadership Events
    • Network Events
  • Network
    • Overview
    • Network Calendar
    • Network Roster
    • Request Corporate Access
    • Request Guest Access
    • Uptime Network Portal
  • Intelligence
  • Clients
    • Client Stories
  • Resources
    • Data Center Industry Surveys
    • Ebooks
    • Journal Blog
    • Product Datasheets
    • Research & Reports
    • Tier Specification Documents
    • Tools
    • Webinars
  • Click to open the search input field Click to open the search input field Search
  • Menu Menu
Blog - Latest News
Too big to fail? Facebook’s global outage

Too big to fail? Facebook’s global outage

October 12, 2021/in Executive, News, Operations/by Rhonda Ascierto, Vice President, Research, Uptime Institute

The bigger the outage, the greater the need for explanations and, most importantly, for taking steps to avoid a repeat.

By any standards, the outage that affected Facebook on Monday, October 4th, was big. For more than six hours, Facebook and its other businesses, including WhatsApp, Instagram and Oculus VR, disappeared from the internet — not just in a few regions or countries, but globally. So many users and machines kept retrying these websites, it caused a slowdown of the internet and issues with cellular networks.

While Facebook is large enough to ride through the immediate financial impact, it should not be dismissed. Market watchers estimate that the outage cost Facebook roughly $60 million in revenues over its more than six-hour period. The company’s shares fell 4.9% on the day, which translated into more than $47 billion in lost market cap.

Facebook may recover those losses, but the bigger ramifications may be reputational and legal. Uptime Institute research shows that the level of outages from hyperscale operators is similar to that experienced by colocation companies and enterprises — despite their huge investments in distributed availability zones and global load and traffic management. In 2020, Uptime Institute recorded 21 cloud/internet giant outages, with associated financial and reputational damage. With antitrust, data privacy and, most recently, children’s mental health concerns swirling about Facebook, the company is unlikely to welcome further reputational and legal scrutiny.

What was the cause of Facebook’s outage? The company said there was an errant command issued during planned network maintenance. While an automated auditing tool would ordinarily catch an errant command, there was a bug in the tool that didn’t properly stop it. The command led to configuration changes on Facebook’s backbone routers that coordinate network traffic among its data centers. This had a cascading effect that halted Facebook’s services.

Setting aside theories of deliberate sabotage, there is evidence that Facebook’s internet routes (Border Gateway Protocol, or BGP) were withdrawn by mistake as part of these configuration changes.

BGP is a mechanism for large internet routers to constantly exchange information about the possible routes for them to deliver network packets. BGP effectively provides very long lists of potential routing paths that are constantly updated. When Facebook stopped broadcasting its presence — something observed by sites that monitor and manage internet traffic — other networks could not find it.

One factor that exacerbated the outage is that Facebook has an atypical internet infrastructure design, specifically related to BGP and another three-letter acronym: DNS, the domain name system. While BGP functions as the internet’s routing map, the DNS serves as its address book. (The DNS translates human-friendly names for online resources into machine-friendly internet protocol addresses.)

Facebook has its own DNS registrar, which manages and broadcasts its domain names. Because of Facebook’s architecture — designed to improve flexibility and control — when its BPG configuration error happened, the Facebook registrar went offline. (As an aside, this caused some domain tools to erroneously show that the Facebook.com domain was available for sale.) As a result, internet service providers and other networks simply could not find Facebook’s network.

How did this then cause a slowdown of the internet? Billions of systems, including mobile devices running a Facebook-owned application in the background, were constantly requesting new “coordinates” for these sites. These requests are ordinarily cached in servers located at the edge, but when the BGP routes disappeared, so did those caches. Requests were routed upstream to large internet servers in core data centers.

The situation was compounded by a negative feedback loop, caused in part by application logic and in part by user behavior. Web applications will not accept a BGP routing error as an answer to a request and so they retry, often aggressively. Users and their mobile devices running these applications in the background also won’t accept an error and will repeatedly reload the website or reboot the application. The result was an up to 40% increase in DNS request traffic, which slowed down other networks (and, therefore, increased latency and timeout requests for other web applications). The increased traffic also reportedly led to issues with some cellular networks, including users being unable to make voice-over-IP phone calls.

Facebook’s outage was initially caused by routine network maintenance gone wrong, but the error was missed by an auditing tool and propagated via an automated system, which were likely both built by Facebook. The command error reportedly blocked remote administrators from reverting the configuration change. What’s more, the people with access to Facebook’s physical routers (in Facebook’s data centers) did not have access to the network/logical system. This suggests two things: the network maintenance auditing tool and process were inadequately tested, and there was a lack of specialized staff with network-system access physically inside Facebook’s data centers.

When the only people who can remedy a potential network maintenance problem rely on the network that is being worked on, it seems obvious that a contingency plan needs to be in place.

Facebook, which like other cloud/internet giants has rigorous processes for applying lessons learned, should be better protected next time. But Uptime Institute’s research shows there are no guarantees — cloud/internet giants are particularly vulnerable to network and software configuration errors, a function of their complexity and the interdependency of many data centers, zones, systems and separately managed networks. Ten of the 21 outages in 2020 that affected cloud/internet giants were caused by software/network errors. That these errors can cause traffic pileups that can then snarl completely unrelated applications globally will further concern all those who depend on publicly shared digital infrastructure – including the internet.

More detailed information about the causes and costs of outages is available in our report Annual outage analysis 2021: The causes and impacts of data center outages.

Share this:

  • Share on Facebook (Opens in new window) Facebook
  • Share on X (Opens in new window) X
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Reddit (Opens in new window) Reddit
  • Share on WhatsApp (Opens in new window) WhatsApp
  • Email a link to a friend (Opens in new window) Email
Tags: Data Center, IT Outages, Outage
https://journal.uptimeinstitute.com/wp-content/uploads/2021/10/Facebook-Global-Outage-a.jpg 520 1200 Rhonda Ascierto, Vice President, Research, Uptime Institute https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.png Rhonda Ascierto, Vice President, Research, Uptime Institute2021-10-12 10:07:362021-10-12 10:33:38Too big to fail? Facebook’s global outage
You might also like
Retail vs wholesale: finding the right colo pricing model Retail vs wholesale: finding the right colo pricing model
Learning from the OVHcloud data center fire
Crypto data centers: The good, the bad and the electric Crypto data centers: The good, the bad and the electric
COVID19: Digital Infrastructure Q&A COVID-19: Q&A (Part 1): Staff Management
Why are governments investigating cloud competitiveness? Why are governments investigating cloud competitiveness?
Are data center workforce initiatives effective? Are data center workforce initiatives effective?
Free Air Cooling Data Center Free Air Cooling Trends
Data Center Security Data center insecurity: Online exposure threatens critical systems

Content Categories

  • Journal Home
  • Executive
  • Operations
  • Design

Subscribe to Journal via Email

Enter your email address to subscribe to Uptime Institute Journal and receive notifications of new articles by email.

  • Recent

Tags

Accredited Tier Designer (9) AI (21) artificial intelligence (16) ATD (10) Carbon Emissions (7) Climate Change (13) Cloud (22) Cloud Computing (17) Cloud Costs (15) Cloud Infrastructure (29) Cloud Migration (8) Colocation (6) cooling (9) Data Center (252) Data Center Availability (40) Data Center Cooling (13) Data Center Design (45) Data Center Disaster Recovery (7) Data Center Energy Efficiency (34) Data Center Facilities Management (43) Data Center Operations (66) data center power (8) Data Center Staffing (18) DCIM (9) digital Infrastructure (117) energy (8) Energy Efficiency (38) Environmental Sustainability (18) IT (7) IT Efficiency (16) IT Outages (10) M&O (6) outages (11) Public Cloud (7) PUE (10) Regulations (24) Resiliency (9) security (7) Sustainability (34) Sustainability Reporting (7) Tier Certification (26) Tier Certification Constructed Facility (16) Uptime Institute FORCSS (6) Uptime Institute Network (13) Uptime Institute Symposium (6)
© 2014-2025 Uptime Institute, LLC All rights reserved.
  • Link to X
  • Link to LinkedIn
  • Link to Mail
Link to: Vertiv’s DCIM ambitions wither on Trellis’ demise Link to: Vertiv’s DCIM ambitions wither on Trellis’ demise Vertiv’s DCIM ambitions wither on Trellis’ demiseVertiv’s DCIM ambitions wither on Trellis’ demise Link to: New ASHRAE guidelines challenge efficiency drive Link to: New ASHRAE guidelines challenge efficiency drive ASHRAE GuidelinesNew ASHRAE guidelines challenge efficiency drive
Scroll to top Scroll to top Scroll to top