Major data center fire highlights criticality of IT services
Uptime Institute’s outages database suggests data center fires are infrequent, and rarely have a significant impact on operations. Uptime has identified 14 publicly reported, high-profile data center outages caused by fire or fire suppression systems since 2020. The frequency of fires is not increasing relative to the IT load or number of data centers but, uncontained, they are potentially disastrous to facilities, and subsequent outages can be ruinous for the business.
SK Group, South Korea’s second largest conglomerate, is the latest high-profile organization to suffer a major data center fire, following a breakout at a multistory colocation facility operated by its SK Inc. C&C subsidiary in Pangyo (just south of Seoul) on October 15. According to police reports, the fire started in a battery room before spreading quickly to the rest of the building. It took firefighters around eight hours to bring the blaze under control.
While there were no reported injuries, this incident could prove to be the largest data center outage caused by fire to date. It is a textbook example of how seemingly minor incidents can escalate to wreak havoc through cascading interdependencies in IT services.
The incident took tens of thousands of servers offline, including not only SK Group’s own systems but also the IT infrastructure running South Korea’s most popular messaging and single sign-on platform, KakaoTalk. The outage disrupted its integrated mobile payment system, transport app, gaming platform and music service — all of which are used by millions. The outage also affected domestic cloud giant Naver (the “Google of South Korea”) which reported disruption to its online search, shopping, media and blogging services.
While SK Group has yet to disclose the root cause of the fire, Kakao, the company behind KakaoTalk, has pointed to the Lithium-ion (Li-ion) batteries deployed at the facility — manufactured by SK on, another SK Group subsidiary. In response, SK Group has released what it claims are records from its battery management system (BMS) showing no deviation from normal operations prior to the incident. Some local media reports contradict this, however, claiming multiple warnings were, in fact, generated by the BMS. Only a thorough investigation will settle these claims. In the meantime, both sides are reported to be “lawyering up.”
The fallout from the outage is not limited to service disruptions or lost revenue, and has prompted a statement from the country’s president, Yoon Suk-yeol, who has promised a thorough investigation into the causes of, and the extent of the damages arising from, the fire. The incident has, so far, led to a police raid on SK Inc. C&C headquarters; the resignation of Kakao co-CEO Whon Namkoong; and the establishment of a national task force for disaster prevention involving military officials and the national intelligence agency. Multiple class-action lawsuits against Kakao are in progress, mainly based on claims that the company has prioritized short-term profits over investment in more resilient IT infrastructure.
The South Korean government has announced a raft of measures aimed at preventing large-scale digital service failures. All large data centers will now be subject to disaster management procedures defined by the government, including regular inspections and safety drills. Longer-term, the country’s Ministry of Science and ICT will be pushing for the development of battery technologies posing a lower fire risk — a matter of national interest for South Korea, home to some of the world’s largest Li-ion cell manufacturers including Samsung SDI and LG Chem, in addition to SK on.
The fire in South Korea will inevitably draw comparisons with the data center fire that brought down the OVHcloud Strasbourg facility in 2021. Impacting some 65,000 customers, many of whom lost their data in the blaze (see Learning from the OVHcloud data center fire), this fire, as in Pangyo, was thought to have involved uninterruptible power supply (UPS) systems. According to the French Bureau of Investigation and Analysis on Industrial Risks (BEA-RI), the lack of an automatic fire extinguisher system, delayed electrical cutoff and building design all contributed to the spread of the blaze.
A further issue arising from this outage, and one that remains to be determined, is the financial cost to SK Group, Kakao and Naver. The fire at the OVHcloud Strasbourg facility was estimated to cost the operator more than €105 million — with less than half of this being covered by insurance. The cost of the fire in Pangyo is likely to run into tens (if not hundreds) of millions of dollars. This should serve as a timely reminder of the importance of fire suppression, particularly in battery rooms.
Li-ion batteries in mission-critical applications — risk creep?
Li-ion batteries present a greater fire risk than valve-regulated lead-acid batteries, regardless of their specific chemistries and construction – a position endorsed by the US’ National Fire Protection Association and others. Since the breakdown of cells in Li-ion batteries produces combustible gases (including oxygen) which can result in a major thermal-runaway event (in which the fire spreads uncontrollably between cells, across battery packs and, potentially, even cabinets if these are inappropriately distanced), the fires they cause are notoriously difficult to suppress.
Many operators have, hitherto, found the risk-reward profile of Li-ion batteries (in terms of their lower footprint and longer lifespan) to be acceptable. Uptime surveys show major UPS vendors reporting strong uptake of Li-ion batteries in data center and industrial applications: some vendors report shipping more than half their major three-phase UPS systems with Li-ion battery strings. According to the Uptime Institute Global Data Center Survey 2021, nearly half of operators have adopted this technology for their centralized UPS plants, up from about a quarter three years ago. The Uptime Institute Global Data Center Survey 2022 found Li-ion adoption levels to be increasing still further (see Figure 1).
The incident at the SK Inc. C&C facility highlights the importance of selecting appropriate fire suppression systems, and the importance of fire containment as part of resiliency. Most local regulation governing fire prevention and mitigation concentrates (rightly) on securing people’s safety, rather than on protecting assets. Data center operators, however, have other critically important issues to consider — including equipment protection, operational continuity, disaster recovery and mean time to recovery.
While gaseous (or clean agent) suppression is effective at slowing down the spread of a fire in the early stages of Li-ion cell failure (when coupled with early detection), it is arguably less suitable for handling a major thermal-runaway event. The cooling effects of water and foam mean these are likely to perform better; double-interlock pre-action sprinklers also limit the spread. Placing battery cabinets farther apart can help prevent or limit the spread of a major fire. Dividing battery rooms into fire-resistant compartments (a measure mandated by Uptime Institute’s Tier IV resiliency requirements) can further decrease the risk of facility-wide outages.
Such extensive fire prevention measures could, however, compromise the benefits of Li-ion batteries in terms of their higher volumetric energy density, lower cooling needs and overall advantage in lifespan costs (particularly where space is at a premium).
Advances in Li-ion chemistries and cell assembly will address operational safety concerns — lithium iron phosphate, with its higher ignition point and no release of oxygen during decomposition – being a case in point. Longer term, inherently safer, innovative chemistries — such as sodium-ion and nickel-zinc — will probably offer a more lasting solution to the safety (and sustainability) conundrum around Li-ion. Until then, the growing prevalence of vast amounts of Li-ion batteries in data centers means the propensity of violent fires can only grow — with potentially dire financial consequences.
By: Max Smolaks, Analyst, Uptime Institute Intelligence and Daniel Bizo, Research Director, Uptime Institute Intelligence