Learning from the OVHcloud data center fire
The fire that destroyed a data center (and damaged others) at the OVHcloud facility in Strasbourg, France, on March 10-11, 2021, has raised a multitude of questions from concerned data center operators and customers around the world. Chief among these is, “What was the main cause, and could it have been prevented?”
Fires at data centers are rare but do occur — Uptime Institute Intelligence has some details of more than 10 data center fires (see our upcoming blog about the frequency of fire incidents). But most of these are quickly isolated and extinguished; it is extremely uncommon for a fire to rage out of control, especially at larger data centers, where strict fire prevention and containment protocols are usually followed. Unfortunately for OVHcloud, the fire occurred just two days after the owners announced plans for a public listing on the Paris Stock Exchange in 2022.
While this Note will address some of the known facts and provide some context, more complete and informed answers will have to wait for the full analysis by OVHcloud, the fire services and other parties. OVHcloud has access to a lot of closed-circuit television and some thermal camera images that will help in the investigation.
OVHcloud
OVHcloud is a high-profile European data center operator and one of the largest hosting companies globally. Founded in 1999 by Octave Klaba, OVHcloud is centered in France but has expanded rapidly, with facilities in several countries offering a range of hosting, colocation and cloud services. It has been championed as a European alternative to the giant US cloud operators and is a key participant in the European Union’s GAIA-X cloud project. It has partnerships with big IT services operators, such as Deutsche Telekom, Atos and Capgemini.
Among OVHcloud customers are tens of thousands of small businesses running millions of websites. But it has many major enterprise, government and commercial customers, including various departments of the French government, the UK’s Vehicle Licensing Agency, and the European Space Agency. Many have been affected by the fire.
OVHcloud is hailed as a bold innovator, offering a range of cloud services and using advanced low energy, free air cooling designs and, unusually for commercial operators, direct liquid cooling. But it has also suffered some significant outages, most notably two serious incidents in 2017. After that, then-Chief Executive Officer and chairman Octave Klaba spoke of the need for OVHcloud to be “even more paranoid than it is already.” Some critics at the time believed these outages were due to poor design and operational practices, coupled with a high emphasis on innovation. The need to compete on a cost basis with large-scale competitors —Amazon Web Services, Microsoft and others – is an ever-present factor.
The campus at Strasbourg (SBG) is based on a site acquired from ArcelorMittal, a steel and mining company. It houses four data centers, serving customers internationally. The oldest and smallest two, SBG1 and SBG4, were originally based on prefab containers. SBG2, destroyed by the fire, was a 2 MW facility capable of housing 30,000 servers. It used an innovative free air cooling system. SBG3, a newer 4 MW facility that was partially damaged, uses a newer design that may have proved more resilient.
Chronology
The fire in SBG2 started after midnight and was picked up by sensors and alarms. Black smoke prevented staff from effectively intervening. The fire spread rapidly within minutes, destroying the entire data center. Using thermal cameras, firefighters identified that two uninterruptible power supplies (UPSs) were at the heart of the blaze, one of which had been extensively worked on that morning.
All of the data centers have been out of action in the days immediately following the fire, although all but SBG2 are due to come back online shortly. SBG1 suffered significant damage to some rooms, with recovery planned to take a week or so. Many customers were advised to invoke disaster recovery plans, but OVHcloud has spare capacity in other data centers and has been working to get customers up and running.
Causes, design and operation
Only a thorough root-cause analysis will reveal exactly what happened and whether this fire was preventable. However, some design and operational issues have been highlighted among the many customers and ecosystem partners of OVHcloud:
- UPS and electrical fires. Early indicators point to the failure of a UPS, causing a fire that spread quickly. At least one of the UPSs had been extensively worked on earlier in the day, suggesting a maintenance issue may have been a main contributor. Although it is not best practice, battery cabinets (when using vent-regulated lead-acid, or VRLA, batteries) are often installed next to the UPS units themselves. Although this may not have been the case at SBG2, this type of configuration can create a situation where a UPS fire heats up batteries until they start to burn and can cause fire to spread rapidly.
- Cooling tower design. SBG2 was built in 2011 using a tower design that has convection-cooling based “auto-ventilation.” Cool air enters, passes through a heat exchange for the (direct liquid) cooling system, and warm air rises through the tower in the center of the building. OVHcloud has four other data centers using the same principle. OVHcloud says this is an environmentally sound, energy efficient design — but since the fire, concerns have been raised that it can act rather like a chimney. Vents that allow external air to enter would need to be immediately shut in the event of a potential fire (the nearby, newer SBG3 data center, which uses an updated design, suffered less damage).
- VESDA and fire suppression. It is being reported that SBG2 had neither a VESDA (very early smoke detection apparatus) system nor a water or gas fire suppression system. Rather, staff relied on smoke detectors and fire extinguishers. It is not known if these reports are accurate. Most data centers do have early detection and fire suppression systems.
- Backup and cloud services. Cloud (and many hosting) companies cite high availability figures and extremely low figures for data loss. But full storage management and recovery across multiple sites costs extra, especially for hosted services. Many customers, especially smaller ones, usually pay for basic backup only. Statements from OVHcloud since the fire suggests that some customers would have lost data. Some backups were in the same data center, or on the same campus, and not all data was replicated elsewhere.
Fire and resiliency certification
Responsibility for fire prevention — and building regulations — is mostly dealt with by local planning authorities (AHJs – authorities having jurisdiction). These vary widely across geographies.
Uptime Institute has been asked whether Tier certification would help prevent fires. Uptime’s Chief Technical Officer Chris Brown responds:
“Tiers has limited fire system requirements, and they are geared to how the systems can impact the critical MEP (mechanical, electrical and plumbing) infrastructure. This is the case because in most locations, fire detection and suppression are tightly controlled by life/safety codes. If the Tier standard were to include specific fire detection and suppression requirements, it would add little value and would run the risk of clashing with local codes.
This is always under review.
Tier IV does have a compartmentalization requirement. It requires a 1 hour fire-rated barrier between complementary systems. This is to protect complementary systems from being impacted by a single fire event. This does assume the facility is properly protected by fire suppression systems.”
A separate Uptime Data Center Risk Assessment (DCRA) would document the condition (or lack of?) a fire suppression system, any lack of a double-interlocked suppression system, and even a pre-action system using only compressed air to charge the lines.