Learning from the OVHcloud data center fire

The fire that destroyed a data center (and damaged others) at the OVHcloud facility in Strasbourg, France, on March 10-11, 2021, has raised a multitude of questions from concerned data center operators and customers around the world. Chief among these is, “What was the main cause, and could it have been prevented?”

Fires at data centers are rare but do occur — Uptime Institute Intelligence has some details of more than 10 data center fires (see our upcoming blog about the frequency of fire incidents). But most of these are quickly isolated and extinguished; it is extremely uncommon for a fire to rage out of control, especially at larger data centers, where strict fire prevention and containment protocols are usually followed. Unfortunately for OVHcloud, the fire occurred just two days after the owners announced plans for a public listing on the Paris Stock Exchange in 2022.

While this Note will address some of the known facts and provide some context, more complete and informed answers will have to wait for the full analysis by OVHcloud, the fire services and other parties. OVHcloud has access to a lot of closed-circuit television and some thermal camera images that will help in the investigation.

OVHcloud

OVHcloud is a high-profile European data center operator and one of the largest hosting companies globally. Founded in 1999 by Octave Klaba, OVHcloud is centered in France but has expanded rapidly, with facilities in several countries offering a range of hosting, colocation and cloud services. It has been championed as a European alternative to the giant US cloud operators and is a key participant in the European Union’s GAIA-X cloud project. It has partnerships with big IT services operators, such as Deutsche Telekom, Atos and Capgemini.

Among OVHcloud customers are tens of thousands of small businesses running millions of websites. But it has many major enterprise, government and commercial customers, including various departments of the French government, the UK’s Vehicle Licensing Agency, and the European Space Agency. Many have been affected by the fire.

OVHcloud is hailed as a bold innovator, offering a range of cloud services and using advanced low energy, free air cooling designs and, unusually for commercial operators, direct liquid cooling. But it has also suffered some significant outages, most notably two serious incidents in 2017. After that, then-Chief Executive Officer and chairman Octave Klaba spoke of the need for OVHcloud to be “even more paranoid than it is already.” Some critics at the time believed these outages were due to poor design and operational practices, coupled with a high emphasis on innovation. The need to compete on a cost basis with large-scale competitors —Amazon Web Services, Microsoft and others – is an ever-present factor.

The campus at Strasbourg (SBG) is based on a site acquired from ArcelorMittal, a steel and mining company. It houses four data centers, serving customers internationally. The oldest and smallest two, SBG1 and SBG4, were originally based on prefab containers. SBG2, destroyed by the fire, was a 2 MW facility capable of housing 30,000 servers. It used an innovative free air cooling system. SBG3, a newer 4 MW facility that was partially damaged, uses a newer design that may have proved more resilient.

Chronology

The fire in SBG2 started after midnight and was picked up by sensors and alarms. Black smoke prevented staff from effectively intervening. The fire spread rapidly within minutes, destroying the entire data center. Using thermal cameras, firefighters identified that two uninterruptible power supplies (UPSs) were at the heart of the blaze, one of which had been extensively worked on that morning.

All of the data centers have been out of action in the days immediately following the fire, although all but SBG2 are due to come back online shortly. SBG1 suffered significant damage to some rooms, with recovery planned to take a week or so. Many customers were advised to invoke disaster recovery plans, but OVHcloud has spare capacity in other data centers and has been working to get customers up and running.

Causes, design and operation

Only a thorough root-cause analysis will reveal exactly what happened and whether this fire was preventable. However, some design and operational issues have been highlighted among the many customers and ecosystem partners of OVHcloud:

  • UPS and electrical fires. Early indicators point to the failure of a UPS, causing a fire that spread quickly. At least one of the UPSs had been extensively worked on earlier in the day, suggesting a maintenance issue may have been a main contributor. Although it is not best practice, battery cabinets (when using vent-regulated lead-acid, or VRLA, batteries) are often installed next to the UPS units themselves. Although this may not have been the case at SBG2, this type of configuration can create a situation where a UPS fire heats up batteries until they start to burn and can cause fire to spread rapidly.
  • Cooling tower design. SBG2 was built in 2011 using a tower design that has convection-cooling based “auto-ventilation.” Cool air enters, passes through a heat exchange for the (direct liquid) cooling system, and warm air rises through the tower in the center of the building. OVHcloud has four other data centers using the same principle. OVHcloud says this is an environmentally sound, energy efficient design — but since the fire, concerns have been raised that it can act rather like a chimney. Vents that allow external air to enter would need to be immediately shut in the event of a potential fire (the nearby, newer SBG3 data center, which uses an updated design, suffered less damage).
  • VESDA and fire suppression. It is being reported that SBG2 had neither a VESDA (very early smoke detection apparatus) system nor a water or gas fire suppression system. Rather, staff relied on smoke detectors and fire extinguishers. It is not known if these reports are accurate. Most data centers do have early detection and fire suppression systems.
  • Backup and cloud services. Cloud (and many hosting) companies cite high availability figures and extremely low figures for data loss. But full storage management and recovery across multiple sites costs extra, especially for hosted services. Many customers, especially smaller ones, usually pay for basic backup only. Statements from OVHcloud since the fire suggests that some customers would have lost data. Some backups were in the same data center, or on the same campus, and not all data was replicated elsewhere.

Fire and resiliency certification

Responsibility for fire prevention — and building regulations — is mostly dealt with by local planning authorities (AHJs – authorities having jurisdiction). These vary widely across geographies.

Uptime Institute has been asked whether Tier certification would help prevent fires. Uptime’s Chief Technical Officer Chris Brown responds:

“Tiers has limited fire system requirements, and they are geared to how the systems can impact the critical MEP (mechanical, electrical and plumbing) infrastructure. This is the case because in most locations, fire detection and suppression are tightly controlled by life/safety codes. If the Tier standard were to include specific fire detection and suppression requirements, it would add little value and would run the risk of clashing with local codes.

This is always under review.

Tier IV does have a compartmentalization requirement. It requires a 1 hour fire-rated barrier between complementary systems. This is to protect complementary systems from being impacted by a single fire event. This does assume the facility is properly protected by fire suppression systems.”

A separate Uptime Data Center Risk Assessment (DCRA) would document the condition (or lack of?) a fire suppression system, any lack of a double-interlocked suppression system, and even a pre-action system using only compressed air to charge the lines.

How data center operators can transition to renewable energy

Transitioning to renewable energy use is an important, but not easily achieved, goal. Although the past decade has seen significant improvements in IT energy efficiency, there are indications that this may not continue. Moore’s Law may be slowing, more people are coming online, and internet traffic is growing faster than ever before. As energy consumption increases, data center operators will need to transition to 100% renewable energy use 100% of the time.

Uptime Institute has recently published a report covering the key components of renewable energy sustainability strategies for data centers. The tech industry has made considerable effort to improve energy efficiency and is the largest purchaser of renewable energy. Even so, most data center sustainability strategies still focus on renewable energy certificates (RECs). RECs are now considered to be low quality products because they cannot credibly be used to back claims of 100% renewable energy use.

To avoid accusations of greenwashing, data center operators must consider investing in a portfolio of renewable energy products. RECs may play a part, but power purchase agreements (PPAs) are becoming more popular, even though there can be financial risks involved.

A stepwise approach will ease the process. There are four steps that data center operators need to take on the journey toward the use of sustainable, renewable energy.

1. Measure, report, offset

Electricity is just one component of the carbon footprint of an organization, but it is relatively easy to measure because exact consumption is regularly reported for billing purposes. Tracking how much electricity comes from renewable and nonrenewable sources allows decisions to be made about offsets and renewables matching. This breakdown can be obtained from the electricity supplier, or grid-level emissions factors can be used. For example, in the US this is published annually by state by the Energy Information Administration; other countries provide similar resources. (A more formal methodology is explained in the Greenhouse Gas Protocol Scope 2 guidance.)

Once total emissions are known (that is, total emissions from electricity ― the full organizational emissions also need to be calculated), the next step is to buy offset products to mitigate the existing impact. However, there are significant challenges with ensuring offset quality, and so offsetting is only a stopgap measure. Ideally, offsets must be reserved for emissions that cannot be reduced by other measures (e.g., by switching to 100% renewable energy).

Measurement and reporting are crucial to understanding carbon footprint and are effectively becoming a necessary function of doing business. The reporting of carbon emissions is becoming a legal requirement for larger companies in some jurisdictions (e.g., the UK). Some companies are starting to require carbon reporting for their suppliers (for example, Apple and Microsoft, because of their own goals to be carbon-neutral/negative by 2030).

Data center operators who are not already tracking carbon emissions associated with electricity purchases may have to invest significant resources to catch up should reporting become required.

2. 100% renewables matching

Ideally, all electricity used should be 100% matched by renewable energy production. So far in the data center industry, 100% renewables matching has generally been achieved through purchasing RECs, but PPAs (direct, financial, or through green supply arrangements) must now take over as the dominant approach. RECs can act as a stopgap between taking no action and using tools such as PPAs, but they should eventually be a small component in the overall data center sustainability strategy.

3. Power purchase agreements

Purchasing RECs is a reasonable first step in a sustainability strategy but is insufficient on its own. Establishing direct/physical PPAs with a nearby renewable energy generator, combined with their associated RECs, is the gold standard necessary to truly claim 100% renewable energy use. However, even this does not mean 100% renewable energy is actually being used by the data center, just that an amount of purchased renewable energy equivalent to the data center’s energy use was added to the grid. Virtual/financial PPAs are an option where the power market does not allow direct retail PPAs.

Both types of PPA involve pricing risk or can act as a hedge against wholesale price changes. For direct PPAs, the fixed price provides certainty — but if wholesale prices fall, buyers may be stuck in a long-term contract paying more than the current market price. Virtual/financial PPAs introduce further complexity and financial risk: if the wholesale price falls below the agreed-upon strike price at the time of purchase, the buyer must pay the supplier the difference, which may be significant.

Despite these risks, the use of PPAs is growing rapidly in the US, particularly in the tech and communications sectors. Operators with advanced sustainability programs have been buying PPAs for several years, either directly through financial/virtual PPAs, or by using green supply agreements through a utility. Our report covers these options in more detail.

4. 24/7 renewable energy use

Most matching (of renewable energy purchased against energy actually used) happens on an annual basis but shifts in generation (the grid mix) happen at a much lower granularity. There are strategies to smooth this out: different sources of renewable energy can be combined to create blended PPAs, such as combining wind and solar energy production with storage capacity. This is useful because different sources generate at different times (for example, wind can generate energy at night when solar energy is unavailable).

In 2019, Microsoft and Vattenfall announced a new product to provide hourly renewables matching. The pilot started at Microsoft’s Sweden headquarters and will provide hourly matching to a new Azure cloud region in 2021. No data center operator has achieved 24/7 matching for its entire global fleet, although some are almost there for individual sites (e.g., in 2019, Google achieved 96% renewable energy use in Oklahoma [US], and 61% on an hourly basis globally).

This is the objective: 24/7 renewable energy use, 100% of the time. Matching on an annual basis is not enough to reach the goal of decarbonizing the electricity grid. All demand ― 100% ― must be supplied by 100% renewable energy, 100% of the time. This will, of course, take many decades to achieve in most economies.


The full report Renewable energy for data centers: Renewable energy certificates, power purchase agreements and beyond is available to members of the Uptime Institute. Membership can be found here.

Climate Change and Digital Infrastructure

Extreme weather affects nearly half of data centers

Recent extreme weather-related events in the US (the big freeze in Texas, fires on the west coast) have once again highlighted the need for data center operators to reassess their risks in the face of climate change. The topic is discussed in depth in the Uptime Institute report (available to Uptime Institute members) entitled, The gathering storm: Climate change and data center resiliency.

Data centers are designed and built to withstand bad weather. But extreme weather is becoming more common, and it can trigger all kinds of unforeseen problems — especially for utilities and support services.

In a recent Uptime Intelligence survey, almost half (45%) of respondents said they had experienced an extreme weather event that threatened their continuous operation — a surprisingly large number. While most said operations continued without problems, nearly one in 10 respondents (8.8%) did suffer an outage or significant service disruption as a result. This makes extreme weather one of the top causes of outages or disruption.



More events, higher costs

The industry — and that means investors, designers, insurers, operators and other contractors — is now braced for more challenging conditions and higher costs in the years ahead. Three in five respondents (59%) think there will be more IT service outages as a direct result of the impact of climate change. Nearly nine in 10 (86%) think that climate change and weather-related events will drive up the cost of data center infrastructure and operations over the next 10 years.

While most operators are very aware of the risks and costs of climate change, however, many do not appear to consider their own sites to be facing any immediate challenges. Over a third (36%) report their management has yet to formally assess the vulnerability of data centers to climate change. Almost a third (31%) believe they already have adequate protection in place — but it is not clear if this belief is backed by recent analysis.

Perhaps most dramatically, only one in 20 managers sees a dramatic increase in risks due to climate change and is taking steps to improve resiliency as a result. Such steps can range from simple changes to processes and maintenance to expensive investments in flood barriers, changes to cooling systems or even re-siting and closure.



The need for assessments

Any investments in resiliency need, of course, to be based on sound risk analysis. Uptime Institute strongly recommends that operators conduct regular reviews of climate change-related risks to their data centers. The risk profile for a data center may be far less rosy in 2021 than it was when it was built even a few years ago. Four in five data center operators (81%) agree that data center resiliency assessments will need to be regularly updated due to the impact of climate change.

As recent events show, such reviews may need to consider water and power grid resilience, potential impacts to roads and staff access, and even the economics of operating for long periods without free cooling.

The figure below shows the top areas typically reviewed by organizations conducting climate change/weather-related data center resiliency assessments.



Data center managers do appear to have a good understanding of what to assess. But the findings also highlight the Achilles heel of data center resiliency: the difficulty of mitigating against (or even accurately analyzing) the risks of failures by outside suppliers. Extreme weather events can hit power, water, fuel supplies, maintenance services and staff availability all at once. Good planning, however, can dramatically reduce the impact of such challenges.

Data center staff shortages don’t need to be a crisis

In every region of the world, data center capacity is being dramatically expanded. Across the board, the scale of capacity growth is stretching the critical infrastructure sector’s talent supply. The availability (or lack) of specialist staff will be an increasing concern for all types of data centers, from mega-growth hyperscales to smaller private enterprise facilities.

Uptime Institute forecasts global data center staff requirements will grow globally from about 2.0 million full-time employee equivalents in 2019 to nearly 2.3 million in 2025 (see Figure 1). This estimate, the sector’s first, covers more than 230 specialist job roles for different types and sizes of data centers, with varying criticality requirements, from design through operation.


Figure 1. Global data center staff projections

Our research shows that the proportion of data center owners or operators globally that are having difficulty finding qualified candidates for open jobs rose to 50% in 2020. While there is hope that new technologies to manage and operate facilities will reduce staff burdens over time, their effect is expected to be limited, at least until 2025.

There is also a concern that many employees in some mature data center markets, such as the US and Western Europe, are due to retire around the same time, causing an additional surge in demand, particularly for senior roles.

However, the growth in demand does not need to represent a crisis. Individual employers can take steps to address the issue, and the sector can act together to raise the profile of opportunities and to improve recruitment and training. Globally, the biggest employers are investing in more training and education, by not only developing internal programs but also working with universities/colleges and technical schools. Of course, additional training requires additional resources, but more operators of all sizes and types are beginning to view this type of investment as a necessity.

Education and background requirements for many job roles may also need to be revisited. In reality, most jobs do not require a high level of formal education to carry out the role, even in positions where the employer may have initially required it. Relevant experience, an internship/traineeship, or on-the-job training can often more than compensate for the lack of a formal qualification in most job roles.

The growing, long-term requirement for more trained people has also caught the attention of private equity and other investors. More are backing data center facilities management suppliers, which offer services that can help overcome skills shortages. Raising awareness of the opportunities and offering training can be part of the investment. While the data center sector faces staff challenges, with focused investment, industry initiatives and more data center-specific education programs, it can rise to the challenge.

Several resources related to this topic are available to members of the Uptime Institute community, including “The people challenge: Global data center staffing forecast 2021-2025” and “Critical facility management: Guidance on using third parties.” Click here to find out more about joining our community.

Extreme cold — a neglected threat to availability?

In Uptime Institute’s recent report on preparing for the extreme effects of climate change, there were over a dozen references to the dangers of extremely hot weather, which can overwhelm cooling systems and trigger regional fires that disrupt power, connectivity and staff access.

But the effects of extreme cold were discussed only in passing. The main thermal challenge for a data center, after all, is keeping temperatures down, and most data centers subject to extreme cold (e.g., Facebook’s data center in Lulea, Sweden, just outside the Arctic Circle) have usually been designed with adequate protective measures in place.

But climate change (if that is the cause) is known to trigger wild swings in the weather, and that may include, for some, a period of unexpected extreme cold. This occurred in Texas this month (February 2021), which has experienced record cold temperatures — breaking the previous record from 1909 by some margin. Temperatures in Austin (TX) on Monday, February 15, fell to 4 degrees Fahrenheit (-16 degrees Celsius), with a wind chill effect taking it down to -16 degrees Fahrenheit (-27 degrees Celsius).

The impact on digital infrastructure was dramatic. First, the grid shut down for more than seven hours, affecting two million homes and forcing data centers to use generators. This, reportedly, was due to multiple failures in the power grid, from the shutdown of gas wells and power plants (due mainly to frozen components and loss of power for pumping gas) to low power generation from renewable sources (due to low wind/solar availability and frozen wind turbines). The failures have triggered further discussions about the way the Texas grid is managed and the amount of capacity it can call on at critical times. In addition, AT&T and T-Mobile reported some issues with connectivity services.

Data center managers struggled with multiple issues. Those successful in moving to generator power faced fuel delivery issues due to road conditions, while anyone buying power on the spot markets saw a surge in power prices (although most data center operators buy at a fixed price). The city of Austin’s own data center was one of those that suffered a lengthy outage.

All this raises the question: What can data center staff do to reduce the likelihood of an outage or service degradation due to low temperatures (and possible snowy/icy conditions)? Below we provide advice from Uptime’s Chief Technical Officer, Chris Brown.

For backup power systems (usually diesel generators):

• Check start battery condition.
• Check diesel additive to ensure it is protective below the anticipated temperatures.
• Ensure block heaters and jacket water heaters are operational.
• Check filters, as they are more likely clog at low temperatures.

For cooling systems:

• Ensure freeze protection is in place or de-icing procedures are followed on cooling towers. Consider reversing fans to remove built-up ice.
• Ensure all externally mounted equipment is rated for the anticipated temperatures. (Direct expansion compressors and air-cooled chillers will not operate in extreme temperatures.)
• Ensure freeze protection on all external piping is operational.
• Evaluate the use of free cooling where available.

And of course, it’s always good to remember to ensure critical staff are housed near the data center in case transportation becomes an issue. Also, consider reducing some IT loads and turn off a generator or two — it is better to do this than run generators for long periods at a low load.

Given the weather-related February 2021 failure of the Texas grid at a critical time, it may also be advisable for all data center operators to review the resiliency and capacity of their local energy utilities, especially with regard to planning for extreme weather events, including heat, cold, rain and wind. Increasing use of renewable energy may require that greater reserve capacity is available.

For more information on climate change and weather risks along with the litany of new challenges facing today’s infrastructure owner and operators, consider becoming a member of Uptime Institute. Members enjoy an entire portfolio of experiential knowledge and hands-on understanding from more than 100 of the world’s most respected companies. Members can access our report The gathering storm: Climate change and data center resiliency.

Network problems causing ever more outages

Power failures have always been one of the top causes of serious IT service outages. The loss of power to a data center can be devastating, and its consequences have fully justified the huge expense and effort that go into preventing such events.

But in recent years, other causes are catching up, with networking issues now emerging as one of the more common — if not the most common — causes of downtime. In our most recent survey of nearly 300 data center and IT service operators, network issues were cited as the most common reason for any IT service outage — more common even than power problems (see Figure 1).


Figure 1. Among survey respondents, networking issues were the most common cause of IT service outages in the past three years.


The reasons are clear enough: modern applications and data are increasingly spread across and among data centers, with the network ever more critical. To add to the mix, software-defined networks have added great flexibility and programmability, but they have also introduced failure-prone complexity.

Delving a little deeper confirms the complexity diagnosis. Configuration errors, firmware errors, and corrupted routing tables all play a big role, while the more traditional worries of weather and cable breaks are a relatively minor concern. Congestion and capacity issues also cause failures, but these are often themselves the result of programming/configuration issues.

Networks are complex not only technically, but also operationally. While enterprise data centers may be served by only one or two providers, multi-carrier colocation hubs can be served by many telecommunications providers. Some of these links may, further down the line, share cables or facilities — adding possible overlapping points of failure or capacity pinch points. Ownership, visibility and accountability can also be complicated. These factors combined help account for the fact that 39% of survey respondents said they had experienced an outage caused by a third-party networking issue — something over which they had little control (see Figure 2).


Figure 2. Configuration/change management failures caused almost half of all network-related outages reported by survey respondents.

Among those who avoided any downtime from networking-related outages, what was the most important reason why?

A few of those organizations that avoided any network-related downtime put this down to luck (!). (We know of one operator who suffered two separate, unrelated critical cable breaks at the same time.) But the majority of those who avoided downtime credited a factor that is more controllable: investment in systems and training (see Figure 3).


Figure 3.. Over three-quarters of survey respondents that avoided network-related downtime attributed it to their investments in resiliency and training.

The Bottom Line: As with the prevention of power issues, money spent on expertise, redundancy, monitoring, diagnostics and recovery — along with staff training and processes — will be paid back with more hours of uptime.