Data Center Security

Data center insecurity: Online exposure threatens critical systems

In early March 2021, a hacker group publicly exposed the username and password of an administrative account of a security camera vendor. The credentials enabled them to access 150,000 commercial security systems and, potentially, set up subsequent attacks on other critical equipment. A few weeks earlier, leaked credentials for the collaboration software TeamViewer gave hackers a way into a system controlling a city water plant in Florida (US). They remotely adjusted the sodium hydroxide levels to a dangerous level (the attack was detected, and harm avoided).

These are just some of the most recent examples of exploits where critical infrastructure was disrupted by remote access to IT systems, including some high-profile attacks at power plants.

The threat of cybersecurity breaches also applies to physical data centers, and it is growing. Cloud computing, and increased automation and remote monitoring have broadened the attack surface. (See our recent report Data center security: Reassessing physical, human and digital risks.)

So, how widespread is the problem of insecure facility assets? Our research of vulnerable systems on the open internet suggests it is not uncommon.

For close to a decade, the website Shodan has been used by hackers, benevolent and malevolent, to search for targets. Instead of fetching results that are webpages, Shodan crawls the internet for devices and industrial control systems (ICSs) that are connected to the internet but exposed.

Shodan and similar search engine websites (BinaryEdge, Censys and others) provide a compendium of port-scan data (locating open ports, which are a path to attack) on the internet. Expert users identify interesting characteristics about certain systems and set out to gain as much access as they can. Automation tools make the process more efficient, speeding up and also expanding what is possible for an exploit (e.g., by defeating login safeguards).

In a recent demonstration of Shodan for the Uptime Institute, the cybersecurity firm Phobos Group showed more than 98,000 ICSs exposed globally, including data center equipment and devices. Phobos quickly discovered access to the login screens of control systems for most major data center equipment providers. In Figure 1 (as in all figures), screenshots of aggregate search results are shown with specific details hidden to ensure privacy.

The login process itself can be highly problematic. Sometimes installers or users do not change the default credentials supplied by the manufacturers, which can often be found online. During our demonstration, for example, Phobos used a default login to gain access to the control system for cooling units supplied by a widely used data center equipment vendor. If this exercise were carried out by a genuine intruder, they would be able to change setpoint temperatures and alarms.

Users’ customized login credentials can sometimes be obtained from a data breach of one service and then used by a hacker to try to log into another service, a type of cyberattack known as credential stuffing. The availability of lists of credentials has proliferated, and automated credential-stuffing tools have become more sophisticated, using bots to thwart traditional login protections. (Data breaches can happen without leaving any trace in corporate systems and can go undetected.)

As cybersecurity exploits of critical infrastructure in recent years have shown, control system interfaces may be the primary targets — but access to them is often through another system. Using the Shodan tool, the security company Phobos searched for exposed remote desktops, which can then provide access to multiple systems. This method can be particularly troubling if a control system is accessible through a remote desktop and if the user employs the same or similar passwords across systems.

There are many remote desktops exposed online. As Figure 2 shows, in a recent Shodan search, over 86,700 remote desktops were exposed in the US city of Ashburn, Virginia, alone (a city known as the world’s data center capital). This list includes a set of addresses for a global data center capacity provider (not shown).

Password reuse is one of the biggest security vulnerabilities humans introduce, but it can be minimized with training and tools, and by multifactor authentication where practicable. Installers and users should also be prevented from removing password protection controls (another vulnerability that Phobos demonstrated). There are also cybersecurity tools to continuously scan for assets exposed online and to provide attack simulations. Services used at some facilities include threat intelligence and penetration tests on IP addresses and infrastructure. Low-tech approaches such as locked workstations and clean-desk policies also help protect sensitive information.

Cybersecurity of data center control systems and other internet protocol (IP)-enabled assets is multilayered and requires a combination of ongoing strategies. The threat is real and the likelihood of physical breaches, unauthorized access to information, and the destruction of or tampering with data and services is higher than ever before.


The full report Data center security: Reassessing physical, human and digital risks is available to members of the Uptime Institute community here.

Datacenter fire frequency trends

Datacenter Fire Frequency

The catastrophic fire that occurred at OVHcloud’s SBG2 data center in Strasbourg, France (see last week’s blog about it) has led many operators to question their vulnerability to fires.

Fires at data centers are a constant concern, but are rare. In almost all cases of data center fire, the source is quickly located, the equipment isolated, and damage contained to a small area.

Uptime Institute’s database of abnormal incidents, which documents over 8,000 incidents shared by members since its inception in 1994, records 11 fires in data centers — less than 0.5 per year. All of these were successfully contained, causing minimal damage/disruption.

A separate Uptime Institute database of publicly recorded incidents around the world — which includes only those that receive public/media attention — also shows fires are rare, with outages often caused by fire suppression equipment.

One or two of these fires have been serious and have led to some destruction of equipment and data. However, the data centers or communications rooms involved have been small, with minimal long-term disruption.

The majority of incidents begin and end in the electrical room (although “people doing dumb things” — such as overloading power strips or working with open flames near flammable materials — is also a cause, says Uptime Institute Chief Technical Officer Chris Brown). Faults in uninterruptible power supplies can create heat and smoke and can require that the equipment be immediately isolated, but the risk rarely goes beyond this, due the lack of nearby combustive materials.

If batteries are nearby, they can catch fire and will burn until the fuel is consumed — which can take some time. Lithium-ion batteries, which are commonly perceived as a fire risk, contain internal monitoring at the cell level, which cuts the battery power if heating occurs.

In recent years (before OVHcloud Strasbourg), accidental discharge of fire suppression systems, especially high pressure clean agent gas systems, has actually caused significantly more serious disruption than fires, with some banking and financial trading data centers affected by this issue. Fires near a data center, or preventative measures taken to reduce the likelihood of forest fires, have also led to some data center disruption (not included in the numbers reported above).

Responsibility for fire regulation is covered by the local AHJ (authority having jurisdiction), and requirements are usually strict. But rules may be stricter for newer facilities, so good operational management is critical for older data centers.

Uptime Institute advises that all data centers use VESDA (very early smoke detection apparatus) systems and maintain appropriate fire barriers and separation of systems. Well-maintained water sprinkler or low pressure clean agent fire suppression systems are preferred. Risk assessments, primarily aimed at reducing the likelihood of outages, will also pick up obvious issues with these systems.

The Uptime Tier IV certification requires 1 hour fire-rated partitions between complementary critical systems. This is to help ensure a fire in one area does not immediately shut down a data center. It does assume proper fire suppression in the facility.

Learning from the OVHcloud data center fire

The fire that destroyed a data center (and damaged others) at the OVHcloud facility in Strasbourg, France, on March 10-11, 2021, has raised a multitude of questions from concerned data center operators and customers around the world. Chief among these is, “What was the main cause, and could it have been prevented?”

Fires at data centers are rare but do occur — Uptime Institute Intelligence has some details of more than 10 data center fires (see our upcoming blog about the frequency of fire incidents). But most of these are quickly isolated and extinguished; it is extremely uncommon for a fire to rage out of control, especially at larger data centers, where strict fire prevention and containment protocols are usually followed. Unfortunately for OVHcloud, the fire occurred just two days after the owners announced plans for a public listing on the Paris Stock Exchange in 2022.

While this Note will address some of the known facts and provide some context, more complete and informed answers will have to wait for the full analysis by OVHcloud, the fire services and other parties. OVHcloud has access to a lot of closed-circuit television and some thermal camera images that will help in the investigation.

OVHcloud

OVHcloud is a high-profile European data center operator and one of the largest hosting companies globally. Founded in 1999 by Octave Klaba, OVHcloud is centered in France but has expanded rapidly, with facilities in several countries offering a range of hosting, colocation and cloud services. It has been championed as a European alternative to the giant US cloud operators and is a key participant in the European Union’s GAIA-X cloud project. It has partnerships with big IT services operators, such as Deutsche Telekom, Atos and Capgemini.

Among OVHcloud customers are tens of thousands of small businesses running millions of websites. But it has many major enterprise, government and commercial customers, including various departments of the French government, the UK’s Vehicle Licensing Agency, and the European Space Agency. Many have been affected by the fire.

OVHcloud is hailed as a bold innovator, offering a range of cloud services and using advanced low energy, free air cooling designs and, unusually for commercial operators, direct liquid cooling. But it has also suffered some significant outages, most notably two serious incidents in 2017. After that, then-Chief Executive Officer and chairman Octave Klaba spoke of the need for OVHcloud to be “even more paranoid than it is already.” Some critics at the time believed these outages were due to poor design and operational practices, coupled with a high emphasis on innovation. The need to compete on a cost basis with large-scale competitors —Amazon Web Services, Microsoft and others – is an ever-present factor.

The campus at Strasbourg (SBG) is based on a site acquired from ArcelorMittal, a steel and mining company. It houses four data centers, serving customers internationally. The oldest and smallest two, SBG1 and SBG4, were originally based on prefab containers. SBG2, destroyed by the fire, was a 2 MW facility capable of housing 30,000 servers. It used an innovative free air cooling system. SBG3, a newer 4 MW facility that was partially damaged, uses a newer design that may have proved more resilient.

Chronology

The fire in SBG2 started after midnight and was picked up by sensors and alarms. Black smoke prevented staff from effectively intervening. The fire spread rapidly within minutes, destroying the entire data center. Using thermal cameras, firefighters identified that two uninterruptible power supplies (UPSs) were at the heart of the blaze, one of which had been extensively worked on that morning.

All of the data centers have been out of action in the days immediately following the fire, although all but SBG2 are due to come back online shortly. SBG1 suffered significant damage to some rooms, with recovery planned to take a week or so. Many customers were advised to invoke disaster recovery plans, but OVHcloud has spare capacity in other data centers and has been working to get customers up and running.

Causes, design and operation

Only a thorough root-cause analysis will reveal exactly what happened and whether this fire was preventable. However, some design and operational issues have been highlighted among the many customers and ecosystem partners of OVHcloud:

  • UPS and electrical fires. Early indicators point to the failure of a UPS, causing a fire that spread quickly. At least one of the UPSs had been extensively worked on earlier in the day, suggesting a maintenance issue may have been a main contributor. Although it is not best practice, battery cabinets (when using vent-regulated lead-acid, or VRLA, batteries) are often installed next to the UPS units themselves. Although this may not have been the case at SBG2, this type of configuration can create a situation where a UPS fire heats up batteries until they start to burn and can cause fire to spread rapidly.
  • Cooling tower design. SBG2 was built in 2011 using a tower design that has convection-cooling based “auto-ventilation.” Cool air enters, passes through a heat exchange for the (direct liquid) cooling system, and warm air rises through the tower in the center of the building. OVHcloud has four other data centers using the same principle. OVHcloud says this is an environmentally sound, energy efficient design — but since the fire, concerns have been raised that it can act rather like a chimney. Vents that allow external air to enter would need to be immediately shut in the event of a potential fire (the nearby, newer SBG3 data center, which uses an updated design, suffered less damage).
  • VESDA and fire suppression. It is being reported that SBG2 had neither a VESDA (very early smoke detection apparatus) system nor a water or gas fire suppression system. Rather, staff relied on smoke detectors and fire extinguishers. It is not known if these reports are accurate. Most data centers do have early detection and fire suppression systems.
  • Backup and cloud services. Cloud (and many hosting) companies cite high availability figures and extremely low figures for data loss. But full storage management and recovery across multiple sites costs extra, especially for hosted services. Many customers, especially smaller ones, usually pay for basic backup only. Statements from OVHcloud since the fire suggests that some customers would have lost data. Some backups were in the same data center, or on the same campus, and not all data was replicated elsewhere.

Fire and resiliency certification

Responsibility for fire prevention — and building regulations — is mostly dealt with by local planning authorities (AHJs – authorities having jurisdiction). These vary widely across geographies.

Uptime Institute has been asked whether Tier certification would help prevent fires. Uptime’s Chief Technical Officer Chris Brown responds:

“Tiers has limited fire system requirements, and they are geared to how the systems can impact the critical MEP (mechanical, electrical and plumbing) infrastructure. This is the case because in most locations, fire detection and suppression are tightly controlled by life/safety codes. If the Tier standard were to include specific fire detection and suppression requirements, it would add little value and would run the risk of clashing with local codes.

This is always under review.

Tier IV does have a compartmentalization requirement. It requires a 1 hour fire-rated barrier between complementary systems. This is to protect complementary systems from being impacted by a single fire event. This does assume the facility is properly protected by fire suppression systems.”

A separate Uptime Data Center Risk Assessment (DCRA) would document the condition (or lack of?) a fire suppression system, any lack of a double-interlocked suppression system, and even a pre-action system using only compressed air to charge the lines.

How data center operators can transition to renewable energy

Transitioning to renewable energy use is an important, but not easily achieved, goal. Although the past decade has seen significant improvements in IT energy efficiency, there are indications that this may not continue. Moore’s Law may be slowing, more people are coming online, and internet traffic is growing faster than ever before. As energy consumption increases, data center operators will need to transition to 100% renewable energy use 100% of the time.

Uptime Institute has recently published a report covering the key components of renewable energy sustainability strategies for data centers. The tech industry has made considerable effort to improve energy efficiency and is the largest purchaser of renewable energy. Even so, most data center sustainability strategies still focus on renewable energy certificates (RECs). RECs are now considered to be low quality products because they cannot credibly be used to back claims of 100% renewable energy use.

To avoid accusations of greenwashing, data center operators must consider investing in a portfolio of renewable energy products. RECs may play a part, but power purchase agreements (PPAs) are becoming more popular, even though there can be financial risks involved.

A stepwise approach will ease the process. There are four steps that data center operators need to take on the journey toward the use of sustainable, renewable energy.

1. Measure, report, offset

Electricity is just one component of the carbon footprint of an organization, but it is relatively easy to measure because exact consumption is regularly reported for billing purposes. Tracking how much electricity comes from renewable and nonrenewable sources allows decisions to be made about offsets and renewables matching. This breakdown can be obtained from the electricity supplier, or grid-level emissions factors can be used. For example, in the US this is published annually by state by the Energy Information Administration; other countries provide similar resources. (A more formal methodology is explained in the Greenhouse Gas Protocol Scope 2 guidance.)

Once total emissions are known (that is, total emissions from electricity ― the full organizational emissions also need to be calculated), the next step is to buy offset products to mitigate the existing impact. However, there are significant challenges with ensuring offset quality, and so offsetting is only a stopgap measure. Ideally, offsets must be reserved for emissions that cannot be reduced by other measures (e.g., by switching to 100% renewable energy).

Measurement and reporting are crucial to understanding carbon footprint and are effectively becoming a necessary function of doing business. The reporting of carbon emissions is becoming a legal requirement for larger companies in some jurisdictions (e.g., the UK). Some companies are starting to require carbon reporting for their suppliers (for example, Apple and Microsoft, because of their own goals to be carbon-neutral/negative by 2030).

Data center operators who are not already tracking carbon emissions associated with electricity purchases may have to invest significant resources to catch up should reporting become required.

2. 100% renewables matching

Ideally, all electricity used should be 100% matched by renewable energy production. So far in the data center industry, 100% renewables matching has generally been achieved through purchasing RECs, but PPAs (direct, financial, or through green supply arrangements) must now take over as the dominant approach. RECs can act as a stopgap between taking no action and using tools such as PPAs, but they should eventually be a small component in the overall data center sustainability strategy.

3. Power purchase agreements

Purchasing RECs is a reasonable first step in a sustainability strategy but is insufficient on its own. Establishing direct/physical PPAs with a nearby renewable energy generator, combined with their associated RECs, is the gold standard necessary to truly claim 100% renewable energy use. However, even this does not mean 100% renewable energy is actually being used by the data center, just that an amount of purchased renewable energy equivalent to the data center’s energy use was added to the grid. Virtual/financial PPAs are an option where the power market does not allow direct retail PPAs.

Both types of PPA involve pricing risk or can act as a hedge against wholesale price changes. For direct PPAs, the fixed price provides certainty — but if wholesale prices fall, buyers may be stuck in a long-term contract paying more than the current market price. Virtual/financial PPAs introduce further complexity and financial risk: if the wholesale price falls below the agreed-upon strike price at the time of purchase, the buyer must pay the supplier the difference, which may be significant.

Despite these risks, the use of PPAs is growing rapidly in the US, particularly in the tech and communications sectors. Operators with advanced sustainability programs have been buying PPAs for several years, either directly through financial/virtual PPAs, or by using green supply agreements through a utility. Our report covers these options in more detail.

4. 24/7 renewable energy use

Most matching (of renewable energy purchased against energy actually used) happens on an annual basis but shifts in generation (the grid mix) happen at a much lower granularity. There are strategies to smooth this out: different sources of renewable energy can be combined to create blended PPAs, such as combining wind and solar energy production with storage capacity. This is useful because different sources generate at different times (for example, wind can generate energy at night when solar energy is unavailable).

In 2019, Microsoft and Vattenfall announced a new product to provide hourly renewables matching. The pilot started at Microsoft’s Sweden headquarters and will provide hourly matching to a new Azure cloud region in 2021. No data center operator has achieved 24/7 matching for its entire global fleet, although some are almost there for individual sites (e.g., in 2019, Google achieved 96% renewable energy use in Oklahoma [US], and 61% on an hourly basis globally).

This is the objective: 24/7 renewable energy use, 100% of the time. Matching on an annual basis is not enough to reach the goal of decarbonizing the electricity grid. All demand ― 100% ― must be supplied by 100% renewable energy, 100% of the time. This will, of course, take many decades to achieve in most economies.


The full report Renewable energy for data centers: Renewable energy certificates, power purchase agreements and beyond is available to members of the Uptime Institute. Membership can be found here.

Climate Change and Digital Infrastructure

Extreme weather affects nearly half of data centers

Recent extreme weather-related events in the US (the big freeze in Texas, fires on the west coast) have once again highlighted the need for data center operators to reassess their risks in the face of climate change. The topic is discussed in depth in the Uptime Institute report (available to Uptime Institute members) entitled, The gathering storm: Climate change and data center resiliency.

Data centers are designed and built to withstand bad weather. But extreme weather is becoming more common, and it can trigger all kinds of unforeseen problems — especially for utilities and support services.

In a recent Uptime Intelligence survey, almost half (45%) of respondents said they had experienced an extreme weather event that threatened their continuous operation — a surprisingly large number. While most said operations continued without problems, nearly one in 10 respondents (8.8%) did suffer an outage or significant service disruption as a result. This makes extreme weather one of the top causes of outages or disruption.



More events, higher costs

The industry — and that means investors, designers, insurers, operators and other contractors — is now braced for more challenging conditions and higher costs in the years ahead. Three in five respondents (59%) think there will be more IT service outages as a direct result of the impact of climate change. Nearly nine in 10 (86%) think that climate change and weather-related events will drive up the cost of data center infrastructure and operations over the next 10 years.

While most operators are very aware of the risks and costs of climate change, however, many do not appear to consider their own sites to be facing any immediate challenges. Over a third (36%) report their management has yet to formally assess the vulnerability of data centers to climate change. Almost a third (31%) believe they already have adequate protection in place — but it is not clear if this belief is backed by recent analysis.

Perhaps most dramatically, only one in 20 managers sees a dramatic increase in risks due to climate change and is taking steps to improve resiliency as a result. Such steps can range from simple changes to processes and maintenance to expensive investments in flood barriers, changes to cooling systems or even re-siting and closure.



The need for assessments

Any investments in resiliency need, of course, to be based on sound risk analysis. Uptime Institute strongly recommends that operators conduct regular reviews of climate change-related risks to their data centers. The risk profile for a data center may be far less rosy in 2021 than it was when it was built even a few years ago. Four in five data center operators (81%) agree that data center resiliency assessments will need to be regularly updated due to the impact of climate change.

As recent events show, such reviews may need to consider water and power grid resilience, potential impacts to roads and staff access, and even the economics of operating for long periods without free cooling.

The figure below shows the top areas typically reviewed by organizations conducting climate change/weather-related data center resiliency assessments.



Data center managers do appear to have a good understanding of what to assess. But the findings also highlight the Achilles heel of data center resiliency: the difficulty of mitigating against (or even accurately analyzing) the risks of failures by outside suppliers. Extreme weather events can hit power, water, fuel supplies, maintenance services and staff availability all at once. Good planning, however, can dramatically reduce the impact of such challenges.

Data center staff shortages don’t need to be a crisis

In every region of the world, data center capacity is being dramatically expanded. Across the board, the scale of capacity growth is stretching the critical infrastructure sector’s talent supply. The availability (or lack) of specialist staff will be an increasing concern for all types of data centers, from mega-growth hyperscales to smaller private enterprise facilities.

Uptime Institute forecasts global data center staff requirements will grow globally from about 2.0 million full-time employee equivalents in 2019 to nearly 2.3 million in 2025 (see Figure 1). This estimate, the sector’s first, covers more than 230 specialist job roles for different types and sizes of data centers, with varying criticality requirements, from design through operation.


Figure 1. Global data center staff projections

Our research shows that the proportion of data center owners or operators globally that are having difficulty finding qualified candidates for open jobs rose to 50% in 2020. While there is hope that new technologies to manage and operate facilities will reduce staff burdens over time, their effect is expected to be limited, at least until 2025.

There is also a concern that many employees in some mature data center markets, such as the US and Western Europe, are due to retire around the same time, causing an additional surge in demand, particularly for senior roles.

However, the growth in demand does not need to represent a crisis. Individual employers can take steps to address the issue, and the sector can act together to raise the profile of opportunities and to improve recruitment and training. Globally, the biggest employers are investing in more training and education, by not only developing internal programs but also working with universities/colleges and technical schools. Of course, additional training requires additional resources, but more operators of all sizes and types are beginning to view this type of investment as a necessity.

Education and background requirements for many job roles may also need to be revisited. In reality, most jobs do not require a high level of formal education to carry out the role, even in positions where the employer may have initially required it. Relevant experience, an internship/traineeship, or on-the-job training can often more than compensate for the lack of a formal qualification in most job roles.

The growing, long-term requirement for more trained people has also caught the attention of private equity and other investors. More are backing data center facilities management suppliers, which offer services that can help overcome skills shortages. Raising awareness of the opportunities and offering training can be part of the investment. While the data center sector faces staff challenges, with focused investment, industry initiatives and more data center-specific education programs, it can rise to the challenge.

Several resources related to this topic are available to members of the Uptime Institute community, including “The people challenge: Global data center staffing forecast 2021-2025” and “Critical facility management: Guidance on using third parties.” Click here to find out more about joining our community.