“Ask the Expert”: Data Center Management Q&A with Uptime Institute CTO, Chris Brown

On April 24th, Uptime Institute CTO, Chris Brown, participated in an “Ask the Expert” session on Data Center Infrastructure Management with BrightTALK senior content manager Kelly Harris.

The 45-minute session covered topics ranging from best practices of a well-run data center to a list of the most common data center issues Uptime Institute’s global consultant team is seeing in the field every day. Toward the end of the session, Harris asked Brown for his views on the challenges and changes facing the data center industry. His answers sparked a flurry of questions from the audience focusing on operations, edge computing, hybrid-resiliency strategies and new cyber security risks facing data center owners and operators.

You can view a recording of the full session on our website here: https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management

We have captured the top twelve audience questions and provided answers below:

If you could only focus on one thing to improve data center performance, what would that be?

Focus on improving operations discipline and protocols through training, procedures and support from management. A well run data center can overcome shortcomings of the design, but a poorly run data center can easily jeopardize a great design. Many people have considered the challenges they see in their infrastructure to be technology driven, but we are seeing a huge swing towards a focus on operational excellence, and the ability to defend the operational choices made.

What is Uptime seeing in the way of IT density growth in the industry?

We are seeing a big delta in what people are anticipating in the future vs. what they are practicing today. New facilities are being designed to support high density, but not a lot of density growth is showing up yet. Most of the density growth in the industry is coming from the hyperscalers right now (Facebook, Google, etc). One of the factors that is being much more heavily leveraged is the ability for virtualization and software-defined technology to create virtual infrastructure without the need for additional physical resources. So the growth in tangible space, power and cooling may appear to be slowing down, but in fact the growth in processing capacity and work processing is actually rapidly increasing.

Is IoT affecting density at this point?

Right now, we are seeing IoT drive the amount of compute that is needed but, at this moment, we don’t see it driving a lot of increases in density in existing sites. IoT has a main premise that the first layer of computing can and should be moved closer to the origination point of data itself, so we are seeing various edge computing strategies coming forward to support this goal. The density of those edges may not be a factor, but the existence of the edge is. In developing regions, we’re seeing more data centers being built to house regional demand vs. a dramatic increase in density. At some point in the future, we’ll run out of physical space to build new data centers and, at that time, I’d expect to see density increase dramatically but it isn’t happening yet.

Do you know if anyone is working on writing a spec/standard for “Edge”?

There are a number of entities out there trying to put together a standard for edge, but there hasn’t been a lot of traction thus far. That said, the focus on Edge computing is critically important to get right.

At Uptime Institute, we’re working with the edge vendors to keep the focus of Edge where it needs to be – delivering required business services. Historically, many in the industry would think Edge data centers are small and, therefore, less important. Uptime Institute takes the position that Edge data centers are already becoming a major piece of the data center strategy of any company, so the design, construction, equipment and implementation is just as important in an edge facility as it is in a centralized facility.

Are these edge data centers remotely managed?

Yes, most Edge data centers are remotely managed. Edge data centers are typically small, with perhaps 1 to 22 racks and physically placed near the demand point, so it is usually cost prohibitive to man those facilities. Various layers of hardware and software move workloads in and out of physical locations, so the amount of on-site management needed has been greatly reduced.

How is data center infrastructure management and security changing?

On the IT side, the devices that are facing the end user and providing compute to the end user have been focusing on cyber security for many years – lots of effort being put on securing and making these systems robust. Literally millions of dollars of cyber protection devices are now installed in most data centers to protect the data and applications from intrusion.
But, one of the things we are seeing is that the management of the building control systems is becoming more IP-based. MEP equipment and their controllers are connected to building networking systems. Some data centers are placing these management systems on the same company internet/intranet as their other production systems and using the same protocols to communicate. This creates a situation where building management systems are also at risk because they can be accessed from the outside, but are not as protected because they are not considered mainstream data. (Even air-gapped facilities are not safe because someone can easily bring malware in on a technician’s laptop and hook it up to the IT equipment, then that malware can replicate itself across the facility through the building management system and infrastructure.)

Consequently, we are starting to see more data center facilities apply the same security standards to their internal systems as they have been applying to their customer-facing systems for the last several years at address this new risk.

Will the Uptime Institute Tier levels be updated to account for the geographic redundancies that newer network technologies allow owners to utilize?

Uptime Institute has been looking into distributed compute and resiliency and how that applies to standard tier levels. The Uptime Institute Tier levels apply to a specific data center and focus to ensure that specific data center meets the level of resilience needed at the component level. Tier levels do not need to be updated to focus on hybrid resiliency. The measure of hybrid resiliency is based on achieving resiliency and redundancy across the individual components within a portfolio, as viewed from the business service delivery itself. We liken this metric to that of the calculation of MTBF for complex systems which is a calculation of the individual components when viewed at the complete system level.

What is the most important metric of a data center (PUE, RCI, RTI, etc)?

If you are looking at a single metric to measure performance, then you are probably approaching the data center incorrectly. PUE looks at how efficiently a data center can deliver a KW to the IT equipment, not how effective the IT equipment is actually being utilized. For example, UPS systems increase in efficiency with load so more total load in the data center can improve your PUE, but if you are only at 25% utilization of your servers, then you are not running an efficient facility despite having a favorable PUE. This is an example of why single metrics are rarely an effective way to run a facility. If you’re relying on a single metric to measure efficiency and performance, you are missing out on a lot of opportunity to drive improvement in your facility.

How are enterprises evaluating risk regarding their cloud workloads? For example, if their on-premises data center is Tier IV – how do they assess the equivalent SLA for cloud instances?

There are two primary ways that Enterprises can reduce risk and create a “Tier IV-like” cloud environment. The first and increasingly popular way is by purchasing High Availability (HA) as-a-service from the cloud provider such as Rackspace or Google. The second way is by the enterprise architecting a bespoke redundancy solution themselves using a combination of two or more public or private cloud computing instances.

The underlying approach to creating a high availability cloud-based service is fundamentally the same, with the combination having redundancy characteristics similar to that of a “Tier IV” data center. In practice, servers are clustered with a load balancer in front of them, with the load balancer distributing requests to all the servers located behind it with in that zone, so that if an individual server fails, the workload will be picked up and executed non-disruptively by the remaining servers. This implementation will often have an additional server node (N+1) installed to what is actually required, so that if a single node fails, the client won’t experience increased latency and the remaining systems are at a lower risk of being over-taxed. And this same concept can be applied across dispersed regions to account for much larger geographic outages.

This approach ensures that data will always continue to be available to clients, in the event that a server or an entire region or site fails. Enterprises can further strengthen their high availability capabilities by utilizing multiple cloud providers across multiple locations, which greatly reduces the chances of single provider failure, and where even the chance of planned maintenance windows overlapping between providers is very small.

What is the best range of PUE, based on Uptime Institute’s experience?

There is no right or wrong answer in regard to what is the best range for PUE. PUE is a basic metric to gauge mechanical and electrical (M&E) infrastructure energy efficiency, intended to be just that, a gauge so that operators can better understand where their infrastructure energy efficiencies lie in order to establish a plan for continuous improvement. We have, however, seen data centers that utilize very efficient M&E equipment, but still have PUEs higher than 1.6, because of inefficiencies in IT hardware management. In addition, the PUE metric is not tied whatsoever to the actual work being done, so although a data center may exhibit an attractive PUE, it may not be doing any work at all!

The value of the metric is not as important as the acknowledgement and the continuous improvement plan. Uptime Institute recommends a business results approach to improving overall IT efficiency, where PUE is a very small part of that overall assessment. Efficient IT is all about following a sustainable approach led from the top down, addressing IT operations and IT hardware utilization, with a small percentage addressing the data center and PUE, that follows a documented continuous improvement approach.

I would like to know if there are general guidelines for decommissioning a data center.

There are no general guidelines for when to decommission a data center because business circumstances are different for practically every data center. The determination for decommissioning a data center can depend on a number of facility factors that can lead or assist the business in making a decision, infrastructure age typically being a leading factor. However, the impact of cloud and edge computing, and hybrid IT, have caused IT strategy changes that have recently caused many data centers to be decommissioned.

We have also seen a lot very well-maintained data centers that are older than 20+ years, where the companies presently have no intent of decommissioning the sites. These companies typically have an IT data center strategy plan in place and are tracking against plan. IT and data center investment decisions are made with this plan in mind. The key is to make sure that the IT data center strategy plan is not developed in a vacuum by just IT, Finance, or by just Facilities/Real Estate. The best IT data center strategy is developed with input from all of these groups, creating a partnership. The IT data center strategy should also be a living plan, reviewed and adjusted as necessary on a regular basis.

Can you speak to where you see multi data center DCIM tools going over the next year or two?

Multi-tenant DCIM is likely to evolve from basic isolated power and environmental monitoring features (including alarming and reporting) to also include facilities asset management features including change and configuration management. Multi-tenant data center providers that offer remote hands services will, in particular, make use of DCIM asset management – to enable customers to remotely track the on-site work being done, including with auditable workflows.

Looking forward, service delivery will be measured with qualitative metrics, which identify not just IF a service is available, but at what cost, and at what capacity. Hence, DCIM will begin to include full-stack analytics to understand how work is hosted and keep track of it as it migrates. And to get there, Multi-tenant DCIM will likely also start to include ‘out-of-the-box’ pre-built connectors to other management software tools, such as ITSM and VM management, so that customers can match specific workloads to physical data center assets, enabling end-to-end costing, ‘right-sized’ asset/workload provisioning, etc.

You can watch the full “Ask the Expert” session with Uptime Institute CTO, Chris Brown, by visiting the recorded session page on our website at:
https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management

How Edge Computing Is Transforming IT Infrastructure

New technologies such as IoT and cloud architectures are driving computing to the edge. Companies must embrace this trend in order to survive.

The definition of computing infrastructure is changing. While large traditional data centers have been the mainstay of information technology for the past 60 years, we’re seeing a perfect storm today where mobility, economics, and technology are all converging to fundamentally redefine the IT challenge as well as the IT solution. In a nutshell, most everything we learned as IT professionals about leveraging economies of scale as it pertains to the delivery of IT in the corporate world is being turned on its side, and instead being viewed from the users’ perspective.  Read the Full Story Here.

Data Center Security

Hacking the Physical Data Center – Not just for Hollywood Movies

We have all seen big headline stories for years about overseas network hackers who are able to extract millions of account details and social security numbers from the retail, financial and a litany of other industries. And Hollywood has done a great job of painting the picture about bad guys physically breaking into data centers to steal information making them rich men overnight. But a less considered situation is the one the fits somewhere in the middle: online hacking of the the physical data center itself. And the reality of what this hacking entails, lies somewhere between what you see in the Hollywood movies and those stories heard around your water-cooler being shared between your IT staffers. Data Center hacking is real, occurs nearly every day, and goes far beyond downloading customer data which shows up on the evening news. And in many cases, it may not even be noticed for days, weeks or years!

Every organization needs to be aware of all of the ways in which their business can be compromised. In fact, now more than ever they need to take all of these threats seriously as their company’s core business is transforming to be digital. Every process and asset is represented digitally. Workflows and actions are defined digitally.  Building systems and energy usage is managed digitally too. And all of this access is connected to systems in the data center. Get into the data center logically or physically and a person can wreak havoc throughout an entire organization.

Security Starts in the Physical World

Security Starts with Policies

But let’s focus on the here and now; the ground level reality of what this all means and how it happens…

It actually started quite innocently. Over the years, manufacturers realized that hardwired RS232/RS485 and other proprietary control wiring interfaces were a dead end, so they turned to open systems approaches and added IP access to their infrastructure control devices. This open connected approach simplified the systems design and integrations needed for desired results. As such, nearly every company has now placed the vast majority of their infrastructure control systems on company intranets. The air conditioning, the power distribution systems, the security cameras, ID readers and door access. Yup, its now all digital and all IP controlled on the network. And its easy to see the huge upside possible for efficiency and intelligence, but the downside is that essentially anyone with access to the systems or network can turn all of these systems on and off, wipe security footage, and unlock doors.

So why has this gone unnoticed? We publicly hear about the security breaches that result in the bad guys gathering sensitive customer data because it affects us all. But all of the other breaches are mostly invisible to the public. Who really cares about a breach that simply shuts off the air-conditioning in the data center? Or the breach that unlocks all of the perimeter doors and allows criminals to wander around? Or the one that turns off the security cameras where your diesel generator fuel is stored? And these types of hacking aren’t just invisible to the public, but in many cases the executive team as well.

So how did we get here? This situation is very different than in decades past, since most of the people responsible for these control systems are not accustomed to worrying about such ‘invisible’ security risks. Many of these folks still think about bigger padlocks rather than firewalls. They think about taller fences and night-vision cameras. The very idea that all of those physical/mechanical systems can be useless when a hacker “rides the wire” is hard to imagine for many of them.

And it all starts with policy and procedure. A huge percentage of the hacking we hear about today actually originates from good people that HAVE AUTHORIZED ACCESS into the data center to do their regular jobs. What? Good people hack? Well, not intentionally. Regularly, data center operators plug in flash drives that they used at home, which have become infected. And as bad as that sounds with operators, higher-level technicians who have access to deeper or more strategic systems can do the same thing, unintentionally releasing a storm! They walk into into a data center to work on some type of control system and the first they do is connect THEIR laptop to the systems. That first hop is essential to any bad guy hack, so policies that prevent these types of common ‘innocent’ practices can offer huge reductions in risk and incidents.

Its hard to comprehend the true cost of physical systems that have been hacked. Security cameras that protect millions of dollars of capital and other assets can be compromised enabling theft. Cooling systems can be switched off for a mere few minutes taking down entire data centers with hundreds of millions of dollars of equipment and all of the lost business and valuation that goes with massive downtime. And this hacking isn’t even focused on the customer data breaches are all hear about every month.

So for 2018, commit to a top to bottom assessment of how you fare in this security framework. You likely already have lots of people focused on the Network and Server portions of security, but I would suspect that you have far fewer people working at the very top (policy and physical) or the very bottom (application and Device) of the security framework we are discussing.

Get on it!

Shrinking Data Center

As Data Center Footprint shrinks, the Importance of the Data Center Grows!

Today, your data center is more important and essential than it has ever been!

Let me explain. We are at a cross-roads in thinking for IT Infrastructures. For nearly 40 years (about the time when the first connected computing systems came into being), bigger has always been better. Bigger processors, bigger disk drives, bigger servers, and bigger data centers to house it all in. Most seasoned IT professionals have taken great pride in the amount of stuff they could acquire and then cobble together to create bespoke IT solutions that could meld and adjust to changing business conditions. And for many years, bigger worked. Every imaginable application was thrown at the IT organization and more stuff would be added to the infrastructure to allow for these new services and expanded capacity. Bigger really was better!

Balancing Public and Private Clouds is essential for 2018

But that all changed in the last ten years as applications and infrastructures were virtualized and all of the essential active devices increased in density and at the same time shrank in physical size. And all of this was happening at the very same time that the public cloud was coming into its own stride and offering new ways of running applications without consuming any owned and operated gear at all! In-house data centers began to shrink in size and take on many of the attributes of the Public Cloud, forming Private Clouds.

Now, some of our colleagues are already equating this shrinking trend to the elimination of data centers altogether. They see the physical size as trending downward and extrapolate this to zero footprint at some point in time. Not so fast partner! This isn’t a statistical and theoretical discussion. Its about identifying all of the business services that are needed to run a business, and then determining the best platform to deliver those services upon. And the platform choice is not one-size-fits-all…. it is a balancing act based upon cost, capacity, timing, security, support and regulatory needs!

Perhaps counter intuitive, but the shrinking footprints of these data centers that you and your team are operating and optimizing into private clouds today are actually becoming MORE important to the business itself! Think about it. What are the applications and services that are being kept in these owned and/or operated data centers? Well, typically its the most critical ones. It’s the secret sauce that you hold close to the vest. It’s the stuff that runs the very essence of your business. It’s your customer detail and your confidential data. It’s the stuff that can’t leave your four walls for jurisdiction, regulatory and/or security reasons.

Make no mistake, the public cloud will help us all offload capacity and allow us access to new applications that can not be brought in-house easily or cost-effectively. The Public Cloud provides a great platform for some applications, but as we have all learned in our IT careers, this is NEVER a one-size-fits-all game, and to focus on a zero-footprint end-game is irrational, and contrary to good business and technology understanding. Each IT business service needs to be delivered on the most suitable platform. And it is the definition of “most suitable” that will get us all in trouble if we are not careful. In cases of security, jurisdiction or regulatory compliance, that definition will be more stringent than many other applications and your ability to reliably deliver those ‘bet the business’ core applications has never been more critical.

In 2018, those IT professionals that really get their business’ needs, and can defend WHY they have satisfied those needs in the various technology platforms available to them today, will be handsomely rewarded….

Delivering IT in 2018 is All About Accountability!

It has always been incredibly difficult to align an organization’s business needs and strategies with the underlying information technologies required to get there. For many years, the long-tenured IT organization was given a tremendous amount of leeway due to the complex nature of delivering IT. IT was viewed as an almost magical discipline, something that had to exist at the genetic level of the people that called themselves IT professionals. In fact, the executive team knew that IT was a domain that few humans could comprehend, so they rarely asked any deep questions at all! They didn’t really challenge budgets, timing or strategy for that matter. Instead, they hired IT experts with impressive resumes and trusted them to deliver. In doing so, they simply let those experts TELL THEM what was possible IT, how soon they could deliver it online, and how much it would cost. And since these were IT experts, everyone listening just wrote down what they heard and considered it as fact. Not a lot of discussion or challenge to their statements. The tail really was wagging the dog!

Last summer I conducted a survey of IT professionals during the VMworld trade-show held in Las Vegas.  Not surprisingly, almost 80% of those people I surveyed indicated that they had been personally focused on delivering IT services for at least 10 years. A bit more surprising was that half of those surveyed actually indicated it was almost twice that amount! The IT world is full of long-term practitioners. So, during such long stints, many individuals find an approach or ‘rhythm’ that works for them. From their perspective, this rhythm continues is tried and true and will always meet the needs of their constituents. And given the lack of questions from the executive team and other stakeholders and the apparent free-wheeling nature of IT deployment and direction, times were pretty good for all those years.

In 2018… this is a problem. In the era where all businesses are fast becoming digitally-centric, this model no longer works. There must be a plan that is well understood and very defend-able, in ALL contexts. There is massive accountability for every action taken and every dollar spent, and just being able to deliver raw capacity and make systems talk to each other is expected as table-stakes.

And the strategies must be well-conceived and sound. Think back about 5 years ago when nearly every company on the planet was blindly jumping on the Public Cloud band wagon like drunken sailors on holiday leave.  That was their ‘strategy’. It sounded really good on paper and appeared to be very high tech.  Consequently, this Public Cloud cut-over plan was presented as the savior to all of the IT woes that existed….

Then reality set in and these same drunken sailors were sobering up and realizing that their actual business needs were not always being met by just abandoning what they already had and foolishly running en mass to the Public Cloud. Whether it was cost or access or predictability, there were one or more aspects of their shiny new “Cloud” strategy that simply didn’t pass the real-world muster. Oops.

So companies are now (re-)thinking through their IT strategies plans. They realize that due to economics, applications or audience they need to create a defendable plan to blend existing resources (like their billion dollars invested in data center infrastructure) with new resources that are now available as a service. They need to have operational plans that account for a long list of “what happens if” scenarios. They need to understand the hardware, software and management costs of each component, and then apply those costs to each business service on a per-unit basis to determine if the business can afford ‘their plan’.

So the point is, true innovation is available everywhere you look in IT, but there is a white-collar wrapper needed for everything we do. New data centers can be built if the business model make sense. Existing data centers can be run better and more efficiently. Co-Location space can be added based on cost or capacity build-outs, and Public Cloud services can be adopted for dynamic capacity when the business needs it. This is not a one-size fits all game.

Accountability is not a FOUR letter word (its actually 14), its your way of life if you are going to be a successful IT professional in 2018.


By Mark Harris, Senior Vice President of Marketing, Uptime Institute

Fire Suppression Systems Bring Unexpected Risk

Hazards include server damage from loud noise during discharge of inert gas fire suppression systems

By Kevin Heslin, with contributions from Scott Good and Pitt Turner

A downtime incident in Europe has rekindled interest in a topic that never seems more than a spark away from becoming a heated discussion. In that incident, the accidental discharge of an inert gas fire suppression system during testing damaged the servers in a mission-critical facility.

The incident, which took place while testing the fire suppression system 10 September 2016 at an ING facility in Bucharest, destroyed dozens of hard drives, according to published reports in the BBC and elsewhere. As a result, ING was forced to rely on a nearby backup facility to support its Romanian operations. The event was of great interest to Uptime Institute’s EMEA Network because of the universal requirement for fire protection in data centers. Uptime Institute and Network principals inaugurated a number of information exchanges at ensuing Network meetings.

Fires that originate in data centers are relatively rare and are usually caused by human error during testing and maintenance or by electrical failures, which tend to be self-extinguishing. Other fires spread to the data center from other spaces. At these times, the need for an effective and functioning fire suppression system is obvious, and the system must provide life safety and protect expensive gear and mission-critical data. However, the fire suppression system can pose a risk to operations when inadvertently activated during testing and maintenance. In addition, the fire suppression systems, when deployed, also cause damage in a facility.

These considerations mean that the choice and design of a fire suppression system must meet the business needs and fire threats the facility is likely to face. Water-based systems, for example, will destroy sensitive IT gear when deployed. In general, however, the loss of IT gear in a fire is acceptable to insurance companies and local authorities having jurisdiction (AHJs), who view equipment as replaceable as long as the system saves lives and preserves the building. Data center operators will place a higher value on the data and operations.

Some data centers, however, do deploy inert gas fire suppression systems. In general use, these systems can be used to protect irreplaceable or extremely costly gear. High-performance computers, for example, tend to be far more expensive to replace than standard X86 servers. In theory, inert gas fire suppression systems prevent water from entering the server room via sprinkler systems. However, discharge of an inert gas system has been shown to damage data center servers—even when there is no fire, and so many facilities are turning to pre-action systems, which also remove the presence of water from the data center floor, except when activated. According to one major vendor of both types of fire suppression systems, inert gas systems better protect IT equipment because they do not damage electric and electronic circuits, even under full load operation. In addition inert gas systems can suppress deep-seated fires, including those inside a cabinet.

Uptime Institute agrees that accidental charges of inert gas fire suppression systems are rare. But, at the same time, according to the 2017 Uptime Institute Data Center Industry Survey, about one-third of data center operators have experienced an accidental discharge. In fact, in the same survey, respondents were three times more likely to have experienced an accidental discharge than an actual fire.

Beyond that point of agreement, however, consensus throughout the industry is rare, with much uncertainty on exactly how IT gear is damaged by the discharge of inert gas, how to protect against the damages, or even whether inert gas fire suppression system vendors or IT manufacturers are best positioned to eliminate the problem. Still anecdotes continue to surface, and vendors have documented, during test conditions, the phenomenon in which loud noises from fire suppression systems impaired the performance of data center servers or disabled them, either temporarily or permanently, leading to data loss.

Uptime Institute notes that vendors have tried to address problems tied to the release of inert gasses by redesigning nozzles and improving sensors to reduce false positive signals. Uptime Institute also agrees with vendor recommendations regarding the use of inert gas systems, including:

  1. Installing racks that have doors to muffle noise
  2. Installing sound-insulating cabinets
  3. Using high-quality servers or even solid-state servers and memory
  4. Slowing the rate of inert gas discharge
  5. Installing walls and ceilings that incorporate sound-muffling materials
  6. Aiming gas discharge nozzles away from servers
  7. Removing power from IT gear before testing inert gas fire suppression systems
  8. Muffling alarms during testing
  9. Replicating data to off-site disk storage

A more dramatic step would be to move to pre-action (dry pipe) water sprinkler or chemical suppression systems, but at least one insurance broker recommends the use of inert gas systems in conjunction with a water system as part of a two-phase fire suppression system.

Regardless, pre-action fire suppression systems have become more common. The use of water means that facility owners are protected against the total loss of a data center, and the dry-pipe feature—originally developed to protect against fire in cold environments such as parking garages or refrigerated coolers—protects facilities from the consequences of an accidental discharge in white spaces. In many applications, they are also the more economical choice, especially as local codes and authorities may require the use of a water suppression system, and the inert gas system then becomes not a replacement, but a fairly expensive supplement.

Still inert gas fire suppression systems continue to have their adherents, and they may make business sense in select applications. Data center operators may consider using inert gas in locations where water is scarce or when an application makes use of very expensive and unique IT gear, such as supercomputers in HPC facilities or old-style tape-drive storage. In addition, inert gas systems may be the best choice when water damage would cause irreplaceable data to be irretrievably lost. In these instances, Uptime Institute believes that organizations would be better served by developing improved backup and business continuity plans.

Those considering inert gas suppressions may be somewhat relieved to learn that vendors have taken steps to minimize damage from discharges of inert gas systems, perhaps the most important of these being improved sensors that register fewer false positives. It is also entirely possible to develop rigorous procedures that reduce the likelihood of an inadvertent discharge due to human error, which is by far the most common cause of accidental discharge.

Some industry sources believe that the problem first began to manifest around 2008 as inert gas systems began to become popular as fire suppression systems in data centers. Others note that server density increased at about the same time.

An examination of Uptime Institute Network’s Abnormal Incident Reports (AIRs) database does not support this belief. It includes reports dating back as far as 1994 and 1995, with no obvious increase in 2008. In total, the AIRs database includes 54 incidents involving inert gas fire suppression systems. Of these reported incidents, 15 involved accidental discharge, with 2 downtime reports. However, many of the incidents took place in support areas, with no possibility of server downtime. Still the documented possibility of damage to IT gear and facility downtime worries many data center operators.

Uptime Institute data finds that operations and management issues lead to most accidental discharges of fire suppression systems.

Uptime Institute Network members commonly use the AIRs database to identify and prevent problems experienced by their colleagues. In this case, Network members report that 27 incidents were caused by technician error, with another 14 resulting from no maintenance or poor procedures, 9 resulting from a manufacturing problem, and 4 from a design omission or installation problem (see the table).

Although these results are typical for most system failures, they are particularly relevant to discussions of all fire suppression systems, as differences of opinion exist about how exactly the discharge of inert gas systems damages IT gear. Manufacturers believe that the sound of the discharge damages the drives, but others say it is the noise of the fire alarm, either independently or by contributing to the noise level in the facility.

Uptime Institute does not believe this to be a realistic concern, as the AIRs database includes no instances where fire alarms by themselves resulted in data center downtime or server damage.

The fire suppression industry is aware of the problem. During a 2014 meeting with server manufacturers, they acknowledged the problem, noting concerns about the noise (actually a pressure wave moving through the room) emitted by inert gas suppression systems during discharge. The vendors theorized that the volume (decibels) and frequency (hertz) of sound emitted during system discharge combine to damage servers. In response, many vendors redesigned nozzles to reduce pressure, and therefore, the decibel level of the discharge. This, one vendor said, would not eliminate the problem as each server type has a different sensitivity and new server types are also susceptible to different volume and frequency combinations. In heterogeneous environments, sensitivity may vary from server to server or rack to rack.

Vendors say that the time required for testing makes it impossible to develop inert gas suppression systems that discharge without affecting the many server types already available in the marketplace. Many environments are heterogeneous, so a discharge system may affect some of the equipment in a rack or room without doing any apparent damage to other equipment. Vendors have also introduced sensors that are more accurate, reducing the likelihood that a false alarm will trigger an unnecessary discharge.

These same vendors note higher-grade enterprise servers are less susceptible to damage from inert gas discharges. These servers, they note, are more likely to be installed in quality racks that have doors and other features that muffle noise and cushion servers against the shock from the sound waves. In addition, enterprise servers are designed to be operated in large cabinets where many drives spin at the same time, creating both noise and vibration. These drives are tested to withstand harsher environments than consumer drives. They track more precisely and have high sustaining data rates, making them resistant to sound and vibration. According to one vendor, even slamming doors can degrade the performance of the consumer drives sometimes be found in data centers.

These measures can be effective, according to a Data Center Journal article written in 2012 by IBM’s Brian P. Rawson and Kent C. Green. They explain that noise causes the read-write element to go o the data track, “Current-generation HDDs have up to about 250,000 data tracks per inch on their disks. To read and write, the element must be within ±15% of the data track spacing. This means the HDD can tolerate less than 1/1,000,000 of an inch off set from the center of the data track—any more than that will halt reads and writes.” They theorize that decreased spacing between data tracks made servers more susceptible to damage or degraded performance from noise. In the same article, Rawson and Green cite a YouTube video that shows how even a low-decibel noise such as a human voice can degrade HDD performance.

Uptime Institute notes that common security practices along with strong operational procedures relating to testing fire suppression systems can mitigate most risk associated with inert gas fire suppression systems.

Uptime Institute recommends that IT management teams work with risk managers to ensure that all stakeholders understand a facility’s fire suppression requirements and options before selecting a fire suppression system. Operational considerations should also be included, so that the system is well suited to an organization’s risk exposure and the business requirements.

Uptime Institute believes that most data centers would be best served by a combination of pre-action (dry pipe) sprinkler system and high-sensitivity smoke detection. Most AHJs, risk managers, and insurance companies will support this choice as long as other operating requirements are met, like having educated and trained staff providing building coverage. These authorities are generally quite familiar with water-based fire suppression systems, as these constitute the vast majority of installations in the U.S; however, they may not always be familiar with pre-action systems.

In instances when risk managers or insurers require an inert gas fire suppression system, operations staff may be able to mitigate the risk for accidental discharge by implementing documented policies, procedures, and practices, etc. These documents should include as many of the vendor recommendations on page 67 as possible. In this way, the risk manager requirement for inert gas fire suppression should not be taken as the end of the discussion but rather the start of a dialog.

Finally, IT should continuously evaluate its fire suppression system and consider removing inert gas systems from spaces when its use changes. Uptime Institute has documented the use of inert gas fire suppression in spaces that were converted to storage from IT. In this instance, the facility increased its risk of accidental discharge but gained no benefits at all.


Kevin Heslin is chief editor and director of Ancillary Projects at Uptime Institute. In these roles, he supports Uptime Institute communications and education efforts. Previously, he served as an editor at BNP Media, where he founded Mission Critical, a commercial publication dedicated to data center and backup power professionals. He also served as editor at New York Construction News and CEE and was the editor of LD+A and JIES at the IESNA. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the BA in journalism from Fordham University in 1981 and a BS in technical communications from Rensselaer Polytechnic Institute in 2000.