Uptime Institute has prepared this brief airline outages FAQ to help the industry, media, and general public understand the reasons that data centers fail.
The failure of a power control module on Monday, August 8, 2016, at a Delta Airlines data center caused hundreds of flight cancellations, inconvenienced thousands of customers, and cost the airline millions of dollars. And while equipment failure is the proximate cause of the data center failure, Uptime Institute’s long experience evaluating the design, construction, and operations of facilities suggest that many enterprise data centers are similarly vulnerable because of construction and design flaws or poor operations practices.
What happened to Delta Airlines?
While software is blamed for many well publicized IT problems, Delta Airline is blaming a piece of infrastructure hardware. According to the airline, “…a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.” We take Delta at its word.
Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.
However, pointing to the failure of a single piece of equipment can be very misleading. Delta’s facility was designed to have redundant systems, so the facility should have remained operational had the facility performed as designed. In short, a design flaw, construction error or change, or poor operations procedures set the stage for the catastrophic failure.
What does this mean?
Equipment like Delta’s power control module (more often called a UPS) should be deployed in a redundant configuration to allow for maintenance or support IT operation in the event of a fault. However, IT demand can grow over time so that the initial redundancy is compromised and each piece of equipment is overloaded when one fails or is taken off line. At this point, only Delta really knows what happened.
Organizations can compromise redundancy in this way by failing to track load growth, lacking a process to manage load growth, or making a poor business decision because of unanticipated or uncontrolled load growth.
Delta Airlines is like many other organizations. Its IT operations are complex and expensive, change is constant, and business demands are constantly growing. As a result, IT must squeeze every cent out of its budgets, while maintaining 24x7x365 operations. Inevitably these demands will expose any vulnerability in the physical infrastructure and operations.
Failures like the one at Delta can have many causes, including a single point of failure, lack of totally redundant and independent systems, or poorly documented maintenance and operations procedures.
Delta’s early report ignores key issues why a single equipment failure could cause such damage, what could have been done to prevent it, and how will Delta respond in the future.
Delta and the media are rightly focused on the human cost of the failure. Uptime Institute is sure that Delta’s IT personnel will be trying to reconstruct the incident once operations have been restored and customer needs met. This incident will cost millions of dollars and hurt Delta’s brand; Delta will not want a recurrence. They have only started to calculate what changes will be required to prevent another incident and how much it will cost.
Why is Uptime Institute publishing this FAQ?
Uptime Institute has spent many years helping organizations avoid this exact scenario, and we want to share our experience. In addition, we have seen many publish reports that just miss the point of how to achieve and maintain redundancy. Equipment fails all the time. Facilities that are well designed, built, and operated have systems in place to minimize data center failure. Furthermore, these systems are tested and validated regularly, with special attention to upgrades, load growth, and modernization plans.
How can organizations avoid catastrophic failures?
Organizations spend millions of dollars to build highly reliable data centers and keep them running. They don’t always achieve the desired outcome because of a failure to understand data center infrastructure.
The temptation to reduce costs in data centers is great because data centers require a great deal of energy to power and cool servers and they require experienced and qualified staff to operate and maintain.
Value engineering in the design process and mistakes and changes in the building process can result in vulnerabilities even in new data centers. Poor change management processes and incomplete procedures in older data centers are another cause for concern.
Uptime Institute believes that independent third-party verifications are the best way to identify the vulnerabilities in IT infrastructure. For instance, For instance, Uptime Institute consultants have Certified that more than 900 facilities worldwide meet the requirements of the Tier Standards.. Certification is the only one way to ensure that a new data center has been built and verified to meet business standards for reliability. In almost all these facilities, Uptime Institute consultants made recommendations to reduce risks that were unknown to the data center owner and operators.
Over time, however, even new data centers can become vulnerable. Change management procedures must help organizations control IT growth, and maintenance procedures must be updated to account for equipment and configuration changes. Third-party verifications such as Uptime Institute’s Tier Certification for Operational Sustainability and Management & Operations Stamp of Approval ensure that an organization’s procedures and processes are complete, accurate, and up-to-date. Maintaining up-to-date procedures is next to impossible without management processes that recognize that data centers change over time as demand grows, equipment changes, and business requirements change.
I’m in IT, what can I do to keep my company out of the headlines?
It all depends on your company’s appetite for risk. If IT failures would not materially affect your operations, then you can probably relax. But if your business relies on IT for mission-critical services for customer-facing or internal customers, then you should consider having the Uptime Institute evaluate your data center’s management and operations procedures and then implement the recommendations that result.
If you are building a new facility or renovating an existing facility, Tier Certification will verify that your data center will meet your business requirements.
These assessments and certifications require a significant organizational commitment in time and money, but these costs are only a fraction of the time and money that Delta will be spending as the result of this one issue. In addition, insurance companies increasingly have begun to reduce premiums for companies that have their operational preparedness verified by a third party.
Published reports suggest that Delta thought it was safe from this type of outage, and that other airlines may be similarly vulnerable because of skimpy IT budgets, poor prioritization, and bad processes and procedures. All companies should undertake a top-to-bottom review of its facilities and operations before spending millions of dollars on new equipment.