• Link to X
  • Link to LinkedIn
  • Link to Mail
  • ABOUT UI
    • Business Partners
    • Careers
    • Contact Us
    • News & Press
    • Our Team
    • Press Releases
    • Branding Guidelines
  • CONTACT
Uptime Institute Blog
  • Journal
    • Journal Home
    • Executive
    • Operations
    • Design
  • AI Services
    • AI Infrastructure Advisory
    • AI Custom Support
  • Tier Certification
    • Overview
    • Design
    • Construction
    • Operations
    • Tier Gap Analysis
    • Prefabricated/Modular
    • Tier Certifications List
  • Professional Services
    • Overview
    • Infrastructure Services
    • Management and Operations Services
    • Energy and Sustainability Services
    • Consulting Services
  • Education
    • Course Details
    • Course Calendar
    • Competency & Confidence Assessments
    • Private Education
    • Graduate Roster
  • Events
    • Industry Events
    • Leadership Events
    • Network Events
  • Network
    • Overview
    • Network Calendar
    • Network Roster
    • Request Corporate Access
    • Request Guest Access
    • Uptime Network Portal
  • Intelligence
  • Clients
    • Client Stories
  • Resources
    • Data Center Industry Surveys
    • Ebooks
    • Journal Blog
    • Product Datasheets
    • Research & Reports
    • Tier Specification Documents
    • Tools
    • Webinars
  • Click to open the search input field Click to open the search input field Search
  • Menu Menu
Blog - Latest News
How to avoid outages: Try harder2019

How to avoid outages: Try harder!

September 23, 2019/in Executive, Operations/by Kevin Heslin

Uptime Institute has spent years analyzing the roots causes for data center and service outages, surveying thousands of IT professionals throughout the year on this topic. According to the data, the vast majority of data center failures are caused by human error. Some industry experts report numbers as high as 75%, but Uptime Institute generally reports about 70% based on the wealth of data we gather continuously. That assumption immediately raises an important question: Just how preventable are most outages?

Certainly, the number of outages remains persistently high, and the associated costs of these outages are also high. Uptime Institute data from the past two years demonstrates that almost one-third of data center owners and operators experienced a downtime incident or severe degradation of service in the past year, and half in the previous three years. Many of these incidents had severe financial consequences, with 10% of the 2019 respondents reporting that their most recent incident cost more than $1 million.

These findings, and others related to the causes of outages, are perhaps not unexpected. But more surprisingly, in Uptime Institute’s April 2019 survey, 60% of respondents believed that their most recent significant downtime incident could have been prevented with better management/processes or configuration. For outages that cost greater than $1 million, this figure jumped to 74%, and then leveled out around 50% as the outage costs increased to more than $40 million. These numbers remain persistently high, given the existing knowledge available on the causes and sources of downtime incidents and the costs of many downtime incidents.

Data center owners and operators know that on-premises power failures continue to cause the most outages (33%), with network and connectivity issues close behind (31%). Additional failures attributed to colocation providers could also have been prevented by the provider.

These findings should be alarming to everyone in the digital infrastructure business. After years of building data centers, and adding complex layers of features and functionality, not to mention dynamic workload migration and orchestration, the industry’s report card on actual service delivery performance is less than stellar. And while these sorts of failures should be very rare in concurrently maintainable and fault tolerant facilities when appropriate and complete procedures are in place, what we are finding is the operational part of the story falls flat. Simply put, if humans worked harder to MANAGE the well-designed and constructed facilities better, we would have fewer outages..

Uptime Institute consultants have underscored the critically important role procedures play in data center operations. They remind listeners that having and maintaining appropriate and complete procedures is essential to achieving performance and service availability goals. These same procedures can also help data centers meet efficiency goals, even in conditions that exceed planned design days. Among other benefits, well conceived procedures and the extreme discipline to follow these procedures helps operators cope with strong storms, properly perform maintenance and upgrades, manage costs and, perhaps most relevant, restore operations quickly after an outage.

So why, then, does the industry continue to experience downtime incidents, given that the causes have been so well pinpointed, the costs are so well-known and the solution to reducing their frequency (better processes and procedures) is so obvious? We just don’t try hard enough.

When asking our constituents about the causes for their outages, there are perhaps as many explanations as there are respondents. Here are just a few questions to consider when looking internal at your own risks and processes:

  • Does the complexity of your infrastructure, especially the distributed nature of it, increase the risk that simple errors will cascade into a service outage?
  • Is your organization expanding critical IT capacity faster than it can attract and apply the resources to manage that infrastructure?
  • Has your organization started to see any staffing and skills shortage, which may be starting to impair mission-critical operations?
  • Do your concerns about cyber vulnerability and data security outweigh concerns about physical infrastructure?
  • Does your organization shortchange training and education programs when budgeting?
  • Does your organizations under-invest in IT operations, management, and other business management functions?
  • Does your organization truly understand change management, especially when many of your workloads may already be shared across multiple servers, in multiple facilities or in entirely different types of IT environments including co-location and the cloud?
  • Does you organization consider the needs at the application level when designing new facilities or cloud adoptions?

Perhaps there is simply a limit to what can be achieved in an industry that still relies heavily on people to perform many of the most basic and critical tasks and thus is subject to human error, which can never be completely eliminated. However, a quick survey of the issues suggests that management failure — not human error — is the main reason that outages persist. By under-investing in training, failing to enforce policies, allowing procedures to grow outdated, and underestimating the importance of qualified staff, management sets the stage for a cascade of circumstances that leads to downtime. If we try harder, we can make progress. If we leverage the investments in physical infrastructure by applying the right level of operational expertise and business management, outages will decline.

We just need to try harder.


More information on this and similar topics is available to members of the Uptime Institute which can be initiated here.

Share this:

  • Share on Facebook (Opens in new window) Facebook
  • Share on X (Opens in new window) X
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Reddit (Opens in new window) Reddit
  • Share on WhatsApp (Opens in new window) WhatsApp
  • Email a link to a friend (Opens in new window) Email
Tags: Data Center, Data Center Availability, Data Center Disaster Recovery
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/GettyImages-966689448-aspect2.7.jpg 816 2218 Kevin Heslin https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.png Kevin Heslin2019-09-23 06:15:002025-11-12 11:40:48How to avoid outages: Try harder!
You might also like
COVID-19: What worries data center management the most?
Electrical considerations with large AI compute Electrical considerations with large AI compute
Failure Doesn’t Keep Business Hours: 24×7 Coverage
2020 Lithium-ion batteries in the Data Center: An ethical dimension?
Lifting and shifting apps to the cloud: a source of risk creep? Lifting and shifting apps to the cloud: a source of risk creep?
Making sense of the outage numbers Making sense of the outage numbers
Executive Perspectives on the Colocation and Wholesale Markets
Flexibility drives cloud lock-in risk Flexibility drives cloud lock-in risk

Content Categories

  • Journal Home
  • Executive
  • Operations
  • Design

Subscribe to Journal via Email

Enter your email address to subscribe to Uptime Institute Journal and receive notifications of new articles by email.

  • Recent

Tags

Accredited Tier Designer (9) AI (21) artificial intelligence (16) ATD (10) Carbon Emissions (7) Climate Change (13) Cloud (22) Cloud Computing (17) Cloud Costs (15) Cloud Infrastructure (29) Cloud Migration (8) Colocation (6) cooling (9) Data Center (252) Data Center Availability (40) Data Center Cooling (13) Data Center Design (45) Data Center Disaster Recovery (7) Data Center Energy Efficiency (34) Data Center Facilities Management (43) Data Center Operations (66) data center power (8) Data Center Staffing (18) DCIM (9) digital Infrastructure (117) energy (8) Energy Efficiency (38) Environmental Sustainability (18) IT (7) IT Efficiency (16) IT Outages (10) M&O (6) outages (11) Public Cloud (7) PUE (10) Regulations (24) Resiliency (9) security (7) Sustainability (34) Sustainability Reporting (7) Tier Certification (26) Tier Certification Constructed Facility (16) Uptime Institute FORCSS (6) Uptime Institute Network (13) Uptime Institute Symposium (6)
© 2014-2025 Uptime Institute, LLC All rights reserved.
  • Link to X
  • Link to LinkedIn
  • Link to Mail
Link to: Troubling for operators: Capacity forecasting and maintaining cost competitiveness Link to: Troubling for operators: Capacity forecasting and maintaining cost competitiveness Troubling for operators: Capacity forecasting and maintaining cost competit...Forecasting Capacity Link to: Why do some industries and organizations suffer more serious, high profile outages than others? Link to: Why do some industries and organizations suffer more serious, high profile outages than others? Business OutagesWhy do some industries and organizations suffer more serious, high profile outages...
Scroll to top Scroll to top Scroll to top