Avoiding IT service outages is a big concern for any operator or service provider, especially one providing a business-critical service. But when an outage does occur, the business impact can vary from “barely noticeable” to “huge and expensive.” Anticipating and modeling the impact of a service interruption should be a part of incident planning and is key to determining the level of investment that should be made to reduce incidents and their impact.
In recent years, Uptime Institute has been collecting data about service outages, including the costs, the consequences and, most notably, the most common causes. One of our findings is that organizations often don’t collect full financial data about the impact of outages, or if they do, it might take months for these to become apparent. Many of the costs are hidden, even if the outcry from managers and even non-paying customers is most certainly not. But cost is not a proxy for impact: even a relatively short and inexpensive outage at a big, consumer-facing service provider can attract negative, national headlines.
Another clear trend, now that so many applications are distributed and interlinked, is that “outages” can often be partial, affecting users in different ways. This has, in some cases, enabled some major operators to claim very impressive availability figures in spite of poor customer experience. Their argument: Just because a service is slow or can’t perform some functions doesn’t mean it is “down.”
To give managers a shorthand way to talk about the impact of a service outage, Uptime Institute developed the Outage Severity Rating (below). The rating is not scientific and might be compared to the internationally used Beaufort Scale, which describes how various wind speeds are experienced on land and sea.
By applying this scale to widely reported outages from 2016-2018, Uptime Institute tracked 11 “Severe” Category 5 outages and 46 “Serious” Category 4 outages. Of these 11 severe outages, no fewer than five occurred at airlines. In each case, multi-million-dollar losses occurred, as flights were cancelled and travelers stranded. Compensation was paid, and negative headlines ensued.
Analysis suggests both obvious and less obvious reasons why airlines were hit so hard: the obvious one is that airlines are not only highly dependent on IT for almost all elements of the operations, but also that the impact of disruption is immediate and expensive. Less obviously, many airlines have been disrupted by low cost competition and forced to “do more with less” in the field of IT. This leads to errors and over-thrifty outsourcing, and it makes incidents more likely.
If we consider Categories 4 and 5 together, the banking and financial services sector is the most over-weighted. For this sector, outage causes varied widely, and in some cases, cost cutting was a factor. More commonly, the real challenge was simply managing complexity and recovering from failures fast enough to reduce the impact.
Members of Uptime Institute Network experience HALF of the incidents that cause these type of service disruptions. Members share a wealth of experiences with their peers from some of the largest companies in the world. Membership instills a primary consciousness about operational efficiency and best practices which can be put into action everyday. For membership information click here.