Why do some industries and organizations suffer more serious, high profile outages than others?

In a recent Uptime Institute Intelligence note, we considered a June 2019 report issued by the US General Accounting Office (GAO) on the IT resiliency of US airlines. The GAO wanted to better understand if the all-too-frequent IT outages and resultant chaos passengers face have any common causes and if so, how they could be addressed. (Since we published that Note, the UK-owned carrier British Airways suffered its second big outage in two years, once again stranding tens of thousands of passengers and facing heavy costs.)

The GAO report didn’t really uncover much new: there was, in some cases, a need for better testing, a little more redundancy needed here and there, certainly some improved processes. But despite suspicions of under-investment, there was nothing systemic. The causes of the outages were varied and, although often avoidable when looked at in hindsight, not necessarily predictable.

There is, however, still an undeniable pattern. Uptime Institute’s own analysis of three years of public, media-reported outages shows that two industries, airlines and retail financial services, do appear to suffer from significantly more, highly disruptive (category 4 and 5), high profile outages than other industries.

To be clear: these businesses do not necessarily have more outages, but rather they suffer a higher number of highly disruptive outages, and as a result, get more negative publicity when there is a problem. Cloud providers are not far behind.

Why is this? The reasons may vary, but these businesses very often offer services on which large numbers of people depend, for which almost any interruption causes immediate losses and negative publicity, and in which it may not be easy to get back to the status quo.

Another trait that seems to set these businesses apart is that their almost complete dependency on IT is relatively recent (or they may be a new IT service or industry), so they may not yet have invested to the same levels as, for example, an investment bank, stock exchange or a power utility. In these last examples, the mission-critical nature of the business has long been clear, they are probably regulated, and so have investments and processes fully in place.

Organizations have long conducted business impact analyses, and there are various methodologies and tools available to help carry these out. Uptime Institute has been researching this area, particularly to see how organizations might specifically address business/impacts of failures in digital infrastructure. One simple approach is to create a “vulnerability” rating for each application/service, with scores attributed across a number of factors. Some of our thinking — and this is not comprehensive — is outlined below:

Profile. Certain industries are consumer facing, large scale or have a very public brand. A high score in this area means even small failures — Facebook’s outages are a good example — will have a big public impact.
Failure sensitivity. Sensitive industries are those for which an outage has immediate and high impact. If an investment bank can’t trade, planes can’t take off or clients can’t access their money, the sensitivity is high.
Recover-ability. Organizations that can take a lengthy time to restore normal service will suffer more seriously from IT failures. The costs of an outage may be multiplied many times over if the recovery time is lengthy. For example, airlines may find it takes days to get all planes and crews in the right location to restore normal operations.
Regulatory/compliance. Failures in the certain industries either must be reported or will attract attention from regulators. Emergency services (e.g., 911, 999, 112), power companies and hospitals are good examples … and this list is growing.
Platform dependents. Organizations whose customers include service providers — such as software as a service; infrastructure as a service; and co-location, hosting and cloud-based service providers — will not only breach service level agreements but also lose paying clients. (There are many examples of this.)

One of the challenges of carrying out assessments is that the impact of any particular service’s or application’s failing is changing, in two ways. First, in most cases, it is increasing, along with the dependency of all businesses and consumers on IT. And second, it is becoming more complicated and harder to determine accurately, largely because of the inter-dependency of many different systems and applications, intertwined to support different processes and services. There may even a logarithmic hockey stick curve, with the impact of failures growing rapidly as more systems, people and businesses are involved.

Looked at like this, it is clear that certain organizations have become more vulnerable to high impact outages than they were a year or two previously, because while the immediate impact on sales/revenue may not have the changed, the scale, profile or recover-ability may have. It may be that airlines, which only two years ago could board passengers manually, can no longer do so without IT; similarly, retail banking customers used to carry sufficient cash or checks to get themselves a meal and get home. Not anymore. These organizations now have a very tricky problem: How do they upgrade their infrastructure, and their processes, to a level of mission critically for which they were not designed?

All this raises a further tricky question that Uptime Institute is researching: Which industries, businesses or services have become (or will become) critical to the national infrastructure — even if a few years ago they certainly were not (or they are not currently)? And how should these services be regulated (if they should …)? We are seeking partners to help with this research. Organizations are not the only ones struggling with these questions — governments are as well.

More information on this topic is available to members of the Uptime Institute Network here.