Outages: understanding the human factor
Analyzing human error — with a view to preventing it — has always been challenging for data center operators. The cause of a failure can lie in how well a process was taught, how tired, well trained or resourced the staff are, or whether the equipment itself was unnecessarily difficult to operate.
There are also definitional questions: If a machine fails due to a software error at the factory, is that human error? In addition, human error can play a role in outages that are attributed to other causes. For these reasons, Uptime Institute tends to research human error in a different way to other causes — we view it as one causal factor in outages, and rarely a single or root cause. Using this methodology, Uptime estimates (based on 25 years of data) that human error plays some role in about two-thirds of all outages.
In Uptime’s recent surveys on resiliency, we have tried to understand the makeup of some of these human error-related failures. As the figure below shows, human error-related outages are most commonly caused either by staff failing to follow procedures (even where they have been agreed and codified) or because the procedures themselves are faulty.
This underscores a key tenet of Uptime: good training and processes both play a major role in outage reduction and can be implemented without great expense or capital expenditure.
Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.