Don’t pit man against machine
By Charles Selkirk, ATD
While data centers in Southern Africa utilize world-class designs and construction techniques, ongoing operational sustainability models struggle to attract and retain sufficient qualified and motivated personnel, in part due to the lack of recognition of the importance of their work (unless something goes wrong!).
The Uptime Institute’s Tier models have wide recognition and acceptance in the region, and many of our data center customers have lately elected for Tier III designs. In most cases, customers require active-active A and B paths, both with UPS backup, due to power instability issues experienced throughout the region; however, the recently launched Operational Sustainability Standard has had less impact in this region and is only now starting to gain some traction. As a result, data center operators in Southern Africa seem to take one of two very different approaches to the working relationship, which we have labeled Availability-Centric and Safety-Centric.
We based this observation on a catalog of data center failure and near-miss events compiled through our work in building facilities and providing sustainable support of ongoing operations in the region (especially in South Africa), which caused us to examine how the customers’ focus on operator safety might affect availability and reliability. We wanted to test the general business perceptions that availability is paramount and that it can be assumed that operators and maintenance support personnel have the skills and motivation to meet their own safety needs.
Whatever the agreed-upon availability level of a facility, businesses consider availability to be paramount. This view is founded on the belief that business will suffer in the event of any downtime, and availability cannot be compromised unless the circumstances are either unavoidable or dire. In high-availability facilities, owners may believe there is no need for downtime and that maintenance can be done only in strictly controlled time slots, with generally no tolerance for faults or errors. The issue of operator and maintenance personnel safety is rarely, if ever, raised or discussed. Without generalizing too much, this attitude is prevalent in the financial services and retail industries.
In a progressive corporate culture, the safety of employees is paramount, and all business needs are undertaken with this understanding. This is not to say that availability is unimportant, only that “safety is king.” These businesses have better-informed engineering support, and without exception, they view accidents and other incidents as potentially significant losses that must be considered and minimized. This culture drives, empowers and enables designers and operators to value themselves and their workplaces – and, in our experience, it also leads to improved availability.
We have found this culture to be more prevalent within the resources and manufacturing industries, although some commercial and colocation operators have also adopted this perspective.
In the schedules below, we summarize 49 incidents that have occurred in data centers over a period of 13 years throughout Southern Africa. Luckily, none of these incidents led to any form of personal injury, although many of them led to outages that damaged reputations, sapped customer confidence and caused financial losses.
In the schedules, we list the centricity characteristic of the owners, the markets served, the availability levels, the presence of dual-cabinet feeds, a description of the root cause of the problem, and whether an outage occurred – plus generalized comments on the incidents reported. The assignment of a centricity characteristic label may appear somewhat subjective, but in our observation, the difference between the Safety-Centric and Availability-Centric perspectives is easily discernible.
We would make the following observations based on this study:
A. The study was limited to the 49 incidents reported. A wider study with a larger number of incidents may yield somewhat different results.
B. There has been a shift to higher-reliability facilities over the last decade, in accordance with the higher uptime requirements of customers.
C. Despite the shift to higher-availability facilities, it is worth noting that the reduction in downtime incidents reported by Safety-Centric customers remains significantly lower than those reported by the Availability-Centric customers.
D. There is a clear and significant trend indicating that refocusing data center operations and maintenance to a Safety-Centric focus has significant benefits to customers in terms of uptime experienced.
Figure 1. In our study, half the incidents reported resulted in downtime of some form to the data center.
Figure 2. Clearly, a higher Tier-level design has the desired effect of reducing downtime.
*The availability assessments are the author’s subjective evaluations.
Figure 3. If we exclude downtime incidents at lower-availability facilities, the ratio adjusts. Three times as many incidents occur in facilities designed to meet Tier III certification as in Tier IV or similar facilities, in part because of the relative number of the lower-availability facilities.
Figure 4. Of the data center owners included in our report, 75% are clearly more focused on availability, while we would classify only 25% as Safety-Centric owners.
Figure 5. We can see from this chart that the breakdown is largely in agreement with the breakdown by centricity type. If we exclude the incidents reported for lower-availability facilities, the incident breakdown remains largely unchanged.
Figure 6. This figure illustrates the relative proportion of downtime incidents by centricity type.
Figure 7. Of the total of 33 incidents recorded by Availability-Centric customers, 22 (67%) caused either partial or total outages. Of the total of 16 incidents recorded by Safety-Centric customers, just three (19%) caused either partial or total outages. In all three cases reported, the downtime incidents were the result of equipment failures in older facilities where maintenance practices and scheduling were not optimal. If we reexamine the proportion of downtime incidents when lower availability facilities are excluded, the incident breakdown is substantially similar.
Figure 8. Of the 12 downtime incidents reported, 11 were recorded by Availability-Centric customers while only one was recorded by a Safety-Centric customer. That single failure was attributable to equipment failure in a legacy facility.
The following tables list the 13 years of events the author used to draw the conclusions presented in this article.
Charles Selkirk was born in Zimbabwe and grew up in Harare, completing high school in 1977. He served as a crime investigator in rural districts for the local police force and later became a circuit court prosecutor. Mr. Selkirk then followed his brother into electrical engineering, earning a degree from the University of Cape Town in 1984. In 1985, he married and moved to ‘The Reef’ near Johannesburg and started working on a deep-level gold mine, initially serving as a foreman and supervisor. He eventually became a section engineer in charge of engineering construction and maintenance operations, a position he held for five years.
Mr. Selkirk left the mining industry in 1989 and moved to Cape Town, where he joined his brother in an engineering consulting practice. In the early years of the firm, they consulted on a wide range of projects for building services. Mr. Selkirk’s brother emigrated to the U.S. in 2000 as the firm shifted its focus to specialize in data center MEP services, turnkey data center construction and running data centers. More recently, his two sons have joined the business – one as a site engineer and the other as a programmer.