Phasing Out Data Center Hot Work
Despite years of discussion, warnings and strict regulations in some countries, data center hot work remains a contentious issue in the data center industry. Hot work is the practice of working on energized electrical circuits (voltage limits differ regionally) — and it is usually done, in spite of the risks, to reduce the possibility of a downtime incident during maintenance.
Uptime Institute advises against hot work in almost all instances. The safety concerns are just too great, and data suggests work on energized circuits may — at best — only reduce the number of manageable incidents, while increasing the risk of arc flash and other events that damage expensive equipment and may lead to an outage or injury. In addition, concurrently maintainable or fault tolerant designs as described in Uptime Institute’s Tier Standard make hot work unnecessary.
The pressure against hot work continues to mount. In the US, electrical contractors have begun to decline some work that involves working on energized circuits, even if an energized work permit has been created and signed by appropriate management, as required by National Fire Protection Association (NFPA) 70E (Standard for Electrical Safety in the Workplace). In addition, US Department of Labor’s Occupational Safety and Hazards Agency (OSHA) has repeatedly rejected business continuity as an exception to hot work restrictions, making it harder for management to justify hot work and to find executives willing to sign the energized work permit.
OSHA statistics make clear that work on energized systems is a dangerous practice, especially for construction trades workers; installation, maintenance, and repair occupations; and grounds maintenance workers. For this reason, NFPA 70E sharply limits the situations in which organizations are allowed to work on energized equipment. Personnel safety is not the only issue; personal protective equipment (PPE) protects only workers, not equipment, so an arc flash can destroy many thousands of dollars of IT gear.
Ignoring local and national standards can be costly, too. OSHA reported 2,923 lockout/tagout and 1,528 PPE violations in 2017, among the many safety concerns it addressed that year. New minimum penalties for a single violation exceed $13,000, with top total fines for numerous, willful and repeated violations running into the millions of dollars. Wrongful death and injury suits add to the cost, and violations can lead to higher insurance premiums, too.
Participants in a recent Uptime Institute discussion roundtable agreed that the remaining firms performing work on live loads should begin preparing to end the practice. They said that senior management is often the biggest impediment to ending hot work, at least at some organizations, despite the well-known and documented risks. Executive resistance can be tied to concerns about power supplies or failure to maintain independent A/B feeds. In some cases, service level agreements contain restrictions against powering down equipment.
Despite executive resistance at some companies, the trend is clearly against hot work. By 2015, more than two-thirds of facilities operators had already eliminated the practice, according to Uptime Institute data. A tighter regulatory environment, heightened safety concerns, increased financial risk and improved equipment should combine to all but eliminate hot work in the near future. But there are still holdouts, and the practice is far more acceptable in some countries — China is an example — than in others, such as the US, where NFPA 70E severely limits the practice in all industries.
Also, hot work does not eliminate IT failure risk. Uptime Institute has been tracking data center abnormal incidents for more than 20 years and when studying the data, at least 71 failures occurred during hot work. While these failures are generally attributed to poor procedures or maintenance, a recent, more careful analysis concluded that better procedures or maintenance (or both) would have made it possible to perform the work safely — and without any failures — on de-energized systems.
The Uptime Institute abnormal incident database includes only four injury reports; all occurred during work on energized systems. In addition, the database includes 16 reports of arc flash. One occurred during normal preventive maintenance and one during an infrared scan. Neither caused injury, but the potential risk to personnel is apparent, as is the potential for equipment damage (and legal exposure).
Undoubtedly, eliminating hot work is a difficult process. One large retailer that has just begun the process expects the transition to take several years. And not all organizations succeed: Uptime Institute is aware of at least one organization in which incidents involving failed power supplies caused senior management to cancel their plan to disallow work on energized equipment.
According to several Uptime Institute Network community members, building a culture of safety is the most time-consuming part of the transition from hot work, as data centers are goal-oriented organizations, well-practiced at developing and following programs to identify and eliminate risk.
It is not necessary or even prudent to eliminate all hot work at once. The IT team can help slowly retire the practice by eliminating the most dangerous hot work first, building experience on less critical loads, or reducing the number of circuits affected at any one time. To prevent common failures when de-energizing servers, the Operations team can increase scrutiny on power supplies and ensure that dual-corded servers are properly fed.
In early data centers, the practice of hot work was understandable — necessary, even. However, Uptime Institute has long advocated against hot work. Modern equipment and higher resiliency architectures based on dual-corded servers make it possible to switch power feeds in the case of an electrical equipment failure. These advances not only improve data center availability, they also make it possible to isolate equipment for maintenance purposes.