Here is Part 2 of our Q&A regarding digital infrastructure considerations during the COVID19 crisis. Keep in mind that we are all handling this crisis in varied ways, and learning from each other along the way. In that process, we really ARE all finding our own “New Normal” and ultimately we will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.
As part of our infrastructure community leadership, we are regularly adding new materials, guidance and recommendations to help digital infrastructure owners and operators in these times.
We’ve conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. Below is Part 2 of the Q&A responses brought up during these webinars. These questions deal with Site Sanitation and Security. (Part 1 dealt with Staffing, and future published info will focus on Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations.
SITE SANITIZATION
Q: If an average of 50 people per day enter the data center, how often is filter change recommended? What parameters do I use to make that change?
A: Under normal operating scenarios, filter changes are typically triggered by an increased pressure differential measured across the filter. As the filter clogs with debris and particulate, pressure drop will increase. As far as operations during COVID-19, there does not appear to be any reasons for a change to the outlook on filter changes, including what should trigger a filter change, based on information currently available on how the virus spreads (although information seems to be changing over time.).
Q: What types of access controls or filters are recommended to implement? Are there air conditioning filters on the market that limit the circulation of viruses in the data center?
A: Currently, it is not believed that filters will play a large part in mitigating spread of the virus in data centers. Some research suggests that high-end filters (for example, high efficiency particulate air filters) are capable of filtering particles the size of COVID-19. However, based on current information on how the virus spreads, it is not generally believed that it spreads in a true airborne manner to the point where it gets in ventilation systems; it spreads via proximity to infected people sneezing, coughing, etc. Data centers are unique in that they have multiple air changes per minute, which is different from other facility types. It is important to note that while filtering could theoretically reduce spread, the National Air Filtration Association does not believe this will happen from a practical perspective.
Q: My concern is that, in a closed-loop environment, if COVID-19 enters the data center, the virus will live and could infect more people.
A: That is an accurate statement. That is also why it is important to take any reasonable steps possible to keep it out of the data center via strong site-access control requirements and checks. It is also why many data center operators are implementing regular disinfecting so that if it is in the data center, spread is mitigated. Uptime Institute is aware that many operators have found specialized data center cleaning companies that are capable of disinfecting sites in accordance with guidelines for this pandemic from the WHO (World Health Organization) and/or the CDC (US Centers for Disease Control and Prevention).
Q: How long can the airborne virus particle last in a data center because it is cold air?
A: According to the WHO: “It is not certain how long the virus that causes COVID-19 survives on surfaces, but it seems to behave like other coronaviruses. Studies suggest that coronaviruses (including preliminary information on the COVID-19 virus) may persist on surfaces for a few hours or up to several days. This may vary under different conditions (e.g., type of surface, temperature or humidity of the environment).” We believe that this is an area that is still being studied. There does not appear to be any consensus, other than there are a number of factors, including temperature and humidity, that can impact this.
Q: Taking into consideration that the virus lasts for an incubation period, would it be possible to bring contamination into the data center?
A: Based on the information presented by the WHO and CDC, it is likely that the virus can be introduced and that there can be contamination in the data center. The most likely vector for transmission is infected individuals who do not know they are infected. Please refer to the WHO, the CDC and/or your local authority for more information.
Q: Is there any type of clothing, masks or gloves that is recommended for access to the data center by customers or suppliers so as not to expose our staff? Is it more feasible for staff to carry this type of PPE or to demand it from the customers or suppliers?
A: The CDC is recommending the N95 mask. Other masks do not seal tightly around the nose and mouth to provide proper protection. To be fully effective it must be fitted properly. Specialists receive training annually on how to properly fit these respirators around the nose, cheeks and chin, ensuring that wearers don’t breathe around the edges of the respirator. When you do that, it turns out that the work of breathing, since you’re going through a very thick material, is harder. You have to work to breathe in and out. All personnel accessing the data center should be wearing PPE in accordance with the current policies related to the COVID-19 pandemic and future similar events.
It is important to note that we are seeing companies implement their pandemic plans, which to a large extent includes limiting site access to customers and employees. The pandemic plans vary between companies, but we are hearing of restrictions being implemented to include the use of masks, gloves, etc., primarily to follow CDC guidelines. Our suggestion is to follow CDC guidelines, as well as follow your approved pandemic plan. Please note that sanitizing methods may be a better thing to focus on than the use of masks. Also, while masks would provide an additional layer of protection, until production of masks ramps up, Uptime Institute is now somewhat cautious about recommending data center owners and operators stock up on masks. The “good” masks are in short supply and should be allocated to healthcare professionals until there is sufficient supply to go around.
Q: If we decide to sanitize [our data center], the fire system detectors can go off. What do you recommend?
A: It is common during various housekeeping operations to put the fire system into bypass. This is especially important if there is a VESDA (very early smoke detection apparatus) system present, which can be triggered by disturbances of even very small particulate (they are specifically designed to be highly sensitive). Our recommendation is to put the fire system in bypass while maintaining compliance with local jurisdictional requirements. This may require fire watch or similar measures be taken while the system is in bypass.
SITE SECURITY
Q: Insecurity is likely to increase in different regions, do you recommend increasing security?
A: Certainly, in facilities operating in severely affected areas the level of security risk could be affected. In these areas, management must adopt enhanced security policies, including prescreening all scheduled visitors before arrival on-site; prohibiting all unscheduled visitors; and if possible and applicable, creating a separate, secure entrance for all parties involved in essential on-site construction projects and establishing a policy that they (or any other visitors) are not allowed to interact with duty operations personnel.
We are all in this together. We are all finding our own “New Normal”. We are all learning from each other and will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.
To this end, Uptime Institute has been creating many types of guidance and recommendations to help digital infrastructure owners and operators ride through the COVID19 crisis. We want to help the entire community deal with their present and prepare for their future, where the New Normal will be a way of life.
We’ve created COVID19 operational guideline reports based on our 25 years of data center risk management experience, published associated real-time updates and bulletins, and conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. As a community, it’s clear we are all looking for each other’s support, guidance and experience.
As part of these webinars, we have published a series of digital infrastructure Q&A documents, focused on various categories of COVID19 specific topics: Staff Management, Site Sanitation, Site Security, Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations. Presented below is PART 1 of these Q&A topics, covering Staff Management during COVID19.
STAFF MANAGEMENT
Q: What is your view on continuous, long shifts for four or seven days in the data center?
A: It is not best practice for operations to change to continuous shifts from four days to seven days, as fatigue and stress will increase the human risk factor that can cause abnormal incidents. Instead, we recommend assessing extending the shift time from 8 hours to 12 hours and limiting this to a maximum of two or three consecutive days. Any extended continuous shifts should include long regular breaks each shift to avoid fatigue. There needs to be a careful balance between the increased risk of human fatigue and the mitigated risk of virus spread. Managers should also consider that the total hours worked per person does not increase, that overtime will not be over 10%, and that the shifts are arranged so that staff can rest adequately between shifts.
Q: What procedures should we follow to identify an infected staff member?
A: As described in our report COVID-19: Minimizing critical facility risk, we recommend contact tracing systems. Register the health information and location of your organization’s personnel, suppliers’ personnel and other related personnel every day to monitor possible exposure to the virus and/or any symptoms (including those of the common cold). We recommend prescreening all scheduled visitors before they arrive on-site, including sending a questionnaire via email 48 hours prior to their visit. Require completion of the questionnaire before the appointment is confirmed. Verify that all answers remain unchanged upon arrival and institute temperature checks using noncontact thermometers before entry to the facility.
For a confirmed COVID-19 case at the site, we recommend that cleaning personnel use bio-hazard suits, gloves, shoe coverings, etc. and that all personal protective equipment (PPE) is bagged and removed from the site once cleaning is complete. With or without a confirmed case at the side, ensure the availability of PPE, including masks, gloves and hazardous materials or hazmat suits. Depending on the appropriate medical or management advice, workers should use masks during shift turnover. Training pairs (e.g., senior engineer and trainee) must wear masks at all times.
Q: How feasible is it to move families to the data center?
A: Although housing staff on-site should be considered only as a last resort, regions could go into lockdown mid-shift, so you may need to prepare for that eventuality. There are disaster recovery plans that include providing accommodation for several family members for up to 2 weeks to avoid traveling to and from the data center. While the data center is perceived as a controlled-access space, it is not a safe space. Therefore, any organization considering this option should also consider offering a specialized training program for family members that includes awareness of the hazards and the associated risks, emergency evacuation procedures, etc.
Q: Is it always recommended to keep personnel 24×7 for Tier III and Tier IV data centers?
A: Yes, it is a required criteria of the Uptime Institute Tier Certification of Operational Sustainability to have a minimum of one 24-hour, 7-day-a-week qualified staff presence (full-time employee) for Tier III data centers per shift and a minimum of two 24-hr/7-day-a-week staff presence (full-time employee) per shift for Tier IV data centers.
Q: We must not forget the following considerations for staff that may need to stay at the data center for 24 hours or more: the need to prepare food; a supply of canned food for more than 40 days, as well as alkaline water and the ability to purify it by reverse osmosis in case of water contamination; and cardiopulmonary resuscitation equipment for emergencies.
A: Correct, all these initiatives are proactive and preventative. Uptime Institute’s COVID-19: Minimizing critical facility risk report provides additional information related to what measures data center management should consider for the health and safety of staff and the protection of the site.
Q: Do you recommend interviewing all internal staff to determine their personal situation, and whether this should be done by a psychologist, particularly if staff are in the data center for a long time?
A: Organizations should maintain open and continuous communication with staff, customers and relevant third parties on a daily basis or even twice daily. Briefings may be appropriate as the conditions change. We also recommend sharing news updates and links to public resources to keep staff informed of the current status of the pandemic and the best practices for maintaining a safe and healthy work environment. As appropriate for each case, emotional support should be provided to reduce stress. Special attention should be given to any changes to continuous, long shifts that could increase the risk of human error, which may cause abnormal incidents.
One of the findings of Uptime Institute’s recently published report Annual outage analysis 2020 is that the most serious categories of outages — those that cause a significant disruption in services — are becoming more severe and more costly. This isn’t entirely surprising: individuals and businesses alike are becoming ever more dependent on IT, and it is becoming harder to replicate or replace an IT service with a manual system.
But one of the findings raises both red flags and questions: Setting aside those that were partial and incidental (which had minimal impact), publicly reported outages over the past three years appear to be getting longer. And this, in turn, is likely to be one of the reasons why the costs and severity of outages have been rising.
The table above shows the numbers of publicly reported outages collected by Uptime Institute in the years 2017-2019, except for those that did not have a financial or customer impact or those for which there was no known cause. The figures show outages are on the rise. This is due to a number of factors, including greater deployment of IT services and better reporting. But they also show a tilt in the data towards longer outages — especially those that lasted more than 48 hours. (This is true even though one of the biggest causes of lengthy outages — ransomware — was excluded from our sample.)
The outage times reported are until full IT service recovery, not business recovery — it may take longer, for example, to move aircraft back to where they are supposed to be, or to deal with backlogs in insurance claims. This trend is not dramatic, but it is real and it is concerning, because a 48-hour interruption can be lethal for many organizations.
Why is it happening? Complexity and interdependency of IT systems and greater dependency on software and data are very likely to be big reasons. For example, Uptime’s Institute’s research shows that fewer big outages are caused by power failure in the data center and more by IT systems configuration now than in the past. While resolving facility engineering issues may not be easy, it is usually a relatively predictable matter: failures are often binary, and very often recovery processes have been drilled into the operators and spare parts are kept at hand. Software, data integrity and damaged/interrupted cross-organizational business processes, however, can be much more difficult issues to resolve or sometimes even to diagnose — and these types of failure are becoming much more common (and yes, sometimes they are triggered by a power failure). Furthermore, because failures can be partial, files may become out of sync or even be corrupted.
There are lessons to be drawn. The biggest is that the resiliency regimes that facilities staff have lived by for three decades or more need to be extended and integrated into IT and DevOps and fully supported and invested in by management. Another is that while disaster recovery may be slowly disappearing as a type of commercial backup service, the principles of vigilance, recovery, and fail over – especially when under stress – are more important than ever.
The full report Annual outage analysis 2020 is available to members of the Uptime Institute Network which can be requested here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/02/GettyImages-1202490579-blog.jpg10002700Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-03-30 06:00:592020-03-13 10:57:44Are IT Infrastructure Outages Getting Longer?
Over the past few weeks, Uptime Institute held multiple customer roundtables to discuss the impact of the COVID-19 virus on data center operations and potential operational responses to its spread. We gathered our communities insights and best practices, we combined with our own 25 years worth of infrastructure operational management knowledge and we are now making this information available freely to the data center industry. HERE.
A little background to get you started right away….
Dozens of organizations were represented at these roundtables, which was open to a global audience of Uptime Institute Network membership. What we learned is that while most organizations have a plan for foreseen emergency situations, few have one specific to a global pandemic. As a result, many have been hurrily modifying existing plans based on gut feel and good intentions: creating tiered response levels, identifying events that would trigger the next level of response, researching concerns specific to a pandemic (e.g., what does “deep cleaning” mean in a white space, and what are the implications for different data center environments — raised floors, multi-tenant data centers, mixed-use facilities, etc.?).
But this is clearly unchartered territory for ALL of us. For many organizations, the Human Resources and/or Environmental Health & Safety department(s) take the lead in generating an organization-wide response plan, and specific business units, such as data center operations, incorporate that guidance into a plan tailored to their business mission and setting. Because many organizations have data centers in multiple regions, responses may vary by location or facility characteristics. A sample but very broad Emergency Response Plan by the US government’s FDA (with portions pertaining to the delivery of IT services contained within) can be seen here.
But immediately actionable tangible advice goes a long way in times like these. Several participants mentioned that their facilities now screen all potential visitors with a questionnaire. They do not admit anyone who reports symptoms (personally or in family members) or who has traveled recently to areas with high levels of COVID-19 cases. Some repondants advised that nn additional measure of their security involves prescreening all scheduled visitors: Send the visitor the questionnaire via email 4-8 hours prior to their visit and require completion before the appointment is confirmed. Only permit entry if the questionnaire indicates a low probability of infection (confirm all answers remain unchanged upon arrival) and prohibit unscheduled visitors altogether.
Some facilities – for example, multi-tenant data centers or mixed-use facilities – have a higher volume of visitors, and thus greater potential for COVID-19 spread. To avoid inconvenience and potential client dissatisfaction, be proactive: Inform all affected parties of the COVID-19 preparedness plan in place and its impact on their access to the facility in advance.
Sanitization is a particular challenge, with several participants reporting disinfectant/hand sanitizer shortages. Many had questions specific to deep cleaning the white space environment, given its high rate/volume of air exchange, highly specialized electronic equipment and possible raised floor configuration. Spray techniques are more effective than simply wiping surfaces with disinfectant solutions, as the antiseptic mist coats surfaces for a longer period. Many organizations are hiring specialist cleaning firms and/or following CDC recommendations for disinfection.
As COVID-19 spreads, more organizations are moving their energy from academicly tweaking their written response plans to implementing them. In many companies, that decision is made by a business unit, based on site environment, number of COVID-19 cases in the area and government-mandated restrictions. Mission-critical facilities have a particular remit, though, so need to create and implement plans specific to their business mission.
Good preparation simplifies decision-making. Roundtable participants suggest the following:
Categorizing essential versus nonessential tasks and calendaring them in advance (makes it easier to identify maintenance items you can postpone, and for how long).
Cross-training personnel and maintaining up-to-date skill inventories/certifications (helps ensure core capabilities are always available).
Having contingency plans in place (means you’re prepared to manage supply chain disruption and staff shortages).
Stress-testing technologies and procedures in advance (gives you confidence that you can accommodate a move to remote work: shifting procedures that are usually performed manually to an automated process, monitoring remotely, interacting virtually with other team members, etc.).
It’s no longer a question of if a plan like this will be needed, we know it is! Most facility operators need to quickly craft and then implement their response plan, and learn from this incident for the future.
Uptime Institute has created a number of resources and will be providing specific guidance regarding the COVID-19 situation here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/03/GettyImages-1212420405-blog.jpg10672667Sandra Vailhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngSandra Vail2020-03-17 06:00:572020-03-17 09:44:54COVID-19: IT organizations move from planning to implementation
As enterprises continue to move from a focus on capital expenditures to operating expenditures, more data center components will also be consumed on a pay-as-you-go, “as a service” basis.
“-aaS” goes mainstream
The trend toward everything “as a service” (XaaS) is now mainstream in IT, ranging from cloud (infrastructure-aaS) and software-aaS (SaaS) to newer offerings, such as bare metal-aaS, container-aaS, and artificial intelligence-aaS (AI-aaS). At the IT level, service providers are winning over more clients to the service-based approach by reducing capital expenditures (capex) in favor of operational expenditures (opex), by offering better products, and by investing heavily to improve security and compliance. More organizations are now willing to trust them.
But this change is not confined to the IT: a similar trend is underway in data centers.
Why buy and not build?
While the cost to build new data centers is generally falling, driven partly by the availability of more prefabricated components, enterprise operators have been increasingly competing against lower-cost options to host their IT — notably colocation, cloud and SaaS.
Cost is rarely the biggest motivation for moving to cloud, but it is a factor. Large cloud providers continue to build and operate data centers at scale and enjoy the proportional cost savings as well the fruits of intense value engineering. They also spread costs among customers and tend to have much higher utilization rates compared with other data centers. And, of course, they invest in innovative, leading-edge IT tools that can be rolled out almost instantly. This all adds up to ever-improving IT and infrastructure services from cloud providers that are cheaper (and often better) than using or developing equivalent services based in a smaller-scale enterprise data center.
Many organizations have now come to view data center ownership as a big capital risk — one that only some want to take. Even when it’s cheaper to deliver IT from their own “on-premises” data center, the risks of data center early obsolescence, under-utilization, technical noncompliance or unexpected technological or local problems are all factors. And, of course, most businesses want to avoid a big capital outlay: Our research shows that, in 2017, the total cost of ownership of an “average” concurrently maintainable 3 megawatt (MW) enterprise data center amortized over 15 years was about $90 million, and that roughly half of the cost is invested in three installments over the first six years, assuming a typical phased build and bricks-and-mortar construction.
This represents a significant amount of risk. To be economically viable, the enterprise must typically operate a facility at a high level of utilization — yet forecasting future data center capacity remains enterprises’ top challenge, according to our research.
Demand for enterprise data centers remains sizable, in spite of the alternatives. Many enterprises with smaller data centers are closing them and consolidating into premium, often larger, centralized data centers and outsourcing as much else as possible.
The appeal of the cloud will continue to convince executives and drive strategy. Increasingly, public cloud is an alternative way to deliver workloads faster and cheaper without having to build additional on-premise capacity. Scalability, portability, reduced risk, better tools, high levels of resiliency, infrastructure avoidance and fewer staff requirements are other key drivers for cloud adoption. Innovation and access to leading-edge IT will likely be bigger factors in the future, as will more cloud-first remits from upper management.
Colocation, including sale leasebacks
Although rarely thought of in this way, colocation is the most widely used “data center-aaS” offering today. Sale with leaseback of the data center by enterprise to colos is also becoming more common, a trend that will continue to build (see UII Note 38: Capital inflow boosts the data center market).
Colo interconnection services will attract even more businesses. More will likely seek to lease space in the same facility as their cloud or other third-party service provider, enabling lower latency and fewer costs and more security for third-party services, such as storage-aaS and disaster recovery-aaS.
While more enterprise IT is moving to colos and managed services (whether or not it is cloud), enterprise data centers will not disappear. More than 600 IT and data center managers told Uptime Institute that, in 2021, about half of all workloads will still be in enterprise data centers, and only 18% of workloads in public cloud/SaaS.
Other “as a service” trends in data centers
Data center monitoring and analysis is another relatively new example of a pay-as-you-go service. Introduced in late 2016, data center management as a service is a big data-driven cloud service that provides customized analysis and is paid for on a recurring basis. The move to a pay-as-you-go service has helped unlock the data center infrastructure management market, which was struggling for growth because of costs and complexity.
Energy backup and generation is another area to watch. Suppliers have introduced various pay-as-you-go models for their equipment. These include leased fuel cells owned by the supplier (notably Bloom Energy), which charges customers only for the energy produced. By eliminating the client’s risk and capital outlay, it can make the supplier’s sale easier (although they have to wait to be paid). Some suppliers have ventured in UPS-aaS, but with limited success to date.
More alternatives to ownership are likely for data center electrical assets, such as batteries. Given the high and fast rate of innovation in the technology, leasing large-scale battery installations delivers the capacity and innovation benefits without the risks.
It’s also likely that more large data centers will use energy service companies (ESCOs) to produce, manage and deliver energy from renewable microgrids. Demand for green energy, for energy security (that is, energy produced off-grid) and energy-price stability is growing; ESCOs can deliver all this for dedicated customers that sign long-term energy-purchase agreements but don’t have the capital required to build or the expertise necessary to run a green microgrid.
Demand for enterprise data centers will continue but alongside the use of more cloud and more colo. More will be consumed “as a service,” ranging from data center monitoring to renewable energy from nearby dedicated microgrids.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network. Membership information can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/03/GettyImages-1187121207-blog.jpg10002700Rhonda Ascierto, Vice President, Research, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRhonda Ascierto, Vice President, Research, Uptime Institute2020-03-02 06:01:122020-02-28 15:23:13Pay-as-you-go model spreads to critical components
Despite years of discussion, warnings and strict regulations in some countries, data center hot work remains a contentious issue in the data center industry. Hot work is the practice of working on energized electrical circuits (voltage limits differ regionally) — and it is usually done, in spite of the risks, to reduce the possibility of a downtime incident during maintenance.
Uptime Institute advises against hot work in almost all instances. The safety concerns are just too great, and data suggests work on energized circuits may — at best — only reduce the number of manageable incidents, while increasing the risk of arc flash and other events that damage expensive equipment and may lead to an outage or injury. In addition, concurrently maintainable or fault tolerant designs as described in Uptime Institute’s Tier Standard make hot work unnecessary.
The pressure against hot work continues to mount. In the US, electrical contractors have begun to decline some work that involves working on energized circuits, even if an energized work permit has been created and signed by appropriate management, as required by National Fire Protection Association (NFPA) 70E (Standard for Electrical Safety in the Workplace). In addition, US Department of Labor’s Occupational Safety and Hazards Agency (OSHA) has repeatedly rejected business continuity as an exception to hot work restrictions, making it harder for management to justify hot work and to find executives willing to sign the energized work permit.
OSHA statistics make clear that work on energized systems is a dangerous practice, especially for construction trades workers; installation, maintenance, and repair occupations; and grounds maintenance workers. For this reason, NFPA 70E sharply limits the situations in which organizations are allowed to work on energized equipment. Personnel safety is not the only issue; personal protective equipment (PPE) protects only workers, not equipment, so an arc flash can destroy many thousands of dollars of IT gear.
Ignoring local and national standards can be costly, too. OSHA reported 2,923 lockout/tagout and 1,528 PPE violations in 2017, among the many safety concerns it addressed that year. New minimum penalties for a single violation exceed $13,000, with top total fines for numerous, willful and repeated violations running into the millions of dollars. Wrongful death and injury suits add to the cost, and violations can lead to higher insurance premiums, too.
Participants in a recent Uptime Institute discussion roundtable agreed that the remaining firms performing work on live loads should begin preparing to end the practice. They said that senior management is often the biggest impediment to ending hot work, at least at some organizations, despite the well-known and documented risks. Executive resistance can be tied to concerns about power supplies or failure to maintain independent A/B feeds. In some cases, service level agreements contain restrictions against powering down equipment.
Despite executive resistance at some companies, the trend is clearly against hot work. By 2015, more than two-thirds of facilities operators had already eliminated the practice, according to Uptime Institute data. A tighter regulatory environment, heightened safety concerns, increased financial risk and improved equipment should combine to all but eliminate hot work in the near future. But there are still holdouts, and the practice is far more acceptable in some countries — China is an example — than in others, such as the US, where NFPA 70E severely limits the practice in all industries.
Also, hot work does not eliminate IT failure risk. Uptime Institute has been tracking data center abnormal incidents for more than 20 years and when studying the data, at least 71 failures occurred during hot work. While these failures are generally attributed to poor procedures or maintenance, a recent, more careful analysis concluded that better procedures or maintenance (or both) would have made it possible to perform the work safely — and without any failures — on de-energized systems.
The Uptime Institute abnormal incident database includes only four injury reports; all occurred during work on energized systems. In addition, the database includes 16 reports of arc flash. One occurred during normal preventive maintenance and one during an infrared scan. Neither caused injury, but the potential risk to personnel is apparent, as is the potential for equipment damage (and legal exposure).
Undoubtedly, eliminating hot work is a difficult process. One large retailer that has just begun the process expects the transition to take several years. And not all organizations succeed: Uptime Institute is aware of at least one organization in which incidents involving failed power supplies caused senior management to cancel their plan to disallow work on energized equipment.
According to several Uptime Institute Network community members, building a culture of safety is the most time-consuming part of the transition from hot work, as data centers are goal-oriented organizations, well-practiced at developing and following programs to identify and eliminate risk.
It is not necessary or even prudent to eliminate all hot work at once. The IT team can help slowly retire the practice by eliminating the most dangerous hot work first, building experience on less critical loads, or reducing the number of circuits affected at any one time. To prevent common failures when de-energizing servers, the Operations team can increase scrutiny on power supplies and ensure that dual-corded servers are properly fed.
In early data centers, the practice of hot work was understandable — necessary, even. However, Uptime Institute has long advocated against hot work. Modern equipment and higher resiliency architectures based on dual-corded servers make it possible to switch power feeds in the case of an electrical equipment failure. These advances not only improve data center availability, they also make it possible to isolate equipment for maintenance purposes.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/02/Energized-hot-work-cropped-blog.jpg11983242Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2020-02-24 05:30:402020-02-14 15:34:41Phasing Out Data Center Hot Work
COVID-19: Q&A (Part 2): Site Sanitation and Security
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteHere is Part 2 of our Q&A regarding digital infrastructure considerations during the COVID19 crisis. Keep in mind that we are all handling this crisis in varied ways, and learning from each other along the way. In that process, we really ARE all finding our own “New Normal” and ultimately we will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.
As part of our infrastructure community leadership, we are regularly adding new materials, guidance and recommendations to help digital infrastructure owners and operators in these times.
We’ve conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. Below is Part 2 of the Q&A responses brought up during these webinars. These questions deal with Site Sanitation and Security. (Part 1 dealt with Staffing, and future published info will focus on Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations.
SITE SANITIZATION
Q: If an average of 50 people per day enter the data center, how often is filter change recommended? What parameters do I use to make that change?
A: Under normal operating scenarios, filter changes are typically triggered by an increased pressure differential measured across the filter. As the filter clogs with debris and particulate, pressure drop will increase. As far as operations during COVID-19, there does not appear to be any reasons for a change to the outlook on filter changes, including what should trigger a filter change, based on information currently available on how the virus spreads (although information seems to be changing over time.).
Q: What types of access controls or filters are recommended to implement? Are there air conditioning filters on the market that limit the circulation of viruses in the data center?
A: Currently, it is not believed that filters will play a large part in mitigating spread of the virus in data centers. Some research suggests that high-end filters (for example, high efficiency particulate air filters) are capable of filtering particles the size of COVID-19. However, based on current information on how the virus spreads, it is not generally believed that it spreads in a true airborne manner to the point where it gets in ventilation systems; it spreads via proximity to infected people sneezing, coughing, etc. Data centers are unique in that they have multiple air changes per minute, which is different from other facility types. It is important to note that while filtering could theoretically reduce spread, the National Air Filtration Association does not believe this will happen from a practical perspective.
Q: My concern is that, in a closed-loop environment, if COVID-19 enters the data center, the virus will live and could infect more people.
A: That is an accurate statement. That is also why it is important to take any reasonable steps possible to keep it out of the data center via strong site-access control requirements and checks. It is also why many data center operators are implementing regular disinfecting so that if it is in the data center, spread is mitigated. Uptime Institute is aware that many operators have found specialized data center cleaning companies that are capable of disinfecting sites in accordance with guidelines for this pandemic from the WHO (World Health Organization) and/or the CDC (US Centers for Disease Control and Prevention).
Q: How long can the airborne virus particle last in a data center because it is cold air?
A: According to the WHO: “It is not certain how long the virus that causes COVID-19 survives on surfaces, but it seems to behave like other coronaviruses. Studies suggest that coronaviruses (including preliminary information on the COVID-19 virus) may persist on surfaces for a few hours or up to several days. This may vary under different conditions (e.g., type of surface, temperature or humidity of the environment).” We believe that this is an area that is still being studied. There does not appear to be any consensus, other than there are a number of factors, including temperature and humidity, that can impact this.
Q: Taking into consideration that the virus lasts for an incubation period, would it be possible to bring contamination into the data center?
A: Based on the information presented by the WHO and CDC, it is likely that the virus can be introduced and that there can be contamination in the data center. The most likely vector for transmission is infected individuals who do not know they are infected. Please refer to the WHO, the CDC and/or your local authority for more information.
Q: Is there any type of clothing, masks or gloves that is recommended for access to the data center by customers or suppliers so as not to expose our staff? Is it more feasible for staff to carry this type of PPE or to demand it from the customers or suppliers?
A: The CDC is recommending the N95 mask. Other masks do not seal tightly around the nose and mouth to provide proper protection. To be fully effective it must be fitted properly. Specialists receive training annually on how to properly fit these respirators around the nose, cheeks and chin, ensuring that wearers don’t breathe around the edges of the respirator. When you do that, it turns out that the work of breathing, since you’re going through a very thick material, is harder. You have to work to breathe in and out. All personnel accessing the data center should be wearing PPE in accordance with the current policies related to the COVID-19 pandemic and future similar events.
It is important to note that we are seeing companies implement their pandemic plans, which to a large extent includes limiting site access to customers and employees. The pandemic plans vary between companies, but we are hearing of restrictions being implemented to include the use of masks, gloves, etc., primarily to follow CDC guidelines. Our suggestion is to follow CDC guidelines, as well as follow your approved pandemic plan. Please note that sanitizing methods may be a better thing to focus on than the use of masks. Also, while masks would provide an additional layer of protection, until production of masks ramps up, Uptime Institute is now somewhat cautious about recommending data center owners and operators stock up on masks. The “good” masks are in short supply and should be allocated to healthcare professionals until there is sufficient supply to go around.
Q: If we decide to sanitize [our data center], the fire system detectors can go off. What do you recommend?
A: It is common during various housekeeping operations to put the fire system into bypass. This is especially important if there is a VESDA (very early smoke detection apparatus) system present, which can be triggered by disturbances of even very small particulate (they are specifically designed to be highly sensitive). Our recommendation is to put the fire system in bypass while maintaining compliance with local jurisdictional requirements. This may require fire watch or similar measures be taken while the system is in bypass.
SITE SECURITY
Q: Insecurity is likely to increase in different regions, do you recommend increasing security?
A: Certainly, in facilities operating in severely affected areas the level of security risk could be affected. In these areas, management must adopt enhanced security policies, including prescreening all scheduled visitors before arrival on-site; prohibiting all unscheduled visitors; and if possible and applicable, creating a separate, secure entrance for all parties involved in essential on-site construction projects and establishing a policy that they (or any other visitors) are not allowed to interact with duty operations personnel.
COVID-19: Q&A (Part 1): Staff Management
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteWe are all in this together. We are all finding our own “New Normal”. We are all learning from each other and will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.
To this end, Uptime Institute has been creating many types of guidance and recommendations to help digital infrastructure owners and operators ride through the COVID19 crisis. We want to help the entire community deal with their present and prepare for their future, where the New Normal will be a way of life.
We’ve created COVID19 operational guideline reports based on our 25 years of data center risk management experience, published associated real-time updates and bulletins, and conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. As a community, it’s clear we are all looking for each other’s support, guidance and experience.
As part of these webinars, we have published a series of digital infrastructure Q&A documents, focused on various categories of COVID19 specific topics: Staff Management, Site Sanitation, Site Security, Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations. Presented below is PART 1 of these Q&A topics, covering Staff Management during COVID19.
STAFF MANAGEMENT
Q: What is your view on continuous, long shifts for four or seven days in the data center?
A: It is not best practice for operations to change to continuous shifts from four days to seven days, as fatigue and stress will increase the human risk factor that can cause abnormal incidents. Instead, we recommend assessing extending the shift time from 8 hours to 12 hours and limiting this to a maximum of two or three consecutive days. Any extended continuous shifts should include long regular breaks each shift to avoid fatigue. There needs to be a careful balance between the increased risk of human fatigue and the mitigated risk of virus spread. Managers should also consider that the total hours worked per person does not increase, that overtime will not be over 10%, and that the shifts are arranged so that staff can rest adequately between shifts.
Q: What procedures should we follow to identify an infected staff member?
A: As described in our report COVID-19: Minimizing critical facility risk, we recommend contact tracing systems. Register the health information and location of your organization’s personnel, suppliers’ personnel and other related personnel every day to monitor possible exposure to the virus and/or any symptoms (including those of the common cold). We recommend prescreening all scheduled visitors before they arrive on-site, including sending a questionnaire via email 48 hours prior to their visit. Require completion of the questionnaire before the appointment is confirmed. Verify that all answers remain unchanged upon arrival and institute temperature checks using noncontact thermometers before entry to the facility.
For a confirmed COVID-19 case at the site, we recommend that cleaning personnel use bio-hazard suits, gloves, shoe coverings, etc. and that all personal protective equipment (PPE) is bagged and removed from the site once cleaning is complete. With or without a confirmed case at the side, ensure the availability of PPE, including masks, gloves and hazardous materials or hazmat suits. Depending on the appropriate medical or management advice, workers should use masks during shift turnover. Training pairs (e.g., senior engineer and trainee) must wear masks at all times.
For further information, please refer to our report COVID-19: Minimizing critical facility risk.
Q: How feasible is it to move families to the data center?
A: Although housing staff on-site should be considered only as a last resort, regions could go into lockdown mid-shift, so you may need to prepare for that eventuality. There are disaster recovery plans that include providing accommodation for several family members for up to 2 weeks to avoid traveling to and from the data center. While the data center is perceived as a controlled-access space, it is not a safe space. Therefore, any organization considering this option should also consider offering a specialized training program for family members that includes awareness of the hazards and the associated risks, emergency evacuation procedures, etc.
Q: Is it always recommended to keep personnel 24×7 for Tier III and Tier IV data centers?
A: Yes, it is a required criteria of the Uptime Institute Tier Certification of Operational Sustainability to have a minimum of one 24-hour, 7-day-a-week qualified staff presence (full-time employee) for Tier III data centers per shift and a minimum of two 24-hr/7-day-a-week staff presence (full-time employee) per shift for Tier IV data centers.
Q: We must not forget the following considerations for staff that may need to stay at the data center for 24 hours or more: the need to prepare food; a supply of canned food for more than 40 days, as well as alkaline water and the ability to purify it by reverse osmosis in case of water contamination; and cardiopulmonary resuscitation equipment for emergencies.
A: Correct, all these initiatives are proactive and preventative. Uptime Institute’s COVID-19: Minimizing critical facility risk report provides additional information related to what measures data center management should consider for the health and safety of staff and the protection of the site.
Q: Do you recommend interviewing all internal staff to determine their personal situation, and whether this should be done by a psychologist, particularly if staff are in the data center for a long time?
A: Organizations should maintain open and continuous communication with staff, customers and relevant third parties on a daily basis or even twice daily. Briefings may be appropriate as the conditions change. We also recommend sharing news updates and links to public resources to keep staff informed of the current status of the pandemic and the best practices for maintaining a safe and healthy work environment. As appropriate for each case, emotional support should be provided to reduce stress. Special attention should be given to any changes to continuous, long shifts that could increase the risk of human error, which may cause abnormal incidents.
Are IT Infrastructure Outages Getting Longer?
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]One of the findings of Uptime Institute’s recently published report Annual outage analysis 2020 is that the most serious categories of outages — those that cause a significant disruption in services — are becoming more severe and more costly. This isn’t entirely surprising: individuals and businesses alike are becoming ever more dependent on IT, and it is becoming harder to replicate or replace an IT service with a manual system.
But one of the findings raises both red flags and questions: Setting aside those that were partial and incidental (which had minimal impact), publicly reported outages over the past three years appear to be getting longer. And this, in turn, is likely to be one of the reasons why the costs and severity of outages have been rising.
The table above shows the numbers of publicly reported outages collected by Uptime Institute in the years 2017-2019, except for those that did not have a financial or customer impact or those for which there was no known cause. The figures show outages are on the rise. This is due to a number of factors, including greater deployment of IT services and better reporting. But they also show a tilt in the data towards longer outages — especially those that lasted more than 48 hours. (This is true even though one of the biggest causes of lengthy outages — ransomware — was excluded from our sample.)
The outage times reported are until full IT service recovery, not business recovery — it may take longer, for example, to move aircraft back to where they are supposed to be, or to deal with backlogs in insurance claims. This trend is not dramatic, but it is real and it is concerning, because a 48-hour interruption can be lethal for many organizations.
Why is it happening? Complexity and interdependency of IT systems and greater dependency on software and data are very likely to be big reasons. For example, Uptime’s Institute’s research shows that fewer big outages are caused by power failure in the data center and more by IT systems configuration now than in the past. While resolving facility engineering issues may not be easy, it is usually a relatively predictable matter: failures are often binary, and very often recovery processes have been drilled into the operators and spare parts are kept at hand. Software, data integrity and damaged/interrupted cross-organizational business processes, however, can be much more difficult issues to resolve or sometimes even to diagnose — and these types of failure are becoming much more common (and yes, sometimes they are triggered by a power failure). Furthermore, because failures can be partial, files may become out of sync or even be corrupted.
There are lessons to be drawn. The biggest is that the resiliency regimes that facilities staff have lived by for three decades or more need to be extended and integrated into IT and DevOps and fully supported and invested in by management. Another is that while disaster recovery may be slowly disappearing as a type of commercial backup service, the principles of vigilance, recovery, and fail over – especially when under stress – are more important than ever.
The full report Annual outage analysis 2020 is available to members of the Uptime Institute Network which can be requested here.
COVID-19: IT organizations move from planning to implementation
/in Executive, News, Operations/by Sandra VailOver the past few weeks, Uptime Institute held multiple customer roundtables to discuss the impact of the COVID-19 virus on data center operations and potential operational responses to its spread. We gathered our communities insights and best practices, we combined with our own 25 years worth of infrastructure operational management knowledge and we are now making this information available freely to the data center industry. HERE.
A little background to get you started right away….
Dozens of organizations were represented at these roundtables, which was open to a global audience of Uptime Institute Network membership. What we learned is that while most organizations have a plan for foreseen emergency situations, few have one specific to a global pandemic. As a result, many have been hurrily modifying existing plans based on gut feel and good intentions: creating tiered response levels, identifying events that would trigger the next level of response, researching concerns specific to a pandemic (e.g., what does “deep cleaning” mean in a white space, and what are the implications for different data center environments — raised floors, multi-tenant data centers, mixed-use facilities, etc.?).
But this is clearly unchartered territory for ALL of us. For many organizations, the Human Resources and/or Environmental Health & Safety department(s) take the lead in generating an organization-wide response plan, and specific business units, such as data center operations, incorporate that guidance into a plan tailored to their business mission and setting. Because many organizations have data centers in multiple regions, responses may vary by location or facility characteristics. A sample but very broad Emergency Response Plan by the US government’s FDA (with portions pertaining to the delivery of IT services contained within) can be seen here.
But immediately actionable tangible advice goes a long way in times like these. Several participants mentioned that their facilities now screen all potential visitors with a questionnaire. They do not admit anyone who reports symptoms (personally or in family members) or who has traveled recently to areas with high levels of COVID-19 cases. Some repondants advised that nn additional measure of their security involves prescreening all scheduled visitors: Send the visitor the questionnaire via email 4-8 hours prior to their visit and require completion before the appointment is confirmed. Only permit entry if the questionnaire indicates a low probability of infection (confirm all answers remain unchanged upon arrival) and prohibit unscheduled visitors altogether.
Some facilities – for example, multi-tenant data centers or mixed-use facilities – have a higher volume of visitors, and thus greater potential for COVID-19 spread. To avoid inconvenience and potential client dissatisfaction, be proactive: Inform all affected parties of the COVID-19 preparedness plan in place and its impact on their access to the facility in advance.
Sanitization is a particular challenge, with several participants reporting disinfectant/hand sanitizer shortages. Many had questions specific to deep cleaning the white space environment, given its high rate/volume of air exchange, highly specialized electronic equipment and possible raised floor configuration. Spray techniques are more effective than simply wiping surfaces with disinfectant solutions, as the antiseptic mist coats surfaces for a longer period. Many organizations are hiring specialist cleaning firms and/or following CDC recommendations for disinfection.
As COVID-19 spreads, more organizations are moving their energy from academicly tweaking their written response plans to implementing them. In many companies, that decision is made by a business unit, based on site environment, number of COVID-19 cases in the area and government-mandated restrictions. Mission-critical facilities have a particular remit, though, so need to create and implement plans specific to their business mission.
Good preparation simplifies decision-making. Roundtable participants suggest the following:
It’s no longer a question of if a plan like this will be needed, we know it is! Most facility operators need to quickly craft and then implement their response plan, and learn from this incident for the future.
Uptime Institute has created a number of resources and will be providing specific guidance regarding the COVID-19 situation here.
Pay-as-you-go model spreads to critical components
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteAs enterprises continue to move from a focus on capital expenditures to operating expenditures, more data center components will also be consumed on a pay-as-you-go, “as a service” basis.
“-aaS” goes mainstream
The trend toward everything “as a service” (XaaS) is now mainstream in IT, ranging from cloud (infrastructure-aaS) and software-aaS (SaaS) to newer offerings, such as bare metal-aaS, container-aaS, and artificial intelligence-aaS (AI-aaS). At the IT level, service providers are winning over more clients to the service-based approach by reducing capital expenditures (capex) in favor of operational expenditures (opex), by offering better products, and by investing heavily to improve security and compliance. More organizations are now willing to trust them.
But this change is not confined to the IT: a similar trend is underway in data centers.
Why buy and not build?
While the cost to build new data centers is generally falling, driven partly by the availability of more prefabricated components, enterprise operators have been increasingly competing against lower-cost options to host their IT — notably colocation, cloud and SaaS.
Cost is rarely the biggest motivation for moving to cloud, but it is a factor. Large cloud providers continue to build and operate data centers at scale and enjoy the proportional cost savings as well the fruits of intense value engineering. They also spread costs among customers and tend to have much higher utilization rates compared with other data centers. And, of course, they invest in innovative, leading-edge IT tools that can be rolled out almost instantly. This all adds up to ever-improving IT and infrastructure services from cloud providers that are cheaper (and often better) than using or developing equivalent services based in a smaller-scale enterprise data center.
Many organizations have now come to view data center ownership as a big capital risk — one that only some want to take. Even when it’s cheaper to deliver IT from their own “on-premises” data center, the risks of data center early obsolescence, under-utilization, technical noncompliance or unexpected technological or local problems are all factors. And, of course, most businesses want to avoid a big capital outlay: Our research shows that, in 2017, the total cost of ownership of an “average” concurrently maintainable 3 megawatt (MW) enterprise data center amortized over 15 years was about $90 million, and that roughly half of the cost is invested in three installments over the first six years, assuming a typical phased build and bricks-and-mortar construction.
This represents a significant amount of risk. To be economically viable, the enterprise must typically operate a facility at a high level of utilization — yet forecasting future data center capacity remains enterprises’ top challenge, according to our research.
Demand for enterprise data centers remains sizable, in spite of the alternatives. Many enterprises with smaller data centers are closing them and consolidating into premium, often larger, centralized data centers and outsourcing as much else as possible.
The appeal of the cloud will continue to convince executives and drive strategy. Increasingly, public cloud is an alternative way to deliver workloads faster and cheaper without having to build additional on-premise capacity. Scalability, portability, reduced risk, better tools, high levels of resiliency, infrastructure avoidance and fewer staff requirements are other key drivers for cloud adoption. Innovation and access to leading-edge IT will likely be bigger factors in the future, as will more cloud-first remits from upper management.
Colocation, including sale leasebacks
Although rarely thought of in this way, colocation is the most widely used “data center-aaS” offering today. Sale with leaseback of the data center by enterprise to colos is also becoming more common, a trend that will continue to build (see UII Note 38: Capital inflow boosts the data center market).
Colo interconnection services will attract even more businesses. More will likely seek to lease space in the same facility as their cloud or other third-party service provider, enabling lower latency and fewer costs and more security for third-party services, such as storage-aaS and disaster recovery-aaS.
While more enterprise IT is moving to colos and managed services (whether or not it is cloud), enterprise data centers will not disappear. More than 600 IT and data center managers told Uptime Institute that, in 2021, about half of all workloads will still be in enterprise data centers, and only 18% of workloads in public cloud/SaaS.
Other “as a service” trends in data centers
Data center monitoring and analysis is another relatively new example of a pay-as-you-go service. Introduced in late 2016, data center management as a service is a big data-driven cloud service that provides customized analysis and is paid for on a recurring basis. The move to a pay-as-you-go service has helped unlock the data center infrastructure management market, which was struggling for growth because of costs and complexity.
Energy backup and generation is another area to watch. Suppliers have introduced various pay-as-you-go models for their equipment. These include leased fuel cells owned by the supplier (notably Bloom Energy), which charges customers only for the energy produced. By eliminating the client’s risk and capital outlay, it can make the supplier’s sale easier (although they have to wait to be paid). Some suppliers have ventured in UPS-aaS, but with limited success to date.
More alternatives to ownership are likely for data center electrical assets, such as batteries. Given the high and fast rate of innovation in the technology, leasing large-scale battery installations delivers the capacity and innovation benefits without the risks.
It’s also likely that more large data centers will use energy service companies (ESCOs) to produce, manage and deliver energy from renewable microgrids. Demand for green energy, for energy security (that is, energy produced off-grid) and energy-price stability is growing; ESCOs can deliver all this for dedicated customers that sign long-term energy-purchase agreements but don’t have the capital required to build or the expertise necessary to run a green microgrid.
Demand for enterprise data centers will continue but alongside the use of more cloud and more colo. More will be consumed “as a service,” ranging from data center monitoring to renewable energy from nearby dedicated microgrids.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network. Membership information can be found here.
Phasing Out Data Center Hot Work
/in Executive, Operations/by Kevin HeslinDespite years of discussion, warnings and strict regulations in some countries, data center hot work remains a contentious issue in the data center industry. Hot work is the practice of working on energized electrical circuits (voltage limits differ regionally) — and it is usually done, in spite of the risks, to reduce the possibility of a downtime incident during maintenance.
Uptime Institute advises against hot work in almost all instances. The safety concerns are just too great, and data suggests work on energized circuits may — at best — only reduce the number of manageable incidents, while increasing the risk of arc flash and other events that damage expensive equipment and may lead to an outage or injury. In addition, concurrently maintainable or fault tolerant designs as described in Uptime Institute’s Tier Standard make hot work unnecessary.
The pressure against hot work continues to mount. In the US, electrical contractors have begun to decline some work that involves working on energized circuits, even if an energized work permit has been created and signed by appropriate management, as required by National Fire Protection Association (NFPA) 70E (Standard for Electrical Safety in the Workplace). In addition, US Department of Labor’s Occupational Safety and Hazards Agency (OSHA) has repeatedly rejected business continuity as an exception to hot work restrictions, making it harder for management to justify hot work and to find executives willing to sign the energized work permit.
OSHA statistics make clear that work on energized systems is a dangerous practice, especially for construction trades workers; installation, maintenance, and repair occupations; and grounds maintenance workers. For this reason, NFPA 70E sharply limits the situations in which organizations are allowed to work on energized equipment. Personnel safety is not the only issue; personal protective equipment (PPE) protects only workers, not equipment, so an arc flash can destroy many thousands of dollars of IT gear.
Ignoring local and national standards can be costly, too. OSHA reported 2,923 lockout/tagout and 1,528 PPE violations in 2017, among the many safety concerns it addressed that year. New minimum penalties for a single violation exceed $13,000, with top total fines for numerous, willful and repeated violations running into the millions of dollars. Wrongful death and injury suits add to the cost, and violations can lead to higher insurance premiums, too.
Participants in a recent Uptime Institute discussion roundtable agreed that the remaining firms performing work on live loads should begin preparing to end the practice. They said that senior management is often the biggest impediment to ending hot work, at least at some organizations, despite the well-known and documented risks. Executive resistance can be tied to concerns about power supplies or failure to maintain independent A/B feeds. In some cases, service level agreements contain restrictions against powering down equipment.
Despite executive resistance at some companies, the trend is clearly against hot work. By 2015, more than two-thirds of facilities operators had already eliminated the practice, according to Uptime Institute data. A tighter regulatory environment, heightened safety concerns, increased financial risk and improved equipment should combine to all but eliminate hot work in the near future. But there are still holdouts, and the practice is far more acceptable in some countries — China is an example — than in others, such as the US, where NFPA 70E severely limits the practice in all industries.
Also, hot work does not eliminate IT failure risk. Uptime Institute has been tracking data center abnormal incidents for more than 20 years and when studying the data, at least 71 failures occurred during hot work. While these failures are generally attributed to poor procedures or maintenance, a recent, more careful analysis concluded that better procedures or maintenance (or both) would have made it possible to perform the work safely — and without any failures — on de-energized systems.
The Uptime Institute abnormal incident database includes only four injury reports; all occurred during work on energized systems. In addition, the database includes 16 reports of arc flash. One occurred during normal preventive maintenance and one during an infrared scan. Neither caused injury, but the potential risk to personnel is apparent, as is the potential for equipment damage (and legal exposure).
Undoubtedly, eliminating hot work is a difficult process. One large retailer that has just begun the process expects the transition to take several years. And not all organizations succeed: Uptime Institute is aware of at least one organization in which incidents involving failed power supplies caused senior management to cancel their plan to disallow work on energized equipment.
According to several Uptime Institute Network community members, building a culture of safety is the most time-consuming part of the transition from hot work, as data centers are goal-oriented organizations, well-practiced at developing and following programs to identify and eliminate risk.
It is not necessary or even prudent to eliminate all hot work at once. The IT team can help slowly retire the practice by eliminating the most dangerous hot work first, building experience on less critical loads, or reducing the number of circuits affected at any one time. To prevent common failures when de-energizing servers, the Operations team can increase scrutiny on power supplies and ensure that dual-corded servers are properly fed.
In early data centers, the practice of hot work was understandable — necessary, even. However, Uptime Institute has long advocated against hot work. Modern equipment and higher resiliency architectures based on dual-corded servers make it possible to switch power feeds in the case of an electrical equipment failure. These advances not only improve data center availability, they also make it possible to isolate equipment for maintenance purposes.