Meeting OSHA and NFPA 70E arc flash safety requirements while balancing prevention and production demands
By Ed Rafter
Uninterruptible uptime, 24 x 7, zero downtime…these are some of the terms that characterize data center business goals for IT clients. Given these demands, facility managers and technicians in the industry are skilled at managing the infrastructure that supports these goals, including essential electrical and mechanical systems that are paramount to maintaining the availability of business-critical systems.
Electrical accidents such as arc flash occur all too often in facility environments that have high-energy use requirements, a multitude of high-voltage electrical systems and components, and frequent maintenance and equipment installation activities. A series of stringent standards with limited published exceptions govern work on these systems and associated equipment. The U.S. Occupational Safety and Health Administration (OSHA) and National Fire Protection Association (NFPA) Standard 70E set safety and operating requirements to prevent arc flash and electric shock accidents in the workplace. Many other countries have similar regulatory requirements for electrical safety in the workplace.
When these accidents occur they can derail operations and cause serious harm to workers and equipment. Costs to businesses can include lost work time, downtime, OSHA investigation, fines, medical costs, litigation, lost business, equipment damage, and most tragically, loss of life. According to the Workplace Safety Awareness Council (WPSAC), the average cost of hospitalization for electrical accidents is US$750,000, with many exceeding US$1,000,000.
There are reasonable steps data center operators can—and must—take to ensure the safety of personnel, facilities, and equipment. These steps offer a threefold benefit: the same measures taken to protect personnel also serve to protect infrastructure, and thus protect data center operations.
Across all industries, many accidents are caused by basic mistakes, for example, electrical workers not being properly prepared, working on opened equipment that was not well understood, or magnifying risks through a lack of due diligence. Data center operators, however, are already attuned to the discipline and planning it takes to run and maintain high-availability environments.
While complying with OSHA and NFPA 70E requirements may seem daunting at first, the maintenance and operating standards in place at many data centers enable this industry to effectively meet the challenge of adhering to these mandates. The performance and rigor required to maintain 24 x 7 reliability means the gap between current industry practices and the requirements of these regulatory standards is smaller than it might at first appear, allowing data centers to balance safety with the demands of mission critical production environments.
In this article we describe arc flash and electrical safety issues, provide an overview of the essential measures data centers must follow to meet OSHA and NFPA 70E requirements, and discuss how many of the existing operational practices and adherence to Tier Standards already places many data centers well along the road to compliance.
Figure 1. An arc flash explosion demonstration. Source: Open Electrical
UNDERSTANDING ARC FLASH
Arc flash is a discharge of electrical energy characterized by an explosion that generates light, noise, shockwave, and heat. OSHA defines it as “a phenomenon where a flashover of electric current leaves its intended path and travels through the air from one conductor to another, or to ground (see Figure 1). The results are often violent and when a human is in close proximity to the arc flash, serious injury and even death can occur.” The resulting radiation and shrapnel can cause severe skin burns and eye injuries, and pressure waves can have enough explosive force to propel people and objects across a room and cause lung and hearing damage. OSHA reports that up to 80% of all “qualified” electrical worker injuries and fatalities are not due to shock (electrical current passing through the body) but to external burn injuries caused by the intense radiant heat and energy of an arc fault/arc blast.1
An arc flash results from an arcing electrical fault, which can be caused by dust particles in the air, moisture condensation or corrosion on electrical/mechanical components, material failure, or by human factors such as improper electrical system design, faulty installation, negligent maintenance procedures, dropped tools, or accidentally touching a live electrical circuit. In short, there are numerous opportunities for arc flash to occur in industrial settings, especially those in which there is inconsistency or a lack of adherence to rigorous maintenance, training, and operating procedures.
Variables that affect the power of an arc flash are amperage, voltage, the distance of the arc gap, closure time, three-phase vs. single-phase circuit, and being in a confined space. The power of an arc at the flash location, the distance a worker is from the arc, and the time duration of their exposure to the arc will all affect the extent of skin damage. The WPSAC reports that fatal burns can occur even at distances greater than 10 feet (ft) from an arc location, in fact, serious injury and fatalities can occur up to 20 ft away. The majority of hospital admissions for electrical accidents are due to arc flash burns, with 30,000 arc incidents and 7,000 people suffering burn injuries per year, 2,000 of those requiring admission to burn centers with severe arc flash burns.2
The severity of an arc flash incident is determined by several factors, including temperature, the available fault current, and the time for a circuit to break. The total clearing time of the overcurrent protective device during a fault is not necessarily linear, as lower fault currents can sometimes result in a breaker or fuse taking longer to clear, thus extending the arc duration and thereby raising the arc flash energy.
Unlike the bolted fault (in which high current flows through a solid conductive material typically tripping a circuit breaker or protective device), an arcing fault uses ionized air as a conductor, with current jumping a gap between two conductive objects. The cause of the fault normally burns away during the initial flash, but a highly conductive, intensely hot plasma arc established by the initial arc sustains the event. Arc flash temperatures can easily reach 14,000–16,000°F (7,760–8,870°C) with some projections as high as 35,000°F (19,400°C)—more than three times hotter than the surface of the sun.
These temperatures can be reached by an arc fault event in as little as a few seconds or even a few cycles. The heat generated by the high current flow may melt or vaporize the conductive material and create an arc characterized by a brilliant flash, intense heat, and a fast-moving pressure wave that propels the arcing products. The pressure of an arc blast [up to 2,000 pounds/square foot (9765 kilograms/square meter)] is due to the expansion of the metal as it vaporizes and the heating of the air by the arc. This accounts for the expulsion of molten metal up to 10 ft away. Given these extremes of heat and energy, arc flashes often cause fires, which can rapidly spread through a facility.
INDUSTRY STANDARDS AND REGULATIONS
To prevent these kinds of accidents and injuries, it is imperative that data center operators understand and follow appropriate safety standards for working with electrical equipment. Both the NFPA and OSHA have established standards and regulations that help protect workers against electrical hazards and prevent electrical accidents in the workplace.
OSHA is a federal agency (part of the U.S. Department of Labor) that ensures safe and healthy working conditions for Americans by enforcing standards and providing workplace safety training. OSHA 29 CFR Part 1910, Subpart S and OSHA 29 CFR Part 1926, Subpart K include requirements for electrical installation, equipment, safety-related work practices, and maintenance for general industry and construction workplaces, including data centers.
NFPA 70E is a set of detailed standards (issued at the request of OSHA and updated periodically) that address electrical safety in the workplace. It covers safe work practices associated with electrical tasks and safe work practices for performing other non-electrical tasks that may expose an employee to electrical hazards. OSHA revised its electrical standard to reference NFPA 70E-2000 and continues to recognize NFPA 70E today.
The OSHA standard outlines prevention and control measures for hazardous energies including electrical, mechanical, hydraulic, pneumatic, chemical, thermal, and other energy sources. OSHA requires that facilities:
• Provide and be able to demonstrate a safety program with defined responsibilities.
• Calculate the degree of arc flash hazard.
• Use correct personal protective equipment (PPE) for workers.
• Train workers on the hazards of arc flash.
• Use appropriate tools for safe working.
• Provide warning labels on equipment.
NFPA 70E further defines “electrically safe work conditions” to mean that equipment is not and cannot be energized. To ensure these conditions, personnel must identify all power sources, interrupt the load and disconnect power, visually verify that a disconnect has opened the circuit, lock out and tag the circuit, test for absence of voltage, and ground all power conductors, if necessary.
LOCKOUT/TAGOUT
Most data center technicians will be familiar with lockout and tagging procedures for disabling machinery or equipment. A single qualified individual should be responsible for de-energizing one set of conditions (unqualified personnel should never perform lockout/tagout, work on energized equipment, or enter high risk areas). An appropriate lockout or tagout device should be affixed to the de-energized equipment identifying the responsible individual (see Figure 2).
Figure 2. Equipment lockout/tagout
OVERVIEW: WORKING ON ENERGIZED EQUIPMENT
As the WPSAC states, “the most effective and foolproof way to eliminate the risk of electrical shock or arc flash is to simply de-energize the equipment.” However, both NFPA 70E and OSHA clarify that working “hot” (on live, energized systems) is allowed within the set safety limits on voltage exposures, work zone boundary requirements, and other measures to take in these instances. Required safety elements include personnel qualifications, hazard analysis, protective boundaries, and the use of PPE by workers.
Only qualified persons should work on electrical conductors or circuit parts that have not been put into an electrically safe work condition. A qualified person is one who has received training in and possesses skills and knowledge in the construction and operation of electric equipment and installations and the hazards involved with this type of work. Knowledge or training should encompass the skill to distinguish exposed live parts from other parts of electric equipment, determine the nominal voltage of exposed live parts, and calculate the necessary clearance distances and the corresponding voltages to which a worker will be exposed.
An arc flash hazard analysis for any work must be conducted to determine the appropriate arc flash boundary, the incident energy at the working distance, and the necessary protective equipment for the task. Arc flash is measured in thermal energy units of calories per square centimeter (calories/cm2) and arc flash analysis is referred to as the incident energy of the circuit. Incident energy is both radiant and convective. It is inversely proportional to the working distance squared and directly proportional to the time duration of the arc and to the available bolted fault current. Time has a greater effect on intensity than the available bolted fault current.
The incident energy and flash protection boundary are both calculated in an arc flash hazard analysis. There are two calculation methods, one outlined in NFPA 70E-2012 Annex D and the other in Institute of Electrical and Electronics Engineers (IEEE) Standard 1584.
In practice, to calculate the arc flash (incident energy) at a location, the amount of fault current and the amount of time it takes for the upstream device to trip must be known. A data center should model the distribution system into a software program such as SKM Power System Analysis, calculate the short circuit fault current levels and use the protective device settings feeding switchboards, panelboards, industrial control panels, and motor control centers to determine the incident energy level.
BOUNDARIES
NFPA has defined several protection boundaries: Limited Approach, Restricted, and Prohibited. The intent of NFPA 70E regarding arc flash is to provide guidelines that will limit injury to the onset of second degree burns. Where these boundaries are drawn for any specific task is based on the employee’s level of training, the use of PPE, and the voltage of the energized equipment (see Figure 3).
Figure 3. Protection boundaries. Source: Open Electrical
The closer a worker approaches an exposed, energized conductor or circuit part the greater the chance of inadvertent contact and the more severe the injury that an arc flash is likely to cause that person. When an energized conductor is exposed, the worker may not approach closer than the flash boundary without wearing appropriate personal protective clothing and PPE.
IEEE defines Flash Protection Boundary as “an approach limit at a distance from live parts operating at 50 V or more that are un-insulated or exposed within which a person could receive a second degree burn.” NFPA defines approach boundaries and workspaces as shown in Figure 4. See the sidebar Protection Boundary Definitions.
Figure 4. PPE: typical arc flash suit. Source: Open Electrical
Calculating the specific boundaries for any given piece of machinery, equipment, or electrical component can be done using a variety of methods, including referencing NFPA tables (easiest to do but the least accurate) or using established formulas, an approach calculator tool (provided by IEEE), or one of the software packages available for this purpose.
PROTECTIVE EQUIPMENT
NFPA 70E outlines strict standards for the type of PPE required for any employees working in areas where electrical hazards are present based on the task, the parts of the body that need protection, and the suitable arc rating to match the potential flash exposure. PPE includes items such as a flash suit, switching coat, mask, hood, gloves, and leather protectors. Flame -resistant clothing underneath the PPE gear is also required.
After an arc flash hazard analysis has been performed, the correct PPE can be selected according to the equipment’s arc thermal performance exposure value (ATPV) and the break open threshold energy rating (EBT). Together, these components determine the calculated hazard level that any piece of equipment is capable of protecting a worker from (measured in calories per square centimeter). For example, a hard hat with an attached face shield provides adequate protection for Hazard/Risk Category 2, whereas an arc flash protection hood is needed for a worker exposed to Hazard/Risk Category 4.
PPE is the last line of defense in an arc flash incident; it’s not intended to prevent all injuries, but to mitigate the impact of a flash, should one occur. In many cases, the use of PPE has saved lives or prevented serious injury.
OTHER SAFETY MEASURES
Additional safety-related practices for working on energized systems could include conducting a pre-work job briefing, using insulated tools, having a written safety program, and flash hazard labeling (labels should indicate the flash hazard boundaries for a piece of equipment, and the PPE needed to work within those boundaries), and completing an Energized Electrical Work Permit. According to NFPA, an Energized Electrical Work Permit is required for a task when live parts over 50 volts are involved. The permit outlines conditions and work practices needed to protect employees from arc flash or contact with live parts, and includes the following information:
• Circuit, equipment, and location
• Reason for working while energized
• Shock and arc flash hazard analysis
• Safe work practices
• Approach boundaries
• Required PPE and tools
• Access control
• Proof of job briefing.
DECIDING WHEN TO WORK HOT
NFPA 70E and OSHA require employers to prove that working in a de-energized state creates more or worse hazards than the risk presented by working on live components or is not practical because of equipment design or operational limitations, for example, when working on circuits that are part of a continuous process that cannot be completely shut down. Other exceptions include situations in which isolating and deactivating system components would create a hazard for people not associated with the work, for example, when working on life-support systems, emergency alarm systems, ventilation equipment for hazardous locations, or extinguishing illumination for an area.
In addition, OSHA makes provision for situations in which it would be “infeasible” to shut down equipment, for example, some maintenance and testing operations can only be done on live electric circuits or equipment. The decision to work hot should only be made after careful analysis of the determination of what constitutes infeasibility. In recent years, some well publicized OSHA actions and statements have centered on the matter of how to interpret this term.
ELECTRICAL SAFETY MEASURES IN PRACTICE
Many operational and maintenance practices will help minimize the potential for arc flash, reduce the incident energy or arcing time, or move the worker away from the energy source. In fact, many of these practices are consistent with the rigorous operational and maintenance processes and procedures of a mission-critical data center.
Although the electrical industry is aware of the risks of arc flash, according to the National Institute for Occupational Safety and Health, the biggest worksite personnel hazard is still electrical shock in all but the construction and utility industries. In his presentation at an IEEE-Industry Applications Society (IAS) workshop, Ken Mastrullo of the NFPA compared OSHA 1910 Subpart S citations versus accidents and fatalities between 1 Oct. 2003, and 30 Sept. 2004. Installations accounted for 80% of the citations, while safe work practice issues were cited 20% of the time. However, installations accounted for 9% of the accidents, while safe work practice issues accounted for 91% of all electrical-related accidents. Looking at Mastrullo’s data, while the majority of the OSHA citations were for installation issues, the majority of the injuries were related to work practice issues.
Can OSHA cite you as a company that does not comply with NFPA 70E? The simple answer is: Yes. If employees are involved in a serious electrical incident, OSHA likely will present the employer/owner with several citations. In fact, OSHA assessed more than 2,880 fines between 2007–2011 for sites not meeting Regulation 1910.132(d), averaging 1.5 fines a day.
On the other hand, an OSHA inspection may actually help uncover issues. A May 2012 study of 800 California companies found that those receiving an inspection saw a decline of 9.4% in injuries. On average, these companies saved US$350,000 over the five years following the OSHA inspections,3 an outcome far preferable to being fined for noncompliance or experiencing an electrical accident. Beyond the matter of fines, however, any organization that wishes to effectively avoid putting its personnel in danger—and endangering infrastructure and operations—should endeavor to follow NFPA 70E guidelines (or their regional equivalent).
REDUCING ARC FLASH HAZARDS IN THE FACILITY
While personnel-oriented safety measures are the most important (and mandated) steps to reduce the risk of arc flash accidents, there are numerous equipment and component elements that can be incorporated into facility systems that also help reduce the risk. These include metal-clad switchgear, arc resistant switchgear, current-limiter power circuit breakers, and current-limiting reactors. Setting up zone selective interlocking of circuit breakers can also be an effective prevention measure.
TIER STANDARDS & DATA CENTER PRACTICES ALIGN WITH ARC FAULT PREVENTION
Data centers are already ahead of many industries in conforming to many provisions of OSHA and NFPA 70E. Many electrical accidents are caused by issues such as dust in the environment, improper equipment installation, and human factors. To maintain the performance and reliability demanded by customers, data center operators have adopted a rigorous approach to cleaning, maintenance, installation, training, and other tasks that forestall arc flash. Organizations that subscribe to Tier standards and maintain stringent operational practices are better prepared to take on the challenges of compliance with OSHA and NFPA 70E requirements, in particular the requirements for safely performing work on energized systems, when such work is allowed per the safety standards.
For example, commissioning procedures eliminate the risk of improper installation. Periodically load testing engine generators and UPS systems demonstrates that equipment capacity is available and helps identify out-of-tolerance conditions that are indicative of degrading hardware or calibration and alignment issues. Thermographic scanning of equipment, distribution boards, and conduction paths can identify loose or degraded connections before they reach a point of critical failure.
Adherence to rigorous processes and procedures helps avoid operator error and are tools used in personnel training and refresher classes. Facility and equipment design and capabilities, maintenance programs, and operating procedures are typically well defined and established in a mission critical data center, especially those at a Tier III or Tier IV Certification level.
Beyond the Tier Topology, the operational requirements for every infrastructure classification, as defined in the Tier Standard: Operational Sustainability, include the implementation of processes and procedures for all work activities. Completing comprehensive predictive and preventive maintenance increases reliability, which in turn improves availability. Methods of procedure are generally very detailed and task specific. Maintenance technicians meet stringent qualifications to perform work activities. Training is essential, and planning, practice, and preparation are key to managing an effective data center facility.
This industry focus on rigor and reliability in both systems and operational practices, reinforced by the Tier Standards, will enable data center teams to rapidly adopt and adhere to the practices required for compliance with OSHA and NFPA 70E. What still remains in question is whether or not a data center meets the infeasibility test prescribed by these governing bodies in either the equipment design or operational limitations.
It can be argued that some of today’s data center operations approach the status of being “essential” for much of the underlying infrastructure that runs our 24x 7 digitized society. Data centers support the functioning of global financial systems, power grids and utilities, air traffic control operations, communication networks, and the information processing that support vital activities ranging from daily commerce to national security. Each facility must assess its operations and system capabilities to enable adherence to safe electrical work practices as much as possible without jeopardizing critical mission functions. In many cases, it may become a jurisdictional decision as to the answer for a specific data center business requirement.
No measure will ever completely remove the risk of working on live, energized equipment. In instances where working on live systems is necessary and allowed by NFPA 70E rules, the application of Uptime Institute Tier III and Tier IV criteria can help minimize the risks. Tier III and IV both require the design and installation of systems that enable equipment to be fully de-energized to allow planned activities such as repair, maintenance, replacement, or upgrade without exposing personnel to the risks of working on energized electrical equipment
CONCLUSION
Over the last several decades, data centers and the information processing power they provide has become a fundamental necessity in our global, interconnected society. Balancing the need for appropriate electrical
safety measures and compliance with the need to maintain and sustain uninterrupted production capacity in an energy-intensive environment is a challenge. But it is a challenge the data center industry is perhaps better prepared to meet than many other industry segments. It is apparent that those in the data center industry who subscribe to high-availability concepts such as the Tier Standards: Topology and Operational Sustainability are in a position to readily meet the requirements of NFPA 70E and OSHA from an execution perspective.
SIDEBAR: PROTECTION BOUNDARY DEFINITIONS
The flash protection boundary is the closest approach allowed by qualified or unqualified persons without the use of PPE. If the flash protection boundary is crossed, PPE must be worn. The boundary is a calculated number based upon several factors such as voltage, available fault current, and time for the protective device to operate and clear the fault. It is defined as the distance at which the worker is exposed to 1.2 cal/cm2 for 0.1 second.
LIMITED APPROACH BOUNDARY
The limited approach boundary is the minimum distance from the energized item where untrained personnel may safely stand. No unqualified (untrained) personnel may approach any closer to the energized item than this boundary. The boundary is determined by NFPA 70E Table 130.4-(1) (2) (3) and is based on the voltage of the equipment (2012 Edition).
RESTRICTED APPROACH BOUNDARY
The restricted approach boundary is the distance where qualified personnel may not cross without wearing appropriate PPE. In addition, they must have a written approved plan for the work that they will perform. This boundary is determined from NFPA Table 130.4-(1) (4) (2012 Edition) and is based on the voltage of the equipment.
PROHIBITED APPROACH BOUNDARY
Only qualified personnel wearing appropriate PPE can cross a prohibited approach boundary. Crossing this boundary is considered the same as contacting the exposed energized part. Therefore, personnel must obtain a risk assessment before the prohibited boundary is crossed. This boundary is determined by NFPA 70E Table 130.4-(1) (5) (2012 Edition) and is based upon the voltage of the equipment.
Ed Rafter
Edward P. Rafter has been a consultant to Uptime Institute Professional Services (ComputerSite Engineering) since 1999 and assumed a full time position with Uptime Institute in 2013 as principal of Education and Training. He currently serves as vice president-Technology. Mr. Rafter is responsible for the daily management and direction of the professional education staff to deliver all Uptime Institute training services. This includes managing the activities of the faculty/staff delivering the Accredited Tier Designer (ATD) and Accredited Tier Specialist (ATS) programs, and any other courses to be developed and delivered by Uptime Institute.
ADDITIONAL RESOURCES
To review the complete NFPA-70E standards as set forth in NFPA 70E: Standard For Electrical Safety In The Workplace, visit www.NFPA.org
For resources to assist with calculating flash protection boundaries, visit:
To determine what PPE is required, the tables in NFPA 70E-2012 provide the simplest methods for determining PPE requirements. They provide instant answers with almost no field data needed. The tables provide limited application and are conservative for most applications (the tables are not intended as a substitution for an arc hazard analysis but only as a guide).
A simplified two-category PPE approach is found in NFPA 70E-2012, Table H-2 of Annex H. This table ensures adequate PPE for electrical workers within facilities with large and diverse electrical systems. Other good resources include:
• Controlling Electrical Hazards. OSHA Publication 3076, (2002). 71 pages. Provides a basic overview
of basic electrical safety on the job, including information on how electricity works, how to protect
against electricity, and how OSHA cab help.
• Electrical Safety: Safety and Health for Electrical Trades Student Manual, U.S. Department of Health and
Human Services (DHHS). National Institute for Occupational Safety and Health (NIOSH), Publication
No. 2002-123, (2002, January). This student manual is part of a safety and health curriculum for
secondary and post-secondary electrical trades courses. It is designed to engage the learner in
recognizing, electrical, and controlling hazards associated with electrical work.
• Electrocutions Fatality Investigation Reports. National Institute for Occupational Safety and Health
(NIOSH) Safety and Health Topic. Provides information regarding hundreds of fatal incidents involving
electrocutions investigated by NIOSH and state investigators.
• Working Safely with Electricity. OSHA Fact sheet. Provides safety information on working with
generators, power lines, extension cords, and electrical equipment.
• Lockout/Tagout OSHA Fact Sheet, (2002).
• Lockout-Tagout Interactive Training Program. OSHA. Includes selected references for training and
interactive case studies.
2. Common Electrical Hazards in the Workplace including Arc Flash, Workplace Safety Awareness Council (www.wpsac.org), produced under Grant SH-16615-07-60-F-12 from the Occupational Safety and Health Administration, U.S. Department of Labor.
3. “The Business Case For Safety and Health,” U.S. Department of Labor, https://www.osha.gov/dcsp/products/topics/businesscase/
https://journal.uptimeinstitute.com/wp-content/uploads/2015/11/arc.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2015-12-01 11:03:522015-12-01 11:03:52Arc Flash Mitigation in the Data Center
A statistical justification for 24×7 coverage
By Richard Van Loo
As a result of performing numerous operational assessments at data centers around the world, Uptime Institute has observed that staffing levels at data centers vary greatly from site to site. This observation is discouraging, but not surprising, because while staffing is an important function for data centers attempting to maintain operational excellence, many factors influence an organization’s decision on appropriate staffing levels.
Factors that can affect overall staffing numbers include the complexity of the data center, the level of IT turnover, the number of support activity hours required, the number of vendors contracted to support operations, and business objectives for availability. Cost is also a concern because each staff member represents a direct cost. Because of these numerous factors, data center staffing levels must be constantly reviewed in an attempt to achieve effective data center support at a reasonable cost.
Uptime Institute is often asked, “What is the proper staffing level for my data center.” Unfortunately, there is no quick answer that works for every data center since proper staffing depends on a number of variables.
The time required to perform maintenance tasks and provide shift coverage support are two basic variables. Staffing for maintenance hours requirements is relatively fixed, but affected by which activities are performed by data center personnel and which are performed by vendors. Shift coverage support is defined as staffing for data center monitoring and rounds and for responding to any incidents or events. Staffing levels to support shift coverage can be provided in a number of different ways. Each method of providing shift coverage has potential impacts on operations depending on how that that coverage is focused.
TRENDS IN SHIFT COVERAGE
The primary purpose of having qualified personnel on site is to mitigate the risk of an outage caused by abnormal incidents or events, either by preventing the incident or containing and isolating the incident or event and keeping it from spreading or impacting other systems. Many data centers still support data shift presence with a team of qualified electricians, mechanics, and other technicians who provide 24 x 7 shift coverage. Remote monitoring technology, designs that incorporate redundancy, campus data center environments, the desire to balance costs, and other practices can lead organizations to deploy personnel differently.
Managing shift presence without having qualified personnel on site at all times can elevate risks due to delayed response to abnormal incidents. Ultimately, the acceptable level of risk must be a company decision.
Other shift presence models include:
• Training security personnel to respond to alarms and execute an escalation procedure
• Monitoring the data center through a local or regional building monitoring system (BMS) and having technicians on call
• Having personnel on site during normal business hours and on call during nights and weekends
• Operating multiple data centers as a campus or portfolio so that a team supports multiple data centers without necessarily being on site at each individual data center at a given time
These and other models have to be individually assessed for effectiveness. To assess the effectiveness of any shift presence model, the data center must determine the potential risks of incidents to the operations of the data center and the impact on the business.
For the last 20 years, Uptime Institute has built the Abnormal Incident Reports (AIRs) database using information reported by Uptime Institute Network members. Uptime Institute analyzes the data annually and reports its findings to Network members. The AIRs database provides interesting insights relating to staffing concerns and effective staffing models.
INCIDENTS OCCUR OUTSIDE BUSINESS HOURS
In 2013, a slight majority of incidents (out of 277 total incidents) occurred during normal business hours. However, 44% of incidents happened between midnight and 8:00 a.m., which underscores the potential need for 24 x 7 coverage (see Figure 1).
Figure 1. Approximately half the AIRs that occurred in 2013 took place occurred between 8 a.m. and 12 p.m., the other half between 12 a.m. and 8 a.m.
Similarly, incidents can happen at any time of the year. As a result, focusing shift presence activities toward a certain time of year over others would not be productive. Incident occurrence is pretty evenly spread out over the year.
Figure 2 details the day of the week when incidents occurred. The chart shows that incidents occur on nearly an equal basis every day of the week, which suggests that shift presence requirement levels should be the same every day of the week. To do otherwise would leave shifts with little or no shift presence to mitigate risks. This is an important finding because some data centers focus their shift presence support Monday through Friday and leave weekends to more remote monitoring (see Figure 2).
Figure 2. Data center staff must be ready every day of the week.
INCIDENTS BY INDUSTRY
Figure 3 further breaks down the incidents by industry and shows no significant difference in those trends between industries. The chart does show that the financial services industry reported far more incidents than other industries, but that number reflects the makeup of the sample more than anything.
Figure 3. Incidents in data centers take place all year round.
INCIDENT BREAKDOWNS
Knowing when incidents occur does little to say what personnel should be on site. Knowing what kinds of incidents occur most often will help shape the composition of the on-site staff, as will knowing how incidents are most often identified. Figure 4 shows that electrical systems experience the most incidents, followed by mechanical systems. By contrast, critical IT load causes relatively few incidents.
Figure 4. More than half the AIRs reported in 2013 involved the electrical system.
As a result, it would seem to make sense that shift presence teams should have sufficient electrical experience to respond to the most common incidents. The shift presence team must also respond to other types of incidents, but cross training electrical staff in mechanical and building systems might provide sufficient coverage. And, on-call personnel might cover the relatively rare IT-related incidents.
The AIRs database also sheds some light on how incidents are discovered. Figure 5 suggests that over half of all incidents discovered in 2013 were from alarms and more than 40% of incidents are discovered by technicians on site, totaling about 95% of incidents. The biggest change over the years covered by the chart is a slow growth of incidents discovered by alarm.
Figure 5. Alarms are now the source for most AIRs; however, availability failures are more likely to be found by technicians.
Alarms, however, cannot respond to or mitigate incidents. Uptime Institute has witnessed a number of methods for saving a data center from going down and reducing the impact of a data center incident. These methods require having personnel to respond to the incident, building redundancy into critical systems, and strong predictive maintenance programs to forecast potential failures before they occur. Figure 6 breaks down how often each of these methods produced actual saves.
Figure 6. Equipment redundancy was responsible for more saves in 2013 than in previous years.
The chart also appears to suggest that in recent years, equipment redundancy and predictive maintenance are producing more saves and technicians fewer. There are several possible explanations for this finding, including more robust systems, greater use of predictive maintenance, and budget cuts that reduce staffing or move it off site.
FAILURES
The data show that all the availability failures in 2013 were caused by electrical system incidents. A majority of the failures occurred because maintenance procedures were not followed. This finding underscores the importance of having proper procedures and well trained staff, and ensuring that vendors are familiar with the site and procedures.
Figure 7. Almost half the AIRs reported in 2013 were In Service.
Figure 7 further explores the causes of incidents in 2013. Roughly half the incidents were described as “In Service,” which is defined as inadequate maintenance, equipment adjustment, operated to failure, or no root cause found. The incidents attributed to preventive maintenance actually refer to preventive maintenance that was performed improperly. Data center staff caused just 2% of incidents, showing that the interface of personnel and equipment is not a main cause of incidents and outages.
SUMMARY
The increasing sophistication of data center infrastructure management (DCIM), building management systems (BMS), and building automation systems (BAS) is increasing the question of whether staffing can be reduced at data centers. The advances in these systems are great and can enhance the operations of your data center; however, as the AIRs data shows, mitigation of incidents often requires on-site personnel. This is why it is still a prescriptive behavior for Tier III and Tier IV Operational Sustainability Certified data centers to have qualified full time equivalent (FTE) personnel on site at all times. The driving purpose is to provide quick response time to mitigate any incidents and events. The data show that there is no pattern as to when incidents occur. Their occurrence is pretty well spread across all hours of the day and all days of the week. Watching as data centers continue to evolve with increased remote access and more redundancy built in, will show if the trends continue in their current path. As with any data center operations program the fundamental objective is risk avoidance. Each data center is unique with its own set of inherent risks. Shift presence is just one factor, but a pretty important one; a decision on how many to staff, for each shift, and with what qualifications, can have major impact on risk avoidance and continued data center availability. Choose wisely.
Rich Van Loo
Rich Van Loo is Vice President, Operations for Uptime Institute. He performs Uptime Institute Professional Services audits and Operational Sustainability Certifications. He also serves as an instructor for the Accredited Tier Specialist course.
Mr. Van Loo’s work in critical facilities includes responsibilities ranging from projects manager of a major facility infrastructure service contract for a data center, space planning for the design/construct for several data center modifications, and facilities IT support. As a contractor for the Department of Defense, Mr. Van Loo provided planning, design, construction, operation, and maintenance of worldwide mission critical data center facilities. Mr. Van Loo’s 27-year career includes 11 years as a facility engineer and 15 years as a data center manager.
Municipal requirements imposed difficult space considerations for Italian insurance company’s Tier IV data center
By Andrea Ambrosi and Roberto Del Nero
Space planning is often the key to a successful data center project. Organizing a facility into functional blocks is a fundamental way to limit interference between systems, reduce any problem related to power distribution, and simplify project development. However, identifying functional blocks and optimizing space within an existing building can be extremely complex. Converting an office building into a data center can cause further complexity.
This was the challenge facing Maestrale, a consortium of four engineering companies active in the building field including Ariatta Ingegneria dei Sistemi Srl as the mechanical and electrical engineer, when a major Italian insurance company, UnipolSai Assicurazioni S.p.a (UnipolSai), asked it to design a data center in an office building that had been built at the end of the 1980s in Bologna, Italy. UnipolSai set ambitious performance goals by requiring the electrical and mechanical infrastructure to be Uptime Institute Tier IV Certification of Constructed Facility and very energy efficient.
In addition, the completed facility, designed and built to meet UnipolSai’s requirements, has the following attributes:
• 1,200 kilowatts (kW) maximum overall IT equipment load
• UPS capacity: not less than 10 minutes
• Four equipment rooms having a total area of 1,400 square meters (m2)
• Cold Aisle/Hot Aisle floor layouts : In an energy-conscious project, all innovative free-cooling technologies must be considered.
After a thorough investigation of these free-cooling technologies, Ariatta chose direct free cooling for the UnipolSai project.
MUNICIPAL RESTRICTIONS
The goal of the architectural and structural design of a data center is to accommodate, contain, and protect the mechanical, electrical, and IT equipment. The size, location, and configuration of the mechanical and electrical infrastructure will determine the architecture of the rest of the building. In a pre-existing building, however, this approach does not always apply. More often, builders must work around limitations such as fixed perimeter length and floor height, floors not capable of bearing the weight of expected IT equipment, lack of adjoining external spaces, and other restrictions imposed by municipal regulations. In this project, a series of restrictions and duties imposed by the Municipality of Bologna had a direct impact on technical choices, in particular:
• Any part or any piece of equipment more than 1.8-meters high on the outside yard or on any external surface of the building (e.g., the roof) would be considered added volume, and therefore not allowed.
• Any modification or remodelling activity that changed the shape of the building was to be considered as incompatible with municipal regulations.
• The location was part of a residential area with strict noise limits (noise levels at property lines of 50 decibels [dbA] during the day and 40 dbA at night).
New structural work would also be subject to seismic laws, now in force throughout the country. In addition, UnipolSai’s commitment to Uptime Institute Tier IV Certification required it to also find solutions to achieve Continuous Cooling to IT equipment and to Compartmentalize ancillary systems. The final design incorporates a diesel rotary UPS (DRUPS) in a 2N distribution scheme, a radial double-feed electrical system, and an N+1 mechanical system with dual-water distribution backbone (Line 1 and Line 2) that enable the UnipolSai facility to meet Uptime Institute Tier IV requirements. Refrigerated water chillers with magnetic levitation bearings and air exchangers inserted in an N+1 hydraulic scheme serve the mechanical systems. The chillers are provided with double electric service entrance controlled by an automatic transfer switch (ATS). The DRUPS combine UPS and diesel engine-generator functions and do not require battery systems, which are normally part of static UPS systems. Uptime Institute Tier IV requires Compartmentalization, which necessitates more space. Eliminating the batteries saved a great deal of space. In addition, using the DRUPS to feed the chillers ensured that the facility would meet Tier IV requirements for Continuous Cooling with no need for storage tanks, which would be difficult to place in this site. The DRUPS also completely eliminated cooling requirements in the UPS room because the design ambient temperature would be around 30°C (maximum 40°C). Finally, using the DRUPS greatly simplified the distribution structure, limiting the ATSs on primary electric systems to a very minimum.
Municipal restrictions meant that the best option for locating the DRUPS and chillers would require radically transforming some areas inside building. For example, Ariatta uncovered an office floor and adapted structures and waterproofing to install the chiller plant (see Figures 1 and 2).
Figures 1 and 2. Bird’s eye and ground level views of the chiller plant.
Positioning the DRUPS posed another challenge. In another municipality, its dimensions (12-m length by 4.5-m height), weight, and maintenance requirements would have guided the design team towards a simple solution, such as installing them in containers directly on the floor. However, municipal restrictions for this location (1.8-m limit above street level) required an alternative solution. As a result, geological, geotechnical, and hydrogeological studies of the site of the underground garage showed that:
• Soil conditions met the technical and structural requirements of the DRUPS installation.
• The stratum was lower than the foundations.
• Flood indexes are fixed 30 centimeters above street level (taking zero level as reference).
The garage area was therefore opened and completely modified to contain a watertight tank containing the DRUPs. The tank included a 1.2-m high parapet to prevent flooding. The tank was equipped with redundant water lifting systems fed by the DRUPS (see Figures 3 and 4).
Figures 3 and 4. Particular care was given to protect the DRUPS against water intrusions. Soundproofing was also necessary.
Meeting the city’s acoustic requirements required soundproofing the DRUPS machines, so the DRUPS systems were double shielded reducing noise levels to 40 decibels (dbA) at 10 m during normal operation when connected to mains power. Low-noise chillers and high-performance acoustic barriers helped the entire facility meets its acoustical goals.
After identifying technical rooms and allocating space for equipment rooms, Ariatta had to design systems that met UnipolSai’s IT and mechanical and electrical requirements, IT distribution needs, and Uptime Institute Tier IV Compartmentalization requirements.
The floors of the existing building did not always align, especially on lower stories. These changes in elevation were hard to read in plans and sections. To meet this challenge, Starching S.R.L. Studio Architettura & Ingegneria and Redesco Progetti Srl, both part of the Maestrale Consortium, developed a three-dimensional Revit model, which included information about the mechanical and electrical systems. The Revit model helped identify problems caused by the misalignment of the floors and conflict between systems in the design phase. It also helped communicate new information about the project to contractors during the construction phase (see Figure 5 and 6).
Figures 5 and 6. Revit models helped highlight changes in building elevations that were hard to discern in other media and also aided in communication with contractors.
The use of 3D models is becoming a common way to eliminate interference between systems in final design solutions, with positive effects on the execution of work in general and only a moderate increase in engineering costs.
Figure 7. Fire-rated pipe enclosure
At UnipolSai, Compartmentalizing ancillary systems represented one of the main problems to be resolved to obtain Uptime Institute Tier IV Certification because of restrictions caused by the existing building. Ariatta engaged in continuous dialogue with the Uptime Institute to identify technical solutions. This dialogue, along with studies and functional demonstrations carried out jointly with sector specialists, led to a shared solution where two complementary systems that form the technological backbone are compartmentalized, with respect to one another (see Figure 7). The enclosures, which basically run parallel to each other, have:
• An external fire-resistant layer (60 minutes, same as the building structure)
• An insulation layer to keep the temperature of the technological systems within design limits for 60 minutes
• A channel that contains and protects against leaks absorbs shocks
• Dedicated independent mounting brackets. This solution was needed where the architectural characteristics of the building affected the technological backbone (see Figure 8).
Figure 8. The layout of the building limited the potential paths for pipe runs.
ENERGY EFFICIENCY
The choice of direct free cooling was made following an environmental study intended to determine and analyse the time periods when outdoor thermo-hygrometric conditions are favorable to the indoor IT microclimate of the UnipolSai data center as well as the relevant technical, economic, and energy impact of free cooling on the facility. The next-generation IT equipment used at UnipolSai allows it to modify the environmental parameters used as reference.
Figure 9. UnipolSai sized equipment to meet the requirements of ASHRAE’s “Thermal Guidelines for Data Processing Environments, 3rd edition,” as illustrated by that publication’s Figure 2.
The air conditioning systems in the data center were sized to guarantee temperatures between 24–26°C (75-79°F) per Class A1 equipment rooms as per ASHRAE “Thermal Guidelines for Data Processing Environments, 3rd edition”, in accordance with ASHRAE psychrometric chart (see Figure 9). The studies carried out showed that, in the Bologna region specifically, the outdoor thermo-hygrometric conditions are favorable to the IT microclimate of the data center about 70% of the time with energy savings of approximately 2,000 megawatt-hours. Direct free cooling brought undeniable advantages in terms of energy efficiency but introduced a significant functional complication issue related to Tiers compliance. The Tier Standards do not reference direct free cooling or other economization systems as the Tier requirements apply regardless of the technology. Eventually, it was decided that the free cooling system had to be subordinated to IT equipment continuous operation and excluded every time there was a problem with the mechanical and electrical systems, in which case continued operations would be ensured by the chiller plant.
Eventually, it was decided that the free cooling system had to be subordinated to IT equipment continuous operation and excluded every time there was a problem with the mechanical and electrical systems, in which case Continous Cooling would be ensured by the chiller plant. The direct free cooling functional setting, with unchanneled hot air rejection that was dictated by the pre-existing architectural restrictions, dictated the room layout and drove the choice of Cold Aisle containment. The direct free-cooling system consists of N+1 CRACs placed along the perimeter of the room, blowing cool air into a 60-inch plenum created by the access floor. The same units manage the free-cooling system. Every machine is equipped with a dual feed electric entrance controlled by an ATS and connected to a dual water circuit through a series of automatic valves (see Figure 10).
Figure 10. CRACs are connected with a dual-feed electric entrance controlled by an ATS and connected to a dual water circuit.
Containing the Cold Aisles caused a behavioral response among the IT operators, who normally work in a cold environment. At UnipolSai’s data center, they feel hot air heat when entering the data center. Design return air temperatures in the circulation areas are 32–34°C (90-93°F), and design supply air temperatures are 24–26°C (75-79°F). It became necessary to start an informational campaign to prevent alarmism in connection with room temperatures in the areas outside the functional aisles (See Figures 11-13).
Figures 11-13. Pictures show underfloor piping, containers, and raised floor environment.
Prefabricated electric busbars placed on the floor at regular intervals provide power supply to the IT racks. This decision was made in collaboration with UnipolSai technicians who considered it the most flexible solution in terms of installation and power draw, both initially and to accommodate future changes (see Figure 14 and 15).
Figure 14. Electric busbar
Figure 15. Taps on the busbar allow great flexibility on the data center floor and feed servers on the white space floor below.
In addition, a labeling system involving univocal synthetic description (alphanumeric code) and color coding, which allows a quick visual identification any part of any system makes simplifies the process of testing, operating, and managing all building systems (see Figure 16 and 17).
Figure 16-17. UnipolSai benefits from a well-thought out labeling system, which simplifies many aspects of operations.
FINAL TESTING
Functional tests were carried out at nominal load with the support of electric heaters, distributed in a regular manner within the equipment rooms and connected to the infrastructure feeding the IT equipment (see Figure 18-19). Also, Uptime Institute observed technical and functional tests were as part of Tier IV Certification of Constructed Facility (TCCF). The results of all the tests were positive; final demonstrations are pending. The data center has received Uptime Institute Tier IV Certification of Design Documents and is in progress for Tier IV Certification of Constructed Facility.
Figure 18-19. Two views of the data center floor, including heaters, during final testing.
To fully respond to the energy saving and absorption control policy of UnipolSai, the site was equipped with a network of heat/cooling energy meters and electrical meters connected to the central supervision system. Each chiller, and pumping and air handling system was specifically metered on electrical side, with chillers metered on the thermal side. Each electric system feeding IT loads is also metered.
UnipolSai also adopted DCIM software that, if properly used, can represent the first step towards an effective organization of the maintenance process, essential for keeping a system efficient and operational, independently from its level of redundancy and sophistication.
Andrea Ambrosi
Andrea Ambrosi, is project manager, design team manager and site manager at Ariatta Ingegneria dei Sistemi Srl (Ariatta). He is responsible for the executive planning and management of operations of electrical power, control and supervision systems, safety, and security and fire detection systems to be installed in data centers. He has specific experience in Domotics and special systems for the high-tech residential sector. He has been an Accredited Tier Designer since 2013 and Accredited Tier Specialist since 2014.
Roberto Del Nero, is a project manager and design team manager at
Roberto Del Nero
Ariatta, where he is responsible for the executive planning and management of mechanical plants, control and supervision systems, fire systems, plumbing and drainage to be installed in data center. He has been LEED AP (Accredited Professional) since 2009, Accredited Tier Designer since 2013 and Accredited Tier Specialist since 2014.
https://journal.uptimeinstitute.com/wp-content/uploads/2015/10/Unipol.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2015-10-28 13:08:132015-10-28 13:08:13Unipol Takes Space Planning to a New Level
Conventional wisdom blames “human error” for the majority of outages, but those failures are incorrectly attributed to front-line operator errors, rather than management mistakes
By Julian Kudritzki, with Anne Corning
Data centers, oil rigs, ships, power plants, and airplanes may seem like vastly different entities, but all are large and complex systems that can be subject to failures—sometimes catastrophic failure. Natural events like earthquakes or storms may initiate a complex system failure. But often blame is assigned to “human error”—front-line operator mistakes, which combine with a lack of appropriate procedures and resources or compromised structures that result from poor management decisions.
“Human error” is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.
Responsibility for an incident, in most cases, can be attributed to a senior management decision (e.g., design compromises, budget cuts, staff reductions, vendor selecting and resourcing) seemingly disconnected in time and space from the site of the incident. What decisions led to a situation where front line operators were unprepared or untrained to respond to an incident and mishandled it?
To safeguard against failures, standards and practices have evolved in many industries that encompass strict criteria and requirements for the design and operation of systems, often including inspection regimens and certifications. Compiled, codified, and enforced by agencies and entities in each industry, these programs and requirements help protect the service user from the bodily injuries or financial effects of failures and spur industries to maintain preparedness and best practices.
Twenty years of Uptime Institute research into the causes of data center incidents places predominant accountability for failures at the management level and finds only single-digit percentages of spontaneous equipment failure.
This fundamental and permanent truth compelled the Uptime Institute to step further into standards and certifications that were unique to the data center and IT industry. Uptime Institute undertook a collaborative approach with a variety of stakeholders to develop outcome-based criteria that would be lasting and developed by and for the industry. Uptime Institute¹s Certifications were conceived to evaluate, in an unbiased fashion, front-line operations within the context of management structure and organizational behaviors.
EXAMINING FAILURES
The sinking of the Titanic. The Deepwater Horizon oil spill. DC-10 air crashes in the 1970s. The failure of New Orleans’ levee system. The Three Mile Island nuclear release. The northeast (U.S.) blackout of 2003. Battery fires in Boeing 787s. The space shuttle Challenger disaster. Fukushima Daiichi nuclear disaster. The grounding of the Kulluk arctic drilling rig. These are a few of the most infamous, and in some cases tragic, engineering system failures in history. While the examples come from vastly different industries and each story unfolded in its own unique way, they all have something in common with each other—and with data centers. All exemplify highly complex systems operating in technologically sophisticated industries.
The hallmarks of so-called complex systems are “a large number of interacting components, emergent properties difficult to anticipate from the knowledge of single components, adaptability to absorb random disruptions, and highly vulnerable to widespread failure under adverse conditions (Dueñas-Osorio and Vemuru 2009).” Additionally, the components of complex systems typically interact in non-linear fashion, operating in large interconnected networks.
Large systems and the industries that use them have many safeguards against failure and multiple layers of protection and backup. Thus, when they fail it is due to much more than a single element or mistake.
It is a truism that complex systems tend to fail in complex ways. Looking at just a few examples from various industries, again and again we see that it was not a single factor but the compound effect of multiple factors that disrupted these sophisticated systems. Often referred to as “cascading failures,” complex system breakdowns usually begin when one component or element of the system fails, requiring nearby “nodes” (or other components in the system network) to take up the workload or service obligation of the failed component. If this increased load is too great, it can cause other nodes to overload and fail as well, creating a waterfall effect as every component failure increases the load on the other, already stressed components. The following transferable concept is drawn from the power industry:
Power transmission systems are heterogeneous networks of large numbers of components that interact in diverse ways. When component operating limits are exceeded, protection acts to disconnect the component and the component ”fails” in the sense of not being available… Components can also fail in the sense of misoperation or damage due to aging, fire, weather, poor maintenance, or incorrect design or operating settings…. The effects of the component failure can be local or can involve components far away, so that the loading of many other components throughout the network is increased… the flows all over the network change (Dobson, et al. 2009).
A component of the network can be mechanical, structural or human agent, as front-line operators respond to an emerging crisis. Just as engineering components can fail when overloaded, so can human effectiveness and decision-making capacity diminish under duress. A defining characteristic of a high risk organization is that it provides structure and guidance despite extenuating circumstances—duress is its standard operating condition.
The sinking of the Titanic is perhaps the most well-known complex system failure in history. This disaster was caused by the compound effect of structural issues, management decisions, and operating mistakes that led to the tragic loss of 1,495 lives. Just a few of the critical contributing factors include design compromises (e.g., reducing the height of the watertight bulkheads that allowed water to flow over the tops and limiting the number of lifeboats for aesthetic considerations), poor discretionary decisions (e.g., sailing at excessive speed on a moonless night despite reports of icebergs ahead), operator error (e.g., the lookout in the crow’s nest had no binoculars—a cabinet key had been left behind in Southampton), and misjudgment in the crisis response (e.g., the pilot tried to reverse thrust when the iceberg was spotted, instead of continuing at full speed and using the momentum of the ship to turn course and reduce impact). And, of course, there was the hubris of believing the ship was unsinkable.
Figure 1a. (Left) NTSB photo of the burned auxiliary power unit battery from a JAL Boeing 787 that caught fire on January 7, 2013 at Boston¹s Logan International Airport. Photo credit: By National Transportation Safety Board (NTSB) [Public domain], via Wikimedia Commons. Figure 1b. (Right) A side-by-side comparison of an original Boeing Dreamliner (787) battery compared and a damaged Japan Air Lines battery. Photo credit: By National Transportation Safety Board (NTSB) [Public domain], via Wikimedia Commons.
Looking at a more recent example, the issue of battery fires in Japan Airlines (JAL) Boeing 787s, which came to light in 2013 (see Figure 1), was ultimately blamed on a combination of design, engineering, and process management shortfalls (Gallagher 2014). Following its investigation, the U.S. National Transportation Safety Board reported (NTSB 2014):
• Manufacturer errors in design and quality control. The manufacturer failed to adequately account for the thermal runaway phenomenon: an initial overheating of the batteries triggered a chemical reaction that generated more heat, thus causing the batteries to explode or catch fire. Battery “manufacturing defects and lack of oversight in the cell manufacturing process” resulted in the development of lithium mineral deposits in the batteries. Called lithium dendrites, these deposits can cause a short circuit that reacts chemically with the battery cell, creating heat. Lithium dendrites occurred in wrinkles that were found in some of the battery electrolyte material, a manufacturing quality control issue.
• Shortfall in certification processes. The NTSB found shortcomings in U.S. Federal Aviation Administration (FAA) guidance and certification processes. Some important factors were overlooked that should have been considered during safety assessment of the batteries.
• Lack of contractor oversight and proper change orders. A cadre of contractors and subcontractors were involved in the manufacture of the 787’s electrical systems and battery components. Certain entities made changes to the specifications and instructions without proper approval or oversight. When the FAA performed an audit, it found that Boeing’s prime contractor wasn’t following battery component assembly and installation instructions and was mislabeling parts. A lack of “adherence to written procedures and communications” was cited.
How many of these circumstances parallel those that can happen during the construction and operation of a data center? It is all too common to find deviations from as-designed systems during the construction process, inconsistent quality control oversight, and the use of multiple subcontractors. Insourced and outsourced resources may disregard or hurry past written procedures, documentation, and communication protocols (search Avoiding Data Center Construction Problems @ journal.uptimeinstitute.com).
THE NATURE OF COMPLEX SYSTEM FAILURES
Large industrial and engineered systems are risky by their very nature. The greater the number of components and the higher the energy and heat levels, velocity, and size and weight of these components the greater the skill and teamwork required to plan, manage, and operate the systems safely. Between mechanical components and human actions, there are thousands of possible points where an error can occur and potentially trigger a chain of failures.
In his seminal article on the topic of complex system failure How Complex Systems Fail, first published in 1998 and still widely referenced today, Dr. Richard I. Cook identifies and discusses 18 core elements of failure in complex systems:
1. Complex systems are intrinsically hazardous systems.
2. Complex systems are heavily and successfully defended against failure.
3. Catastrophe requires multiple failures—single point failures are not enough.
4. Complex systems contain changing mixtures of failures latent within them.
5. Complex systems run in degraded mode.
6. Catastrophe is always just around the corner.
7. Post-accident attribution to a root cause is fundamentally wrong.
8. Hindsight biases post-accident assessments of human performance.
9. Human operators have dual roles: as producers and as defenders against failure.
10. All practitioner actions are gambles.
11. Actions at the sharp end resolve all ambiguity.
12. Human practitioners are the adaptable element of complex systems.
13. Human expertise in complex systems is constantly changing.
14. Change introduces new forms of failure.
15. Views of cause limit the effectiveness of defenses against future events.
16. Safety is a characteristic of systems and not of their components.
17. People continuously create safety.
18. Failure-free operations require experience with failure (Cook 1998).
Let’s examine some of these principles in the context of a data center. Certainly high-voltage electrical systems, large-scale mechanical and infrastructure components, high-pressure water piping, power generators, and other elements create hazards [Element 1] for both humans and mechanical systems/structures. Data center systems are defended from failure by a broad range of measures [Element 2], both technical (e.g., redundancy, alarms, and safety features of equipment) and human (e.g., knowledge, training, and procedures). Because of these multiple layers of protection, a catastrophic failure would require the breakdown of multiple systems or multiple individual points of failure [Element 3].
RUNNING NEAR CRITICAL FAILURE Complex systems science suggests that most large-scale complex systems, even well-run ones, by their very nature are operating in “degraded mode” [Element 5], i.e., close to the critical failure point. This is due to the progression over time of various factors including steadily increasing load demand, engineering forces, and economic factors.
The enormous investments in data center and other highly available infrastructure systems perversely incents conditions of elevated risk and higher likelihood of failure. Maximizing capacity, increasing density, and hastening production from installed infrastructure improves the return on investment (ROI) on these major capital investments. Deferred maintenance, whether due to lack of budget or hands-off periods due to heightened production, further pushes equipment towards performance limits—the breaking point.
The increasing density of data center infrastructure exemplifies the dynamics that continually and inexorably push a system towards critical failure. Server density is driven by a mixture of engineering forces (advancements in server design and efficiency) and economic pressures (demand for more processing capacity without increasing facility footprint). Increased density then necessitates corresponding increases in the number of critical heating and cooling elements. Now the system is running at higher risk, with more components (each of which is subject to individual fault/failure), more power flowing through the facility, more heat generated, etc.
This development trajectory demonstrates just a few of the powerful “self-organizing” forces in any complex system. According to Dobson, et al (2009), “these forces drive the system to a dynamic equilibrium that keeps [it] near a certain pattern of operating margins relative to the load. Note that engineering improvements and load growth are driven by strong, underlying economic and societal forces that are not easily modified.”
Because of this dynamic mix of forces, the potential for a catastrophic outcome is inherent in the very nature of complex systems [Element 6]. For large-scale mission critical and business critical systems, the profound implication is that designers, system planners, and operators must acknowledge the potential for failure and build in safeguards.
WHY IS IT SO EASY TO BLAME HUMAN ERROR?
Human error is often cited as the root cause of many engineering system failures, yet it does not often cause a major disaster on its own. Based on analysis of 20 years of data center incidents, Uptime Institute holds that human error must signify management failure to drive change and improvement. Leadership decisions and priorities that result in a lack of adequate staffing and training, an organizational culture that becomes dominated by a fire drill mentality, or budget cutting that reduces preventive/proactive maintenance could result in cascading failures that truly flow from the top down.
Although front-line operator error may sometimes appear to cause an incident, a single mistake (just like a single data center component failure) is not often sufficient to bring down a large and robust complex system unless conditions are such that the system is already teetering on the edge of critical failure and has multiple underlying risk factors. For example, media reports after the 1983 Exxon Valdez oil spill zeroed in on the fact that the captain, Joseph Hazelwood, was not at the bridge at the time of the accident and accused him of drinking heavily that night. However, more measured assessments of the accident by the NTSB and others found that Exxon had consistently failed to supervise the captain or provide sufficient crew for necessary rest breaks (see Figure 2).
Figure 2. Shortly after leaving the Port of Valdez, the Exxon Valdez ran aground on Bligh Reef. The picture was taken three days after the vessel grounded, just before a storm arrived. Photo credit: Office of Response and Restoration, National Ocean Service, National Oceanic and Atmospheric Administration [Public domain], via Wikimedia Commons.
Perhaps even more critical was the lack of essential navigation systems: the tanker’s radar was not operational at time of the accident. Reports indicate that Exxon’s management had allowed the RAYCAS radar system to stay broken for an entire year before the vessel ran aground because it was expensive to operate. There was also inadequate disaster preparedness and an insufficient quantity of oil spill containment equipment in the region, despite the experiences of previous small oil spills. Four years before the accident, a letter written by Captain James Woodle, who at that time was the Exxon oil group¹s Valdez port commander, warned upper management, “Due to a reduction in manning, age of equipment, limited training and lack of personnel, serious doubt exists that [we] would be able to contain and clean-up effectively a medium or large size oil spill” (Palast 1999).
As Dr. Cook points out, post-accident attribution to a root cause is fundamentally wrong [Element 7]. Complete failure requires multiple faults, thus attribution of blame to a single isolated element is myopic and, arguably, scapegoating. Exxon blamed Captain Hazelwood for the accident, and his share of the blame obscures the underlying mismanagement that led to the failure. Inadequate enforcement by the U.S. Coast Guard and other regulatory agencies further contributed to the disaster.
Similarly, the grounding of the oil rig Kulluk was the direct result of a cascade of discrete failures, errors, and mishaps, but the disaster was first set in motion by Royal Dutch Shell’s executive decision to move the rig off of the Alaskan coastline to avoid tax liability, despite high risks (Lavelle 2014). As a result, the rig and its tow vessels undertook a challenging 1,700-nautical-mile journey across the icy and storm-tossed waters of the Gulf of Alaska in December 2012 (Funk 2014).
There had already been a chain of engineering and inspection compromises and shortfalls surrounding the Kulluk, including the installation of used and uncertified tow shackles, a rushed refurbishment of the tow vessel Discovery, and electrical system issues with the other tow vessel, the Aivik, which had not been reported to the Coast Guard as required. (Discovery experienced an exhaust system explosion and other mechanical issues in the following months. Ultimately the tow company—a contractor—was charged with a felony for multiple violations.)
This journey would be the Kulluk’s last, and it included a series of additional mistakes and mishaps. Gale-force winds put continual stress on the tow line and winches. The tow ship was captained on this trip by an inexperienced replacement, who seemingly mistook tow line tensile alarms (set to go off when tension exceeded 300 tons) for another alarm that was known to be falsely annunciating. At one point the Aivik, in attempting to circle back and attach a new tow line, was swamped by a wave, sending water into the fuel pumps (a problem that had previously been identified but not addressed), which caused the engines to begin to fail over the next several hours (see Figure 3).
Waves crash over the mobile offshore drilling unit Kulluk where it sits aground on the southeast side of Sitkalidak Island, Alaska, Jan. 1, 2013. A Unified Command, consisting of the Coast Guard, federal, state, local and tribal partners and industry representatives was established in response to the grounding. U.S. Coast Guard photo by Petty Officer 3rd Class Jonathan Klingenberg.
Despite harrowing conditions, Coast Guard helicopters were eventually able to rescue the 18 crew members aboard the Kulluk. Valiant last-ditch tow attempts were made by the (repaired) Aivik and Coast Guard tugboat Alert, before the effort had to be abandoned and the oil rig was pushed aground by winds and currents.
Poor management decision making, lack of adherence to proper procedures and safety requirements, taking shortcuts in the repair of critical mechanical equipment, insufficient contractor oversight, lack of personnel training/experience all of these elements of complex system failure are readily seen as contributing factors in the Kulluk disaster.
EXAMINING DATA CENTER SYSTEM FAILURES
Two recent incidents demonstrate how the dynamics of complex systems failures can quickly play out in the data center environment.
Example A
Tier III Concurrent Maintenance data center criteria (see Uptime Institute Tier Standard: Topology) require multiple, diverse independent distribution paths serving all critical equipment to allow maintenance activity without impacting critical load. The data center in this example had been designed appropriately with fuel pumps and engine- generator controls powered from multiple circuit panels. As built, however, a single panel powered both, whether due to implementation oversight or cost reduction measures. At issue is not the installer, but rather the quality of communications from the implementation team and the operations team.
In the course of operations, technicians had to shut off utility power during the performance of routine maintenance to an electrical switchgear. This meant the building was running on engine-generator sets. However, when the engine-generator sets started to surge due to a clogged fuel line. The UPS automatically switched the facility to battery power. The day tanks for the engine-generator sets were starting to run dry. If quick-thinking operators had not discovered the fuel pump issue in time, there would have been an outage to the entire facility: a cascade of events leading down a rapid pathway from simple routine maintenance activity to complete system failure.
Example B
Tier IV Fault Tolerant data center criteria require the ability to detect and isolate a fault while maintaining capacity to handle critical load. In this example, a Tier IV enterprise data center shared space with corporate offices in the same building, with a single chilled water plant used to cool both sides of the building. The office air handling units also brought in outside air to reduce cooling costs.
One night, the site experienced particularly cold temperatures and the control system did not switch from outside air to chilled water for office building cooling, which affected data center cooling as well. The freeze stat (a temperature sensing device that monitors a heat exchanger to prevent its coils from freezing) failed to trip; thus the temperature continued to drop and the cooling coil froze and burst, leaking chilled water onto the floor of the data center. There was a limited leak detection system in place and connected, but it had not been fully tested yet. Chilled water continued to leak until pressure dropped and then the chilled water machines started to spin offline in response. Once the chilled water machines went offline neither the office building nor data center had active cooling.
At this point, despite the extreme outside cold, temperatures in the data hall rose through the night. As a result of the elevated indoor temperature conditions, the facility experienced myriad device-level (e.g., servers, disc drives, and fans) failures over the following several weeks. Though a critical shut down was not the issue, damage to components and systems—and the cost of cleanup and replacement parts and labor—were significant. One single initiating factor—a cold night—combined with other elements in a cascade of failures.
In both of these cases, severe disaster was averted, but relying on front-line operators to save the situation is neither robust not reliable.
PREVENTING FAILURES IN THE DATA CENTER
Organizations that adhere to the principles of Concurrent Maintainability and/or Fault Tolerance, as outlined in Tier Standard: Topology, take a vital first step toward reducing the risk of a data center failure or outage.
However, facility infrastructure is only one component of failure prevention; how a facility is run and operated on a day-to-day basis is equally critical. As Dr. Cook noted, humans have a dual role in complex systems as both the potential producers (causes) of failure as well as, simultaneously, some of the best defenders against failure [Element 9].
The fingerprints of human error can be seen on the two data center examples. In Example A, the electrical panel was not set up as originally designed, and the leak detection system, which could have alerted operators to the problem, had not been fully activated in Example B.
Dr. Cook also points out that human operators are the most adaptable component of complex systems [Element 12], as they “actively adapt the system to maximize production and minimize accidents.” For example, operators may “restructure the system to reduce exposure of vulnerable parts,” reorganize critical resources to focus on areas of high demand, provide “pathways for retreat or recovery,” and “establish means for early detection of changed system performance in order to allow graceful cutbacks in production or other means of increasing resiliency.” Given the highly dynamic nature of complex system environments, this human-driven adaptability is key.
STANDARDIZATION CAN ADDRESS MANAGEMENT SHORTFALLS
In most of the notable failures in recent decades, there was a breakdown or circumvention of established standards and certifications. It was not a lack of standards, but a lack of compliance or sloppiness that contributed the most to the disastrous outcomes. For example, in the case of the Boeing batteries, the causes were bad design, poor quality inspections, and lack of contractor oversight. In the case of the Exxon Valdez, inoperable navigation systems and inadequate crew manpower and oversight—along with insufficient disaster preparednesswere critical factors. If leadership, operators, and oversight agencies had adhered to their own policies and requirements and had not cut corners for economics or expediency, these disasters might have been avoided.
Ongoing operating and management practices and adherence to recognized standards and requirements, therefore, must be the focus of long-term risk mitigation. In fact, Dr. Cook states that “failure-free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance….human practitioner adaptations to changing conditions actually create safety from moment to moment” [Element 17]. This emphasis on human activities as decisive in preventing failures dovetails with Uptime Institute’s advocacy of operational excellence as set forth in the Tier Standard: Operational Sustainability. This was the data center industry¹s first standardization, developed by and for data centers, to address the management shortfalls that could unwind the most advanced, complex, and intelligent of solutions. Uptime Institute was compelled by its findings that the vast majority of data center incidents could be attributed to operations, despite advancements in technology, monitoring, and automation.
The Operational Sustainability criteria pinpoint the elements that impact long-term data center performance, encompassing site management and operating behaviors, and documentation and mitigation of site-specific risks. The detailed criteria include personnel qualifications and training and policies and procedures that support operating teams in effectively preventing failures and responding appropriately when small failures occur to avoid having them cascade into large critical failures. As Dr. Cook states, “Failure free operations require experience with failure” [Element 18]. We have the opportunity to learn from the experience of other industries, and, more importantly, from the data center industry¹s own experience, as collected and analyzed in Uptime Institute’s Abnormal Incident Reports database. Uptime Institute has captured and catalogued the lessons learned from more than 5,000 errors and incidents over the last 20 years and used that research knowledgebase to help develop an authoritative set of benchmarks. It has ratified these with leading industry experts and gained the consensus of global stakeholders from each sector of the industry. Uptime Institute’s Tier Certifications and Management & Operations (M&O) Stamp of Approval provide the most definitive guidelines for and verification of effective risk mitigation and operations management.
Dr. Cook explains, “More robust system performance is likely to arise in systems where operators can discern the edge of the envelope. It also depends on calibrating how their actions move system performance towards or away from the edge of the envelope. [Element 18]” Uptime Institute¹s deep subject matter expertise, long experience, and evidence-based standards can help data center operators identify and stay on the right side of that edge. Organizations like CenturyLink are recognizing the value of applying a consistent set of standards to ensure operational excellence and minimize the risk of failure in the complex systems represented by their data center portfolio (See the sidebar CenturyLink and the M&O Stamp of Approval).
CONCLUSION
Complex systems fail in complex ways, a reality exacerbated by the business need to operate complex systems on the very edge of failure. The highly dynamic environments of building and operating an airplane, ship, or oil rig share many traits with running a high availability data center. The risk tolerance for a data center is similarly very low, and data centers are susceptible to the heroics and missteps of many disciplines. The coalescing element is management, which makes sure that frontline operators are equipped with the hands, tools, parts, and processes they need, and, the unbiased oversight and certifications to identify risks and drive continuous improvement against the continuous exposure to complex failure.
REFERENCES
ASME (American Society of Mechanical Engineers). 2011. Initiative to Address Complex Systems Failure: Prevention and Mitigation of Consequences. Report prepared by Nexight Group for ASME (June). Silver Spring MD: Nexight Group. http://nexightgroup.com/wp-content/uploads/2013/02/initiative-to-address-complex-systems-failure.pdf
Bassett, Vicki. (1998). “Causes and effects of the rapid sinking of the Titanic,” working paper. Department of Mechanical Engineering, the University of Wisconsin. http://writing.engr.vt.edu/uer/bassett.html#authorinfo.
BBC News. 2015. “Safety worries lead US airline to ban battery shipments.” March 3, 2015. http://www.bbc.com/news/technology-31709198
Brown, Christopher and Matthew Mescal. 2014. View From the Field. Webinar presented by Uptime Institute, May 29, 2014. https://uptimeinstitute.com/research-publications/asset/webinar-recording-view-from-the-field
Cook, Richard I. 1998. “How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety).” Chicago, IL: Cognitive Technologies Laboratory, University of Chicago. Copyright 1998, 1999, 2000 by R.I. Cook, MD, for CtL. Revision D (00.04.21), http://web.mit.edu/2.75/resources/rando/How%20Complex%20Systems%20Fail.pdf
Dobson, Ian, Benjamin A. Carreras, Vickie E. Lynch and David E. Newman. 2009. “Complex systems analysis of a series of blackouts: Cascading failure, critical points, and self-organization.” Chaos: An Interdisciplinary Journal of Nonlinear Science 17: 026103 (published by the American Institute of Physics).
Dobson, Ian, Benjamin A. Carreras, Vickie E. Lynch and David E. Newman. 2009. “Complex systems analysis of a series of blackouts: Cascading failure, critical points, and self-organization.” Chaos: An Interdisciplinary Journal of Nonlinear Science 17: 026103 (published by the American Institute of Physics).
Dueñas-Osorio, Leonard and Srivishnu Mohan Vemuru. 2009. Abstract for “Cascading failures in complex infrastructure systems.” Structural Safety 31 (2): 157-167.
Funk, McKenzie. 2014. “The Wreck of the Kulluk.” New York Times Magazine December 30, 2014. http://www.nytimes.com/2015/01/04/magazine/the-wreck-of-the-kulluk.html?_r=0
Gallagher, Sean. 2014. “NTSB blames bad battery designand bad managementin Boeing 787 fires.” Ars Technica, December 2, 2014. http://arstechnica.com/information-technology/2014/12/ntsb-blames-bad-battery-design-and-bad-management-in-boeing-787-fires/
Glass, Robert, Walt Beyeler, Kevin Stamber, Laura Glass, Randall LaViolette, Stephen Contrad, Nancy Brodsky, Theresa Brown, Andy Scholand, and Mark Ehlen. 2005. Simulation and Analysis of Cascading Failure in Critical Infrastructure. Presentation (annotated version) Los Alamos National Laboratory, National Infrastructure Simulation and Analysis Center (Department of Homeland Security), and Sandia National Laboratories, July 2005..New Mexico: Sandia National Laboratories. http://www.sandia.gov/CasosEngineering/docs/Glass_annotatedpresentation.pdf
Kirby, R. Lee. 2012. “Reliability Centered Maintenance: A New Approach.” Mission Critical, June 12, 2012. http://www.missioncriticalmagazine.com/articles/84992-reliability-centered-maintenance–a-new-approach
Klesner, Keith. 2015. “Avoiding Data Center Construction Problems.” The The Uptime Institute Journal. 5: Spring 2014: 6-12. https://journal.uptimeinstitute.com/avoiding-data-center-construction-problems/
Lipsitz, Lewis A. 2012. “Understanding Health Care as a Complex System: The Foundation for Unintended Consequences.” Journal of the American Medical Association 308 (3): 243244. http://jama.jamanetwork.com/article.aspx?articleid=1217248
Lavelle, Marianne. 2014. “Coast Guard blames Shell risk taking in the wreck of the Kulluk.” National Geographic, April 4, 2014. http://news.nationalgeographic.com/news/energy/2014/04/140404-coast-guard-blames-shell-in-kulluk-rig-accident/
“Exxon Valdez Oil Spill.” New York Times. On NYTimes.com, last updated August 3, 2010. http://topics.nytimes.com/top/reference/timestopics/subjects/e/exxon_valdez_oil_spill_1989/index.html
NTSB (National Transportation Safety Board). 2014. “Auxiliary Power Unit Battery Fire Japan Airlines Boeing 787-8, JA829J.” Aircraft Incident Report released 11/21/14. Washington, DC: National Transportation Safety Board. http://www.ntsb.gov/Pages/..%5Cinvestigations%5CAccidentReports%5CPages%5CAIR1401.aspx
Palast, Greg. 1999. “Ten Years After But Who Was to Blame?” for Observer/Guardian UK, March 20, 1999. http://www.gregpalast.com/ten-years-after-but-who-was-to-blame/
Pederson, Brian. 2014. “Complex systems and critical missionstoday’s data center.” Lehigh Valley Business, November 14, 2014. http://www.lvb.com/article/20141114/CANUDIGIT/141119895/complex-systems-and-critical-missions–todays-data-center
Plsek, Paul. 2003. Complexity and the Adoption of Innovation in Healthcare. Presentation, Accelerating Quality Improvement in Health Care Strategies to Speed the Diffusion of Evidence-Based Innovations, conference in Washington, DC, January 27-28, 2003. Roswell, GA: Paul E Plsek & Associates, Inc. http://www.nihcm.org/pdf/Plsek.pdf
Reason, J. 2000. “Human Errors Models and Management.” British Medical Journal 320 (7237): 768770. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1117770/
Reuters. 2014. “Design flaws led to lithium-ion battery fires in Boeing 787: U.S. NTSB.” December 2, 2014. http://www.reuters.com/article/2014/12/02/us-boeing-787-batteryidUSKCN0JF35G20141202
Wikipedia, s.v. “Cascading Failure,” last modified April 12, 2015. https://en.wikipedia.org/wiki/Cascading_failure
Wikipedia, s.v. “The Sinking of The Titanic,” last modified July 21, 2015. https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic
Wikipedia, s.v. “SOLAS Convention,” last modified June 21, 2015. https://en.wikipedia.org/wiki/SOLAS_Convention
John Maclean, author of numerous books, including Fire on the Mountain (Morrow 1999), analyzing deadly wildland fires, suggests rebranding of high reliability organizations, which is a fundamental concept of firefighting crews, military, and commercial airline industry. He argued for high risk organizations. A high reliability organization may only fail, like a goalkeeper, as performance is so highly anticipated. A high risk organization is tasked with averting or minimizing impact and may gauge success in a non-binary fashion. It is a recurring theme in Mr. Maclean’s forensic analyses of deadly fires that front-line operators, including the perished, carry the blame for the outcome and management shortfalls are far less exposed.
CENTURYLINK AND THE M&O STAMP OF APPROVAL
The IT industry has growing awareness of the importance of management-people-process issues. That’s why Uptime Institute¹s Management & Operations (M&0) Stamp of Approval focuses on assessing and evaluating both operations activities and management as equally critical to ensuring data center reliability and performance. The M&O Stamp can be applied to a single data center facility, or administered across an entire portfolio to ensure consistency.
Recognizing the necessity of making a commitment to excellence at all levels of an organization, CenturyLink is the first service provider to embrace the M&O assesment for all of its data centers. It has contracted Uptime Institute to assess 57 data center facilities across a global portfolio. This decision shows the company is willing to hold itself to a uniform set of high standards and operate with transparency. The company has committed to achieve M&O Stamp of Approval standards and certification across the board, protecting its vital networks and assets from failure and downtime and providing its customers with assurance.
Julian Kudritzki
Julian Kudritzki joined Uptime Institute in 2004 and currently serves as Chief Operating Officer. He is responsible for the global proliferation of Uptime Institute standards. He has supported the founding of Uptime Institute offices in numerous regions, including Brasil, Russia, and North Asia. He has collaborated on the development of numerous Uptime Institute publications, education programs, and unique initiatives such as Server Roundup and FORCSS. He is based in Seattle, WA.
Anne Corning
Anne Corning is a technical and business writer with more than 20 years experience in the high tech, healthcare, and engineering fields. She earned her B.A. from the University of Chicago and her M.B.A. from the University of Washington’s Foster School of Business. She has provided marketing, research, and writing for organizations such as Microsoft, Skanska USA Mission Critical, McKesson, Jetstream Software, Hitachi Consulting, Seattle Children’s Hospital Center for Clinical Research, Adaptive Energy, Thinking Machines Corporation (now part of Oracle), BlueCross BlueShield of Massachusetts, and the University of Washington Institute for Translational Health Sciences. She has been a part of several successful entrepreneurial ventures and is a Six Sigma Green Belt.
—-
https://journal.uptimeinstitute.com/wp-content/uploads/2015/10/systemsfailures.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2015-10-09 13:37:472016-08-23 08:53:49Examining and Learning from Complex Systems Failures
Allocating IT costs to internal customers improves accountability, cuts waste
By Scott Killian
You’ve heard the complaints many times before: IT costs too much. I have no idea what I’m paying for. I can’t accurately budget for IT costs. I can do better getting IT services myself.
The problem is that end-user departments and organizations can sometimes see IT operations as just a black box. In recent years, IT chargeback systems have attracted more interest as a way to address all those concerns and rising energy use and costs. In fact, IT chargeback can be a cornerstone of practical, enterprise-wide efficiency efforts.
IT chargeback is a method of charging internal consumers (e.g., departments, functional units) for the IT services they used. Instead of bundling all IT costs under the IT department, a chargeback program allocates the various costs of delivering IT (e.g., services, hardware, software, maintenance) to the business units that consume them.
Many organizations already use some form of IT chargeback, but many don’t, instead treating IT as corporate overhead. Resistance to IT chargeback often comes from the perception that it requires too much effort. It’s true that time, administrative cost, and organizational maturity is needed to implement chargeback.
However, the increased adoption of private and public cloud computing is causing organizations to re-evaluate and reconsider IT chargeback methods. Cloud computing has led some enterprises to ask their IT organizations to explain their internal costs. Cloud options can shave a substantial amount from IT budgets, which pressures IT organizations to improve cost modeling to either fend off or justify a cloud transition. In some cases, IT is now viewed as more of a commodity—with market competition. In these circumstances, accountability and efficiency improvements can bring significant cost savings that make chargeback a more attractive path.
CHARGEBACK vs. UNATTRIBUTED ACCOUNTING
All costs are centralized in traditional IT accounting. One central department pays for all IT equipment and activities, typically out of the CTO or CIO’s budget, and these costs are treated as corporate overhead shared evenly by multiple departments. In an IT chargeback accounting model, individual cost centers are charged for their IT service based on use and activity. As a result, all IT costs are “zeroed out” because they have all been assigned to user groups. IT is no longer considered overhead, instead it can be viewed as part of each department’s business and operating expenses (OpEx).
With the adoption of IT chargeback, an organization can expect to see significant shifts in awareness, culture, and accountability, including:
• Increased transparency due to accurate allocation of IT costs and usage. Chargeback allows consumers to see their costs and understand how those costs are determined.
• Improved IT financial management, as groups become more aware of the cost of their IT usage and business choices. With chargeback, consumers become more interested and invested in the costs of delivering IT as a service.
• Increased awareness of how IT contributes to the business of the organization. IT is not just overhead but is seen as providing real business value.
• Responsibility for controlling IT costs shifts to business units, which become accountable for their own use.
• Alignment of IT operations and expenditures with the business. IT is no longer just an island of overhead costs but becomes integrated into business planning, strategy, and operations.
The benefits of an IT chargeback model included simplified IT investment decision making, reduced resource consumption, improved relationships between business units and IT, and greater perception of IT value. Holding departments accountable leads them to modify their behaviors and improve efficiency. For example, chargeback tends to reduce overall resource consumption as business units stop hoarding surplus servers or other resources to avoid the cost of maintaining these underutilized assets. At the same time, organizations experience increased internal customer satisfaction as IT and the business units become more closely aligned and begin working together to analyze and improve efficiency.
Perhaps most importantly, IT chargeback drives cost control. As users become aware of the direct costs of their activities, they become more willing to improve their utilization, optimize their software and activities, and analyze cost data to make better spending decisions. This can extend the life of existing resources and infrastructure, defer resource upgrades, and identify underutilized resources that can be deployed more efficiently. Just as we have seen in organizations that adopt a server decommissioning program (such as the successful initiatives of Uptime Institute’s Server Roundup) (https://uptimeinstitute.com/training-events/server-roundup), IT chargeback identifies underutilized assets that can be reassigned or decommissioned. As a result, more space and power becomes available to other equipment and services, thus extending the life of existing infrastructure. An organization doesn’t have to build new infrastructure if it can get more from current equipment and systems.
IT chargeback also allows organizations to make fully informed decisions about outsourcing. Chargeback provides useful metrics that can be compared against cloud providers and other outsource IT options. As IT organizations are being driven to emulate cloud provider services, a chargeback applies free-market principles to IT (with appropriate governance and controls). The IT group becomes more akin to a service provider, tracking and reporting the same metrics on a more apples-to-apples basis.
Showback is closely related to chargeback and offers many of the same advantages without some of the drawbacks. This strategy employs the same approach as chargeback, with tracking and cost-center allocation of IT expenses. Showback measures and displays the IT cost breakdown by consumer unit just as chargeback does, but without actually transferring costs back. Costs remain in the IT group, but information is still transparent about consumer utilization. Showback can be easy to implement since there is no immediate budgetary impact on user groups.
The premise behind showback and chargeback is the same: awareness drives accountability. However, since business units know they will not be charged in a showback system, their attention to efficiency and improving utilization may not be as focused. Many organizations have found that starting with a showback approach for an initial 3-6 months is an effective way to introduce chargeback, testing the methodology and metrics and allowing consumer groups to get used to the approach before full implementation of chargeback accountability.
The stakeholders affected by chargeback/showback include:
• Consumers: Business units that consume IT resources, e.g., organizational entities, departments, applications, and end users.
• Internal service providers: Groups responsible for providing IT services, e.g., data center teams, network teams, and storage.
• Project sponsor: The group funding the effort and ultimately responsible for its success. Often this is someone under the CTO or can also be a finance/accounting leader.
• Executive team: The C-suite individuals responsible for setting chargeback as an organizational priority and ensuring enterprise-wide participation to bring it to fruition.
• Administrator: The group responsible for operating the chargeback program (e.g., IT finance and accounting).
CHARGEBACK METHODS
A range of approaches have been developed for implementing chargeback in an organization, as summarized in Figure 1. The degree of complexity, degree of difficulty, and cost to implement decreases from the top of the chart [service-based pricing (SBP)], to the bottom [high-level allocation (HLA)]. HLA is the simplest method; it uses a straight division of IT costs based on a generic metric such as headcount. Slightly more effort to implement is low-level allocation (LLA), which bases consumer costs on something more related to IT activity such as the number of users or servers. Direct cost (DC) more closely resembles a time and materials charge but is often tied to headcount as well.
Figure 1. Methods for chargeback allocation.
Measured resource usage (MRU) focuses on the amount of actual resource usage of each department, using metrics such as power (in kilowatts), network bandwidth and terabytes of storage. Tiered flat rate (TFR), negotiated flat rate (NFR), and service based pricing (SBP) are all increasingly sophisticated applications of measuring actual usage by service.
THE CHARGEBACK SWEET SPOT
Measured resource usage (MRU) is often the sweet spot for chargeback implementation. It makes use of readily available data that are likely already known or collected. For example, data center teams typically measure power consumption at the server level, and storage groups know how many terabytes are being used by different users/departments. MRU is a straight allocation of IT costs, thus it is fairly intuitive for consumer organizations to accept. It is not quite as simple as other methods to implement but does provide fairness and is easily controllable.
MRU treats IT services as a utility, consumed and reserved based on key activities:
• Data center = power
• Network = bandwidth
• Storage = bytes
• Cloud =virtual machines or other metric
• Network Operations Center = ticket count or total time to resolve per customer
PREPARING FOR CHARGEBACK IMPLEMENTATION
If an organization is to successfully implement chargeback, it must choose the method that best fits its objectives and apply the method with rigor and consistency. Executive buy-in is critical. Without top-down leadership, chargeback initiatives often fail to take hold. It is human nature to resist accountability and extra effort, so leadership is needed to ensure that chargeback becomes an integral part of the business operations.
To start, it’s important that an organization know the infrastructure capital expense (CapEx) and OpEx costs. Measuring, tracking, reporting, and questioning these costs, and acting on the information to base investment and operating decisions on real costs is critical to becoming an efficient IT organization. To understand CapEx costs, organizations should consider the following:
• Facility construction or acquisition
• Power and cooling infrastructure equipment: new, replacement, or upgrades
• IT hardware: server, network, and storage hardware
• Software licenses, including operating system and application software
• Racks, cables: initial costs (i.e., items installed in the initial set up of the data room)
OpEx incorporates all the ongoing costs of running an IT facility. They are ultimately larger than CapEx in the long run, and include:
• FTE/payroll
• Utility expenses
• Critical facility maintenance (e.g., critical power and cooling, fire and life safety, fuel systems)
• Housekeeping and grounds (e.g., cleaning, landscaping, snow removal)
• Disposal/recycling
• Lease expenses
• Hardware maintenance
• Other facility fees such as insurance, legal, and accounting fees
• General building maintenance (e.g., office area, roof, plumbing)
• Network expenses (in some circumstances)
The first three items (FTE/payroll, utilities, and critical facility maintenance) typically make up the largest portion of these costs. For example, utilities can account for a significant portion of the IT budget. If IT is operated in a colocation environment, the biggest costs could be lease expenses. The charges from a colocation provider typically will include all the other costs, often negotiated. For enterprise-owned data centers, all these OpEx categories can fluctuate monthly depending on activities, seasonality, maintenance schedules, etc. Organizations can still budget and plan for OpEx effectively, but it takes an awareness of fluctuations and expense patterns.
At a fundamental level, the goal is to identify resource consumption by consumer, for example the actual kilowatts per department. More sophisticated resource metrics might include the cost of hardware installation (moves, adds, changes) or the cost per maintenance ticket. For example, in the healthcare industry, applications for managing patient medical data are typically large and energy intensive. If 50% of a facility’s servers are used for managing patient medical data, the company could determine the kilowatt per server and multiply total OpEx by the percentage of total IT critical power used for this activity as a way to allocate costs. If 50% of its servers are only using 30% of the total IT critical load, then it could use 30% to determine the allocation of data center operating costs. The closer the data can get to representing actual IT usage, the better.
An organization that can compile this type of data for about 95% of its IT costs will usually find it sufficient for implementing a very effective chargeback program. It isn’t necessary for every dollar to be accounted for. Expense allocations will be closely proportional based on actual consumption of kilowatts and/or bandwidth consumed and reserved by each user organization. Excess resources typically are absorbed proportionally by all. Even IT staff costs can be allocated by tracking and charging their activity to different customers using timesheets or by headcount where staff is dedicated to specific customers.
Another step in preparing an organization to adopt an IT chargeback methodology is defining service levels. What’s key is setting expectations appropriately so that end users, just like customers, know what they are getting for what they are paying. Defining uptime (e.g., Tier level such as Tier III Concurrent Maintainability or Tier IV Fault Tolerant infrastructure or other uptime and/or downtime requirements, if any), and outlining a detailed service catalog are important.
IT CHARGEBACK DRIVES EFFICIENT IT
Adopting an IT chargeback model may sound daunting, and doing so does take some organizational commitment and resources, but the results are worthwhile. Organizations that have implemented IT chargeback have experienced reductions in resource consumption due to increased customer accountability, and higher, more efficient utilization of hardware, space, power, and cooling due to reduction in servers. IT chargeback brings a new, enterprise-wide focus on lowering data center infrastructure costs with diverse teams working together from the same transparent data to achieve common goals, now possible because everyone has “skin in the game.”
Essentially, achieving efficient IT outcomes demands a “follow the money” mindset. IT chargeback drives a holistic approach in which optimizing data center and IT resource consumption becomes the norm. A chargeback model also helps to propel organizational maturity, as it drives the need for more automation and integrated monitoring, for example the use of a DCIM system. To collect data and track resources and key performance indicators manually is too tedious and time consuming, so stakeholders have an incentive to improve automated tracking, which ultimately improves overall business performance and effectiveness.
IT chargeback is more than just an accounting methodology; it helps drive the process of optimizing business operations and efficiency, improving competitiveness and adding real value to support the enterprise mission.
IT CHARGEBACK DOs AND DON’Ts
19 May 2015, Uptime Institute gathered a group of senior stakeholders for the Executive Assembly for Efficient IT. The group was composed of leaders from large financial, healthcare, retail and Web-scale IT organizations and the purpose of the meeting was to share experiences, success stories and challenges to improving IT efficiency.
At Uptime Institute’s 2015 Symposium, executives from leading data center organizations that have implemented IT chargeback discussed the positive results they had achieved. They also shared the following recommendations for companies considering adopting an IT chargeback methodology.
DO:
• Partner with the Finance department. Finance has to completely buy in to implementing chargeback.
• Inventory assets and determine who is using them. A complete inventory of the number of data centers, number of servers, etc., is needed to develop a clear picture of what is being used.
• Chargeback needs strong senior-level support; it will not succeed as a bottom-up initiative. Similarly don’t try to implement it from the field. Insist that C-suite representatives (COO/CFO) visit the data center so the C-suite understands the concept and requirements.
• Focus on cash management as the goal, not finance issues (e.g., depreciation) or IT equipment (e.g., server models and UPS equipment specifications). Know the audience, and get everyone on the same page talking about straight dollars and cents.
• Don’t give teams too much budget—ratchet it down. Make departments have to make trade-offs so they begin to make smarter decisions.
• Build a dedicated team to develop the chargeback model. Then show people the steps and help them understand the decision process.
• Data is critical: show all the data, including data from the configuration management data base (CMDB), in monthly discussions.
• Be transparent to show and add credibility. For example, explain clearly, “Here’s where we are and here’s where we are trying to get to.”
• Above all, communicate. People will need time to get used to the idea.
DON’TS:
• Don’t try to drive chargeback from the bottom up.
• Simpler is better: don’t overcomplicate the model. Simplify the rules and prioritize; don’t get hung up perfecting every detail because it doesn’t save much money. Approximations can be sufficient.
• Don’t move too quickly: start with showback. Test it out first; then, move to chargeback.
• To get a real return, get rid of the old hardware. Move quickly to remove old hardware when new items are purchased. The efficiency gains are worth it.
• The most challenging roadblocks can turn out to be the business units themselves. Organizational changes might need to go the second level within the business unit if it has functions and layers under them that should be separate.
Scott Killian
Scott Killian joined the Uptime Institute in 2014 and currently serves as VP for Efficient IT Program. He surveys the industry for current practices and develops new products to facilitate industry adoption of best practices. Mr. Killian directly delivers consulting at the site management, reporting, and governance levels. He is based in Virginia.
Prior to joining Uptime Institute, Mr. Killian led AOL’s holistic resource consumption initiative, which resulted in AOL winning two Uptime Institute Server Roundups for decommissioning more than 18,000 servers and reducing operating expenses more than US$6 million. In addition, AOL received three awards in the Green Enterprise IT (GEIT) program. AOL accomplished all this in the context of a five-year plan developed by Mr. Killian to optimize data center resources, which saved US$17 million annually.
NEXTDC deploys new isolated parallel diesel rotary uninterruptible power supply systems and other innovative technologies
By Jeffrey Van Zetten
NEXTDC’s colocation data centers in Australia incorporate innovation in engineering design, equipment selection, commissioning, testing, and operation. This quality-first philosophy saw NEXTDC become one of 15 organizations globally to win a 2015 Brill Award for Efficient IT. NEXTDC’s B1 Brisbane, M1 Melbourne, S1 Sydney, C1 Canberra, and P1 Perth data centers (see Figure 1-3), have a combined 40-megawatt (MW) IT load (see Figure 4).
Figure 1. Exterior of NEXTDC’s 11.5-MW S1 Sydney data center, which is Uptime Institute Tier III Design Documents and Constructed Facility Certified
Figure 2. Exterior of NEXTDC’s 5.5-MW P1 Perth data center, which is Uptime Institute Tier III Design Documents and Constructed Facility Certified
Figure 3. Exterior of NEXTDC’s 12-MW M1 Melbourne data center, which is Uptime Institute Tier III Design Documents Certified
Figure 4. NEXTDC’s current footprint: 40+ MW IT capacity distributed across Australia
In order to accomplish the business goals NEXTDC established, it set the following priorities:
1. High reliability, so that clients can trust NEXTDC facilities with their mission critical IT equipment
2. The most energy-efficient design possible, especially where it can also assist reliability and total cost of ownerships, but not at the detriment of high reliability
3. Efficient total cost of ownership and day one CapEx by utilizing innovative design and technology
4. Capital efficiency and scalability to allow flexible growth and cash flow according to demand
5. Speed to market, as NEXTDC was committed to build and open five facilities within just a few years using a small team across the entire 5,000-kilometer-wide continent, to be the first truly national carrier-neutral colocation provider in
the Australian market
6. Flexible design suitable for the different regions and climates of Australia ranging
from subtropical to near desert.
NEXTDC put key engineering design decisions for these facilities through rigorous engineering decision matrices that weighed and scored the risks, reliability, efficiency, maintenance, total cost of ownership, day one CapEx, full final day CapEx, scalability, vendor local after-sales support, delivery, and references. The company extensively examined all the possible alternative designs to obtain accurate costing and modeling. NEXTDC Engineering and management worked closely to ensure the design would be in accord with the driving brief and the mandate on which the company is based.
NEXTDC also carefully selected the technology providers, manufacturers, and contractors for its projects. This scrutiny was critical, as the quality of local support in Australia can vary from city to city and region to region. NEXTDC paid as much attention to the track record of after-sales service teams as to the initial service or technology.
“Many companies promote innovative technologies; however, we were particularly interested in the after-sales support and the track record of the people who would support the technology,” said Mr. Van Zetten. “We needed to know if they were a stable and reliable team and had in-built resilience and reliability, not only in their equipment, but in their personnel.” NEXTDC’s Perth and Sydney data centers successfully achieved Uptime Institute Tier III Certification of Design Documents (TCDD) and Tier III Certification of Constructed Facilities (TCCF) using Piller’s isolated parallel (IP) diesel rotary uninterruptible power supply (DRUPS) system. A very thorough and exhaustive engineering analysis was performed on all electrical system design options and manufacturers available, including static uninterruptible power supply (UPS) designs with distributed redundant and block redundant distribution, along with the more innovative options such as the IP DRUPS solution. Final scale and capacity was a key design input for making the final decision, and indeed for smaller scale data centers a more traditional static UPS design is still favored by NEXTDC. For facilities larger than 5 MW, the IP DRUPS allows NEXTDC to:
• Eliminate batteries, which fail after 5 to 7 years, causing downtime and loss of redundancy and can cause
hydrogen explosions
• Eliminate the risks of switching procedures, as human error causes most failures
• Maintain power to both A & B supplies without switching even if more than one engine-generator set or UPS is
out of service
• Eliminate problematic static switches.
NEXTDC benefits because:
• If a transformer fails, only the related DRUPS engine generator needs to start. The other units in parallel can all remain on mains [editor’s note: incoming utility] power.
• Electrically decoupled cabinet rotary UPS are easier to maintain, providing less down time and more long-term reliability, which reduces the total cost of ownership.
• The N+1 IP DRUPS maintain higher loaded UPS/engine generators to reduce risk of cylinder glazing/damage at low and growing loads.
• Four levels of independently witnessed, loaded integrated systems testing were applied to verify the performance.
• The IP topology shares the +1 UPS capacity across the facility and enables fewer UPS to run at higher loads for better efficiency.
• The rotary UPSs utilize magnetic-bearing helium-gas enclosures for low friction optimal efficiency.
• The IP allows scalable installation of engine generators and UPS.
For example, the 11.5-MW S1 Sydney data center is based on 12+1 1,336-kilowatt (kW) continuous-rated Piller DRUPS with 12+1 IP power distribution boards. The facility includes sectionalized fire-segregated IP and main switchboard rooms. This design ensures that a failure of any one DRUPS, IP, or main switchboard does not cause a data center failure. The return ring IP bus is also fire segregated.
Figure 5. Scalable IP overall concept design
Differential protection also provides a level of Fault Tolerance. Because the design is scalable, NEXTDC is now increasing the system to a 14+1 DRUPS and IP design to increase the final design capacity from 11.5 to 13.8 MW of IT load to meet rapid growth. All NEXTDC stakeholders, especially those with significant operational experience, were particularly keen to eliminate the risks associated with batteries, static switches, and complex facilities management switching procedures. The IP solution successfully eliminated these risks with additional benefits for CapEx and OpEx efficiency. From a CapEx perspective, the design allows a common set of N+1 DRUPS units to be deployed based on actual IT load for the entire facility (see Figure 5). From an OpEx perspective, the advantage is the design is always able to operate in a N+1 configuration across the entire facility to match actual IT load, so the load is maintained at a higher percentage and thus at efficiencies approaching 98%, whereas low loaded UPS in a distributed redundant design, for example, can often have actual efficiencies of less than 95%. Operating engine-generator sets at higher loads also reduces the risks of engine cylinder glazing and damage, further reducing risks and maintenance costs (see Figure 6).
Figure 6. Distribution of the +1 IP DRUPS across more units provides higher load and thus efficiency
NEXTDC repeated integrated systems tests four times. Testing, commissioning, and tuning are the keys to delivering a successful project. Each set of tests—by the subcontractors, NEXTDC Engineering, independent commissioning agent engineers, and those required for Uptime Institute TCCF—helped to identify potential improvements, which were immediately implemented (see Figure 7). In particular, the TCCF review identified some improvements that NEXTDC could make to Piller’s software logic so that the control became truly distributed, redundant, and Concurrently Maintainable. This improvement ensured that even the complete failure of any panel in the entire system would not cause loss of N IP and main switchboards, even if the number of DRUPS is fewer than the number of IP main switchboards installed. This change improves CapEx efficiency without adding risks. The few skeptics we had regarding Uptime Institute Tier Certification became believers once they saw the professionalism, thoroughness, and helpfulness of Uptime Institute professionals on site.
Figure 7. Concurrently maintainable electrical design utilized in NEXTDC’s facilities
From an operational perspective, NEXTDC found that eliminating static switches and complex switching procedures for the facilities managers also reduced risk and delivered optimal uptime in reality.
MECHANICAL SYSTEMS
The mechanical system designs and equipment were also run through equally rigorous engineering decision matrices, which assessed the overall concept designs and supported the careful selection of individual valves, testing, and commissioning equipment. For example, the final design of the S1 facility includes 5+1 2,700-kW (cooling), high-efficiency, water-cooled, magnetic oil-free bearing, multi-compressor chillers in an N+1 configuration and received Uptime Institute Tier III Design Documents and Constructed Facility Certifications. The chillers are supplemented by both water-side and air-side free-cooling economization with Cold Aisle containment and electronically commutable (EC) variable-speed CRAC fans. Primary/secondary pump configurations are utilized, although, a degree of primary variable flow control is implemented for significant additional energy savings. Furthermore, NEXTDC implemented extreme oversight on testing and commissioning and continues to work with the facilities management teams to carefully tune and optimize the systems. This reduces not only energy use but also wear on critical equipment, extending equipment life, reducing maintenance, and increasing long-term reliability. The entire mechanical plant is supported by the IP DRUPS for continuously available compressor cooling even in the event of a mains power outage. This eliminates the risks associated with buffer tanks and chiller/compressor restarts that occur on most conventional static-UPS-supported data centers and is a common cause of facility outage.
The central cooling plant achieved its overall goals because of the following additional key design decisions:
• Turbocor magnetic oil-free bearing, low-friction compressors developed in Australia provide both reliability and efficiency (see Figure 8).
• Multi-compressor chillers provide redundancy within the chillers and
improved part load operation.
• Single compressors can be replaced while the chiller keeps running.
• N+1 chillers are utilized to increase thermal transfer area for better part-load coefficient of performance
(COP) and Fault Tolerance, as the +1 chiller is already on-line and operating should one chiller fail.
• Variable-speed drive, magnetic-bearing, super-low-friction chillers provide leading COPs.
• Variable number of compressors can optimize COPs.
• Seasonal chilled water temperature reset enables even higher COPs and greater free economization in winter.
• Every CRAC is fitted with innovative pressure-independent self-balancing characterized control valves
(PICCV) to ensure no part of system is starved of chilled water with scalable dynamic staged expansions, and also ensure minimal flow per IT power to minimize pumping energy.
• Variable speed drives (VSDs) are installed on all pumps for less wear and reduced failure.
• 100% testing, tuning, commissioning and independent witnessing of all circuits, and minimization of pump
∆P for reduced wear.
• High ∆T and return water temperatures optimize water-side free cooling.
• High ∆T optimizes seasonal water reset free-cooling economization.
The cooling systems utilize evaporative cooling, which takes advantage of Australia’s climate, with return water precooling heat exchangers that remove load from the chiller compressors for more efficient overall plant performance. The implementation of the water-side and air-side free economization systems is a key to the design.
Very early smoke detection apparatus (VESDA) air-quality automatic damper shutdown is designed and tested along the facility’s entire façade. Practical live witness testing was performed with smoke bombs directed at the façade intakes and utilized a crane to simulate the worst possible case bush-fire event, with a sudden change of wind direction to ensure that false discharges of the gas suppression could be mitigated.
The free-cooling economization systems provide the following benefits to reliability and efficiency (see Figures 9-12):
• Two additional cooling methods provide backup in addition to chillers for most of the year
• Reduced running time on chillers and pumps extend the life and reduce failure and maintenance.
• The design is flexible to use either water side or air side depending on geographic locations and outside air quality.
• Actual results have proven a reduced total cooling plant energy.
• Reduced loads on chillers provide even better chiller COPs at partial loads.
• Reduced pump energy is achieved when air-side economization is utilized.
Figure 9. (Top Left) Water-side free-cooling economization design principle. Note: not all equipment shown for simplicity
Figure 12. Air-side free-cooling economization actual results
WHITE SPACE
NEXTDC’s designer consultants specified raised floor in the first two data rooms in the company’s M1 Melbourne facility (the company’s first builds) as a means of supplying cold air to the IT gear. A Hot Aisle containment system prevents intermixing and returns hot air to the CRACs via chimneys and an overhead return hot air plenum and back to the CRACs. This system minimizes fan speeds, reducing wear and maintenance. Containment also makes it simpler to run the correct number of redundant fans, which provides a high level of redundancy, and due to fan laws, reduces fan wear and maintenance. At NEXTDC, containment means higher return air temperatures, which enables more air-side economization and energy efficiency and an innovative, in-house floor grille management tool that minimizes fan energy according to IT load (see Figure 13). For all later builds, however, NEXTDC chose Cold Aisle containment to eliminate the labor costs and time to build overhead plenum and chimneys for the Hot Aisle containment system, which reduced payback and return on investment. NEXTDC now specifies Cold Aisle containment in all its data centers.
Figure 13. Cold Aisle containment is mandated design for all new NEXTDC facilities to minimize CRAC fan energy
The common-sense implementation of containment has proved to be worthwhile and enabled genuine energy savings. Operational experience suggested that containment alone saves only a part of the total possible energy savings. To capture even more savings, NEXTDC Engineering developed a program that utilizes the actual contracted loads and data from PDU branch circuit monitoring to automatically calculate the ideal floor grille balance for each rack. This intelligent system tuning saved an additional 60% from NEXTDC’s CRAC fan power by increasing air-side ∆T and reducing airflow rates (see Figure 14).
Figure 14. Innovative floor grille tuning methods applied in conjunction with containment yielded significant savings
NEXTDC also learned not to expect mechanical subcontractors to have long-term operational expenses and energy bills as their primary concern. NEXTDC installed pressure/temperature test points across all strainers and equipment and specified that all strainers had to be cleaned prior to commissioning. During the second round of tests, NEXTDC Engineering found that secondary pump differential pressures and energy were almost double what they theoretically should be. Using its own testing instruments, NEXTDC Engineering determined that some critical strainers on the index circuit had in fact been dirtied with construction debris—the contractors had simply increased the system differential pressure setting to deliver the correct flow rates and specified capacity. After cleaning the relevant equipment, the secondary pump energy was almost halved. NEXTDC would have paid the price for the next 20 years had these thorough checks not been performed.
Similarly, primary pumps and the plant needed to be appropriately tuned and balanced based on actual load, as the subcontractors had a tendency to setup equipment to ensure capacity but not for minimal energy consumption. IT loads are very stable, so it is possible to adjust the primary flow rates and still maintain N+1 redundancy, thanks to pump laws—with massive savings on pump energy. The system was designed with pressure independent self-balancing control valves and testing and commissioning sets to ensure scalable, efficient, flow distribution, and high water-side ∆Ts to enable optimal use of water-side free-cooling economization. The challenge then was personally witnessing all flow tests to ensure that the subcontractors had correctly adjusted the equipment. Another lesson learned was that a single flushing bypass left open by a contractor can seriously reduce the return water temperature and prevent the water-side economization system from operating entirely if not tracked down and resolved during commissioning. Hunting down all such incorrect bypasses helped to increase the return water temperature by almost 11ºF (6ºC) for a massive improvement in economization.
Figure 15. Energy saving trends – actual typical results achieved for implementation
Operational tuning through the first year, with the Engineering and facilities management teams comparing actual trends to the theoretical design model provided savings exceeding even NEXTDC’s optimistic hopes. Creating clear and simple procedures with the facilities management teams and running carefully overseen trended trials was critical before rolling out these initiatives nationally. Every single tuning initiative implemented nationally after the facilities go-live date is trended, recorded, and collated into a master national energy savings register. Figure 15 provides just a few examples. Tuning has so far yielded a 24% reduction in the power consumption for mechanical plant with still conservative safety factors. Over time, with additional trend data and machine learning, power consumption is still expected to considerably improve on this via continuous improvement. NEXTDC expects a further 10–20% saving as NEXTDC is on target to operate Australia’s first National Australian Built Environment Rating System (NABERS) 5-star-rated mega data centers.
The design philosophy didn’t end with the electrical and mechanical cooling systems, but also applied to the hydraulics and fire protection systems:
• Rainwater collection is implemented on site to supply cooling towers, which provides additional hours of water most of the year.
• The water tanks are scalable.
• Rainwater collection minimizes mains water consumption.
• VESDA laser optical early detection developed in Australia and licensed internationally was utilized.
• The design completely eliminated water-based sprinkler systems from within the critical IT equipment data
halls, instead utilizing IG55 inert-gas suppression, so that IT equipment can continue to run even if a single server
has an issue (see Figure 16). Water-based pre-action sprinklers risk catastrophic damage to IT equipment that is
not suffering from an over-heating or fire event, risking unnecessary client IT outages.
• The gas suppression system is facility staff friendly, unlike alternatives that dangerously deplete oxygen levels in
the data hall.
• The design incorporates a fully standby set of gas suppression bottle banks onsite.
• The gas suppression bottle banks are scalable.
• The IG55 advanced gas suppression is considered one of the world’s most environmentally friendly gas
suppression systems.
Figure 16. Rainwater, environmentally friendly inert gas fire-suppression and solar power generation innovations
The design of NEXTDC’s data centers is one of the fundamental reasons IT, industrial, and professional services companies are choosing NEXTDC as a colocation data center partner for the region. This has resulted in very rapid top and bottom line financial growth, leading to profitability and commercial success in just a few years. NEXTDC was named Australia’s fastest-growing communications and technology company at Deloitte Australia’s 2014 Technology Fast 50 awards. Mr. Van Zetten said, “What we often found was that when innovation was primarily sought to provide improved resilience and reliability, it also provided improved energy efficiency, better total cost of ownership, and CapEx efficiency. The IP power distribution system is a great example of this. Innovations that were primarily sought for energy efficiency and total cost of ownership likewise often provided higher reliability. The water-side and air-side economization free cooling are great examples. Not only do they reduce our power costs, they also provide legitimate alternative cooling redundancy for much of the year and reduce wear and maintenance on chillers, which improves overall reliability for the long term. “Cold Aisle containment, which was primarily also sought to reduce fan energy, eliminates client problems associate with air mixing and bypassing, thus providing improved client IT reliability.”
Jeffrey Van Zetten
Jeffrey Van Zetten has been involved with NEXTDC since it was founded in 2010 as Australia’s first national data center company. He is now responsible for the overall design, commissioning, Uptime Institute Tier III Certification process, on-going performance, and energy tuning for the B1 Brisbane, M1 Melbourne, S1 Sydney, C1 Canberra, and P1 Perth data centers. Prior to joining NEXTDC, Mr. Van Zetten was based in Singapore as the Asia Pacific technical director for a leading high-performance, buildings technology company. While based in Singapore, he was also the lead engineer for a number of successful energy-efficient high tech and mega projects across Asia Pacific, such as the multi-billion dollar Marina Bay Sands. Mr. Van Zetten has experience in on-site commissioning and troubleshooting data center and major projects throughout Asia, Australia, Europe, North America, and South America.
https://journal.uptimeinstitute.com/wp-content/uploads/2015/09/NextDC.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2015-09-22 11:49:232015-09-22 11:49:23Australian Colo Provider Achieves High Reliability Using Innovative Techniques
Arc Flash Mitigation in the Data Center
/in Operations/by Kevin HeslinMeeting OSHA and NFPA 70E arc flash safety requirements while balancing prevention and production demands
By Ed Rafter
Uninterruptible uptime, 24 x 7, zero downtime…these are some of the terms that characterize data center business goals for IT clients. Given these demands, facility managers and technicians in the industry are skilled at managing the infrastructure that supports these goals, including essential electrical and mechanical systems that are paramount to maintaining the availability of business-critical systems.
Electrical accidents such as arc flash occur all too often in facility environments that have high-energy use requirements, a multitude of high-voltage electrical systems and components, and frequent maintenance and equipment installation activities. A series of stringent standards with limited published exceptions govern work on these systems and associated equipment. The U.S. Occupational Safety and Health Administration (OSHA) and National Fire Protection Association (NFPA) Standard 70E set safety and operating requirements to prevent arc flash and electric shock accidents in the workplace. Many other countries have similar regulatory requirements for electrical safety in the workplace.
When these accidents occur they can derail operations and cause serious harm to workers and equipment. Costs to businesses can include lost work time, downtime, OSHA investigation, fines, medical costs, litigation, lost business, equipment damage, and most tragically, loss of life. According to the Workplace Safety Awareness Council (WPSAC), the average cost of hospitalization for electrical accidents is US$750,000, with many exceeding US$1,000,000.
There are reasonable steps data center operators can—and must—take to ensure the safety of personnel, facilities, and equipment. These steps offer a threefold benefit: the same measures taken to protect personnel also serve to protect infrastructure, and thus protect data center operations.
Across all industries, many accidents are caused by basic mistakes, for example, electrical workers not being properly prepared, working on opened equipment that was not well understood, or magnifying risks through a lack of due diligence. Data center operators, however, are already attuned to the discipline and planning it takes to run and maintain high-availability environments.
While complying with OSHA and NFPA 70E requirements may seem daunting at first, the maintenance and operating standards in place at many data centers enable this industry to effectively meet the challenge of adhering to these mandates. The performance and rigor required to maintain 24 x 7 reliability means the gap between current industry practices and the requirements of these regulatory standards is smaller than it might at first appear, allowing data centers to balance safety with the demands of mission critical production environments.
In this article we describe arc flash and electrical safety issues, provide an overview of the essential measures data centers must follow to meet OSHA and NFPA 70E requirements, and discuss how many of the existing operational practices and adherence to Tier Standards already places many data centers well along the road to compliance.
Figure 1. An arc flash explosion demonstration. Source: Open Electrical
UNDERSTANDING ARC FLASH
Arc flash is a discharge of electrical energy characterized by an explosion that generates light, noise, shockwave, and heat. OSHA defines it as “a phenomenon where a flashover of electric current leaves its intended path and travels through the air from one conductor to another, or to ground (see Figure 1). The results are often violent and when a human is in close proximity to the arc flash, serious injury and even death can occur.” The resulting radiation and shrapnel can cause severe skin burns and eye injuries, and pressure waves can have enough explosive force to propel people and objects across a room and cause lung and hearing damage. OSHA reports that up to 80% of all “qualified” electrical worker injuries and fatalities are not due to shock (electrical current passing through the body) but to external burn injuries caused by the intense radiant heat and energy of an arc fault/arc blast.1
An arc flash results from an arcing electrical fault, which can be caused by dust particles in the air, moisture condensation or corrosion on electrical/mechanical components, material failure, or by human factors such as improper electrical system design, faulty installation, negligent maintenance procedures, dropped tools, or accidentally touching a live electrical circuit. In short, there are numerous opportunities for arc flash to occur in industrial settings, especially those in which there is inconsistency or a lack of adherence to rigorous maintenance, training, and operating procedures.
Variables that affect the power of an arc flash are amperage, voltage, the distance of the arc gap, closure time, three-phase vs. single-phase circuit, and being in a confined space. The power of an arc at the flash location, the distance a worker is from the arc, and the time duration of their exposure to the arc will all affect the extent of skin damage. The WPSAC reports that fatal burns can occur even at distances greater than 10 feet (ft) from an arc location, in fact, serious injury and fatalities can occur up to 20 ft away. The majority of hospital admissions for electrical accidents are due to arc flash burns, with 30,000 arc incidents and 7,000 people suffering burn injuries per year, 2,000 of those requiring admission to burn centers with severe arc flash burns.2
The severity of an arc flash incident is determined by several factors, including temperature, the available fault current, and the time for a circuit to break. The total clearing time of the overcurrent protective device during a fault is not necessarily linear, as lower fault currents can sometimes result in a breaker or fuse taking longer to clear, thus extending the arc duration and thereby raising the arc flash energy.
Unlike the bolted fault (in which high current flows through a solid conductive material typically tripping a circuit breaker or protective device), an arcing fault uses ionized air as a conductor, with current jumping a gap between two conductive objects. The cause of the fault normally burns away during the initial flash, but a highly conductive, intensely hot plasma arc established by the initial arc sustains the event. Arc flash temperatures can easily reach 14,000–16,000°F (7,760–8,870°C) with some projections as high as 35,000°F (19,400°C)—more than three times hotter than the surface of the sun.
These temperatures can be reached by an arc fault event in as little as a few seconds or even a few cycles. The heat generated by the high current flow may melt or vaporize the conductive material and create an arc characterized by a brilliant flash, intense heat, and a fast-moving pressure wave that propels the arcing products. The pressure of an arc blast [up to 2,000 pounds/square foot (9765 kilograms/square meter)] is due to the expansion of the metal as it vaporizes and the heating of the air by the arc. This accounts for the expulsion of molten metal up to 10 ft away. Given these extremes of heat and energy, arc flashes often cause fires, which can rapidly spread through a facility.
INDUSTRY STANDARDS AND REGULATIONS
To prevent these kinds of accidents and injuries, it is imperative that data center operators understand and follow appropriate safety standards for working with electrical equipment. Both the NFPA and OSHA have established standards and regulations that help protect workers against electrical hazards and prevent electrical accidents in the workplace.
OSHA is a federal agency (part of the U.S. Department of Labor) that ensures safe and healthy working conditions for Americans by enforcing standards and providing workplace safety training. OSHA 29 CFR Part 1910, Subpart S and OSHA 29 CFR Part 1926, Subpart K include requirements for electrical installation, equipment, safety-related work practices, and maintenance for general industry and construction workplaces, including data centers.
NFPA 70E is a set of detailed standards (issued at the request of OSHA and updated periodically) that address electrical safety in the workplace. It covers safe work practices associated with electrical tasks and safe work practices for performing other non-electrical tasks that may expose an employee to electrical hazards. OSHA revised its electrical standard to reference NFPA 70E-2000 and continues to recognize NFPA 70E today.
The OSHA standard outlines prevention and control measures for hazardous energies including electrical, mechanical, hydraulic, pneumatic, chemical, thermal, and other energy sources. OSHA requires that facilities:
• Provide and be able to demonstrate a safety program with defined responsibilities.
• Calculate the degree of arc flash hazard.
• Use correct personal protective equipment (PPE) for workers.
• Train workers on the hazards of arc flash.
• Use appropriate tools for safe working.
• Provide warning labels on equipment.
NFPA 70E further defines “electrically safe work conditions” to mean that equipment is not and cannot be energized. To ensure these conditions, personnel must identify all power sources, interrupt the load and disconnect power, visually verify that a disconnect has opened the circuit, lock out and tag the circuit, test for absence of voltage, and ground all power conductors, if necessary.
LOCKOUT/TAGOUT
Most data center technicians will be familiar with lockout and tagging procedures for disabling machinery or equipment. A single qualified individual should be responsible for de-energizing one set of conditions (unqualified personnel should never perform lockout/tagout, work on energized equipment, or enter high risk areas). An appropriate lockout or tagout device should be affixed to the de-energized equipment identifying the responsible individual (see Figure 2).
Figure 2. Equipment lockout/tagout
OVERVIEW: WORKING ON ENERGIZED EQUIPMENT
As the WPSAC states, “the most effective and foolproof way to eliminate the risk of electrical shock or arc flash is to simply de-energize the equipment.” However, both NFPA 70E and OSHA clarify that working “hot” (on live, energized systems) is allowed within the set safety limits on voltage exposures, work zone boundary requirements, and other measures to take in these instances. Required safety elements include personnel qualifications, hazard analysis, protective boundaries, and the use of PPE by workers.
Only qualified persons should work on electrical conductors or circuit parts that have not been put into an electrically safe work condition. A qualified person is one who has received training in and possesses skills and knowledge in the construction and operation of electric equipment and installations and the hazards involved with this type of work. Knowledge or training should encompass the skill to distinguish exposed live parts from other parts of electric equipment, determine the nominal voltage of exposed live parts, and calculate the necessary clearance distances and the corresponding voltages to which a worker will be exposed.
An arc flash hazard analysis for any work must be conducted to determine the appropriate arc flash boundary, the incident energy at the working distance, and the necessary protective equipment for the task. Arc flash is measured in thermal energy units of calories per square centimeter (calories/cm2) and arc flash analysis is referred to as the incident energy of the circuit. Incident energy is both radiant and convective. It is inversely proportional to the working distance squared and directly proportional to the time duration of the arc and to the available bolted fault current. Time has a greater effect on intensity than the available bolted fault current.
The incident energy and flash protection boundary are both calculated in an arc flash hazard analysis. There are two calculation methods, one outlined in NFPA 70E-2012 Annex D and the other in Institute of Electrical and Electronics Engineers (IEEE) Standard 1584.
In practice, to calculate the arc flash (incident energy) at a location, the amount of fault current and the amount of time it takes for the upstream device to trip must be known. A data center should model the distribution system into a software program such as SKM Power System Analysis, calculate the short circuit fault current levels and use the protective device settings feeding switchboards, panelboards, industrial control panels, and motor control centers to determine the incident energy level.
BOUNDARIES
NFPA has defined several protection boundaries: Limited Approach, Restricted, and Prohibited. The intent of NFPA 70E regarding arc flash is to provide guidelines that will limit injury to the onset of second degree burns. Where these boundaries are drawn for any specific task is based on the employee’s level of training, the use of PPE, and the voltage of the energized equipment (see Figure 3).
Figure 3. Protection boundaries. Source: Open Electrical
The closer a worker approaches an exposed, energized conductor or circuit part the greater the chance of inadvertent contact and the more severe the injury that an arc flash is likely to cause that person. When an energized conductor is exposed, the worker may not approach closer than the flash boundary without wearing appropriate personal protective clothing and PPE.
IEEE defines Flash Protection Boundary as “an approach limit at a distance from live parts operating at 50 V or more that are un-insulated or exposed within which a person could receive a second degree burn.” NFPA defines approach boundaries and workspaces as shown in Figure 4. See the sidebar Protection Boundary Definitions.
Figure 4. PPE: typical arc flash suit. Source: Open Electrical
Calculating the specific boundaries for any given piece of machinery, equipment, or electrical component can be done using a variety of methods, including referencing NFPA tables (easiest to do but the least accurate) or using established formulas, an approach calculator tool (provided by IEEE), or one of the software packages available for this purpose.
PROTECTIVE EQUIPMENT
NFPA 70E outlines strict standards for the type of PPE required for any employees working in areas where electrical hazards are present based on the task, the parts of the body that need protection, and the suitable arc rating to match the potential flash exposure. PPE includes items such as a flash suit, switching coat, mask, hood, gloves, and leather protectors. Flame -resistant clothing underneath the PPE gear is also required.
After an arc flash hazard analysis has been performed, the correct PPE can be selected according to the equipment’s arc thermal performance exposure value (ATPV) and the break open threshold energy rating (EBT). Together, these components determine the calculated hazard level that any piece of equipment is capable of protecting a worker from (measured in calories per square centimeter). For example, a hard hat with an attached face shield provides adequate protection for Hazard/Risk Category 2, whereas an arc flash protection hood is needed for a worker exposed to Hazard/Risk Category 4.
PPE is the last line of defense in an arc flash incident; it’s not intended to prevent all injuries, but to mitigate the impact of a flash, should one occur. In many cases, the use of PPE has saved lives or prevented serious injury.
OTHER SAFETY MEASURES
Additional safety-related practices for working on energized systems could include conducting a pre-work job briefing, using insulated tools, having a written safety program, and flash hazard labeling (labels should indicate the flash hazard boundaries for a piece of equipment, and the PPE needed to work within those boundaries), and completing an Energized Electrical Work Permit. According to NFPA, an Energized Electrical Work Permit is required for a task when live parts over 50 volts are involved. The permit outlines conditions and work practices needed to protect employees from arc flash or contact with live parts, and includes the following information:
• Circuit, equipment, and location
• Reason for working while energized
• Shock and arc flash hazard analysis
• Safe work practices
• Approach boundaries
• Required PPE and tools
• Access control
• Proof of job briefing.
DECIDING WHEN TO WORK HOT
NFPA 70E and OSHA require employers to prove that working in a de-energized state creates more or worse hazards than the risk presented by working on live components or is not practical because of equipment design or operational limitations, for example, when working on circuits that are part of a continuous process that cannot be completely shut down. Other exceptions include situations in which isolating and deactivating system components would create a hazard for people not associated with the work, for example, when working on life-support systems, emergency alarm systems, ventilation equipment for hazardous locations, or extinguishing illumination for an area.
In addition, OSHA makes provision for situations in which it would be “infeasible” to shut down equipment, for example, some maintenance and testing operations can only be done on live electric circuits or equipment. The decision to work hot should only be made after careful analysis of the determination of what constitutes infeasibility. In recent years, some well publicized OSHA actions and statements have centered on the matter of how to interpret this term.
ELECTRICAL SAFETY MEASURES IN PRACTICE
Many operational and maintenance practices will help minimize the potential for arc flash, reduce the incident energy or arcing time, or move the worker away from the energy source. In fact, many of these practices are consistent with the rigorous operational and maintenance processes and procedures of a mission-critical data center.
Although the electrical industry is aware of the risks of arc flash, according to the National Institute for Occupational Safety and Health, the biggest worksite personnel hazard is still electrical shock in all but the construction and utility industries. In his presentation at an IEEE-Industry Applications Society (IAS) workshop, Ken Mastrullo of the NFPA compared OSHA 1910 Subpart S citations versus accidents and fatalities between 1 Oct. 2003, and 30 Sept. 2004. Installations accounted for 80% of the citations, while safe work practice issues were cited 20% of the time. However, installations accounted for 9% of the accidents, while safe work practice issues accounted for 91% of all electrical-related accidents. Looking at Mastrullo’s data, while the majority of the OSHA citations were for installation issues, the majority of the injuries were related to work practice issues.
Can OSHA cite you as a company that does not comply with NFPA 70E? The simple answer is: Yes. If employees are involved in a serious electrical incident, OSHA likely will present the employer/owner with several citations. In fact, OSHA assessed more than 2,880 fines between 2007–2011 for sites not meeting Regulation 1910.132(d), averaging 1.5 fines a day.
On the other hand, an OSHA inspection may actually help uncover issues. A May 2012 study of 800 California companies found that those receiving an inspection saw a decline of 9.4% in injuries. On average, these companies saved US$350,000 over the five years following the OSHA inspections,3 an outcome far preferable to being fined for noncompliance or experiencing an electrical accident. Beyond the matter of fines, however, any organization that wishes to effectively avoid putting its personnel in danger—and endangering infrastructure and operations—should endeavor to follow NFPA 70E guidelines (or their regional equivalent).
REDUCING ARC FLASH HAZARDS IN THE FACILITY
While personnel-oriented safety measures are the most important (and mandated) steps to reduce the risk of arc flash accidents, there are numerous equipment and component elements that can be incorporated into facility systems that also help reduce the risk. These include metal-clad switchgear, arc resistant switchgear, current-limiter power circuit breakers, and current-limiting reactors. Setting up zone selective interlocking of circuit breakers can also be an effective prevention measure.
TIER STANDARDS & DATA CENTER PRACTICES ALIGN WITH ARC FAULT PREVENTION
Data centers are already ahead of many industries in conforming to many provisions of OSHA and NFPA 70E. Many electrical accidents are caused by issues such as dust in the environment, improper equipment installation, and human factors. To maintain the performance and reliability demanded by customers, data center operators have adopted a rigorous approach to cleaning, maintenance, installation, training, and other tasks that forestall arc flash. Organizations that subscribe to Tier standards and maintain stringent operational practices are better prepared to take on the challenges of compliance with OSHA and NFPA 70E requirements, in particular the requirements for safely performing work on energized systems, when such work is allowed per the safety standards.
For example, commissioning procedures eliminate the risk of improper installation. Periodically load testing engine generators and UPS systems demonstrates that equipment capacity is available and helps identify out-of-tolerance conditions that are indicative of degrading hardware or calibration and alignment issues. Thermographic scanning of equipment, distribution boards, and conduction paths can identify loose or degraded connections before they reach a point of critical failure.
Adherence to rigorous processes and procedures helps avoid operator error and are tools used in personnel training and refresher classes. Facility and equipment design and capabilities, maintenance programs, and operating procedures are typically well defined and established in a mission critical data center, especially those at a Tier III or Tier IV Certification level.
Beyond the Tier Topology, the operational requirements for every infrastructure classification, as defined in the Tier Standard: Operational Sustainability, include the implementation of processes and procedures for all work activities. Completing comprehensive predictive and preventive maintenance increases reliability, which in turn improves availability. Methods of procedure are generally very detailed and task specific. Maintenance technicians meet stringent qualifications to perform work activities. Training is essential, and planning, practice, and preparation are key to managing an effective data center facility.
This industry focus on rigor and reliability in both systems and operational practices, reinforced by the Tier Standards, will enable data center teams to rapidly adopt and adhere to the practices required for compliance with OSHA and NFPA 70E. What still remains in question is whether or not a data center meets the infeasibility test prescribed by these governing bodies in either the equipment design or operational limitations.
It can be argued that some of today’s data center operations approach the status of being “essential” for much of the underlying infrastructure that runs our 24x 7 digitized society. Data centers support the functioning of global financial systems, power grids and utilities, air traffic control operations, communication networks, and the information processing that support vital activities ranging from daily commerce to national security. Each facility must assess its operations and system capabilities to enable adherence to safe electrical work practices as much as possible without jeopardizing critical mission functions. In many cases, it may become a jurisdictional decision as to the answer for a specific data center business requirement.
No measure will ever completely remove the risk of working on live, energized equipment. In instances where working on live systems is necessary and allowed by NFPA 70E rules, the application of Uptime Institute Tier III and Tier IV criteria can help minimize the risks. Tier III and IV both require the design and installation of systems that enable equipment to be fully de-energized to allow planned activities such as repair, maintenance, replacement, or upgrade without exposing personnel to the risks of working on energized electrical equipment
CONCLUSION
Over the last several decades, data centers and the information processing power they provide has become a fundamental necessity in our global, interconnected society. Balancing the need for appropriate electrical
safety measures and compliance with the need to maintain and sustain uninterrupted production capacity in an energy-intensive environment is a challenge. But it is a challenge the data center industry is perhaps better prepared to meet than many other industry segments. It is apparent that those in the data center industry who subscribe to high-availability concepts such as the Tier Standards: Topology and Operational Sustainability are in a position to readily meet the requirements of NFPA 70E and OSHA from an execution perspective.
SIDEBAR: PROTECTION BOUNDARY DEFINITIONS
The flash protection boundary is the closest approach allowed by qualified or unqualified persons without the use of PPE. If the flash protection boundary is crossed, PPE must be worn. The boundary is a calculated number based upon several factors such as voltage, available fault current, and time for the protective device to operate and clear the fault. It is defined as the distance at which the worker is exposed to 1.2 cal/cm2 for 0.1 second.
LIMITED APPROACH BOUNDARY
The limited approach boundary is the minimum distance from the energized item where untrained personnel may safely stand. No unqualified (untrained) personnel may approach any closer to the energized item than this boundary. The boundary is determined by NFPA 70E Table 130.4-(1) (2) (3) and is based on the voltage of the equipment (2012 Edition).
RESTRICTED APPROACH BOUNDARY
The restricted approach boundary is the distance where qualified personnel may not cross without wearing appropriate PPE. In addition, they must have a written approved plan for the work that they will perform. This boundary is determined from NFPA Table 130.4-(1) (4) (2012 Edition) and is based on the voltage of the equipment.
PROHIBITED APPROACH BOUNDARY
Only qualified personnel wearing appropriate PPE can cross a prohibited approach boundary. Crossing this boundary is considered the same as contacting the exposed energized part. Therefore, personnel must obtain a risk assessment before the prohibited boundary is crossed. This boundary is determined by NFPA 70E Table 130.4-(1) (5) (2012 Edition) and is based upon the voltage of the equipment.
Ed Rafter
Edward P. Rafter has been a consultant to Uptime Institute Professional Services (ComputerSite Engineering) since 1999 and assumed a full time position with Uptime Institute in 2013 as principal of Education and Training. He currently serves as vice president-Technology. Mr. Rafter is responsible for the daily management and direction of the professional education staff to deliver all Uptime Institute training services. This includes managing the activities of the faculty/staff delivering the Accredited Tier Designer (ATD) and Accredited Tier Specialist (ATS) programs, and any other courses to be developed and delivered by Uptime Institute.
ADDITIONAL RESOURCES
To review the complete NFPA-70E standards as set forth in NFPA 70E: Standard For Electrical Safety In The Workplace, visit www.NFPA.org
For resources to assist with calculating flash protection boundaries, visit:
• http://www.littelfuse.com/arccalc/calc.html
• http://www.pnl.gov/contracts/esh-procedures/forms/sp00e230.xls
• http:www/bussmann.com/arcflash/index.aspx
To determine what PPE is required, the tables in NFPA 70E-2012 provide the simplest methods for determining PPE requirements. They provide instant answers with almost no field data needed. The tables provide limited application and are conservative for most applications (the tables are not intended as a substitution for an arc hazard analysis but only as a guide).
A simplified two-category PPE approach is found in NFPA 70E-2012, Table H-2 of Annex H. This table ensures adequate PPE for electrical workers within facilities with large and diverse electrical systems. Other good resources include:
• Controlling Electrical Hazards. OSHA Publication 3076, (2002). 71 pages. Provides a basic overview
of basic electrical safety on the job, including information on how electricity works, how to protect
against electricity, and how OSHA cab help.
• Electrical Safety: Safety and Health for Electrical Trades Student Manual, U.S. Department of Health and
Human Services (DHHS). National Institute for Occupational Safety and Health (NIOSH), Publication
No. 2002-123, (2002, January). This student manual is part of a safety and health curriculum for
secondary and post-secondary electrical trades courses. It is designed to engage the learner in
recognizing, electrical, and controlling hazards associated with electrical work.
• Electrocutions Fatality Investigation Reports. National Institute for Occupational Safety and Health
(NIOSH) Safety and Health Topic. Provides information regarding hundreds of fatal incidents involving
electrocutions investigated by NIOSH and state investigators.
• Working Safely with Electricity. OSHA Fact sheet. Provides safety information on working with
generators, power lines, extension cords, and electrical equipment.
• Lockout/Tagout OSHA Fact Sheet, (2002).
• Lockout-Tagout Interactive Training Program. OSHA. Includes selected references for training and
interactive case studies.
• NIOSH Arc Flash Awareness, NIOSH Publication No. 2007-116D.
ENDNOTES
1. http://www.arcsafety.com/resources/arc-flash-statistics
2. Common Electrical Hazards in the Workplace including Arc Flash, Workplace Safety Awareness Council (www.wpsac.org), produced under Grant SH-16615-07-60-F-12 from the Occupational Safety and Health Administration, U.S. Department of Labor.
3. “The Business Case For Safety and Health,” U.S. Department of Labor, https://www.osha.gov/dcsp/products/topics/businesscase/
Failure Doesn’t Keep Business Hours: 24×7 Coverage
/in Executive, Operations/by Kevin HeslinA statistical justification for 24×7 coverage
By Richard Van Loo
As a result of performing numerous operational assessments at data centers around the world, Uptime Institute has observed that staffing levels at data centers vary greatly from site to site. This observation is discouraging, but not surprising, because while staffing is an important function for data centers attempting to maintain operational excellence, many factors influence an organization’s decision on appropriate staffing levels.
Factors that can affect overall staffing numbers include the complexity of the data center, the level of IT turnover, the number of support activity hours required, the number of vendors contracted to support operations, and business objectives for availability. Cost is also a concern because each staff member represents a direct cost. Because of these numerous factors, data center staffing levels must be constantly reviewed in an attempt to achieve effective data center support at a reasonable cost.
Uptime Institute is often asked, “What is the proper staffing level for my data center.” Unfortunately, there is no quick answer that works for every data center since proper staffing depends on a number of variables.
The time required to perform maintenance tasks and provide shift coverage support are two basic variables. Staffing for maintenance hours requirements is relatively fixed, but affected by which activities are performed by data center personnel and which are performed by vendors. Shift coverage support is defined as staffing for data center monitoring and rounds and for responding to any incidents or events. Staffing levels to support shift coverage can be provided in a number of different ways. Each method of providing shift coverage has potential impacts on operations depending on how that that coverage is focused.
TRENDS IN SHIFT COVERAGE
The primary purpose of having qualified personnel on site is to mitigate the risk of an outage caused by abnormal incidents or events, either by preventing the incident or containing and isolating the incident or event and keeping it from spreading or impacting other systems. Many data centers still support data shift presence with a team of qualified electricians, mechanics, and other technicians who provide 24 x 7 shift coverage. Remote monitoring technology, designs that incorporate redundancy, campus data center environments, the desire to balance costs, and other practices can lead organizations to deploy personnel differently.
Managing shift presence without having qualified personnel on site at all times can elevate risks due to delayed response to abnormal incidents. Ultimately, the acceptable level of risk must be a company decision.
Other shift presence models include:
• Training security personnel to respond to alarms and execute an escalation procedure
• Monitoring the data center through a local or regional building monitoring system (BMS) and having technicians on call
• Having personnel on site during normal business hours and on call during nights and weekends
• Operating multiple data centers as a campus or portfolio so that a team supports multiple data centers without necessarily being on site at each individual data center at a given time
These and other models have to be individually assessed for effectiveness. To assess the effectiveness of any shift presence model, the data center must determine the potential risks of incidents to the operations of the data center and the impact on the business.
For the last 20 years, Uptime Institute has built the Abnormal Incident Reports (AIRs) database using information reported by Uptime Institute Network members. Uptime Institute analyzes the data annually and reports its findings to Network members. The AIRs database provides interesting insights relating to staffing concerns and effective staffing models.
INCIDENTS OCCUR OUTSIDE BUSINESS HOURS
In 2013, a slight majority of incidents (out of 277 total incidents) occurred during normal business hours. However, 44% of incidents happened between midnight and 8:00 a.m., which underscores the potential need for 24 x 7 coverage (see Figure 1).
Figure 1. Approximately half the AIRs that occurred in 2013 took place occurred between 8 a.m. and 12 p.m., the other half between 12 a.m. and 8 a.m.
Similarly, incidents can happen at any time of the year. As a result, focusing shift presence activities toward a certain time of year over others would not be productive. Incident occurrence is pretty evenly spread out over the year.
Figure 2 details the day of the week when incidents occurred. The chart shows that incidents occur on nearly an equal basis every day of the week, which suggests that shift presence requirement levels should be the same every day of the week. To do otherwise would leave shifts with little or no shift presence to mitigate risks. This is an important finding because some data centers focus their shift presence support Monday through Friday and leave weekends to more remote monitoring (see Figure 2).
Figure 2. Data center staff must be ready every day of the week.
INCIDENTS BY INDUSTRY
Figure 3 further breaks down the incidents by industry and shows no significant difference in those trends between industries. The chart does show that the financial services industry reported far more incidents than other industries, but that number reflects the makeup of the sample more than anything.
Figure 3. Incidents in data centers take place all year round.
INCIDENT BREAKDOWNS
Knowing when incidents occur does little to say what personnel should be on site. Knowing what kinds of incidents occur most often will help shape the composition of the on-site staff, as will knowing how incidents are most often identified. Figure 4 shows that electrical systems experience the most incidents, followed by mechanical systems. By contrast, critical IT load causes relatively few incidents.
Figure 4. More than half the AIRs reported in 2013 involved the electrical system.
As a result, it would seem to make sense that shift presence teams should have sufficient electrical experience to respond to the most common incidents. The shift presence team must also respond to other types of incidents, but cross training electrical staff in mechanical and building systems might provide sufficient coverage. And, on-call personnel might cover the relatively rare IT-related incidents.
The AIRs database also sheds some light on how incidents are discovered. Figure 5 suggests that over half of all incidents discovered in 2013 were from alarms and more than 40% of incidents are discovered by technicians on site, totaling about 95% of incidents. The biggest change over the years covered by the chart is a slow growth of incidents discovered by alarm.
Figure 5. Alarms are now the source for most AIRs; however, availability failures are more likely to be found by technicians.
Alarms, however, cannot respond to or mitigate incidents. Uptime Institute has witnessed a number of methods for saving a data center from going down and reducing the impact of a data center incident. These methods require having personnel to respond to the incident, building redundancy into critical systems, and strong predictive maintenance programs to forecast potential failures before they occur. Figure 6 breaks down how often each of these methods produced actual saves.
Figure 6. Equipment redundancy was responsible for more saves in 2013 than in previous years.
The chart also appears to suggest that in recent years, equipment redundancy and predictive maintenance are producing more saves and technicians fewer. There are several possible explanations for this finding, including more robust systems, greater use of predictive maintenance, and budget cuts that reduce staffing or move it off site.
FAILURES
The data show that all the availability failures in 2013 were caused by electrical system incidents. A majority of the failures occurred because maintenance procedures were not followed. This finding underscores the importance of having proper procedures and well trained staff, and ensuring that vendors are familiar with the site and procedures.
Figure 7. Almost half the AIRs reported in 2013 were In Service.
Figure 7 further explores the causes of incidents in 2013. Roughly half the incidents were described as “In Service,” which is defined as inadequate maintenance, equipment adjustment, operated to failure, or no root cause found. The incidents attributed to preventive maintenance actually refer to preventive maintenance that was performed improperly. Data center staff caused just 2% of incidents, showing that the interface of personnel and equipment is not a main cause of incidents and outages.
SUMMARY
The increasing sophistication of data center infrastructure management (DCIM), building management systems (BMS), and building automation systems (BAS) is increasing the question of whether staffing can be reduced at data centers. The advances in these systems are great and can enhance the operations of your data center; however, as the AIRs data shows, mitigation of incidents often requires on-site personnel. This is why it is still a prescriptive behavior for Tier III and Tier IV Operational Sustainability Certified data centers to have qualified full time equivalent (FTE) personnel on site at all times. The driving purpose is to provide quick response time to mitigate any incidents and events. The data show that there is no pattern as to when incidents occur. Their occurrence is pretty well spread across all hours of the day and all days of the week. Watching as data centers continue to evolve with increased remote access and more redundancy built in, will show if the trends continue in their current path. As with any data center operations program the fundamental objective is risk avoidance. Each data center is unique with its own set of inherent risks. Shift presence is just one factor, but a pretty important one; a decision on how many to staff, for each shift, and with what qualifications, can have major impact on risk avoidance and continued data center availability. Choose wisely.
Rich Van Loo
Rich Van Loo is Vice President, Operations for Uptime Institute. He performs Uptime Institute Professional Services audits and Operational Sustainability Certifications. He also serves as an instructor for the Accredited Tier Specialist course.
Mr. Van Loo’s work in critical facilities includes responsibilities ranging from projects manager of a major facility infrastructure service contract for a data center, space planning for the design/construct for several data center modifications, and facilities IT support. As a contractor for the Department of Defense, Mr. Van Loo provided planning, design, construction, operation, and maintenance of worldwide mission critical data center facilities. Mr. Van Loo’s 27-year career includes 11 years as a facility engineer and 15 years as a data center manager.
Unipol Takes Space Planning to a New Level
/in Design/by Kevin HeslinMunicipal requirements imposed difficult space considerations for Italian insurance company’s Tier IV data center
By Andrea Ambrosi and Roberto Del Nero
Space planning is often the key to a successful data center project. Organizing a facility into functional blocks is a fundamental way to limit interference between systems, reduce any problem related to power distribution, and simplify project development. However, identifying functional blocks and optimizing space within an existing building can be extremely complex. Converting an office building into a data center can cause further complexity.
This was the challenge facing Maestrale, a consortium of four engineering companies active in the building field including Ariatta Ingegneria dei Sistemi Srl as the mechanical and electrical engineer, when a major Italian insurance company, UnipolSai Assicurazioni S.p.a (UnipolSai), asked it to design a data center in an office building that had been built at the end of the 1980s in Bologna, Italy. UnipolSai set ambitious performance goals by requiring the electrical and mechanical infrastructure to be Uptime Institute Tier IV Certification of Constructed Facility and very energy efficient.
In addition, the completed facility, designed and built to meet UnipolSai’s requirements, has the following attributes:
• 1,200 kilowatts (kW) maximum overall IT equipment load
• UPS capacity: not less than 10 minutes
• Four equipment rooms having a total area of 1,400 square meters (m2)
• Cold Aisle/Hot Aisle floor layouts : In an energy-conscious project, all innovative free-cooling technologies must be considered.
After a thorough investigation of these free-cooling technologies, Ariatta chose direct free cooling for the UnipolSai project.
MUNICIPAL RESTRICTIONS
The goal of the architectural and structural design of a data center is to accommodate, contain, and protect the mechanical, electrical, and IT equipment. The size, location, and configuration of the mechanical and electrical infrastructure will determine the architecture of the rest of the building. In a pre-existing building, however, this approach does not always apply. More often, builders must work around limitations such as fixed perimeter length and floor height, floors not capable of bearing the weight of expected IT equipment, lack of adjoining external spaces, and other restrictions imposed by municipal regulations. In this project, a series of restrictions and duties imposed by the Municipality of Bologna had a direct impact on technical choices, in particular:
• Any part or any piece of equipment more than 1.8-meters high on the outside yard or on any external surface of the building (e.g., the roof) would be considered added volume, and therefore not allowed.
• Any modification or remodelling activity that changed the shape of the building was to be considered as incompatible with municipal regulations.
• The location was part of a residential area with strict noise limits (noise levels at property lines of 50 decibels [dbA] during the day and 40 dbA at night).
New structural work would also be subject to seismic laws, now in force throughout the country. In addition, UnipolSai’s commitment to Uptime Institute Tier IV Certification required it to also find solutions to achieve Continuous Cooling to IT equipment and to Compartmentalize ancillary systems. The final design incorporates a diesel rotary UPS (DRUPS) in a 2N distribution scheme, a radial double-feed electrical system, and an N+1 mechanical system with dual-water distribution backbone (Line 1 and Line 2) that enable the UnipolSai facility to meet Uptime Institute Tier IV requirements. Refrigerated water chillers with magnetic levitation bearings and air exchangers inserted in an N+1 hydraulic scheme serve the mechanical systems. The chillers are provided with double electric service entrance controlled by an automatic transfer switch (ATS). The DRUPS combine UPS and diesel engine-generator functions and do not require battery systems, which are normally part of static UPS systems. Uptime Institute Tier IV requires Compartmentalization, which necessitates more space. Eliminating the batteries saved a great deal of space. In addition, using the DRUPS to feed the chillers ensured that the facility would meet Tier IV requirements for Continuous Cooling with no need for storage tanks, which would be difficult to place in this site. The DRUPS also completely eliminated cooling requirements in the UPS room because the design ambient temperature would be around 30°C (maximum 40°C). Finally, using the DRUPS greatly simplified the distribution structure, limiting the ATSs on primary electric systems to a very minimum.
Municipal restrictions meant that the best option for locating the DRUPS and chillers would require radically transforming some areas inside building. For example, Ariatta uncovered an office floor and adapted structures and waterproofing to install the chiller plant (see Figures 1 and 2).
Figures 1 and 2. Bird’s eye and ground level views of the chiller plant.
Positioning the DRUPS posed another challenge. In another municipality, its dimensions (12-m length by 4.5-m height), weight, and maintenance requirements would have guided the design team towards a simple solution, such as installing them in containers directly on the floor. However, municipal restrictions for this location (1.8-m limit above street level) required an alternative solution. As a result, geological, geotechnical, and hydrogeological studies of the site of the underground garage showed that:
• Soil conditions met the technical and structural requirements of the DRUPS installation.
• The stratum was lower than the foundations.
• Flood indexes are fixed 30 centimeters above street level (taking zero level as reference).
The garage area was therefore opened and completely modified to contain a watertight tank containing the DRUPs. The tank included a 1.2-m high parapet to prevent flooding. The tank was equipped with redundant water lifting systems fed by the DRUPS (see Figures 3 and 4).
Figures 3 and 4. Particular care was given to protect the DRUPS against water intrusions. Soundproofing was also necessary.
Meeting the city’s acoustic requirements required soundproofing the DRUPS machines, so the DRUPS systems were double shielded reducing noise levels to 40 decibels (dbA) at 10 m during normal operation when connected to mains power. Low-noise chillers and high-performance acoustic barriers helped the entire facility meets its acoustical goals.
After identifying technical rooms and allocating space for equipment rooms, Ariatta had to design systems that met UnipolSai’s IT and mechanical and electrical requirements, IT distribution needs, and Uptime Institute Tier IV Compartmentalization requirements.
The floors of the existing building did not always align, especially on lower stories. These changes in elevation were hard to read in plans and sections. To meet this challenge, Starching S.R.L. Studio Architettura & Ingegneria and Redesco Progetti Srl, both part of the Maestrale Consortium, developed a three-dimensional Revit model, which included information about the mechanical and electrical systems. The Revit model helped identify problems caused by the misalignment of the floors and conflict between systems in the design phase. It also helped communicate new information about the project to contractors during the construction phase (see Figure 5 and 6).
Figures 5 and 6. Revit models helped highlight changes in building elevations that were hard to discern in other media and also aided in communication with contractors.
The use of 3D models is becoming a common way to eliminate interference between systems in final design solutions, with positive effects on the execution of work in general and only a moderate increase in engineering costs.
Figure 7. Fire-rated pipe enclosure
At UnipolSai, Compartmentalizing ancillary systems represented one of the main problems to be resolved to obtain Uptime Institute Tier IV Certification because of restrictions caused by the existing building. Ariatta engaged in continuous dialogue with the Uptime Institute to identify technical solutions. This dialogue, along with studies and functional demonstrations carried out jointly with sector specialists, led to a shared solution where two complementary systems that form the technological backbone are compartmentalized, with respect to one another (see Figure 7). The enclosures, which basically run parallel to each other, have:
• An external fire-resistant layer (60 minutes, same as the building structure)
• An insulation layer to keep the temperature of the technological systems within design limits for 60 minutes
• A channel that contains and protects against leaks absorbs shocks
• Dedicated independent mounting brackets. This solution was needed where the architectural characteristics of the building affected the technological backbone (see Figure 8).
Figure 8. The layout of the building limited the potential paths for pipe runs.
ENERGY EFFICIENCY
The choice of direct free cooling was made following an environmental study intended to determine and analyse the time periods when outdoor thermo-hygrometric conditions are favorable to the indoor IT microclimate of the UnipolSai data center as well as the relevant technical, economic, and energy impact of free cooling on the facility. The next-generation IT equipment used at UnipolSai allows it to modify the environmental parameters used as reference.
Figure 9. UnipolSai sized equipment to meet the requirements of ASHRAE’s “Thermal Guidelines for Data Processing Environments, 3rd edition,” as illustrated by that publication’s Figure 2.
The air conditioning systems in the data center were sized to guarantee temperatures between 24–26°C (75-79°F) per Class A1 equipment rooms as per ASHRAE “Thermal Guidelines for Data Processing Environments, 3rd edition”, in accordance with ASHRAE psychrometric chart (see Figure 9). The studies carried out showed that, in the Bologna region specifically, the outdoor thermo-hygrometric conditions are favorable to the IT microclimate of the data center about 70% of the time with energy savings of approximately 2,000 megawatt-hours. Direct free cooling brought undeniable advantages in terms of energy efficiency but introduced a significant functional complication issue related to Tiers compliance. The Tier Standards do not reference direct free cooling or other economization systems as the Tier requirements apply regardless of the technology. Eventually, it was decided that the free cooling system had to be subordinated to IT equipment continuous operation and excluded every time there was a problem with the mechanical and electrical systems, in which case continued operations would be ensured by the chiller plant.
Eventually, it was decided that the free cooling system had to be subordinated to IT equipment continuous operation and excluded every time there was a problem with the mechanical and electrical systems, in which case Continous Cooling would be ensured by the chiller plant. The direct free cooling functional setting, with unchanneled hot air rejection that was dictated by the pre-existing architectural restrictions, dictated the room layout and drove the choice of Cold Aisle containment. The direct free-cooling system consists of N+1 CRACs placed along the perimeter of the room, blowing cool air into a 60-inch plenum created by the access floor. The same units manage the free-cooling system. Every machine is equipped with a dual feed electric entrance controlled by an ATS and connected to a dual water circuit through a series of automatic valves (see Figure 10).
Figure 10. CRACs are connected with a dual-feed electric entrance controlled by an ATS and connected to a dual water circuit.
Containing the Cold Aisles caused a behavioral response among the IT operators, who normally work in a cold environment. At UnipolSai’s data center, they feel hot air heat when entering the data center. Design return air temperatures in the circulation areas are 32–34°C (90-93°F), and design supply air temperatures are 24–26°C (75-79°F). It became necessary to start an informational campaign to prevent alarmism in connection with room temperatures in the areas outside the functional aisles (See Figures 11-13).
Figures 11-13. Pictures show underfloor piping, containers, and raised floor environment.
Prefabricated electric busbars placed on the floor at regular intervals provide power supply to the IT racks. This decision was made in collaboration with UnipolSai technicians who considered it the most flexible solution in terms of installation and power draw, both initially and to accommodate future changes (see Figure 14 and 15).
Figure 14. Electric busbar
Figure 15. Taps on the busbar allow great flexibility on the data center floor and feed servers on the white space floor below.
In addition, a labeling system involving univocal synthetic description (alphanumeric code) and color coding, which allows a quick visual identification any part of any system makes simplifies the process of testing, operating, and managing all building systems (see Figure 16 and 17).
Figure 16-17. UnipolSai benefits from a well-thought out labeling system, which simplifies many aspects of operations.
FINAL TESTING
Functional tests were carried out at nominal load with the support of electric heaters, distributed in a regular manner within the equipment rooms and connected to the infrastructure feeding the IT equipment (see Figure 18-19). Also, Uptime Institute observed technical and functional tests were as part of Tier IV Certification of Constructed Facility (TCCF). The results of all the tests were positive; final demonstrations are pending. The data center has received Uptime Institute Tier IV Certification of Design Documents and is in progress for Tier IV Certification of Constructed Facility.
Figure 18-19. Two views of the data center floor, including heaters, during final testing.
To fully respond to the energy saving and absorption control policy of UnipolSai, the site was equipped with a network of heat/cooling energy meters and electrical meters connected to the central supervision system. Each chiller, and pumping and air handling system was specifically metered on electrical side, with chillers metered on the thermal side. Each electric system feeding IT loads is also metered.
UnipolSai also adopted DCIM software that, if properly used, can represent the first step towards an effective organization of the maintenance process, essential for keeping a system efficient and operational, independently from its level of redundancy and sophistication.
Andrea Ambrosi
Andrea Ambrosi, is project manager, design team manager and site manager at Ariatta Ingegneria dei Sistemi Srl (Ariatta). He is responsible for the executive planning and management of operations of electrical power, control and supervision systems, safety, and security and fire detection systems to be installed in data centers. He has specific experience in Domotics and special systems for the high-tech residential sector. He has been an Accredited Tier Designer since 2013 and Accredited Tier Specialist since 2014.
Roberto Del Nero, is a project manager and design team manager at
Roberto Del Nero
Ariatta, where he is responsible for the executive planning and management of mechanical plants, control and supervision systems, fire systems, plumbing and drainage to be installed in data center. He has been LEED AP (Accredited Professional) since 2009, Accredited Tier Designer since 2013 and Accredited Tier Specialist since 2014.
Examining and Learning from Complex Systems Failures
/in Executive/by Kevin HeslinConventional wisdom blames “human error” for the majority of outages, but those failures are incorrectly attributed to front-line operator errors, rather than management mistakes
By Julian Kudritzki, with Anne Corning
Data centers, oil rigs, ships, power plants, and airplanes may seem like vastly different entities, but all are large and complex systems that can be subject to failures—sometimes catastrophic failure. Natural events like earthquakes or storms may initiate a complex system failure. But often blame is assigned to “human error”—front-line operator mistakes, which combine with a lack of appropriate procedures and resources or compromised structures that result from poor management decisions.
“Human error” is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.
Responsibility for an incident, in most cases, can be attributed to a senior management decision (e.g., design compromises, budget cuts, staff reductions, vendor selecting and resourcing) seemingly disconnected in time and space from the site of the incident. What decisions led to a situation where front line operators were unprepared or untrained to respond to an incident and mishandled it?
To safeguard against failures, standards and practices have evolved in many industries that encompass strict criteria and requirements for the design and operation of systems, often including inspection regimens and certifications. Compiled, codified, and enforced by agencies and entities in each industry, these programs and requirements help protect the service user from the bodily injuries or financial effects of failures and spur industries to maintain preparedness and best practices.
Twenty years of Uptime Institute research into the causes of data center incidents places predominant accountability for failures at the management level and finds only single-digit percentages of spontaneous equipment failure.
This fundamental and permanent truth compelled the Uptime Institute to step further into standards and certifications that were unique to the data center and IT industry. Uptime Institute undertook a collaborative approach with a variety of stakeholders to develop outcome-based criteria that would be lasting and developed by and for the industry. Uptime Institute¹s Certifications were conceived to evaluate, in an unbiased fashion, front-line operations within the context of management structure and organizational behaviors.
EXAMINING FAILURES
The sinking of the Titanic. The Deepwater Horizon oil spill. DC-10 air crashes in the 1970s. The failure of New Orleans’ levee system. The Three Mile Island nuclear release. The northeast (U.S.) blackout of 2003. Battery fires in Boeing 787s. The space shuttle Challenger disaster. Fukushima Daiichi nuclear disaster. The grounding of the Kulluk arctic drilling rig. These are a few of the most infamous, and in some cases tragic, engineering system failures in history. While the examples come from vastly different industries and each story unfolded in its own unique way, they all have something in common with each other—and with data centers. All exemplify highly complex systems operating in technologically sophisticated industries.
The hallmarks of so-called complex systems are “a large number of interacting components, emergent properties difficult to anticipate from the knowledge of single components, adaptability to absorb random disruptions, and highly vulnerable to widespread failure under adverse conditions (Dueñas-Osorio and Vemuru 2009).” Additionally, the components of complex systems typically interact in non-linear fashion, operating in large interconnected networks.
Large systems and the industries that use them have many safeguards against failure and multiple layers of protection and backup. Thus, when they fail it is due to much more than a single element or mistake.
It is a truism that complex systems tend to fail in complex ways. Looking at just a few examples from various industries, again and again we see that it was not a single factor but the compound effect of multiple factors that disrupted these sophisticated systems. Often referred to as “cascading failures,” complex system breakdowns usually begin when one component or element of the system fails, requiring nearby “nodes” (or other components in the system network) to take up the workload or service obligation of the failed component. If this increased load is too great, it can cause other nodes to overload and fail as well, creating a waterfall effect as every component failure increases the load on the other, already stressed components. The following transferable concept is drawn from the power industry:
Power transmission systems are heterogeneous networks of large numbers of components that interact in diverse ways. When component operating limits are exceeded, protection acts to disconnect the component and the component ”fails” in the sense of not being available… Components can also fail in the sense of misoperation or damage due to aging, fire, weather, poor maintenance, or incorrect design or operating settings…. The effects of the component failure can be local or can involve components far away, so that the loading of many other components throughout the network is increased… the flows all over the network change (Dobson, et al. 2009).
A component of the network can be mechanical, structural or human agent, as front-line operators respond to an emerging crisis. Just as engineering components can fail when overloaded, so can human effectiveness and decision-making capacity diminish under duress. A defining characteristic of a high risk organization is that it provides structure and guidance despite extenuating circumstances—duress is its standard operating condition.
The sinking of the Titanic is perhaps the most well-known complex system failure in history. This disaster was caused by the compound effect of structural issues, management decisions, and operating mistakes that led to the tragic loss of 1,495 lives. Just a few of the critical contributing factors include design compromises (e.g., reducing the height of the watertight bulkheads that allowed water to flow over the tops and limiting the number of lifeboats for aesthetic considerations), poor discretionary decisions (e.g., sailing at excessive speed on a moonless night despite reports of icebergs ahead), operator error (e.g., the lookout in the crow’s nest had no binoculars—a cabinet key had been left behind in Southampton), and misjudgment in the crisis response (e.g., the pilot tried to reverse thrust when the iceberg was spotted, instead of continuing at full speed and using the momentum of the ship to turn course and reduce impact). And, of course, there was the hubris of believing the ship was unsinkable.
Figure 1a. (Left) NTSB photo of the burned auxiliary power unit battery from a JAL Boeing 787 that caught fire on January 7, 2013 at Boston¹s Logan International Airport. Photo credit: By National Transportation Safety Board (NTSB) [Public domain], via Wikimedia Commons. Figure 1b. (Right) A side-by-side comparison of an original Boeing Dreamliner (787) battery compared and a damaged Japan Air Lines battery. Photo credit: By National Transportation Safety Board (NTSB) [Public domain], via Wikimedia Commons.
Looking at a more recent example, the issue of battery fires in Japan Airlines (JAL) Boeing 787s, which came to light in 2013 (see Figure 1), was ultimately blamed on a combination of design, engineering, and process management shortfalls (Gallagher 2014). Following its investigation, the U.S. National Transportation Safety Board reported (NTSB 2014):
• Manufacturer errors in design and quality control. The manufacturer failed to adequately account for the thermal runaway phenomenon: an initial overheating of the batteries triggered a chemical reaction that generated more heat, thus causing the batteries to explode or catch fire. Battery “manufacturing defects and lack of oversight in the cell manufacturing process” resulted in the development of lithium mineral deposits in the batteries. Called lithium dendrites, these deposits can cause a short circuit that reacts chemically with the battery cell, creating heat. Lithium dendrites occurred in wrinkles that were found in some of the battery electrolyte material, a manufacturing quality control issue.
• Shortfall in certification processes. The NTSB found shortcomings in U.S. Federal Aviation Administration (FAA) guidance and certification processes. Some important factors were overlooked that should have been considered during safety assessment of the batteries.
• Lack of contractor oversight and proper change orders. A cadre of contractors and subcontractors were involved in the manufacture of the 787’s electrical systems and battery components. Certain entities made changes to the specifications and instructions without proper approval or oversight. When the FAA performed an audit, it found that Boeing’s prime contractor wasn’t following battery component assembly and installation instructions and was mislabeling parts. A lack of “adherence to written procedures and communications” was cited.
How many of these circumstances parallel those that can happen during the construction and operation of a data center? It is all too common to find deviations from as-designed systems during the construction process, inconsistent quality control oversight, and the use of multiple subcontractors. Insourced and outsourced resources may disregard or hurry past written procedures, documentation, and communication protocols (search Avoiding Data Center Construction Problems @ journal.uptimeinstitute.com).
THE NATURE OF COMPLEX SYSTEM FAILURES
Large industrial and engineered systems are risky by their very nature. The greater the number of components and the higher the energy and heat levels, velocity, and size and weight of these components the greater the skill and teamwork required to plan, manage, and operate the systems safely. Between mechanical components and human actions, there are thousands of possible points where an error can occur and potentially trigger a chain of failures.
In his seminal article on the topic of complex system failure How Complex Systems Fail, first published in 1998 and still widely referenced today, Dr. Richard I. Cook identifies and discusses 18 core elements of failure in complex systems:
1. Complex systems are intrinsically hazardous systems.
2. Complex systems are heavily and successfully defended against failure.
3. Catastrophe requires multiple failures—single point failures are not enough.
4. Complex systems contain changing mixtures of failures latent within them.
5. Complex systems run in degraded mode.
6. Catastrophe is always just around the corner.
7. Post-accident attribution to a root cause is fundamentally wrong.
8. Hindsight biases post-accident assessments of human performance.
9. Human operators have dual roles: as producers and as defenders against failure.
10. All practitioner actions are gambles.
11. Actions at the sharp end resolve all ambiguity.
12. Human practitioners are the adaptable element of complex systems.
13. Human expertise in complex systems is constantly changing.
14. Change introduces new forms of failure.
15. Views of cause limit the effectiveness of defenses against future events.
16. Safety is a characteristic of systems and not of their components.
17. People continuously create safety.
18. Failure-free operations require experience with failure (Cook 1998).
Let’s examine some of these principles in the context of a data center. Certainly high-voltage electrical systems, large-scale mechanical and infrastructure components, high-pressure water piping, power generators, and other elements create hazards [Element 1] for both humans and mechanical systems/structures. Data center systems are defended from failure by a broad range of measures [Element 2], both technical (e.g., redundancy, alarms, and safety features of equipment) and human (e.g., knowledge, training, and procedures). Because of these multiple layers of protection, a catastrophic failure would require the breakdown of multiple systems or multiple individual points of failure [Element 3].
RUNNING NEAR CRITICAL FAILURE
Complex systems science suggests that most large-scale complex systems, even well-run ones, by their very nature are operating in “degraded mode” [Element 5], i.e., close to the critical failure point. This is due to the progression over time of various factors including steadily increasing load demand, engineering forces, and economic factors.
The enormous investments in data center and other highly available infrastructure systems perversely incents conditions of elevated risk and higher likelihood of failure. Maximizing capacity, increasing density, and hastening production from installed infrastructure improves the return on investment (ROI) on these major capital investments. Deferred maintenance, whether due to lack of budget or hands-off periods due to heightened production, further pushes equipment towards performance limits—the breaking point.
The increasing density of data center infrastructure exemplifies the dynamics that continually and inexorably push a system towards critical failure. Server density is driven by a mixture of engineering forces (advancements in server design and efficiency) and economic pressures (demand for more processing capacity without increasing facility footprint). Increased density then necessitates corresponding increases in the number of critical heating and cooling elements. Now the system is running at higher risk, with more components (each of which is subject to individual fault/failure), more power flowing through the facility, more heat generated, etc.
This development trajectory demonstrates just a few of the powerful “self-organizing” forces in any complex system. According to Dobson, et al (2009), “these forces drive the system to a dynamic equilibrium that keeps [it] near a certain pattern of operating margins relative to the load. Note that engineering improvements and load growth are driven by strong, underlying economic and societal forces that are not easily modified.”
Because of this dynamic mix of forces, the potential for a catastrophic outcome is inherent in the very nature of complex systems [Element 6]. For large-scale mission critical and business critical systems, the profound implication is that designers, system planners, and operators must acknowledge the potential for failure and build in safeguards.
WHY IS IT SO EASY TO BLAME HUMAN ERROR?
Human error is often cited as the root cause of many engineering system failures, yet it does not often cause a major disaster on its own. Based on analysis of 20 years of data center incidents, Uptime Institute holds that human error must signify management failure to drive change and improvement. Leadership decisions and priorities that result in a lack of adequate staffing and training, an organizational culture that becomes dominated by a fire drill mentality, or budget cutting that reduces preventive/proactive maintenance could result in cascading failures that truly flow from the top down.
Although front-line operator error may sometimes appear to cause an incident, a single mistake (just like a single data center component failure) is not often sufficient to bring down a large and robust complex system unless conditions are such that the system is already teetering on the edge of critical failure and has multiple underlying risk factors. For example, media reports after the 1983 Exxon Valdez oil spill zeroed in on the fact that the captain, Joseph Hazelwood, was not at the bridge at the time of the accident and accused him of drinking heavily that night. However, more measured assessments of the accident by the NTSB and others found that Exxon had consistently failed to supervise the captain or provide sufficient crew for necessary rest breaks (see Figure 2).
Figure 2. Shortly after leaving the Port of Valdez, the Exxon Valdez ran aground on Bligh Reef. The picture was taken three days after the vessel grounded, just before a storm arrived. Photo credit: Office of Response and Restoration, National Ocean Service, National Oceanic and Atmospheric Administration [Public domain], via Wikimedia Commons.
Perhaps even more critical was the lack of essential navigation systems: the tanker’s radar was not operational at time of the accident. Reports indicate that Exxon’s management had allowed the RAYCAS radar system to stay broken for an entire year before the vessel ran aground because it was expensive to operate. There was also inadequate disaster preparedness and an insufficient quantity of oil spill containment equipment in the region, despite the experiences of previous small oil spills. Four years before the accident, a letter written by Captain James Woodle, who at that time was the Exxon oil group¹s Valdez port commander, warned upper management, “Due to a reduction in manning, age of equipment, limited training and lack of personnel, serious doubt exists that [we] would be able to contain and clean-up effectively a medium or large size oil spill” (Palast 1999).
As Dr. Cook points out, post-accident attribution to a root cause is fundamentally wrong [Element 7]. Complete failure requires multiple faults, thus attribution of blame to a single isolated element is myopic and, arguably, scapegoating. Exxon blamed Captain Hazelwood for the accident, and his share of the blame obscures the underlying mismanagement that led to the failure. Inadequate enforcement by the U.S. Coast Guard and other regulatory agencies further contributed to the disaster.
Similarly, the grounding of the oil rig Kulluk was the direct result of a cascade of discrete failures, errors, and mishaps, but the disaster was first set in motion by Royal Dutch Shell’s executive decision to move the rig off of the Alaskan coastline to avoid tax liability, despite high risks (Lavelle 2014). As a result, the rig and its tow vessels undertook a challenging 1,700-nautical-mile journey across the icy and storm-tossed waters of the Gulf of Alaska in December 2012 (Funk 2014).
There had already been a chain of engineering and inspection compromises and shortfalls surrounding the Kulluk, including the installation of used and uncertified tow shackles, a rushed refurbishment of the tow vessel Discovery, and electrical system issues with the other tow vessel, the Aivik, which had not been reported to the Coast Guard as required. (Discovery experienced an exhaust system explosion and other mechanical issues in the following months. Ultimately the tow company—a contractor—was charged with a felony for multiple violations.)
This journey would be the Kulluk’s last, and it included a series of additional mistakes and mishaps. Gale-force winds put continual stress on the tow line and winches. The tow ship was captained on this trip by an inexperienced replacement, who seemingly mistook tow line tensile alarms (set to go off when tension exceeded 300 tons) for another alarm that was known to be falsely annunciating. At one point the Aivik, in attempting to circle back and attach a new tow line, was swamped by a wave, sending water into the fuel pumps (a problem that had previously been identified but not addressed), which caused the engines to begin to fail over the next several hours (see Figure 3).
Waves crash over the mobile offshore drilling unit Kulluk where it sits aground on the southeast side of Sitkalidak Island, Alaska, Jan. 1, 2013. A Unified Command, consisting of the Coast Guard, federal, state, local and tribal partners and industry representatives was established in response to the grounding. U.S. Coast Guard photo by Petty Officer 3rd Class Jonathan Klingenberg.
Despite harrowing conditions, Coast Guard helicopters were eventually able to rescue the 18 crew members aboard the Kulluk. Valiant last-ditch tow attempts were made by the (repaired) Aivik and Coast Guard tugboat Alert, before the effort had to be abandoned and the oil rig was pushed aground by winds and currents.
Poor management decision making, lack of adherence to proper procedures and safety requirements, taking shortcuts in the repair of critical mechanical equipment, insufficient contractor oversight, lack of personnel training/experience all of these elements of complex system failure are readily seen as contributing factors in the Kulluk disaster.
EXAMINING DATA CENTER SYSTEM FAILURES
Two recent incidents demonstrate how the dynamics of complex systems failures can quickly play out in the data center environment.
Example A
Tier III Concurrent Maintenance data center criteria (see Uptime Institute Tier Standard: Topology) require multiple, diverse independent distribution paths serving all critical equipment to allow maintenance activity without impacting critical load. The data center in this example had been designed appropriately with fuel pumps and engine- generator controls powered from multiple circuit panels. As built, however, a single panel powered both, whether due to implementation oversight or cost reduction measures. At issue is not the installer, but rather the quality of communications from the implementation team and the operations team.
In the course of operations, technicians had to shut off utility power during the performance of routine maintenance to an electrical switchgear. This meant the building was running on engine-generator sets. However, when the engine-generator sets started to surge due to a clogged fuel line. The UPS automatically switched the facility to battery power. The day tanks for the engine-generator sets were starting to run dry. If quick-thinking operators had not discovered the fuel pump issue in time, there would have been an outage to the entire facility: a cascade of events leading down a rapid pathway from simple routine maintenance activity to complete system failure.
Example B
Tier IV Fault Tolerant data center criteria require the ability to detect and isolate a fault while maintaining capacity to handle critical load. In this example, a Tier IV enterprise data center shared space with corporate offices in the same building, with a single chilled water plant used to cool both sides of the building. The office air handling units also brought in outside air to reduce cooling costs.
One night, the site experienced particularly cold temperatures and the control system did not switch from outside air to chilled water for office building cooling, which affected data center cooling as well. The freeze stat (a temperature sensing device that monitors a heat exchanger to prevent its coils from freezing) failed to trip; thus the temperature continued to drop and the cooling coil froze and burst, leaking chilled water onto the floor of the data center. There was a limited leak detection system in place and connected, but it had not been fully tested yet. Chilled water continued to leak until pressure dropped and then the chilled water machines started to spin offline in response. Once the chilled water machines went offline neither the office building nor data center had active cooling.
At this point, despite the extreme outside cold, temperatures in the data hall rose through the night. As a result of the elevated indoor temperature conditions, the facility experienced myriad device-level (e.g., servers, disc drives, and fans) failures over the following several weeks. Though a critical shut down was not the issue, damage to components and systems—and the cost of cleanup and replacement parts and labor—were significant. One single initiating factor—a cold night—combined with other elements in a cascade of failures.
In both of these cases, severe disaster was averted, but relying on front-line operators to save the situation is neither robust not reliable.
PREVENTING FAILURES IN THE DATA CENTER
Organizations that adhere to the principles of Concurrent Maintainability and/or Fault Tolerance, as outlined in Tier Standard: Topology, take a vital first step toward reducing the risk of a data center failure or outage.
However, facility infrastructure is only one component of failure prevention; how a facility is run and operated on a day-to-day basis is equally critical. As Dr. Cook noted, humans have a dual role in complex systems as both the potential producers (causes) of failure as well as, simultaneously, some of the best defenders against failure [Element 9].
The fingerprints of human error can be seen on the two data center examples. In Example A, the electrical panel was not set up as originally designed, and the leak detection system, which could have alerted operators to the problem, had not been fully activated in Example B.
Dr. Cook also points out that human operators are the most adaptable component of complex systems [Element 12], as they “actively adapt the system to maximize production and minimize accidents.” For example, operators may “restructure the system to reduce exposure of vulnerable parts,” reorganize critical resources to focus on areas of high demand, provide “pathways for retreat or recovery,” and “establish means for early detection of changed system performance in order to allow graceful cutbacks in production or other means of increasing resiliency.” Given the highly dynamic nature of complex system environments, this human-driven adaptability is key.
STANDARDIZATION CAN ADDRESS MANAGEMENT SHORTFALLS
In most of the notable failures in recent decades, there was a breakdown or circumvention of established standards and certifications. It was not a lack of standards, but a lack of compliance or sloppiness that contributed the most to the disastrous outcomes. For example, in the case of the Boeing batteries, the causes were bad design, poor quality inspections, and lack of contractor oversight. In the case of the Exxon Valdez, inoperable navigation systems and inadequate crew manpower and oversight—along with insufficient disaster preparednesswere critical factors. If leadership, operators, and oversight agencies had adhered to their own policies and requirements and had not cut corners for economics or expediency, these disasters might have been avoided.
Ongoing operating and management practices and adherence to recognized standards and requirements, therefore, must be the focus of long-term risk mitigation. In fact, Dr. Cook states that “failure-free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance….human practitioner adaptations to changing conditions actually create safety from moment to moment” [Element 17]. This emphasis on human activities as decisive in preventing failures dovetails with Uptime Institute’s advocacy of operational excellence as set forth in the Tier Standard: Operational Sustainability. This was the data center industry¹s first standardization, developed by and for data centers, to address the management shortfalls that could unwind the most advanced, complex, and intelligent of solutions. Uptime Institute was compelled by its findings that the vast majority of data center incidents could be attributed to operations, despite advancements in technology, monitoring, and automation.
The Operational Sustainability criteria pinpoint the elements that impact long-term data center performance, encompassing site management and operating behaviors, and documentation and mitigation of site-specific risks. The detailed criteria include personnel qualifications and training and policies and procedures that support operating teams in effectively preventing failures and responding appropriately when small failures occur to avoid having them cascade into large critical failures. As Dr. Cook states, “Failure free operations require experience with failure” [Element 18]. We have the opportunity to learn from the experience of other industries, and, more importantly, from the data center industry¹s own experience, as collected and analyzed in Uptime Institute’s Abnormal Incident Reports database. Uptime Institute has captured and catalogued the lessons learned from more than 5,000 errors and incidents over the last 20 years and used that research knowledgebase to help develop an authoritative set of benchmarks. It has ratified these with leading industry experts and gained the consensus of global stakeholders from each sector of the industry. Uptime Institute’s Tier Certifications and Management & Operations (M&O) Stamp of Approval provide the most definitive guidelines for and verification of effective risk mitigation and operations management.
Dr. Cook explains, “More robust system performance is likely to arise in systems where operators can discern the edge of the envelope. It also depends on calibrating how their actions move system performance towards or away from the edge of the envelope. [Element 18]” Uptime Institute¹s deep subject matter expertise, long experience, and evidence-based standards can help data center operators identify and stay on the right side of that edge. Organizations like CenturyLink are recognizing the value of applying a consistent set of standards to ensure operational excellence and minimize the risk of failure in the complex systems represented by their data center portfolio (See the sidebar CenturyLink and the M&O Stamp of Approval).
CONCLUSION
Complex systems fail in complex ways, a reality exacerbated by the business need to operate complex systems on the very edge of failure. The highly dynamic environments of building and operating an airplane, ship, or oil rig share many traits with running a high availability data center. The risk tolerance for a data center is similarly very low, and data centers are susceptible to the heroics and missteps of many disciplines. The coalescing element is management, which makes sure that frontline operators are equipped with the hands, tools, parts, and processes they need, and, the unbiased oversight and certifications to identify risks and drive continuous improvement against the continuous exposure to complex failure.
REFERENCES
ASME (American Society of Mechanical Engineers). 2011. Initiative to Address Complex Systems Failure: Prevention and Mitigation of Consequences. Report prepared by Nexight Group for ASME (June). Silver Spring MD: Nexight Group. http://nexightgroup.com/wp-content/uploads/2013/02/initiative-to-address-complex-systems-failure.pdf
Bassett, Vicki. (1998). “Causes and effects of the rapid sinking of the Titanic,” working paper. Department of Mechanical Engineering, the University of Wisconsin. http://writing.engr.vt.edu/uer/bassett.html#authorinfo.
BBC News. 2015. “Safety worries lead US airline to ban battery shipments.” March 3, 2015. http://www.bbc.com/news/technology-31709198
Brown, Christopher and Matthew Mescal. 2014. View From the Field. Webinar presented by Uptime Institute, May 29, 2014. https://uptimeinstitute.com/research-publications/asset/webinar-recording-view-from-the-field
Cook, Richard I. 1998. “How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety).” Chicago, IL: Cognitive Technologies Laboratory, University of Chicago. Copyright 1998, 1999, 2000 by R.I. Cook, MD, for CtL. Revision D (00.04.21), http://web.mit.edu/2.75/resources/rando/How%20Complex%20Systems%20Fail.pdf
Dobson, Ian, Benjamin A. Carreras, Vickie E. Lynch and David E. Newman. 2009. “Complex systems analysis of a series of blackouts: Cascading failure, critical points, and self-organization.” Chaos: An Interdisciplinary Journal of Nonlinear Science 17: 026103 (published by the American Institute of Physics).
Dobson, Ian, Benjamin A. Carreras, Vickie E. Lynch and David E. Newman. 2009. “Complex systems analysis of a series of blackouts: Cascading failure, critical points, and self-organization.” Chaos: An Interdisciplinary Journal of Nonlinear Science 17: 026103 (published by the American Institute of Physics).
Dueñas-Osorio, Leonard and Srivishnu Mohan Vemuru. 2009. Abstract for “Cascading failures in complex infrastructure systems.” Structural Safety 31 (2): 157-167.
Funk, McKenzie. 2014. “The Wreck of the Kulluk.” New York Times Magazine December 30, 2014. http://www.nytimes.com/2015/01/04/magazine/the-wreck-of-the-kulluk.html?_r=0
Gallagher, Sean. 2014. “NTSB blames bad battery designand bad managementin Boeing 787 fires.” Ars Technica, December 2, 2014. http://arstechnica.com/information-technology/2014/12/ntsb-blames-bad-battery-design-and-bad-management-in-boeing-787-fires/
Glass, Robert, Walt Beyeler, Kevin Stamber, Laura Glass, Randall LaViolette, Stephen Contrad, Nancy Brodsky, Theresa Brown, Andy Scholand, and Mark Ehlen. 2005. Simulation and Analysis of Cascading Failure in Critical Infrastructure. Presentation (annotated version) Los Alamos National Laboratory, National Infrastructure Simulation and Analysis Center (Department of Homeland Security), and Sandia National Laboratories, July 2005..New Mexico: Sandia National Laboratories. http://www.sandia.gov/CasosEngineering/docs/Glass_annotatedpresentation.pdf
Kirby, R. Lee. 2012. “Reliability Centered Maintenance: A New Approach.” Mission Critical, June 12, 2012. http://www.missioncriticalmagazine.com/articles/84992-reliability-centered-maintenance–a-new-approach
Klesner, Keith. 2015. “Avoiding Data Center Construction Problems.” The The Uptime Institute Journal. 5: Spring 2014: 6-12. https://journal.uptimeinstitute.com/avoiding-data-center-construction-problems/
Lipsitz, Lewis A. 2012. “Understanding Health Care as a Complex System: The Foundation for Unintended Consequences.” Journal of the American Medical Association 308 (3): 243244. http://jama.jamanetwork.com/article.aspx?articleid=1217248
Lavelle, Marianne. 2014. “Coast Guard blames Shell risk taking in the wreck of the Kulluk.” National Geographic, April 4, 2014. http://news.nationalgeographic.com/news/energy/2014/04/140404-coast-guard-blames-shell-in-kulluk-rig-accident/
“Exxon Valdez Oil Spill.” New York Times. On NYTimes.com, last updated August 3, 2010. http://topics.nytimes.com/top/reference/timestopics/subjects/e/exxon_valdez_oil_spill_1989/index.html
NTSB (National Transportation Safety Board). 2014. “Auxiliary Power Unit Battery Fire Japan Airlines Boeing 787-8, JA829J.” Aircraft Incident Report released 11/21/14. Washington, DC: National Transportation Safety Board. http://www.ntsb.gov/Pages/..%5Cinvestigations%5CAccidentReports%5CPages%5CAIR1401.aspx
Palast, Greg. 1999. “Ten Years After But Who Was to Blame?” for Observer/Guardian UK, March 20, 1999. http://www.gregpalast.com/ten-years-after-but-who-was-to-blame/
Pederson, Brian. 2014. “Complex systems and critical missionstoday’s data center.” Lehigh Valley Business, November 14, 2014. http://www.lvb.com/article/20141114/CANUDIGIT/141119895/complex-systems-and-critical-missions–todays-data-center
Plsek, Paul. 2003. Complexity and the Adoption of Innovation in Healthcare. Presentation, Accelerating Quality Improvement in Health Care Strategies to Speed the Diffusion of Evidence-Based Innovations, conference in Washington, DC, January 27-28, 2003. Roswell, GA: Paul E Plsek & Associates, Inc. http://www.nihcm.org/pdf/Plsek.pdf
Reason, J. 2000. “Human Errors Models and Management.” British Medical Journal 320 (7237): 768770. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1117770/
Reuters. 2014. “Design flaws led to lithium-ion battery fires in Boeing 787: U.S. NTSB.” December 2, 2014. http://www.reuters.com/article/2014/12/02/us-boeing-787-batteryidUSKCN0JF35G20141202
Wikipedia, s.v. “Cascading Failure,” last modified April 12, 2015. https://en.wikipedia.org/wiki/Cascading_failure
Wikipedia, s.v. “The Sinking of The Titanic,” last modified July 21, 2015. https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic
Wikipedia, s.v. “SOLAS Convention,” last modified June 21, 2015. https://en.wikipedia.org/wiki/SOLAS_Convention
John Maclean, author of numerous books, including Fire on the Mountain (Morrow 1999), analyzing deadly wildland fires, suggests rebranding of high reliability organizations, which is a fundamental concept of firefighting crews, military, and commercial airline industry. He argued for high risk organizations. A high reliability organization may only fail, like a goalkeeper, as performance is so highly anticipated. A high risk organization is tasked with averting or minimizing impact and may gauge success in a non-binary fashion. It is a recurring theme in Mr. Maclean’s forensic analyses of deadly fires that front-line operators, including the perished, carry the blame for the outcome and management shortfalls are far less exposed.
CENTURYLINK AND THE M&O STAMP OF APPROVAL
The IT industry has growing awareness of the importance of management-people-process issues. That’s why Uptime Institute¹s Management & Operations (M&0) Stamp of Approval focuses on assessing and evaluating both operations activities and management as equally critical to ensuring data center reliability and performance. The M&O Stamp can be applied to a single data center facility, or administered across an entire portfolio to ensure consistency.
Recognizing the necessity of making a commitment to excellence at all levels of an organization, CenturyLink is the first service provider to embrace the M&O assesment for all of its data centers. It has contracted Uptime Institute to assess 57 data center facilities across a global portfolio. This decision shows the company is willing to hold itself to a uniform set of high standards and operate with transparency. The company has committed to achieve M&O Stamp of Approval standards and certification across the board, protecting its vital networks and assets from failure and downtime and providing its customers with assurance.
Julian Kudritzki
Julian Kudritzki joined Uptime Institute in 2004 and currently serves as Chief Operating Officer. He is responsible for the global proliferation of Uptime Institute standards. He has supported the founding of Uptime Institute offices in numerous regions, including Brasil, Russia, and North Asia. He has collaborated on the development of numerous Uptime Institute publications, education programs, and unique initiatives such as Server Roundup and FORCSS. He is based in Seattle, WA.
Anne Corning
Anne Corning is a technical and business writer with more than 20 years experience in the high tech, healthcare, and engineering fields. She earned her B.A. from the University of Chicago and her M.B.A. from the University of Washington’s Foster School of Business. She has provided marketing, research, and writing for organizations such as Microsoft, Skanska USA Mission Critical, McKesson, Jetstream Software, Hitachi Consulting, Seattle Children’s Hospital Center for Clinical Research, Adaptive Energy, Thinking Machines Corporation (now part of Oracle), BlueCross BlueShield of Massachusetts, and the University of Washington Institute for Translational Health Sciences. She has been a part of several successful entrepreneurial ventures and is a Six Sigma Green Belt.
—-
IT Chargeback Drives Efficiency
/in Executive/by Kevin HeslinAllocating IT costs to internal customers improves accountability, cuts waste
By Scott Killian
You’ve heard the complaints many times before: IT costs too much. I have no idea what I’m paying for. I can’t accurately budget for IT costs. I can do better getting IT services myself.
The problem is that end-user departments and organizations can sometimes see IT operations as just a black box. In recent years, IT chargeback systems have attracted more interest as a way to address all those concerns and rising energy use and costs. In fact, IT chargeback can be a cornerstone of practical, enterprise-wide efficiency efforts.
IT chargeback is a method of charging internal consumers (e.g., departments, functional units) for the IT services they used. Instead of bundling all IT costs under the IT department, a chargeback program allocates the various costs of delivering IT (e.g., services, hardware, software, maintenance) to the business units that consume them.
Many organizations already use some form of IT chargeback, but many don’t, instead treating IT as corporate overhead. Resistance to IT chargeback often comes from the perception that it requires too much effort. It’s true that time, administrative cost, and organizational maturity is needed to implement chargeback.
However, the increased adoption of private and public cloud computing is causing organizations to re-evaluate and reconsider IT chargeback methods. Cloud computing has led some enterprises to ask their IT organizations to explain their internal costs. Cloud options can shave a substantial amount from IT budgets, which pressures IT organizations to improve cost modeling to either fend off or justify a cloud transition. In some cases, IT is now viewed as more of a commodity—with market competition. In these circumstances, accountability and efficiency improvements can bring significant cost savings that make chargeback a more attractive path.
CHARGEBACK vs. UNATTRIBUTED ACCOUNTING
All costs are centralized in traditional IT accounting. One central department pays for all IT equipment and activities, typically out of the CTO or CIO’s budget, and these costs are treated as corporate overhead shared evenly by multiple departments. In an IT chargeback accounting model, individual cost centers are charged for their IT service based on use and activity. As a result, all IT costs are “zeroed out” because they have all been assigned to user groups. IT is no longer considered overhead, instead it can be viewed as part of each department’s business and operating expenses (OpEx).
With the adoption of IT chargeback, an organization can expect to see significant shifts in awareness, culture, and accountability, including:
• Increased transparency due to accurate allocation of IT costs and usage. Chargeback allows consumers to see their costs and understand how those costs are determined.
• Improved IT financial management, as groups become more aware of the cost of their IT usage and business choices. With chargeback, consumers become more interested and invested in the costs of delivering IT as a service.
• Increased awareness of how IT contributes to the business of the organization. IT is not just overhead but is seen as providing real business value.
• Responsibility for controlling IT costs shifts to business units, which become accountable for their own use.
• Alignment of IT operations and expenditures with the business. IT is no longer just an island of overhead costs but becomes integrated into business planning, strategy, and operations.
The benefits of an IT chargeback model included simplified IT investment decision making, reduced resource consumption, improved relationships between business units and IT, and greater perception of IT value. Holding departments accountable leads them to modify their behaviors and improve efficiency. For example, chargeback tends to reduce overall resource consumption as business units stop hoarding surplus servers or other resources to avoid the cost of maintaining these underutilized assets. At the same time, organizations experience increased internal customer satisfaction as IT and the business units become more closely aligned and begin working together to analyze and improve efficiency.
Perhaps most importantly, IT chargeback drives cost control. As users become aware of the direct costs of their activities, they become more willing to improve their utilization, optimize their software and activities, and analyze cost data to make better spending decisions. This can extend the life of existing resources and infrastructure, defer resource upgrades, and identify underutilized resources that can be deployed more efficiently. Just as we have seen in organizations that adopt a server decommissioning program (such as the successful initiatives of Uptime Institute’s Server Roundup) (https://uptimeinstitute.com/training-events/server-roundup), IT chargeback identifies underutilized assets that can be reassigned or decommissioned. As a result, more space and power becomes available to other equipment and services, thus extending the life of existing infrastructure. An organization doesn’t have to build new infrastructure if it can get more from current equipment and systems.
IT chargeback also allows organizations to make fully informed decisions about outsourcing. Chargeback provides useful metrics that can be compared against cloud providers and other outsource IT options. As IT organizations are being driven to emulate cloud provider services, a chargeback applies free-market principles to IT (with appropriate governance and controls). The IT group becomes more akin to a service provider, tracking and reporting the same metrics on a more apples-to-apples basis.
Showback is closely related to chargeback and offers many of the same advantages without some of the drawbacks. This strategy employs the same approach as chargeback, with tracking and cost-center allocation of IT expenses. Showback measures and displays the IT cost breakdown by consumer unit just as chargeback does, but without actually transferring costs back. Costs remain in the IT group, but information is still transparent about consumer utilization. Showback can be easy to implement since there is no immediate budgetary impact on user groups.
The premise behind showback and chargeback is the same: awareness drives accountability. However, since business units know they will not be charged in a showback system, their attention to efficiency and improving utilization may not be as focused. Many organizations have found that starting with a showback approach for an initial 3-6 months is an effective way to introduce chargeback, testing the methodology and metrics and allowing consumer groups to get used to the approach before full implementation of chargeback accountability.
The stakeholders affected by chargeback/showback include:
• Consumers: Business units that consume IT resources, e.g., organizational entities, departments, applications, and end users.
• Internal service providers: Groups responsible for providing IT services, e.g., data center teams, network teams, and storage.
• Project sponsor: The group funding the effort and ultimately responsible for its success. Often this is someone under the CTO or can also be a finance/accounting leader.
• Executive team: The C-suite individuals responsible for setting chargeback as an organizational priority and ensuring enterprise-wide participation to bring it to fruition.
• Administrator: The group responsible for operating the chargeback program (e.g., IT finance and accounting).
CHARGEBACK METHODS
A range of approaches have been developed for implementing chargeback in an organization, as summarized in Figure 1. The degree of complexity, degree of difficulty, and cost to implement decreases from the top of the chart [service-based pricing (SBP)], to the bottom [high-level allocation (HLA)]. HLA is the simplest method; it uses a straight division of IT costs based on a generic metric such as headcount. Slightly more effort to implement is low-level allocation (LLA), which bases consumer costs on something more related to IT activity such as the number of users or servers. Direct cost (DC) more closely resembles a time and materials charge but is often tied to headcount as well.
Figure 1. Methods for chargeback allocation.
Measured resource usage (MRU) focuses on the amount of actual resource usage of each department, using metrics such as power (in kilowatts), network bandwidth and terabytes of storage. Tiered flat rate (TFR), negotiated flat rate (NFR), and service based pricing (SBP) are all increasingly sophisticated applications of measuring actual usage by service.
THE CHARGEBACK SWEET SPOT
Measured resource usage (MRU) is often the sweet spot for chargeback implementation. It makes use of readily available data that are likely already known or collected. For example, data center teams typically measure power consumption at the server level, and storage groups know how many terabytes are being used by different users/departments. MRU is a straight allocation of IT costs, thus it is fairly intuitive for consumer organizations to accept. It is not quite as simple as other methods to implement but does provide fairness and is easily controllable.
MRU treats IT services as a utility, consumed and reserved based on key activities:
• Data center = power
• Network = bandwidth
• Storage = bytes
• Cloud =virtual machines or other metric
• Network Operations Center = ticket count or total time to resolve per customer
PREPARING FOR CHARGEBACK IMPLEMENTATION
If an organization is to successfully implement chargeback, it must choose the method that best fits its objectives and apply the method with rigor and consistency. Executive buy-in is critical. Without top-down leadership, chargeback initiatives often fail to take hold. It is human nature to resist accountability and extra effort, so leadership is needed to ensure that chargeback becomes an integral part of the business operations.
To start, it’s important that an organization know the infrastructure capital expense (CapEx) and OpEx costs. Measuring, tracking, reporting, and questioning these costs, and acting on the information to base investment and operating decisions on real costs is critical to becoming an efficient IT organization. To understand CapEx costs, organizations should consider the following:
• Facility construction or acquisition
• Power and cooling infrastructure equipment: new, replacement, or upgrades
• IT hardware: server, network, and storage hardware
• Software licenses, including operating system and application software
• Racks, cables: initial costs (i.e., items installed in the initial set up of the data room)
OpEx incorporates all the ongoing costs of running an IT facility. They are ultimately larger than CapEx in the long run, and include:
• FTE/payroll
• Utility expenses
• Critical facility maintenance (e.g., critical power and cooling, fire and life safety, fuel systems)
• Housekeeping and grounds (e.g., cleaning, landscaping, snow removal)
• Disposal/recycling
• Lease expenses
• Hardware maintenance
• Other facility fees such as insurance, legal, and accounting fees
• Office charges (e.g., telephones, PCs, office supplies)
• Depreciation of facility assets
• General building maintenance (e.g., office area, roof, plumbing)
• Network expenses (in some circumstances)
The first three items (FTE/payroll, utilities, and critical facility maintenance) typically make up the largest portion of these costs. For example, utilities can account for a significant portion of the IT budget. If IT is operated in a colocation environment, the biggest costs could be lease expenses. The charges from a colocation provider typically will include all the other costs, often negotiated. For enterprise-owned data centers, all these OpEx categories can fluctuate monthly depending on activities, seasonality, maintenance schedules, etc. Organizations can still budget and plan for OpEx effectively, but it takes an awareness of fluctuations and expense patterns.
At a fundamental level, the goal is to identify resource consumption by consumer, for example the actual kilowatts per department. More sophisticated resource metrics might include the cost of hardware installation (moves, adds, changes) or the cost per maintenance ticket. For example, in the healthcare industry, applications for managing patient medical data are typically large and energy intensive. If 50% of a facility’s servers are used for managing patient medical data, the company could determine the kilowatt per server and multiply total OpEx by the percentage of total IT critical power used for this activity as a way to allocate costs. If 50% of its servers are only using 30% of the total IT critical load, then it could use 30% to determine the allocation of data center operating costs. The closer the data can get to representing actual IT usage, the better.
An organization that can compile this type of data for about 95% of its IT costs will usually find it sufficient for implementing a very effective chargeback program. It isn’t necessary for every dollar to be accounted for. Expense allocations will be closely proportional based on actual consumption of kilowatts and/or bandwidth consumed and reserved by each user organization. Excess resources typically are absorbed proportionally by all. Even IT staff costs can be allocated by tracking and charging their activity to different customers using timesheets or by headcount where staff is dedicated to specific customers.
Another step in preparing an organization to adopt an IT chargeback methodology is defining service levels. What’s key is setting expectations appropriately so that end users, just like customers, know what they are getting for what they are paying. Defining uptime (e.g., Tier level such as Tier III Concurrent Maintainability or Tier IV Fault Tolerant infrastructure or other uptime and/or downtime requirements, if any), and outlining a detailed service catalog are important.
IT CHARGEBACK DRIVES EFFICIENT IT
Adopting an IT chargeback model may sound daunting, and doing so does take some organizational commitment and resources, but the results are worthwhile. Organizations that have implemented IT chargeback have experienced reductions in resource consumption due to increased customer accountability, and higher, more efficient utilization of hardware, space, power, and cooling due to reduction in servers. IT chargeback brings a new, enterprise-wide focus on lowering data center infrastructure costs with diverse teams working together from the same transparent data to achieve common goals, now possible because everyone has “skin in the game.”
Essentially, achieving efficient IT outcomes demands a “follow the money” mindset. IT chargeback drives a holistic approach in which optimizing data center and IT resource consumption becomes the norm. A chargeback model also helps to propel organizational maturity, as it drives the need for more automation and integrated monitoring, for example the use of a DCIM system. To collect data and track resources and key performance indicators manually is too tedious and time consuming, so stakeholders have an incentive to improve automated tracking, which ultimately improves overall business performance and effectiveness.
IT chargeback is more than just an accounting methodology; it helps drive the process of optimizing business operations and efficiency, improving competitiveness and adding real value to support the enterprise mission.
IT CHARGEBACK DOs AND DON’Ts
19 May 2015, Uptime Institute gathered a group of senior stakeholders for the Executive Assembly for Efficient IT. The group was composed of leaders from large financial, healthcare, retail and Web-scale IT organizations and the purpose of the meeting was to share experiences, success stories and challenges to improving IT efficiency.
At Uptime Institute’s 2015 Symposium, executives from leading data center organizations that have implemented IT chargeback discussed the positive results they had achieved. They also shared the following recommendations for companies considering adopting an IT chargeback methodology.
DO:
• Partner with the Finance department. Finance has to completely buy in to implementing chargeback.
• Inventory assets and determine who is using them. A complete inventory of the number of data centers, number of servers, etc., is needed to develop a clear picture of what is being used.
• Chargeback needs strong senior-level support; it will not succeed as a bottom-up initiative. Similarly don’t try to implement it from the field. Insist that C-suite representatives (COO/CFO) visit the data center so the C-suite understands the concept and requirements.
• Focus on cash management as the goal, not finance issues (e.g., depreciation) or IT equipment (e.g., server models and UPS equipment specifications). Know the audience, and get everyone on the same page talking about straight dollars and cents.
• Don’t give teams too much budget—ratchet it down. Make departments have to make trade-offs so they begin to make smarter decisions.
• Build a dedicated team to develop the chargeback model. Then show people the steps and help them understand the decision process.
• Data is critical: show all the data, including data from the configuration management data base (CMDB), in monthly discussions.
• Be transparent to show and add credibility. For example, explain clearly, “Here’s where we are and here’s where we are trying to get to.”
• Above all, communicate. People will need time to get used to the idea.
DON’TS:
• Don’t try to drive chargeback from the bottom up.
• Simpler is better: don’t overcomplicate the model. Simplify the rules and prioritize; don’t get hung up perfecting every detail because it doesn’t save much money. Approximations can be sufficient.
• Don’t move too quickly: start with showback. Test it out first; then, move to chargeback.
• To get a real return, get rid of the old hardware. Move quickly to remove old hardware when new items are purchased. The efficiency gains are worth it.
• The most challenging roadblocks can turn out to be the business units themselves. Organizational changes might need to go the second level within the business unit if it has functions and layers under them that should be separate.
Scott Killian
Scott Killian joined the Uptime Institute in 2014 and currently serves as VP for Efficient IT Program. He surveys the industry for current practices and develops new products to facilitate industry adoption of best practices. Mr. Killian directly delivers consulting at the site management, reporting, and governance levels. He is based in Virginia.
Prior to joining Uptime Institute, Mr. Killian led AOL’s holistic resource consumption initiative, which resulted in AOL winning two Uptime Institute Server Roundups for decommissioning more than 18,000 servers and reducing operating expenses more than US$6 million. In addition, AOL received three awards in the Green Enterprise IT (GEIT) program. AOL accomplished all this in the context of a five-year plan developed by Mr. Killian to optimize data center resources, which saved US$17 million annually.
Australian Colo Provider Achieves High Reliability Using Innovative Techniques
/in Design/by Kevin HeslinNEXTDC deploys new isolated parallel diesel rotary uninterruptible power supply systems and other innovative technologies
By Jeffrey Van Zetten
NEXTDC’s colocation data centers in Australia incorporate innovation in engineering design, equipment selection, commissioning, testing, and operation. This quality-first philosophy saw NEXTDC become one of 15 organizations globally to win a 2015 Brill Award for Efficient IT. NEXTDC’s B1 Brisbane, M1 Melbourne, S1 Sydney, C1 Canberra, and P1 Perth data centers (see Figure 1-3), have a combined 40-megawatt (MW) IT load (see Figure 4).
Figure 1. Exterior of NEXTDC’s 11.5-MW S1 Sydney data center, which is Uptime Institute Tier III Design Documents and Constructed Facility Certified
Figure 2. Exterior of NEXTDC’s 5.5-MW P1 Perth data center, which is Uptime Institute Tier III Design Documents and Constructed Facility Certified
Figure 3. Exterior of NEXTDC’s 12-MW M1 Melbourne data center, which is Uptime Institute Tier III Design Documents Certified
Figure 4. NEXTDC’s current footprint: 40+ MW IT capacity distributed across Australia
In order to accomplish the business goals NEXTDC established, it set the following priorities:
1. High reliability, so that clients can trust NEXTDC facilities with their mission critical IT equipment
2. The most energy-efficient design possible, especially where it can also assist reliability and total cost of ownerships, but not at the detriment of high reliability
3. Efficient total cost of ownership and day one CapEx by utilizing innovative design and technology
4. Capital efficiency and scalability to allow flexible growth and cash flow according to demand
5. Speed to market, as NEXTDC was committed to build and open five facilities within just a few years using a small team across the entire 5,000-kilometer-wide continent, to be the first truly national carrier-neutral colocation provider in
the Australian market
6. Flexible design suitable for the different regions and climates of Australia ranging
from subtropical to near desert.
NEXTDC put key engineering design decisions for these facilities through rigorous engineering decision matrices that weighed and scored the risks, reliability, efficiency, maintenance, total cost of ownership, day one CapEx, full final day CapEx, scalability, vendor local after-sales support, delivery, and references. The company extensively examined all the possible alternative designs to obtain accurate costing and modeling. NEXTDC Engineering and management worked closely to ensure the design would be in accord with the driving brief and the mandate on which the company is based.
NEXTDC also carefully selected the technology providers, manufacturers, and contractors for its projects. This scrutiny was critical, as the quality of local support in Australia can vary from city to city and region to region. NEXTDC paid as much attention to the track record of after-sales service teams as to the initial service or technology.
“Many companies promote innovative technologies; however, we were particularly interested in the after-sales support and the track record of the people who would support the technology,” said Mr. Van Zetten. “We needed to know if they were a stable and reliable team and had in-built resilience and reliability, not only in their equipment, but in their personnel.” NEXTDC’s Perth and Sydney data centers successfully achieved Uptime Institute Tier III Certification of Design Documents (TCDD) and Tier III Certification of Constructed Facilities (TCCF) using Piller’s isolated parallel (IP) diesel rotary uninterruptible power supply (DRUPS) system. A very thorough and exhaustive engineering analysis was performed on all electrical system design options and manufacturers available, including static uninterruptible power supply (UPS) designs with distributed redundant and block redundant distribution, along with the more innovative options such as the IP DRUPS solution. Final scale and capacity was a key design input for making the final decision, and indeed for smaller scale data centers a more traditional static UPS design is still favored by NEXTDC. For facilities larger than 5 MW, the IP DRUPS allows NEXTDC to:
• Eliminate batteries, which fail after 5 to 7 years, causing downtime and loss of redundancy and can cause
hydrogen explosions
• Eliminate the risks of switching procedures, as human error causes most failures
• Maintain power to both A & B supplies without switching even if more than one engine-generator set or UPS is
out of service
• Eliminate problematic static switches.
NEXTDC benefits because:
• If a transformer fails, only the related DRUPS engine generator needs to start. The other units in parallel can all remain on mains [editor’s note: incoming utility] power.
• Electrically decoupled cabinet rotary UPS are easier to maintain, providing less down time and more long-term reliability, which reduces the total cost of ownership.
• The N+1 IP DRUPS maintain higher loaded UPS/engine generators to reduce risk of cylinder glazing/damage at low and growing loads.
• Four levels of independently witnessed, loaded integrated systems testing were applied to verify the performance.
• The IP topology shares the +1 UPS capacity across the facility and enables fewer UPS to run at higher loads for better efficiency.
• The rotary UPSs utilize magnetic-bearing helium-gas enclosures for low friction optimal efficiency.
• The IP allows scalable installation of engine generators and UPS.
For example, the 11.5-MW S1 Sydney data center is based on 12+1 1,336-kilowatt (kW) continuous-rated Piller DRUPS with 12+1 IP power distribution boards. The facility includes sectionalized fire-segregated IP and main switchboard rooms. This design ensures that a failure of any one DRUPS, IP, or main switchboard does not cause a data center failure. The return ring IP bus is also fire segregated.
Figure 5. Scalable IP overall concept design
Differential protection also provides a level of Fault Tolerance. Because the design is scalable, NEXTDC is now increasing the system to a 14+1 DRUPS and IP design to increase the final design capacity from 11.5 to 13.8 MW of IT load to meet rapid growth. All NEXTDC stakeholders, especially those with significant operational experience, were particularly keen to eliminate the risks associated with batteries, static switches, and complex facilities management switching procedures. The IP solution successfully eliminated these risks with additional benefits for CapEx and OpEx efficiency. From a CapEx perspective, the design allows a common set of N+1 DRUPS units to be deployed based on actual IT load for the entire facility (see Figure 5). From an OpEx perspective, the advantage is the design is always able to operate in a N+1 configuration across the entire facility to match actual IT load, so the load is maintained at a higher percentage and thus at efficiencies approaching 98%, whereas low loaded UPS in a distributed redundant design, for example, can often have actual efficiencies of less than 95%. Operating engine-generator sets at higher loads also reduces the risks of engine cylinder glazing and damage, further reducing risks and maintenance costs (see Figure 6).
Figure 6. Distribution of the +1 IP DRUPS across more units provides higher load and thus efficiency
NEXTDC repeated integrated systems tests four times. Testing, commissioning, and tuning are the keys to delivering a successful project. Each set of tests—by the subcontractors, NEXTDC Engineering, independent commissioning agent engineers, and those required for Uptime Institute TCCF—helped to identify potential improvements, which were immediately implemented (see Figure 7). In particular, the TCCF review identified some improvements that NEXTDC could make to Piller’s software logic so that the control became truly distributed, redundant, and Concurrently Maintainable. This improvement ensured that even the complete failure of any panel in the entire system would not cause loss of N IP and main switchboards, even if the number of DRUPS is fewer than the number of IP main switchboards installed. This change improves CapEx efficiency without adding risks. The few skeptics we had regarding Uptime Institute Tier Certification became believers once they saw the professionalism, thoroughness, and helpfulness of Uptime Institute professionals on site.
Figure 7. Concurrently maintainable electrical design utilized in NEXTDC’s facilities
From an operational perspective, NEXTDC found that eliminating static switches and complex switching procedures for the facilities managers also reduced risk and delivered optimal uptime in reality.
MECHANICAL SYSTEMS
The mechanical system designs and equipment were also run through equally rigorous engineering decision matrices, which assessed the overall concept designs and supported the careful selection of individual valves, testing, and commissioning equipment. For example, the final design of the S1 facility includes 5+1 2,700-kW (cooling), high-efficiency, water-cooled, magnetic oil-free bearing, multi-compressor chillers in an N+1 configuration and received Uptime Institute Tier III Design Documents and Constructed Facility Certifications. The chillers are supplemented by both water-side and air-side free-cooling economization with Cold Aisle containment and electronically commutable (EC) variable-speed CRAC fans. Primary/secondary pump configurations are utilized, although, a degree of primary variable flow control is implemented for significant additional energy savings. Furthermore, NEXTDC implemented extreme oversight on testing and commissioning and continues to work with the facilities management teams to carefully tune and optimize the systems. This reduces not only energy use but also wear on critical equipment, extending equipment life, reducing maintenance, and increasing long-term reliability. The entire mechanical plant is supported by the IP DRUPS for continuously available compressor cooling even in the event of a mains power outage. This eliminates the risks associated with buffer tanks and chiller/compressor restarts that occur on most conventional static-UPS-supported data centers and is a common cause of facility outage.
Figure 8. Multi-magnetic bearing, oil-free, low-friction compressor chillers
Figure 8b
The central cooling plant achieved its overall goals because of the following additional key design decisions:
• Turbocor magnetic oil-free bearing, low-friction compressors developed in Australia provide both reliability and efficiency (see Figure 8).
• Multi-compressor chillers provide redundancy within the chillers and
improved part load operation.
• Single compressors can be replaced while the chiller keeps running.
• N+1 chillers are utilized to increase thermal transfer area for better part-load coefficient of performance
(COP) and Fault Tolerance, as the +1 chiller is already on-line and operating should one chiller fail.
• Variable-speed drive, magnetic-bearing, super-low-friction chillers provide leading COPs.
• Variable number of compressors can optimize COPs.
• Seasonal chilled water temperature reset enables even higher COPs and greater free economization in winter.
• Every CRAC is fitted with innovative pressure-independent self-balancing characterized control valves
(PICCV) to ensure no part of system is starved of chilled water with scalable dynamic staged expansions, and also ensure minimal flow per IT power to minimize pumping energy.
• Variable speed drives (VSDs) are installed on all pumps for less wear and reduced failure.
• 100% testing, tuning, commissioning and independent witnessing of all circuits, and minimization of pump
∆P for reduced wear.
• High ∆T and return water temperatures optimize water-side free cooling.
• High ∆T optimizes seasonal water reset free-cooling economization.
The cooling systems utilize evaporative cooling, which takes advantage of Australia’s climate, with return water precooling heat exchangers that remove load from the chiller compressors for more efficient overall plant performance. The implementation of the water-side and air-side free economization systems is a key to the design.
Very early smoke detection apparatus (VESDA) air-quality automatic damper shutdown is designed and tested along the facility’s entire façade. Practical live witness testing was performed with smoke bombs directed at the façade intakes and utilized a crane to simulate the worst possible case bush-fire event, with a sudden change of wind direction to ensure that false discharges of the gas suppression could be mitigated.
The free-cooling economization systems provide the following benefits to reliability and efficiency (see Figures 9-12):
• Two additional cooling methods provide backup in addition to chillers for most of the year
• Reduced running time on chillers and pumps extend the life and reduce failure and maintenance.
• The design is flexible to use either water side or air side depending on geographic locations and outside air quality.
• Actual results have proven a reduced total cooling plant energy.
• Reduced loads on chillers provide even better chiller COPs at partial loads.
• Reduced pump energy is achieved when air-side economization is utilized.
Figure 9. (Top Left) Water-side free-cooling economization design principle. Note: not all equipment shown for simplicity
Figure 10. (Top Right) Water-side free-cooling economization design principle
Figure 11a-d. Air-side free-cooling economization design principle
Figure 11b
Figure 11c
Figure 11d
Figure 12. Air-side free-cooling economization actual results
WHITE SPACE
NEXTDC’s designer consultants specified raised floor in the first two data rooms in the company’s M1 Melbourne facility (the company’s first builds) as a means of supplying cold air to the IT gear. A Hot Aisle containment system prevents intermixing and returns hot air to the CRACs via chimneys and an overhead return hot air plenum and back to the CRACs. This system minimizes fan speeds, reducing wear and maintenance. Containment also makes it simpler to run the correct number of redundant fans, which provides a high level of redundancy, and due to fan laws, reduces fan wear and maintenance. At NEXTDC, containment means higher return air temperatures, which enables more air-side economization and energy efficiency and an innovative, in-house floor grille management tool that minimizes fan energy according to IT load (see Figure 13). For all later builds, however, NEXTDC chose Cold Aisle containment to eliminate the labor costs and time to build overhead plenum and chimneys for the Hot Aisle containment system, which reduced payback and return on investment. NEXTDC now specifies Cold Aisle containment in all its data centers.
Figure 13. Cold Aisle containment is mandated design for all new NEXTDC facilities to minimize CRAC fan energy
The common-sense implementation of containment has proved to be worthwhile and enabled genuine energy savings. Operational experience suggested that containment alone saves only a part of the total possible energy savings. To capture even more savings, NEXTDC Engineering developed a program that utilizes the actual contracted loads and data from PDU branch circuit monitoring to automatically calculate the ideal floor grille balance for each rack. This intelligent system tuning saved an additional 60% from NEXTDC’s CRAC fan power by increasing air-side ∆T and reducing airflow rates (see Figure 14).
Figure 14. Innovative floor grille tuning methods applied in conjunction with containment yielded significant savings
NEXTDC also learned not to expect mechanical subcontractors to have long-term operational expenses and energy bills as their primary concern. NEXTDC installed pressure/temperature test points across all strainers and equipment and specified that all strainers had to be cleaned prior to commissioning. During the second round of tests, NEXTDC Engineering found that secondary pump differential pressures and energy were almost double what they theoretically should be. Using its own testing instruments, NEXTDC Engineering determined that some critical strainers on the index circuit had in fact been dirtied with construction debris—the contractors had simply increased the system differential pressure setting to deliver the correct flow rates and specified capacity. After cleaning the relevant equipment, the secondary pump energy was almost halved. NEXTDC would have paid the price for the next 20 years had these thorough checks not been performed.
Similarly, primary pumps and the plant needed to be appropriately tuned and balanced based on actual load, as the subcontractors had a tendency to setup equipment to ensure capacity but not for minimal energy consumption. IT loads are very stable, so it is possible to adjust the primary flow rates and still maintain N+1 redundancy, thanks to pump laws—with massive savings on pump energy. The system was designed with pressure independent self-balancing control valves and testing and commissioning sets to ensure scalable, efficient, flow distribution, and high water-side ∆Ts to enable optimal use of water-side free-cooling economization. The challenge then was personally witnessing all flow tests to ensure that the subcontractors had correctly adjusted the equipment. Another lesson learned was that a single flushing bypass left open by a contractor can seriously reduce the return water temperature and prevent the water-side economization system from operating entirely if not tracked down and resolved during commissioning. Hunting down all such incorrect bypasses helped to increase the return water temperature by almost 11ºF (6ºC) for a massive improvement in economization.
Figure 15. Energy saving trends – actual typical results achieved for implementation
Operational tuning through the first year, with the Engineering and facilities management teams comparing actual trends to the theoretical design model provided savings exceeding even NEXTDC’s optimistic hopes. Creating clear and simple procedures with the facilities management teams and running carefully overseen trended trials was critical before rolling out these initiatives nationally. Every single tuning initiative implemented nationally after the facilities go-live date is trended, recorded, and collated into a master national energy savings register. Figure 15 provides just a few examples. Tuning has so far yielded a 24% reduction in the power consumption for mechanical plant with still conservative safety factors. Over time, with additional trend data and machine learning, power consumption is still expected to considerably improve on this via continuous improvement. NEXTDC expects a further 10–20% saving as NEXTDC is on target to operate Australia’s first National Australian Built Environment Rating System (NABERS) 5-star-rated mega data centers.
The design philosophy didn’t end with the electrical and mechanical cooling systems, but also applied to the hydraulics and fire protection systems:
• Rainwater collection is implemented on site to supply cooling towers, which provides additional hours of water most of the year.
• The water tanks are scalable.
• Rainwater collection minimizes mains water consumption.
• VESDA laser optical early detection developed in Australia and licensed internationally was utilized.
• The design completely eliminated water-based sprinkler systems from within the critical IT equipment data
halls, instead utilizing IG55 inert-gas suppression, so that IT equipment can continue to run even if a single server
has an issue (see Figure 16). Water-based pre-action sprinklers risk catastrophic damage to IT equipment that is
not suffering from an over-heating or fire event, risking unnecessary client IT outages.
• The gas suppression system is facility staff friendly, unlike alternatives that dangerously deplete oxygen levels in
the data hall.
• The design incorporates a fully standby set of gas suppression bottle banks onsite.
• The gas suppression bottle banks are scalable.
• The IG55 advanced gas suppression is considered one of the world’s most environmentally friendly gas
suppression systems.
Figure 16. Rainwater, environmentally friendly inert gas fire-suppression and solar power generation innovations
The design of NEXTDC’s data centers is one of the fundamental reasons IT, industrial, and professional services companies are choosing NEXTDC as a colocation data center partner for the region. This has resulted in very rapid top and bottom line financial growth, leading to profitability and commercial success in just a few years. NEXTDC was named Australia’s fastest-growing communications and technology company at Deloitte Australia’s 2014 Technology Fast 50 awards. Mr. Van Zetten said, “What we often found was that when innovation was primarily sought to provide improved resilience and reliability, it also provided improved energy efficiency, better total cost of ownership, and CapEx efficiency. The IP power distribution system is a great example of this. Innovations that were primarily sought for energy efficiency and total cost of ownership likewise often provided higher reliability. The water-side and air-side economization free cooling are great examples. Not only do they reduce our power costs, they also provide legitimate alternative cooling redundancy for much of the year and reduce wear and maintenance on chillers, which improves overall reliability for the long term. “Cold Aisle containment, which was primarily also sought to reduce fan energy, eliminates client problems associate with air mixing and bypassing, thus providing improved client IT reliability.”
Jeffrey Van Zetten
Jeffrey Van Zetten has been involved with NEXTDC since it was founded in 2010 as Australia’s first national data center company. He is now responsible for the overall design, commissioning, Uptime Institute Tier III Certification process, on-going performance, and energy tuning for the B1 Brisbane, M1 Melbourne, S1 Sydney, C1 Canberra, and P1 Perth data centers. Prior to joining NEXTDC, Mr. Van Zetten was based in Singapore as the Asia Pacific technical director for a leading high-performance, buildings technology company. While based in Singapore, he was also the lead engineer for a number of successful energy-efficient high tech and mega projects across Asia Pacific, such as the multi-billion dollar Marina Bay Sands. Mr. Van Zetten has experience in on-site commissioning and troubleshooting data center and major projects throughout Asia, Australia, Europe, North America, and South America.