Moscow State University Meets HPC Demands

HPC supercomputing and traditional enterprise IT facilities operate very differently
By Andrey Brechalov

In recent years, the need to solve complex problems in science, education, and industry, including the fields of meteorology, ecology, mining, engineering, and others, has added to the demand for high-performance computing (HPC). The result is the rapid development of HPC systems, or supercomputers, as they are sometimes called. The trend shows no signs of slowing, as the application of supercomputers is constantly expanding. Research today requires ever more detailed modeling of complex physical and chemical processes, global atmospheric phenomena, and distributed systems behavior in dynamic environments. Supercomputer modeling provides fine results in these areas and others, with relatively low costs.

Supercomputer performance can be described in petaflops (pflops), with modern systems operating at tens of pflops. However, performance improvements cannot be achieved solely by increasing the number of existing computing nodes in a system, due to weight, size, power, and cost considerations. As a result, designers of supercomputers attempt to improve their performance by optimizing their architecture and components, including interconnection technologies (networks) and by developing and incorporating new types of computing nodes having greater computational density per unit of area. These higher-density nodes require the use of new (or well-forgotten old) and highly efficient methods of removing heat. All this has a direct impact on the requirements for site engineering infrastructure.

HPC DATA CENTERS
Supercomputers can be described as a collection of interlinked components and assemblies—specialized servers, network switches, storage devices, and links between the system and the outside world. All this equipment can be placed in standard or custom racks, which require conditions of power, climate, security, etc., to function properly—just like the server-based IT equipment found in more conventional facilities.

Low- or medium-performance supercomputers can usually be placed in general purpose data centers and even in server rooms, as they have infrastructure requirements similar to other IT equipment, except for a bit higher power density. There are even supercomputers for workgroups that can be placed directly in an office or lab. In most cases, however, any data center designed to accommodate high power density zones should be able to host one of these supercomputers.

On the other hand, powerful supercomputers usually get placed in dedicated rooms or even buildings that include unique infrastructure optimized for a specific project. These facilities are pretty similar to general-purpose data centers. However, dedicated facilities for powerful supercomputers host a great deal of high power density equipment, packed closely together. As a result, these facilities must make use of techniques suitable for removing the higher heat loads. In addition, the composition and characteristics of IT equipment for an HPC data center are already known before the site design begins and its configuration does not change or changes only subtly during its lifetime, except for planned expansions. Thus it is possible to define the term HPC data center as a data center intended specifically for placement of a supercomputer.

Figure 1. Types of HPC data center IT equipment

Figure 1. Types of HPC data center IT equipment

The IT equipment in a HPC data center built using the currently popular cluster-based architecture can be generally divided into two types, each having its own requirements for engineering infrastructure Fault Tolerance and component redundancy (see Figure 1 and Table 1).

Table 1. Types of IT equipment located in HPC data centers

Table 1. Types of IT equipment located in HPC data centers

The difference in the requirements for redundancy for the two types of IT equipment is because applications running on a supercomputer usually have a reduced sensitivity to failures of computational nodes, interconnection leaf switches, and other computing equipment (see Figure 2). These differences enable HPC facilities to incorporate segmented infrastructures that meet the different needs of the two kinds of IT equipment.

Figure 2. Generic HPC data center IT equipment and engineering infrastructure.

Figure 2. Generic HPC data center IT equipment and engineering infrastructure.

ENGINEERING INFRASTRUCTURE FOR COMPUTATIONAL EQUIPMENT
Supercomputing computational equipment usually incorporates cutting-edge technologies and has extremely high power density. These features affect the specific requirements for engineering infrastructure. In 2009, the number of processors that were placed in one standard 42U (19-inch rack unit) cabinet at Moscow State University’s (MSU) Lomonosov Data Center was more than 380, each having a thermal design power (TDP) of 95 watts for a total of 36 kilowatts (kW/rack). Adding 29 kW/rack for auxiliary components, such as  (e.g. motherboards, fans, and switches brings the total power requirement to 65 kW/rack. Since then the power density for such air-cooled equipment has reached 65 kW/rack.

On the other hand, failures of computing IT equipment do not cause the system as a whole to fail because of the cluster technology architecture of supercomputers and software features. For example, job checkpointing and automated job restart software features enable applications to isolate computing hardware failures and the computing tasks management software ensures that applications use only operational nodes even when some faulty or disabled nodes are present. Therefore, although failures in engineering infrastructure segments that serve computational IT equipment increase the time required to perform computing tasks, these failures do not lead to a catastrophic loss of data.

Supercomputing computational equipment usually operates on single or N+1 redundant power supplies, with the same level of redundancy throughout the electric circuit to the power input. In the case of a single hardware failure, segmentation of the IT loads and supporting equipment limits the effects of the failure to only a part of IT equipment.

Customers often refuse to install standby-rated engine-generator sets, completely relying on utility power. In these cases, the electrical system design is defined by the time required for normal IT equipment shutdown and the UPS system mainly rides through brief power interruptions (a few minutes) in utility power.

Cooling systems are designed to meet similar requirements. In some cases, owners will lower redundancy and increase segmentation without significant loss of operational qualities to optimize capital expense (CapEx) and operations expense (OpEx). However, the more powerful supercomputers expected in the next few years will require the use of technologies, including partial or full liquid cooling, with greater heat removal capacity.

OTHER IT EQUIPMENT
Auxiliary IT equipment in a HPC data center includes air-cooled servers (sometimes as part of blade systems), storage systems, and switches in standard 19-inch racks that only rarely reach the power density level of 20 kW/rack. Uptime Institute’s annual Data Center Survey reports that typical densities are less than 5 kW/rack.

This equipment is critical to cluster functionality; and therefore, usually has redundant power supplies (most commonly N+2) that draw power from independent sources. Whenever hardware with redundant power is applied, the rack’s automatic transfer switches (ATS) are used to ensure the power failover capabilities. The electrical infrastructure for this equipment is usually designed to be Concurrently Maintainable, except that standby-rated engine-generator sets are not always specified. The UPS system is designed to provide sufficient time and energy for normal IT equipment shutdown.

The auxiliary IT equipment must be operated in an environment cooled to 18–27°C, (64–81°F) according to ASHRAE recommendations, which means that solutions used in general data centers will be adequate to meet the heat load generated by this equipment. These solutions often meet or exceed Concurrent Maintainability or Fault Tolerant performance requirements.

ENERGY EFFICIENCY
In recent years, data center operators have put a greater priority on energy efficiency. This focus on energy saving also applies to specialized HPC data centers. Because of the numerous similarities between the types of data centers, the same methods of improving energy efficiency are used. These include the use of various combinations of free cooling modes, higher coolant and set point temperatures, economizers, evaporative systems, and variable frequency drives and pumps as well as numerous other technologies and techniques.

Reusing the energy used by servers and computing equipment is one of the most promising of these efficiency improvements. Until recent years, all that energy had been dissipated. Moreover, based on average power usage efficiency (PUE), even efficient data centers must use significant energy to dissipate the heat they generate.

Facilities that include chillers and first-generation liquid cooling systems generate “low potential heat” [coolant temperatures of 10–15°C (50–59°F), 12–17°C (54–63°F), and even 20–25°C (68–77°F)] that can be used rather than dissipated, but doing so requires significant CapEx and OpEx (e.g., use of heat pumps) that lead to long investment return times that are usually considered unacceptable.

Increasing the heat potential of the liquid coolants improves the effectiveness of this approach, absent direct expansion technologies. And even while reusing the heat load is not very feasible in server-based spaces, there have been positive applications in supercomputing computational spaces. Increasing the heat potential can create additional opportunities to use free cooling in any climate. That allows year-round free cooling in the HPC data center is a critical requirement.

A SEGMENTED HPC DATA CENTER
Earlier this year the Russian company T-Platforms deployed the Lomonosov-2 supercomputer at MSU, using the segmented infrastructure approach. T-Platforms has experience in supercomputer design and complex HPC data center construction in Russia and abroad. When T-Platforms built the first Lomonosov supercomputer, it scored 12th in the global TOP500 HPC ratings. Lomonosov-1 has been used at 100% of its capacity with about 200 tasks waiting in job queue on average. The new supercomputer will significantly expand MSU’s Supercomputing Center capabilities.

The engineering systems for the new facility were designed to support the highest supercomputer performance, combining new and proven technologies to create an energy-efficient scalable system. The engineering infrastructure for this supercomputer was completed in June 2014, and the computing equipment is being gradually added to the system, as requested by MSU. The implemented infrastructure allows system expansion with currently available A-Class computing hardware and perspective generations of IT equipment without further investments in the engineering systems.

THE COMPUTATIONAL SEGMENT
The supercomputer is based on T-Platforms’s A-class high-density computing system and makes use of a liquid cooling system (see Figure 3). A-class supercomputers support designs of virtually any scale. The peak performance of one A-class enclosure is 535 teraflops (tflops) and a system based on it can easily be extended up to more than 100 pflops. For example, the combined performance of the five A-class systems already deployed at MSU reached 2,576 tflops in 2014 (22nd in the November 2014 TOP500) and was about 2,675 tflops in July 2015. This is approximately 50% greater than the peak performance of the entire first Lomonosov supercomputer (1,700 tflops, 58th in the same TOP500 edition). A supercomputer made of about 100 A-class enclosures would perform on par with the Tianhe-2 (Milky Way-2) system at National Supercomputer Center in Guangzhou (China) that leads the current TOP500 list at about 55 pflops.

Figure 3. The supercomputer is based on T-Platforms’s A-class high-density computing system and makes use of a liquid cooling system

Figure 3. The supercomputer is based on T-Platforms’s A-class high-density computing system and makes use of a liquid cooling system

All A-class subsystems, including computing and service nodes, switches, and cooling and power supply equipment, are tightly integrated in a single enclosure as modules with hot swap support (including those with hydraulic connections). The direct liquid cooling system is the key feature of the HPC data center infrastructure. It almost completely eliminates air as the medium of heat exchange. This solution improves the energy efficiency of the entire complex by making these features possible:

• IT equipment installed in the enclosure has no fans

• Heat from the high-efficiency air-cooled power supply units (PSU) is removed using water/air heat exchangers embedded in the enclosure

• Electronic components in the cabinet do not require computer room air for cooling

• Heat dissipated to the computer room is minimized because the  cabinet is sealed and insulated

• Coolant is supplied to the cabinet at cabinet at 44°C (111°F) with up to 50°C (122°F)  outbound under full load, which enables year-round free cooling at ambient summer temperatures of up to 35°C (95°F) and the use of dry coolers without adiabatic systems (see Figure 4)

B brechalov Figure 4 image4

Figure 4. Coolant is supplied to the auxiliary IT cabinets at 44°C (50°C outbound under full load), which enables year-round free cooling at ambient summer temperatures of up to 35°C and the use of dry coolers without adiabatic systems

Figure 4. Coolant is supplied to the auxiliary IT cabinets at 44°C (50°C outbound under full load), which enables year-round free cooling at ambient summer temperatures of up to 35°C and the use of dry coolers without adiabatic systems

In addition noise levels are also low because liquid cooling eliminates powerful node fans that generate noise in air-cooled systems.  The only remaining fans in A-Class systems are embedded in the PSUs inside the cabinets, and these fans are rather quiet. Cabinet design contains most of the noise from this source.

INFRASTRUCTURE SUPPORT
The power and cooling systems for Lomonosov-2 follow the general segmentation guidelines. In addition, they must meet the demands of the facility’s IT equipment and engineering systems at full load, which includes up to 64 A-class systems (peak performance over 34 pflops) and up to 80 auxiliary equipment U racks in 42U, 48U, and custom cabinets. At full capacity these systems require 12,000-kW peak electric power capacity.

Utility power is provided by eight 20/0.4-kV substations, each having two redundant power transformers making a total of 16 low-voltage power lines with a power limit of 875 kW/line in normal operation.

Although no backup engine-generator sets have been provisioned, at least 28% of the computing equipment and 100% of auxiliary IT equipment is protected by UPS providing at least 10 minutes of battery life for all connected equipment.

The engineering infrastructure also includes two independent cooling systems: a warm-water, dry-cooler type for the computational equipment and a cold-water, chiller system for auxiliary equipment. These systems are designed for normal operation in temperatures ranging from -35 to+35°C (-31 to +95°F) with year-round free cooling for the computing hardware. The facility also contains an emergency cooling system for auxiliary IT equipment.

The facility’s first floor includes four 480-square-meters (m2) rooms for computing equipment (17.3 kW/m2) and four 280-m2 rooms for auxiliary equipment (3 kW/m2) with 2,700 m2 for site engineering rooms on an underground level.

POWER DISTRIBUTION
The power distribution system is built on standard switchboard equipment and is based on the typical topology for general data centers. In this facility, however, the main function of the UPS is to ride through brief blackouts of the utility power supply for select computing equipment (2,340 kW), all auxiliary IT equipment (510 kW), and engineering equipment systems (1,410 kW). In the case of a longer blackout, the system supplies power for proper shutdown of connected IT equipment.

The UPS system is divided into three independent subsystems. The first is for computing equipment, the second is for auxiliary equipment, and the third is for engineering systems. In fact, the UPS system is deeply segmented because of the large number of input power lines. This minimizes the impact of failures of engineering equipment on supercomputer performance in general.

The segmentation principle is also applied to the physical location of the power supply equipment. Batteries are placed in three separate rooms. In addition, there are three UPS rooms and one switchboard room for the computing equipment that is unprotected by UPS. Figure 5 shows one UPS-battery room pair.

Figure 5. A typical pair of UPS-battery rooms

Figure 5. A typical pair of UPS-battery rooms

Three, independent, parallel UPS, each featuring N+1 redundancy (see Figure 6), feed the protected computing equipment. This redundancy, along with bypass availability and segmentation, simplifies UPS maintenance and the task of localizing a failure. Considering that each UPS can receive power from two mutually redundant transformers, the overall reliability of the system meets the owner’s requirements.

Figure 6. Power supply plan for computing equipment

Figure 6. Power supply plan for computing equipment

Three independent parallel UPS systems are also used for the auxiliary IT equipment because it requires greater failover capabilities. The topology incorporates a distributed redundancy scheme that was developed in the late 1990s. The topology is based on use of three or more UPS modules with independent input and output feeders (see Figure 7).

Figure 7. Power supply for auxiliary equipment

Figure 7. Power supply for auxiliary equipment

This system is more economical than a 2N-redundant configuration while providing the same reliability and availability levels. Cable lines connect each parallel UPS to the auxiliary equipment computer rooms. Thus, the computer room has three UPS-protected switchboards. The IT equipment in these rooms, being mostly dual fed, is divided into three groups, each of which is powered by two switchboards. Single-feed and N+1 devices are connected through a local rack-level ATS (see Figure 8).

Figure 8. Single-feed and N+1-redundant devices are connected through a local rack-level ATS

Figure 8. Single-feed and N+1-redundant devices are connected through a local rack-level ATS

ENGINEERING EQUIPMENT
Some of the engineering infrastructure also requires uninterrupted power in order to provide the required Fault Tolerance. The third UPS system meets this requirement. It consists of five completely independent single UPSs. Technological redundancy is fundamental. Redundancy is applied not to the power lines and switchboard equipment but directly to the engineering infrastructure devices.

The number of  UPSs in the group (Figure 9 shows three of five) determines the maximum redundancy to be 4+1. This system can also provide 3+2 and 2N configurations). Most of the protected equipment is at N+1 (see Figure 9).

Figure 9. Power supply for engineering equipment

Figure 9. Power supply for engineering equipment

In general, this architecture allows decommissioning of any power supply or cooling unit, power line, switchboard, UPS, etc., without affecting the serviced IT equipment. Simultaneous duplication of power supply and cooling system components is not necessary.

OVERALL COOLING SYSTEM
Lomonosov-2 makes use of a cooling system that consists of two independent segments, each of which is designed for its own type of IT equipment (see Table 2). Both segments make use of a two-loop scheme with plate heat exchangers between loops. The outer loops have a 40% ethylene-glycol mixture that is used for coolant. Water is used in the inner loops. Both segments have N+1 components (N+2 for dry coolers in the supercomputing segment).

Table 2. Lomonosov makes use of a cooling system that consists of two independent segments, each of which is designed to meet the different requirements of the supercomputing and auxiliary IT equipment.

Table 2. Lomonosov makes use of a cooling system that consists of two independent segments, each of which is designed to meet the different requirements of the supercomputing and auxiliary IT equipment.

This system, designed to serve the 64 A-class enclosures, has been designated the hot-water segment. Its almost completely eliminates the heat from extremely energy-intensive equipment without chillers (see Figure 10). Dry coolers dissipate all the heat that is generated by the supercomputing equipment up to ambient temperatures of 35°C (95°F). Power is required only for the circulation pumps of both loops, dry cooler fans, and automation systems.

Figure 10. The diagram shows the cooling system’s hot water segment.

Figure 10. The diagram shows the cooling system’s hot water segment.

Under full load and in the most adverse conditions, the instant PUE would be expected to be about 1.16 for the fully deployed system of 64 A-class racks (see Figure 11).

Figure 11. Under full load and under the most adverse conditions, the instant PUE (power utilization efficiency would be expected to be about 1.16 for the fully deployed system of 64 A-class racks)

Figure 11. Under full load and under the most adverse conditions, the instant PUE (power utilization efficiency would be expected to be about 1.16 for the fully deployed system of 64 A-class racks)

The water in the inner loop has been purified and contains corrosion inhibitors. It is supplied to computer rooms that will contain only liquid-cooled computer enclosures. Since the enclosures do not use computer room air for cooling, the temperature in these rooms is set at 30°C (86°F) and can be raised to 40°C (104°F) without any influence on the equipment performance. The inner loop piping is made of PVC/CPVC (polyvinyl chloride/chlorinated polyvinyl chloride) thus avoiding electrochemical corrosion.

COOLING AUXILIARY IT EQUIPMENT
It is difficult to avoid using air-cooled IT equipment, even in a HPC project, so MSU deployed a separate cold-water [12–17°C (54–63°F)] cooling system. The cooling topology in these four spaces is almost identical to the hot-water segment deployed in the A-class rooms, except that chillers are used to dissipate the excess heat from the auxiliary IT spaces to the atmosphere. In the white spaces, temperatures are maintained using isolated Hot Aisles and in-row cooling units. Instant PUE for this isolated system is about 1.80, which is not a particularly efficient system (see Figure 12).

Figure 12. In white spaces, temperatures are maintained using isolated hot aisles and in-row cooling units.

Figure 12. In white spaces, temperatures are maintained using isolated hot aisles and in-row cooling units.

If necessary, some of the capacity of this segment can be used to cool the air in the A-class computing rooms. The capacity of the cooling system in these spaces can meet up to 10% of the total heat inflow in each of the A-class enclosures. Although sealed, they still heat the computer room air through convection. But in fact, passive heat radiation from A-class enclosures is less than 5% of the total power consumed by them.

EMERGENCY COOLING
An emergency-cooling mode exists to deal with utility power input blackouts, when both cooling segments are operating on power from the UPS. In emergency mode, each cooling segment has its own requirements. As all units in the first segment (both pump groups, dry coolers, and automation) are connected to the UPS, the system continues to function until the batteries discharge completely.

In the second segment, the UPS services only the inner cooling loop pumps, air conditioners in computer rooms, and automation equipment. The chillers and outer loop pumps are switched off during the blackout.

Since the spaces allocated for cooling equipment are limited, it was impossible to use a more traditional method of stocking cold water at the outlet of the heat exchangers (see Figure 13). Instead, the second segment of the emergency system features accumulator tanks with water stored at a lower temperature than in the loop [about 5°C (41°F) with 12°C (54°F) in the loop] to keep system parameters within a predetermined range. Thus, the required tank volume was reduced to 24 cubic meters (m3) instead of 75 m3, which allowed the equipment to fit in the allocated area. A special three-way valve allows mixing of chilled water from the tanks into the loop if necessary. Separate small low-capacity chillers (two 55-kW chillers) are responsible for charging the tanks with cold water. The system charges the cold-water tanks in about the time it takes to charge the UPS batteries.

Figure 13. Cold accumulators are used to keep system parameters within a predetermined range.

Figure 13. Cold accumulators are used to keep system parameters within a predetermined range.

MSU estimates that segmented cooling with a high-temperature direct water cooling segment reduces its total cost of ownership by 30% compared to data center cooling architectures based on direct expansion (DX) technologies. MSU believes that this project shows that combining the most advanced and classical technologies and system optimization allows  significant savings on CapEx and OpEx while keeping the prerequisite performance, failover, reliability and availability levels.


Andrey Brechalov

Andrey Brechalov

Andrey Brechalov is Chief Infrastructure Solutions Engineer of T-Platforms, a provider of high performance computing (HPC) systems, services, and solutions headquartered in Moscow, Russia. Mr. Brechalov has responsibility for building engineering infrastructures for supercomputers SKIF K-1000, SKIF Cyberia, and MSU Lomonosov-2 as well as smaller supercomputers by T-Platforms. He has worked for more than 20 years in computer industry including over 12 years in HPC and specializes in designing, building, and running supercomputer centers.

Achieving Uptime Institute Tier III Gold Certification of Operational Sustainability

Vantage Data Centers certifies design, facility, and operational sustainability at its Quincy, WA site

By Mark Johnson

In February 2015, Vantage Data Centers earned Tier III Gold Certification of Operational Sustainability (TCOS) from Uptime Institute for its first build at its 68-acre Quincy, WA campus. This project is a bespoke design for a customer that expects a fully redundant, mission critical, and environmentally sensitive data center environment for its company business and mission critical applications.

02_quincyAchieving TCOS verifies that practices and procedures (according to the Uptime Institute Tier Standard: Operational Sustainability) are in place to avoid preventable errors, maintain IT functionality, and support effective site operation. The Tier Certification process ensures operations are in alignment with an organization’s business objectives, availability expectations, and mission imperatives. The Tier III Gold TCOS provides evidence that the 134,000-square foot (ft2) Quincy facility, which qualified as Tier III Certified Constructed Facility (TCCF) in September 2014, would meet the customer’s operational expectations.

Vantage believes that TCOS is a validation that its practices, procedures, and facilities management are among the best in the world. Uptime Institute professionals verified not only that all the essential components for success are in place but also that each team member demonstrates tangible evidence of adhering strictly to procedure. It also provides verification to potential tenants that everything from maintenance practices to procedures, training, and documentation is done properly.

Recognition at this level is a career highlight for data center operators and engineers—the equivalent of receiving a 4.0-grade-point average from Vantage’s most elite peers. This recognition of hard work is a morale booster for everyone involved—including the tenant, vendors, and contractors, who all worked together and demonstrated a real commitment to process in order to obtain Tier Certification at this level. This commitment from all parties is essential to ensuring that human error does not undermine the capital investment required to build a 2N+1 facility capable of supporting up to 9 megawatts of critical load.

03_quincyData centers looking to achieve TCOS (for Tier-track facilities) or Uptime Institute Management & Operations (M&O) Stamp of Approval (independent of Tiers) should recognize that the task is first and foremost a management challenge involving building a team, training, developing procedures, and ensuring consistent implementation and follow up.

BUILDING THE RIGHT TEAM

The right team is the foundation of an effectively run data center. Assembling the team was Vantage’s highest priority and required a careful examination of the organization’s strengths and weaknesses, culture, and appeal to prospective employees.

Having a team of skilled heating, ventilation and air conditioning (HVAC) mechanics, electricians, and other highly trained experts in the field is crucial to running a data center effectively. Vantage seeks technical expertise but also demonstrable discipline, accountability, responsibility, and drive in its team members.

Beyond these must-have features is a subset of nice-to-have characteristics, and at the top of that list is diversity. A team that includes diverse skill sets, backgrounds, and expertise not only ensures a more versatile organization but also enables more work to be done in-house. This is a cost saving and quality control measure, and yet another way to foster pride and ownership in the team.

Time invested upfront in selecting the best team members helps reduce headaches down the road and gives managers a clear reference for what an effective hire looks like. A poorly chosen hire costs more in the long run, even if it seems like an urgent decision in the moment, so a rigorous, competency-based interview process is a must. If the existing team does not unanimously agree on a potential hire, organizations must move on and keep searching until the right person is found.

Recruiting is a continuous process. The best time to look for top talent is before it’s desperately needed. Universities, recruiters, and contractors can be sources of local talent. The opportunity to join an elite team can be a powerful inducement to promising young talent.

TRAINING

04_quincyTalent, by itself, is not enough. It is just as important to train the employees who represent the organization. Like medicine or finance, the data center world is constantly evolving—standards shift, equipment changes, and processes are streamlined. Training is both about certification (external requirements) and ongoing learning (internal advancement and education). To accomplish these goals, Vantage maintains and mandates a video library of training modules at its facilities in Quincy and Santa Clara, CA. In addition, the company has also developed an online learning management system that augments safety training, on-site video training, and personnel qualifications standards that require every employee to be trained on every piece of equipment on site.

The first component of a successful training program is fostering on-the-job learning in every situation. Structuring on the job learning requires that senior staff work closely with junior staff and employees with different types and levels of expertise match up with each other to learn from one another. Having a diverse hiring strategy can lead to the creation of small educational partnerships.

It’s impossible to ensure the most proficient team members will be available for every problem and shift, so it’s essential that all employees have the ability to maintain and operate the data center. Data center management should encourage and challenge employees to try new tasks and require peer reviews to demonstrate competency. Improving overall competency reduces over-dependence on key employees and helps encourage a healthier work-life balance.

Formalized, continuous training programs should be designed to evaluate and certify employees using a multi-level process through which varying degrees of knowledge, skill, and experience are attained. The objectives are ensuring overall knowledge, keeping engineers apprised of any changes to systems and equipment, and identifying and correcting any knowledge shortfalls.

PROCEDURES

Ultimately, discipline and adherence to fine-tuned procedures are essential to operational excellence within a data center. The world’s best-run data centers even have procedures on how to write procedures. Any element that requires human interaction or consideration—from protective equipment to approvals—should have its own section in the operating procedures, including step-by-step instructions and potential risks. Cutting corners, while always tempting, should be avoided; data centers live and die by procedure.

Managing and updating procedure is equally important. For example, major fires broke out just a few miles away from Vantage’s Quincy facility not long ago,. The team carefully monitored and tracked the fires, noting that the fires were still several miles away and seemingly headed away from our site. That information, however, was not communicated directly to the largest customer at the site, which called in the middle of the night to ask about possible evacuation and the recovery plan. Vantage collaborated with the customer to develop a standardized system for emergency notifications, which it incorporated in its procedures, to mitigate the possibility of future miscommunications.

Once procedures are created, they should go through a careful vetting process involving a peer review, to verify the technical accuracy of each written step, including lockout/tagout and risk identification. Vetting procedures means physically walking on site and carrying out each step to validate the procedure for accuracy and precision.

Effective work order management is part of a well-organized procedure. Vantage’s work order management process:

• Predefines scope of service documents to stay ahead of work

• Manages key work order types, such as corrective work orders, preventive maintenance work orders, and project work orders

• Measures and reports on performance at every step

Maintaining regular, detailed reporting practices adds yet another layer of procedural security. A work order system can maintain and manage all action items. Reporting should be reviewed with the parties involved in each step, with everyone held accountable for the results and mistakes analyzed and rectified on an ongoing basis.

Peer review is also essential to maintaining quality methods of procedure (MOPs) and standard operating procedures (SOPs). As with training, pairing up employees for peer review processes helps ensure excellence at all stages.

IMPLEMENTATION AND DISCIPLINE

Disciplined enforcement of processes that are proven to work is the most important component of effective standards and procedures. Procedures are not there to be followed when time allows or when it is convenient. For instance, if a contractor shows upon site without a proper work order or without having followed proper procedure, that’s not an invitation to make an exception. Work must be placed on hold until procedures can be adhered to, with those who did not follow protocol bearing accountability for the delay.

For example, Vantage developed emergency operating procedures (EOPs) for any piece of equipment that could possibly fail. And, sure enough, an uninterruptible power supply failed (UPS) during routine maintenance. Because proper procedures had been developed and employees properly trained, they followed the EOP to the letter, solving the problem quickly and entirely eliminating human error from the process. The loads were diverted, the crisis averted, and everything was properly stabilized to work on the UPS system without fear of interrupting critical loads.

Similarly, proper preparation for maintenance procedures eliminates risk of losing uptime during construction. Vantage develops and maintains scope of service documents for each piece of equipment in the data center, and what is required to maintain them. The same procedures for diverting critical loads for maintenance were used during construction to ensure the build didn’t interfere with critical infrastructure despite the load being moved more than 20 times.

Transparency and open communication between data center operators and customers while executing preventative maintenance is key. Vantage notifies the customer team at the Quincy facility prior to executing any preventative maintenance that may pose a risk to their data haul. The customer then puts in a snap record, which notifies their internal teams about the work. Following these procedures and getting the proper permissions ensures that the customer won’t be subjected to any uncontrolled risk and covers all bases should any unexpected issues arise.

When procedure breaks down and fails due to lack of employee discipline, it puts both the company and managerial staff in a difficult position. First, the lack of discipline undermines the effectiveness of the procedures. Second, management must make a difficult choice—retrain or replace the offending employee. For those given a second chance, managers put their own jobs on the line—a tough prospect in a business that requires to-the-letter precision at every stage.

To ensure that discipline is instilled deeply in every employee, it’s important that the team take ownership of every component. Vantage keeps all its work in-house and consistently trains its employees in multiple disciplines rather than outsourcing. This makes the core team better and more robust and avoids reliance on outside sources. Additionally, Vantage does not allow contractors to turn breakers on and off, because the company ultimately bears the responsibility of an interrupted load. Keeping everything under one roof and knowing every aspect of the data center inside and out is a competitive advantage.

Vantage’s accomplishment of Tier III Gold Certification of Operational Sustainability validates everything the company does to develop and support its operational excellence.


Mark Johnson

Mark Johnson

Mark Johnson is Site Operations Manger at Vantage Data Centers. Prior to joining Vantage, Mr. Johnson was data center facilities manager at Yahoo.  He was responsible for the critical facilities infrastructure for the Wenatchee and Quincy, WA, data centers.  He was also a CITS Facilities Engineer at Level 3 Communications, where he was responsible for the critical facilities infrastructure for two Sunnyvale, CA, data centers. Before that, Mr. John was an Engineer III responsible for critical facilities at VeriSign, where he was responsible for two data centers, and a chief facilities engineer at Abovenet.

 

Economizers in Tier Certified Data Centers

Achieving the efficiency and cost savings benefits of economizers without compromising Tier level objectives
By Keith Klesner

In their efforts to achieve lower energy use and greater mechanical efficiency, data center owners and operators are increasingly willing to consider and try economizers. At the same time, many new vendor solutions are coming to market. In Tier Certified data center environments, however, economizers, just as any other significant infrastructure system, must operate consistently with performance objectives.

Observation by Uptime Institute consultants indicates that roughly one-third of new data center construction designs include an economizer function. Existing data centers are also looking at retrofitting economizer technologies to improve efficiency and lower costs. Economizers use external ambient air to help cool IT equipment. In some climates, the electricity savings from implementing economizers can be so significant that the method has been called “free cooling.” But, all cooling solutions require fans, pumps, and/or other systems that draw power; thus, the technology is not really free and the term economizers is more accurate.

When The Green Grid surveyed large 2,500-square foot data centers in 2011, 49% of the respondents (primarily U.S. and European facilities) reported using economizers and another 24% were considering them. In the last 4 years, these numbers have continued to grow. In virtually all climatic regions, adoption of these technologies appears to be on the rise. Uptime Institute has seen an increase in the use of economizers in both enterprise and commercial data centers, as facilities attempt to lower their power usage effectiveness (PUE) and increase efficiency. This increased adoption is due in large part to fears about rising energy costs (predicted to grow significantly in the next 10 years). In addition, outside organizations, such as ASHRAE, are advocating for greater efficiencies, and internal corporate and client sustainability initiatives at many organizations drive the push to be more efficient and reduce costs.

The marketplace includes a broad array of economizer solutions:

• Direct air cooling: Fans blow cooler outside air into a data center, typically through filters

• Indirect evaporative cooling: A wetted medium or water spray promotes evaporation to supply cool air into a data center

• Pumped refrigerant dry coolers: A closed-loop fluid, similar to an automotive radiator, rejects heat to external air and provides cooling to the data center

• Water-side economizing: Traditional cooling tower systems incorporating heat exchangers bypass chillers to cool the data center

IMPLICATIONS FOR TIER
Organizations that plan to utilize an economizer system and desire to attain Tier Certification must consider how best to incorporate these technologies into a data center in a way that meets Tier requirements. For example, Tier III Certified Constructed Facilities have Concurrently Maintainable critical systems. Tier IV Certified Facilities must be Fault Tolerant.

Some economizer technologies and/or their implementation methods can affect critical systems that are integral to meeting Tier Objectives. For instance, many technologies were not originally designed for data center use, and manufacturers may not have thought through all the implications.

For example, true Fault Tolerance is difficult to achieve and requires sophisticated controls. Detailed planning and testing is essential for a successful implementation. Uptime Institute does not endorse or recommend any specific technology solution or vendor; each organization must make its own determination of what solution will meet the business, operating, and environmental needs of its facility.

ECONOMIZER TECHNOLOGIES
Economizer technologies include commercial direct air rooftop units, direct air plus evaporative systems, indirect evaporative cooling systems, water-side economizers, and direct air plus dry cooler systems.

DIRECT AIR
Direct air units used as rooftop economizers are often the same units used for commercial heating, ventilation, and air-conditioning (HVAC) systems. Designed for office and retail environments, this equipment has been adapted for 24 x 7 applications. Select direct air systems also use evaporative cooling, but all of them combine of them combine direct air and multi-stage direct expansion (DX) or chilled water. These units require low capital investment because they are generally available commercially, service technicians are readily available, and the systems typically consume very little water. Direct air units also yield good reported PUE (1.30–1.60).

On the other hand, commercial direct air rooftop units may require outside air filtration, as many units do not have adequate filtration to prevent the introduction of outside air directly into critical spaces, which increases the risk of particulate contamination.

Outside air units suitable for mission critical spaces require the capability of 100% air recirculation during certain air quality events (e.g., high pollution events and forest or brush fires) that will temporarily negate the efficiency gains of the units.

Figure 1. A direct expansion unit with an air-side economizer unit provides four operating modes including direct air, 100% recirculation, and two mixed modes. It is a well-established technology, designed to go from full stop (no power) to full cooling in 120 seconds or less, and allowing for PUE as low as 1.30-1.40.

Figure 1. A direct expansion unit with an air-side economizer unit provides four operating modes including direct air, 100% recirculation, and two mixed modes. It is a well-established technology, designed to go from full stop (no power) to full cooling in 120 seconds or less, and allowing for PUE as low as 1.30-1.40.

Because commercial HVAC systems do not always meet the needs of mission critical facilities, owners and operators must identify the design limitations of any particular solution. Systems may integrate critical cooling and the air handling unit or provide a mechanical solution that incorporates air handling and chilled water. These units will turn off the other cooling mechanism when outside air cooling permits. Units that offer dual modes are typically not significantly more expensive. These commercial units require reliable controls that ensure that functional settings align with the mission critical environment. It is essential that the controls sequence be dialed in before performing thorough testing and commissioning of all control possibilities (see Figure 1). Commercial direct air rooftop units have been used successfully in Tier III and Tier IV applications (see Figure 2).

II Klesner Figure 2a image002II Klesner Figure 2b image003

Figure 2. Chilled water with air economizer and wetting media provides nine operating modes including direct air plus evaporative cooling. With multiple operating modes, the testing regimen is extensive (required for all modes).

Figure 2. Chilled water with air economizer and wetting media provides nine operating modes including direct air plus evaporative cooling. With multiple operating modes, the testing regimen is extensive (required for all modes).

A key step in adapting commercial units to mission critical applications is considering the data center’s worst-case scenario. Most commercial applications are rated at 95°F (35°C), and HVAC units will typically allow some fluctuation in temperature and discomfort for workers in commercial settings. The temperature requirements for data centers, however, are more stringent. Direct air or chilled water coils must be designed for peak day—the location’s ASHRAE dry bulb temperature and/or extreme maximum wet bulb temperature. Systems must be commissioned and tested in Tier demonstrations for any event that would require 100% recirculation. If the unit includes evaporative cooling, makeup (process) water must meet all Tier requirements or the evaporative capacity must be excluded from the Tier assessment.

In Tier IV facilities, Continuous Cooling is required, including during any transition from utility power to engine generators. Select facilities have achieved Continuous Cooling using chilled water storage. In the case of one Tier IV site, the rooftop chilled water unit included very large thermal storage tanks to provide Continuous Cooling via the chilled water coil.

Controller capabilities and building pressure are also considerations. As these are commercial units, their controls are usually not optimized for the transition of power from the utility to the engine-generator sets and back. Typically, over- or under-pressure imbalance in a data center increases following a utility loss or mode change due to outside air damper changes and supply and exhaust fans starting and ramping up. This pressure can be significant. Uptime Institute consultants have even seen an entire wall blow out from over-pressure in a data hall. Facility engineers have to adjust controls for the initial building pressure and fine-tune them to adjust to the pressure in the space.

To achieve Tier III objectives, each site must determine if a single or shared controller will meet its Concurrent Maintainability requirements. In a Tier IV environment, Fault Tolerance is required in each operating mode to prevent a fault from impacting the critical cooling of other units. It is acceptable to have multiple rooftop units, but they must not be on a single control or single weather sensor/control system component. It is important to have some form of distributed or lead-lag (master/slave) system to control these components and enabling them to be operate in a coordinated fashion with no points of commonality. If any one component fails, the master control system will switch to the other unit, so that a fault will not impact critical cooling. For one Tier IV project demonstration, Uptime Institute consultants found additional operating modes while on site. Each required additional testing and controls changes to ensure Fault Tolerance.

DIRECT AIR PLUS EVAPORATIVE SYSTEMS
One economizer solution on the market, along with many other similar designs, involves a modular data center with direct air cooling and wetted media fed from a fan wall. The fan wall provides air to a flooded Cold Aisle in a layout that includes a contained Hot Aisle. This proprietary solution is modular and scalable, with direct air cooling via an air optimizer. This system is factory built with well-established performance across multiple global deployments. The systems have low reported PUEs and excellent partial load efficiency. Designed as a prefabricated modular cooling system and computer room, the system comes with a control algorithm that is designed for mission critical performance.

These units are described as being somewhat like a “data center in a box” but without the electrical infrastructure, which must be site designed to go with the mechanical equipment. Cost may be another disadvantage, as there have been no deployments to date in North America. In operation, the system determines whether direct air or evaporative cooling is appropriate, depending upon external temperature and conditions. Air handling units are integrated into the building envelope rather than placed on a rooftop.

Figure 3. Bladeroom’s prefabricated modular data center uses direct air with DX and evaporative cooling.

Figure 3. Bladeroom’s prefabricated modular data center uses direct air with DX and evaporative cooling.

One company has used a prefabricated modular data center solution with integrated cooling optimization between indirect, evaporative, and DX cooling in Tier III facilities. In these facilities, a DX cooling system provides redundancy to the evaporative cooler. If there is a critical failure of the water supply to the evaporative cooler (or the water pump, which is measured by a flow switch), the building management system starts DX cooling and puts the air optimizer into full recirculation mode. In this set up, from a Tier objective perspective, the evaporative system and water system supporting it are not critical systems. Fans are installed in an N+20% configuration to provide resilience. The design plans for DX cooling at less than 6% of the year at the installations in Japan and Australia and acts as redundant mechanical cooling for the remainder of the year, able to meet 100% of the IT capacity. The redundant mechanical cooling system itself is an N+1 design (See Figures 3 and 4).

Figure 4. Supply air from the Bladeroom “air optimizer” brings direct air with DX and evaporative cooling into flooded cold aisles in the data center.

Figure 4. Supply air from the Bladeroom “air optimizer” brings direct air with DX and evaporative cooling into flooded cold aisles in the data center.

This data center solution has seen multiple Tier I and Tier II deployments, as well as several Tier III installations, providing good efficiency results. Achieving Tier IV may be difficult with this type of DX plus evaporative system because of Compartmentalization and Fault Tolerant capacity requirements. For example; Compartmentalization of the two different air optimizers is a challenge that must be solved; the louvers and louver controls in the Cold Aisles are not Fault Tolerant and would require modification, and Compartmentalization of electrical controls has not been incorporated into the concept (for example, one in the Hot Aisle and one in the Cold Aisle).

INDIRECT EVAPORATIVE COOLING SYSTEMS
Another type of economizer employs evaporative cooling to indirectly cool the data center using a heat exchanger. There are multiple suppliers of these types of systems. New technologies incorporate cooling media of hybrid plastic polymers or other materials. This approach excludes outside air from the facility. The result is a very clean solution; pollutants, over-pressure/under-pressure, and changes in humidity from outside events like thunderstorms are not concerns. Additionally, a more traditional, large chilled water plant is not necessary (although makeup water storage will be needed) because chilled water is not required.

As with many economizing technologies, greater efficiency can enable facilities to avoid upsizing the electrical plant to accommodate the cooling. A reduced mechanical footprint may mean lower engine-generator capacity, fewer transformers and switchgear, and an overall reduction in the often sizable electrical generation systems traditionally seen in a mission critical facility. For example, one data center eliminated an engine generator set and switchgear, saving approximately US$1M (although the cooling units themselves were more expensive than some other solutions on the market).

The performance of these types of systems is climate dependent. No external cooling systems are generally required in more northern locations. For most temperate and warmer climates some supplemental critical cooling will be needed for hotter days during the year. The systems have to be sized appropriately; however, a small supplemental DX top-up system can meet all critical cooling requirements even in warmer climates. These cooling systems have produced low observed PUEs (1.20 or less) with good partial load PUEs. Facilities employing these systems in conjunction with air management systems and Hot Aisle containment to supply air inlet temperatures up to the ASHRAE recommendation of 27°C (81°F) have achieved Tier III certification with no refrigeration or DX systems needed.

Indirect air/evaporative solutions have two drawbacks, a relative lack of skilled service technicians to service the units and high water requirements. For example, one fairly typical unit on the market can use approximately 1,500 cubic meters (≈400,000 gallons) of water per megawatt annually. Facilities need to budget for water treatment and prepare for a peak water scenario to avoid an impactful water shortage for critical cooling.

Makeup water storage must meet Tier criteria. Water treatment, distribution, pumps, and other parts of the water system must meet the same requirements as the critical infrastructure. Water treatment is an essential, long-term operation performed using methods such as filtration, reverse osmosis, or chemical dosing. Untreated or insufficiently treated water can potentially foul or scale equipment, and thus, water-based systems require vigilance.

It is important to accurately determine how much makeup water is needed on site. For example, a Tier III facility requires 12 hours of Concurrently Maintainable makeup water, which means multiple makeup water tanks. Designing capacity to account for a worst-case scenario can mean handling and treating a lot of water. Over the 20-30 year life of a data center, thousands of gallons (tens of cubic meters) of stored water may be required, which becomes a site planning issue. Many owners have chosen to exceed 12 hours for additional risk avoidance. For more information, refer to Accredited Tier Designer Technical Paper Series: Makeup Water).

WATER-SIDE ECONOMIZERS
Water-side economizer solutions combine a traditional water-cooled chilled water plant with heat exchangers to bypass chiller units. These systems are well known, which means that skilled service technicians are readily available. Data centers have reduced mechanical plant power consumption from 10-25% using water-side economizers. Systems of this type provide perhaps the most traditional form of economizer/mechanical systems power reduction. The technology uses much of the infrastructure that is already in place in older data centers, so it can be the easiest option to adopt. These systems introduce heat exchangers, so cooling comes directly from cooling towers and bypasses chiller units. For example, in a climate like that in the northern U.S., a facility can run water through a cooling tower during the winter to reject heat and supply the data center with cool water without operating a chiller unit.

Controls and automation for transitions between chilled water and heat exchanger modes are operationally critical but can be difficult to achieve smoothly. Some operators may bypass the water-side economizers if they don’t have full confidence in the automation controls. In some instances, operators may choose not to make the switch when a facility is not going to utilize more than four or six hours of economization. Thus energy savings may actually turn out to be much less than expected.

Relatively high capital expense (CapEx) investment is another drawback. Significant infrastructure must be in place on Day 1 to account for water consumption and treatment, heat exchangers, water pumping, and cooling towers. Additionally, the annualized PUE reduction that results from water-side systems is often not significant, most often in the 0.1–0.2 range. Data center owners will want a realistic cost/ROI analysis to determine if this cooling approach will meet business objectives.

Figure 5. Traditional heat exchanger typical of a water-side economizer system.

Figure 5. Traditional heat exchanger typical of a water-side economizer system.

Water-side economizers are proven in Tier settings and are found in multiple Tier III facilities. Tier demonstrations are focused on the critical cooling system, not necessarily the economizer function. Because the water-side economizer itself is not considered critical capacity, Tier III demonstrations are performed under chiller operations, as with typical rooftop units. Demonstrations also include isolation of the heat exchanger systems and valves, and economizer control of functions is critical (see Figures 5 and 6). However, for Tier IV settings where Fault Tolerance is required, the system must be able to respond autonomously. For example, one data center in Spain had an air-side heat recovery system with a connected office building. If an economizer fault occurred, the facility would need to ensure it would not impact the data center. The solution was to have a leak detection system that would shut off the economizer to maintain critical cooling of the data hall in isolation.

Figure 6. The cooling tower takes in hot air from the sides and blows hot, wet air out of the top, cooling the condenser water as it falls down the cooling tower. In operation it can appear that steam is coming off the unit, but this is a traditional cooling tower.

Figure 6. The cooling tower takes in hot air from the sides and blows hot, wet air out of the top, cooling the condenser water as it falls down the cooling tower. In operation it can appear that steam is coming off the unit, but this is a traditional cooling tower.

CRACs WITH PUMPED REFRIGERANT ECONOMIZER
Another system adds an economizer function to a computer room air conditioner (CRAC) unit using a pumped liquid refrigerant. In some ways, this technology operates similarly to standard refrigeration units, which use a compressor to convert a liquid refrigerant to a gas. However, instead of using a compressor, newer technology blows air across a radiator unit to reject the heat externally without converting the liquid refrigerant to a gas. This technology has been implemented and tested in several data centers with good results in two Tier III facilities.

The advantages of this system include low capital cost compared to many other mechanical cooling solutions. These systems can also be fairly inexpensive to operate and require no water. Because they use existing technology that has been modified just slightly, it is easy to find service technicians. It is a proven cooling method with low estimated PUE (1.25–1.35), not quite as much as some modern CRACs that yield 1.60–1.80 PUE, but still substantial. These systems offer distributed control of mode changes. In traditional facilities, switching from chillers to coolers typically happens using one master control. A typical DX CRAC installation will have 10-12 units (or even up to 30) that will self-determine the cooling situation and individually select the appropriate operating mode. Distributed control is less likely to cause a critical cooling problem even if one or several units fail. Additionally these units do not use any outside air. They recirculate inside air, thus avoiding any outside air issues like pollution and humidity.

The purchase of DX CRAC units with dry coolers does require more CapEx investment, a 50–100% premium over traditional CRACs.Other cooling technologies may offer higher energy efficiency. Additional space is required for the liquid pumping units, typically on the roof or beside the data center.

Multiple data centers that use this technology have achieved Tier III Certification. From a Tier standpoint, these CRACs are the same as the typical CRAC. In particular, the distributed control supports Tier III requirements, including Concurrent Maintainability. The use of DX CRAC systems needs to be considered early in the building design process. For example, the need to pump refrigerant limits the number of building stories. With a CRAC in the computer room and condenser units on the roof, two stories seem to be the building height limit at this time. The suitability of this solution for Tier IV facilities is still undetermined. The local control mechanism is an important step to Fault Tolerance, and Compartmentalization of refrigerant and power must be considered.

OPERATIONS CONSIDERATIONS WITH ECONOMIZERS
Economizer solutions present a number of operational ramifications, including short- and long-term impacts, risk, CapEx, commissioning, and ongoing operating costs. An efficiency gain is one obvious impact; although, an economizer can increase some operational maintenance expenses:
• Several types require water filtration and/or water treatment

• Select systems require additional outside air filtration

• Water-side economizing can require additional cooling tower maintenance

Unfortunately in certain applications, economization may not be a sustainable practice overall, from either a cost or “green” perspective, even though it reduces energy use. For example, high water use is not an ideal solution in dry or water-limited climates. Additionally, extreme use of materials such as filters and chemicals for water treatment can increase costs and also reduce the sustainability of some economizer solutions.

CONCLUSION
Uptime Institute experience has amply shown that, with careful evaluation, planning, and implementation, economizers can be effective at reducing energy use and costs and lowering energy consumption without sacrificing performance, availability, or Tier objectives. Even so, modern data centers have begun to see diminishing PUE returns overall, with many data centers experiencing a leveling off after initial gains. These and all facilities can find it valuable to consider whether investing in mechanical efficiency or broader IT efficiency measures such as server utilization and decommissioning will yield the most significant gains and greater holistic efficiencies.

Economizer solutions can introduce additional risks into the data center, where changes in operating modes increase the risk of equipment failure or operator error. These multi-modal systems are inherently more complex and have more components than traditional cooling solutions. In the event of a failure, operators must know how to manually isolate the equipment or transition modes to ensure critical cooling is maintained.

Any economizer solution must fit both the uptime requirement and business objective, especially if it uses newer technologies or was not originally designed for mission critical facilities. Equally important is ensuring that system selection and installation takes Tier requirements into consideration.

Many data centers with economizers have attained Tier Certification; however, in the majority of facilities, Uptime Institute consultants discovered flaws in the operational sequences or system installation during site inspections that were defeating Tier objectives. In all cases so far, the issues were correctible, but extra diligence is required.

Many economizer solutions are newer technologies, or new applications of existing technology outside of their original intended environment; therefore, careful attention should be paid to elements such as control systems to ensure compatibility with mission critical data center operation. Single shared control systems or mechanical system control components are a problem. A single controller, workstation, or weather sensor may fault or require removal from service for maintenance/upgrade over the lifespan of a data center. Neither the occurrence of a component fault nor taking a component offline for maintenance should impact critical cooling. These factors are particularly important when evaluating the impact of economizers on a facility’s Tier objective.

Despite the drawbacks and challenges of properly implementing and managing economizers, their increased use represents a trend for data center operational and ecological sustainability. For successful economizer implementation, designers and owners need to consider the overarching design objectives and data center objectives to ensure those are not compromised in pursuit of efficiency.


ECONOMIZER SUCCESS STORY

Digital Realty’s Profile Park facility in Ireland implemented compressor-less cooling by employing an indirect evaporative economizer, using technology adapted from commercial applications. The system is a success, but it took some careful consideration, adaptation, and fine-tuning to optimize the technology for a Tier III mission critical data center.

Figure 7. The unit operates as a scavenger air system (red area at left) taking the external air and running it across a media. That scavenger air is part of the evaporative process, with the air used to cool the media directly or cool the return air. This image shows summer operation where warm outside air is cooled by the addition of moisture. In winter, outside air cools the return air.

Figure 7. The unit operates as a scavenger air system (red area at left) taking the external air and running it across a media. That scavenger air is part of the evaporative process, with the air used to cool the media directly or cool the return air. This image shows summer operation where warm outside air is cooled by the addition of moisture. In winter, outside air cools the return air.

Achieving the desired energy savings first required attention to water (storage, treatment, and consumption). The water storage needs were significant—approximately 60,000 liters for 3.8-megawatts (MW), equivalent to about 50,000 gallons. Water treatment and filtration are critical in this type of system and was a significant challenge. The facility implemented very fine filtration at a particulate size of 1 micron (which is 10 times stricter than would typically be required for potable water). This type of indirect air system eliminates the need for chiller units but does require significant water pressure.

To achieve Tier III Certification, the system also had to be Concurrently Maintainable. Valves between the units and a loop format with many valves separating units, similar to what would be used with a chilled water system, helped the system meet the Concurrent Maintainability requirement. Two values in series are located between each unit on a bi-directional water loop (see Figure 7 and 8).

As with any installation that makes use of new technology, the facility required additional testing and operations sequence modification for a mission critical Tier III setting. For example, initially the units were overconsuming power, not responding to a loss of power as expected, and were draining all of the water when power was lost. After adjustments, the system performance was corrected.

II Klesner Figure 8a image007

Figure 8 (a and b). The system requires roughly twice the fan energy needed for a lot of typical rooftop units or CRACs but does not use a compressor refrigeration unit, which does reduce some of the energy use. Additionally, the fans themselves are high efficiency with optimized motors. Thus, while the facility has approximately twice the number of fans and twice the airflow, it can run many more of the small units more efficiently.

Figure 8 (a and b). The system requires roughly twice the fan energy needed for a lot of typical rooftop units or CRACs but does not use a compressor refrigeration unit, which does reduce some of the energy use. Additionally, the fans themselves are high efficiency with optimized motors. Thus, while the facility has approximately twice the number of fans and twice the airflow, it can run many more of the small units more efficiently.

Ultimately, this facility with its indirect cooling system was Tier III Certified, proving that it is possible to sustain mechanical cooling year-round without compressors. Digital Realty experienced a significant reduction in PUE with this solution, improving from 1.60 with chilled water to 1.15. With this anticipated annualized PUE reduction, the solution is expected to result in approximately €643,000 (US$711,000) in savings per year. Digital Realty was recognized with an Uptime Institute Brill Award for Efficient IT in 2014.


CHOOSING THE RIGHT ECONOMIZER SOLUTION FOR YOUR FACILITY

Organizations that are considering implementing economizers—whether retrofitting an existing facility or building a new one—have to look at a range of criteria. The specifications of any one facility need to be explored with mechanical, electrical, plumbing (MEP), and other vendors, but key factors to consider are:

Geographical area/climate: This is perhaps the most important factor in determining which economizer technologies are viable options for a facility. For example, direct outside air can be a very effective solution in northern locations that have an extended cold winter, and select industrial environments can preclude the use of outside air because of high pollutant content, other solutions will work better in tropical climates versus arid regions where water-side solutions are less appropriate.

New build or retrofit: Retrofitting an existing facility can eliminate available economizer options, usually due to space considerations but also because systems such as direct air plus evaporative and DX CRAC need to be incorporated at the design stage as part of the building envelope.

Supplier history: Beware of suppliers from other industries entering the data center space. Limited experience with mission critical functionality including utility loss restarts, control architecture, power consumption, and water consumption can mean systems need to be substantially modified to conform to 24 x 7 data center operating objectives. New suppliers are entering into the data center market, but consider which of them will be around for the long term before entering into any agreements to ensure parts supplies and skilled service capabilities will be available to maintain the system throughout its life cycle.

Financial considerations: Economizers have both CapEx and operating expense (OpEx) impact. Whether an organization wants to invest capital up front or focus on long-term operating budgets depends on the business objectives.

Some general CapEx/OpEx factors to keep in mind include:

• Select newer cooling technology systems are high cost, and thus require more up front CapEx.

• A low initial capital outlay with higher OpEx may be justified in some settings.

• Enterprise owners/operators should consider insertion of economizers into the capital project budget with long-term savings justifications.

ROI objectives: As an organization, what payback horizon is needed for significant PUE reduction? Is it one-two years, five years, or ten? The assumptions for the performance of economizer systems should utilize real-world savings, as expectations for annual hours of use and performance should be reduced from the best-case scenarios provided by suppliers. A simple payback model should be less than three to five years from the energy savings.

Depending on an organization’s status and location, it may be possible to utilize sustainability or alternate funding. When it comes to economizers, geography/climate, and ROI are typically the most significant decision factors. Uptime Institute’s FORCSS model can aid in evaluating the various economizer technology and deployment options, balancing Financial, Opportunity, Risk, Compliance, Sustainability, and Service Quality considerations (see more about FORCSS at https://journal.uptimeinstitute.com/introducing-uptime-institutes-forcss-system/).


Keith Klesner is Uptime Institute’s Vice President of Strategic Accounts. Mr. Klesner’s career in critical facilities spans 16 years and includes responsibilities ranging from planning, engineering, design, and construction to start-up and ongoing operation of data centers and mission critical facilities. He has a B.S. in Civil Engineering from the University of Colorado-Boulder and a MBA from the University of LaVerne. He maintains status as a professional engineer (PE) in Colorado and is a LEED Accredited Professional.

Tier Certification for Modular and Phased Construction

Special care must be taken on modular and phased construction projects to avoid compromising reliability goals. Shared system coordination could defeat your Tier Certification objective
By Chris Brown

Today, we often see data center owners taking a modular or phased construction approach to reduce the costs of design, construction, and operation and build time. Taking a modular or phased construction approach allows companies to make a smaller initial investment and to delay some capital expenditures by scaling capacity with business growth.

The modular and phased construction approaches bring some challenges, including the need for multiple design drawings for each phase, potential interruption of regular operations and systems during expansion, and the logistics of installation and commissioning alongside a live production environment. Meticulous planning can minimize the risks of downtime or disruption to operations and enable a facility to achieve the same high level of performance and resilience as conventionally built data centers. In fact, with appropriate planning in the design stage and by aligning Tier Certification with the commissioning process for each construction phase, data center owners can simultaneously reap the business and operating benefits of phased construction along with the risk management and reliability validation benefits of Tier Certification Constructed Facility (TCCF).

DEFINING MODULAR AND PHASED CONSTRUCTION
The terms modular construction and phased construction, though sometimes used interchangeably, are distinct. Both terms refer to the emerging practice of building production capacity in increments over time based on expanded need.

Figure 1. Phased construction allows for the addition of IT capacity over but time but relies on infrastructure design to support each additional IT increment.

Figure 1. Phased construction allows for the addition of IT capacity over but time but relies on infrastructure design to support each additional IT increment.

However, though all modular construction is by its nature phased, not all phased construction projects are modular. Uptime Institute classifies phased construction as any project in which critical capacity components are installed over time (see Figure 1). Such projects often include common distribution systems. Modular construction describes projects that add capacity in blocks over time, typically in repeated, sequential units, each with self-contained infrastructure sufficient to support the capacity of the expansion unit rather than accessing shared infrastructure (see Figure 2).

Figure 2. Modular design supports the IT capacity growth over time by allowing for separate and independent expansions of infrastructure.

Figure 2. Modular design supports the IT capacity growth over time by allowing for separate and independent expansions of infrastructure.

For example, a phased construction facility might be built with adequate electrical distribution systems and wiring to support the ultimate intended design capacity, with additional power supply added as needed to support growing IT load. Similarly, cooling piping systems might be constructed for the entire facility at the outset of a project, with additional pumps or chiller units added later, all using a shared distribution system.

Figure 3. Simplified modular electrical system with each phase utilizing independent equipment and distribution systems

Figure 3. Simplified modular electrical system with each phase utilizing independent equipment and distribution systems

For modular facilities, the design may specify an entire electrical system module that encompasses all the engine-generator sets, uninterruptible power supply (UPS) capacities, and associated distribution systems needed to support a given IT load. Then, for each incremental increase in capacity, the design may call for adding another separate and independent electrical system module to support the IT load growth. These two modules would operate independently, without sharing distribution systems (see Figure 3). Taking this same approach, a design may specify a smaller chiller, pump, piping, and an air handler to support a given heat load. Then, as load increases, the design would include the addition of another small chiller, pump, piping, and air handler to support the incremental heat load growth instead of adding onto the existing chilled water or piping system. In both examples, the expansion increments do not share distribution systems and therefore are distinct modules (see Figure 4).

Figure 4. Simplified modular mechanical system with each phase utilizing independent equipment and distribution systems expansions of infrastructure.

Figure 4. Simplified modular mechanical system with each phase utilizing independent equipment and distribution systems expansions of infrastructure.

CERTIFICATION IN A PHASED MODEL­: DESIGN THROUGH CONSTRUCTION
Organizations desiring a Tier Certified data center must first obtain Tier Certification of Design Documents (TCDD). For phased construction projects, the Tier Certification process culminates with TCCF after construction. (For conventional data center projects the Tier Certification process culminates in Tier Certification of Operational Sustainability.) TCCF validates the facility Tier level as it has been built and commissioned. It is not uncommon for multiple infrastructure and/or system elements to be altered during construction, which is why Tier Certification does not end with TCDD; a facility must undergo TCCF to ensure that the facility was built and performs as designed, without any alterations that would compromise its reliability. This applies whether a conventional, phased, or modular construction approach is used.

In a phased construction project, planning for Tier Certification begins in the design stage. To receive TCDD, Uptime Institute will review each phase and all design documents from the initial build through the final construction phase to ensure compliance with Tier Standards. All phases should meet the requirements for the Tier objective.

Certification of each incremental phase of the design depends on meaningful changes to data center capacity, meaningful being the key concept. For example, upgrading a mechanical system may increase cooling capacity, but if it does not increase processing capacity, it is not a meaningful increment. An upgrade to mechanical and/or electrical systems that expands a facility’s overall processing capacity would be considered a meaningful change and necessitate that a facility have its Certification updated.

In some cases, organizations may not yet have fully defined long-term construction phases that would enable Certification of the ultimate facility. In these situations, Uptime Institute will review design documents for only those phases that are fully defined for Tier Certification specific to those phases. Tier Certification (Tier I-IV) is limited to that specific phase alone. Knowing the desired endpoint is important: if Phase 1 and 2 of a facility do not meet Tier criteria, but subsequently Phase 3 does; then, completion of a TCCF review must wait until Phase 3 is finished.

TCCF includes a site visit with live functional demonstrations of all critical systems, which is typically completed immediately following commissioning. For a phased construction project, Tier Certification of the Phase 1 facility can be the same as Tier Certification for conventional (non-phased) projects in virtually all respects. In both cases, there is no live load at the time, allowing infrastructure demonstrations to be performed easily without risking interruption to the production environment.

Figure 5. Simplified phased electrical system with each additional phase adding equipment while sharing distribution components

Figure 5. Simplified phased electrical system with each additional phase adding equipment while sharing distribution components

The process for Tier Certification of later phases can be as easy as it is for Phase 1 or more difficult, depending on the construction approach. Truly modular expansion designs minimize risk during later phases of commissioning and TCCF because they do not rely on shared distribution systems. Because modules consist of independent, discrete systems, installing additional capacity segments over time does not put facility-wide systems at risk. However, when there is shared infrastructure, as in phased (not modular) projects, commissioning and TCCF can be more complex. Installing new capacity components on top of shared distribution paths, e.g., adding or upgrading an engine generator or UPS module, requires that all testing and demonstrations be repeated across the whole system. It’s important to ensure that all of the system settings work together, for example, verifying that all circuit breaker settings remain appropriate for the new capacity load, so that the new production load will not trip the breakers.

Pre-planning for later phases can help ensure a smooth commissioning and Tier Certification process even with shared infrastructure. As long as the design phases support a Tier Certification objective, there is no reason why phased construction projects cannot be Tier Certified.

COMMISSIONING AND TIER CERTIFICATION
TCCF demonstrations align with commissioning; both must be completed at the same stage (following installation, prior to live load). If a data center design allows full commissioning to be completed at each phase of construction, Tier Certification is achievable for both modular and non-modular phased projects. TCCF demonstrations would be done at the same expansion stages designated for the TCDD at the outset of the project.

For a modular installation, commissioning and Tier Certification demonstrations can be conducted as normal using load banks inside a common data hall, with relatively low risk. If not managed properly, load banks can direct hot air at server intakes, which would be the only significant risk. Obviously this risk can be prevented.

For phased installations that share infrastructure, later phases of commissioning and Tier Certification carry increased risk, because load banks are running in common data halls with shared distribution paths and capacity systems that are supporting a concurrent live load. The best way to reduce the risks of later phase commissioning and Tier Certification is to conduct demonstrations as early in the Certification as possible.

Figure 6. Simplified phased mechanical system with each additional phase adding equipment while sharing distribution components

Figure 6. Simplified phased mechanical system with each additional phase adding equipment while sharing distribution components

Shared critical infrastructure distribution systems included in the initial phase of construction can be commissioned and Tier Certified at full (planned) capacity during the initial TCCF review, so these demonstrations can be front loaded and will not need to be repeated at future expansion phases.

The case studies offer examples of how two data centers approached the process of incorporating phased construction practices without sacrificing Tier Certification vital to supporting their business and operating objectives.

CONCLUSION
Modular and phased construction approaches can be less expensive at each phase and require less up-front capital than traditional construction, but installing equipment that is outside of that specified for the TCDD or beyond the capacity of the TCCF demonstrations puts not only the Tier Certification at risk, but the entire operation. Tier Certification remains valid only until there has been a change to the infrastructure. Beyond that, regardless of an organization’s Tier objective, if construction phases are designed and built in a manner that prevents effective commissioning, then there are greater problems than the status of Tier Certification.
A data center that cannot be commissioned at the completion of a phase incurs increased risk of downtime or system error for that phase of operation and all later phases. Successful commissioning and Tier Certification of phased or modular projects requires thinking through the business and operational impacts of the design philosophy and the decisions made regarding facility expansion strategies. Design decisions must be made with an understanding of which factors are and are not consistent with achieving the Tier Certification‹these are essentially the same factors that allow commissioning. In cases where a facility expansion or system upgrade cannot be Tier Certified, Uptime Institute often sees that is usually the result of limitations inherent in the design of the facility or due to business choices that were made long before.

It is incumbent upon organizations to think through not only the business rationale but also the potential operational impacts of various design and construction choices. Organizations can simultaneously protect their data center investment and achieve the Tier Certification level that supports the business and operating mission‹including modular and phased construction plans‹ by properly anticipating the need for commissioning in Phase 2 and beyond.

Planning design and construction activities to allow for commissioning greatly reduces the organization¹s overall risk. TCCF is the formal validation of the reliability of the built facility.


Case Study: Tier III Certification of Constructed Facility: Phased Construction
An organization planned a South African Tier III facility capital infrastructure project in two build phases, with a shared infrastructure (i.e., non-modular, phased construction). The original design drawings specified two chilled-water plants: an air-cooled chiller plant and an absorption chiller plant, although, the absorption chiller plant was not installed initially due to a limited natural gas supply. The chilled-water system piping was installed up front, and connected to the air-cooled chiller plant. Two air-cooled chillers capable of supporting the facility load were then installed.

The organization installed all the data hall air-handling units (AHUs), including two Kyoto Cooling AHUs, on day one. Because the Kyoto AHUs would be very difficult to install once the facility was built, the facility was essentially designed around them. In other words, it was more cost effective to install both AHUs during the initial construction phase, even if their full capacity would not be reached until after Phase 2.

The facility design utilizes a common infrastructure with a single data hall. Phase 1 called for installing 154 kilowatts (kW) of IT capacity; an additional 306 kW of capacity would be added in Phase 2 for a total planned capacity of 460 kW. Phase 1 TCCF demonstrations were conducted first for the 154 kW of IT load that the facility would be supporting initially. In order to minimize the risk to IT assets when Phase 2 TCCF demonstrations are performed, the commissioning team next demonstrated both AHUs at full capacity. They increased the loading on the data hall to a full 460 kW, successfully demonstrating that the AHUs could support that load in accordance with Tier III requirements.

For Tier Certification of Phase 2, the facility will have to demonstrate that the overall chilled water piping system and additional electrical systems would support the full 460-kW capacity, but they will not have to demonstrate the AHUs again. During Phase 1 demonstrations, the chillers and engine generators ran at N capacity (both units operating) to provide ample power and cooling to show that the AHUs could support 460 kW in a Concurrently Maintainable manner. The Phase 2 demonstrations will not require placing extra load on the UPS, but they did test the effects of putting more load into the data hall and possibly raising the temperature for the systems under live load.


Case Study: Tier III Expanded to Tier IV
The design for a U.S.-based cloud data center validated as a Tier III Certified Constructed Facility after the first construction phase calls for a second construction phase and relies on a common infrastructure (i.e., non-modular, phased construction). The ultimate business objective for the facility is Tier IV, and the facility design supports those objectives. The organization was reluctant to make expenditures on the mechanical UPS required to provide Continuous Cooling for the full capacity of the center until it had secured a client that required Tier IV performance, which would then justify the capital investment in increasing cooling capacity.

The organization was only able to achieve this staged Tier expansion because it worked with Uptime Institute consultants to plan both phases and the Tier demonstrations. For Phase 1, the organization installed all systems and infrastructure needed to support a Tier IV operation, except for the mechanical UPS, thus the Tier Certification objective for Phase 1 was to attain Tier III. Phase 1 Tier Certification included all of the required demonstrations normally conducted to validate Tier III, with load banks located in the data hall. Additionally, because all systems except for the mechanical UPS were already installed, Uptime Institute was able to observe all of the demonstrations that would normally be required for Tier IV TCCF, with the exception of Continuous Cooling.

As a result when the facility is ready to proceed with the Phase 2 expansion, the only demonstrations required to qualify for Tier IV TCCF will be Continuous Cooling. The organization will have to locate load banks within the data hall but will not be required to power those load banks from the IT UPS nor simulate faults on the IT UPS system because that capability has already been satisfactorily observed. Thus, the organization can avoid any risk of interruption to the live customer load the facility will have in place during Phase 2.

The Tier III Certification of Constructed Facility demonstrations require Concurrent Maintainability. The data center must be able to provide baseline power and cooling capacity in each and every maintenance configuration required to operate and maintain the site for an indefinite period. The topology and procedures to isolate each and every component for maintenance, repair, or replacement without affecting the baseline power and cooling capacity in the computer rooms should be in place, with a summary load of 750 kW of critical IT load spread across the data hall. All other house and infrastructure loads required to sustain the baseline load must also be supported in parallel with, and without affecting, the baseline computer room load.

Tier Certification requirements are cumulative; Tier IV encompasses Concurrent Maintainability, with the additional requirements of Fault Tolerance and Continuous Cooling. To demonstrate Fault Tolerance, a facility must have the systems and redundancy in place so that a single failure of a capacity system, capacity component, or distribution element will not impact the IT equipment. The organization must demonstrate that the system automatically responds to a failure to prevent further impact to the site operations. Assessing Continuous Cooling capabilities require demonstrations of computer room air conditioning (CRAC) units under various conditions and simulated fault situations.


Chris Brown

Chris Brown

Christopher Brown joined Uptime Institute in 2010 and currently serves as Vice President, Global Standards and is the Global Tier Authority. He manages the technical standards for which Uptime Institute delivers services and ensures the technical delivery staff is properly trained and prepared to deliver the services. Mr. Brown continues to actively participate in the technical services delivery including Tier Certifications, site infrastructure audits, and custom strategic-level consulting engagements.

 

Arc Flash Mitigation in the Data Center

Meeting OSHA and NFPA 70E arc flash safety requirements while balancing prevention and production demands
By Ed Rafter

Uninterruptible uptime, 24 x 7, zero downtime…these are some of the terms that characterize data center business goals for IT clients. Given these demands, facility managers and technicians in the industry are skilled at managing the infrastructure that supports these goals, including essential electrical and mechanical systems that are paramount to maintaining the availability of business-critical systems.

Electrical accidents such as arc flash occur all too often in facility environments that have high-energy use requirements, a multitude of high-voltage electrical systems and components, and frequent maintenance and equipment installation activities. A series of stringent standards with limited published exceptions govern work on these systems and associated equipment. The U.S. Occupational Safety and Health Administration (OSHA) and National Fire Protection Association (NFPA) Standard 70E set safety and operating requirements to prevent arc flash and electric shock accidents in the workplace. Many other countries have similar regulatory requirements for electrical safety in the workplace.

When these accidents occur they can derail operations and cause serious harm to workers and equipment. Costs to businesses can include lost work time, downtime, OSHA investigation, fines, medical costs, litigation, lost business, equipment damage, and most tragically, loss of life. According to the Workplace Safety Awareness Council (WPSAC), the average cost of hospitalization for electrical accidents is US$750,000, with many exceeding US$1,000,000.

There are reasonable steps data center operators can—and must—take to ensure the safety of personnel, facilities, and equipment. These steps offer a threefold benefit: the same measures taken to protect personnel also serve to protect infrastructure, and thus protect data center operations.

Across all industries, many accidents are caused by basic mistakes, for example, electrical workers not being properly prepared, working on opened equipment that was not well understood, or magnifying risks through a lack of due diligence. Data center operators, however, are already attuned to the discipline and planning it takes to run and maintain high-availability environments.

While complying with OSHA and NFPA 70E requirements may seem daunting at first, the maintenance and operating standards in place at many data centers enable this industry to effectively meet the challenge of adhering to these mandates. The performance and rigor required to maintain 24 x 7 reliability means the gap between current industry practices and the requirements of these regulatory standards is smaller than it might at first appear, allowing data centers to balance safety with the demands of mission critical production environments.

In this article we describe arc flash and electrical safety issues, provide an overview of the essential measures data centers must follow to meet OSHA and NFPA 70E requirements, and discuss how many of the existing operational practices and adherence to Tier Standards already places many data centers well along the road to compliance.

Figure 1. An arc flash explosion demonstration. Source: Open Electrical

Figure 1. An arc flash explosion demonstration. Source: Open Electrical

UNDERSTANDING ARC FLASH

Arc flash is a discharge of electrical energy characterized by an explosion that generates light, noise, shockwave, and heat. OSHA defines it as “a phenomenon where a flashover of electric current leaves its intended path and travels through the air from one conductor to another, or to ground (see Figure 1). The results are often violent and when a human is in close proximity to the arc flash, serious injury and even death can occur.” The resulting radiation and shrapnel can cause severe skin burns and eye injuries, and pressure waves can have enough explosive force to propel people and objects across a room and cause lung and hearing damage. OSHA reports that up to 80% of all “qualified” electrical worker injuries and fatalities are not due to shock (electrical current passing through the body) but to external burn injuries caused by the intense radiant heat and energy of an arc fault/arc blast.1

An arc flash results from an arcing electrical fault, which can be caused by dust particles in the air, moisture condensation or corrosion on electrical/mechanical components, material failure, or by human factors such as improper electrical system design, faulty installation, negligent maintenance procedures, dropped tools, or accidentally touching a live electrical circuit. In short, there are numerous opportunities for arc flash to occur in industrial settings, especially those in which there is inconsistency or a lack of adherence to rigorous maintenance, training, and operating procedures.

Variables that affect the power of an arc flash are amperage, voltage, the distance of the arc gap, closure time, three-phase vs. single-phase circuit, and being in a confined space. The power of an arc at the flash location, the distance a worker is from the arc, and the time duration of their exposure to the arc will all affect the extent of skin damage. The WPSAC reports that fatal burns can occur even at distances greater than 10 feet (ft) from an arc location, in fact, serious injury and fatalities can occur up to 20 ft away. The majority of hospital admissions for electrical accidents are due to arc flash burns, with 30,000 arc incidents and 7,000 people suffering burn injuries per year, 2,000 of those requiring admission to burn centers with severe arc flash burns.2

The severity of an arc flash incident is determined by several factors, including temperature, the available fault current, and the time for a circuit to break. The total clearing time of the overcurrent protective device during a fault is not necessarily linear, as lower fault currents can sometimes result in a breaker or fuse taking longer to clear, thus extending the arc duration and thereby raising the arc flash energy.

Unlike the bolted fault (in which high current flows through a solid conductive material typically tripping a circuit breaker or protective device), an arcing fault uses ionized air as a conductor, with current jumping a gap between two conductive objects. The cause of the fault normally burns away during the initial flash, but a highly conductive, intensely hot plasma arc established by the initial arc sustains the event. Arc flash temperatures can easily reach 14,000–16,000°F (7,760–8,870°C) with some projections as high as 35,000°F (19,400°C)—more than three times hotter than the surface of the sun.

These temperatures can be reached by an arc fault event in as little as a few seconds or even a few cycles. The heat generated by the high current flow may melt or vaporize the conductive material and create an arc characterized by a brilliant flash, intense heat, and a fast-moving pressure wave that propels the arcing products. The pressure of an arc blast [up to 2,000 pounds/square foot (9765 kilograms/square meter)] is due to the expansion of the metal as it vaporizes and the heating of the air by the arc. This accounts for the expulsion of molten metal up to 10 ft away. Given these extremes of heat and energy, arc flashes often cause fires, which can rapidly spread through a facility.

INDUSTRY STANDARDS AND REGULATIONS
To prevent these kinds of accidents and injuries, it is imperative that data center operators understand and follow appropriate safety standards for working with electrical equipment. Both the NFPA and OSHA have established standards and regulations that help protect workers against electrical hazards and prevent electrical accidents in the workplace.

OSHA is a federal agency (part of the U.S. Department of Labor) that ensures safe and healthy working conditions for Americans by enforcing standards and providing workplace safety training. OSHA 29 CFR Part 1910, Subpart S and OSHA 29 CFR Part 1926, Subpart K include requirements for electrical installation, equipment, safety-related work practices, and maintenance for general industry and construction workplaces, including data centers.

NFPA 70E is a set of detailed standards (issued at the request of OSHA and updated periodically) that address electrical safety in the workplace. It covers safe work practices associated with electrical tasks and safe work practices for performing other non-electrical tasks that may expose an employee to electrical hazards. OSHA revised its electrical standard to reference NFPA 70E-2000 and continues to recognize NFPA 70E today.

The OSHA standard outlines prevention and control measures for hazardous energies including electrical, mechanical, hydraulic, pneumatic, chemical, thermal, and other energy sources.  OSHA requires that facilities:

•   Provide and be able to demonstrate a safety program with defined responsibilities.

•   Calculate the degree of arc flash hazard.

•   Use correct personal protective equipment (PPE) for workers.

•   Train workers on the hazards of arc flash.

•   Use appropriate tools for safe working.

•   Provide warning labels on equipment.

NFPA 70E further defines “electrically safe work conditions” to mean that equipment is not and cannot be energized. To ensure these conditions, personnel must identify all power sources, interrupt the load and disconnect power, visually verify that a disconnect has opened the circuit, lock out and tag the circuit, test for absence of voltage, and ground all power conductors, if necessary.

LOCKOUT/TAGOUT
Most data center technicians will be familiar with lockout and tagging procedures for disabling machinery or equipment. A single qualified individual should be responsible for de-energizing one set of conditions (unqualified personnel should never perform lockout/tagout, work on energized equipment, or enter high risk areas). An appropriate lockout or tagout device should be affixed to the de-energized equipment identifying the responsible individual (see Figure 2).

Figure 2. Equipment lockout/tagout

Figure 2. Equipment lockout/tagout

OVERVIEW: WORKING ON ENERGIZED EQUIPMENT
As the WPSAC states, “the most effective and foolproof way to eliminate the risk of electrical shock or arc flash is to simply de-energize the equipment.” However, both NFPA 70E and OSHA clarify that working “hot” (on live, energized systems) is allowed within the set safety limits on voltage exposures, work zone boundary requirements, and other measures to take in these instances. Required safety elements include personnel qualifications, hazard analysis, protective boundaries, and the use of PPE by workers.

Only qualified persons should work on electrical conductors or circuit parts that have not been put into an electrically safe work condition. A qualified person is one who has received training in and possesses skills and knowledge in the construction and operation of electric equipment and installations and the hazards involved with this type of work. Knowledge or training should encompass the skill to distinguish exposed live parts from other parts of electric equipment, determine the nominal voltage of exposed live parts, and calculate the necessary clearance distances and the corresponding voltages to which a worker will be exposed.

An arc flash hazard analysis for any work must be conducted to determine the appropriate arc flash boundary, the incident energy at the working distance, and the necessary protective equipment for the task. Arc flash is measured in thermal energy units of calories per square centimeter (calories/cm2) and arc flash analysis is referred to as the incident energy of the circuit. Incident energy is both radiant and convective. It is inversely proportional to the working distance squared and directly proportional to the time duration of the arc and to the available bolted fault current. Time has a greater effect on intensity than the available bolted fault current.

The incident energy and flash protection boundary are both calculated in an arc flash hazard analysis. There are two calculation methods, one outlined in NFPA 70E-2012 Annex D and the other in Institute of Electrical and Electronics Engineers (IEEE) Standard 1584.

In practice, to calculate the arc flash (incident energy) at a location, the amount of fault current and the amount of time it takes for the upstream device to trip must be known. A data center should model the distribution system into a software program such as SKM Power System Analysis, calculate the short circuit fault current levels and use the protective device settings feeding switchboards, panelboards, industrial control panels, and motor control centers to determine the incident energy level.

BOUNDARIES
NFPA has defined several protection boundaries: Limited Approach, Restricted, and Prohibited. The intent of NFPA 70E regarding arc flash is to provide guidelines that will limit injury to the onset of second degree burns. Where these boundaries are drawn for any specific task is based on the employee’s level of training, the use of PPE, and the voltage of the energized equipment (see Figure 3).

Figure 3. Protection boundaries. Source: Open Electrical

Figure 3. Protection boundaries. Source: Open Electrical

The closer a worker approaches an exposed, energized conductor or circuit part the greater the chance of inadvertent contact and the more severe the injury that an arc flash is likely to cause that person. When an energized conductor is exposed, the worker may not approach closer than the flash boundary without wearing appropriate personal protective clothing and PPE.

IEEE defines Flash Protection Boundary as “an approach limit at a distance from live parts operating at 50 V or more that are un-insulated or exposed within which a person could receive a second degree burn.” NFPA defines approach boundaries and workspaces as shown in Figure 4. See the sidebar Protection Boundary Definitions.

Figure 4. PPE: typical arc flash suit. Source: Open Electrical

Figure 4. PPE: typical arc flash suit. Source: Open Electrical

Calculating the specific boundaries for any given piece of machinery, equipment, or electrical component can be done using a variety of methods, including referencing NFPA tables (easiest to do but the least accurate) or using established formulas, an approach calculator tool (provided by IEEE), or one of the software packages available for this purpose.

PROTECTIVE EQUIPMENT
NFPA 70E outlines strict standards for the type of PPE required for any employees working in areas where electrical hazards are present based on the task, the parts of the body that need protection, and the suitable arc rating to match the potential flash exposure. PPE includes items such as a flash suit, switching coat, mask, hood, gloves, and leather protectors. Flame -resistant clothing underneath the PPE gear is also required.

After an arc flash hazard analysis has been performed, the correct PPE can be selected according to the equipment’s arc thermal performance exposure value (ATPV) and the break open threshold energy rating (EBT). Together, these components determine the calculated hazard level that any piece of equipment is capable of protecting a worker from (measured in calories per square centimeter). For example, a hard hat with an attached face shield provides adequate protection for Hazard/Risk Category 2, whereas an arc flash protection hood is needed for a worker exposed to Hazard/Risk Category 4.

PPE is the last line of defense in an arc flash incident; it’s not intended to prevent all injuries, but to mitigate the impact of a flash, should one occur. In many cases, the use of PPE has saved lives or prevented serious injury.

OTHER SAFETY MEASURES
Additional safety-related practices for working on energized systems could include conducting a pre-work job briefing, using insulated tools, having a written safety program, and flash hazard labeling (labels should indicate the flash hazard boundaries for a piece of equipment, and the PPE needed to work within those boundaries), and completing an Energized Electrical Work Permit. According to NFPA, an Energized Electrical Work Permit is required for a task when live parts over 50 volts are involved. The permit outlines conditions and work practices needed to protect employees from arc flash or contact with live parts, and includes the following information:

•   Circuit, equipment, and location

•   Reason for working while energized

•   Shock and arc flash hazard analysis

•   Safe work practices

•   Approach boundaries

•   Required PPE and tools

•   Access control

•   Proof of job briefing.

DECIDING WHEN TO WORK HOT
NFPA 70E and OSHA require employers to prove that working in a de-energized state creates more or worse hazards than the risk presented by working on live components or is not practical because of equipment design or operational limitations, for example, when working on circuits that are part of a continuous process that cannot be completely shut down. Other exceptions include situations in which isolating and deactivating system components would create a hazard for people not associated with the work, for example, when working on life-support systems, emergency alarm systems, ventilation equipment for hazardous locations, or extinguishing illumination for an area.

In addition, OSHA makes provision for situations in which it would be “infeasible” to shut down equipment, for example, some maintenance and testing operations can only be done on live electric circuits or equipment. The decision to work hot should only be made after careful analysis of the determination of what constitutes infeasibility. In recent years, some well publicized OSHA actions and statements have centered on the matter of how to interpret this term.

ELECTRICAL SAFETY MEASURES IN PRACTICE
Many operational and maintenance practices will help minimize the potential for arc flash, reduce the incident energy or arcing time, or move the worker away from the energy source. In fact, many of these practices are consistent with the rigorous operational and maintenance processes and procedures of a mission-critical data center.

Although the electrical industry is aware of the risks of arc flash, according to the National Institute for Occupational Safety and Health, the biggest worksite personnel hazard is still electrical shock in all but the construction and utility industries. In his presentation at an IEEE-Industry Applications Society (IAS) workshop, Ken Mastrullo of the NFPA compared OSHA 1910 Subpart S citations versus accidents and fatalities between 1 Oct. 2003, and 30 Sept. 2004. Installations accounted for 80% of the citations, while safe work practice issues were cited 20% of the time. However, installations accounted for 9% of the accidents, while safe work practice issues accounted for 91% of all electrical-related accidents. Looking at Mastrullo’s data, while the majority of the OSHA citations were for installation issues, the majority of the injuries were related to work practice issues.

Can OSHA cite you as a company that does not comply with NFPA 70E? The simple answer is: Yes. If employees are involved in a serious electrical incident, OSHA likely will present the employer/owner with several citations. In fact, OSHA assessed more than 2,880 fines between 2007–2011 for sites not meeting Regulation 1910.132(d), averaging 1.5 fines a day.

On the other hand, an OSHA inspection may actually help uncover issues. A May 2012 study of 800 California companies found that those receiving an inspection saw a decline of 9.4% in injuries. On average, these companies saved US$350,000 over the five years following the OSHA inspections,3 an outcome far preferable to being fined for noncompliance or experiencing an electrical accident. Beyond the matter of fines, however, any organization that wishes to effectively avoid putting its personnel in danger—and endangering infrastructure and operations—should endeavor to follow NFPA 70E guidelines (or their regional equivalent).

REDUCING ARC FLASH HAZARDS IN THE FACILITY
While personnel-oriented safety measures are the most important (and mandated) steps to reduce the risk of arc flash accidents, there are numerous equipment and component elements that can be incorporated into facility systems that also help reduce the risk. These include metal-clad switchgear, arc resistant switchgear, current-limiter power circuit breakers, and current-limiting reactors. Setting up zone selective interlocking of circuit breakers can also be an effective prevention measure.

TIER STANDARDS & DATA CENTER PRACTICES ALIGN WITH ARC FAULT PREVENTION
Data centers are already ahead of many industries in conforming to many provisions of OSHA and NFPA 70E. Many electrical accidents are caused by issues such as dust in the environment, improper equipment installation, and human factors. To maintain the performance and reliability demanded by customers, data center operators have adopted a rigorous approach to cleaning, maintenance, installation, training, and other tasks that forestall arc flash. Organizations that subscribe to Tier standards and maintain stringent operational practices are better prepared to take on the challenges of compliance with OSHA and NFPA 70E requirements, in particular the requirements for safely performing work on energized systems, when such work is allowed per the safety standards.

For example, commissioning procedures eliminate the risk of improper installation. Periodically load testing engine generators and UPS systems demonstrates that equipment capacity is available and helps identify out-of-tolerance conditions that are indicative of degrading hardware or calibration and alignment issues. Thermographic scanning of equipment, distribution boards, and conduction paths can identify loose or degraded connections before they reach a point of critical failure.

Adherence to rigorous processes and procedures helps avoid operator error and are tools used in personnel training and refresher classes. Facility and equipment design and capabilities, maintenance programs, and operating procedures are typically well defined and established in a mission critical data center, especially those at a Tier III or Tier IV Certification level.

Beyond the Tier Topology, the operational requirements for every infrastructure classification, as defined in the Tier Standard: Operational Sustainability, include the implementation of processes and procedures for all work activities. Completing comprehensive predictive and preventive maintenance increases reliability, which in turn improves availability. Methods of procedure are generally very detailed and task specific. Maintenance technicians meet stringent qualifications to perform work activities. Training is essential, and planning, practice, and preparation are key to managing an effective data center facility.

This industry focus on rigor and reliability in both systems and operational practices, reinforced by the Tier Standards, will enable data center teams to rapidly adopt and adhere to the practices required for compliance with OSHA and NFPA 70E. What still remains in question is whether or not a data center meets the infeasibility test prescribed by these governing bodies in either the equipment design or operational limitations.

It can be argued that some of today’s data center operations approach the status of being “essential” for much of the underlying infrastructure that runs our 24x 7 digitized society. Data centers support the functioning of global financial systems, power grids and utilities, air traffic control operations, communication networks, and the information processing that support vital activities ranging from daily commerce to national security. Each facility must assess its operations and system capabilities to enable adherence to safe electrical work practices as much as possible without jeopardizing critical mission functions. In many cases, it may become a jurisdictional decision as to the answer for a specific data center business requirement.

No measure will ever completely remove the risk of working on live, energized equipment. In instances where working on live systems is necessary and allowed by NFPA 70E rules, the application of Uptime Institute Tier III and Tier IV criteria can help minimize the risks. Tier III and IV both require the design and installation of systems that enable equipment to be fully de-energized to allow planned activities such as repair, maintenance, replacement, or upgrade without exposing personnel to the risks of working on energized electrical equipment

CONCLUSION
Over the last several decades, data centers and the information processing power they provide has become a fundamental necessity in our global, interconnected society. Balancing the need for appropriate electrical
safety measures and compliance with the need to maintain and sustain uninterrupted production capacity in an energy-intensive environment is a challenge. But it is a challenge the data center industry is perhaps better prepared to meet than many other industry segments. It is apparent that those in the data center industry who subscribe to high-availability concepts such as the Tier Standards: Topology and Operational Sustainability are in a position to readily meet the requirements of NFPA 70E and OSHA from an execution perspective.


 

SIDEBAR: PROTECTION BOUNDARY DEFINITIONS
The flash protection boundary is the closest approach allowed by qualified or unqualified persons without the use of PPE. If the flash protection boundary is crossed, PPE must be worn. The boundary is a calculated number based upon several factors such as voltage, available fault current, and time for the protective device to operate and clear the fault. It is defined as the distance at which the worker is exposed to 1.2 cal/cm2 for 0.1 second.

LIMITED APPROACH BOUNDARY
The limited approach boundary is the minimum distance from the energized item where untrained personnel may safely stand. No unqualified (untrained) personnel may approach any closer to the energized item than this boundary. The boundary is determined by NFPA 70E Table 130.4-(1) (2) (3) and is based on the voltage of the equipment (2012 Edition).

RESTRICTED APPROACH BOUNDARY
The restricted approach boundary is the distance where qualified personnel may not cross without wearing appropriate PPE. In addition, they must have a written approved plan for the work that they will perform. This boundary is determined from NFPA Table 130.4-(1) (4)  (2012 Edition) and is based on the voltage of the equipment.

PROHIBITED APPROACH BOUNDARY
Only qualified personnel wearing appropriate PPE can cross a prohibited approach boundary. Crossing this boundary is considered the same as contacting the exposed energized part. Therefore, personnel must obtain a risk assessment before the prohibited boundary is crossed. This boundary is determined by NFPA 70E Table 130.4-(1) (5)  (2012 Edition) and is based upon the voltage of the equipment.


Ed Rafter

Ed Rafter

Edward P. Rafter has been a consultant to Uptime Institute Professional Services (ComputerSite Engineering) since 1999 and assumed a full time position with Uptime Institute in 2013 as principal of Education and Training. He currently serves as vice president-Technology. Mr. Rafter is responsible for the daily management and direction of the professional education staff to deliver all Uptime Institute training services. This includes managing the activities of the faculty/staff delivering the Accredited Tier Designer (ATD) and Accredited Tier Specialist (ATS) programs, and any other courses to be developed and delivered by Uptime Institute.

 

ADDITIONAL RESOURCES
To review the complete NFPA-70E standards as set forth in NFPA 70E: Standard For Electrical Safety In The Workplace, visit www.NFPA.org

For resources to assist with calculating flash protection boundaries, visit:

•   http://www.littelfuse.com/arccalc/calc.html

•   http://www.pnl.gov/contracts/esh-procedures/forms/sp00e230.xls

•   http:www/bussmann.com/arcflash/index.aspx

To determine what PPE is required, the tables in NFPA 70E-2012 provide the simplest methods for determining PPE requirements. They provide instant answers with almost no field data needed. The tables provide limited application and are conservative for most applications (the tables are not intended as a substitution for an arc hazard analysis but only as a guide).

A simplified two-category PPE approach is found in NFPA 70E-2012, Table H-2 of Annex H. This table ensures adequate PPE for electrical workers within facilities with large and diverse electrical systems. Other good resources include:

•   Controlling Electrical Hazards. OSHA Publication 3076, (2002). 71 pages. Provides a basic overview
of basic electrical safety on the job, including information on how electricity works, how to protect
against electricity, and how OSHA cab help.

•   Electrical Safety: Safety and Health for Electrical Trades Student Manual, U.S. Department of Health and
Human Services (DHHS). National Institute for Occupational Safety and Health (NIOSH), Publication
No. 2002-123, (2002, January). This student manual is part of a safety and health curriculum for
secondary and post-secondary electrical trades courses. It is designed to engage the learner in
recognizing, electrical, and controlling hazards associated with electrical work.

•   Electrocutions Fatality Investigation Reports. National Institute for Occupational Safety and Health
(NIOSH) Safety and Health Topic. Provides information regarding hundreds of fatal incidents involving
electrocutions investigated by NIOSH and state investigators.

•   Working Safely with Electricity. OSHA Fact sheet. Provides safety information on working with
generators, power lines, extension cords, and electrical equipment.

•   Lockout/Tagout OSHA Fact Sheet, (2002).

•   Lockout-Tagout Interactive Training Program. OSHA. Includes selected references for training and
interactive case studies.

•   NIOSH Arc Flash Awareness, NIOSH Publication No. 2007-116D.

ENDNOTES
1.  http://www.arcsafety.com/resources/arc-flash-statistics

2. Common Electrical Hazards in the Workplace including Arc Flash, Workplace Safety Awareness Council (www.wpsac.org), produced under Grant SH-16615-07-60-F-12 from the Occupational Safety and Health Administration, U.S. Department of Labor.

3. “The Business Case For Safety and Health,” U.S. Department of Labor, https://www.osha.gov/dcsp/products/topics/businesscase/

Failure Doesn’t Keep Business Hours: 24×7 Coverage

A statistical justification for 24×7 coverage
By Richard Van Loo

As a result of performing numerous operational assessments at data centers around the world, Uptime Institute has observed that staffing levels at data centers vary greatly from site to site. This observation is discouraging, but not surprising, because while staffing is an important function for data centers attempting to maintain operational excellence, many factors influence an organization’s decision on appropriate staffing levels.

Factors that can affect overall staffing numbers include the complexity of the data center, the level of IT turnover, the number of support activity hours required, the number of vendors contracted to support operations, and business objectives for availability. Cost is also a concern because each staff member represents a direct cost. Because of these numerous factors, data center staffing levels must be constantly reviewed in an attempt to achieve effective data center support at a reasonable cost.

Uptime Institute is often asked, “What is the proper staffing level for my data center.” Unfortunately, there is no quick answer that works for every data center since proper staffing depends on a number of variables.

The time required to perform maintenance tasks and provide shift coverage support are two basic variables. Staffing for maintenance hours requirements is relatively fixed, but affected by which activities are performed by data center personnel and which are performed by vendors. Shift coverage support is defined as staffing for data center monitoring and rounds and for responding to any incidents or events. Staffing levels to support shift coverage can be provided in a number of different ways. Each method of providing shift coverage has potential impacts on operations depending on how that that coverage is focused.

TRENDS IN SHIFT COVERAGE
The primary purpose of having qualified personnel on site is to mitigate the risk of an outage caused by abnormal incidents or events, either by preventing the incident or containing and isolating the incident or event and keeping it from spreading or impacting other systems. Many data centers still support data shift presence with a team of qualified electricians, mechanics, and other technicians who provide 24 x 7 shift coverage. Remote monitoring technology, designs that incorporate redundancy, campus data center environments, the desire to balance costs, and other practices can lead organizations to deploy personnel differently.

Managing shift presence without having qualified personnel on site at all times can elevate risks due to delayed response to abnormal incidents. Ultimately, the acceptable level of risk must be a company decision.

Other shift presence models include:

• Training security personnel to respond to alarms and execute an escalation procedure

• Monitoring the data center through a local or regional building monitoring system (BMS) and having technicians on call

• Having personnel on site during normal business hours and on call during nights and weekends

• Operating multiple data centers as a campus or portfolio so that a team supports multiple data centers without necessarily being on site at each individual data center at a given time

These and other models have to be individually assessed for effectiveness. To assess the effectiveness of any shift presence model, the data center must determine the potential risks of incidents to the operations of the data center and the impact on the business.

For the last 20 years, Uptime Institute has built the Abnormal Incident Reports (AIRs) database using information reported by Uptime Institute Network members. Uptime Institute analyzes the data annually and reports its findings to Network members. The AIRs database provides interesting insights relating to staffing concerns and effective staffing models.

INCIDENTS OCCUR OUTSIDE BUSINESS HOURS
In 2013, a slight majority of incidents (out of 277 total incidents) occurred during normal business hours. However, 44% of incidents happened between midnight and 8:00 a.m., which underscores the potential need for 24 x 7 coverage (see Figure 1).

Figure 1. Approximately half the AIRs that occurred in 2013 took place occurred between 8 a.m. and 12 p.m., the other half between 12 a.m. and 8 a.m.

Figure 1. Approximately half the AIRs that occurred in 2013 took place occurred between 8 a.m. and 12 p.m., the other half between 12 a.m. and 8 a.m.

Similarly, incidents can happen at any time of the year. As a result, focusing shift presence activities toward a certain time of year over others would not be productive. Incident occurrence is pretty evenly spread out over the year.

Figure 2 details the day of the week when incidents occurred. The chart shows that incidents occur on nearly an equal basis every day of the week, which suggests that shift presence requirement levels should be the same every day of the week. To do otherwise would leave shifts with little or no shift presence to mitigate risks. This is an important finding because some data centers focus their shift presence support Monday through Friday and leave weekends to more remote monitoring (see Figure 2).

Figure 2. Data center staff must be ready every day of the week.

Figure 2. Data center staff must be ready every day of the week.

INCIDENTS BY INDUSTRY
Figure 3 further breaks down the incidents by industry and shows no significant difference in those trends between industries. The chart does show that the financial services industry reported far more incidents than other industries, but that number reflects the makeup of the sample more than anything.

003

Figure 3. Incidents in data centers take place all year round.

INCIDENT BREAKDOWNS

Knowing when incidents occur does little to say what personnel should be on site. Knowing what kinds of incidents occur most often will help shape the composition of the on-site staff, as will knowing how incidents are most often identified. Figure 4 shows that electrical systems experience the most incidents, followed by mechanical systems. By contrast, critical IT load causes relatively few incidents.

Figure 4. More than half the AIRs reported in 2013 involved the electrical system.

Figure 4. More than half the AIRs reported in 2013 involved the electrical system.

As a result, it would seem to make sense that shift presence teams should have sufficient electrical experience to respond to the most common incidents. The shift presence team must also respond to other types of incidents, but cross training electrical staff in mechanical and building systems might provide sufficient coverage. And, on-call personnel might cover the relatively rare IT-related incidents.

The AIRs database also sheds some light on how incidents are discovered. Figure 5 suggests that over half of all incidents discovered in 2013 were from alarms and more than 40% of incidents are discovered by technicians on site, totaling about 95% of incidents. The biggest change over the years covered by the chart is a slow growth of incidents discovered by alarm.

Figure 5. Alarms are now the source for most AIRs; however, availability failures are more likely to be found by technicians.

Figure 5. Alarms are now the source for most AIRs; however, availability failures are more likely to be found by technicians.

Alarms, however, cannot respond to or mitigate incidents. Uptime Institute has witnessed a number of methods for saving a data center from going down and reducing the impact of a data center incident. These methods require having personnel to respond to the incident, building redundancy into critical systems, and strong predictive maintenance programs to forecast potential failures before they occur. Figure 6 breaks down how often each of these methods produced actual saves.

Figure 6. Equipment redundancy was responsible for more saves in 2013 than in previous years.

Figure 6. Equipment redundancy was responsible for more saves in 2013 than in previous years.

The chart also appears to suggest that in recent years, equipment redundancy and predictive maintenance are producing more saves and technicians fewer. There are several possible explanations for this finding, including more robust systems, greater use of predictive maintenance, and budget cuts that reduce staffing or move it off site.

FAILURES
The data show that all the availability failures in 2013 were caused by electrical system incidents. A majority of the failures occurred because maintenance procedures were not followed. This finding underscores the importance of having proper procedures and well trained staff, and ensuring that vendors are familiar with the site and procedures.

Figure 7. Almost half the AIRs reported in 2013 were In Service.

Figure 7. Almost half the AIRs reported in 2013 were In Service.

Figure 7 further explores the causes of incidents in 2013. Roughly half the incidents were described as “In Service,” which is defined as inadequate maintenance, equipment adjustment, operated to failure, or no root cause found. The incidents attributed to preventive maintenance actually refer to preventive maintenance that was performed improperly. Data center staff caused just 2% of incidents, showing that the interface of personnel and equipment is not a main cause of incidents and outages.

SUMMARY
The increasing sophistication of data center infrastructure management (DCIM), building management systems (BMS), and building automation systems (BAS) is increasing the question of whether staffing can be reduced at data centers. The advances in these systems are great and can enhance the operations of your data center; however, as the AIRs data shows, mitigation of incidents often requires on-site personnel. This is why it is still a prescriptive behavior for Tier III and Tier IV Operational Sustainability Certified data centers to have qualified full time equivalent (FTE) personnel on site at all times. The driving purpose is to provide quick response time to mitigate any incidents and events. The data show that there is no pattern as to when incidents occur. Their occurrence is pretty well spread across all hours of the day and all days of the week. Watching as data centers continue to evolve with increased remote access and more redundancy built in, will show if the trends continue in their current path. As with any data center operations program the fundamental objective is risk avoidance. Each data center is unique with its own set of inherent risks. Shift presence is just one factor, but a pretty important one; a decision on how many to staff, for each shift, and with what qualifications, can have major impact on risk avoidance and continued data center availability. Choose wisely.


Rich Van Loo

Rich Van Loo

Rich Van Loo is Vice President, Operations for Uptime Institute. He performs Uptime Institute Professional Services audits and Operational Sustainability Certifications. He also serves as an instructor for the Accredited Tier Specialist course.

Mr. Van Loo’s work in critical facilities includes responsibilities ranging from projects manager of a major facility infrastructure service contract for a data center, space planning for the design/construct for several data center modifications, and facilities IT support. As a contractor for the Department of Defense, Mr. Van Loo provided planning, design, construction, operation, and maintenance of worldwide mission critical data center facilities. Mr. Van Loo’s 27-year career includes 11 years as a facility engineer and 15 years as a data center manager.