Solving Air Contaminant Problems in Data Centers

RoHS-compliant products exacerbate the problem

By Christopher Muller, Dr. Prabjit Singh, G. Henry White, and Paul Finch

End users have worried about the reliability of electronic gear almost since the introduction of the circuit board. Restrictions on Hazardous Substances (RoHS), or lead-free, manufacturing regulations for electronic equipment that went into effect in 2006 only served to increase their concerns. Today companies selling consumer electronics, industrial process, and control systems products in the European Union and a few other European nations that adhere to RoHS regulations must be aware of a number of additional obligations. RoHS-compliant datacom and IT equipment, in particular, are at risk in locations with poor ambient air quality. Some data centers in urban locations have reported failures of servers and hard disk drives caused by sulfur corrosion.

As a result, new industry-accepted specifications include particulate contamination limits that specify the quantity and deliquescent relative humidity of dust. Additionally, research by ASHRAE’s Technical Committee 9.9 for Mission Critical Facilities, Technology Spaces, and Electronic Equipment led to the publication of a white paper on contamination guideline for data centers1 and the formulation of new gaseous contamination limits used to update International Society of Automation (ISA) Standard 71.04-2013.2 This research also led to the publication of an International Electronics Manufacturing Initiative (iNEMI) position paper3 and efforts to update the Chinese data center design guide GB 50174-2008.4

The Lead-Free Transition
Industry did not foresee one failure mechanism caused by the transition to lead-free products mandated by RoHS.  Products with an immersion silver (ImmAg) surface finish will creep corrode in environments that electronic equipment manufacturers consider to be high sulfur (ISA Class G2 or higher).5 In high-pollution locations around the world, as a result, the number and types of corrosion failures have increased dramatically. The common component failures include hard disk drives (HDD), graphic cards, motherboards, DIMMs, capacitors, and transistors. In fact, the rate of failure has become so severe that many of the world’s leading IT and datacom equipment manufacturers have changed their warranties to include requirements for the control of corrosion due to gaseous contamination:

• Dell PowerEdge R310 Rack Server – The following airborne contaminant level supersedes the information that is present in the Getting Started Guide of your system: Airborne Contaminant Level Class G1 as defined by ISA-S71.04-1985.6

• IBM Power7 Server Series –  Severity level G1 as per ANSI/ISA 71.04-1985, which states that the reactivity rate of copper coupons shall be less than 300 angstroms (Å)/month (≈ 0.0039 micrograms per square centimeter-hour
[μg/cm2-hour] weight gain). “In addition, the reactivity rate of silver coupons shall be less than 300Å/month (≈ 0.0035 μg/cm2-hour weight gain). The reactive monitoring of gaseous corrosivity should be conducted approximately 2 inches (5 cm) in front of the rack on the air inlet side at one-quarter and three-quarter frame height off the floor or where the air velocity is much higher.7

• HP Integrity Superdome 2 Enclosure –  Specification: Gaseous contaminants must be at the G1 level or less as defined by ISA Standard ISA-71.04-1985.8

Contamination Control Process
With the changes to IT and datacom equipment mandated by various RoHS directives, data center owners, managers, and operators should include an environmental contamination monitoring and control section as part of an overall site planning, risk management, mitigation, and improvement plan.

The three parts of this plan should comprise:
1. Considerations for the assessment of the outdoor air and indoor environment with regards to corrosion potential. ISA Standard 71.04 can be used to provide site-specific data about the types and levels of gaseous contamination in the amount of corrosion being formed. Corrosion classification coupons (CCCs) can be used as a survey tool to establish baseline data necessary to determine whether environmental controls are needed and, if so, which ones.

2. A specific contamination control strategy. Corrosion in an indoor environment is most often caused by a short list of chemical contaminants or combination of contaminants. The contaminants present in a specific area are highly dependent on the controls put in place to mitigate them. Most of this would involve the selection and application of the appropriate chemical filtration systems to clean both the outdoor air being used for pressurization and/or ventilation as well as any recirculation air.

3. A real-time environmental monitoring program based on the severity levels established in ISA Standard 71.04. Real-time atmospheric corrosion monitors can provide accurate and timely data on the performance of the chemical filtration systems as well as the room air quality.

Often the relationship between corrosion levels and hardware failures in data centers is overlooked or unknown. However, AMD, Cisco, Cray, Dell, EMC, Huawei, Hitachi, HP, IBM, Intel, Oracle, Seagate, SGI, and others are working hard to increase awareness of the problem and it solutions. These manufacturers are also working to develop successful corrosion monitoring and control programs.

Data Center Design and Operation Requirements
Data centers are dynamic environments where maintenance operations, infrastructure upgrades, and equipment changes take place regularly, leading to possible introduction of airborne contaminants. Data centers also house other contaminants, such as chlorine, that can be emitted from PVC insulation on wires and cables if temperatures get too high.

However, outdoor air used for ventilation, pressurization and/or cooling remains the primary source of airborne contaminants. The growing use of air-side economizers for free cooling, means that even data centers located in regions without major air quality concerns may struggle to maintain an environment conducive to the protection of sensitive electronic equipment. Air used for any of these purposes should be cleaned before being introduced into the data center.

To meet warranty requirements for new IT and datacom equipment, data center owners and operators must take action to eliminate airborne contaminants from these sources. These steps include:

• Measure the reactivity (corrosion) rates inside the data center and in the outdoor air

• Seal all doors, windows, and wall penetrations in the data center

• Install room pressure monitors if the data center envelope is designed to be
positively pressured

• Measure airflow at the supply and exhaust air grills, and at each computer
room air-conditioning (CRAC) unit

• Develop a temperature and humidity profile

It is incumbent upon data center managers to maintain hardware reliability by monitoring and controlling both the gaseous and particulate contamination in their data centers. ASHRAE, in cooperation with many of the world’s leading manufacturers of computer systems, has developed guidelines that summarize the acceptable levels of contamination (See Table 1).

Table 1. Particulate and gaseous contamination guidelines for data centers

Table 1. Particulate and gaseous contamination guidelines for data centers

Environmental Assessments
ISA Standard 71.04-2013 Environmental Conditions for Process Measurement and Control Systems: Airborne Contaminants describes a simple quantitative method to determine the airborne corrosivity in a data center environment. This method, called “reactivity monitoring” involves the analysis of copper and silver sensors that have been exposed to the environment for a period of time to determine corrosion film thickness and chemistry. Silver reactivity monitoring done as part of the assessment provides a complete accounting of the types of corrosive chemical species in the data center environment.

ISA 71.04 classifies four levels of environmental severity for electrical and electronic systems providing a measure of the corrosion potential of an environment (See Table 2). The overall classification is based on the higher of the total copper and silver reactivity rates.

Corrosion Monitoring
Many options can be considered with respect to air quality monitoring for data center applications. Proper assessment of environmental conditions in the data center requires monitoring of outdoor and ambient air at various locations inside and outside the data center. In addition, consideration should be given to room size and layout to determine the proper placement and type of CCCs, which can help determine compliance with air-quality specifications, and real-time atmospheric corrosion monitors (ACMs). Data from the ACMs can help provide a statistically valid environmental assessment of corrosion rates and of corrosion control strategies. Either monitoring technique may be used to provide the data necessary to troubleshoot and mitigate contamination issues inside the data center.

Table 2. Classification of reactive environments

Table 2. Classification of reactive environments

CCCs are typically used for an initial survey of ambient (outdoor) air quality and the data center environment and may be used on a continuing basis to provide historical data. This is especially important where equipment warranties specify establishing and maintaining an ISA Class G1 environment. Seasonality is a major issue, and outdoor air should be assessed at different times during the year.

Real-time monitoring may also be used but should be limited to the data center environment. Where corrosion problems have been identified, ACMs placed in a number of locations can help determine if contamination is widespread or limited to a specific area. Once a baseline has been established, some of the monitors could be redeployed around the problem area(s) to gauge the effectiveness of contamination control strategies. Once the data center environment is under control and meets the conditions set forth in the manufacturers’ warranties, one can determine the best permanent ACM locations for specific needs.

There is general confidence that corrosion monitoring can be used to identify contaminant types, e.g. active sulfur, sulfur oxides, and inorganic chloride. These determinations can be verified using independent sources of environmental data (air pollution indices, satellite data, etc.) to verify the results obtained from corrosion monitoring.

Contamination Control Case Studies
The following sections include the results of environmental assessments, design, and application of chemical filtration systems and ongoing monitoring of mission critical areas from a number of data centers around the world where corrosion-related electronic equipment failures were reported.

Case Study 1. ISP Data Center
A representative of an internet service provider (ISP) reported a number of equipment failures in one of its data centers. The primary IT equipment vendor determined that the majority of failures were due to sulfur creep corrosion, most likely caused by high levels of motor vehicle pollution.

Reactivity monitoring performed inside the data center and the adjoining UPS battery room according to ISA Standard 71.04 indicated a Class GX – Severe classification for both rooms, with high levels of sulfur contamination (copper sulfide, Cu2S) in both rooms and extremely high levels of chlorine contamination (silver chloride, AgCl) in the battery room. These results are summarized in Table 3.

Table 3. CCC monitoring results – baseline data

Table 3. CCC monitoring results – baseline data

Based on these results and advice from the IT equipment vendor, the ISP facility manager added chemical filtration inside the data center and the battery room. Based on the size of the two spaces and the amount of air that needed to be cleaned, the data center required nine stand-alone self-contained chemical filtration systems and the battery room required two systems. These systems were optimized for the control of the acidic corrosive contaminants identified by the CCC monitoring.

Table 4. CCC monitoring results – with chemical filtration systems installed

Table 4. CCC monitoring results – with chemical filtration systems installed

Within a few days after the chemical filtration systems were installed and operational, dramatic decreases in the copper and silver corrosion rates were observed in both rooms with each now showing severity levels of ISA Class G1 (See Table 4). Further, there was no evidence of either sulfur or chlorine contamination inside either room.

Case Study 2. Internet Data Center
A small data center for a global consulting and IT outsourcing firm was experiencing repeated failures of a single component. The manufacturer performed monitoring with CCCs and found that the continuing failures were due to high sulfur content in the ambient air. The firm installed a single air cleaning system to clean and recirculate this air within the data center and an ACM to gauge the effectiveness of the chemical filtration system.

Figure 1. Reactivity monitoring data before and after chemical filtration was installed.

Figure 1. Reactivity monitoring data before and after chemical filtration was installed.

Data from the ACM were continuously collected the week before the air cleaning unit was installed and the week after chemical filtration was in place. Examination of the copper data indicated a Class G1 environment throughout the monitoring period; whereas, the silver corrosion rate dropped from ISA Class G2 to ISA Class G1 (See Figure 1). No further equipment failures have been reported.

Case Study 3. Auto Manufacturer
An automobile company was planning to expand its manufacturing capacity by building a second production facility. Ambient air quality near its location was extremely poor due to high levels of motor vehicle pollution as well as significant industrial activity in the region. A large number of IT equipment failures had been experienced in the original production facility’s IT center, which resulted in the addition of chemical filtration systems and the use of several ACMs.

During the planning stage for this second production facility, the manufacturer decided to provide for chemical filtration and air monitoring in the design of the IT center to prevent corrosion problems inside the equipment rooms (See Figure 2).

The auto manufacturer started reactivity monitoring with ACMs at the time the IT equipment became operational, and the results indicated a significant reduction in the total amount of contamination in the IT center. However, the silver corrosion rates fluctuated around the Class G1/G2 breakpoint (See Figure 3).

After re-examining the layout of the IT center, it determined that significant amounts of untreated outdoor air were being introduced into the protected spaces through the main entrance.

Figure 2. Automotive manufacturer’s IT center layout showing location of chemical filter systems.

Figure 2. Automotive manufacturer’s IT center layout showing location of chemical filter systems.

Although the silver corrosion rate was near the specified ISA Class G1, the owner and the IT equipment vendors agreed that steps should be taken to eliminate untreated outdoor air from the IT center Modifications to the facility were proposed and work is ongoing.

Case Study 4. Bank
The other three case studies describe the use of chemical filtration inside the protected spaces. In this example, the owner wanted to clean all the outdoor air that was being used to pressurize the data center space. This bank building was located in a major metropolitan area with high levels of sulfur contamination from coal-burning power plants as well as motor vehicle traffic.

Because the air quality was so poor relative to the IT equipment manufacturers’ air quality guidelines, air cleaning would be accomplished by:

1. Installing a chemical filtration system at the outdoor air intake of the existing HVAC system

2. Adding another chemical filtration system to deliver additional clean pressurization air under the raised floor in the data center to supplement the outdoor ventilation air

Figure 3. Reactivity monitoring data for an auto manufacturer's IT center

Figure 3. Reactivity monitoring data for an auto manufacturer’s IT center

3. Installing three standalone chemical filters systems inside the data center to provide for complete distribution of the clean air. Air monitoring after chemical filtration had been installed showed a tremendous improvement in the data center air quality. Average results from 15 monitoring locations indicated an ISA Class G1 environment for both copper and silver with no evidence of sulfur contamination. These results are summarized in (See Table 5).

A few locations showed ISA Class G2 severity levels, with no specific external or internal sources of contamination identified. Team members suggested moving the CA units to provide better air distribution in these locations in an effort to maintain Class G1 area throughout the data center. Monitoring with CCCs is ongoing.

Case Study 5. Telecommunications Company
This company experienced continuing problems with corrosion-related failures of switchgear and network cards in one of its mobile switching centers (MSC). A chemical filtration system reduced the number of failures per month from an average of 36 per month to ~20 after five months of operation. Although this represented a significant improvement, the owner wanted to improve this performance even further.

The chemical filtration system had been designed to deliver the specified amount of cleaned air to the MSC when running at 70% capacity. The system was adjusted to increase the amount of air delivered to the MSC. The system is now operating at 90% capacity and since then the number of failures has dropped below 10 per month (See Figure 4).

The IT manager is currently considering whether to add additional chemical filtration systems to see if corrosion-related failures can be eliminated altogether.

Figure 4. Reduction in MGW/MSS cards with corrosion failures.

Figure 4. Reduction in MGW/MSS cards with corrosion failures.

Table 5. Reactivity monitoring results after addition of chemical filtration systems.

Table 5. Reactivity monitoring results after addition of chemical filtration systems.

Case Study 6. Telecommunications Company 2
This company experienced problems with IT equipment failures and was not willing to provide space inside the data center for chemical filtration systems. It was also discovered that the existing HVAC system supplying ventilation air could not be modified to accept chemical filters. Therefore, the only option available was to replace the existing particulate filters installed in the CRAC units located inside the data center with combination particulate/chemical filters.

Before committing to upgrading all of the CRAC units serving the data center, the owner wanted to determine whether this was a practical and effective solution. As a test, the filters in one CRAC unit were replaced with these combination filters and reactivity monitoring was performed at the inlet of the CRAC unit and in the Cold Aisle supplied by this unit.

Reactivity monitoring performed with an ACM for several days prior to the installation of the new filters indicated an ISA Class G1 copper severity rate but a mid-to-high Class G2 rate for silver. After installation of the filters, the silver reactivity rate immediately fell almost 90% (See Figure 5). This result convinced the customer to put these combination chemical/particulate filters in the all of the remaining CRAC units. The data center has not reported additional equipment failures.

Conclusions
Data centers located in areas with high ambient air pollution, whether from stationary or mobile sources, can experience corrosion-related hardware failures due to the changes in electronic equipment first mandated by the implementation of the European Union (EU) RoHS regulations in 2002 and in more than a dozen countries since that time.

Figure 5. Performance of chemical filters in a CRAC unit.

Figure 5. Performance of chemical filters in a CRAC unit.

These regulations, along with the continuing reductions in circuit board feature sizes and miniaturization of components necessary to improve hardware performance, makes today’s electronic hardware more prone to attack by airborne contaminants. Manufacturers have to maintain the reliability of their equipment; therefore, the need to control airborne contaminants and to specify acceptable limits in data centers is now considered to be critical to the continued reliable operation of datacom and IT equipment.

Increases in corrosion-related electronic hardware failures have led to new IT and datacom equipment warranties that require environmental corrosion (reactivity) monitoring and control of airborne contamination where necessary.  These additional measures are especially important for urban areas with elevated pollution levels and for locations near industrial facilities, seashores, and other sources that could produce corrosive airborne contaminants.

ISA Standard 71.04 has been updated to now include silver corrosion monitoring as a requirement in determining environmental severity levels. Many manufacturers of datacom and IT equipment currently reference this standard in their site planning / preparation guidelines as well as their terms and conditions for warranty compliance. The addition of silver corrosion rates as a required metric serves to bridge the gap between ambient environmental conditions and the reliability of RoHS-compliant (lead-free) electronic equipment.

Ongoing research will serve to further refine Standard 71.04 both quantitatively and qualitatively. This along with continuing advancements in the monitoring and control of corrosive contaminants will help to prevent costly and potentially catastrophic failure of critical electronic equipment.

References
1.  ASHRAE. 2011. 2011 Gaseous and Particulate Contamination Guidelines for Data Centers. Atlanta: American Society of Heating, Refrigerating, and Air-Conditioning Engineers, Inc.

2. ISA. 2013. ANSI/ISA 71.04-2013 (in print): Environmental Conditions for Process Measurement and Control Systems: Airborne Contaminants. Research Triangle Park: International Society for Automation

3. iNEMI 2012. iNEMI Position Statement on the Limits of Temperature, Humidity and Gaseous Contamination in Data Centers and Telecommunication Rooms to Avoid Creep Corrosion on Printed Circuit Boards

4. China National Standard GB 50174-2008: Code for Design of Electronic Information System

5. Muller, C.O. and Yu, H., “Controlling Gaseous and Particulate Contamination in Data Centers,” Proceedings of SMTA China South Technical Conference, Shenzhen, China, 2012.

6. Airborne Contaminant Level Update. July, 2010. © 2013 Dell Inc.

7. Power7 information: Environmental design criteria. July, 2012. © 2013 IBM Corporation.

8. Muller, C. and Yu, H., 2013. “Air Quality Monitoring for Mission Critical / Data Center Environments,” CEEDI White Paper on Data Center Monitoring Systems, China Electronics Engineering Design Institute, Beijing, China.

Muller Chris Muller is the technical director and Global Mission Critical Technology manager at Purafil, Inc. (Doraville, GA), and is responsible for Purafil’s Data Center Business Development program as well as for technical support services and various research and development functions. Prior to joining Purafil, he worked in the chemical process and pharmaceutical manufacturing industries in plant management and quality assurance/quality control.

He has written and spoken extensively on the subject of environmental air quality and the application and use of gas-phase air filtration, corrosion control and monitoring, electronic equipment reliability, and RoHS and counts over 120 articles and peer-reviewed papers, more than 100 seminars, and 7 handbooks to his credit. Mr. Muller has edited chapters in two handbooks on the application and use of gas-phase air filtration, wrote the chapter on gas-phase air filtration in the NAFA Air Filtration Handbook and the chapter on airborne molecular contamination in the Semiconductor Manufacturing Handbook published by McGraw-Hill.
Mr. Muller has consulted on the development of environmental air quality guidelines in mission critical applications for companies such as Dell, Google, HP, Huawei, IBM, and Morgan Stanley. He has worked with the China Electronics Engineering Design Institute (CEEDI) to update China National Standard GB 50174-2008: Code for Design of Electronic Information System.

SinghDr. Prabjit (PJ) Singh is a senior technical staff member in the Materials and Processes Department in IBM Poughkeepsie, NY, with 34 years of experience in the metallurgical engineering aspects of mainframe computer power, packaging, cooling, and reliability. He has more than 36 patents and is an IBM Master Inventor.
Dr.  Singh received a B. Tech. from the Indian Institute of Technology and an MS and Ph.D. from the Stevens Institute of Technology, all in the field of metallurgical engineering. Recently, he received a MS in microelectronic manufacturing from Rensselaer Polytechnic Institute and an MS in electrical engineering from the National Technological University. He is an adjunct professor of electrical engineers at the State University of New York at New Paltz.

 

Henry White joined HP in 1980, where he has mainly served in HP Focus on Customer Site Environment for product reliability, safety, and POR-like efficiencies. His experience in this area includes site-caused product corrosion, power quality, lightning protection, high-density data center cooling, electrical codes, safety, and regulatory engineering.
Within that framework, he has visited many customer sites worldwide to measure and gauge problems. He used developing techniques to understand zinc whiskers and their effects. He has also develoWhiteped the on-site measurement techniques in China for site-specific corrosion.

Mr. White has conducted and delivered training for Worldwide HP field teams to understand site-specific topics and better serve HP customers. In parallel with his duties for site environment characterization, he utilizes HP’s ISO 9001 Quality Management processes for new large server introductions and HP Cloud programs to continually improve its product offerings.

 

Paul Finch is a technical director, EMEA Design & Construction, at Digital Realty. He is the leading Finchexpert for mechanical engineering and energy engineering having both strategic and project responsibilities. Since joining the company in 2010, he has published a collection of European design engineering guides and led many successful major developments across London, Amsterdam, Paris, Dublin, and Singapore markets. Mr. Finch has more than 25 years of progressive experience in engineering, construction and property consulting, focused on technical real estate and high dependency mission critical environments in the banking and finance, technology and telecommunications sectors.

TELUS SIDCs provide support to widespread Canadian network

TELUS SIDCs provide support to widespread Canadian network. SIDCs showcase best-in-class innovation and efficiency
By Pete Hegarty

TELUS built two Super Internet Data Center (SIDC) facilities as part of an initiative to help its customers reap the benefits of flexible, reliable, and secure IT and communication solutions. These SIDCs, located in Rimouski, Quebec, and Kamloops, British Columbia, contain the hardware required to support the operations and diverse database applications of TELUS’ business customers, while simultaneously housing and supporting the company’s telecommunications network facilities and services (see Figures 1 and 2). Both facilities required sophisticated and redundant power, cooling, and security systems to meet the organization’s environmental and sustainability goals as well as effectively support and protect customers’ valuable data. A cross-functional project team, led by Lloyd Switzer, senior vice-president of Network Transformation, implemented a cutting-edge design solution that makes both SIDCs among the greenest and most energy-efficient data centers in North America.

Figure 1. Exterior photo of the Kamloops SIDC

Figure 1. Exterior photo of the Kamloops SIDC

TELUS achieved its energy-efficiency goals through a state-of-the-art mechanical design that significantly improves overall operating efficiencies. The modular design supports Concurrent Maintenance, a contractually guaranteed PUE of 1.15, and the ability to add subsequent phases with no disruption to existing operations.
Both SIDCs allow TELUS to provide its customers with next-generation cloud computing and unified communication solutions. To achieve this, teams across the company came together with the industry’s best external partners to collaborate and drive innovation into every aspect of the architecture, design, build, and operations.

Best-in-Class Facilities
The first phase of a seven-phase SIDC plan was completed and commissioned in the fall of 2012. The average cabinet density is 14 kilowatts (kW) per cabinet with the ability to support up to 20 kW per rack. The multi-phased, inherently modular approach was programmed so that subsequent phases will be constructed and added with no disruption to existing operations. Concurrent Maintainability of all major electrical and mechanical systems is achieved through a combination of adequate component redundancy and appropriate system isolation.

Both SIDCs have been constructed to LEED environmental standards and are 80% more efficient than traditional data center facilities.

Other expected results of the project include:

Figure 2. Exterior photo of Rimouski SIDC

Figure 2. Exterior photo of Rimouski SIDC

• Water savings: 17,000,000 gallons per year

• Energy savings: 10,643,000 kilowatt-hours (kWh) per year

• Carbon savings: 329 tons per year

The SIDCs are exceptional in many other ways. They both:

• Use technology that minimizes water consumption, resulting in a Water
Usage Effectiveness (WUE) of 0.23 liters (l)/kWh, approximately four times
less than a traditional data center.

• Provide protected power by diesel rotary UPS (DRUPS), which eliminates
the need for massive numbers of lead-acid batteries that are hazardous
elements requiring regular replacement and safe disposal. The use of DRUPS
also avoids the need for large rooms to house UPS and their batteries.

• Use a modular concept that enables rapid expansion to leverage the latest IT,
environmental, power, and cooling breakthroughs providing efficiency gains
and flexibility to meet customers’ growing needs (see Figures 3 and 4).

Figure 2. Exterior photo of Rimouski SIDC

Figure 2. Exterior photo of Rimouski SIDC

 

 

 

TELUS’ data center in Rimouski is strategically located in a province where more than 94% of the generation plants are hydroelectric; the facility operates in free-cooling mode year-round, with only 40 hours of mechanically assisted cooling required annually. The facility in Kamloops enjoys the same low humidity and benefits from the same efficiencies.

Both SIDCs received Tier III Certification of Design Documents and Tier III Certification of Constructed Facility by the Uptime Institute. The city of Rimouski awarded TELUS the Prix rimouskois du mérite architectural (Rimouski Architectural Award) recognizing the data center for exceptional design and construction, including its environmental integration.

 

 

Figure 3. Both TELUS SIDCs are based on modular designs

Figure 3. Both TELUS SIDCs are based on modular designs

Green Is Good
TELUS teams challenged themselves to develop innovative solutions and new processes that would change the game for TELUS and its customers. Designed around an advanced foundation of efficiency, scalability, reliability, and security, both SIDCs are technologically exceptional in many ways.

Natural cooling: With a maximum IT demand of 16.2 megawatts, the facilities will require168 gigawatt-hours per year. To reduce the environmental impact of such consumption, TELUS chose to place both facilities in areas well known for their green power generation and advantageous climates (see Table 1).

When mechanical systems are operating, a Turbocor compressor (0.25 kW/ton) with magnetic bearings provides 4-5ºF ∆T to supplement cooling to the Cold Aisles. Additionally, the compressor has been re-programmed to work at even higher efficiency to operate within the condensing/evaporating temperature ranges of the glycol/refrigerant of this system.

This innovative approach was achieved through collaboration between TELUS and a partner specializing in energy management systems and has been so successful that TELUS is delivering these same dramatic cooling system improvements in some of its existing data centers.

Modular design: The modular concept enables rapid expansion while leveraging the latest IT, environmental, power and cooling breakthroughs. Additional modules can be added in as little as 16-18 weeks and are non-disruptive to existing operations. The mechanical system is completely independent, and the new module “taps” into a centralized critical power spinal system. By comparison, traditional data center capacity expansions can take up to 18 months.

Figure 4. Module in fabrication at the factory

Figure 4. Module in fabrication at the factory

The modular mechanical system provides a dual refrigerant and water-cooled air-conditioning system to the racks. From a mechanical perspective, each pod can provide up to 2N redundancy with the use of two independent, individually pumped refrigerant cooling coil circuits. The two cooling circuits are fully independent of one another and consist of a pump, a condenser, and a coil array (see Figure 5).

The approach uses a pumped refrigerant that utilizes latent heat to remove excess server heat. The use of micro-channelized rear-door coils substantially increases overall coil capacity and decreases the pressure drop across the coil, which improves the energy efficiency.

The individual modules are cooled by closed-circuit fluid coolers that can provide heat rejection utilizing both water and non-water (air-cooled) processes. This innovative cooling solution results in industry-leading energy efficiencies. The modules are completely built off site, then transported and re-assembled on site.

As a result, the efficiency across utilization (measured in terms of PUE) is significantly improved compared to traditional data center designs. While the PUE of a traditionally designed data center can range from as high as 3.0 to as low as 1.6 in a best-case scenario (approximately at 80% usage), both TELUS facilities have a PUE utilization curve that is almost flat, varying from 1.3 (at no load) to 1.15 (at 100% load). (See Table 2).

The Electrical System
A world-class SIDC runs on solid pillars, with one of them being the electrical system (See Figure 6). This system acts as the foundation of the data center, and consistent reliability was a key component in the planning and design of both TELUS facilities.

Table 1. The moderate climates of Kamloops and Rimouski enable the SIDCs to achieve low emissons.

Table 1. The moderate climates of Kamloops and Rimouski enable the SIDCs to achieve low emissons.

The main electrical systems of both SIDCs operate in medium voltage, including the critical distribution systems. Incoming power is conditioned and protected right from the entrance point, so there are no unprotected loads in either facility.

Figure 5. Telus cooling system employed at Kamloops and Rimouski

Figure 5. TELUS cooling system employed at Kamloops and Rimouski

Figure7

Figure 6. Physical separation of A and B electrical systems ensures that the system meets Uptime Institute Tier III requirements.

Behind the fully protected electrical architecture is a simple design that eliminates multiple distribution systems. In addition, the architecture ensures that the high-density environments in the IT module do not undergo temperature fluctuations—due to loss of cooling—when switching from normal to emergency power. Other important benefits include keeping short-circuit currents and arc flashes to manageable levels.

The DRUPS are located in individual enclosures outside the building, and adding units to increase capacity is a relatively easy and non-intrusive activity. Since the system is designed to be fully N+1, any of its machines can be removed from service without causing any disruption or limitation of capacity on the building operations. Each of these enclosures sits on its own fuel belly tank that stores 72 hours of fuel at full capacity of the diesel engines.

The output of the DRUPS is split and collected into a complete 2N distribution system—A and B. The main components of this A and B system are located in different rooms, keeping fire separation between them. The entire distribution system was designed with the concept of Concurrent Maintainability. This means that the A or B side of the distribution can be safely de-energized without causing any disruption to IT or mechanical loads.
The electrical system uses the same concept as the architectural distribution. Each module connects to the main distribution, and the introduction of new modules is performed without affecting the operation of the existing ones.

In order to do this, each module includes its own set of critical and mechanical substations (A and B in both cases) that deliver low voltage to the loads (RPPs, mechanical CDUs, pod fans) in the form of bus ducts that run along both service corridors. Each rack receives two sets of electrical feeders from the A and B systems, where power is distributed to the servers through intelligent and metered PDUs.

 

Figure 7. The mechanical cooling system avoids any single point of failure.

Figure 7. The mechanical cooling system avoids any single point of failure.

 

 

 

The Mechanical System

If the electrical system is one of the pillars of the SIDCs, the mechanical system is another (see Figure 7). In today’s IT world, server density grows, and because the servers are installed in relatively narrow spaces, keeping a controlled environment where temperature does not undergo large or fast fluctuations and air flows clean and without restriction, is key to the reliability of those servers. As indicated above, the entire cooling system is protected with UPS power and provides the first line of defense against those fluctuations.

The SIDC modules are cooled with a sophisticated, yet, architecturally simple, mechanical system that not only allows for an adequate environment for the IT equipment but is also extremely efficient.

Many technologies today are achieving extremely low PUEs—especially compared to the old days of IDCs with PUEs of 2.0 or worse—but they do that at the expense of significant consumption of water. The TELUS SIDC cooling technology, the first of its kind in the world, not only runs at low PUE levels, but also is extremely frugal in the use of water. For ease of illustration, the system can be divided in its three main components:

1. The elements that capture heat immediately after it’s produced, which increases  efficiency to its maximum possible as the hot air does not have to travel far. The facilities achieve this with edge devices—rear door coils or coils situated in the superior part of the Hot Aisle. Multiple fans push the cooled air onto the Cold Aisle to re-enter the IT gear through the front of the racks.

With this arrangement of fully contained Cold Aisles/Hot Aisles, air recirculates with only a minimum contribution of outside air provided by a make-up system. Among many advantages of the system, the need for small quantities of outside air makes it appropriate for installation in virtually any environment as the system is not prone to be affected by pollution, smoke, or any other agent that should not enter the IT space.

2. The CDUs are extremely sophisticated heat exchangers, which ensure the refrigerant lines are always kept within temperature limits. The heat interchange is done at maximum efficiency based on the temperature of the coolant in the return lines. If the temperature of this return is too high and doesn’t allow the refrigerant loop to remain within range, the CDU also has the intelligence and the capability to push this temperature down with the assistance of high-efficiency compressors.

3. The fluid cooler yard, which is situated outside the facilities and is responsible for the final rejection of the heat to the atmosphere. The fluid coolers run in free (and dry) cooling mode year-round except when the wet-bulb temperature exceeds a certain threshold, and then water is sprayed to assist in the rejection.

Table 2. The Telus facilities maintain a relatively flat PUE across utilization

Table 2. The TELUS facilities maintain a relatively flat PUE across utilization

All of the above mentioned elements are redundant, and the loss of any element does not result in any disruption of the cooling they provide. The fluid coolers have been designed in an N+1 configuration where the CDUs and edge devices are fully 2N redundant, the same scenario as with both refrigerant and coolant loops.

Figure 8. Inside shot of a Cold Aisle at the Kamloops, SIDC

Figure 8. Inside shot of a Cold Aisle at the Kamloops, SIDC

 

Figure 9. Inside shot of a Cold Aisle at the Rimouski SIDC

Figure 9. Inside shot of a Cold Aisle at the Rimouski SIDC

 

 

 

 

The TELUS Team

The TELUS project team challenged traditional approaches to data center design, construction, and operations. In doing so, it created opportunities to transform its approach to future data center builds and enhance existing facilities. The current SIDCs have helped TELUS develop a data center strategy that will enable it to:

• Achieve low total cost of ownership (TCO)

• Support long-term customer growth

• Maximize the advantages of technology change:
density and efficiency

• Align with environmental sustainability and
climate change strategy

• Enable end-to-end service capabilities: cloud/
network/device

• Provide a robust and integrated physical and
logical security

More than 50 team members from across the organization participated in the SIDC project, which involved a multiple-year process of collaboration across many TELUS departments such as Network Transformation, Network Standards, IT, Operations, and Real Estate.

Figure 10. Cold Aisle in fabrication at the factory

Figure 10. Cold Aisle in fabrication at the factory

The completion of the SIDCs represented an ‘above and beyond’ achievement for the TELUS organization. The magnitude of the undertaking was significantly larger than any infrastructure work in the past. In addition to this, the introduction of a completely new infrastructure model and technology, in many ways revolutionary compared to the company’s legacy data centers, introduced additional pressure to ensure the team was choosing the right model and completing the necessary due diligence to ensure it was a good fit for its business.

Many of these industry-leading concepts are now being deployed across TELUS’ existing data centers, such as the use of ultra-efficient cooling systems. In addition, as a result of the knowledge and best practices created through this project, TELUS has established a center of excellence across IT and data infrastructure that empowers teams to continue working together to drive innovation across their solutions, provide stronger governance, and create shared accountability for driving its data center strategy forward.

Overcoming Challenges
A clear objective combined with the magnitude and high visibility of the undertaking, enabled the multi-disciplinary team to achieve an exceptional goal. One of the most important lessons for TELUS is the self-awareness that when teams are summoned for a project, no matter how large it is, they are capable of succeeding, if the scope and timelines are clear.

During construction of the Rimouski SIDC, the team experienced an issue with server racks, specifically with the power bars on the back. Although great in design, they ended up being too large, therefore preventing the removal of the power supplies from the backs of servers. Even though the proper due diligence had been completed, it wasn’t until the problem arose physically that the team was able to identify it. As a result, TELUS modified the design for the Kamloops SIDC and built mock-ups or prototypes to ensure this does not happen again.

Partners
TELUS worked with external partners on some of the project’s key activities, including Skanska (build partner), Inertech (module design and manufacturer), Cosentini (engineering firm), and Callison (architects). These partners were integral in helping TELUS achieve a scalable and sustainable design for its data centers.

Future Applications
Once TELUS realized the operational energy savings achieved by implementing the modular mechanical system, they quickly explored the applicability within existing legacy data centers. In fact, TELUS has recently completed the conceptual design of a retrofit within one of its existing facilities, and an additional feasibility study is currently underway for another of its existing data centers. The team has learned that it can utilize the existing central plant mechanical system with a similar, customized modular rack and cooling system that was installed in TELUS’ new SIDCs to achieve increased (density) capacity and much improved energy efficiency. And of significant importance, all of this can be accomplished without eliminating the legacy mechanical system.

TELUS’ intention is to start applying these highly efficient solutions to some of its large network sites as well. Some of the traditional legacy equipment is rapidly moving towards a much more server-like gear with a small footprint and large power consumption, which translates into more heat rejection. Large telecom rooms can be relatively easily retrofitted to host an almost completely contained “pod” that will be used for the deployment of those new technologies.

An Ongoing Commitment
When TELUS first started planning the SIDCs in Rimouski and Kamloops, it focused on four main pillars: efficiency, reliability, scalability, and security. And all four pillars were integral to achieving overall sustainability. These pillars are the foundation of the SIDCs and ensure they’re able to meet customer needs now and in the future. With that commitment in place, TELUS’ forward-thinking strategy and innovative approach ensure it will be considered a front-runner in data center design for years to come.

 

Pete Hegarty

Pete Hegarty

Pete Hegarty has acquired a wealth of industry knowledge and experience in his more than 10 years at TELUS Communications, a leading national communications company, and 25 years in the telecommunications industry. His journey with TELUS has taken him through various positions in Technology Strategy, Network Planning, and Network Transformation. In his current role as Director of TELUS’ Data Centre Strategy, Mr. Hegarty leads an integrated team of talented and highly motivated individuals who are driving the transformation of TELUS’ data center infrastructure to provide the most efficient, reliable and secure Information communications technology (ICT) solutions in Canada.

emert-cover-photo

Data Center Facility Owners See Modules as Efficient Way to Deploy Capital

Compass sets the course with new modular data center approach.

The term modular has been used to describe a variety of approaches to data center design. Historically, the first commercially available modular design was a 2007 Sun Microsystems container-based product called the Black Box. Today, the term describes products that range from shipping containers and simple repeated designs to fully manufactured IT spaces and MEP systems built in factories and shipped to various sites.

Many modular data center providers have written white papers that tout the benefits of their proprietary designs, but there are also some informative reports on the ins and outs of modular design. John Stanley of 451 Research presented the results of his survey work at the Uptime Institute Symposium in May 2012. He surveyed 35 companies that use and provide various types of modular solutions for data centers. His research pointed out that 65% of those surveyed deployed capacity in ‘chunky’ increments, specifically highlighting the need for small chunks of capacity per module or deployment to match growth of their data centers.

Compass has based its business strategy on this industry driver and five main features of modular design:

  • High quality
  • Low construction costs
  • Speed to deploy
  • Integrated supply chain
  • Low operating costs

In the Green Grid paper, “Deploying and Using Containerized/Modular Data Center Facilities,” published in November 2011, the Green Grid demonstrated that modular data centers can follow capacity requirements more closely, freeing capital and keeping MEP systems more fully loaded than large-scale deployments. The authors of the paper showed that end users could employ their utility systems and floor spaces more fully, without undue concern for system exhaustion or space starvation. The authors discuss how small increments of capacity can follow compute demands more closely, with the caveat that the facility must also have a faster implementation cycle in order to match the changing IT requirements.

The data center industry is beginning to realize the benefits of the early industrial revolution. Standardized modular power center designs provide some of the same benefits to design and construction personnel. Instead of hand-building custom electrical systems for each data center, the modular approach allows for greater deployment speed, improved quality and lower costs, all achieved by using factory-based labor. The use of modules also relieves labor stacking on the job site, while reducing the overall cost of the work by a significant amount.

In his 1936 paper, Factors Affecting the Cost of Airplanes, T.P. Wright first quantified the cost savings that could be attained using factory labor. Wright showed that the direct labor cost of assembling an airframe decreased by roughly 20% for each doubling of the production numbers. In other words, if the labor cost of building one airframe was $1,000,000, the labor cost of building two of the same models might be $800,000 each. Doubling again would mean the cost of four the same airframes would be $640,000 each, or 36% lower than the first unique model.

Compass Data Centers intended to take advantage of this principle as it developed its patent-pending solution. In doing so, Compass has found that modular designs enable it to:

  • Benefit from learning-curve dynamics to improve quality and speed of implementation with each iteration
  • Lower operating costs from design efficiencies that tune initial innovative ideas
  • Provide predictable reliability attributes and repeatable, low-cost operations
  • Deliver a standard product

As a result, Compass has moved from building a prototype on every data center project to productizing the data center program and build.

Despite the fact that the industry has not settled on a clear definition of the ‘modular data center,’ it is attempting to evade this ambiguity by proclaiming the coming of ‘Version 2,’ which seems to include prefabrication as a fundamental element. Before, modular construction was associated with ISO containers and packaged chillers. Now, modularity is being embraced as synonymous with sound design and sound capital usage.

The Compass Deployment
Every data center design begins with the establishment of availability and power requirements – reliability and power density are the cornerstones of any data center specification. The business needs of the enterprise and the IT applications to be contained within the data center should drive these decisions.

Virtually every initial design addresses the project power requirements and references the Uptime Institute’s well-known Tier specifications to define reliability requirements. Uptime’s Tier Standard helps define the equipment and systems required to support the reliability and power demands of the business’s IT model. If the needs of a business are truly defined by an Uptime Tier III specification, then its data center design should be certifiable as a Tier III design.

Compass’ Truly Modular Architecture design provides for standard features such as RHI Modular Power Solutions’ Modular Power Center units, which are also built off-site, power modules that can be configured to provide a 2N power infrastructure, and N+1 mechanical systems. Other Compass architecture features include:

  • 10,000 square feet of column-free raised floor space
  • Hardened structure (Seismic 1.5 and Category 4 Hurricane Winds)
  • The capability to support up to 20-kW racks
  • Convenient operations, staging and storage spaces
  • Uptime Tier III-certifiable design
  • Uptime Tier III-certifiable facility

The standardized design of the Compass Modular Architecture provides assurance that the facility will be Tier III design certified. Internal auditors and external customers can not only be assured of the design certification, but also have the option to certify the facility itself. Importantly, the ability to productize a unique modular data center solution into a repeatable and well-defined process was a top priority. Modularizing data center components permits control over three primary variables:

  • Cost
  • Quality
  • Schedule

Compass shares the view that these three variables are the only three points that end users actually care about. Compass also believes that modularity:

  • Is cost effective
  • Is not always a prefabricated solution
  • Increases quality
  • Supports continuous improvement
  • May bring jurisdictional issues

Some jurisdictions have misconceptions about modular or containerized solutions. They envision standard ISO shipping containers, assembled on-site, as the basis of the construction process.

Meeting the local authority having jurisdiction (AHJ) early in the project’s developmental cycle is an effective way to establish trust and dispel misconceptions. Providing an extensive amount of specific information on the modular components of the proposed data centers is a good way to start.

Next Phase
As part of its schematic designs, Compass decided that its basic architectural scheme would be a shell that would contain a 10,000-square-foot (ft2) column-free, raised-floor hall and all the traditional support spaces. Compass made this decision based on the collective experience of its internal and consultant teams, as well as market analysis.

Compass decided that the systems, called Modular Power Centers (MPCs), would be packaged and delivered as standardized systems. This work was provided by Modular Power Solutions. The MPCs could be installed either internally or externally to the data center’s brick-and-mortar shell structure. Crosby and the MPC executives have filed for patent protection of the intellectual property.

Compass chose not to employ a central mechanical plant, but rather to deploy an N+1 redundant rooftop mechanical system to support the facility’s HVAC requirements. So, the schematic design established the following parameters:

Data Hall:

  • 10,000 ft2 data hall
  • Column free
  • 36-inch raised-floor white space, supporting up to 500 racks
  • 120 watts per square foot , with the ability to cool heterogeneous loads and accommodate racks up to 30 kW
  • Hardened for wind and seismic conditions
  • N+1 rooftop mechanical units
  • Tier III-certifiable

Support Spaces:

  • Office area, breakroom, loading dock, security station, engineering office, storage room, staging area, PPOP Room, fire riser room, restroom and janitorial.

Modular Power Centers:

  • 1.2-MW Modular Power Centers, which could be either internally or externally installed at the facility.

The basis-of-design (BOD) goals are then used for the next step in the design process, the creation of a single-line diagram. The development of the schematic design (SD) phase single-line diagram brings together the conceptual components and requirements of the BOD (see Figure 1).

schematic-design-single-line-diagram

Figure 1. Schematic design single-line diagram

The initial single-line is the tool that Compass used to determine the structure and components required for the electrical distribution system. The MPCs must be sized to be able to support the demands of the 1.2-MW data hall, the mechanical systems and the support spaces. The single-line diagram is also used to validate the Tier III redundancy and concurrent maintainability requirements. The redundancy of critical components and systems and the requirement for de-energized maintenance must be worked through at this stage of the design process.

The major electrical distribution system components were identified as:

  • One 2500-kilovolt-ampere (kVA) utility transformer
  • One 2500-kVA/2000-kW standby-rated generator
  • One redundant 2500-kVA/2000-kW standby generator
  • One 480-volts (V), 3000-amp (A) main distribution switchboard
  • Two 1200-kW high-efficiency uninterruptible power supplies (UPS)
  • Two 1600-A static transfer switch (STS) bypasses
  • Two 1600-A UPS output distribution switchboards
  • Two 1600-A maintenance bypasses

The Compass one-line based on the Tier III requirements provided a look at the electrical system’s redundancy (see Figure 2).

electrical-system-redundancy

Figure 2. Electrical system redundancy

Then Compass added two isolation or tie breakers to the main switchboard in order to meet the Tier III concurrently maintainable requirement. These breakers were required to allow either side of the switchboard to be shut down and completely de-energized for service. The tie breakers can also be used to enhance fault tolerance. If they are forced to trip before the main circuit breakers on either side, the non-faulted side of the system can remain operational.

The decision was made to look to the local utility provider to supply an N redundant 2500-kVA utility power transformer. The 2500-kVA/2000-kW standby generator in the base system is N redundant. A second N+1 or 2N redundant generator added to the lineup meets the Tier III redundancy requirements. The second generator will be connected to the opposite side of the switchboard lineup. Compass preferred generator redundancy over utility redundancy, finding generators to be more reliable than the utility, based upon its experience that blackouts tend to regional in nature (as seen with hurricane Sandy).

The remaining equipment would be installed in a Power Center module. That equipment was identified as:

  • Two 480-V, 3000-A main distribution switchboards
  • Two 1200-kW high-efficiency UPS units
  • Two 1600-A STS bypasses
  • Two 1600-A UPS output distribution switchboards
  • Two 1600-A maintenance bypasses

Based on the equipment defined in the single-line, Compass determined that at least two MPCs would be required.

The tie breakers provide a natural point to split the system. This split became the basis of how the system is packaged. And the size of the modules and weight of the MPCs is constrained by the need to ship them over highways from the assembly facility to the job site. Typically, those shipping packages are not to exceed 50 feet by 12 feet and 100,000 pounds.

Each MPC switchboard lineup features a 3000-A and a 1600-A UL 891-listed switchboard. Each switchboard is equipped with two 3000-A, four 400-A, and four 450-A UL 489-listed circuit breakers. All circuit breakers larger than 200-A are 100% duty rated. All circuit breakers feature zone selective interlocks (ZSIs). A ZSI ties the circuit breaker trip units together, allowing them to communicate in order to ensure that the circuit breaker closest to the fault trips first. Increasing the fault isolation capabilities increases the data center’s ability to maintain operational continuity.

The main switchboards are configured as main-tie-main-tie-main. Each CPC has a dedicated programmable logic controller (PLC). The PLCs are hot swappable, meaning that if either processor goes down, the other processor will automatically take control. The I/O rack is located in the A CPC. There is no power bussing in the I/O section. The switchgear can be manually operated if the I/O rack is de-energized for maintenance.

Modbus Protocol is provided to the Schneider Electric StruXureWare management system at the PLC gateway for each main switchboard. The main switchgear has integrated revenue-grade power-quality metering.

Each side of the redundant power system features a 1.2-MW Schneider Electric APC Symmetra Megawatt UPS (see Figure 3). Each UPS has a dedicated external 1600-A continuous-duty-rated static bypass switch. Power to the two UPS systems is delivered from two separate (A/B) switchboards. Each switchboard is able to support the entire data center. The two tie breakers operate in the normally closed position.

ups-efficiency

Figure 3. UPS efficiency

modular-power-center-top-view

Figure 4. (Above) Modular Power Center – top view (patent pending)

modular-power-center-plan-view

Figure 5. (Below) Modular Power Center – plan view (patent pending)

Power for the data hall will be derived from two identical MPC modules, each of which is factory-assembled prior to on-site delivery. The first layouts were completed for the MPCs with the addition of the following components:

  • Four battery cabinets per side for 5-minutes battery backup at 1.2 MW
  • Two automatic transfer switches for N+1 rooftop units
  • One 208/120-V distribution panel for local power needs
modular-power-center-interior

Figure 6. Modular Power
Center – interior

The MPCs (see Figures 4-8) are IBC-rated R17 structures, which meet Miami-Dade County 149-mph wind-pressure loading requirements. The MPCs are constructed to provide protection with respect to harmful effects on the equipment due to the ingress of water (rain, sleet, snow), and will be undamaged by the external formation of ice on the enclosure or seismic events.

Compass employs 2.0-MW/2.5-MVA, 277/480-V, 3ø, four-wire generator rated at 1825 kW to provide standby power (see Figure 9 and 10).

Compass determined that the data center infrastructure will require support from a 2.0-MW/2.5-MVA generator. The sequence of operation of the total system is controlled automatically through deployment of redundant PLC control units installed in each of the 3000-A main switchboards. Should the primary standby generator fail to come online after loss of the utility source, the swing generator will pick up the critical loads of the system. Each generator will be provided with a weather-protective enclosure. All generator permits (including all operations, fuel storage, noise and air) will be obtained and maintained with the appropriate AHJ. Generators are equipped with 4,000-gallon fuel storage belly tanks for 24 hours of fuel capacity at full load.

modular-power-center-interior

Figure 7. Modular Power Center – interior

modular-power-center-exterior

Figure 8. Modular Power Center – exterior

standby-generator

Figure 9. Standby generator

The output of each UPS module is a 1600-A distribution board equipped with a maintenance bypass. A solenoid key release unit (SKRU) is provided to ensure that the UPS has transferred to bypass before the MBP breaker can be engaged to the output switchboard. This will always be a closed transition transfer so that critical load power will never be lost.

Critical power to the IT load will be provided by eight 300-kVA PDUs installed in an alternating A/B arrangement in the data hall to provide 208/120-V power to either overhead busway or remote power panels. Each PDU has a 300-kVA K-13 rated transformer and six 225-A breakers. Additionally, each PDU has six integrated revenue-grade power-monitoring meters.

Compass determined that the mechanical requirements for each data hall can be supported by four 120-ton rooftop units (RTUs). The four RTUs provide N+1 system redundancy. Power for each RTU will be available from either the A or B System through dedicated automatic transfer switches (see Figure 11). The units feature integrated controls allowing for efficient airside economization across all units. A proprietary rapid-restart feature ensures full air movement within 30 seconds after restoration of power. The system controls deliver uniform under-floor pressures.

generator-exterior

Figure 10. Generator exterior

This single-line diagram (see Figure 12) represents the next step in the construction document’s development process: the design development (DD)-phase documents. The updated single-line features significant developments in the design process. The new features are:

The division of the electrical distribution system into two completely separate MPCs

  • The second generator was added to provide N+1 redundancy
  • Provisions for ‘A side’ and ‘B side’ mechanical feeders
  • The integration of the main and output switchboards
  • The integration of the maintenance bypass (MBP) breaker into the new contiguous switchboard lineup
  • The introduction of dual-programmable logic controllers (PLCs)
  • The introduction of a remote main utility circuit breaker

The new single-line diagram now reflects the separate MPCs. The MPCs in this arrangement actually complement each other, providing 2N redundancy to the data hall.

A second generator was added to provide 2N generator redundancy. In the future, this second generator could be shared with additional data halls on the same campus. The redundancy of the generators would then be N+1.

Each MPC was outfitted with provisions to provide power for the entire mechanical system. Normally, power for one-half of the mechanical equipment is supplied by each MPC. Dedicated ATSs will automatically roll power to any active MPC if power were lost.

Figure 13

Figure 11. Power for each RTU is available from dedicated automatic transfer switches.

The new single-line shows how both the input and output switchboards have been integrated into a single switchgear lineup. The new lineup provided for a simplified interconnection scheme when it was installed in the MPC.

The use of dual MPC facilitates the use of dual PLCs. The ability of the PLCs to stay in synchronous operation allows for a seamless transfer of control between either unit.

High levels of arc-flash energy in the main switchboards became a concern when the decision was made to have the utility provide the main 2500-kVA transformer and the transformer’s primary-side protection. Most utilities design their protection schemes to protect the utility’s own equipment. This typically doesn’t translate into limiting the arc-flash energy levels on the secondary side of these large transformers. Remoting the main utility breaker outside the MPC allows the arc-flash energy to be contained outside the remote switchboard. The arc-flash energy inside the MPC is now be significantly reduced.

design-development-single-line-diagram

Figure 12. Design-Development (DD) single-line diagram (patent pending)

The deployment of the MPCs is just part of the modular concept that was developed for Compass Data Centers. Compass has done further development of the concept. This work and Tier III compliance with the Uptime Institute are important elements of its business plan to control costs and provide alignment of data center facilities with customers’ business requirements.


steve-emertSteve Emert earned a BSEE Degree in Electrical Engineering from San Jose State University. He is currently a Registered Professional Engineer in more than 15 states, and is director of Mission-Critical Engineering at Rosendin Electric, the largest privately owned electrical contractor in the U.S. Mr. Emert began his career performing design and analysis of industrial, commercial and utility power systems, cogeneration plant design and coordination studies. In the mid-1990s, he worked at the Ames Research Center located at Moffett Field, CA, where he began a mission-critical-focused career working on the NAS supercomputer and many of the technically advanced NASA facilities.

Since joining Rosendin Electric, Mr. Emert has provided the engineering foundation for the company’s design-build mission-critical construction business. Today Rosendin Electric is the largest design-build electrical contractor for mission-critical facilities in the U.S. Mr. Emert is an active member of the IEEE P1584 IEEE Guidelines for Performing Arc-Flash Hazard Calculations Working Group. He is a member of the IEEE P1584 Configuration Task Group, currently engaged in the process of defining a new set of standards for the next issue of the P1584 Guideline publication. Mr. Emert has coauthored IEEE EMC Society presentations and written articles on electrical power systems for Electrical Contractor Magazine.

 

Dual-Corded-Power

Dual-Corded Power and Fault Tolerance: Past, Present, and Future

Details of dual-corded power change, but the theme remains the same.

Uptime Institute has worked with owners and operators of data centers since the early 1990s. At the time, data center owners used single-corded IT devices for even their most critical IT assets. Figure 1 shows a selection of the many potential sources of outage in the single path.

Early on, Site Uptime Network (now the Uptime Institute Network) founder Ken Brill recognized that outages due to faults or maintenance in the critical distribution system were a major problem in high availability computing. The Uptime Institute considers the critical distribution to include the power supply to IT devices from the UPS output to any PDU (power distribution unit), panel, or remote power panel (RPP), and the power distribution down to the rack via whip or bus duct.

Ahead of their time, Ken Brill and the Network created the Fault Tolerant Power Compliance Specification in 2000 to address the sources of outages, and updated it in 2002. Then, in 2004 Uptime Institute produced the paper Fault Tolerant Power Certification is Essential When Buying Products for High-Availability to directly address the issue. When this paper was written, four years after the Fault Tolerant Power Compliance Specification was first issued, critical distribution failures continued to cause the majority of data center outages.

Fault-Tolerant Power Compliance Specification Version 2.0” lists the required functionality of Fault Tolerant dual-corded IT devices as defined by the Uptime Institute.

In the mid-1990s, the Uptime Institute led the data center industry in establishing Tiers as a way to define the performance characteristics of data centers. Each Tier builds upon the previous Tier, adding maintenance opportunity and Fault Tolerance. This progress culminated in the 2009 publication of the Tier Standard: Topology, which established Tiers as progressive maintenance opportunities and fault tolerance. The Tier Standard also included the requirement for dual-corded devices in Tier III and Tier IV objective data centers. Tier III data centers have dual power paths to provide Concurrent Maintainability of each and every component and path. Tier IV data centers require the same dual power paths for Concurrent Maintainability and add the ability to autonomously respond to failures.

Single-corded-IT-equipment

Figure 1. Single-corded IT equipment

Present
The Fault Tolerant Power Compliance Specification, Version 2.0 is clearly relevant 12 years later. Originally called Fault Tolerant IT devices, today the commonly used vernacular is dual corded, and these devices have become the basis of high availability. The two terms Fault Tolerant IT devices and dual-corded IT device are used interchangeably.

Tier III and Tier IV data centers designs continue to be based upon the use of dual-corded architecture and require an active-active, dual-path distribution. The dual-corded concept is cemented into high-availability architecture in enterprise data centers, hyper-scale internet providers, and third-party data center spaces. Even the innovative Open Compute Project, sponsored by Facebook, which uses cutting-edge electrical architecture, utilizes dual-corded, Fault Tolerant IT devices.

Confoundingly, though, more than half of the more than 5,000 reported incidents in the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database relate to the critical distribution system.

Dual-corded assets have increased maintenance opportunities for data center facilities management. Operations teams no longer need to wait for inconveniently timed maintenance windows to perform maintenance; instead they can maintain their facilities without IT impact during safe and regular hours. If there is an anomaly, the facilities and IT staff are on hand to address them.

Fault-Tolerant-and-dual-corded

Figure 2. Fault-Tolerant-and-dual-corded

Uptime Institute Network members today recognize the benefits of dual-corded devices. COO Jason Weckworth of RagingWire recently said, “Dual-corded IT devices allow RagingWire the maintenance and operations flexibility that are consistent with our Concurrently Maintainable objective and provide that extra level of availability assurance below the UPS system where any problem may have consequential impacts to availability.”

Uptime Institute Network adoption of dual-corded devices has clearly improved, as indicated by the number of outages attributed to critical distribution. Properly applied, dual-corded devices do not experience any effect on loss of a single source. Analysis of the AIRs database from 2007 to 2012 showed a reduction of more than 90% of critical distribution failures impacting the IT load.

Some data center owners or IT teams try to achieve dual power paths to IT equipment using large static transfer switches (STS) or STS power distribution units (PDU) (see Figure 3). However, problems inherent in the maintenance, replacement, or a fault of an STS for the device and onward threaten the critical load. One data center suffered a fault on an STS-PDU that affected one third of its IT equipment and loss of those systems rendered the entire data center unavailable. As noted in Figure 3, the single large STS solution does not meet Tier III or Tier IV criteria.

Static-transfer-switches

Figure 3. Static transfer switches

Uptime Institute recognizes that some heritage devices or legacy systems may end up in data centers, due to systems migrations challenges, mergers and acquisitions, consolidations, or historical clients. Data center infrastructure professionals need to question the justifications that lead to these conditions: If the system is so important, why is it not migrated to a high-availability, dual-corded IT asset?

The Tier Standard: Topology does include an accommodation for single-corded equipment as shown in Figure 4, depicting a local, rack-mounted transfer switch. The rack-mounted or point-of-use transfer switch allows for distribution of risk as low as possible in the critical distribution.

Still, many in IT have not yet gotten the message and bring in more than the occasional one-off device. Single-corded devices are found in a larger percentage of installations than should be expected. Rob McClary, SVP and GM of FORTRUST, said, “FORTRUST utilizes infrastructure with dual-power paths, yet we estimate that greater than 50% of our clients continue to deploy at least one or more single-corded devices that do not utilize our power infrastructure and could impact their own availability. FORTRUST strongly supports continued education to our end user community to utilize all dual-corded IT assets for a true high-availability solution.” The loss of even one of the data centers asset in their deployment can render the platforms or applications of the deployment unavailable. The disconnect between data center infrastructure and IT hardware continues to exist.

Point-of-use-transfer-switch

Figure 4. Point-of-use transfer switch

Uptime Institute teams still find the following configurations that continue to plague data center operators:

  • Single-corded network devices
  • Mainframes that degrade or are lost on loss of a single source of power
  • IT devices with an odd number of cords

The Future: A Call to Action
Complex systems such as data center infrastructure and the IT equipment and systems within them require comprehensive team approaches to management, which means breaking down the barriers between the organizations by integrating Facilities and IT staff, allowing the integrated organization to manage the data center and educating end users who don’t understand power infrastructure. If we can’t integrate, then educate.

If a merger of IT and facilities just won’t work in an enterprise data center, a regular meeting will at least enable teams to share knowledge and review change management and facilities maintenance actions. In addition, codifying change management and maintenance window procedures in terms IT can understand using an ITIL-based system will enable IT counterparts to start to understand the criticality of power distribution as they see the how and why of data center facility operations firsthand.

Colocation and third-party data centers understand that many client IT organizations have limited in-house staff, expertise, and familiarity with high-availability data centers. The need to educate these clients is clear. Several ways to educate include:

  • Compile incident reports involving single-corded device and share them with new tenants and deployments teams
  • Create a one-page fact sheet on dual-corded infrastructure with a schematic and benefits summary that those users can understand
  • Create a policy that requires rack-mounted or point-of-use transfer switches for all single-corded device.
  • Require all devices that support a high-availability application or IT deployment to be dual corded

These actions will pay dividends with increased ease of maintenance and reduced client coordination.

Facilities teams also need to look within themselves. Improved monitoring and data center infrastructure management (DCIM) solutions provide windows into the infrastructure but do not replace good management. Anecdotal evidence has shown 1-10% of servers in a data center may be improperly corded, i.e., both cords are plugged into the A distribution.

Management can address these challenges by

  • Clearly and consistently labeling A and B power
  • Training all staff working in critical areas about data center policies, including the dual-corded policy
  • Performing quality control to verify A/B cording, phase balancing, and installation documentation
  • Capturing the configuration of the data center
  • Regularly tracking single-corded installations to pressure owners of those systems to modernize

Summary
Millions of dollars are regularly invested in the dual-power path infrastructure in data centers for high availability because of business needs. This is clearly represented in the increasing cost of downtime from lost business to ruined reputations or goodwill. It is essential that Facilities and IT, including the procurement and installation teams, work together to safeguard the investment, making sure dual-power path technology is utilized for business critical applications. In addition, owners and operators of data centers must continue to educate customers who lack the knowledge or familiarity with data center practices and manage the data center to ensure high-availability principals such as dual-corded architecture are fully utilized.


Fault-Tolerant Power Compliance Specification Version 2.0

Fault-tolerant power equipment refers to computer or communication hardware that is capable of receiving AC input from two different AC power sources. The objective is to maintain full equipment functionality when operating from A and B power sources or from A alone or from B alone. Equipment with an odd number of external power inputs (line cords) generally will not meet this requirement. It is desirable for equipment to have the least number of external power inputs while still meeting the requirement for receiving AC input from two different AC power sources. Products requiring more than two external power inputs risk being rejected by some sites. For equipment to qualify as truly fault-tolerant power compliant, it must meet all of the following criteria as initially installed, at ultimate capacity, and under any configuration or combination of options. (The designation of A and B power sources is used for clarity in the following descriptions.)

  • If either one of two AC power sources fails or is out-of-tolerance, the equipment must still be able to start up or continue uninterrupted operation with no loss of data, reduction in hardware functionality, performance, capacity, or cooling.
  • After the return of either AC power source from a failed or out-of-tolerance condition, during which acceptable power was continuously available from the other AC power source, the equipment will not require a power-down, IPL, or human intervention to restore data, hardware functionality, performance, or capacity.
  • The first or second AC power source may then subsequently fail no later than 10 seconds after the return of the first or second AC power source from a failed or out-of-tolerance condition with no loss of data, reduction in hardware functionality, performance, capacity, or cooling.
  • The two AC power sources can be out of synchronization with each having a different voltage, frequency, phase rotation, and phase angle as long as the power characteristics for each separate AC source remain within the range of the manufacturer’s published specifications and tolerances.
  • Both external AC power inputs must terminate within the manufacturer’s fault-tolerant power compliant computer equipment. In the event that the external AC power input is a detachable power cord, the equipment must provide for positive retention of the female plug so the plug cannot be pulled loose accidentally. Within the equipment, the AC power train (down to and including the AC to DC power supplies) must be compartmentalized such that any power train component to neighter side can be safely serviced without affecting computer equipment availability or performance and without putting the AC power train of the other side at risk.
  • For single- or three-phase power sources, the neutral conductor in the AC power input shall not be bonded to the chassis ground inside the equipment. This will prevent circulating ground currents between the two external power sources.
  • Internal or external active AC input switching devices (e.g., mechanical or electronic transfer switches) are not acceptable.
  • A fault inside the manufacturer’s equipment that results in the failure of one AC power source shall not be transferred to the second AC power source causing it to also fail.
  • For single- or three-phase power sources, with both AC power inputs available and with both inputs operating at approximately the same voltage, the normal load on each power source will be shared within 10% of the average.
  • For three-phase power source configurations, the normal load on each phase will be within 10% of the average.

Keith-KlesnerKeith Klesner’s career in critical facilities spans 14 years and includes responsibilities ranging from planning, engineering, design and construction to start-up and ongoing operation of data centers and mission-critical facilities. In the role of Uptime Institute vice president of Engineering, Mr. Klesner has provided leadership and strategic direction to maintain the highest levels of availability for leading organizations around the world. Mr. Klesner performs strategic-level consulting engagements, Tier Certifications and industry outreach—in addition to instructing premiere professional accreditation courses. Prior to joining the Uptime Institute, Mr. Klesner was responsible for the planning, design, construction, operation and maintenance of critical facilities for the U.S. government worldwide. His early career includes six years as a U.S. Air Force officer. He has a Bachelor of Science degree in Civil Engineering from the University of Colorado-Boulder and a Masters in Business Administration from the University of LaVerne. He maintains status as a professional engineer (PE) in Colorado and is a LEED-accredited professional.

data-center-personnel

Resolving the Data Center Staffing Shortage

The availability of qualified candidates is just part of the problem with data center staffing; the industry also lacks training and clear career paths attractive to recruits.

The data center industry is experiencing a shortage of personnel. Uptime Institute Founder Ken Brill, as always, was among the first to note a trend, mentioning it more than 10 years ago. This trend reflects, in part, an aging global demographic but also increasing demand for data center personnel, which Uptime Institute Network members have described as chronic. The aging population threatens many industries but few more so than the data center industry, where the Network Abnormal Incident Reports (AIRs) database supports the relationship between downtime and inexperienced personnel.

As a result, at recent meetings North American Network members discussed how enterprises could successfully attract, train and retain staff. Network members blame the shortage on increasing market demand for data centers, inflexible organizational structures and an aging workforce retiring in greater numbers. This shortfall has already caused interruptions in service and reduced availability to mission-critical business applications. Some say that the shortage of skilled personnel has already created conditions that could lead to downtime. If not addressed in the near term, the problem could affect sections of the economy and company valuations.

Prior to 2000, data center infrastructure was fairly static; power and cooling demand generally grew following a linear curve. Enterprises could manage the growth in demand during this period fairly easily. Then technology advances and increased market adoption rates changed the demand for data center capacity so that it no longer followed a linear growth model. This trend continues, with one recent survey from TheInfoPro, a service of 451 Research, finding that 37% of data center operators had added data center space from July 2012 to June 2013. Similarly, the 2013 Uptime Institute Data Center Industry Survey found that 70% of 1000 respondents had built new data center space or added space in the last five years. The survey reported even more data center construction projects in 2012 and 2011 (see Figures 1-3). The 2013 survey showing more detail about industry growth appears here.

size-changes-of-data-center

Figure 1: New data center space reported in the last 12 months.

Drivers of the data center market are similar to those that drive overall internet growth and include increasing broadband penetration, e-commerce, video delivery, gaming, social networking, VOIP, cloud computing and web applications that make the internet and data networking a key enabler of business and consumer activity. More qualified personnel are required to respond to this accelerated growth.

The organization models of many companies kept IT and Facilities or real-estate groups totally separate. IT did IT work while Facilities maintained the building, striped the parking lot, and—oh, by the way—supported the UPS systems. The groups did not share goals, schedules, meetings or ideas. This organizational structure worked well until technology accentuated the importance of, and lack of actual, middle ground between the two groups.

data-center-supply-demand

Figure 2. Demand significantly outpaced supply since 2010.

Efforts to bridge the gap between the two groups foundered because of conflicting work processes and multiple definitions for shared terms (such as mission critical maintenance). Simply put, the two groups spoke different languages, followed different leaders and pursued unreconciled goals. Companies that recognized the implications of the situation immediately reorganized. Some of these companies established mission critical teams and others moved Facilities and IT into the same department. This organizational challenge is by no means worked out and will continue well into the next decade.

Though no government agency or private enterprise keeps track of employment trends in data centers, U.S. Social Security Administration (SSA) statistics for the general population support anecdotes shared by Network members. According to the SSA, which is the agency that supervises the federal retirement benefits program in the U.S., 10,000 people per day apply for social security benefits, with this number expected to continue to 2025 as the baby boomers continue to retire, a phenomenon first apparently dubbed the “silver tsunami” by the Alliance for Aging Research in 2006. The populations of Europe and wide parts of Asia, including China and Japan, are also aging.

The direct experiences shared by Uptime Institute Network members suggest that the data center industry is highly vulnerable to, if not already diminished by, this larger societal trend. Network members estimate that 40% of the facilities engineering community is older than 50. One member of the Network expects that 50% of its staff will retire in the next two years. Network members remain concerned that many qualified candidates—science, technology, engineering and mathematics (STEM) students—are unaware of the employment opportunities offered by the industry and may not be attracted to the 24 x 7 nature of the work.

Tony Ulichnie, who presided over many of these discussions as Network Director, North America (before retiring in July of this year), described the cost of wisdom and experience lost with the retirement of the retiring generation as “the price of silver,” referring to the loss to the organization when a longstanding and silver-haired data center operations specialist retires.

Military and civilian nuclear programs have proven to be a source of excellent candidates for data center facilities but yield only so many graduates. These “Navy Nukes” and seasoned facilities engineers command very competitive salaries and find themselves being courted by the industry.

Industry leaders say that the pipeline for replacement engineers has slowed to a dribble. Tactics such as poaching and counteroffers have become commonplace in the field.

Potential employers are also reluctant to risk hiring green (inexperienced) recruits. The practices of mission-critical maintenance require much discipline and patience, especially when dealing with IT schedules and meeting application availability requirements. Deliberate processes along with clear communications skills become necessary elements of an effective facilities organization. Identifying individuals with these capabilities is the trick: one Uptime Institute Network member found a key recruit working at a bakery. Another member puts HVAC students through an 18-month training program after hiring them from a local vocational school, with a 70% success rate.

data-center-facility-capacity

Figure 3. Those reporting new space in the Uptime Institute Survey (see p. 142 for the full
2013 Uptime Institute Data Center Survey) in the last five years, Growth in new whitespace by size also reported that a wide variety of spaces had been built.

The hunt for unexplored candidate pools will increase in intensity as the demand for talent escalates in the next decade, and availability and reliability will also suffer unless the industry addresses the problem in a comprehensive manner. To mitigate the silver tsunami, some combination of industry, individual enterprises and academia must create effective training, development and apprenticeship programs to prepare replacements for retirees at all levels of responsibility. In particular, data center operators must develop ways to identify and recruit talented young individuals who possess the key attributes needed to succeed in a pressure-packed environment.

A Resource Pool

Veterans form an often overlooked and/or misunderstood talent pool for technical and high-precision jobs in many fields, including data centers. Statistics suggest that unemployment among veterans exceeds the national rate, which is counterintuitive to those who have served. With more than one million service members projected to leave the military between now and 2016 due to the draw down in combat operations, according to U.S. Department of Defense estimates, unemployment among veterans could be a growing national problem in the U.S.

From the industry perspective, however, the national problem of unemployed veterans could prove an opportunity to “do well by doing good.” While experienced nuclear professionals represent a small pool of high-end and experienced talent, the pool of unemployed but trainable veterans represents a nearly inexhaustible source of talent suitable, with appropriate preparation, for all kinds of data center challenges.

Data centers compete with other industries for personnel, so now is the time to seize the opportunity because other industries are already moving to capitalize on this pool of talent. For example, Walmart has committed to hiring any veteran who was honorably discharged in the past 12 months, JP Morgan Chase has teamed with the U.S. Chamber of Commerce to hire over 100,000 veterans, and iHeartRadio’s Show Your Stripes program features many large and small enterprises, including some that own or operate data centers, committed to meeting the employment needs of veterans. For its own good, the data center industry must broadly participate in these efforts and drive to acquire and train candidates from this talent pool.

In North America, some data center staffs already include veterans who received technical training in the military and were able to land a job because they could quickly apply those skills to data centers. These technicians have proven the value of hiring veterans for data center work, not only for their relevant skills but also for their personal attributes of discipline and performance excellence.

The data center industry can take further advantage of the talent pool of veterans by establishing effective training and onboarding programs (mechanisms that enable new employees to acquire the necessary knowledge, skills and behaviors to become effective organizational members and insiders) for veterans who do not have the technical training (e.g., infantry, armor) that translates easily to the data center industry but have all the other important characteristics, including a proven ability to learn. Providing clear pathways for veterans of all backgrounds to enter the industry will ensure that it benefits from the growing talent pool and will be able to compete effectively with the other industries.

While technically trained veterans can enter the data center industry needing only mentoring and experience to become near-term replacements for retiring mid-level personnel, reaching out to a broader pool that requires technical training will create a generation of junior staff who can grow into mid-level positions and beyond with time and experience. The leadership, discipline and drive that veterans have will enable them to more quickly grasp and master the technical requirements of the job and adapt with ease to the rigor of data center operations.

Veterans’ Value to Industries

Military training and experience is unequaled in the breadth and depth of skills that it develops and the conditions in which these skills are vetted. Service members are trained to be intellectually, mentally and emotionally strong. They are then continuously tested in the most extreme conditions. Combat veterans have made life and death decisions, 24 hours a day for months without a break. They perform complex tasks, knowing that the consequences of failure could result in harm or even the death of themselves and others. This resilience and strength can be relied on in the civilian marketplace.

Basic training teaches the men and women of the military that the needs of the team are greater than their individual needs. They are taught to lead and to follow. They are taught to communicate up, down and across. They learn that they can achieve things they never thought possible because of these skills, and with a humble confidence can do the same in any work environment.

The military is in a constant state of learning, producing individuals with uncanny adaptive thinking and a capacity and passion for continuing to learn. This learning environment focuses not only on personal development but also on training and developing subordinates and peers. This experience acts as a force multiplier when a veteran who is used to knowing his job plus that of the entire team is added to the staff. The veteran is used to making sure that the team as a whole is performing well rather than focusing on the individual. This unwavering commitment to a greater cause becomes an ingrained ethos that can improve the work habits of the entire team.

The public commonly stereotypes military personnel as unable to think outside of a chain of command, but following a chain of command is only a facet of understanding how to perform in a team. Service members are also trained to be problem solvers. In this author’s experience, Iraq and Afghanistan were highly complex operations where overlooking the smallest detail could change outcomes. The military could not succeed at any mission if everyone waited for specific orders/instructions from their superiors before reacting to a situation. The mindset of a veteran is to focus on the mission: mission leaders impart a thorough understanding of the intent of a plan to troops, who then apply problem-solving skills to each situation in order to get the most positive outcome. They are trained to be consummate planners, engaging in a continuous process of assessment, planning, execution, feedback and fine tuning to ensure mission success.

Reliability is another key attribute that comes from military service. Veterans know that a mission that starts a minute late can be fatal. This precision translates to little things like punctuality and big things like driving projects to meet due dates and budgets. This level of dependability is cornerstone of being a good teammate and leader.

Finally, an often overlooked value of military service is the scope of responsibility that veterans have had, which is often much larger than their non-veteran peers. It is not uncommon for servicemen and women in their twenties to have managed multi-million dollar budgets and hundreds of people. Their planning and management experience is gained in situations where bad decisions can cause troops to drive into an ambush that might also prevent supplies or reinforcements from reaching an under-provisioned unit.

Because military experience produces individuals who demonstrate strong leadership skills, reliability, dependability, integrity, problem-solving ability, proven ability to learn and a team-first attitude, veterans are the best source of talent available. Salute Inc. is an example of a company that helps bring veterans into the data center industry, and in less than six months has proven the value proposition.

Challenges

Recent Uptime Institute Network discussions identified the need for standard curriculum and job descriptions to help establish a pathway for veterans to more easily enter the industry, and Network members are forming a subcommittee to examine the issue. The subcommittee’s first priority is establishing a foundation of training for veterans whose military specialty did not include technical training. Training programs should allow each veteran to enter the data center industry at an appropriate level.

At the same time, the subcommittee will assess and recommend human resource (HR) policies to address a myriad of systemic issues that should be expected. For example, once trainees become qualified, how should companies adjust their salaries? Pay adjustments might exceed normal increases; however, the market value of these trainees has changed, and, unlike other entry-level trainees, veterans have proven high retention rates. The subcommittee has already defined several entry-level positions:

  • Network operations center trainee
  • Data center operations trainee
  • Security administration trainee
  • IT equipment installation trainee
  • Asset management administrator trainee

Resources for Veterans

The Center for New American Security (CNAS) conducted in-depth interviews with 69 companies and found that more than 80% named one or two negative perceptions about veterans. The two most common are skill translation and concerns about post-traumatic stress (PTS).

Many organizations have looked at the issue of skill translation. Some of them have developed online resources to help veterans translate their experiences into meaningful and descriptive civilian terms (www.careerinfonet.org/moc/). They also provide tools to help veterans articulate their value in terms that civilian organizations will understand.

Organizations that access these resources will also gain a better understanding of how a veteran’s training and skills suit the data center environment. In addition, the military has established comprehensive transition programs that all service members go through when re-entering the civilian job market, including resume preparation and interview planning. The combination of government-sponsored programs and resources, a veteran’s own initiative and a civilian organization’s desire to understand can offset concerns about skill translation.

PTS is an issue that cannot be ignored. It is one of the more misunderstood problems in America, even among some in the medical community. It is important to understand more about PTS before assuming this problem affects only veterans. It is estimated that 8% of all Americans suffer from PTS, which is about 25 million people. The number of returning military who have been diagnosed with PTS is 300,000, which is about 30% of Iraq/Afghanistan combat veterans, yet only a very small proportion of the total PTS sufferers in the U.S. The mass media—where most people learn about PTS—often describes PTS as a military issue because the military approach to PTS is very visible: there is a formal process for identifying it and also enormous resources focused on helping veterans cope with it. Given that there are 80 times more non-veterans suffering from PTS, the focus for any HR organization should be ensuring that a company’s practices (from the interview to employee assistance programs and retention) are effectively addressing the issue in general.

Conclusion
The data center industry needs the discipline, leadership and flexibility skills of veterans to serve as a foundation on which it can build the next generation of data center operators. The Uptime Institute Network is establishing a subcommittee and called for volunteers to help define the fundamentals that would be required to have an effective onboarding, training and development program in the industry. This group will address everything from job descriptions to clearly defined career paths for both entry-level trainees as well as experienced technicians transitioning from the military. For further information or if you are interested in contributing to this effort, please contact Rob Costa, Network Director, North America ([email protected]).

Resources

The following list provides a good starting point for understanding the many resources available for veterans and employers to connect.


kirbyLee Kirby is Uptime Institute senior vice president, CTO. In his role he is responsible for serving Uptime Institute clients throughout the life cycle of the data center from design through operations. Mr. Kirby’s experience includes senior executive positions at Skanska, Lee Technologies and Exodus Communications. Prior to joining the Uptime Institute, he was CEO and founder of Salute Inc. He has more than 30 years of experience in all aspects of information systems, strategic business development, finance, planning, human resources and administration both in the private and public sectors. Mr. Kirby has successfully led several technology startups and turn-arounds as well as built and run world-class global operations. In addition to an MBA from University of Washington and further studies at Henley School of Business in London and Stanford University, Mr. Kirby holds professional certifications in management and security (ITIL v3 Expert, Lean Six Sigma, CCO). In addition to his many years as a successful technology industry leader, he masterfully balanced a successful military career over 36 years (Ret. Colonel) and continues to serve as an advisor to many veteran support organizations.

Mr. Kirby also has extensive experience working cooperatively with leading organizations across many Industries, including Morgan Stanley, Citibank, Digital Realty, Microsoft, Cisco and BP.

Data-center-owners-v-designers

Resolving Conflicts between Data Center Owners and Designers

Improving communication between the enterprise and design engineers during a capital project

For over 10 years, Uptime Institute has sought to improve the relationship between data center design engineers and data center owners. Yet, it is clear that issues remain.

Uptime Institute’s uniquely unbiased position—it does not design, construct, commission, operate, or provision equipment to data centers—affords direct insight into data center capital projects throughout the world. Uptime Institute develops this insight through relationships with Network members in North America, Latin America, EMEA, and Asia Pacific; the Accredited Tier Designer (ATD) community; and the owner/operators of 392 Tier Certified, high-performance data centers in 56 countries.

Despite increasingly sophisticated analyses and tools available to the industry, Uptime Institute continues to find that when an enterprise data center owner’s underlying assumptions at the outset of a capital project are not attuned to its business needs for performance and capacity, problematic operations issues can plague the data center for its entire life.

The most extreme cases can result in disrupted service life of the new data center. Disrupted service life may be classified in three broad categories.

1. Limited flexibility

  • The resulting facility does not the meet the performance requirements of an IT deployment that could have been reasonably forecast
  • The resulting facility is difficult to operate, and staff may avoid using performance or efficiency features in the design

2. Insufficient capacity

  • Another deployment (either new build, expansion, or colocation) must be launched earlier than expected
  • The Enterprise must again fund and resource a definition, justification, and implementation phase with all the associated business disruptions
  • The project is cancelled and capacity sought elsewhere

3. Excess capacity

  • Stranded assets in terms of space, power, and/or cooling represent a poor use of the Enterprise’s capital resources
  • Efficiency is diminished over the long term due to low utilization of equipment and space
  • Capital and operating cost per piece of IT or network equipment is untenable

Any data center capital project is subject to complex challenges. Overtime and over-budget considerations, such as inclement weather, delayed equipment delivery, overwhelmed local resources, slow-moving permitting and approval bureaucracies, lack of availability of public utilities (power, water, gas), merger or acquisition, or other shift in corporate strategy, may be outside of the direct control of the enterprise.

But other causes of overtime and over-budget are avoidable and can be dealt with effectively during the pre-design phase. Unfortunately, many of these issues become clear to the Enterprise after the project management, design, construction, and commissioning teams have completed their obligations.

Planning and justifying major data center projects has been a longstanding topic of research and education for Uptime Institute. Nevertheless, the global scale of planning shortfalls and project communication issues only became clear due to insight gained through the rapid expansion of Tier Certifications.

Even before a Tier Certification contract is signed, Uptime Institute requests a project profile, composed of key characteristics including size, capacity, density, phasing, and Tier objective(s). This information helps Uptime Institute determine the level of effort required for Tier Certification, based on similar projects. Additionally, this allows Uptime Institute to provide upfront counsel on common shortfalls and items of concern based upon our experience of similar projects.

Furthermore, a project team may update or amend the project profile to maintain cost controls. Yet, Uptime Institute noted significant variances in these updated profiles in terms of density, capacity, and Tier. It is acknowledged that an owner may decide to amend the size of a data center, or to adjust phasing, to limit initial capital costs or otherwise better respond to business needs. But a project that moves up and down the Tier levels or varies dramatically in density from one profile to another indicates management and communication issues.

These issues result in project delays, work stoppages, or cancellations. And if the project is completed, it can be expected to lack in terms of capacity (either too much or too little), meeting performance requirements (in design or facility), and flexibility.

Typically, a Tier Certification inquiry occurs after a business need has been established for a data enter project and a data center design engineer has been contracted. Unstable Certification profiles show that a project may have prematurely been moved into the design phase, with cost, schedule, and credibility consequences for a number of parties—notably, the owner and the designer.

Addressing the Communications Gap

Beginning in May 2013, Uptime Institute engaged the industry to address this management and communication issue on a broader basis. Anecdotally, both sides, via the Network or ATD courses, had voiced concerned that one had insufficient insight into the scope or responsibility, or unrealistic expectations, of the other. For example, a design engineer would typically be contracted to produce an executable design but soon find out that the owner was not ready to make the decisions that would allow the design process to begin. On the other hand, owners found that the design engineers lacked commitment to innovation, and they would be delivered a solution that was similar to a previous project rather than vetted against established performance and operations requirements. This initiative was entitled Owners vs Designers (OvD) to call attention to a tension evident between these two responsibilities.

The Uptime Institute’s approach was to meet with the designers and owners separately to gather feedback and recommendations and to then reconcile the feedback and recommendations in a publication.

OvD began with the ATD community during a special session at Uptime Institute Symposium in May 2013. Participants were predominantly senior design engineers with experience in the U.S., Canada, Brasil, Mexico, Kenya, Australia, Saudi Arabia, Lebanon, Germany, Oman, and Russia. This initial session verified the need for more attention to this issue.

The design engineers’ overwhelming guidance to owners could be summarized as “know what you want.” The following issues were raised specifically and repeatedly:

1. Lack of credible IT forecasting

  • Without a credible IT requirements definition, it is difficult to establish the basic project profile (size, density, capacity, phasing, and Tier). As this information is discovered, the profile changes, requiring significant delays and rework.
  • In the absence of an IT forecast, designers have to make assumptions about IT equipment. The designers felt that that this task is outside their design contract and they are hired to be data center design experts, not IT planning experts.

2. Lack of detailed Facilities Technical Requirements

  • The absence of detailed Facilities Technical Requirements forces designers to complete a project definitions document themselves because a formal design exercise cannot be launched in its absenceSome designers offer, or are forced, to complete the Facilities Technical Requirements, although it is out-of-scope
  • Others hesitated to do so as this is an extensive effort that requires knowledge and input from a variety of stakeholders
  • Others acknowledged that this process is outside their core competency and the result could be compromised by schedule pressures or limited experience

3. Misalignment of available budget and performance expectations

  • Owners wanted low capital expense, operating expense, and cost of ownership over the life of the project.
  • Most solutions cannot satisfy all three. The owners should establish the highest priority (Capex, Opex, TCO).
  • Designers felt unduly criticized for not prioritizing energy efficiencies in data center designs, although the owners did not understand the correlation between Capex and efficiency. “Saving money takes money; a cheap data center design is rarely efficient.”

Data Center Owners Respond
Following the initial meeting with the data center design community, Uptime Institute brought the discussion to data center owners and operators in the Uptime Institute Network throughout 2013, at the North America Network Meeting in Seattle, WA, APAC Network Meeting in Shenzhen, China, and at the Fall Network Meeting in Scottsdale, AZ.

Uptime Institute solicited input from the owners and also presented the designers’ perspective to the Network members. The problems the engineering community identified resonated with the Operations professionals. However, the owners also identified multiple problems encountered on the design side of a capital project.

In the owner’s words, “designers, do your job.”

According to the owners, the design community is responsible for drawing out the owners’ requirements, providing multiple options, and identifying and explaining potential costs. Common problems in the owners’ experience include:

  • Conflicts often arise between the design team and outside consultants hired by owners
  • Various stakeholders in the owner’s organization have different agendas that confuse priorities
  • Isolated IT and Facilities teams result in capacity planning problems
  • Design teams are reluctant to stray from their preferred designs

The data center owner community agreed with the designer’s perspective and took responsibility for those shortcomings. But the owners pointed out that many design firms promote cookie-cutter solutions and are reluctant to stray from their preferred topologies and equipment-based solutions. One participant shared that he received data center design documents for a project with the name of the design firm’s previous customer still on the paperwork.

Recommendations

Throughout this process, Uptime Institute worked to collect and synthesize the feedback and potential solutions to chronic communications problems between these two constituencies. The following best practices will improve management and communication throughout the project planning and development, with lasting positive effect on the operations lifecycle.

Pre-Design Phase
All communities that participated in OvD discussions understood the need to unite stakeholders throughout the project and the importance of reviewing documentation and tracking changes throughout. Owners and designers also agreed on the need to invest time and budget for pre-design, specifically including documenting the IT Capacity Plan with near-term, mid-term, and long-term scenarios.

The owners and designers also agreed on the importance of building Facilities Technical Requirements that are responsive to the IT Capacity Plan and includes essential project parameters:

  • Capacity (initial and ultimate)
  • Tiers[s]
  • Redundancy
  • Density
  • Phased implementation strategy
  • Configuration preferences
  • Technology preferences
  • Operations requirements
  • Level of innovation
  • Energy efficiency objectives
  • Computer Room Master Plan

Workshop Computer Room Master Plans with IT, Facilities, Corporate Real Estate, Security, and other stakeholders and then incorporate them into the Facilities Technical Requirements. After preparing the Facilities Technical Requirements, invite key stakeholders to ratify the document. This recommendation does not prohibit changes later but provides a basis of understanding and launch point for the project. Following ratification, brief the executive (or board). This and subsequent briefings can provide the appropriate forum for communicating the costs associated with various design alternatives, but also how they deliver business value.

RFP and Hiring
Provide as much detail about project requirements as possible in the RFP, including an excerpt of Facilities Technical Requirements in the RFP itself and technology and operations preferences and requirements. This allows respondents to the RFP to begin to understand the project and respond with most relevant experience. Also, given that many RFPs compel some level of at-risk design work, a detailed RFP will best guide this qualification period and facilitate the choice of the right design firm. Inclusion of details in the RFP does not prohibit the design from changing during its development and implementation.

Negotiate in person as much as possible. Owners regretted not spending more time with the design firm(s) before a formal engagement as misalignments only became evident once it was too late. Also, multiple owners remarked with pride that they walked out of a negotiation at least once. This demonstrated their own commitment to their projects and set a tone of consequences and accountability for poor or insufficient communication.

Assess and score the culture of the design firms for alignment with the owner’s preferred mode and tone of operations. Owners commented that they preferred a small and local design firm, which may require some additional investment in training, but they were confident would get more careful and close attention in return.

Notify the design engineer from the outset of specific requirements and indicators of success to pre-empt receiving a generic or reconstituted design.

Should the owner engage an outside consultant, avoid setting an aggressive tone for consultants. Owners may want to augment their internal team with a trusted advisor resource. Yet, this role can inadvertently result in the consultant assuming the role of guard dog, rather than focusing on collaboration and facilitation.

Design and Subsequent Phases
Owners and designers agreed that a design effort was a management challenge rather than a technical one. An active and engaged owner yields a more responsive and operable design. Those owners that viewed it as outsourcing the production/fabrication effort of a data center struggled with the resulting solution. The following recommendations will reduce surprises during or after the project.

  • Success was defined not as a discrete number of meetings or reports, but as being contingent upon establishing and managing a communication system.
    Key components of this system include the following:
  • Glossary of terms: Stakeholders will have varying experience or expertise and some terms may be foreign or misconceived. A glossary of terms established a consistent vocabulary, encourages questions, and builds common understanding.
  • List of stakeholders: Stakeholders may vary, but identifying the ‘clients’ of the data center helps to establish and maintain accountability.
  • Document all changes: The owner must be able to evidence the circumstances and reasons behind any changes. These are a natural aspect of a complex data center project, but knowing the decision made and why will be key to setting expectations and successful operation of the data center.
  • Notify stakeholders of changes to IT Capacity Plans, Facilities Technical Requirements, and design documents. This will also help executive and non-technical stakeholders to feel engaged without disruption(s) to the project low and allow the project stakeholders to provide accurate and timely answers when decisions are questioned during or after the project.

As the recommendations were compiled from the OvD initiative, many of the recommendations resonated with Uptime Institute guidance of years past. Over 10 years ago, Ken Brill and Pitt Turner held seminars on project governance that touched upon a number of the items herein. It is an old problem, but just as relevant.


Key Quotes from the Design Community

Owners want to design to Tier III, but they want to pay for Tier II and get Tier IV performance.

Owners want technologies or designs that don’t work in their region or budget.

The IT people are not at the table, and engineers don’t have adequate opportunity to understand their requirements. Designers are often trying to meet the demands of an absent, remote, or shielded IT client who lives in a state of constant crisis.

Once the project is defined, it’s in the hands of the general contractor and commercial real estate group. Intermediaries may not have data center experience, and engineers aren’t in direct contract with the end user anymore.


Industry Perspectives

Chris-CrosbyChris Crosby, CEO, Compass Datacenters

There are some days when I’d like to throw architects and engineers off the roof. They don’t read their own documents, for example, putting in boilerplate that has nothing to do with the current project in a spec. They can also believe that they know better than the owner—making assumptions and changes independent of what you have clearly told them on paper that you want. It drives me nuts because as an owner you may not catch it until it has cost you a hundred grand, since it just gets slipsheeted into some detail or RFI response with no communication back to you.

 

Dennis-JulianDennis R. Julian, PE, ATD, Principal, Integrated Design Group, Inc.

Data center designs are detail oriented. Missing a relatively minor item (e.g. control circuit), could result in shutting down the IT equipment. When schedules are compressed, it is more difficult and requires more experienced design personnel to flush out the details, do the analysis, and provide options with recommendations required for a successful solution.

There are pressures to stick with proven designs when:

  • Fees are low. Standard designs are used so less experienced personnel may be used to meet budgets.
  • Schedules are compressed. Reuse of existing designs and minimizing options and analysis speeds up completion of the design.

Good design saves capital and operating costs over the life of the facility and vastly dwarfs any savings in design fees. Selecting designers based on qualifications and not fees, similar to the Brooks Act regulating the selection of engineers by the U.S. Federal government (Public Law 92-582 92nd Congress, H.R. 12807. October 27, 1972) and allowing reasonable schedules will allow the discussion about the client’s goals and needs and the time to review alternatives for the most cost-effective solution based on total cost of ownership.



Julian-Kudritzki

Julian Kudritzki joined the Uptime Institute in 2004 and currently serves as Chief Operating Officer. He is responsible for the global proliferation of Uptime Institute Standards. He has supported the founding of Uptime Institute offices in numerous regions, including Brasil, Russia, and North Asia. He has collaborated on the development of numerous Uptime Institute publications, education programs, and unique initiatives such as Server Roundup and FORCSS. He is based in Seattle, WA.

 

 

Matt-StansberryMatt Stansberry is director of Content and Publications for the Uptime Institute and also serves as program director for the Uptime Institute Symposium, an annual spring event that brings together 1,500 stakeholders in enterprise IT, data center facilities, and corporate real estate to deal with the critical issues surrounding enterprise computing. He was formerly Editorial Director for Tech Target’s Data Center and Virtualization media group, and was managing editor of Today’s Facility Manager magazine. He has reported on the convergence of IT and Facilities for more than a decade.