Saudi Aramco’s Cold Aisle Containment Saves Energy

Oil exploration and drilling require HPC

By Issa A. Riyani and Nacianceno L. Mendoza

Saudi Aramco’s Exploration and Petroleum Engineering Computer Center (ECC) is a three-story data center built in 1982. It is located in Dhahran, Kingdom of Saudi Arabia. It provides computing capability to the company’s geologists, geophysicists, and petroleum engineers to enable them to explore, develop, and manage Saudi Arabia’s oil and gas reserves. Transitioning the facility from mainframe to rack-mounted servers was just the first of several transitions that challenged the IT organization over the last three decades. More recently, Saudi Aramco reconfigured the legacy data center to a Cold Aisle/Hot Aisle configuration, increasing rack densities to 8 kilowatts per rack (kW/rack) from 3 kW/rack in 2003 and nearly doubling capacity. Further increasing efficiency, Saudi Aramco also sealed openings around and under the computer racks, cooling units, and the computer power distribution panel in addition to blanking unused rack space.

The use of computational fluid dynamics (CFD) simulation software to manage the hardware deployment process enabled Saudi Aramco to increase the total number of racks and rack density in each data hall. Saudi Aramco used the software to analyze various proposed configurations prior to deployment, eliminating the risk of trial and error.

In 2015 one of the ECC’s five data halls was modified to accommodate a Cold Aisle Containment System. This installation supports the biggest single deployment so far in the data center, 124 racks of high performance computers (HPC) with a total power demand of 994 kW. As a result, the data hall now hosts 219 racks on a 10,113-square-foot (940-square-meter) raised floor. To date, the data center hall has not experienced any temperature problems.

Business Drivers

Increasing demand by ECC customers requiring the deployment of IT hardware and software technology advances necessitated a major reconfiguration in the data center. Each new configuration increased the heat that needed to be dissipated from the ECC. At each step, several measures were employed to mitigate potential impact to the hardware, ensuring safety and reliability during each deployment and project implementation.

For instance, Saudi Aramco developed a hardware deployment master plan based on a projected life cycle and refresh rate of 3–5 years to transition to the Cold Aisle/Hot Aisle configuration. This plan allows for advance planning of space and power source allocation with no compromise to existing operation as well as fund allocation and material procurement (see Figures 1 and 2).

Figure 1. Data center configuration

Figure 1. Data center configuration

Current day

Current day

Masterplan Figure 2. Data center plan view

Masterplan
Figure 2. Data center plan view

Because of the age of the building and its construction methodology, the company’s engineering and consulting department was asked to evaluate the building structure based on the initial master plan. This department determined the maximum weight capacity of the building structure, which was used to establish the maximum rack weight to avoid compromising structural stability.

In addition, the engineering and consulting department evaluated the chilled water pipe network and determined the maximum number of cooling units to be deployed in each data hall, based on maximum allowable chilled water flow. Similarly, the department determined the total heat to be dissipated per Hot Aisle to optimize the heat rejection capability of the cooling units. The department also determined the amount heat to be dissipated per rack to ensure sufficient cooling as per manufacturer’s recommendation.

Subsequently, facility requirements based on these limiting factors were shared with the technology planning team and IT system support. The checklist includes maximum weight, rack dimensions, and the requirement for blanking panels and sealing technologies to prevent air mixing.

Other features of the data center include:

  • A 1.5-foot (ft) [0.45 meter (m)] raised floor containing chilled water supply and return pipes for the CRAH units, cable trays for network connectivity, sweet water line for the humidifier, liquid-tight flexible conduits for power, and computer power system (CPS) junction boxes
  • A 9-ft (2.8 m) ceiling height
  • False ceilings
  • Down-flow chilled water computer room air handling (CRAH) units
  • CRAH units located at the end of each Hot Aisle
  • Perforated floor tiles (56% open with manually controlled dampers)
  • No overhead obstructions
  • Total data center heat load of 1,200 kW
  • Total IT load of 1,084 kW, which is constant for all three models
  • Sealed cable penetrations (modeled at 20% leakage)

The 42U cabinets in the ECC have solid sides and tops, with 64% perforated front and rear doors on each cabinet. Each is 6.5-ft. high by 2-ft. wide by 3.61-ft. deep (2 m by 0.6 m by 1.10 m) and weighs 1874 pounds (850 kilograms). Rack density ranges from 6.0–8.0 kW. The total nominal cooling capacity is 1,582 kW from (25) 18-ton computer room air conditioning (CRAC) units.

Modeling Software

In 2007, Saudi Aramco commissioned the CFD modeling software company to prepare baseline models for all the data halls. The software is capable of performing transient analysis that suits the company’s requirement. The company uses the modeling software to simulate proposed hardware deployment, investigate deployment scenarios, and identify any stranded capacity.The modeling company developed several simulations based on different hardware iterations of the master plan to help establish the final hardware master plan with each Hot Aisle not exceeding a 125-kW heat load on a 16-rack Hot Aisle and not more than 8 kW per rack.  After the modeling software company completed the initial iterations, Saudi Aramco acquired a perpetual license and support contract for the CFD simulation software in January 2010.

Saudi Aramco finds that the CFD simulation software makes it easier to identify and address heat stratification, recirculation, and even short-circuiting of cool air. By identifying the issues in this way, Saudi Aramco was able to take several precautionary measures and improve its capacity management procedures, including increasing cooling efficiency and optimizing load distribution.

Temperature and Humidity Monitoring System

With the CFD software simulation results at hand, the facilities management team looked for other means to gather data for use in future cooling optimization simulations while validating the results of CFD simulations. As a result, the facilities management group decided to install a temperature and humidity monitoring system. The initial deployment was carried out in 2008, with the monitoring of subfloor air supply temperature and hardware entering temperature.

At that time, three sensors were installed in each Cold Aisle for a total of six sensors. The sensors were positioned at the ends of each end of the row and in the middle, at the highest point of each rack. Saudi Aramco chose these points to get better understanding of the temperature variance (∆T) between the subfloor and the highest rack inlet temperature. Additionally, Saudi Aramco uses this data to monitor and ensure that all inlet temperatures are within the recommended ranges of ASHRAE and the manufacturer.

The real-time temperature and humidity monitoring system enabled the operation and facility management team to monitor and document unusual and sudden temperature variances allowing proactive responses and early resolution of potential cooling issues. The monitoring system gathers data that can be used to validate the CFD simulations and for further evaluation and iteration.

The Prototyping

The simulation models identified stratification, short circuiting, and recirculation issues in the data halls, which prompted the facilities management team to develop more optimization projects, including a containment system. In December 2008, a prototype was installed in one of the Cold Aisles (see Figure 3) using ordinary plastic sheets as refrigerated doors and Plexiglass sheets on aluminum frame, Saudi Aramco monitored the resulting inlet and core temperatures using the temperature and humidity monitoring system and internal system monitors prior to, during, and upon completion of installation to ensure no adverse effect with the hardware. The prototype was observed over the course of three months with no reported hardware issues.

Figure 3. Prototype containment system

Figure 3 Prototype containment system

Following the successful installation of the prototype, various simulation studies were further conducted to ensure the proposed deployment’s benefit and savings. In parallel, Saudi Aramco looked for the most suitable materials to comply with all applicable standards, giving prime consideration to the safety of assets and personnel and minimizing risk to IT operations.

Table 1. Installation dates

Table 1. Installation dates

When the Cold Aisle was contained, Saudi Aramco noticed considerable improvement in the overall environment. Containment improved cold air distribution by eliminating hot air mixing with the supply air from the subfloor, so that air temperature at the front of the servers was close to the subfloor supply temperature. With cooler air entering the hardware, the core temperature was vastly improved, resulting in lower exhaust and return air temperatures to the cooling units. As a result, the data hall was able to support more hardware

Material Selection and Cold Aisle Containment System installation

From 2009 to 2012, the facility management team evaluated and screened several products It secured and reviewed the material data sheets and submitted them to the Authority Having Jurisdiction (AHJ) for evaluation and concurrence. Each of the solutions would require some modifications to the facility before being implemented. The facility management team evaluated and weighed the impact of these modifications as part of the procurement process.

Of all the products, one stood out from the rest; the use of easy to install and transparent material addresses not only safety but also eliminated the need for modifications of the existing infrastructure, which translates to considerable savings in terms of project execution and money.

Movement in and out of the aisle is easy and safe as people can see through the doors and walls. Additionally, the data hall lighting did not need to be modified since it was not obstructed. Even the fire suppression system was not affected since it has a fusible link and lanyard connector. The only requirement by AHJ prior to deployment was additional smoke detectors in the Cold Aisle itself.

To comply with this requirement, an engineering work order was raised for the preparation of the necessary design package for the modification of the smoke detection system. After completing the required design package including certification from a chartered fire protection engineer as mandated by the National Fire Protection Association (NFPA), it was established that four smoke detectors were to be relocated and an additional seven smoke detectors installed in the data hall.

Implementation and challenges

Optimizations and improvements always come with challenges; the reconfiguration process necessitated close coordination between the technology planning team, IT system support, ECC customers, the network management group, Operations, and facility management. These teams had to identify hardware that could be decommissioned without impacting operations, prepare temporary spaces for interim operations, and then take the decommissioned hardware out of the data hall, allowing the immediate deployment of new hardware in Cold Aisle/Hot Aisle. Succeeding deployments follow the master plan, allowing the complete realignment process to be completed in five years.

Installation of the Cold Aisle Containment System did not come without challenges; all optimization activities, including relocating luminaires in the way of the required smoke detectors had to be completed with no impact to system operations. To meet this requirement, ECC followed a strict no work permit–no work procedure; work permits are countersigned by operation management staff on duty during issuance and prior to close out. This enabled close monitoring of all activities within the data halls, ensuring safety and no impact to daily operation and hardware reliability. Additionally, a strict change management documentation process was utilized and adhered to by the facility management team and monitored by operation management staff; all activities within the data halls have to undergo a change request approval process.

Operations management and facility management worked hand in hand to overcome these challenges. Operations management, working in three shifts, closely monitored the implementation process, especially after regular working hours. Continuous coordination between contractors, vendors, operation staff, and facility management team enabled smooth transition and project implementations eliminating any showstoppers along the way.

Summary

The simulation comparison in Figure 4 clearly shows the benefits of the Cold Aisle Containment System. Figure 4a shows hot air recirculating around the end of the rows and mixing with the cold air supply to the Cold Aisles. In Figure 4b, mixing of hot and cold air is considerably reduced with the installation of the 14 Cold Aisle containment systems. The Cold Aisles are better defined and clearly visible in the figures, with less hot air recirculation, but the three rows without containment still suffer from recirculation. In Figure 4c, the Cold Aisles are far better defined, and hot air recirculation and short circuiting are reduced. Additionally, the exhaust air temperature from the hardware has dropped considerably.

Figure 4a. Without Cold Aisle Containment

Figure 4a. Without Cold Aisle Containment

Figure 4b. With current Cold Aisle Containment (14 of 17 aisles)

Figure 4b. With current Cold Aisle Containment (14 of 17 aisles)

 

Figure 4c. With full Cold Aisle Containment

Figure 4c. With full Cold Aisle Containment

Figures 5–11 show that the actual power and temperature readings taken from the sensors installed in the racks validated the simulation results. As shown in Figures 4a and 5a, the power draw of the racks in Aisles 1 and 2 fluctuates while the corresponding entering and leaving temperature was maintained. On Week 40, the temperature even dropped slightly despite the slight increase in the power draw. The same can also be observed in Figures 6 and 7. All these aisles are fitted with a Cold Aisle Containment System.

B Issa Figure 4a Aisle 1image8

Figure 5. Actual Power utilization, entering temperature, and leaving temperature Aisle 01 (installed on July 30, 2015 – week 31)

Figure 5. Actual Power utilization, entering temperature, and leaving temperature
Aisle 01 (installed on July 30, 2015 – week 31)

B Issa Figure 5A Aisle 2 image10

Figure 6. Aisle 2 (installed on July 28, 2015 – week 31)

Figure 6. Aisle 2 (installed on July 28, 2015 – week 31)

B Issa Figure 6A Aisle 3image12

Figure 7. Aisle 3 (installed on March 7, 2015 – week 10)

Figure 7. Aisle 3 (installed on March 7, 2015 – week 10)

Figure 8. Aisle 6 (installed on April 09, 2015 – week 15)

Figure 8. Aisle 6 (installed on April 09, 2015 – week 15)

Figure 9. Aisle 7a (a and b) and Aisle 7b (c and d)

Figure 9. Aisle 7a (a and b) and Aisle 7b (c and d)

B Issa Figure 9a Aisle 8 image20

Figure 10. Aisle 8 (installed on February 28, 2015 – week 09)

Figure 10. Aisle 8 (installed on February 28, 2015 – week 09)

B Issa Figure 10A image22

Figure 11. Aisle 17 (no Cold Aisle installed)

Figure 11. Aisle 17 (no Cold Aisle installed)

Additionally, Figure 11 clearly shows slightly higher entering and leaving temperature as well as fluctuation in the temperature readings that coincided with the power draw fluctuation of the racks within the aisle. This aisle has no containment.

The installation of the Cold Aisle Containment System greatly improved the overall cooling environment of the data hall (see Figure 12). Eliminating hot and cold air mixing and short circuiting allowed for more efficient cooling unit performance and cooler supply and leaving air. Return air temperature readings in the CRAH units were also monitored and sampled in Figure 12, which shows the actual return air temperature variance as a result of the improved overall data hall room temperature.

B Issa Figure 12a image24 B Issa Figure 12b image25 B Issa Figure 12c image26

Figure 11. Computer air handling unit actual return air temperature graphs

Figure 11. Computer air handling unit actual return air temperature graphs

Figure 13. Cold Aisle floor layout

Figure 13. Cold Aisle floor layout

The installation of the Cold Aisle Containment System allows the same data hall to host the company’s MAKMAN and MAKMAN-2 supercomputers (see Figures 5

Figure 14. Installed Cold Aisle Containment System

Figure 14. Installed Cold Aisle Containment System

). Both MAKMAN and MAKMAN-2 appear on the June 2015 Top500 Supercomputers list.


Issa Riyani

Issa Riyani

Issa A. Riyani joined the Saudi Aramco Exploration Computer Center (ECC) in January 1993. He graduated from King Fahad University of Petroleum and Minerals (KFUPM) in Dhahran, Kingdom of Saudi Arabia, with a bachelor’s degree in electrical engineering. Mr. Riyani currently leads the ECC Facility Planning & Management Group and has more than 23 years experience managing ECC facilities.

Nacianceno L. Mendoza

Nacianceno L. Mendoza

Nacianceno L. Mendoza joined the Saudi Aramco Exploration Computer Center (ECC) in March 2002. He holds a bachelor of science in civil engineering and has more than 25 years of diverse experience in project design, review, construction management, supervision, coordination and implementation. Mr. Mendoza spearheaded the design and implementation of the temperature and humidity monitoring system and deployment of Cold Aisle Containment System in the ECC.

 

 

Data-Driven Approach to Reduce Failures

Operations teams use the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database to enrich site knowledge, enhance preventative maintenance, and improve preparedness

By Ron Davis

The AIRs system is one of the most valuable resources available to Uptime Institute Network members. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management.

A quick search of the database, using various parameters or keywords, turns up invaluable documentation on a broad range of facility and equipment topics. The results can support evidence-based decision making and operational planning to guide process improvement, identify gaps in documentation and procedures, refine training and drills, benchmark against successful organizations, inform purchasing decisions, fine-tune preventive maintenance (PM) programs to minimize failure risk, help maximize uptime, and support financial planning.

THE VALUE OF AIRs OPERATIONS
The philosopher, essayist, poet, and novelist George Santayana wrote, “Those who cannot remember the past are doomed to repeat it.” Using records of past data center incidents, errors, and outages can help inform operational practices to help prevent future incidents.

All Network member organizations participate in the AIRs program, ensuring a broad sample of incident information from data center organizations representing diverse sizes, business sectors, and geography. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management. The database contains data resulting from data center facility infrastructure incidents and outages for a period beginning in 1994 and continuing through the present. This volume of incident data allows for meaningful and extremely valuable analysis of trends and emerging patterns. Annually, Uptime Institute presents aggregated results and analysis of the AIRs database, spotlighting issues from the year and both current and historical trends.

Going beyond the annual aggregate trend reporting there is also significant insight to be gained from looking at individual incidents. Detailed incident information is particularly relevant to front-line operators, helping to inform key facility activities including:

• Operational documentation creation or improvement

• Planned maintenance process development or improvement

• Training

• PM

• Purchasing

• Effective practices, failure analysis, lessons learned

AIRs reporting is confidential and subject to a non-disclosure agreement (NDA), but the following hypothetical case study illustrates how AIRs information can be applied to improve an organization’s operations and effectiveness.

USING THE AIRs DATA IN OPERATIONS: CASE STUDY
A hypothetical “Site X” is installing a popular model of uninterruptible power supply (UPS) modules.

The facility engineer decides to research equipment incident reports for any useful information to help the site prepare for a smooth installation and operation of these critical units.

Figure 1. The page where members regularly go to submit an AIR. The circled link takes you directly to the “AIR Search” page.

Figure 1. The page where members regularly go to submit an AIR. The circled link takes you directly to the “AIR Search” page.

Figure 2. The AIR Search page is the starting point for Abnormal Incident research. The page is structured to facilitate broad searches but includes user friendly filters that permit efficient and effective narrowing of the desired search results.

Figure 2. The AIR Search page is the starting point for Abnormal Incident research. The page is structured to facilitate broad searches but includes user friendly filters that permit efficient and effective narrowing of the desired search results.

The facility engineer searches the AIRs database using specific filter criteria (see Figures 1 and 2), looking for any incidents within the last 10 years (2005-2015) involving the specific manufacturer and model where there was a critical load loss. The database search returns seven incidents meeting those criteria (see Figure 3).

Figure 3. The results page of our search for incidents within the last 10 years (2005-2015) involving a specific manufacturer/model where there was a critical load loss. We selected the first result returned for further analysis.

Figure 3. The results page of our search for incidents within the last 10 years (2005-2015) involving a specific manufacturer/model where there was a critical load loss. We selected the first result returned for further analysis.

Figure 4. The overview page of the abnormal incident report selected for detailed analysis.

Figure 4. The overview page of the abnormal incident report selected for detailed analysis.

The first incident report on the list (see Figure 4) reveals that the unit involved was built in 2008. A ventilation fan failed in the unit (a common occurrence for UPS modules of any manufacturer/model). Replacing the fan required technicians to implement a UPS maintenance bypass, which qualifies as a site configuration change. At this site, vendor personnel were permitted to perform site configuration changes. The UPS vendor technician was working in concert with one of the facility’s own operations engineers but was not being directly supervised (observed) at the time the incident occurred; he was out of the line of sight.

Social scientist Brené Brow said, “Maybe stories are just data with a soul.” If so, the narrative portion of each report is where we find the soul of the AIR (Figure 5). Drilling down into the story (Description, Action Taken, Final Resolution, and Synopsis) reveals what really happened, how the incident played out, and what the site did to address any issues. The detailed information found in these reports offers the richest value that can be mined for current and future operations. Reviewing this information yields insights and cautions and points towards prevention and solution steps to take to avoid (or respond to) a similar problem at other sites.

Figure 5. The detail page of the abnormal incident report selected for analysis. This is where the “story” of our incident is told.

Figure 5. The detail page of the abnormal incident report selected for analysis. This is where the “story” of our incident is told.

In this incident, the event occurred when the UPS vendor technician opened the output breaker before bringing the module to bypass, causing a loss of power, and the load was dropped. This seemingly small but crucial error in communication and timing interrupted critical production operations—a downtime event.

Back-up systems and safeguards, training, procedures, and precautions, detailed documentation, and investments in redundant equipment—all are in vain the moment there is a critical load loss. The very rationale for the site having a UPS was negated by one error. However, this site’s hard lesson can be of use if other data center operators learn from the mistake and use this example to shore up their own processes, procedures, and incident training. Data center operators do not have to witness an incident to learn from it; the AIRs database opens up this history so that others may benefit.

As the incident unfolded, the vendor quickly reset the breaker to restore power, as instructed by the facility technician. Subsequently, to prevent this type of incident from happening in the future, the site:

• Created a more detailed method of procedure (MOP) for UPS maintenance

• Placed warning signs near the output breaker

• Placed switch tags at breakers

• Instituted a process improvement that now requires the presence of two technicians: an MOP supervisor and MOP performer, with both technicians required to verified each step

These four steps are Uptime Institute-recommended practices for data center operations. However, this narrative begs the question of how many sites have actually taken the effort to follow through on each of these elements, checked and double-checked, and drilled their teams to respond to an incident like this. Today’s operators can use this incident as a cautionary tale to shore up efforts in all relevant areas: operational documentation creation/improvement, planned maintenance process development/improvement, training, and PM program improvement.

Operational Documenation Creation/Improvement
In this hypothetical, the site added content and detail to its MOP for UPS maintenance. This can inspire other sites to review their UPS bypass procedures to determine if there is sufficient content and detail.

The consequences of having too little detail are obvious. Having too much content can also be a problem if it causes a technician to focus more on the document than on the task.

The AIR in this hypothetical did not say whether facility staff followed an emergency operating procedure (EOP), so there is not enough information to say whether they handled it correctly. This event may never happen in this exact manner again, but anyone who has been around data centers long enough knows that UPS output breakers can and will fail in a variety of unexpected ways. All sites should examine their EOP for unexpected failure/trip of UPS output breaker.

In this incident, the technician reclosed the breaker immediately, which is an understandable human reaction in the heat of the moment. However, this was probably not the best course of action. Full system start-up and shutdown should be orderly affairs, with IT personnel fully informed, if not present as active participants. A prudent EOP might require recording the time of the incident, following an escalation tree, gathering white space data, and confirming redundant equipment status, along with additional steps before undertaking a controlled, fully scripted restart.

Another response to failure was the addition of warning signs and improved equipment labeling as improvements to the facility’s site configuration procedures (SCPs). This change can motivate other sites to review their nomenclature and signage. Some sites include a document that gives expected or steady state system/equipment information. Other sites rely on labeling and warning signs or tools like stickers or magnets located beside equipment to indicate proper position. If a site has none of these safeguards in place, then assessment of this incident should prompt the site team to implement them.

Examining AIRs can provide specific examples of potential failure points, which can be used by other sites as a checklist of where to improve policies. The AIRs data can also be a spur to evaluate whether site policies match practices and ensure that documented procedures are being followed.

Planned Maintenance Process Improvement
After this incident, the site that submitted the AIR incident report changed its entire methodology for performing procedures. Now two technicians must be present, each with strictly defined roles: one technician reads the MOP and supervises the process, and the second technician verifies, performs, and confirms. Both technicians must sign off on the proper and correct completion of the task. It is unclear whether there was a change in vendor privileges.

When reviewing AIRs as a learning and improvement tool, facilities teams can benefit by implementing measures that are not already in place or any improvements that they determine they would implement if a similar incident had occurred at their site. For example, a site may conclude that configuration changes should be reserved only for those individuals who:

• Have a comprehensive understanding of site policy

• Have completed necessary site training

• Have more at stake for site performance and business outcomes

Training
The primary objective of site training is to increase adherence to site policy and knowledge of effective mission critical facility operations. Incorporating information gleaned from AIRs analysis helps maximize these benefits. Training materials should be geared to ensure that technicians are fully qualified to utilize their skills and abilities to operate the installed infrastructure within a mission critical environment and not to certify electricians or mechanics. In addition, training is an opportunity to provide an opportunity for professional development and interdisciplinary education amongst our operations team, which can help enterprises retain key personnel.

The basic components of an effective site-training program are an instructor, scheduled class times that can be tracked by student and instructor, on-the-job training (OJT), reference material, and a metric(s) for success.

With these essentials in place, the documentation and maintenance process improvement steps derived from the AIR incident report can be applied immediately for training. Newly optimized SOPs/MOPs/EOPs can be incorporated into the training, as well as process improvements such as the two-person rule. Improved documentation can be a training reference and study material, and improved SCPs will reduce confusion during OJT and daily rounds. Training drills can be created directly from real-world incidents, with outcomes not just predicted but also chronicled from actual events. Trainer development is enhanced by the involvement of an experienced technician in the AIR review process and creation of any resulting documentation/process improvement.

Combining AIRs data with existing resources enables sites to take a systematic approach to personnel training, for example:

1. John Doe is an experienced construction  electrician who was recently hired. He needs UPS bypass training.

2. Jane Smith is a facility operations tech/operating engineer with 10 years of experience as a UPS technician. She was instrumental in the analysis of the AIRs incident and consequent improvements in the UPS bypass procedures and processes; she is the site’s SME in this area.

3. Using a learning management system (LMS) or a simple spreadsheet, John Doe’s training is scheduled.

• Scheduled Class: UPS bypass procedure discussion and walk-through

• Instructor: Jane Smith

• Student: John Doe

• Reference material: the new and improved UPS BYPASS SOP XXXX_20150630, along with the EOP and SCP

• Metric might include:

o Successful simulation of procedure as a performer

o Successful simulation of procedure as a supervisor

o Both of the above

o Successful completion of procedure during a PM activity

o Success at providing training to another technician

Drills for both new trainees and seasoned personnel are important. Because an AIRs-based training exercise is drawn from an actual event, not an imaginary scenario, it lends greater credibility to the exercise and validates the real risks.  Anyone who has led a team drill has probably encountered that one participant who questions the premise of a procedure or suggests a different procedure. Far from being a roadblock to effective drills, the participant is proving to be actively engaged and can assist in program improvement by assisting in creating drills and assessing AIRs scenarios.

PM Program Improvement
The goal of any PM program is to prevent the failure of equipment. The incident detailed in the AIR incident report was triggered from a planned maintenance event, a UPS fan replacement. Typically, a fan replacement requires systems be put on bypass, as do annual PM procedures. Since any change of equipment configuration such as changing a fan introduces risk, it is worth asking whether predictive/proactive fan replacement performed during PM makes more sense than awaiting fan failure. The risk of configuration change must be weighed against the risk of inactivity.

Figure 6. Our incident was not caused by UPS Fan Failure, but occurred as a result of human error during its replacement. So how many AIRs involving fan failures for our manufacturer/model of UPS exist within our database? Figure 6 shows the filters we chose to obtain this information.

Figure 6. Our incident was not caused by UPS Fan Failure, but occurred as a result of human error during its replacement. So how many AIRs involving fan failures for our manufacturer/model of UPS exist within our database? Figure 6 shows the filters we chose to obtain this information.

Examining this and similar incidents in the AIRs database yields information about UPS fan life expectancy that can be used to develop an “evidence-based” replacement strategy. Start by searching the AIRs database for the keyword “fan” using the same dates, manufacturer, and model criteria, with no filter for critical load loss (see Figure 6). This search returns eight reports with fan failure (see Figure 7). The data show that the average life span of the units with fan failure was 5.5 years. The limited sample size means that this result should not be relied on, but this experience at other sites can help guide planning. Less restrictive search criteria can return even more specific data.

Figure 7. A sample of the results (showing 10 of 25 results reports) returned from the search described in Figure 6. Analysis of these incidents may help us to determine and develop the best strategy for cooling fan replacement.

Figure 7. A sample of the results (showing 10 of 25 results reports) returned from the search described in Figure 6. Analysis of these incidents may help us to determine and develop the best strategy for cooling fan replacement.

Additional Incidents Yield Additional Insight
The initial database search at the start of this hypothetical case returned a result of seven AIRs total. What can we learn from the other six?  Three of the remaining six reports involved capacitor failures. At one site, the capacitor was 12 years old, and the report noted, “No notification provided of the 7-year life cycle by the vendor.” Another incident occurred in 2009, involving a capacitor with a 2002 manufacture date, which would match (perhaps coincidentally) a 7-year life cycle. The third capacitor failure was in a 13-year old piece of equipment, and the AIR notes that it was “outside of 4–5-year life cycle.” These results highlight the importance of having an equipment/component life-cycle replacement strategy. The AIRs database is a great starting point.

A fourth AIR describes a driver board failure in a 13-year old UPS. Driver board failure could fall into any of the AIR root-cause types. Examples of insufficient maintenance might include a case where maintenance performed was limited in scope and did not consider end-of-life. Perhaps there was no procedure to diagnose equipment for a condition or measurement indicative of component deterioration, or maybe maintenance frequency was insufficient. Without further data in the report it is hard to draw an actionable insight, but the analysis does raise several important topics for discussion regarding the status of a site’s preventive and predictive maintenance approach. A fifth AIR involves an overload condition resulting from flawed operational documentation. The lesson there is obvious.

The final of the remaining six reports resulted from a lightning strike that made it through the UPS to interrupt the critical load. Other sites might want to check transient voltage surge suppressor (TVSS) integrity during rounds. With approximately 138,000,000 lightning strikes per year worldwide, any data center can be hit. A site can implement an EOP that dictates checking summary alarms, ensuring redundant equipment integrity, performing a facility walk-through by space priority, and providing an escalation tree with contact information.

Each of the AIRs casts light on the types of shortfalls and gaps that can be found in even the most capably run facilities. With data centers made up of vast numbers of components and systems operating in a complex interplay, it can be difficult to anticipate and prevent every single eventuality. AIRs may not provide the most definitive information on equipment specifications, but assessing these incidents provides an opportunity for other sites to identify potential risks and plan how to avoid them.

PURCHASING/EQUIPMENT PROCUREMENT DECISIONS
In addition to the operational uses described above, AIRs information can also support effective procurement. However, as with using almost any type of aggregated statistics, one should be cautious about making broad assumptions based on the limited sample size of the AIRs database.

For example, a search using simply the keywords ‘fan failure’ and ‘UPS’ could return 50 incidents involving Vendor A products and five involving Vendor B’s products (or vice versa). This does not necessarily mean that Vendor A has a UPS fan failure problem. The number of incidents reported could just mean that Vendor B has a significant market share advantage.

Further, one must be careful of jumping to conclusions regarding manufacturing defects. For example, the first AIR incident report made no mention of how the UPS may (or may not) have been designed to help mitigate the risk of the incident. Some UPS modules have HMI (human machine interface) menu-driven bypass/shutdown procedures that dictate action and provide an expected outcome indication. These equipment options can help mitigate the risk of such an event but may also increase the unit cost. Incorporating AIRs information as just one element in a more detailed performance evaluation and cost-benefit analysis will help operators accurately decide which unit will be the best fit for a specific environment and budget.

LEARNING FROM FAILURES
If adversity is the best teacher, then every failure in life is an opportunity to learn, and that certainly applies in the data center environment and other mission critical settings. The value of undertaking failure analysis and applying lessons learned to continually develop and refine procedures is what makes an organization resilient and successful over the long term.

To use an example from my own experience, I was working one night at a site when the operations team was transferring the UPS to maintenance bypass during PM. Both the UPS output breaker and the UPS bypass breaker were in the CLOSED position, and they were sharing the connected load. The MOP directed personnel to visually confirm that the bypass breaker was closed and then directed them to open the UPS output breaker. The team followed these steps as written, but the critical load was dropped.

Immediately, the team followed EOP steps to stabilization. Failure analysis revealed that the breaker had suffered internal failure; although the handle was in the CLOSED position, the internal contacts were not closed. Further analysis yielded a more detailed picture of events. For instance, the MOP did not require verification of the status of the equipment. Maintenance records also revealed that the failed breaker had passed primary injection testing within the previous year, well within the site-required 3-year period. Although meticulous compliance with the site’s maintenance standards had eliminated negligence as a root cause, the operational documentation could have required verification of critical component test status as a preliminary step. There was even a dated TEST PASSED sticker on the breaker.

Indeed eliminating key gaps in the procedures would have prevented the incident. As stated, the breaker appeared to be in the closed position as directed, but the team had not followed the load during switching activities, (i.e., had not confirmed the transfer of the amperage to the bypass breaker). If we had done so, we would have seen a problem, and initiated a back-out of the procedure. Subsequently, these improvements were added to the MOP.

FLASH REPORTS
Flash reports are a particularly useful AIRs service because they provide early warning about incidents identified as immediate risks, with root causes and fixes to help Network members prevent a service interruption. These reports are an important source of timely front-line risk information.

For example, searching the AIRs database for any FLASH AIR since 2005 involving one popular UPS model returns two results. Both reports detailed a rectifier shutdown as a result of faulty trap filter components; The vendor consequently performed a redesign and recommended a replacement strategy. The FLASH report mechanism became a crucial channel for communicating the manufacturer’s recommendation to equipment owners. Receiving a FLASH notification can spur a team to check maintenance records and consult with trusted vendors to ensure that manufacturer bulletins or suggested modifications have been addressed.

When FLASH incidents are reported, Uptime Institute’s AIRs program team contacts the manufacturer as part of its validation and reporting process. Uptime Institute strives for and considers its relationships with OEMs (original equipment manufacturers) to be cooperative, rather than confrontational. All parties understand that no piece of complex equipment is perfect, so the common goal is to identify and resolve issues as quickly and smoothly as possible.

CONCLUSION
It is virtually impossible for an organization’s site culture, procedures, and processes to be so refined that there are no details left unaddressed and no improvements that can be made. There is also a need to beware of hidden disparities between site policy and actual practice. Will a team be ready when something unexpected does go wrong? Just because an incident has not happened yet does not mean it will not happen. In fact, if a site has not experienced an issue, complacency can set in; steady state can get boring. Operators with foresight will use AIRs as opportunities to create drills and get the team engaged with troubleshooting and implementing new, improved procedures.

Instead of trying to ferret out gaps or imagine every possible failure, the AIRs database provides a ready source of real-world incidents to draw from. Using this information can help hone team function and fine tune operating practices. Technicians do not have to guess at what could happen to equipment but can benefit from the lessons learned by other sites. Team leaders do not have to just hope that personnel are ready to face a crisis; they can use AIRs information to prepare for operating eventualities and to help keep personnel responses sharp.

AIRs is much more than a database; it is a valuable tool for raising awareness of what can happen, mitigating the risk that it will happen, and for preparing an operations team for when/if it does happen. With uses that extend to purchasing, training, and maintenance activities, the AIRs database truly is Uptime Institute Network members’ secret weapon for operational success.


Ron Davis

Ron Davis

Ron Davis is a Consultant for Uptime Institute, specializing in Operational Sustainability. Mr. Davis brings more than 20 years of experience in mission critical facility operations in various roles supporting data center portfolios, including facility management, management and operations consultant, and central engineering subject matter expert. Mr. Davis manages the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database, performing root-cause and trending analysis of data center outages and near outages to improve industry performance and vigilance.

Reduce Data Center Insurance Premiums

Uptime Institute President Lee Kirby and Stephen Douglas, Risk Control Director for CNA, an insurance and risk control provider for the software and IT services industry recently coauthored an article for Data Center Knowledge: Lowering Your Data Center’s Exposure to Insurance Claims. In this follow on, Kirby discusses how companies can reduce insurance premiums by providing insurance providers with an Uptime Institute Tier Certification of Operational Sustainability or Management & Operations Stamp of Approval.

Uptime Institute has provided data center expertise for more than 20 years to mission-critical and high-reliability data centers. It has identified a comprehensive set of evidence-based methods, processes, and procedures at both the management and operations level that have been proven to dramatically reduce data center risk, as outlined in the Tier Standard: Operational Sustainability.

Organizations that apply and maintain the Standard are taking the most effective actions available to protect their investment in infrastructure and systems and reduce the risk of costly incidents and downtime. The elements outlined in the Standard have been developed based on the industry’s most comprehensive database of information about real-world data center incidents, errors, and failures: Uptime Institute’s Abnormal Incident Reporting System (AIRS). Many of the key Standards elements are based on analysis of 20 years of AIRS data collected on thousands of data center incidents, pinpointing causes and contributing factors. The Standards focus in on specific behaviors and criteria that have been proven to decrease the risk of downtime.

To assess and validate whether a data center organization is meeting this operating Standard, Uptime Institute administers the industry’s leading operations certifications. These independent, third-party credentials signify that a data center is managed and operated in a manner that will reduce risk and support availability. There are two types of operations credentials:

Tier Certification of Operational Sustainability (TCOS) is for organizations that have been designed and built to meet Tier Topology criteria. Earning a TCOS credential signifies that a data center upholds the most stringent criteria for quality, consistency, and risk prevention in its facility and operations.

The Management & Operations (M&O) Stamp of Approval is for any existing data center that does not have Tier Certification for Design and Construction. The M&O assessment evaluates management, staffing, and procedures independent of topology, and ensures that the facility is being operated to maximize the uptime potential and minimize the risks to the existing infrastructure.

Both credentials are based on the same rigorous Standards for data center operations management, with detailed behaviors and factors that have been shown to impact availability and performance. The Standards encompass all aspects of data center planning, policies and procedures, staffing and organization, training, maintenance, operating conditions, and disaster preparedness. Earning one of these credentials demonstrates to all stakeholders that a data center is following the principles of effective operations and is being managed with transparency following industry best practices.

The process for a data center to receive either TCOS or the M&O Stamp of Approval includes review of each facility’s policies and documentation, but also includes on-site inspections and live demonstrations to verify that critical systems, backups, and procedures are effective—not just on paper but in daily practice. It’s analogous to putting a vehicle operator through a live driving test before issuing a license. These credentials offer the only comprehensive risk assessment in the data center industry, zeroing in on the risk factors that are the most critical.

The data center environment is never static; continuous review of performance metrics and vigilant attention to changing operating conditions is vital. The data center environment is so dynamic that if policies, procedures, and practices are not revisited on a regular basis, they can quickly become obsolete. Even the best procedures implemented by solid teams are subject to erosion. Staff may become complacent, or bad habits begin to creep in.

Just as ‘good driver’ discounts use an individual’s track record as a reliable indicator of good ongoing behaviors (such as effective maintenance and safe driving habits), periodic data center recertification (biannually at a minimum) provides a key indicator of ongoing effective facility management and operational best practices. Uptime Institute’s data center credentials have built-in expiry periods, with reassessment required at regular intervals.

There is tremendous value for organizations that hold themselves to a consistent set of standards over time, evaluating, fine tuning, and retraining on a routine basis. This discipline creates resiliency, ensuring that maintenance and operations procedures are appropriate and effective, and that teams are prepared to respond to contingencies, prevent errors, and keep small issues from becoming large problems.

Insurance is priced competitively based on the insurers assessment of the exposure presented. Data center operations credentials provide the consistent benchmarking of an unbiased third party review that can be used by service providers at all levels of the data supply chain to demonstrate the quality of the organization’s risk management efforts. This demonstration of risk quality allows infrastructure and service providers to obtain more competitive terms and pricing across their insurance programs.

When data centers obtain the relevant Uptime Institute credential, it results in a level of expert scrutiny unmatched in the industry, giving insurance companies the risk management proof they need. Insurers can validate risk level to a consistent set of reliable Standards. As a result, facilities with good operations, as validated by TCOS or M&O Stamp of Approval, can benefit from reduced insurance costs. When a data center has a current certification, underwriters can be assured that it has withstood the rigorous evaluation of an unbiased third-party, meets globally-recognized Standards, and that its management has taken effective steps to maintain uninterrupted performance and mitigate the risk of loss.

Identifying Lurking Vulnerabilities in the World’s Best-Run Data Centers

Peer-based critiques drive continuous improvement, identify lurking data center vulnerabilities

By Kevin Heslin

Shared information is one of the distinctive features of the Uptime Institute Network and its activities. Under non-disclosure agreements, Network members not only share information, but they also collaborate on projects of mutual interest. Uptime Institute facilitates the information sharing and helps draw conclusions and identify trends from the raw data and information submitted by members representing industries such as banking and finance, telecommunications, manufacturing, retail, transportation, government, and colocation. In fact, information sharing is required of all members.

As a result of their Network activities, longtime Network members report reduced frequency and duration of unplanned downtime in their data centers. They also say that they’ve experienced enhanced facilities and IT operations because of ideas, proven solutions, and best practices that they’ve gleaned from other members. In that way, the Network is more than the sum of its parts. Obvious examples of exclusive Network benefits include the Abnormal Incident Reports (AIRs) database , real-time Flash Reports, email inquiries, and peer-to-peer interactions at twice-annual Network conferences. No single enterprise or organization would be able to replicate the benefits created by the collective wisdom of the Network membership.

Perhaps the best examples of shared learning are the data center site tours (140 tours through the fall of 2015) held in conjunction with Network conferences. During these in-depth tours of live, technologically advanced data centers, Network members share their experiences, hear about vendor experiences, and gather new ideas—often within the facility in which they were first road tested. Ideas and observations generated during the site tours are collated during detailed follow-up discussions with the site team. Uptime Institute notes that both site hosts and guests express high satisfaction with these tours. Hosts remark that visitors raised interesting and useful observations about their data centers, and participants witness new ideas in action.

Rob Costa, Uptime Institute North America Network Director, has probably attended more Network data center tours than anyone else, first as a senior manager for The Boeing Co., and then as Network Director. In addition, Boeing and Costa hosted two tours of the company’s facilities since joining the Network in 1997. As a result, Costa is very knowledgeable about what happens during site tours and how Network members benefit.

“One of the many reasons Boeing joined the Uptime Institute Network was the opportunity to visit world-class, mission-critical data centers. We learned a lot from the tours and after returning from the conferences, we would meet with our facility partners and review the best practices and areas of improvement that we noted during the tour,” said Costa.

“We also hosted two Network tours at our Boeing data centers. The value of hosting a tour was the honest and thoughtful feedback from our peer Network members. We focused on the areas of improvement noted from the feedback sessions,” he said.

Recently, Costa noted the improvement in the quality of the data centers hosting tours, as Network members have had the opportunity to make improvements based on the lessons learned from participating in the Network. He said, “It is all about continuous improvement and the drive for zero downtime.” In addition, he has noted an increased emphasis on safety and physical security.

Fred Dickerman, Uptime Institute, Senior Vice President, Management Services, who also hosted a tour of DataSpace facilities, said, “In preparing to host a tour you tend to look at your own data center from a different point of view, which helps you to see things which get overlooked in the day to day. Basically you walk around the data center asking yourself, ‘How will others see my data center?’

Prior to the tour, Network members study design and engineering documents for the facility to get an understanding of the site’s topology.

Prior to the tour, Network members study design and engineering documents for the facility to get an understanding of the site’s topology.

“A manager who has had problems at a data center with smoke from nearby industrial sites entering the makeup air intakes will look at the filters on the host’s data center and suggest improvement. Managers from active seismic zones will look at your structure. Managers who have experienced a recent safety incident will look at your safety procedures, etc.” Dickerman’s perspective summarizes why normally risk-averse organizations are happy to have groups of Network members in their data centers.

Though tour participants have generated literally thousands of comments since 1994  when the first tour was held, recent data suggest that more work remains to improve facilities. In 2015, Network staff collated all the areas of improvement suggested by participants in the data center tours since 2012. In doing so, Uptime Institute counted more than 300 unique comments made in 15 loosely defined categories, with airflow management, energy efficiency, labeling, operations, risk management, and safety meriting the most attention. The power/backup power categories also received a lot of comments. Although categories such as testing, raised floor issues, and natural disasters did not receive many comments, some participants had special interest in these areas, which made the results of the tours all the more comprehensive.

Uptime Institute works with tour hosts to address all security requirements. All participants have signed non-disclosure agreements through the Network.

Uptime Institute works with tour hosts to address all security requirements. All participants have signed non-disclosure agreements through the Network.

Dickerman suggested that the comments highlighted the fact that even the best-managed data centers have vulnerabilities. He highlighted comments related to human action (including negligence due to employee incompetence, lack of training, and procedural or management error) as significant. He also pointed out that some facilities are vulnerable because they lack contingency and emergency response plans.

Site tours last 2 hours and review the raised floor, site’s command center, switchgear, UPS systems, batteries, generators, electrical distribution, and cooling systems.

Site tours last 2 hours and review the raised floor, site’s command center, switchgear, UPS systems, batteries, generators, electrical distribution, and cooling systems.

He notes that events become failures when operators respond too slowly, respond incorrectly, or don’t respond at all, whether the cause of the event is human action or natural disaster. “In almost every major technological disaster, subsequent analysis shows that timely, correct response by the responsible operators would have prevented or minimized the failure. The same is true for every serious data center failure I’ve looked at,” Dickerman said.

It should be emphasized that a lot of the recommendations deal with operations and processes, things that can be corrected in any data center. It is always nice to talk about the next data center “I would build,” but the reality is that not many people will have that opportunity. Everyone, however, can improve how they operate the site.

For example, tour participants continue to find vulnerabilities in electrical systems, perhaps because any electrical system problem may appear instantaneously, leaving little or no time to react. In addition, tour participants also continue to focus on safety, training, physical security, change management, and the presence of procedures.

In recent years, energy efficiency has become more of an issue. Related to this is the nagging sense that Operations is not making full use of management systems to provide early warning about potential problems. In addition, most companies are not using an interdisciplinary approach to improving IT efficiency.

Dickerman notes that changes in industry practices and regulations explain why comments tend to cluster in the same categories year after year. Tour hosts are very safety conscious and tend to be very proud of their safety records, but new U.S. Occupational Safety and Health Administration (OSHA) regulations limiting hot work, for example, increase pressure on facility operators to implement redundant systems that allow for the shutdown of electrical systems to enable maintenance to be performed safely. Tour participants can share experiences about how to effectively and efficiently develop appropriate procedures to track every piece of IT gear and ensure that the connectivity of the gear is known and that it is installed and plugged in correctly.

Of course, merely creating a list of comments after a walk-through is not necessarily helpful to tour hosts. Each network tour concludes with discussion where comments are compiled and discussed. Most importantly Uptime Institute staff moderate discussions, as comments are evaluated and rationales for construction and operations decisions explained. These discussions ensure that all ideas are vetted for accuracy, and that the expertise of the full group is tapped before a comment gets recorded.

Finally, Uptime Institute moderators prepare a final report for use by the tour host, so that the most valid ideas can be implemented.

Pitt Turner, Uptime Institute Executive Director Emeritus, notes that attending or hosting tours is not sufficient by itself, “There has to be motivation to improve. And people with a bias toward action do especially well. They have the opportunity to access technical ideas without worrying about cost justifications or gaining buy in. Then, when they get back to work, they can implement the easy and low-cost ideas and begin to do cost justifications on those with budget impacts.”

TESTIMONIALS

One long-time Network member illustrates that site tours are a two-way street of continuous improvement by telling two stories separated by several years and from different perspectives. “In 1999, I learned two vitally important things during a facility tour at Company A. During that tour my team saw that Company A color coded both its electrical feeds and CEVAC (command, echo, validate, acknowledge, control), which is a process that minimizes the chance for errors when executing procedures. We still color code in this way to this day.

Years later the same Network member hosted a data center tour and learned an important lesson during the site tour. “We had power distribution unit (PDU) breakers installed in the critical distribution switchgear,” he said. “We had breaker locks on the breaker handles that are used to open and close the breakers to prevent an accidental trip. I thought we had protected our breakers well, but I hadn’t noticed a very small red button at the bottom of the breaker that read ‘push to trip’ under it. A Network member brought it to my attention during a tour. I was shocked when I saw it. We now have removable plastic covers over those buttons.”


Kevin Heslin

Kevin Heslin

Kevin Heslin is Chief Editor and Director of Ancillary Projects at Uptime Institute. In these roles, he supports Uptime Institute communications and education efforts. Previously, he served as an editor at BNP Media, where he founded Mission Critical, a commercial publication dedicated to data center and backup power professionals. He also served as editor at New York Construction News and CEE and was the editor of LD+A and JIES at the IESNA. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.


1


2

Ignore Data Center Water Consumption at Your Own Peril

Will drought dry up the digital economy? With water scarcity a pressing concern, data center owners are re-examining water consumption for cooling.

By Ryan Orr and Keith Klesner

In the midst of a historic drought in the western U.S., 70% of California experienced “extreme” drought in 2015, according to the U.S. Drought Monitor.

The state’s governor issued an Executive Order requiring a 25% reduction in urban water usage compared to 2013. The Executive Order also authorizes the state’s Water Resources Control Board to implement restrictions on individual users to meet overall water savings objectives. Data centers, in large part, do not appear to have been impacted by new restrictions. However, there is no telling what steps may be deemed necessary as the state continues to push for savings.

The water shortage is putting a premium on the existing resources. In 2015, water costs in the state increased dramatically, with some customers seeing rate increases as high as 28%. California is home to many data centers, and strict limitations on industrial use would dramatically increase the cost of operating a data center in the state.

The problem in California is severe and extends beyond the state’s borders.

Population growth and climate change will create additional global water demand, so the problem of water scarcity is not going away and will not be limited to the California or even the western U.S.; it is a global issue.

On June 24, 2015, The Wall Street Journal published an article focusing on data center water usage, “Data Centers and Hidden Water Use.” With the industry still dealing with environmental scrutiny over carbon emissions, and water scarcity poised to be the next major resource to be publicly examined, IT organizations need to have a better understanding of how data centers consume water, the design choices that can limit water use, and the IT industry’s ability to address this issue.

HOW DATA CENTERS USE WATER
Data centers generally use water to aid heat rejection (i.e., cooling IT equipment). Many data centers use a water-cooled chilled water system, which distributes cool water to computer room cooling units. A fan blows across the chilled water coil, providing cool, conditioned air to IT equipment. That water then flows back to the chiller and is recooled.

Figure 2. Photo of traditional data center cooling tower

Figure 1. Photo of traditional data center cooling tower

Water-cooled chiller systems rely on a large box-like unit called a cooling tower to reject heat collected by this system (see Figure 1). These cooling towers are the main culprits for water consumption in traditional data center designs. Cooling towers cool warm condenser water from the chillers by pulling ambient air in from the sides, which passes over a wet media, causing the water to evaporate. The cooling tower then rejects the heat by blowing hot, wet air out of the top. The cooled condenser water then returns back to the chiller to again accept heat to be rejected. A 1-megawatt (MW) data center will pump 855 gallons of condenser water per minute through a cooling tower, based on a design flow rate of 3 gallons per minute (GPM) per ton.

Figure 2. Cooling towers “consume” or lose water through evaporation, blow down, and drift.

Figure 2. Cooling towers “consume” or lose water through evaporation, blow down, and drift.

Cooling towers “consume” or lose water through evaporation, blow down, and drift (see Figure 2). Evaporation is caused by the heat actually removed from the condenser water loop. Typical design practice allows evaporation to be estimated at 1% of the cooling tower water flow rate, which equates to 8.55 GPM in a fairly typical 1-MW system. Blow down describes the replacement cycle, during which the cooling tower dumps condenser water to eliminate minerals, dust, and other contaminants. Typical design practices allow for blow down to be estimated at 0.5% of the condenser water flow rate, though this could vary widely depending on the water treatment and water quality. In this example, blow down would be about 4.27 GPM. Drift describes the water that is blown away from the cooling tower by wind or from the fan. Typical design practices allow drift to be estimated at 0.005%, though poor wind protection could increase this value. In this example, drift would be practically negligible.

In total, a 1-MW data center using traditional cooling methods would use about 6.75 million gallons of water per year.

CHILLERLESS ALTERNATIVES
Many data centers are adopting new chillerless cooling methods that are more energy efficient and use less water than the chiller and cooling tower combinations. These technologies still reject heat to the atmosphere using cooling towers. However, chillerless cooling methodologies incorporate an economizer that utilizes outdoor air, which means that water is not evaporated all day long or even every day.

Some data centers use direct air cooling, which introduces outside air to the data hall, where it directly cools the IT gear without any conditioning. Christian Belady, Microsoft’s general manager for Data Center Services, once demonstrated the potential of this method by running servers for long periods in a tent. Climate, and more importantly, an organization’s willingness to accept risk of IT equipment failure due to fluctuating temperatures and airborne particulate contamination limited the use of this unusual approach. The majority of organizations that use this method do so in combination with other cooling methods.

Direct evaporative cooling employs outside air that is cooled by a water-saturated medium or via misting. A blower circulates this air to cool the servers (see Figure 3). This approach, while more common than direct outside air cooling, still exposes IT equipment to risk from outside contaminants from external events like forest fires, dust storms, agricultural activity, or construction, which can impair server reliability. These contaminants can be filtered, but many organizations will not tolerate a contamination risk.

Figure 3. Direct evaporative vs. indirect evaporative cooling

Figure 3. Direct evaporative vs. indirect evaporative cooling

Some data centers use what is called indirect evaporative cooling. This process uses two air streams: a closed-loop air supply for IT equipment and an outside air stream that cools the primary air supply. The outside (scavenger) air stream is cooled using direct evaporative cooling. The cooled secondary air stream goes through a heat exchanger, where it cools the primary air stream. A fan circulates the cooled primary air stream to the servers.

WATERLESS ALTERNATIVES
Some existing data center cooling technologies do not evaporate water at all. Air-cooled chilled water systems do not include evaporative cooling towers. These systems are closed loop and do not use makeup water; however, they are much less energy efficient than nearly all the other cooling options, which may offset any water savings of this technology. Air-cooled systems can be fitted with water sprays to provide evaporative cooling to increase capacity and or increase cooling efficiency, but this approach is somewhat rare in data centers.

The direct expansion (DX) computer room air conditioner (CRAC) system includes a dry cooler that rejects heat via an air-to-refrigerant heat exchanger. These types of systems do not evaporate water to reject heat. Select new technologies utilize this equipment with a pumped refrigerant economizer that makes the unit capable of cooling without the use of the compressor. The resulting compressorless system does not evaporate water to cool air either, which improves both water and energy efficiency. Uptime Institute has seen these technologies operate at power usage efficiencies (PUE) of approximately 1.40, even while in full DX cooling mode, and they meet California’s strict Title 24 Building Energy Efficiency Standards.

Table 1. Energy, water, and resource costs and consumption compared for generic cooling technologies.

Table 1. Energy, water, and resource costs and consumption compared for generic cooling technologies.

Table 1 compares a typical water-cooled chiller system to an air-cooled chilled water system in a 1-MW data center, assuming that the water-cooled chiller plant operates at a PUE of 1.6 and the air-cooled chiller plant operates at a PUE of 1.8 with electric rates at $0.16/kilowatt-hour (kWh) and water rates are $6/unit, with one unit being defined as 748 gallons.

The table shows that although air-cooled chillers do not consume any water they can still cost more to operate over the course of a year because water, even though a potentially scarce resource, is still relatively cheap for data center users compared to power. It is crucial to evaluate the potential offsets between energy and cooling during the design process. This analysis does not include considerations for the upstream costs or resource consumption associated with water production and energy production. However, these should also be weighed carefully against a data center’s sustainability goals.

LEADING BY EXAMPLE

Some prominent data centers using alternative cooling methods include:

• Vantage Data Centers’ Quincy, WA, site uses Munters Indirect Evaporative Cooling systems.

• Rackspace’s London data center and Digital Realty’s Profile Park facility in Dublin use roof-mounted indirect outside air technology coupled with evaporative cooling from ExCool.

• A first phase of Facebook’s Prineville, OR, data center uses direct evaporative cooling and humidification. Small nozzles attached to water pipes spray a fine mist across the air pathway, cooling the air and adding humidity. In a second phase, Facebook uses a dampened media.

• Yahoo’s upstate New York data center uses direct outside air cooling when weather conditions allow.

• Metronode, a telecommunications company in Australia, uses direct air cooling (as well as direct evaporative and DX for backup) in its facilities

• Dupont Fabros is utilizing recycled gray water for cooling towers in its Silicon Valley and Ashburn, VA, facilities. The municipal gray water supplies saves on water cost, reduces water treatment for the municipality, and reuses a less precious form of water.

Facebook reports that its Prineville cooling system uses 10% of the water of a traditional chiller and cooling tower system. ExCool claims that it requires roughly 260,000 gallons annually in a 1-MW data center, 3.3% of traditional data center water consumption, and the data centers using pumped refrigerant systems consume even less water—zero. These companies save water by eliminating evaporative technologies or by combining evaporative technologies with outside air economizers, meaning that they do not have to evaporate water 24×7.

DRAWBACKS TO LOW WATER COOLING SYSTEMS
These cooling systems can cost much more than traditional cooling systems. At current rates for water and electricity, return on investment (ROI) on these more expensive systems can take years to achieve. Compass Datacenters recently published a study showing the potential negative ROI for an evaporative cooling system.

These systems also tend take up a lot of space. For many data centers, water-cooled chiller plants make more sense because an owner can pack in a large capacity system in a relatively small footprint without modifying building envelopes.

There are also implications for data center owners who want to achieve Tier Certification. Achieving Concurrently Maintainable Tier III Constructed Facility Certification requires the isolation of each and every component of the cooling system without impact to design day cooling temperature. This means an owner needs to be able to tolerate the shutdown of cooling units, control systems, makeup water tanks and distribution, and heat exchangers. Fault Tolerance (Tier IV) requires the system to sustain operations without impact to the critical environment after any single but consequential event. While Uptime Institute has Certified many data centers that use newer cooling designs, they do add a level of complexity to the process.

Organizations also need to factor temperature considerations into their decision. If a data center is not prepared to run its server inlet air temperature at 22 degrees Celsius (72 degrees Fahrenheit) or higher, there is not much payback on the extra investment due to the fact that the potential for economization is reduced. Also, companies need to improve their computer room management, including optimizing airflow for efficient cooling, and potentially adding containment, which can drive up costs. Additionally, some of these cooling systems just won’t work in hot and humid climates.

As with any newer technology, alternative cooling systems present operations challenges. Organizations will likely need to implement new training to operate and maintain unfamiliar equipment configurations. Companies will need to conduct particularly thorough due diligence on new, proprietary vendors entering the mission critical data center space for the first time.

And last, there is significant apathy about water conservation across the data center industry as a whole. Uptime Institute survey data shows that less than one-third of data center operators track water usage or use the (WUE) metric. Furthermore, Uptime Institute’s 2015 Data Center Industry Survey found (see The Uptime Institute Journal, vol 6, p. 60) that data center operators ranked water conservation as a low priority.

But the volumes of water or power used by data centers make them easy targets for criticism. While there are good reasons to choose traditional water-cooled chilled water systems, especially when dealing with existing buildings, for new data center builds, owners should evaluate alternative cooling designs against overall business requirements, which might include sustainability factors.

Uptime Institute has invested decades of research toward reducing data center resource consumption. The water topic needs to be assessed within a larger context such as the holistic approach to efficient IT described in Uptime Institute’s Efficient IT programs. With this framework, data center operators can learn how to better justify and explain business requirements and demonstrate that they can be responsible stewards of our environment and corporate resources.

Matt Stansberry contributed to this article.


WATER SOURCES

Data centers can use water from almost any source, with the vast majority of those visited by Uptime Institute using municipal water, which typically comes from reservoirs. Other data centers use groundwater, which is precipitation that seeps down through the soil and is stored below ground. Data center operators must drill wells to access this water. However, drought and overuse are depleting groundwater tables worldwide. The United States Geological Survey has published a resource to track groundwater depletion in the U.S.

Other sources of water include rainfall, gray water, and surface water. Very few data centers use these sources for a variety of reasons. Because rainfall can be unpredictable, for instance, it is mostly collected and used as a secondary or supplemental water supply. Similarly only a handful of data centers around the world are sited near lakes, rivers, or the ocean, but those data center operators could pump water from these sources through a heat exchanger. Data centers also sometimes use a body of water for an emergency water source for cooling towers or evaporative cooling systems. Finally, gray water, which is partially treated wastewater, can be utilized as a non-potable water source for irrigation or cooling tower use. These water sources are interdependent and may be short in supply during a sustained regional drought.


Ryan Orr

Ryan Orr

Ryan Orr joined Uptime Institute in 2012 and currently serves as a senior consultant. He performs Design and Constructed Facility Certifications, Operational Sustainability Certifications, and customized Design and Operations Consulting and Workshops. Mr. Orr’s work in critical facilities includes responsibilities ranging from project engineer on major upgrades for legacy enterprise data centers, space planning for the design and construction of multiple new data center builds, and data center M&O support.

 

 

Keith Klesner

Keith Klesner

Keith Klesner is Uptime Institute’s Senior Vice President, North America. Mr. Klesner’s career in critical facilities spans 16 years and includes responsibilities ranging from planning, engineering, design, and construction to start-up and ongoing operation of data centers and mission critical facilities. He has a B.S. in Civil Engineering from the University of Colorado-Boulder and a MBA from the University of LaVerne. He maintains status as a professional engineer (PE) in Colorado and is a LEED Accredited Professional.

Bank of Canada Achieves Operational Excellence

The team approach helped the Bank earn Uptime Institute’s M&O Stamp of Approval

By Matt Stansberry

The Bank of Canada is the nation’s central bank. The Bank acts as the fiscal agent of the Canadian government, managing its public debt programs and foreign exchange reserves and setting its monetary policy. It also designs, issues, and distributes Canada’s bank notes. The Bank plays a critical role in supporting the Canadian government and Canada’s financial system. The organization manages a relatively small footprint of high-criticality data centers.

Over the last several years, the Bank has worked with Uptime Institute to significantly upgrade its data center operations framework, and it has also implemented a cross-disciplinary management team that includes stakeholders from IT, Facilities Management, and Security.  The Bank adopted Uptime Institute’s Integrated Critical Environment (ICE) team concept to enhance the effectiveness of the collaboration and shared accountability framework between the three disciplines.

These efforts paid off when the Bank of Canada received a 93% score on the Uptime Institute’s M&O Assessment, which surpassed the 80% pass requirement and the 83% median score achieved by approximately 70 other accredited organizations worldwide. These scores helped the Bank achieved the M&O Stamp of Approval in October 2015.

ICE Program Project Manager Megan Murphy and ICE Program Chairperson David Schroeter explain the challenges and benefits of implementing a multidisciplinary team approach and earning the M&O Stamp of Approval from Uptime Institute.

Uptime Institute: Uptime Institute has been advocating that companies develop multidisciplinary teams for about a decade. Some leading organizations have deployed this kind of management framework, while many more still struggle with interdisciplinary communication gaps and misaligned incentives. Multidisciplinary teams are a highly effective management structure for continuously improving performance and efficiency, while increasing organizational transparency and collaboration. How has your organization deployed this team structure?

Megan Murphy: The Bank likes to shape things in its own way. Certain disciplines are near and dear to our hearts, like security. So our multidisciplinary approach integrates not just IT and Facilities but also Security and our Continuity of Operations Program.

The integrated team looks after the availability and reliability of the Bank’s critical infrastructure supporting the data center.  The team is the glue that binds the different departments together, with a common objective, same language and terminologies, and unified processes. It ultimately allows us to be more resilient and nimble.

The integrated team is virtual, in that each representative reports to a home position on a day-to-day basis.  The virtual team meets regularly, and representatives have the authority to make decisions on behalf of their individual stakeholder groups.

The team functions like a committee. However, where the term “committee” may sound passive; the Bank’s team functions more like a “super committee with teeth.”

David Schroeter: We come together as a committee to review and discuss changes, incidents, schedules as well as coordinate work flows. It requires a lot of effort from the individuals in these departments because of the rigor of the collaborative process, but it has paid off.

As an example, recently there was a facilities infrastructure issue. As a result of the multidisciplinary approach framework, we had the right people in the room to identify the immediate risks associated with this issue and identified that it had a significant impact on other critical infrastructure. We shifted our focus from a simple facilities repair to consider how that change might affect our overall business continuity and security posture.

This information was then quickly escalated to the Bank’s Continuity of Operations office, which activated the corporate incident management process.

It sounds like the collaboration is paying significant benefits.  Why did your organization take this on?

Schroeter: Like other large corporations, our IT and Facilities teams worked within their own organizations, with their own unique perspectives and lenses. We adopted a multidisciplinary approach to bring the stakeholders together and to understand how the things they do every day will inherently impact the other groups. We realized that by not managing our infrastructure using a collaborative team approach, we were incurring needless risk to our operations.

Murphy: This concept is new to the organization, but it reflects a key tenet of the Bank’s overall vision—bringing cross-functional groups together to solve complex issues. That’s really what our team approach does.

Uptime Institute: How did you get started?

Schroeter: We used an iterative approach to develop the program through a working group and interim committee, looking at interdependencies and overlaps between our departments procedures.  Gaps in our processes were revisited and addressed as required.

Murphy: We weren’t trying to reinvent things that already existed in the Bank but rather to leverage the best processes, practices, and wisdom from each department. For example, the IT department has a mature change management process in place. We took their change management template and applied it across the entire environment. Similarly we adopted Security’s comprehensive policy suite for governing and managing access control. We integrated these policies into the process framework.

Schroeter: In addition, we expanded the traditional facilities management framework to include processes to mitigate potential cyber security threats.

The multidisciplinary team allows us to continually improve, be proactive, and respond to a constantly changing environment and technology landscape.

Uptime Institute: Why did you pursue the M&O Stamp of Approval?

Murphy: It gave us a goal, a place to shoot for. Also, the M&O Stamp of Approval needs to be renewed every two years, which means we are going to be evaluated on a regular basis. So we have to stay current and continue to build on this established framework.

Schroeter: We needed a structured approach to managing the critical environment. This is not to say that our teams weren’t professional or didn’t have the competencies to do the work. But when challenged on how or why we did things, they didn’t have a consistent response.

To prepare for the M&O Assessment with Uptime Institute Consultant Richard Van Loo, we took a structured approach that encourages us to constantly look for improvements and to plan for long-term sustainability with a shared goal. It’s not just about keeping the wheels from falling off the bus. It’s about being proactive—making sure the wheels are properly balanced so it rolls efficiently.

Always looking ahead for issues rather than letting them happen.

Did you have any concerns about how you might score on the M&O Assessment?

Schroeter: We were confident that we were going to pass and pass well. We invested a significant amount of time and effort into creating the framework of the program and worked closely with Richard to help ensure that we continued on the right path as we developed documentation. Although we were tracking in the right direction and believed we would pass the assessment, we were not expecting to achieve such a high score. Ultimately our objective was to establish a robust program with a proper governance structure that could sustain operations into the future.

Uptime Institute: What was the most difficult aspect of the process?

Schroeter: Team members spent a lot of time talking to people at the Bank, advocating for the program and explaining why it matters and why it is important. We drank a lot of coffee as we built the support and relationships necessary to ensure the success of the program.

Uptime Institute: What surprised you about the assessment?

Murphy: Despite the initial growing pains that occur when any team comes together, establishing a collective goal and a sense of trust provided the team with stronger focus and clarity. Even with the day-to-day distractions and priorities and different viewpoints, the virtual team became truly integrated. This integration did not happen serendipitously; it took time, persistence, and a lot of hard work.

Uptime Institute: Did having a multidisciplinary team in place make the M&O Assessment process more or less difficult?

Murphy: The multidisciplinary approach made the M&O Assessment easier. In 2014, at the beginning of the process, there were some growing pains as the group was forming and learning how to come together. But by October 2015 when we scored the M&O Assessment, the group had solidified and team members trusted each other, having gone through the inherent ups and downs of the building process. As a result of having a cohesive team, we were confident that our management framework was strong and comprehensive as we had all collaborated and participated in the structure and its associated processes and procedures.

Having an interdisciplinary team provided us with a structured framework in which to have open and transparent discussions necessary to drive and meet the objectives our mandate.

Uptime Institute developed the concept of Integrated Critical Environments teams nearly a decade ago, encouraging organizations to adopt a combined IT-Facilities operating model. This structure became a lynchpin of M&O Stamp of Approval. During the assessment, the participant’s organizational structure is weighted very heavily to address the dangerous misperception that outages are a result of human error (the individual) while they are statistically a result of inadequate training, resource, or protocol (the organization).


Matt Stansberry

Matt Stansberry

Matt Stansberry is Director of Content and Publications for the Uptime Institute and also serves as program director for the Uptime Institute Symposium, an annual event that brings together 1,500 stakeholders in enterprise IT, data center facilities, and corporate real estate to deal with the critical issues surrounding enterprise computing. He was formerly editorial director for Tech Target’s Data Center and Virtualization media group, and was managing editor of Today’s Facility Manager magazine. He has reported on the convergence of IT and facilities for more than a decade.