Blog Single Author Small

Implementing Data Center Cooling Best Practices

June 25, 2014/in Operations/by Kevin Heslin

Step-by-step guide to data center cooling best practices will help data center managers take greater advantage of the energy savings opportunities available while providing improved cooling of IT systems

The nature of data center temperature management underwent a dramatic evolution when American Society of Heating, Refrigeration, and Air-conditioning Engineers (ASHRAE) adopted new operating temperature guidelines. A range of higher intake temperatures at the IT devices has enabled substantial increases in computer room temperatures and the ability to cool using a range of ‘free cooling’ options. However, recent Uptime Institute research has demonstrated that the practices used in computer rooms to manage basic airflow and the actual adoption of increased temperatures have not kept pace with the early adopters or even the established best practices. This gap means that many site operations teams are not only missing an opportunity to reduce energy consumption but also to demonstrate to executive management that the site operations team is keeping up with the industry’s best practices.

The purpose of this technical paper is to raise awareness of the relatively simple steps that can be implemented to reduce the cost of cooling a computer room. The engineering underlying these simple steps was contained in two Uptime Institute papers. The first, Reducing Bypass Airflow is Essential for Eliminating Computer Room Hotspots, published in 2004, contained substantial research demonstrating the poor practices common in the industry. In a nutshell, the cooling capacity of the units found operating in a large sample of data centers was 2.6 times what was required to meet the IT requirement, well beyond any reasonable level of redundant capacity. In addition, an average of only 40% of the cooling air supplied to the data centers studies was used for cooling IT equipment. The remaining 60% was effectively wasted capacity, required because of mismanagement of the airflow and cooling capacity.

Uptime Institute’s subsequent paper, How to Meet ‘24 by Forever’ Cooling Demands of Your Data Center was published in 2005. This paper contained 27 recommendations that were offered to remedy air management and over capacity conditions.

New research by Upsite Technologies®, which also sponsored research leading to the Uptime Institute’s 2004 publication on bypass airflow, found that the average ratio of operating nameplate cooling capacity in use in a newer, separate and significantly larger sample had increased from 2.6 to 3.9 times the IT requirement. Clearly, the data are still relevant and, disturbingly, this trend is going the wrong direction.

The 2013 Uptime Institute user survey (see p. 108), consisting of 1,000 respondents, found some interesting and conflicting responses. Between 50 and 71% (this varied by geography) reported that it was ‘very important’ to reduce data center energy consumption. Yet, 80% (up from 71% in 2012) indicated the electrical costs are paid by real estate, and therefore, generally not readily available to the IT decision makers when specifying equipment. Most strikingly, 43% (unchanged from 2012) use cooling air below 70°F/21°C, and 3% more (down from 6% in 2012) did not know what the temperatures were. More than half (57%) were using overall room air or return air temperatures to control cooling. These are the two least effective control points.

There is fundamental conflict between the desire to reduce operational expense (OPEX) and the lack of implementation of higher cooling temperatures and the resulting benefits. The goal of this paper is to give data center managers the tools to reduce the mechanical cooling power consumption through more efficient use of cooling units. Improving airflow management will allow for elevated return air temperatures and a larger ΔT across the cooling unit resulting in increased capacity of the cooling unit. This allows for the cooling to be performed by fewer cooling units to reduce mechanical energy consumption. Further cooling system best practices are discussed, including transition to supply air control, increasing chilled water temperatures, and refurbishment or replacement of fixed speed cooling units with variable speed capability.

Uptime Institute has consistently found that many site managers are waiting for the ‘next big breakthrough’ before springing into action. The dramatic OPEX, subsequent environmental and cooling stability benefits of improved cooling air management, increased operating temperatures, variable speed cooling fans and other best practices are not being exploited by the data center industry at large.

In some cases, managers are overwhelmed by engineering speak and lack a straightforward guideline. This technical paper is designed to assist site managers with a simple high-level guide to implement these best practices quickly and successfully, and includes some straightforward metrics to get the attention of the leadership team. One such metric is the ratio of operating cooling capacity in the computer room vs. the cooling load in the computer room, which is a simple, but effective, way to identify stranded capacity and unnecessary operation of cooling units.

This paper lists 29 steps intended to focus effort on reducing the cost of cooling. Select actions may require the involvement of consultants, engineers or vendors. The first 17 steps have been divided into two groups, one called Preparatory Metrics and the other Initial Action to Manage Airflow.

Taking these steps will establish a baseline to measure the program’s effectiveness and to ensure proper airflow management. Three sets of subsequent actions describe will describe 12 steps for reducing the number of cooling units, raising temperatures in the computer room and measuring the effectiveness of these actions. The third set of actions will also set the stage driving continuous improvement in the facility.

Reductions in the cost of cooling can be achieved without any capital investment: none of the 29 actions in this paper require capital! Additional energy-efficient actions are also described but will require a more comprehensive approach, often involving engineering design and potentially capital investment. Other benefits will be realized when the cost of cooling is reduced. Correcting air management reduces hot spots, and taking cooling units off line can reduce maintenance costs.

Preparatory Metrics.
This first group of steps captures the basic data collection necessary to capture the operating status of the computer room for later comparison. It is essential that this be the first group of steps as these metrics will provide the basis for reporting later success in reducing energy consumption and reducing or eliminating hot spots. Develop the work plan covering the proposed changes.

Initial Action to Manage Airflow.
The second group of steps guides step-by-step actions to rearrange existing perforated tiles and close bypass airflow openings such as cable holes and the face of the rack. Proper air management is necessary before changing the temperature in the computer room.

Subsequent Action One. Reducing the number of operating cooling units. The number of cooling units running should be adjusted to represent the necessary capacity plus the desired level of redundancy. This will involve turning selected units off.
Subsequent Action Two. Raising the temperature. Over time, increase the temperature of the air being supplied to the IT device intakes. A very slow and conservative approach is presented.
Subsequent Action Three. Reviewing and reporting metrics. Regular measurement and reporting allows the data center integrated operations team of both IT and Facilities professionals to make informed data center management decisions and will drive continuous improvement within the organization.

Selectively, some groups of steps can be done on a common initiative depending on how a particular project is being implemented. The approach in this paper is intentionally conservative.

Limitations and Conditions
Implementing change in any critical environment must be done carefully. It is important to develop and coordinate with IT a detailed site plan to supplement the 29 steps from this technical paper; this is a critical requirement for any site concerned with availability, stability and continuity of service. Retain appropriate consultants, engineers or contractors as necessary to completely understand proposed changes, especially with the physical cooling equipment changes outlined in Additional Advanced Energy Efficient Actions.

Because of air-cooling density limitations, higher density equipment may need to be supplemented with localized cooling solutions, direct liquid cooling or other innovations such as extraction hoods or chimneys. Traditionally, the cap for all ‘air cooling ’ was a power density of 7 kilowatts (kw) per rack, and that was only possible when all established best practices were being followed. Failure to follow best practices can lead to problems cooling even two kilowatt racks.

Experience shows that consistent and complete adoption of the steps in this technical paper enables successful cooling of server racks and other IT equipment at higher densities. Sites that excel at the adoption of these practices, sometimes coupled with commonly available technologies, can successfully air cool racks exceeding 10 kW (see Case Study 2). Other contemporary examples have successfully applied air cooling to even higher densities.

For sites with rack densities greater than 6 kW per rack, the Uptime Institute recommends Continuous Cooling. Continuous Cooling is where the computer room heat removal is uninterrupted during the transition from utility power to engine generator power. This is the direct analogy to uninterrupted power to the IT equipment. This recommendation is based on multiple recorded temperature excursions in the industry where the temperature and the rate of change of the temperature both exceed ASHRAE guidelines. In one case, a computer room with 250 racks at 6 kW went from 72°F to 90+°F (22°C to 32+°C) in 75 seconds when the cooling was lost.

Increasing the operating temperature of a computer room may shorten the ridethrough time for a momentary loss of cooling. Real-event experience, however, suggests the change in the ride-through time will be seconds, not minutes or hours. This explanation, which is often cited for not raising the operating temperature, is not based on fact.

ASHRAE Temperature Guidelines
Current ASHRAE TC 9.9 publications can be obtained at http://tc99.ashraetcs.org/documents.html. The TC9.9 publications should be fully understood before selecting a higher operating temperature. The ability of legacy or heritage equipment to operate in the newer temperature ranges should be validated by the IT team.

Cold Aisle/Hot Aisle
The physical arrangement of the IT equipment into what is known as Cold Aisle/Hot Aisle is another best practice. This configuration places the intake of IT equipment on the Cold (supply) Aisle and the discharge of warmer air toward the Hot (return) Aisle. Current practices permit most computer rooms to use 75°F/24°C supply in the Cold Aisle, understanding that the only temperature that matters in a computer room is the air at the intake to the computer hardware. The Hot Aisle will be substantially warmer. Noticing dramatically different temperatures while walking through a computer room with Cold Aisles/Hot Aisles is a demonstration of successful implementation and operating practices.

Proper adoption of Cold Aisle/Hot Aisle methodology improves the stability and predictability of cooling and is a key step to minimizing the mixing of hot and cold air. However, an existing legacy computer room that does not utilize a Cold Aisle/Hot Aisle configuration can still benefit from improved airflow management. Perforated floor tiles providing raised-floor cooling air should be placed immediately adjacent to the front of the rack they are cooling. Temperatures in a legacy environment will not be as distinctive as in a Cold Aisle/Hot Aisle environment, and extra caution is required when implementing these best practices. Additional check points should be added to ensure that undesired temperature increases are identified and quickly addressed during the rearrangement of floor tiles.

For maximum benefit, a strategic computer room master plan should be developed. This master plan will identify the arrangement of the rows of IT equipment, the positions in rows or the room for power support equipment, and the number and arrangement of the cooling units to support the ultimate capacity of the space. The master plan also reserves locations for IT, power and cooling equipment that may not be installed for some time, so that when the equipment is required, there is not a crisis to determine where to install it; it is preplanned. A computer room master plan is of value for a new room or reconfiguring an existing room.

29 Cooling Best Practices
Group one: Preparatory metrics

1. Determine the IT load in kilowatts.

2. Measure the intake temperature across the data center, including where hot spots have been experienced. At a minimum, record the temperature at mid-height at the end of each row of racks and at the top of a rack in the center of each row. Record locations, temperatures and time/date manually or using an automated system. These data will be used later as a point of comparison.

3. Measure the power draw to operate the cooling units in kilowatts. Often, there is a dedicated panel serving these devices. This may be accessible from a monitoring system. These data will be used later as a point of comparison and calculation of kilowatt-hours and long-term cost savings.

4. Measure the room’s sensible cooling load as found. This can be done by measuring the airflow volume for each cooling unit and recording the supply and return temperatures for each unit that is running. The sensible capacity of each operating unit in kW for each unit can be calculated using the expression Q sensible (kW) = 0.316*CFM*(Return Temp°F – Supply Temp°F)/1000 [Q sensible (kW) = 1.21*CMH*(Return Temp°C – Supply Temp°C)/3600]

The room’s sensible load is the sum of each cooling unit running in the room. Organize the data in a spreadsheet or similar tool to record each data point, supply and return temperature, ΔT (return-supply), airflow, kilowatt sensible and sum to provide overall room sensible load. Confirm that entire IT load is being serviced by the cooling system, that there is not additional IT load in another room and that the cooling plant is not supporting any non-IT load.

Compare the cooling load to the IT load for a point of reference.

5. Using the airflow and return air temperature measured in the previous step, request the sensible capacity of each unit in kilowatts from the equipment vendor. Enter this nominal sensible capacity in the spreadsheet. Sum the sensible capacity for the units that are operating. This is a quick and easy approximation of the sensible-cooling capacity of the cooling units in use. More complex methods are available but were not included here for simplicity.

6. Divide the overall room operating cooling sensible capacity in kilowatts determined in previous step by the sensible operating cooling load determined in Step 4. This number presents a simple ratio of operating capacity versus cooling load. The ratio could be as low as 1.15 (where the cooling capacity matches IT load with one cooling unit in six redundant), but will likely be much higher. This ratio is a benchmark to evaluate subsequent improvement as reduction of this ratio will occur with the reduction of operating cooling units.

7. Cooperatively with IT, determine the maximum allowable intake temperature for the new operating environment. For temperatures above 75°F, there may be offsetting power increases within the IT devices from their onboard cooling fans.

8. Using the data developed above, create a work plan outlining goals of the effort, detail the metrics that will be monitored to ensure the cooling environment is not disrupted, specify a back-out plan should a problem be discovered and identify the performance metrics that will be tracked for rack inlet temperatures, power consumption, etc. This plan should be fully communicated with IT as a key member of the critical operations team.

Group two: Initial action to manage airflow

9. Relocate all raised-floor perforated tiles to the Cold Aisle. Replace perforated tile supporting a given rack should be placed immediately in front of the rack. As a starting baseline, each rack with IT equipment producing cooling load should have a single perforated floor tile in front of it. Empty racks and racks with no electrical load should have a solid tile in front of them. Give special consideration to racks populated with network equipment like switches or patch panels.

10. If a return air plenum is used, ensure that the placement of ceiling return grilles aligns with the Hot Aisles.

11. Seal the vertical front of the racks using blanking plates or other systems to prevent Cold Aisle air from moving through the empty rack positions into the Hot Aisle.

12. Seal the cable holes in raised floor, if used.

13. Seal any holes in the walls or structure surrounding the computer room. Check walls below the raised floor and above the ceiling. Check for unsealed openings in the structure under or over the computer room. Typical issues are found where conduits, cables or pipes enter the computer room.

14. Seal any supply air leakage around the cooling units, if they are located in the computer room.

15. Ensure that all cooling units in the space are equipped with a backflow damper or fabricate covers for units designated as off to prevent backflow of cooling through powered-off cooling units.

16. Seal any leakage of cooling air at power distribution units, distribution panels or other electrical devices mounted on the raised floor.

17. Record the temperatures at the same locations as in Step 2. Compare this and prior readings to determine if any intake temperatures have increased. If an increase is observed, increase airflow to the problem rack by moving a raised-floor perforated tile from an area that does not have any hot spots or by replacing the standard perforated floor tile with a higher flow rate perforated floor tile.

Subsequent action one: Reducing the number of operating cooling units by shutting units off.

18. Identify the cooling units with the lowest sensible load. These units will be the first units to be turned off as the following process is implemented.

19. Estimate the number of required cooling units by dividing the IT load by the smallest sensible cooling unit capacity. This will approximate the number of cooling units needed to support the IT load.

20. Uptime Institute recommends redundancy in large computer rooms of 1 redundant unit for every six cooling units. Smaller or oddly shaped rooms may require a higher level of redundancy. Divide the required number of cooling units by six to determine the number of redundant cooling units. Round up to the next higher whole number. This is the suggested number of redundant cooling units for the IT load.

21. Add the results of the two previous steps. This will be the desired number of cooling units necessary to operate to meet the IT load and provide 1:6 redundant cooling units. This should be fewer than the total number of cooling units installed.

22. Turn off one operating cooling unit each week until only the desired number of cooling units determined in the previous step remains operating.

23. With each ‘excess’ cooling unit being shut off, examine reducing the number of perforated tiles to match the reduced airflow being supplied to the room. If a temperature increase is observed, increase airflow to the problem rack by adding perforated floor tiles or utilizing higher flow rate perforated floor tiles. The placement of floor tiles to appropriately match cooling supply is an iterative process.

24. Before each subsequent reduction in the number of operating cooling units, record the temperatures at the same locations as in Step 2. Compare this and prior readings to determine if any server intake temperatures have increased. If an increase is observed, increase airflow to the problem rack by moving a perforated tile from an area that does not have any hot spots.

25. When the desired number of units has been reached, record the power draw for the cooling units as in Step 3. This may be possible through existing monitoring. Compare the before and after to observe the reduction in power draw.

Subsequent action two: Raising the temperature in the computer room.

26. Increase the temperature set point on each cooling unit one or two degrees a week until the server intake air is at the desired level, for example the highest server intake temperature is 75°F/24°C. To accomplish this faster, a greater rate of change should be discussed with and agreed to by the data center integrated operations team. (Note: this step can be completed regardless of whether the cooling unit uses supply or return air control. If the unit has return air control, the set point will seem quite high, but the supply air will be in the acceptable range.)

27. Before each subsequent increase in set point temperature, record the temperatures at the same locations as in Step 2. Compare this and prior readings to determine if any server intake temperatures have increased. If an increase is observed, increase airflow to the problem rack by moving a perforated tile from an area that does not have any hot spots.

Subsequent action three: Review and report metrics.

28. Re-create the preparatory metrics from Steps 1-6 and compare to the original values. This includes the following metrics:

Total cooling unit power consumption
Ratio of operating cooling capacity and cooling load
IT device inlet temperatures

29. Calculate the overall power savings based on the implemented measures and estimate annual operations savings based upon kilowatt-hours reduced and the average cost of power. Report the progress to the integrated data center operations team.

Additional Advanced Energy-Efficient Actions
There are additional measures, which can be taken in order to further increase the overall efficiency of the cooling systems supporting the data center, including implementing variable speed/frequency fan drives, utilizing supply air control and increasing the chilled-water temperature where chilled-water systems are utilized. These measures, while requiring increased engineering study and operational awareness prior to implementation, can continue to lower the amount of electricity required for the cooling plant, while simultaneously maintaining a more consistent ambient environment in the data center.

Implementing supply air control
Supply air control typically requires the addition of supply air temperature sensors, as legacy air conditioners are typically not equipped with these devices. The supply air sensors will be the modulation control point for the units to ensure a consistent supply air temperature for the data center.
Most legacy data centers are still utilizing standard return air control methodologies. However, data centers with supply air control methods are realizing two important benefits. First, in conjunction with good airflow management practices, supply air control guarantees a consistent inlet air temperature through all cold aisles in the data center. Second, it reduces the importance of the ΔT of the air handlers. Data centers utilizing return air control are forced to use an artificially low return air set point in order to guarantee that the cold aisle temperatures do not exceed their normal design temperature. But, this is not an issue for data centers utilizing supply air control.

Considerations
Supply air control typically reacts much faster than return air control, so there is the possibility of increased ‘hunting ’ to find the set point. Often, implementing supply air control requires modification of the control algorithms that the units use to modulate cooling. This should be done in close concert with the original equipment manufacturer (OEM).

This control scheme should use an adjustable set point for supply air, and additionally there should not be any automatic reset on the desired supply air temperature. The objective is to have all underfloor air at the same temperature.

Figure 2. The power consumption of a fan is a function of the cube of the fan speed.

Implementing variable speed fan drives
Since the power consumption of a fan is a function of the cube of the fan speed, operating additional variable speed cooling units or fans will further reduce power consumption (see Figure 2). This can be accomplished by retrofitting any constant speed cooling fans to variable speed or by replacing legacy units with new units with built in variable speed capability. Implementing variable speed fan drives will require close interaction with the OEM of the cooling units, and possibly, other assistance, depending on the qualifications of the facility staff.

Considerations
There are various ways of controlling the variable speed fans of the cooling units, but the most common is the use of underfloor static pressure. A less consistently applied control method is controlling variable speed fans based on temperature sensors within the cold aisle. Static pressure can be difficult to measure accurately.

Therefore, it is a good practice to have groups or even all units operating at the same speed. This helps to ensure that units are not constantly fighting each other and excessively modulating to maintain the set point.

As a best practice, the implementation of variable speed fan drives should not be done without prior installation of supply air control on the cooling units. Using variable speed fan drives in conjunction with return air control can lead to lower temperature differentials (ΔT’s), which can in turn lead to increased Cold Aisle temperatures, potentially even above the desired limit. To alleviate this issue, the return air set point has to be reduced, thereby defeating the potential savings realized by implementing the variable speed fan drives.

Using variable speed fan drives will typically change the airflow characteristics of the room. Therefore, it is important to monitor temperatures at the inlet of the IT devices. If higher temperatures are observed, perforated floor tile type and location may need modification.

Increasing Chilled Water Temperature
A chilled water system is one of the largest single facility power consumers in the data center. Raising the air handler temperature set point provides the opportunity to increase the temperature of the chilled water and further reduce energy consumption. Typical legacy data centers have chilled water set points between 42-45°F (6-7°C). Facilities that have gone through optimization of their cooling systems have successfully raised their chilled water temperatures to 50°F (10°C) or higher.

Considerations
The maximum allowable chilled water temperature will be based on the performance capabilities of both the chiller and the air handler units. Therefore, it is imperative to work in close contact with both the chiller and air handler unit OEMs to ensure that the chilled water temperature can be increased without impacting the data center. Increasing the chilled water temperature will have an impact on the capacity of the cooling units and should be implemented very slowly, potentially only increasing the set point by a single degree per week.

Some facilities require colder chilled water temperature in order to dehumidify the air in the data center. An engineering analysis should be performed to determine whether or not a low chilled water temperature is required for dehumidification processes. The savings from implementing higher chilled water temperatures can be so great that investigating alternative dehumidification methods may be warranted.

Case Study 1: How NOT to implement changes in a computer room.
A mechanical system diagnostic assessment was performed for a confidential client’s global support data center’s 40,000-ft2 (3,700-m2) computer room with over 70 chilled-water cooling units. Only 35 units (including redundancy) were required to keep the room cool. Upon receiving the preliminary study, the facility staff began turning cooling units off in a random and unplanned manner and without communicating with IT. Within five minutes, critical computer and storage equipment had shut down, interrupting the mission-critical computing (internal protective sensors that had sensed high temperatures automatically shut down some of the hardware). IT management, which was not aware of the initiative to reduce energy use, immediately demanded all cooling units be turned back on and the hot spots went away. After this very unpleasant experience, no further attempts were made at cooling optimization.

The failed attempt to reduce facility energy consumption outlined in this case study illustrates the necessity of making changes in a computer environment in a methodical, planned and well communicated way. Purposefully implementing the best practices established by Uptime Institute research can be done with great success if done in a prudent manner with the involvement and support of IT management.

Case Study 2: Overcoming legacy capabilities in a 20-year-old data center.
Designed for less than 2 kW per rack, Agriculture and Agri-Food Canada (AAFC)’s 20-year old data center was updated to successfully support multiple 10-kW racks using only air cooling. Best practices were followed for managing airflow, including selective use of hot-air containment and extraction technology. The temperatures in the room stabilized and previous hot spots were eliminated, in addition to supporting racks of a dramatically higher density than previously envisioned. This project earned AAFC and Opengate a 2011 GEIT award winner in the Outstanding Facilities Product in a User Deployment category. In addition, Opengate made the project the subject of a 2012 Symposium Technology Innovation presentation.

Case Study 3: A 40-year-old data center updates fan monitoring and controls to reduce fan energy consumption by 24%.
Lawrence Berkeley National Laboratories (a 2013 GEIT Finalist) used variable- speed controls to slow fan speeds. Since fan energy consumption is a function of the cube of fan speed, slowing down the fans brings an immediate and measurable reduction in energy consumption. This implementation used available technology and was performed after best practices in air management were met.

About the Authors

W. Pitt Turner IV, PE, is the executive director emeritus of Uptime Institute, LLC. Since 1993 he has been a senior consultant and principal for

Uptime Institute Professional Services (UIPS), previously known as ComputerSite Engineering, Inc. Uptime Institute includes UIPS, the Institute Uptime Network, a peer group of over 100 data center owners and operators in North America, EMEA, Latin America, and Asia Pacific, as well as the Uptime Institute Symposium, a major annual international conference. Uptime Institute is an independent operating unit of The 451 Group.

Before joining Uptime Institute, Mr.Turner was with Pacific Bell, now AT&T, for more than 20 years in their corporate real estate group, where he held a wide variety of leadership positions in real estate property management, major construction projects, capital project justification and capital budget management. He has a BS Mechanical Engineering from UC Davis and an MBA with honors in Management from Golden Gate University in San Francisco. He is a registered professional engineer in several states. He travels extensively and is a school-trained chef.

Keith Klesner’s career in critical facilities spans 14 years and includes responsibilities ranging from planning, engineering, design and construction to start-up and ongoing operation of data centers and mission-critical facilities. In the role of Uptime Institute vice president of Engineering, Mr. Klesner provides leadership and strategic direction to maintain the highest levels of availability for leading organizations around the world. Mr. Klesner performs strategic-level consulting engagements, Tier Certifications and industry outreach—in addition to instructing premiere professional accreditation courses. Prior to joining the Uptime Institute, Mr. Klesner was responsible for the planning, design, construction, operation and maintenance of critical facilities for the U.S. government worldwide. His early career includes six years as a U.S. Air Force officer. He has a Bachelor of Science degree in Civil Engineering from the University of Colorado- Boulder and a Masters in Business Administration from the University of LaVerne. He maintains status as a professional engineer (PE) in Colorado and is a LEED-accredited professional.

Ryan Orr has been an Uptime Institute Professional Services Consultant since January 2012. In this role, he performs Tier Certifications of Design, Constructed Facility and Operational Sustainability. Additionally, he performs M&O Stamp of Approval reviews and custom consulting activities. Prior to joining the Uptime Institute, Mr. Orr was with the Boeing Company, where he provided engineering, maintenance and operations support to both legacy and new data centers nationwide. This support included project planning, design, commissioning and implementation of industry best practices for mechanical systems. Mr. Orr holds a BS degree in Mechanical Engineering from Oregon State University.

Code mandates for data center operations

June 24, 2014/in Operations/by Matt Stansberry

Data center operations and management are increasingly the target of codes that impact safety, communications, and economic systems. Going well beyond the guidelines laid out by NFPA 75 and 76, the new 2011 National Electrical Code (NFPA 70) actually provides for regulatory oversight of mission-critical facility management. This video will provide a history of codes as they apply to data centers, and a strategy for answering and implementing code provisions to avoid problems with inspectors.

This video, shot at Uptime Institute Symposium 2013, features Richard L Sawyer, Global Strategist, HP Critical Facilities Services.

Server Decommissioning as a Discipline: Paul Nally, Barclays

June 18, 2014/in Operations/by Matt Stansberry

In 2013, Barclays removed 9,124 physical servers from its data centers globally, winning Uptime Institute’s Server Roundup competition for the second year running. This effort is part of a server footprint reduction across the entire company. The organization reaped significant cost benefits, but this project has also reduced risk and complexity, providing Barclays’ internal IT customers better uptime, and positions these teams to get either on to, or ready for, next-generation solutions like cloud.

The direct cost benefit of this effort sells itself, but the lasting legacy of the server reduction program at Barclays will be the benefits of getting off technologies that are holding the business back, and replacing them with IT services that are more future-friendly.

This keynote video from Uptime Institute Symposium 2014 will provide you with a roadmap for starting your own server decommissioning project.

ATD Interview: Christopher Johnston, Syska Hennessy

June 18, 2014/in Design/by Kevin Heslin

Industry Perspective: Three Accredited Tier Designers discuss 25 years of data center evolution

The experiences of Christopher Johnston, Elie Siam, and Dennis Julian are very different. Yet, their experiences as Accredited Tier Designers (ATDs) all highlight the pace of change in the industry. Somehow, though, despite significant changes in how IT services are delivered and business practices affect design, the challenge of meeting reliability goals remains very much the same, complicated only by greater energy concerns and increased density. All three men agree that Uptime Institute’s Tiers and ATD programs have helped raise the level of practice worldwide and the quality of facilities. In this installment, SyskaHennessy Group’s Christopher Johnston looks forward by examining current trends. This is the first of three ATD interviews from the May 2014 Issue of the Uptime Institute Journal.

Christopher Johnston is a senior vice president and chief engineer for Syska’s Critical Facilities Team. He leads research and development efforts to address current and impending technical issues (cooling, power, etc.) in critical and hypercritical facilities and specializes in the planning, design, construction, testing, and commissioning of critical 7×24 facilities. He has more than 40 years of engineering experience. Mr. Johnston has served as quality assurance officer and supervising engineer on many projects, as well as designing electrical and mechanical systems for major critical facilities projects. He has been actively involved in the design of critical facility projects throughout the U.S. and internationally for both corporate and institutional clients. He is a regular presenter at technical conferences for organizations such as 7×24 Exchange, Datacenter Dynamics, and the Uptime Institute. He regularly authors articles in technical publications. He heads Syska’s Critical Facilities’ Technical Leadership Committee and is a member of Syska’s Green Critical Facilities Committee.

Christopher, you are pretty well known in this business for your professional experience and prominence at various industry events. I bet a lot of people don’t know how you got your start.

I was born in Georgia, where I attended the public schools and then did my undergraduate training at Georgia Tech, where I earned a Bachelor’s in Electrical Engineering and effectively a minor in Mechanical Engineering. Then, I went to the University of Tennessee at Chattanooga. For the first 18 years of my career, I was involved in the design of industrial electrical systems, water systems, wastewater systems, pumping stations, and treatment stations. That’s an environment where power and controls have a premium.

In 1988, I decided to relocate my practice (Johnston Engineers) to Atlanta and merge it with RL Daniell & Associates. RL Daniell & Associates had been doing data center work since the mid-1970s.

Raymond Daniell was my senior partner, and he, Paul Katzaroff, and Charlie Krieger were three of the original top-level designers in data center design.

That was 25 years ago, Raymond and I parted ways in 2000, and I was at Carlson until 2003. Then, I opened up the Atlanta office for EYP Mission Critical Facilities and moved over to Syska in 2005.

What are the biggest changes since you began practicing as a data center designer?

From a technical standpoint, when I began we were still in the mainframe era, so every shop was running a mainframe, be it Hitachi, Amdahl, NAS, or IBM. Basically we dealt with enterprise-type clients. People working at their desks might have a dumb terminal keyboard and video. Very few had mice, and they would connect into a terminal controller, which would connect to the mainframe.

The next big change was when they went to distributed computing, which were the IBM AS400 and the Digital VAX systems. Then, in the 1990s, we started seeing server-based technologies, which is what we have today.

Of course, we now have virtualized servers that make the servers operate like mainframes, so the technology has gone in a circle. That’s my experience with IT hardware. Of course, today’s hardware is typically more forgiving than the mainframes were.

We’re seeing much bigger data centers today. A 400-kilowatt (kW) UPS system was very sizable. Today, we have data centers with tens of megawatts of critical load.

I’m the Engineer-of-Record now for a data center with 60 megawatts (MW) of critical load. SyskaHennessy Group has a data center design sitting on the shelf with 100 MW of load. I’ve actually done a design for kind of a prototype for a telecom in China that wanted 315 MW of critical load.

There is also much more use of medium voltage because we have to move tremendous amounts of electricity at minimal costs to meet client requirements.

The business of data center design has become commoditized since the late 1980s. Data center design was very much a specialty, and we were a boutique operation. There was not much competition, and we could typically demand three times the fee on a percentage basis than we can now.

Now everything is commoditized; there is a lot of competition. We encounter people who have not done a data center in 5-7 years who consider themselves data center experts. They say to themselves, “Oh data centers, I did one of those 6 or 7 years ago. How difficult can it be? Nothing’s changed.” Well the whole world has changed.

Please tell me more about the China facility. At 315 MW, obviously it has a huge footprint, but does it have the kind of densities that I think you are implying? And, if so, what kind of systems are you looking to incorporate?

It was not extremely dense. It was maybe 200 watts (W)/ft2 or 2,150 W/square meter (m2). The Chinese like low voltage, they like domestically manufactured equipment, and there are no low-voltage domestically manufactured UPS systems in China.

The project basically involved setting up a 1,200-kW building block. You would have two UPS systems for every 1,200-kW block of white space, and then you would have another white space supplied by two more 1,200-kW UPS systems.

It was just the same design repeated over and over again, because that is what fit how the Chinese like to do data centers. They do not like to buy products from outside the country, and the largest unit substation transformer you can get is 2,500 kilovolt-amperes so perhaps you can go up to a 1,600-kW system. It was just many, many, many multiple systems, and it was multiple buildings as well.

I think a lot of people did conceptual designs. We didn’t end up doing the project.

GE Appliances Tier III data center in Louisville, KY

What projects are you doing?

We’ve got a large project that I cannot mention. It’s very large, very high density, and very cutting edge. We also have a number of greenfield data centers on the board. We always have greenfield projects on the board.

You might find it interesting that when I came to Syska, they had done only one greenfield; Syska had always lived in other people’s data centers. I was part of the team that helped corporate leadership develop its critical facilities team to where it is today.

We’re also seeing the beginning of a growing market in the U.S. in upgrading and replacing legacy data center equipment. We’re beginning to see the first generation of large enterprise data centers bumping up against the end-of-usable-life spans of the UPS systems, cooling systems, switchgear, and other infrastructure. Companies want to upgrade and replace these critical components, which, by the way, must be done in live data centers. It’s like doing open-heart surgery on people while they work.

What challenge does that pose?

One of the challenges is figuring out how to perform the work without posing risk to the operating computing load. You can’t shut down the data center, and the people operating the data center want minimal risk to their computers. Let’s say I am going to replace a piece of switchgear. Basically, there are a few ways I can do it. I can pull the old switchgear out, put the new switchgear in its place, connect everything back together, then commission it, and put everything back into service. That’s what we use to call a push pull.

The downside is that the computers are possibly going to be more at risk because you are not going to be able to pull the old switchgear out, put new equipment in, and commission it very quickly. If you are working on a data center with A and B sides and don’t have ability to do a tie, perhaps the A side will be down for 2 weeks, so there is certainly more risk in terms of time there.

If you have unused space in the building where you can put new equipment or if you can put the switchgear in containers or another building, then you can put the new switchboard in, test it, commission it, and make sure everything is fine. Then you just need to have a day or perhaps a couple of days to cut the feeders over. There may be less risk, but it might end up costing more in terms of capital expense.

So each facility has to be assessed to understand its restrictions and vulnerabilities?

Yes, it is not a one-size-fits all. You have to go and look. And an astute designer has got to do the work in his or her head. The worst situation a designer can put the contractor in is to turn out a set of drawings that say, “Here’s the snapshot of the data center today and here’s what we want to look at when we finish. How you get from point A to point B is your problem.” You can’t do that because the contractor may not be able to build it without an outage.

We are looking at data centers that were built in the 1990s up to the dot bomb in 2002-2003 that are getting close to the end, particularly on the UPS systems, of usable life. The problem is not only the technology, but it is also being able to find technicians who can work on the older equipment.

What happens is the UPS manufacturer may say, “I’m not going to make this particular UPS system any more after next year. I’ve got a new model. I’ll agree to support that system normally, for, let’s say, 7 years. I’ll provide full parts and service support.” Past 7 years they don’t guarantee the equipment.

So let’s say that two years before the UPS was obsoleted you bought one and you are planning on keeping it 20 years. You are going to have to live with it for years afterwards. The manufacturer hasn’t trained any technicians, and the technicians are retiring. And, the manufacturer may have repair parts, but then again after 7 years, they may not.

What other changes do you see?

We’re beginning to see the enterprise data centers in the U.S. beginning to settle out into two groups. We are seeing the very, very, very small, 300-500 kW, maybe a 1000 kW, and then we see very big data centers of 50 MW or maybe 60 MW. We’re not seeing many in the middle.

But, we are seeing a tremendous amount of building by colo operators like T5, Digital Realty, and DuPont Fabros. I think they are having a real effect in the market, and I think there are a lot of customers who are tending to go to colo or to cloud colo. We see that as being more of an interest in the future, so the growth of very large data centers is going to continue.

At the same time, what we see from the colos is that they want to master plan and build out capacity as they need it, just in time. They don’t want to spend a lot of money building out a huge amount of capacity that they might not be able to sell, so maybe they build out 6 MW, and after they’ve sold 4 MW, maybe they build 4-6 more MW. So that’s certainly a dynamic that we’re seeing as people look to reduce the capital requirements of data centers.

We’re also seeing the cost per megawatt going down. We have clients doing Uptime Institute Certified Tier III Constructed Facility data centers in the $9-$11 million/MW range, which is a lot less than the $25 million/MW people have expected historically. People are just not willing to pay $25 million/MW. Even the large financials are looking for ways to cut costs, both capital costs and operating costs.

You have been designing data centers for a long time. Yet, you became an ATD just a few years ago. What was your motivation?

December 2010. My ATD foil number is 233. The motivation at the time was that we were seeing more interest in Uptime Institute Certification from clients, particularly in China and the Mideast. I went primarily because we felt we needed an ATD, so we could meet client expectations.

If a knowledgeable data center operator outside goes to China and sets up a data center, the Chinese look at an Uptime Institute Certification as a Good Housekeeping Seal of Approval. It gives them evidence that all is well.

In my view, data center operators in the U.S. tend to be a bit more knowledgeable, and they don’t necessarily feel the need for Certification.

Since then, we got three more people accredited in the U.S. because I didn’t want to be a single point of failure. And, our Mideast office decided to get a couple of our staff in Dubai accredited, so all told Syska has six ATDs.

This article was written by Kevin Heslin, senior editor at the Uptime Institute. He served as an editor at New York Construction News, Sutton Publishing, the IESNA, and BNP Media, where he founded Mission Critical, the leading publication dedicated to data center and backup power professionals. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.

Fuel System Design and Reliability

June 18, 2014/in Design/by Kevin Heslin

Topology and operational sustainability considerations must drive fuel system design.

When Superstorm Sandy hit the Northeastern U.S., some data center owners and operators discovered that they hadn’t considered and treated their fuel systems as part of a mission-critical system. In one highly publicized story, a data center operator in New York City was forced to use a “fuel bucket brigade,” moving fuel by hand up 18 stories for 48 hours. The brigade’s heroics kept the data center online, but thoughtful design of the fuel topology would have led to better operational sustainability. Superstorm Sandy taught many other New York and New Jersey data center owners that fuel system planning is essential if critical power is to be maintained in an emergency.

Uptime Institute has observed that designers may not be well versed in fuel solutions for data centers. Possible explanations include the delegation of the fuel system design to a fuel or engine generator supplier and a focus on other systems. As a critical system, however, fuel systems require all the consideration paid to the topology of other data center subsystems from data center designers and owners, with integration of Operational Sustainability principles to ensure high availability.

Tier Topology of Fuel Systems
Recall that the Tier Standard is based on the resiliency of the lowest-rated data center subsystem. The topology of the fuel system includes paths of fuel distribution and each and every fuel component, all of which must correspond to the Tier objective of the data center. For Tier III and Tier IV data centers, the path of fuel supply and potentially the return lines must be either Concurrently Maintainable or Fault Tolerant. The fuel system components—namely pumps, manual valves, automated valves, control panels, bulk tanks and day tanks – all must meet either the Concurrently Maintainable or Fault Tolerant objective.

Determining whether a fuel system is Concurrently Maintainable or Fault Tolerant means that it must be evaluated in the same way as the chilled-water schematic or an electrical one-line drawing, starting from a given point – often the bulk storage tanks—and methodically working through the paths and components to the fuel treatment supplies and day tanks. Removing each component and path reveals points in the system where “N,” or the needed fuel flow and/or storage, are unavailable.

Tier Topology Fuel Requirement
The Uptime Institute Owners Advisory Committee defined 12 hours of minimum fuel storage as a starting point for Tier-defined data centers. The Tier Standard: Topology requires this 12-hour fuel storage minimum for all Tiers at 12 hours of runtime at “N” load while meeting the facility’s stated topology objective. Put another way, the fuel storage must be adequate to support the data center design load for 12 hours while on engine generators while meeting the Concurrently Maintainable or Fault Tolerant objective. Exceeding the 12-hour minimum is an Operational Sustainability issue that requires careful analysis of the risks to the data center energy supply.

Fuel Storage Topology
Many owners reference the amount of fuel on hand based on total capacity. Just as with engine generators, chillers and other capacity components, the true fuel storage capacity is evaluated by removing the redundant component(s). A common example is an owner’s claim of 48 hours of fuel; however, the configuration is two 24-hour tanks, which is only 24 hours of Concurrently Maintainable fuel. The amount of Concurrently Maintainable or Fault Tolerant fuel establishes the baseline as the amount always available, not the best-case scenario of raw fuel storage.

Common fuel system design discrepancies in Tier III and Tier IV systems include:

The fuel system valving and piping does not match the Concurrently Maintainable objective. An example of the most common fuel error is a single valve between two pumps or storage tanks when using an N+1 configuration, which cannot be maintained without shutdown of more than the redundant number of components.
The power to fuel pumps and fuel control panels is not Concurrently Maintainable or Fault Tolerant. A single panel or single ATS feeding all fuel pumps is a common example. The entire power path to fuel pumps and controls must be Concurrently Maintainable or Fault Tolerant.
Depending on the topology of the fuel system (constant flow, etc.), the return fuel path may be critical. If the fuel supply is pressurized or part of a continuously functioning return system from the day tanks to the bulk tanks, the return path must meet the Concurrently Maintainable or Fault Tolerant objective.
The fuel control system is not Concurrently Maintainable. A fuel control panel may be replicated or a method of manual fuel operations must be available in case of maintenance or replacement of the fuel control panel. In this case, the manual method must be a designed and installed methodology.

Other considerations include fuel cooling, the fault tolerance of fuel systems and fuel type.

The internal supply pumps of most engine generators continuously oversupply diesel to the engine generator and return superheated diesel fuel to the day tank. During extended operations or during high ambient temperatures, it is possible to overheat the day tank to such an extent that it reduces engine generator power or even causes thermal shutdown of the unit. There are two primary ways to prevent fuel overheat:

Recirculating day tank fuel to the underground bulk tank. This system utilizes the thermal inertia of the underground system with little or no bulk tank temperature change. Of course, the return path is a critical element of the fuel system and must meet all the same Tier criteria.
Specifying a fuel cooler as part of the engine generator package. Fuel coolers function well as long as their net power impacts on the engine generator are considered and the fuel cooler is confirmed to function at extreme ambient temperatures.

Tier IV requires autonomous response to failure, which encompasses the ability to detect a fault, isolate the fault and sustain operations with alternate systems. The simple test of Fault Tolerance for fuel is that “N,” or the needed amount of fuel for the data center, is available after any failure. Some failure scenarios of fuel systems include loss of power, leaks, failure of components or fire. Leak detection of fuel systems is crucial to Fault Tolerance. Fault detection may be achieved using a variety of methods, but the chosen method must be integrated with the system response to isolate the fuel leak. The isolation must limit fuel loss such that the ability of alternate systems to deliver “N” fuel is not impacted.

Tier IV also requires the compartmentalization of fuel systems. Compartmentalizing complementary systems within separate spaces limits the damage of a single catastrophic event to a single fuel system. This includes fuel lines, fuel controllers, fuel pumps and all types of fuel tanks.

Owners considering propane or natural gas to feed either gas turbines or more recent innovations such as fuel cells face additional fuel considerations.

Fuel must have completely redundant supply lines in order to be Concurrently Maintainable.

Tier topology views gas supply lines as a utility similar to water or electricity. Thus, as a utility, a minimum of 12 hours of gas is required to be stored on-site in a manner consistent with the site’s Tier design objective. The result is one or more large gas tanks for storage. While Tiers allow this solution, not many jurisdictions have shown tolerance for large gas storage under local or national code. And, from an Operational Sustainability perspective, multiple large gas tanks could be an explosion hazard. Explosion risk may be mitigated but certainly is a major design consideration.

Fuel System Design Impacts to Operational Sustainability
Operational Sustainability defines the behaviors and risks beyond Tier Topology that impact the ability of a data center to meet its business objectives over the long term. Of the three elements of Operational Sustainability, the data center design team influences the building characteristics element the most. An improperly designed fuel system can have serious negative Operational Sustainability impacts.

Sustainability Impacts
Recent weather and disaster events have shown that planning for more extreme circumstances than previously anticipated is being realized. Properly designed fuel storage can mitigate the operational risks related to long-term outages and fuel availability caused by disaster. Some questions to ask include:

What is the owner’s business continuity objective and corporate policy?
How will the data center operator manage the bulk fuel on hand?
What is the site’s risk of a natural disaster?
What is the site’s risk of a man-made disaster?
What fuel service can be provided in a local disaster, and can local suppliers meet fuel requirements?
What fuel service can be provided in a regional emergency, and within what time period?

The answers to questions about fuel storage requirements can come from dialogue with the data center owner, who must clearly define the site’s autonomy objective. The design team can then apply the Tier topology along with Operational Sustainability risk mitigation to create the fuel storage and delivery system.

Operational Sustainability is about mitigating risks through defining specific behaviors. Human error is one of the biggest risks to operations. In order to mitigate the human factor, the design of infrastructure, including fuel, must incorporate simplicity of design and operations. In essence, the design must make the fuel system easier to operate and maintain. Fuel distribution isolation valves and control systems must be accessible. Underground vaults for fuel systems should be routinely accessible and provide adequate space for maintenance activities. Fuel tank site glasses located in readable locations can provide visual clues of infrastructure status, confirm fuel levels and give operators direct feedback before the start of maintenance activities.

Fuel treatment is not required by Tier topology, but it is certainly an important Operational Sustainability site characteristic. Diesel fuel requires regular management to remove water, algal growth and sediment to ensure that engine generators are available to provide power. Permanently installed fuel treatment systems encourage regular fuel maintenance and improve long-term availability. Local fuel vendors can help address local concerns based on diesel fuel and climate. Finally, the Institute encourages designers to integrate fuel treatment systems carefully to ensure that the Tier objective is not compromised.

The Institute has seen a growing use of engine generators in exterior enclosures with fuel systems located outside. Physical protection of these assets is an Operation Sustainability concern. With the level of investment made in data center infrastructure, designers must prevent weather and incidental damage to the fuel and engine generator systems. Protecting fuel lines, tanks and engine generators from errant operators or vehicles is paramount; bollards or permanent structures are good solutions. The fuel system must consider Operational Sustainability to achieve high availability. Simplicity of design and consideration of operations can be the most effective long-term investment in uptime. Analysis of Operational Sustainability risks with appropriate mitigations will ensure uptime in an emergency.

Summary
Fuel systems must match the topology objective set for the entire facility. The chart outlines the benefits, drawbacks and Operational Sustainability considerations of different fuel solutions.

Keith Klesner’s career in critical facilities spans 14 years and includes responsibilities ranging from planning, engineering, design and construction to start-up and ongoing operation of data centers and mission-critical facilities. In the role of Uptime Institute vice president of Engineering, Mr. Klesner has provided leadership and strategic direction to maintain the highest levels of availability for leading organizations around the world. Mr. Klesner performs strategic-level consulting engagements, Tier Certifications and industry outreach—in addition to instructing premiere professional accreditation courses. Prior to joining the Uptime Institute, Mr. Klesner was responsible for the planning, design, construction, operation and maintenance of critical facilities for the U.S. government worldwide. His early career includes six years as a U.S. Air Force officer. He has a Bachelor of Science degree in Civil Engineering from the University of Colorado-Boulder and a Masters in Business Administration from the University of LaVerne. He maintains status as a professional engineer (PE) in Colorado and is a LEED-accredited professional.

2014 Uptime Institute Data Center Survey Results

June 10, 2014/in Executive/by Mark Harris

Uptime Institute Director of Content and Publications, Matt Stansberry, delivers the opening keynote from Uptime Institute Symposium 2014: Empowering the Data Center Professional. This is Uptime Institute’s fourth annual survey, and looks at data center budgets, energy efficiency metrics, and adoption trends in colocation and cloud computing.

Implementing Data Center Cooling Best Practices

Code mandates for data center operations

Server Decommissioning as a Discipline: Paul Nally, Barclays

ATD Interview: Christopher Johnston, Syska Hennessy

Fuel System Design and Reliability

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices