A Look at Data Center Cooling Technologies

Sabey optimizes air-cooled data centers through containment
By John Sasser

The sole purpose of data center cooling technology is to maintain environmental conditions suitable for information technology equipment (ITE) operation. Achieving this goal requires removing the heat produced by the ITE and transferring that heat to some heat sink. In most data centers, the operators expect the cooling system to operate continuously and reliably.

I clearly recall a conversation with a mechanical engineer who had operated data centers for many years. He felt that most mechanical engineers did not truly understand data center operations and design. He explained that most HVAC engineers start in office or residential design, focusing on comfort cooling, before getting into data center design. He thought that the paradigms they learn in those design projects don’t necessarily translate well to data centers.

It is important to understand that comfort cooling is not the primary purpose of data center cooling systems, even though the data center must be safe for the people who work in them. In fact, it is perfectly acceptable (and typical) for areas within a data center to be uncomfortable for long-term occupancy.

As with any well-engineered system, a data center cooling system should efficiently serve its function. Data centers can be very energy intensive, and it is quite possible for a cooling system to use as much (or more) energy as the computers it supports. Conversely, a well-designed and operated cooling system may use only a small fraction of the energy used by ITE.

In this article, I will provide some history on data center cooling. I will then discuss some of the technical elements of data center cooling, along with a comparison of data center cooling technologies, including some that we use in Sabey’s data centers.

The Economic Meltdown of Moore’s Law
In the early to mid-2000s, designers and operators worried about the ability of air-cooling technologies to cool increasingly power hungry servers. With design densities approaching or exceeding 5 kilowatts (kW) per cabinet, some believed that operators would have to resort to technologies such as rear-door heat exchangers and other kinds of in-row cooling to keep up with the increasing densities.

In 2007, Ken Brill of the Uptime Institute famously predicted the Economic Meltdown of Moore’s Law. He said that the increasing amount of heat resulting from fitting more and more transistors onto a chip would reach an endpoint at which it would no longer be economically feasible to cool the data center without significant advances in technology (see Figure 1).

Figure 1. ASHRAE New Datacom Equipment Power Chart, published February 1, 2005

Figure 1. ASHRAE New Datacom Equipment Power Chart, published February 1, 2005

The U.S. Congress even got involved. National leaders had become aware of data centers and the amount of energy they require. Congress directed the U.S. Environmental Protection Agency (EPA) to submit a report on data center energy consumption (Public Law 109-341). This law also directed the EPA to identify efficiency strategies and drive the market for efficiency. This report projected vastly increasing energy use by data centers unless measures were taken to significantly increase efficiency (see Figure 2).

Figure 2. Chart ES-1 from EPA report dated (August 2, 2007)

Figure 2. Chart ES-1 from EPA report dated (August 2, 2007)

As of 2014, Moore’s Law has not yet failed. When it does, the end will be a result of physical limitations involved in the design of chips and transistors, having nothing to do with the data center environment.

At about the same time that EPA published its data center report, industry leaders took note of efficiency issues, ITE manufacturers began to place a greater emphasis on efficiency in their designs, in addition to performance; and data center designers and operators began designing for efficiency as well as reliability and cost; and operators started to realize that efficiency does not require a sacrifice of reliability.

Legacy Cooling and the End of Raised Floor
For decades, computer rooms and data centers utilized raised floor systems to deliver cold air to servers. Cold air from a computer room air conditioner (CRAC) or computer room air handler (CRAH) pressurized the space below the raised floor. Perforated tiles provided a means for the cold air to leave the plenum and enter the main space—ideally in front of server intakes. After passing through the server, the heated air returned to the CRAC/CRAH to be cooled, usually after mixing with the cold air. Very often, the CRAC unit’s return temperature was the set point used to control the cooling system’s operation. Most commonly the CRAC unit fans ran at a constant speed, and the CRAC had a humidifier within the unit that produced steam. The primary benefit of a raised floor, from a cooling standpoint, is to deliver cold air where it is needed, with very little effort, by simply swapping a solid tile for a perforated tile (see Figure 3).

Figure 3: Legacy raised floor cooling

Figure 3: Legacy raised floor cooling

For many years, this system was the most common design for computer rooms and data centers. It is still employed today. In fact, I still find many operators who are surprised to enter a modern data center and not find raised floor and CRAC units.

The legacy system relies on one of the principles of comfort cooling: deliver a relatively small quantity of conditioned air and let that small volume of conditioned air mix with the larger volume of air in the space to reach the desired temperature. This system worked okay when ITE densities were low. Low densities enabled the system to meet its primary objective despite its flaws—poor efficiency, uneven cooling, etc.
At this point, it is an exaggeration to say the raised floor is obsolete. Companies still build data centers with raised floor air delivery. However, more and more modern data centers do not have raised floor simply because improved air delivery techniques have rendered it unnecessary.

How Cold is Cold Enough?
“Grab a jacket. We’re going in the data center.”

Heat must be removed from the vicinity of the ITE electrical components to avoid overheating the components. If a server gets too hot, onboard logic will turn it off to avoid damage to the server.

ASHRAE Technical Committee 9.9 (TC 9.9) has done considerable work in the area of determining suitable environments for ITE. I believe their publications, especially Thermal Guidelines for Data Processing Equipment, have facilitated the transformation of data centers from the “meat lockers” of legacy data centers to more moderate temperatures. [Editor’s note: The ASHRAE Technical Committee TC9.9 guideline recommends that the device inlet be between 18-27°C and 20-80% relative humidity (RH) to meet the manufacturer’s established criteria. Uptime Institute further recommends that the upper limit be reduced to 25°C to allow for upsets, variable conditions in operation, or to compensate for errors inherent in temperature sensors and/or controls systems.]

It is extremely important to understand that the TC 9.9 guidelines are based on server inlet temperatures—not internal server temperatures, not room temperatures, and certainly not server exhaust temperatures. It is also important to understand the concepts of Recommended and Allowable conditions.
If a server is kept too hot, but not so hot that it turns itself off, its lifespan could be reduced. Generally speaking, this lifespan reduction is a function of the high temperatures the server experiences and the duration of that exposure. In providing a broader Allowable range, ASHRAE TC 9.9 suggests that ITE can be exposed to the higher temperatures for more hours each year.

Given that technology refreshes can occur as often as every 3 years, ITE operators should consider how relevant the lifespan reduction is to their operations. The answer may depend on the specifics of a given situation. In a homogenous environment with a refresh rate of 4 years or less, the failure rate of increased temperatures may be insufficient to drive cooling design—especially if the manufacturer will warrant the ITE at higher temperatures. In a mixed environment with equipment of longer expected life spans, temperatures may warrant increased scrutiny.

In addition to temperature, humidity and contamination can affect ITE. Humidity and contamination tend to only affect ITE when the ITE is exposed to unacceptable conditions for a long period of time. Of course, in extreme cases (if someone dumped a bucket of water or dirt on a computer) one would expect to see an immediate effect.

The concern about low humidity involves electro-static discharge (ESD). As most people have experienced, in an environment with less moisture in the air (lower humidity), ESD events are more likely. However, ESD concerns related to low humidity in a data center have been largely debunked. In “Humidity Controls for Data Centers – Are They Necessary” (ASHRAE Journal, March 2010), Mark Hydeman and David Swenson wrote that ESD was not a real threat to ITE, as long as it stayed in the chassis. On the flip side, tight humidity control is no guarantee of protection against ESD for ITE with its casing removed. A technician removing the casing to work on components should use a wrist strap.

High humidity, on the other hand, does appear to pose a realistic threat to ITE. While condensation should definitely not occur, it is not a significant threat in most data centers. The primary threat is something called hygrometric dust particles. Basically, higher humidity can make dust in the air more likely to stick to electrical components in the computer. When dust sticks, it can reduce heat transfer and possibly cause corrosion to those components. The effect of reduced heat transfer is very similar to that caused by high temperatures.

There are several threats related to contamination. Dust can coat electronic components, reducing heat transfer. Certain types of dust, called zinc whiskers, are conductive. Zinc whiskers have been most commonly found in electroplated raised floor tiles. The zinc whiskers can become airborne and land inside a computer. Since they are conductive, they can actually cause damaging shorts in tiny internal components.  Uptime Institute documented this phenomenon in a paper entitled “Zinc Whiskers Growing on Raised-Floor Tiles Are Causing Conductive Failures and Equipment Shutdowns.”

In addition to the threats posed by physical particulate contamination, there are threats related to gaseous contamination. Certain gases can be corrosive to the electronic components.

Cooling Process
The cooling process can be broken into steps:

1.   Server Cooling. Removing heat from ITE

2.  Space Cooling. Removing heat from the space housing the ITE

3.  Heat Rejection. Rejecting the heat to a heat sink outside the data center

4.  Fluid Conditioning. Tempering and returning fluid to the white space, to maintain appropriate
conditions within the space.

Server Cooling
ITE generates heat as the electronic components within the ITE use electricity. It’s Newtonian physics: the energy in the incoming electricity is conserved. When we say a server uses electricity, we mean the server’s components are effectively changing the state of the energy from electricity to heat.

Heat transfers from a solid (the electrical component) to a fluid (typically air) within the server, often via another solid (heat sinks within the server). ITE fans draw air across the internal components, facilitating this heat transfer.

Some sytems make use of liquids to absorb and carry heat from ITE. In general, liquids perform this function more efficiently than air. I have seen three such sytems:

• Liquid contact with a heat sink. A liquid flows through a server and makes contact with a heat sink inside the equipment, absorbing heat and removing it from the ITE.

• Immersion cooling. ITE components are immersed in a non-conductive liquid. The liquid absorbs the heat and transfers it away from the components.

• Dielectric fluid with state change. ITE components are sprayed with a non-conductive liquid. The liquid changes state and takes heat away to another heat exchanger, where the fluid rejects the heat and changes state back into a liquid.

In this article, I focus on systems associated with air-cooled ITE, as that is by far the most common method used in the industry.

Space Cooling
In legacy data center designs, heated air from servers mixes with other air in the space and eventually makes its way back to a CRAC/CRAH unit. The air transfers its heat, via a coil, to a fluid within the CRAC/CRAH. In the case of a CRAC, the fluid is a refrigerant. In the case of a CRAH, the fluid is chilled water. The refrigerant or chilled water removes the heat from the space. The air coming out of the CRAC/CRAH often has a discharge temperature of 55-60°F (13-15.5°C). The CRAC/CRAH blows the air into a raised floor plenum—typically using constant-speed fans. The standard CRAC/CRAH configuration from many manufacturers and designers controls the unit’s cooling based on return air temperature.

Layout and Heat Rejection Options
While raised floor free cooling worked okay in low-density spaces where no one paid attention to efficiency, it could not meet the demands of increasing heat density and efficiency—at least not as it had been historically used. I have been in legacy data centers with temperature gauges, and I’ve measured temperatures around 60°F (15.5°C) at the base of a rack and temperatures near 80°F (26°C) at the top of the same rack and also calculated PUEs well in excess of two.

People began to employ best practices and technologies including Hot Aisles and Cold Aisles, ceiling return plenums, raised floor management, and server blanking panels to improve the cooling performance in raised floor environments. These methods are definitely beneficial, and operators should use them.

Around 2005, design professionals and operators began to experiment with the idea of containment. The idea is simple; use a physical barrier to separate cool server intake air from heated server exhaust air. Preventing cool supply air and heated exhaust air from mixing provides a number of benefits, including:

• More consistent inlet air temperatures

• The temperature of air supplied to the white space can be raised, improving options for efficiency

• The temperature of air returning to the coil is higher, which typically makes it operate more efficiently

• The space can accommodate higher density equipment

Ideally, in a contained environment, air leaves the air handling equipment at a temperature and humidity suitable for ITE operation. The air goes through the ITE only once and then returns to the air handling equipment for conditioning.

Hot Aisle Containment vs. Cold Aisle Containment
In a Cold Aisle containment system, cool air from air handlers is contained, while hot server exhaust air is allowed to return freely to the air handlers. In a Hot Aisle containment system, hot exhaust air is contained and returns to the air handlers, usually via a ceiling return plenum (see Figure 4).

Figure 4: Hot Aisle containment

Figure 4: Hot Aisle containment

Cold Aisle containment can be very useful in a raised floor retrofit, especially if there is no ceiling return plenum. In such a case, it might be possible to leave the cabinets more or less as they are, as long as they are in a Cold Aisle/Hot Aisle arrangement. One builds the containment system around the existing Cold Aisles.

Most Cold Aisle containment environments are used in conjunction with raised floor. It is also possible to use Cold Aisle containment with another delivery system, such as overhead ducting. The raised floor option allows for some flexibility; it is much more difficult to move a duct, once it is installed.

In a raised floor environment with multiple Cold Aisle pods, the volume of cold air delivered to each pod depends largely on the number of floor tiles deployed within each of the containment areas. Unless one builds an extremely high raised floor, the amount of air that can go to a given pod is going to be limited. High raised floors can be expensive to build; the heavy ITE must go on top of the raised floor.

In a Cold Aisle containment data center, one must typically assume that airflow requirements for a pod will not vary significantly on a regular basis. It is not practical to frequently switch out floor tiles or even adjust floor tile dampers. In some cases, a software system that uses CFD modeling to determine airflows based on real time information can then control air handler fan speeds in an attempt to get the right amount of air to the right pods. There are limits to how much air can be delivered to a pod with any given tile configuration; one must still try to have about the right amount of floor tiles in the proper position.

In summary, Cold Aisle containment works best in instances where the designer and operator have confidence in the layout of ITE cabinets and in instances where the loading of the ITE does not change much, nor vary widely.

I prefer Hot Aisle containment in new data centers. Hot Aisle containment increases flexibility. In a properly designed Hot Aisle containment data center, operators have more flexibility in deploying containment. The operator can deploy a full pod or chimney cabinets. The cabinet layouts can vary. One simply connects the pod or chimney to the ceiling plenum and cuts or removes ceiling tiles to allow hot air to enter it.

In a properly controlled Hot Aisle containment environment, the ITE determines how much air is needed. There is a significant flexibility in density. The cooling system floods the room with temperate air. As air is removed from the cool side of the room by server fans, the lower pressure area causes more air to flow to replace it.

Ideally, the server room has a large, open ceiling plenum, with clear returns to the air handling equipment. It is easier to have a large, open ceiling plenum than a large, open raised floor, because the ceiling plenum does not have to support the server cabinets. The air handlers remove air from the ceiling return plenum. Sabey typically controls fan speed based on differential pressure (dP) between the cool air space and the ceiling return plenum. Sabey attempts to keep the dP slightly negative in the ceiling return plenum, with respect to the cool air space. In this manner, any small leaks in containment cause cool air to go into the plenum. The air handler fans ramp up or down to maintain the proper airflow.

Hot Aisle containment requires a much simpler control scheme and provides more flexible cabinet layouts than a typical Cold Aisle containment system.

In one rather extreme example, Sabey deployed six customer racks in a 6000 ft2 space pulling a little more than 35 kilowatts (kW) per rack. The racks were all placed in a row. Sabey allowed about 24 inches between the racks and built a Hot Aisle containment pod around them. Many data centers would have trouble accommodating such high density racks. A more typical utilization in the same space might be 200 racks (30 ft2 per rack) at 4.5 kW/rack. Other than building the pod, Sabey did not have to take any sort of custom measures for the cooling. The operations sequence worked as intended, simply ramping up the air handler fans a bit to compensate for the increased airflow. These racks have been operating well for almost a year.

Hot Aisle containment systems tend to provide higher volumes of conditioned air compared to Cold Aisle containment, which is a minor benefit. In a Cold Aisle containment system, the volume of air in a data center at any given time is the volume of air in the supply plenum (whether that is a raised floor or overhead duct) and the amount of air in the contained Cold Aisles. This volume is typically less than the volume in the remainder of the room. In a Hot Aisle containment system, the room is flooded with air. The volume of hot air is typically limited to the air inside the Hot Aisle containment and the ceiling return plenum.

Hot Aisle containment also allows operators to remove raised floor from the design. Temperate air floods the room, often from the perimeter. The containment prevents mixing, so air does not have to be delivered immediately in front of the ITE. Removing raised floor reduces the initial costs and the continuing management headache.

There is one factor that could lead operators to continue to install raised floor. If one anticipates direct liquid cooling during the lifespan of the data center, a raised floor may make a very good location for the necessary piping.

Close-Coupled Cooling
There are other methods of removing heat from white spaces, including in-row and in-cabinet solutions. For example, rear-door heat exchangers accept heat from servers and remove it from a data center via a liquid.

In-row cooling devices are placed near the servers, typically as a piece of equipment placed in a row of ITE cabinets. There are also systems that are located above the server cabinets.

These close-coupled cooling systems reduce the fan energy required to move air. These types of systems do not strike me as being optimal for Sabey’s business model. I believe such a system would likely be more expensive and less flexible than Hot Aisle containment layouts for accommodating unknown future customer requirements, which is important for Sabey’s operation. Close-coupled cooling solutions can have good applications, such as increasing density in legacy data centers.

Heat Rejection
After server heat is removed from a white space, it must be rejected to a heat sink. The most common heat sink is the atmosphere. Other choices include bodies of water or the ground.

There are various methods of transferring data center heat to its ultimate heat sink. Here is a partial list:

• CRAH units with water-cooled chillers and cooling towers

• CRAH units with air-cooled chillers

• Split system CRAC units

• CRAC units with cooling towers or fluid coolers

• Pumped liquid (e.g., from in-row cooling) and cooling towers

• Airside economization

• Airside economization with direct evaporative cooling (DEC)

• Indirect evaporative cooling (IDEC)

Economizer Cooling
Most legacy systems include some form of refrigerant-based thermodynamic cycle to obtain the desired environmental conditions. Economization is cooling in which the refrigerant cycle is turned off—either part or all of the time.

Airside economizers draw outside air into the data center, which is often mixed with return air to obtain the right conditions, before entering the data center. IDEC is a variation of this in which the outside air does not enter the data center but receives heat from the inside air via a solid heat exchanger.

Evaporative cooling (either direct or indirect) systems use evaporated water to supplement the availability of economizer cooling or more efficient refrigerant-based cooling. The state change of water absorbs energy, lowering the dry bulb temperature to a point where it approaches the wet bulb (saturated) temperature of the air (see Figure 5).

Figure 5. Direct evaporative cooling (simplified)

Figure 5. Direct evaporative cooling (simplified)

In waterside economizer systems, the refrigerant cycle is not required when outside conditions are cold enough to achieve the desired chilled water temperature set points. The chilled water passes through a heat exchanger and rejects the heat directly to the condenser water loop.

Design Criteria
In order to design a cooling system, the design team must agree upon certain criteria.

Heat load (most often measured in kilowatts) typically gets the most attention. Most often, heat load actually includes two elements: total heat to be rejected and the density of that heat. Traditionally, data centers have measured heat density in watts per square foot. Many postulate that density should actually be measured in kilowatts per cabinet, which is a very defensible in cases where one knows the number of cabinets to be deployed.

Airflow receives less attention than heat load. Many people use computational fluid dynamics (CFD) software to model airflow. These programs can be especially useful in non-contained raised floor environments.

In all systems, but especially in contained environments, it is important that the volume of air produced by the cooling system meet the ITE requirement. There is a direct relationship between heat gain through a server, power consumed by the server, and airflow through that server. Heat gain through a server is typically measured by the temperature difference between the server intake and server exhaust or delta T (∆T). Airflow is measured in volume over time, typically cubic feet per minute (CFM).

Assuming load has already been determined, a designer should know (or, more realistically, assume) a ∆T. If the designer does not assume a ∆T, the designer leaves it to the equipment manufacturer to determine the design ∆T, which could result in airflow that does not match the requirements.

I typically ask designers to assume a 20°F (11°C) ∆T. Higher density equipment, such as blades, typically has higher ∆T. However, most commodity servers are doing well to get as high as a 20°F (11°C) ∆T. (Proper containment and various set points can also make a tremendous difference.)

The risk of designing a system in which the design ∆T is lower than the actual ∆T is that the system will not be able to deliver the necessary airflow/cooling. The risk in going the other way is that the owner will have purchased more capacity than the design goals otherwise warrant.

The Design Day equals the most extreme outside air conditions the design is intended to handle. The owner and designers have to decide how hot is hot enough, as it affects the operation of the equipment. In Seattle, in the 100 years before July 29, 2009, there was not a recorded ambient temperature above 100°F (38°C) (as measured at SeaTac airport). Also keep in mind that equipment is often located (especially on the roof) where temperatures are higher than are experienced at official weather stations.

An owner must determine what the temperature and humidity should be in the space. Typically, this is specified for a Design Day when N equipment is operating and redundant units are off-line. Depending on the system, the designers will determine air handler discharge set points based on these conditions, making assumptions and/or calculations of temperature increases between the air handler discharge and the server inlet. There can be opportunities for more efficient systems if the owner is willing to go into the ASHRAE Allowable range during extreme outside temperatures and/or during upset conditions such as utility interruptions. Sabey typically seeks to stay within the ASHRAE Recommended range In its business model.

The owner and designer should understand the reliability goals of the data center and design mechanical, electrical, and controls to support these reliability goals. Of course, when considering these items, the design team may be subject to over building. If the design team assumes an extreme Design Day, adds in redundant equipment, specifies the low end of the ASHRAE Recommended range, and then maybe adds a little percentage on top, just in case, the resulting system can be highly reliable, if designed and operated appropriately. It can also be too expensive to build and inefficient to operate.

It is worth understanding that data centers do not typically operate at design load. In fact, during much of a data center’s lifespan, it may operate in a lightly loaded state. Operators and designers should spend some time making the data center efficient in those conditions, not just as it approaches design load. Sabey has made design choices that allow us to not only cool efficiently, but also to cool efficiently at light loads. Figure 6 shows that we reached average PUE conditions of 1.20 at only 10% loading at one of its operating data centers.

Figure 6. PUE and design load (%) over time.

Figure 6. PUE and design load (%) over time.

Crystal Ball
While very high density ITE is still being built and deployed, the density of most ITE has not kept up with the increases projected 10 years ago. Sabey was designing data centers at an average 150 watts/ft2 6 years ago, and the company has not yet seen a reason to increase that. Of course, Sabey can accommodate significantly higher localized densities where needed.

In the near future, I expect air-based cooling systems with containment to continue to be the system of choice for cooling data centers. In the long term, I would not be surprised to see increasing adoption of liquid-cooling technologies.

Conclusion
Sabey Data Centers develops and operates data centers. It has customers in many different verticals and of many different sizes. As a service provider, Sabey does not typically know the technology or layout its customers will require. Sabey’s data centers use different cooling technologies, suitable to the location. Sabey has data centers in the mild climate of Seattle, the semi-arid climate of central Washington, and in downtown New York City. Sabey’s data centers are housed in single-story greenfield buildings and in a redeveloped high-rise.

Despite these variations and uncertainties, all the data centers Sabey designs and operates have certain common elements. They all use Hot Aisle containment without raised floor. All have a ceiling return plenum for server exhaust air and flood the room for the server inlet air. These data centers all employ some form of economizer. Sabey seeks to operate efficiently in lightly loaded conditions, with variable speed motors for fans, pumps, and chillers, where applicable.

Sabey has used a variety of different mechanical systems with Hot Aisle containment, and I tend to prefer IDEC air handlers, where practical. Sabey has found that this is a very efficient system with lower water use than the name implies. Much of the time, the system is operating in dry heat exchanger mode. The system tends to facilitate very simple control sequencing, and that simplicity enhances reliability. The systems restart rapidly, which is good in utility interruptions. The fans keep spinning and ramp up as soon as the generators start providing power. Water remains in the sump, so the evaporative cooling process requires essentially no restart time. Sabey has successfully cooled racks between 35-40 kW with no problem.

Until there is broad adoption of liquid-cooled servers, the primary opportunities appear to be in optimizing air-cooled, contained data centers.


John Sasser

John Sasser

John Sasser brings more than 20 years of management experience to the operations of Sabey Data Centers’ portfolio of campuses. In addition to all day-to-day operations, start-ups and transitions, he is responsible for developing the conceptual bases of design and operations for all Sabey data centers, managing client relationships, overseeing construction projects, and overall master planning.

Mr. Sasser and his team have received recognition from a variety of organizations, including continuous uptime awards from the Uptime Institute and energy conservations awards from Seattle City Light and the Association of Energy Engineers.

Prior to joining Sabey, he worked for Capital One and Walt Disney Company. Mr. Sasser also spent 7 years with the Navy Civil Engineer Corps.

AIG Tells How It Raised Its Level of Operations Excellence

By Kevin Heslin and Lee Kirby

Driving operational excellence across multiple data centers is exponentially more difficult than managing just one. Technical complexity multiplies as you move to different sites, regions, and countries where codes, cultures, climates, and other factors are different. Organizational complexity further complicates matters when the data centers in your portfolio have different business requirements.

With little difficulty, an organization can focus on staffing, maintenance planning and execution, training and operations for a single site. Managing a portfolio turns the focus from projects to programs and from activity to outcomes. Processes become increasingly complex and critical. In this series of interviews, you will hear from practitioners about the challenges and lessons they have drawn from their experiences. You will find that those who thrive in this role share the understanding that Operational Excellence is not an end state, but a state of mind.

This interview is part of a series of conversations with executives who are managing diverse data center portfolios. The interviewees in this series participated in a panel at Uptime Institute Symposium 2015, discussing their use of the Uptime Institute Management & Operations (M&O) Stamp of Approval to drive standardization across data center operations.

Herb Alvarez: Director of Global Engineering and Critical Facilities
American International Group

An experienced staff was empowered to improve infrastructure, staffing, processes, and programs

What’s the greatest challenge managing your current footprint?

Providing global support and oversight via a thin staffing model can be difficult, but due to the organizational structure and the relationship with our global FM alliance partner (CBRE) we have been able to improve service delivery, manage cost, and enhance reliability. From my perspective, the greatest challenges have been managing the cultural differences of the various regions, followed by the limited availability of qualified staffing in some of the regions. With our global FM partner, we can provide qualified coverage for approximately 90% of our portfolio; the remaining 10% is where we see some of these challenges.

 

Do you have reliability or energy benchmarks?

We continue to make energy efficiency and sustainability a core requirement of our data center management practice. Over the last few years we retrofitted two existing data center pods at our two global data centers and we replaced EOL (end of life) equipment with best-in-class, higher efficiency systems. The UPS systems that we installed achieve a 98% efficiency rating while operating in ESS mode and 94 to 96% rating while operating in VMMS mode. In addition, the new cooling systems were installed with variable flow controls and VFDs for the chillers, pumps, and CRAHs. Including full cold aisle containment as well as multiple control algorithms to enhance operating efficiency. Our target operating model for the new data center pods was to achieve a Tier III level of reliability along with a 1.75 PUE, and we achieved both of these objectives. The next step on our energy and sustainability path is to seek Energy Star and other industry recognitions.

 

Can you tell me about your governance model and how that works?

My group in North America is responsible for the strategic direction and the overall management for the critical environments around the world. We set the standards (design, construction, operations, etc.), guidelines, and processes. Our regional engineering managers, in turn, carry these, out at the regional level. At the country level, we have the tactical management (FM) that ultimately implements the strategy. We subscribe to a system of checks and balances, and we have incorporated global and regional auditing to ensure that we have consistency throughout the execution phase. We also incorporate KPIs to promote the high level of service delivery that we expect.

 

From your perspective, what is the greatest difficulty in making that model work, ensuring that the design ideas are appropriate for each facility, and that they are executed according to your standards?

The greatest difficulties encountered were attributed to the cultural differences between regions. Initially, we encountered some resistance at the international level in regards to broad acceptance of design standards and operating standards. However, with the support of executive senior leadership and the on-going consolidation effort, we achieved global acceptance through a persistent and focused effort. We now have the visibility and oversight to ensure that our standards and guidelines are being enforced across the regions. It is important to mention that our standards, although rigid, do have flexible components embedded in them due to the fact that a “one size fits all” regimen is not always feasible. For these instances, we incorporated an exception process that grants the required flexibility to deviate from a documented standard. In terms of execution, we now have the ability via “in-country” resources to validate designs and their execution.

It also requires changing the culture, even within our own corporate group. For example, we have a Transactions group that starts the search for facilities. Our group said that we should only be in this certain type of building, this quality of building, so we created some standards and minimum requirements. We said, “We are AIG. We are an insurance company. We can’t go into a shop house.” This was a cultural change, because Transactions always looked for the lowest cost option first.

The AIG name is at stake. Anything we do that is deficient has the potential to blemish the brand.

 

Herb, it sounds like you are describing a pretty successful program. And yet, I am wondering if there are things that you would do differently if you starting from scratch.

If it were a clean slate, and a completely new start, I would look to use an M&O type of assessment at the onset of any new initiatives as it relates to data center space acquisition. Utilizing M&O as a widely accepted and recognized tool would help us achieve consistency across data centers and would validate colo provider capabilities as it relates to their operational practices.

 

How do M&O stamps help the organization, and which parts of your operations do they influence the most?

I see two clear benefits. From the management and operations perspective, the M&O Stamp offers us a proven methodology of assessing our M&O practice, not only validating our program but also offering a level of benchmarking against other participants of the assessments. The other key benefit is that the M&O stamp helps us promote our capabilities within the AIG organization. Often, we believe that we are operationally on par with the industry, but a third-party validation from a globally accepted and recognized organization helps further validate our beliefs and our posture as it relates to the quality of the service delivery that we provide. We look at the M&O stamp as an on-going certification process that ensures that we continually uphold the underlying principles of management and operations excellence, a badge of honor if you will.

 

AIG has been awarded two M&O Stamps of Approval in the U.S. I know you had similar scores on the two facilities. Were the recommendations similar?

I expected more commonality between both of the facilities. When you have a global partner, you expect consistency across sites. In these cases, there were about five recommendations for each site; two of them were common to both sites. The others were not. It highlighted the need for us to re-assess the operation in several areas, and remediate where necessary.

 

Of course you have way more than two facilities. Were you able to look at those reports and those recommendations and apply them universally?

Oh, absolutely. If there was a recommendation specific to one site, we did not look at it just for that site. We looked to leverage that across the portfolio. It only makes sense, as it applies to our core operating principals of standardizing across the portfolio.

 

Is setting KPIs for operations performance part of your FM vendor management strategy?

KPIs are very important to the way we operate. They allow us to set clear and measureable performance indicators that we utilize to gauge our performance. The KPIs drive our requirement for continuous improvement and development. We incentivize our alliance partner and its employees based on KPI performance, which helps drive operational excellence.

 

Who do you share the information with and who holds you accountable for improvements in your KPIs?

That’s an interesting question. This information is shared with our senior management as it forms our year-over-year objectives and is used as a basis for our own performance reviews and incentive packages. We review our KPIs on an on-going basis to ensure that we are trending positively; we re-assess the KPIs on an annual basis to ensure that they remain relevant to the desired corporate objectives. During the last several years one of our primary KPIs has been to drive cost reductions to the tune of 5% reductions across the portfolio.

 

Does implementing those reductions become part of staff appraisals?

For my direct reports, the answer is yes. It becomes part of their annual objectives, they have to be measurable and we have to agree that they are achievable. We track progress on a regular basis and communicate progress via our quarterly employee reviews. Again, we are very careful that any such reductions do not adversely impact our operations or detract us from achieving our uptime requirements.

 

Do you feel that AIG has mastered demand management so you can effectively plan, deploy, and manage capacity at the speed of the client?

I think that we have made significant improvements over the last few years in terms of capacity planning, but I do believe that this is an area where we can still continue to improve. Our capacity planning team does a very good job of tracking, trending, and projecting workloads. But there is ample opportunity for us to become more granular on the projections side of the reporting, so that we have a very clear and transparent view of what is planned, its anticipated arrival, and its anticipated deployment time line. We recognize that we all play a role, and the expectation is that we will all work collaboratively to implement these types of enhancements to our demand/capacity management practice.

 

So you are viewing all of this as a competitive advantage.

You have to. That’s a clear objective for all of senior management. We have to have a competitive edge in the marketplace, whether that’s on the technology side, product side, or how we deliver services to our clients. We need to be best in class. We need to champion the cause and drive this message throughout the organization.

 

Staffing is a huge part of maintaining data center operational excellence. We hear from our Network members that finding and keeping talent is a challenge. Is this something you are seeing as well?

I definitely do think there is a shortage of data center talent. We have experienced this first hand. I do believe that the industry needs to have a focused data center education program to train data center personnel. I am not referring to the theoretical or on-line programs, which already exist, but hands-on training that is specific to data center infrastructure. Typical trade school programs focus on general systems and equipment but do not have a track that is specific to data centers, one that also includes operational practices in critical environments. I think there has got to be something in the industry that’s specialized and hands-on. Training that covers the complex systems found in data centers, such as UPS systems, switchgear, EPMS, BMS, fire suppression, etc.

 

How do you retain your own good talent?

Keep them happy, keep them trained, and above all keep it interesting. You have to have a succession track, a practice that allows growth from within but also accounts for employee turnover. The succession track has to ensure that we have operational continuity when a team member moves on to pursue other opportunities.

The data center environment is a very demanding environment, and so you have to keep staff members focused and engaged. We focus on building a team, and as part of team development we ensure team members are properly trained and developed to the point where we can help them achieve their personal goals, which often times includes upward mobility. Our development track is based on the CBRE Foundations training program. In addition to the training program, AIG and CBRE provide multiple avenues for staff members to pursue growth opportunities.

 

When the staff is stable, what kinds of things can you do to keep them happy when you can’t promote them?

Oftentimes, it is the small things you do that resonate the most. I am a firm believer that above-average performance needs to be rewarded. We are pro-active and at times very creative in how we acknowledge those that are considered top performers. The Brill Award, which we achieved as a team, is just one example. We acknowledged the team members with a very focused and sincere thank you communication, acknowledging not only their participation but also the fact that it could not have been achieved without them. From a senior management perspective, we can’t lose sight of the fact that in order to cultivate a team environment you have to be part of the team. We advocate for a culture of inclusion, development, and opportunity.


Herb Alvarez

Herb Alvarez

Herb Alvarez is director of Global Engineering & Critical Facilities, American International Group. Inc. Mr. Alvarez is responsible for engineering and critical facilities management for the AIG portfolio, which comprises 970 facilities spread across 130 countries. Mr. Alvarez has overarching responsibility for the global data center facilities and their building operations. He works closely and in collaboration with AIG’s Global Services group, which is the company’s IT division.

AIG operates three purpose-built data centers in the U.S., including a 235,000 square foot (ft2) facility in New Jersey and a 205,000-ft2 facility in Texas, and eight regional colo data centers in Asia Pacific, EMEA, and Japan.

Mr. Alvarez helped implement a consolidation and standardization effort Global Infrastructure Utility (GIU) that AIG’s CEO Robert Benmosche implemented in 2010. This initiative was completed in 2013.


 

Kevin Heslin

Kevin Heslin

Kevin Heslin is chief editor and director of ancillary projects at the Uptime Institute. He served as an editor at New York Construction News, Sutton Publishing, the IESNA, and BNP Media, where he founded Mission Critical, the leading commercial publication dedicated to data center and backup power professionals. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.

 

 

 

 

 

Meeting the M&O Challenge of Managing a Diverse Data Center Footprint: John Sheputis and Don Jenkins, Infomart

By Matt Stansberry and Lee Kirby

Driving operational excellence across multiple data centers is exponentially more difficult than managing just one. Technical complexity multiplies as you move to different sites, regions, and countries where codes, cultures, climates and other factors are different. Organizational complexity further complicates matters when the data centers in your portfolio have different business requirements.

With little difficulty, an organization can focus on staffing, maintenance planning and execution, training and operations for a single site. Managing a portfolio turns the focus from projects to programs and from activity to outcomes. Processes become increasingly complex and critical. In this series of interviews, you will hear from practitioners about the challenges and lessons they have drawn from their experiences. You will find that those who thrive in this role share the understanding that Operational Excellence is not an end state, but a state of mind.

This interview is part of a series of conversations with executives who are managing diverse data center portfolios. The interviewees in this series participated in a panel at Uptime Institute Symposium 2015, discussing their use of the Uptime Institute Management & Operations (M&O) Stamp of Approval to drive standardization across data center operations.

John Sheputis: President, Infomart Data Centers

Don Jenkins: VP Operations, Infomart Data Centers

Give our readers a sense of your current data center footprint.

Sheputis: The portfolio includes about 2.2 million square feet (ft2) of real estate, mostly data center space. The facilities in both of our West Coast locations are data center exclusive. The Dallas facility is enormous, at 1.6 million ft2, and is a combination of mission critical and non-mission critical space. Our newest site in Ashburn, VA, is 180,000 ft2 and undergoing re-development now, with commissioning on the new critical load capacity expected to complete early next year.

The Dallas site has been operational since the 1980s. We assumed the responsibility for the data center pods in that building in Q4 2014 and brought on staff from that site to our team.

What is the greatest challenge of managing your current footprint?

Jenkins: There are several challenges, but communicating standards across the portfolio is a big one. Also, different municipalities have varying local codes and governmental regulations. We need to adapt our standards to the different regions.

For example, air quality control standards vary at different sites. We have to meet very high air quality standards in California, which means we adhere to very strict requirements for engine-generator runtimes and exhaust filter media. But in other locations, the regulations are less strict, and that variance impacts our maintenance schedules and parts procurement.

Sheputis: It may sound trivial to go from an area where air quality standards are high to one that is less stringent, but it still represents a change in our standards. If you’re going to do development, it’s probably best to start in California or somewhere with more restrictive standards and then go somewhere else. It would be very difficult to go the other way.

More generally, the Infomart merger was a big bite. It includes a lot of responsibility for non-data center space. So now we have two operating standards. We have over 500,000 ft2 of office-use real estate that uses the traditional break-fix operation model. We also have over two dozen data center suites with another 500,000 ft2 of mission critical space as well, where nothing breaks, or if it does, there can be no interruption of service. These different types of property have two different operations objectives and require different skill sets. Putting those varying levels of operations under one team expands the number of challenges you absorb. It pushes us from managing a few sites to a “many sites” level of complexity.

How do you benchmark performance goals?

Sheputis: I’m going to restrict my response to our mission critical space. When we start or assume control of a project, we have some pretty unforgiving standards. We want concurrent maintenance, industry-leading PUE, on time, on budget, and no injuries—and we want our project to meet critical load capacity and quality standards.

But picking up somebody else’s capital project after they‘ve already completed their design and begun the work, yet before they finished? That is the hardest thing in the world. The Dallas Infomart site is so big, there are two or three construction projects going on at any time. Show up any weekend, and you’ll somebody is doing a crane pick or has a helicopter delivering some equipment to be installed on the roof. It’s that big. It’s a damn good thing that we have great staff on site in Dallas and someone like Don Jenkins to make sure everything goes smoothly.

We hear a lot about data center operations staffing shortages. What has been your experience at Infomart?

Jenkins: Good help is hard to find anywhere. Data center skills are very specific. It’s a lot harder to find good data center people. One of the things we try to do is hire veterans. Over half our operating engineers have military backgrounds, including myself. We do this not just out of patriotism or to meet security concerns, but because we understand and appreciate the similarity of a mission critical operation and a military operation (see http://journal.uptimeinstitute.com/resolving-data-center-staffing-shortage/).

Sheputis: If you have high standards, there is always a shortage of people for any job. But the corollary for that is that if you’re known for doing your job very well, the best people often find you. Don deserves credit for building low turnover teams. Creating a culture of continuity requires more than strong technical skillsets, you have to begin recruiting the kinds of people who can play on a team.

Don uses this phrase a lot to describe the type he’s looking for—people who are capable of both leading and being led. He wants candidates with low egos who care about outcomes, strong ethics, and who want to learn. We invest heavily in our training program, and we are rigorous in finding people who buy into our process. We don’t want people who want to be heroes. The ideal candidate is a responsible team player with an aptitude for learning, and we fill in the technical gaps as necessary over time. No one has all the skills they need day one. Our training is industry leading. To date, we have had no voluntary turnover.

Jenkins: We do about 250 man-hours of training for each staff member. It’s not cheap, but we feel it’s necessary and the guys love it. They want to learn. They ask for it. Greater skill attainment is a win-win for them, our tenants, and us.

Sheputis: When you build a data center, you often meet the technically strongest people at either the beginning of the project during design or the end of the project during the commissioning phase. Every project we do is Level 5 Commissioned. That’s when you find and address all of the odd or unusual use cases that the manufacturer may not have anticipated. More than once, we have had a UPS troubleshooting specialist say to Don, “You guys do it right. Let me know when you have an opening in your organization.”

Jenkins: I think it’s a testament that shows how passionate we are about what we do.

Are you standardizing management practices across multiple sites?

Sheputis: When we had one or two sites, it wasn’t a challenge because we were copying from California to Oregon. But with three or more sites it becomes much more difficult. With the inclusion of Dallas and Ashburn, we have had to raise our game. It is tempting to say we do the same thing everywhere, but that would be unrealistic at best.

Broadly speaking, we have two families of standards: Content and Process. For functional content we have specs for staffing, maintenance, security, monitoring, and the like. We apply these with the knowledge that there will be local exceptions—such as different codes and different equipment choices. An operator from one site has to appreciate the deviations at the other sites. We also have process-based standards, and these are more meticulously applied across sites. While the OEM equipment may be different, shouldn’t the process for change management be consistent? Same goes for the problem management process. Compliance is another area where consistency is expected.

The challenge with projecting any standard is to efficiently create evidence of acceptance and verification. We try to create a working feedback loop, and we are always looking for ways to do it better. We can centrally document standard policies and procedures, but we rely on field acceptance of the standard, and we leverage our systems to measure execution versus expectation. We can say please complete work orders on time and to the following spec, and we can delegate scheduling to the field, but the loop isn’t complete until we confirm execution and offer feedback on whether the work and documentation were acceptable.

What technology or methodology has helped your organization to significantly improve data center management?

Jenkins: Our standard building management system BMS is a Niagaraproduct with an open framework. This allows our legacy equipment to talk over open protocols. All of our dashboards and data look the same and feel the same across all of the sites so that anybody could pull up another site and it would look the same to the operator.

Sheputis: Whatever system you’re using, there has to be a high premium on keeping it open. If you run on a closed system, it eventually becomes a lost island. This is especially true as you scale your operation. You have to have open systems.

How does your organization use the M&O Stamp?

Sheputis: The M&O stamp is one of the most important things we have ever achieved. And I’m not saying this to flatter you or the Uptime Institute. We believe data center operations are very important, and we have always believed we were pretty good. But I have to believe that many operators think they do a good job as well. So who is right? How does anyone really know? The challenge to the casual observer is that the data center industry is fairly closed. Operations are secure and private.

We started the process to see how good we were, and if we were good, we also thought it would be great to have a credible third party to acknowledge that. Saying I think I’m good is one thing, having a credentialed organization like Uptime Institute say so is much more.

But the M&O process is more than the Stamp of Approval. Our operations have matured and improved by participating in this process. Every year we reassess and recertify we feel like we learn new things, and we’re tracking our progress. The bigger benefit may be that the process forces us to think procedurally. When we’re setting up a new site, it helps us set a roadmap for what we want to achieve. Compared to all other forms of certification, we get something out of this beyond the credential; we get a path to improve.

Jenkins: Lots of people run a SWOT (strengths, weaknesses, opportunities, and threats) analysis or internal audit, but that feedback often lacks external reference points. You can give yourself an audit, and you can say “we’re great.” But what are you learning? How do you expand your knowledge? The M&O Stamp of Approval provides learning opportunities for us by providing a neutral experienced outsider viewpoint on where, and more importantly, how we can do better.

On one of the assessments, one of Uptime Institute’s consultants demonstrated how we could setup our chiller plant so that an operator could see all the key variables easily at a glance, with fewer steps to see what valves are open or closed. The advice was practical and easy to implement. Including markers on a chain, little flags on a chiller, LED lights on a pump. Very simple things to do, but we hadn’t thought of it. They’d seen it in Europe, it was easy to do, and it helps. That’s one specific example, but we used the knowledge of the M&O team to help us grow.

We think the M&O criteria and content will get better and deeper as time goes on. This is a solid standard for people to grow on.

Sheputis: We are for certifications, as they remove doubt, but most of the work and value is had in obtaining the first certification. I can see why others are cynical about value and cost to recertify. But I do think there’s real value in the ongoing M&O certification, mainly because it shows continuous improvement. No other certification process does that.

Jenkins: A lot of certifications are binary in that you pass if you have enough checked boxes—the content is specific, but operationally shallow. We feel that we get a lot more content out of the M&O process.

Sheputis: As I said before, we are for compliance and transparency. As we are often fulfilling a compliance requirement for someone else, there is clear value is saying we are PCI compliant or SSAE certified. But the M&O Stamp of Approval process is more like seeing a professional instructor. All other certifications should address the M&O stamp as “Sir.”


matt-stansberry

Matt Stansberry

Matt Stansberry is director of Content and Publications for the Uptime Institute and also serves as program director for the Uptime Institute Symposium, an annual spring event that brings together 1,500 stakeholders in enterprise IT, data center facilities, and corporate real estate to deal with the critical issues surrounding enterprise computing. He was formerly editorial director for Tech Target’s Data Center and Virtualization media group, and was managing editor of Today’s Facility Manager magazine. He has reported on the convergence of IT and Facilities for more than a decade.

 

The Calibrated Data Center:  Using Predictive Modeling

Better information leads to better decisions
By Jose Ruiz

New tools have dramatically enhanced the ability of data center operators to base decisions regarding capacity planning and operational performance like move, adds, and changes on actual data. The combined use of modeling technologies to effectively calibrate the data center during the commissioning process and the use of these benchmarks in modeling prospective configuration scenarios enable end users to optimize the efficiency of their facilities prior to the movement or addition of a single rack.

Data center construction is expected to continue growing in coming years to house the compute and storage capacity needed to support the geometric increases in data volume that will characterize our technological environment for the foreseeable future. As a result, data center operators will find themselves under ever-increasing pressure to fulfill dynamic requirements in the most optimized environment possible. Every kilowatt (kW) of cooling capacity will become increasingly precious, and operators will need to understand the best way to deliver it proactively.

As Uptime Institute’s Lee Kirby explains in Start With the End in Mind, a data center’s ongoing operations should be the driving force behind its design, construction, and commissioning processes.

This paper examines performance calibration and its impact on ongoing operations. To maximize data center resources, Compass performs a variety of analyses using Future Facilities’ 6SigmaDC and Romonet’s Software Suite. In the sections that follow, I will discuss how predictive modeling during data center design, the commissioning process, and finally, the calibration processes validate the predictive models. Armed with the calibrated model, a customer can study the impact of proposed modifications on data center performance before any IT equipment is physically installed in the data center. This practice helps data center operators account for the three key elements during facility operations: availability, capacity, and efficiency. Compass calls this continuous modeling.

E Ruiz Figure 1 image1

Figure 1. CFD software creates a virtual facility model and studies the physics of the cooling and power elements of the data center

What is a Predictive Model?
A predictive model, in a general sense, combines the physical attributed and operating data of a system and uses that to calculate an outcome in the future. The 6Sigma model provides complete 3D representation of a data center at any given point in its life cycle. Combining the physical elements of IT equipment, racks, cables, air handling units (AHUs), power distribution units (PDUs), etc., with computational fluid dynamics (CFD) and power modeling, enables designers and operators to predict the impact of their configuration on future data center performance. Compass uses commercially available performance modeling and CFD tools to model data center performance in the following ways:

• CFD software creates a virtual facility model and studies the physics of the cooling and power elements of the data center (see Figure 1).

• The modeling tool interrogates the individual components that make up the data center and compare their actual performance with the initial modeling prediction.

This proactive modeling process allows operators to fine tune performance and identify potential operational issues at the component level. A service provider, for example, could use this process to maximize the sellable capacity of the facility and/or its ability to meet the service level agreements (SLA) requirements for new as well as existing customers.

Case Study Essentials

For the purpose of this case study all of the calibrations and modeling are based upon Compass Data Center’s Shakopee, MN, facility with the following specifications (see Figure 2):

• 13,000 square feet (ft2) of raised floor space

• No columns on the data center floor

• 12-foot (ft) false ceiling used as a return air
plenum

• 36-inch (in.) raised floor

• 1.2 megwatt (MW) of critical IT load

• four rooftop air handlers in an N+1 configuration

• 336 perforated tiles (25% open) with dampers installed

• Customer type: service provider

E Ruiz Figure 2 image2

Figure 2. Data center room with rooftop AHUs

 

Cooling Baseline
The cooling system of this data center comprises  4 120-ton rooftop air handler units in an N+1 configuration (see Figure 3). The system provides a net cooling capacity that a) supports the data center’s 1.2-MW power requirement and b) delivers 156,000 cubic feet per minute (CFM) of airflow to the white space. The cooling units are controlled based on the total IT load present in the space. This method turns on AHUs as the load increases. Table 1 describes the scheme.

table1

Table 2. Tests performed during calibration

These units have outside air economizers to leverage free cooling and increase efficiency. For the purpose of the calibration, the system was set to full recirculation mode with the outside air economization feature turned off. This allows the cooling system to operate at 100% mechanical cooling, which is representative of a standard operating day under the Design Day conditions.

E Ruiz Figure 3 image3

Figure 3. Rooftop AHUs

E Ruiz Figure 4 image4

Figure 4. Cabinet and perforated tile layout.  Note: Upon turnover, the customer is responsible for racking and stacking the IT equipment.

Cabinet Layout
The default cabinet layout is based on a standard Cold Aisle/Hot Aisle configuration (see Figure 4).

Airflow Delivery and Extraction
Because the cooling units are effectively outside the building, a long opening on one side of the room serves as a supply air plenum. The air travels down the 36-in.-wide plenum to a patent-pending air dam before entering the raised floor. The placement of the air dam ensures even pressurization of the raised floor during both normal and maintenance failure modes. Once past the air dam, the air enters a 36-in. raised floor and is released into the above floor by 336 perforated tiles (25% open) (see Figure 5).

Figure 5. Airflow

Figure 5. Airflow

Hot air from the servers then passes through ventilation grilles placed in the 12-ft false ceiling.

Commissioning and Calibration
Commissioning is a critical step in the calibration process because it eliminates extraneous variables that may affect subsequent reporting values. Upon the completion of the Integrated Systems Testing (IST), the calibration process begins. This calibration exercise is designed to enable the data center operator to compare actual data center performance against the modeled values.

E Ruiz Figure 6image6

Figure 6. Inconsistencies between model values and actual performance can be explored and examined prior to placing the facility into actual operation. These results provide a unique insight into whether the facility will operate as per the design intent in the local climate.

The actual process consists of conducting partial load tests in 25% increments and monitoring actual readings from specific building management system points, sensors, and devices that account for all the data center’s individual components.

E Ruiz Figure 7 image11

Figure 7. Load bank and PDUs during the test

As a result of this testing, inconsistencies between model values and actual performance can be explored and examined prior to placing the facility into actual operation. These results provide a unique insight into whether the facility will operate as per the design intent in the local climate or whether there are issues that will affect future operation that must be addressed. Figure 6 shows the process. Figure 7 shows load banks and PDUs as arranged for testing.

table2

Table 2. Tests performed during calibration

All testing at Shakopee was performed by a third-party entity to eliminate the potential for any reporting bias in the testing. The end result of this calibration exercise is that the operator now has a clear understanding of the benchmark performance standards unique to their data center. This provides specific points of reference for all future analysis and modeling to determine the prospective performance impact of site moves, adds, or changes. Table 2 lists the tests performed during the calibration.

table3

Table 3. Perforated tile configuration during testing

During the calibration, dampers on appropriate number of tiles were closed proportionally to coincide with the load step. Table 3 shows the perforated tile damper configuration used during the test.

table4

Table 4. CPM goals, test results, and potential adjustments

Analysis & Results
To properly interpret the results of the initial calibration testing, it’s important to understand the concept of cooling path management (CPM), which is the process of stepping through the full route taken by the cooling air and systematically minimizing or eliminating potential breakdowns. The ultimate goal of this exercise is meeting the air intake requirement for each unit of IT equipment. The objectives and associated changes are shown in Table 4.

Cooling paths are influenced by a number of variables, including the room configuration, IT equipment and its arrangement, and any changes that will fundamentally change the cooling paths. In order to proactively avoid cooling problems or inefficiencies that may creep in over time, CPM is, therefore, essential to the initial design of the room and to configuration management of the data center throughout its life span.

AHU Fans to Perforated Tiles (Cooling Path #1). CPM begins by tracing the airflow from the source (AHU fans) to the returns (AHU returns). The initial step consists of investigating the underfloor pressure. Figure 8 shows the pressure distribution in the raised floor. In this example, the underfloor pressure is uniform from the very onset; thereby, ensuring an even flow rate distribution.

E Ruiz Figure 8 image12

Figure 8 shows the pressure distribution in the raised floor. In this example, the underfloor pressure is uniform from the very onset; thereby, ensuring an even flow rate distribution.

From a calibration perspective, Figure 9 demonstrates that the results obtained from the simulation are aligned with the data collected during commissioning/calibration testing. The average underfloor pressure captured by software during the commissioning process was 0.05 in. of H20 as compared to 0.047 in. H20 predicted by 6SigmaDC.

The airflow variation across the 336 perforated tiles was determined to be 51 CFM. These data guaranteed an average target cooling capacity of 4 kW/cabinet compared to the installed 3.57 kW/cabinet (assuming that the data center operator uses the same type of perforated tiles as those initially installed). In this instance, the calibration efforts provided the benchmark for ongoing operations, and verified that the customer target requirements could be fulfilled prior to their taking ownership of the facility.

The important takeaway in this example is the ability of calibration testing to not only validate that the facility is capable of supporting its initial requirements but also to offer the end user a cost-saving mechanism to determine the impact of proposed modifications on the site’s performance, prior to their implementation. In short, hard experience no longer needs to be the primary mode of determining the performance impact of prospective moves, adds, and changes.

Table 5. Airflow simulations and measured results

Table 5. Airflow simulations and measured results

During the commissioning process, all 336 perforated tiles were measured.

Table 5 is a results comparison of the measured and simulated flow from the perforated tiles.

table6

Table 6. Airflow distribution at the perforated tiles

The results show a 1% error between measured and simulated values. Let’s take a look at the flow distribution at the perforated tiles (see Table 6).

The flows appear to match up quite well. It is worth noting that the locations of the minimum and maximum flows are different between measured and simulated values. However, this is not of concern as the flows are within an acceptable margin of error. Any large discrepancy (> 10%) between simulated and measured would warrant further investigation (see Table 7). The next step in the calibration process examined the AHU supply temperatures.

Perforated Tiles to Cabinets (Cooling Path #2). Perforated tile to cabinet airflow (see Figure 10) is another key point of reference that should be included in calibration testing and determination. Airflow leaving the perforated tiles enters the inlets of the IT equipment with minimal bypass.

E Ruiz Figure 9 image13

Figure 9. Simulated flow through the perforated tiles

E Ruiz Figure 10 ?image14

Figure 10. The blue particles cool the IT equipment, but the gray particles bypass the equipment.

Figure 10 shows how effective the perforated tiles are in terms of delivering the cold air to the IT equipment. The blue particles cool the IT equipment while the gray particles bypassing the equipment.

A key point of this testing is the ability to proactively identify solutions that can increase efficiency. For example, during this phase, testing helped determine that reducing fan speed would improve the site’s efficiency. As a result, the AHU fans were fitted with variable frequency drives (VFDs), which enables Compass to more effectively regulate this grille to cabinet airflow.

E Ruiz Figure 11 image15

Figure 11. Inlet temperatures

It was also determined that inlet temperatures to the cabinets were on the lower scale of the ASHRAE allowable range (see Figure 11), this creating the potential to raise the air temperature within the room for operations. If the operator takes action and raises the supply air temperature, they will have immediate efficiency gains and see significant cost savings.

table8

Table 8. Savings estimates based on IT loads

The analytical model  can estimate these savings quickly. Table 8 shows the estimated annual cost savings based on IT load, supply air temperature setting for the facility and a power cost of seven cents per kilowatt-hour (U.S. national average). It is important to note the location of the data center because the model uses specific EnergyPlus TMY3 weather files published by the U.S. Department of Energy for its calculation.

E Ruiz Figure 12 image16

Figure 12. Cooling path three tracks airflow from the equipment exhaust to the returns of the AHU units

Cabinet Exhaust to AHU Returns (Cooling Path #3). Cooling path three tracks airflow from the equipment exhaust to the returns of the AHU units (see Figure 12). In this case, calibration testing identified that the inlet temperatures suggest that there was very little external or internal cabinet recirculation. The return temperatures and the capacities of the AHU units are fairly uniform. The table  shows the comparison between measured and simulated AHU return temperatures:

Looking at the percentage cooling load utilized for each AHU unit, the measured load was around 75% and the simulated values show an average value of 80% for each AHU. This slight discrepancy was acceptable due to the differences between the measured and simulated supply and return temperatures; thereby, establishing the acceptable parameters for ongoing operation within the site.

Introducing Continuous Modeling
Up to this point, I have illustrated how calibration efforts can be used to both verify the suitability of the data center to successfully perform as originally designed  and to prescribe the specific benchmarks for the site. This knowledge can be used to evaluate the impact of future operational modifications, which is the basis of continuous modeling.

The essential value of continuous modeling is its ability to facilitate more effective capacity planning. By modeling prospective changes before moving IT equipment in, a lot of important what-if’s can be answered (and costs avoided) while meeting all the SLA requirements.

Examples of continuous modeling applications include, but are not limited to:

• Creating custom cabinet layouts to predict the impact of various configurations

• Increasing cabinet power density or modeling custom cabinets

• Modeling Cold Aisle/Hot Aisle containment

• Changing the control systems that regulate VFDs to move capacity where needed

• Increasing the air temperature safely without breaking the temperature SLA

• Investigating upcoming AHU maintenance or AHU failures that can’t be achieved in a production environment

In each of these applications, the appropriate modeling tools are used in concert with initial calibration data to determine the best method of implementing a desired change. The ability to proactively identify the level of deviation from the site’s initial system benchmarks can aid in the identification of more effective alternatives that not only improve operational performance but also reduce the time and cost associated with their implementation.

Case History: Continuous Modeling
Total airflow in the facility described in this case study is based on the percentage of IT load in the data hall with a design criteria of 25°F (-4°C) ∆T. Careful tile management must be practiced in order to maintain proper static pressure under the raised floor and avoid potential hot spots. Using the calibrated model, Compass created two scenarios to understand the airflow behavior. This resulted in installing fewer perforated tiles than originally planned and better SLA compliance. Having the calibrated model gave a higher level of confidence for the results. The two scenarios are summarized following.

E Ruiz Figure 15 image19

Figure 13. Case history equipment layout

Scenario 1: Less Than Ideal Management
There are 72 4-kW racks in one area of the raised floor and six  6 20-kW racks in the opposite corner (see Figure 13). The total IT load is 408 kW, which is equal to 34% of the total IT load available. The total design airflow at 1,200 kW is 156,000 CFM, meaning the total airflow delivered in this example is 53,040 CFM. A leakage rate of 12% is assumed, which means that 88% of the 53,040 CFM is distributed using the perforated tiles. Perforated tiles were provided in front of each rack. The 25% open tiles were used in front of the 4-kW racks and Tate GrateAire tiles were used in front of the 20-kW racks.

E Ruiz Figure 16 image20

Figure 14. Scenario 1 data hall temperatures

The results of Scenario 1 demonstrate the temperature differences between the hot and cold aisles. For the area with 4-kW racks there is an average temperature difference of around 10°F (5.5 °C) between the Hot and Cold aisles, and the 20-kW racks have a temperature difference of around 30°F (16°C) (see Figure 14).

Scenario 2: Ideal Management
In this scenario, the racks were left in the same location, but the perforated tiles were adjusted to better distribute air based on the IT load. The 20-kW racks account for 120 kW of the total IT load while the 4-kW racks account for 288 kW of the total IT load. In an ideal floor layout, 29.4% of the airflow will be delivered to the 20-kW racks and 70.6% of the airflow will be delivered to the 4-kW racks. This will allow for an ideal average temperature difference across all racks.

E Ruiz Figure 17 image21

Figure 15. Scenario 2 data hall temperatures

Scenario 2 shows a much better airflow distribution than Scenario 1. The 20-kW racks now have around 25°F (14°C) difference between the hot and cold aisles (see Figure 15).

In general, it may stand to reason that if there are a total of 336 perforated tiles in the space and the space is running at 34% IT load, 114 perforated tiles should be open. The model validated that if 114 perforated tiles were opened, the underfloor static pressure would drop off and potentially cause hot spots due to lack of airflow.

Furthermore, continuous modeling will allow operators a better opportunity to match growth with actual demand. Using this process, operators can validate capacity and avoid wasted capital expense due to poor capacity planning.

Conclusion
To a large extent, a lack of evaluative tools has historically forced data center operators to accept on faith their new data center’s ability to meet its design requirements. Recent developments in modeling applications not only address this long-standing short coming, but also provide operators with an unprecedented level of control. The availability of these tools provide end users with proactive analytical capabilities that manifest themselves in more effective capacity planning and efficient data center operation.

table10

Table 9. Summary of the techniques used to develop in each step of model development and verification

Through the combination of rigorous calibration testing, measurement, and continuous modeling, operators can evaluate the impact of prospective operational modifications prior to their implementation and ensure that they are cost-effectively implemented without negatively affecting site performance. This enhanced level of control is essential for effectively managing data centers in an environment that will continue to be characterized by its dynamic nature and increasing application complexity. Finally, Table 9 summarizes the reasons why these techniques are valuable and provide a positive impact in data center operations.

Most importantly, all of these help the data center owner and operator make a more informed decision.


Jose Ruiz

Jose Ruiz

Jose Ruiz is an accomplished data center professional with a proven track record of success. Mr. Ruiz serves as Compass Datacenters’ director of Engineering where he is responsible for all of the company’s sales engineering and development support activities. Prior to joining Compass, he spent four years serving in various sales engineering positions and was responsible for a global range of projects at Digital Realty Trust. Mr. Ruiz is an expert on CFD modeling.

Prior to Digital Realty Trust, Mr. Ruiz was a pilot in the United States Navy where he was awarded two Navy Achievement Medals for leadership and outstanding performance. He continues to serve in the Navy’s Individual Ready Reserve. Mr. Ruiz is a graduate of the University of Massachusetts with a degree in Bio-Mechanical Engineering.

 

Retainers Improve the Effectiveness of IEC Plugs

These small devices prevent accidental disconnection of mission critical gear

By Scott Good


Today IEC plugs are used at the rack-level PDU and the IT device. IEC plugs backing out of sockets create a significant concern, since these plugs feed UPS power to the device.  In the past, twist-lock cord caps were used, but these did not address the connection of the IEC plug at the IT device.  Retainers are a way the industry has addressed this problem. 

In one case, Uptime Institute evaluated a facility in the Caribbean (a Tier Certified Constructed Facility) which was not using the retainers. While operators had checked all the connections two weeks earlier, when they isolated one UPS during the TCCF process, a single cord on a single device belonging to the largest customer was found to be loose and the device suffered an interruption of power.


The International Electrotechnical Commission (IEC) plug is the most common device used to connect rack-mounted IT hardware to power. In recent years, the use of IEC 60320 cords with IEC plugs has become more common, replacing twist-lock and field-constructed hard-wired type IEC plug connections. During several recent site evaluations, Uptime Institute has observed that the IEC 60320 plug-in electrical cords may fit loosely and accidentally disconnect during routine site network maintenance. Some incidents have involved plugs that were not fully inserted at the connections to the power distribution units (PDUs) in the IT rack or became loose due to temperature changes fluctuations. This technical paper will provide information related to cable and connector installation methods that can be used in ensuring a secure connection at the PDU.

IT Hardware Power Cables

The IEC publishes consensus-based international standards and manages conformity assessment systems for electric and electronic products, systems and services, collectively known as electrotechnology. The IEC 60320 standard describes the devices used to couple IT hardware to power systems. The plugs and cords described by this standard come in various configurations to meet the current and voltages found in each region. This standard is intended to ensure that proper voltage and current are provided to IT appliances wherever they are deployed (see http://www.iec.ch/worldplugs/?ref=extfooter).

The most common cables used to power standard PCs, monitors, and servers are designated C13 and C19. Cable connectors have male and female versions, with the female always carrying an odd number label. The male version carries the next higher even number as its designation. C19 and C20 connectors are becoming more common for use with servers and power distribution PDUs in high-power applications.

Most standard PCs accept a C13 female cable end, which connects a standard 5-15 plug cord set that plugs into a 120-volt (V) outlet to a C13 male inlet on the device end. In U.S. data centers, a C14/C13 coupler includes a C14 (male) end that plugs into a PDU and a C13 (female) end that power plugs into the server. Couplers in EU data centers also include C13s at the IT appliance end but have different male connectors to the PDU. These male ends are identified as C or CEE types. For example, the CEE /7 has two rounded prongs and provides power at a 220-V power.

IEC Plug Installation Methods

In data centers, PDUs are typically configured to support dual-corded IT hardware. Power cords are plugged into PDU receptacles that are powered from A and B power sources. During installation, installers typically plug a cable coupler in a server outlet first and then into a PDU.

Figure 1. Coiled cable

Figure 1. Coiled cable

Sometimes the cord is longer than the distance between the server outlet and the PDU, so the installer will coil the cable and secure the coil with cable ties or Velcro (see Figures 1 and 2). This practice adds weight on the cable and stress to the closest connection, which is at the PDU. If the connection at the PDU is not properly supported, the connector can easily pull or fall out during network maintenance activity. Standard methods for securing PDU connections include cable retention clips, plug locks, and IEC Plug Lock and IEC Lock Plus.

Figure 2. Velcro ties

Figure 2. Velcro ties

 

Cable retention clips are the original solution developed for IT hardware cable installations. These clips are manufactured to install at the connection point and clip to retention receptacles on the side of the PDU. Supports on the PDU receive the clip and hold the connector in the receptacle slot (see Figure 3).

Figure 3. A retention clip to PDU in use

Figure 3. A retention clip to PDU in use

Plug lock inserts prevent power cords from accidentally disconnecting from C13 output receptacles (see Figure 4). A Plug lock insert place over any C14 input cord strengthens the connection of the plug to the C13 outlet, keeping critical equipment plugged-in and running during routine rack access and maintenance.

Figure 4. Plug lock

Figure 4. Plug lock

C13 and C19 IEC Lock connectors include lockable female cable ends suitable for use with standard C14 or C20 outlets. They cannot be accidentally dislodged or vibrated out of the outlets (see Figure 5).

The IEC Plug Lock and IEC Lock Plus are also alternatives. Both products have an integral locking mechanism that secures C13 and C19 plugs to the power pins of the all C13 and C19 outlets.

 

 

Summary

Manufacturers of IEC plugs over the recent years have developed technologies in new and existing plug and cable products to help mitigate the issue of plugs working their way out of the sockets on both IT hardware and PDU power feeds.

Good figure 5 image5

 

Figure 5. IEC plug lock

As these connections are audited in the data center, it is good practice to see where these conditions exist or could be created. Having a plan to change out older style and suspect cables will help mitigate or avoid incidents during maintenance and change processes in data centers.


Scott Good

Scott Good

Scott Good is a senior consultant of Uptime Institute Professional Services, facilitating prospective engagements and delivering Tier Topology and Facilities Certifications to contracted clients. Mr. Good has been in the data center industry for more 25 years and has developed data center programs for enterprise clients globally. He has been involved in the execution of Tier programs in alignment with the Uptime Institute and was one of the first to be involved in the creation of the original Tier IV facilities. Mr. Good developed and executed a systematic approach to commissioning these facilities, and the processes he created are used by the industry to this day.

Avoiding Data Center Construction Problems

Experience, teamwork, and third-party verification are keys to avoiding data center construction problems
By Keith Klesner

In 2014, Uptime Institute spoke to the common conflicts between data center owners and designers. In our paper, “Resolving Conflicts Between Data Center Owners and Designers” [The Uptime Institute Journal, Volume 3, p 111], we noted that both the owner and designer bear a certain degree of fault for data center projects that fail to meet the needs of the enterprise or require expensive and time-consuming remediation when problems are uncovered during commissioning or Tier Certification.

Further analysis reveals that not all the communications failures can be attributed to owners or designers. In a number of cases, data center failures, delays, or cost overruns occur during the construction phase because of misaligned construction incentives or poor contractor performance. In reality, the seeds of both these issues are sown in the earliest phases of the capital project, when design objectives, budgets, and schedules are developed, RFPs and RFIs issued, and the construction team assembled. The global scale of planning shortfalls and project communication issues became clear due to insight gained through the rapid expansion of the Tier Certification program.

Many construction problems related to data center functionality are avoidable. This article will provide real-life examples and ways to avoid these problems.

In Uptime Institute’s experience from more than 550 Tier Certifications in over 65 countries, problems in construction resulting in poor data center performance can be attributed to:

•   Poor integration of complex systems

•   Lack of thorough commissioning or compressed commissioning schedules

•   Design changes

•   Substitution of materials or products

These issues arise during construction, commissioning, or even after operations have commenced and may impact cost, schedule, or IT operations. These construction problems often occur because of poor change management processes, inexperienced project teams, misaligned objectives of project participants, or lack of third-party verification.

Lapses in construction oversight, planning, and budget can mean that a new facility will fail to meet the owner’s expectations for resilience or require additional time or budget to cure problems that become obvious during commissioning—or even afterwards.

APPOINTING AN OWNER’S REPRESENTATIVE

At the project outset, all parties should recognize that owner objectives differ greatly from builder objectives. The owner wants a data center that best meets cost, schedule, and overall business needs, including data center availability. The builder wants to meet project budget and schedule requirements while preserving project margin. Data center uptime (availability) and operations considerations are usually outside the builder’s scope and expertise.

Thus, it is imperative that the project owner—or owner’s representatives—devise contract language, processes, and controls that limit the contractors’ ability to change or undermine design decisions while making use of the contractors’ experience in materials and labor costs, equipment availability, and local codes and practices, which can save money and help construction follow the planned timeline without compromising availability and reliability.

Data center owners should appoint an experienced owner’s representative to properly vet contractors. This representative should review contractor qualifications, experience, staffing, leadership, and communications. Less experienced and cheaper contractors can often lead to quality control problems and design compromises.

The owner or owner’s representative must work through all the project requirements and establish an agreed upon sequence of operations and an appropriate and incentivized construction schedule that includes sufficient time for rigorous and complete commissioning. In addition, the owner’s representative should regularly review the project schedule and apprise team members of the project status to ensure that the time allotted for testing and commissioning is not reduced.

Project managers, or contractors, looking to keep on schedule may perform tasks out of sequence. Tasks performed out of sequence often have to be reworked to allow access to space allocated to another system or to correct misplaced electrical service, conduits, ducts, etc., which only exacerbates scheduling problems.

Construction delays should not be allowed to compromise commissioning. Incorporating penalties for delays into the construction contract is one solution that should be considered.

VALUE ENGINEERING

Value Engineering (VE) is a regularly accepted construction practice employed by owners to reduce the expected cost of building a completed design. The VE process has its benefits, but it tends to focus just on the first costs of the build. Often conducted by a building contractor, the practice has a poor reputation among designers because it often leads to changes that compromise the design intent. Yet other designers believe that in qualified hands, VE, even in data centers, can yield savings for the project owner, without affecting reliability, availability, or operations.

If VE is performed without input from Operations and appropriate design review, any initial savings realized from VE changes may be far less than charges for remedial work needed to restore features necessary to achieve Concurrent Maintainability or Fault Tolerance and increased operating costs over the life of the data center (See Start with the End in Mind, The Uptime Institute Journal, Volume 3, p.104).

Uptime Institute believes that data center owners should be wary of changes suggested by VE that deviate from either the owner’s project requirements (OPR) or design intent. Cost savings may be elusive if changes resulting from VE substantially alter the design. As a result, each and every change must be scrutinized for its effect on the design. Retaining the original design engineer or a project engineer with experience in data centers may reduce the number of inappropriate changes generated during the process. Even so, data center owners should be aware that Uptime Institute personnel have observed that improperly conducted VE has led to equipment substitutions or systems consolidations that compromised owner expectations of Fault Tolerance or Concurrent Maintainability. Contractors may substitute lower-priced equipment that has different capacity, control methodology, tolerances, or specifications without realizing the effect on reliability.

Examples of VE changes include:

•   Eliminating valves needed for Concurrent Maintainability (see Figure 1)

•   Reducing the number of  automatic transfer switches (ATS) by consolidating equipment onto a single ATS

•   Deploying one distinct panel rather than two, confounding Fault Tolerance

•   Integrating economizer and energy-efficiency systems in a way that does not allow for Concurrent Maintainability or Fault Tolerant operation

image4

 

 

 

Figure 1. Above, an example of a design that meets Tier III Certification requirements. Below, an example of a built system that underwent value engineering. Note that there is only one valve between components instead of the two shown in the design.

Figure 1. Above, an example of a design that meets Tier III Certification requirements. Below, an example of a built system that underwent value engineering. Note that there is only one valve between components instead of the two shown in the design.

ADEQUATE TIME FOR COMMISSIONING

Problems attributed to construction delays sometimes result when the initial construction timeline does not include adequate time for Level 4 and Level 5 testing. Construction teams that are insufficiently experienced in the rigors of data center commissioning (Cx) are most susceptible to this mistake. This is not to say that builders do not contribute to the problem by working to a deadline and regarding the commissioning period as a kind of buffer that can be accessed when work runs late. For both these reasons, it is important that the owner or owner’s representative take care to schedule adequate time for commissioning and ensure that contractors meet or exceed construction deadlines. A recommendation would be to engage the Commissioning Agent (CxA) and General Contractor early in the process as a partner in the development of the project schedule.

In addition, data center capital projects include requirements that might be unfamiliar to teams lacking experience in mission critical environments; these requirements often have budgetary impacts.

For example, owners and owner’s representatives must scrutinize construction bids to ensure that they include funding and time for:

•   Factory witness tests of critical equipment

•   Extended Level 4 and Level 5 commissioning with  vendor support

•   Load banks to simulate full IT load within the  critical environment

•   Diesel fuel to test and verify engine-generator systems

EXAMPLES OF DATA CENTER CONSTRUCTION MISTAKES

Serious mistakes can take place at almost any time during the construction process, including during the bidding process. In one such instance, an owner’s procurement department tried to maximize a vendor discount for a UPS but failed to order bus and other components to connect the UPS.

In another example, consider the contractor who won a bid based on the cost of transporting completely assembled generators on skids for more than 800 miles. When the vendor threatened to void warranty support for this creative use of product, the contractor was forced to absorb the substantial costs of transporting equipment in a more conventional way. In such instances, owners might be wise to watch closely whether the contractor tries to recoup his costs by changing the design or making other equipment substitution.

During the Tier Certification of a Constructed Facility  (TCCF) for a large financial firm, Uptime Institute uncovered a problematic installation of electrical bus duct. Experienced designers and contractors, or those willing to involve Operations in the construction process, know that these bus ducts should be regularly scanned under load at all joints. Doing so ensures that the connections do not loosen and  overheat, which can lead to an arc-based failure. Locating the bus over production equipment or in hard to reach locations may prevent thorough infrared scanning and eventual maintenance.

Labeling the critical feeders is just as important so Operations knows how to respond to an incident and which systems to shut down (see Figure 2).

Figure 2. A contractor that understands data centers and a construction management team that focuses on a high reliability data center can help owners achieve their desired goals. In this case, design specifications and build team closely followed the intent of a major data center developer for a clear labeling system of equipment with amber (primary) side and blue (alternate) equipment and all individual feeders. The TCCF process found no issues with Concurrent Maintainability of power systems.

Figure 2. A contractor that understands data centers and a construction management team that focuses on a high reliability data center can help owners achieve their desired goals. In this case, design specifications and build team closely followed the intent of a major data center developer for a clear labeling system of equipment with amber (primary) side and blue (alternate) equipment and all individual feeders. The TCCF process found no issues with Concurrent Maintainability of power systems.

In this case, the TCCF team found that the builder implemented a design as it saw fit, without considering maintenance access or labeling of this critical infrastructure. The builder had instead rerouted the bus ducts into a shared compartment and neglected to label any of the conductors.

In another such case, a contractor in Latin America simply did not want to meet the terms of the contract. After bidding on a scope of work, the contractor made an unapproved change that was approved by the local engineer. Only later did the experienced project engineer hired by the owner note the discrepancy, which began a months-long struggle to get the contractor to perform. During this time, when he was reminded of his obligation, he simply deflected responsibility and eventually admitted that he didn’t want to do the work as specified. The project engineer still does not know the source of the contractor’s intransigence but speculates that inexperience led him to submit an unrealistically low bid.

Uptime Institute has witnessed all the following cooling system problems in facilities with Tier III objectives:

•   When the rooftop unit (RTU) control sequence was not well understood and coordinated, RTU supply air  fan and outside air dampers did not  react at the same speed, creating over/under pressure conditions in the data hall. In one case, over-pressurization blew a wall out. In another case over/under pressure created door opening and closing hazards.

•   A fire detection and suppression system was specifically reviewed for Concurrent Maintainability to ensure no impact to power or cooling during any maintenance or repair activities. At the TCCF, Uptime Institute recognized that a dual-fed UPS power supply to a CRAC shutdown relay that fed a standing voltage system was still an active power supply to the switchboard, even though the mechanical board had been completely isolated. Removing that relay caused the loss of all voltage, the  breakers for all the CRACs to open, and critical cooling to the data halls and UPS rooms to be lost. The problem was traced to a minor construction change to the Concurrently Maintainable design of a US$22-million data center.

•   In yet another instance, Uptime Institute discovered during a TCCF that a builder had fed power to a motorized building air supply and return using a single ATS, which would have defeated all critical cooling. The solution involved the application of multiple distributed damper control power ATS devices.

Fuel supply systems are also susceptible to construction errors. Generally diesel fuel for engine generators is pumped from bulk storage tanks through a control and filtration room to day tanks near the engine generators.

But in one instance, the fuel subcontractor built the system incorrectly and failed to do adequate quality control. The commissioning team also did not rigorously confirm that the system was built as designed, which is a major oversight. In fact, the commissioning agent was only manually testing the valves as the TCCF team arrived on site (see Figure 3). In this example, an experienced data center developer created an overly complex design for which the architect provisioned too little space. Operating the valves required personnel to climb on and over the piping. Much of the system was removed and rebuilt at the contractor’s expense. The owner also suffered adding project time and additional commissioning and TCCF testing after the fact.

Figure 3. A commissioning team operating valves manually to properly test a fuel supply system. Prior to Uptime Institute’s arrival for the TCCF, this task had not been performed.

Figure 3. A commissioning team operating valves manually to properly test a fuel supply system. Prior to Uptime Institute’s arrival for the TCCF, this task had not been performed.

AVOIDING CONSTRUCTION PROBLEMS

Once a design has been finalized and meets the OPR, change control processes are essential to managing and reducing risk during the construction phase. For various reasons, many builders, and even some owners, may be unfamiliar with the criticality of change control as it relates to data center projects. No project will be completely error free; however, good processes and documentation will reduce the number and severity of errors and sometimes make the errors that do occur easier to fix. Uptime Institute recommends that anyone contemplating a data center project take the following steps to protect against errors and other problems that can occur during construction.

Gather a design, construction, and project management team with extensive data center experience. If necessary bring in outside experts to focus on the OPR. Keep in mind that an IT group may not understand schedule risk or the complexity of a project. Experienced teams pushback on unrealistic schedules or VE suggestions that do not meet OPR, which prevents commissioning schedule compression and leads to good Operational Sustainability. In addition, experienced teams have data center operations and commissioning experience, which means that project changes will more likely benefit the owner. The initial costs may be higher, but experienced teams bring better ROI.

Because experienced teams understand the importance of data center specific Cx, the CxA will be able to work more effectively early in the process, setting the stage for the transition to operations. The Cx  requirements and focus on functionality will be clear from the start.

In addition, Operations should be part of the design and construction team from the start. Including Operations in change management gives it the opportunity to share and learn key information about how that data center will run, including set points, equipment rotation, change management, training, and spare inventory, that will be essential in every day operations and dealing with incidents.

Finally vendors should be a resource to the construction team, but almost by definition, their interests and those of the owner are not aligned.

Assembling an experienced team only provides benefits if they work as a team. The owner and owner’s representatives can encourage collaboration among team members who have divergent interests and strong  opinions by structuring contracts with designers, project engineering, and builders to prioritize meeting the OPR. Many data center professionals find that Design-Build or Design–Bid–Build using guaranteed maximum price (GMP) and sharing of cost savings contract types conducive to developing a team approach.

Third-party verifications can assure the owner that the project delivered meets the OPR. Uptime Institute has witnessed third-party verification improve contractor performance. The verifications motivate the contractors to work better, perhaps because verification increases the likelihood that shortcuts or corner  cutting will be found and repaired at the contractor’s expense. Uptime Institute does not believe that contractors, as a whole, engage in such activities, but it is logical that the threat of verification may make contractors more cautious about “interpreting contract language” and making changes that inexperienced project engineers and owner’s representatives may not detect.

Certifications and verifications are only effective when conducted by an unbiased, vendor-neutral third-party. Many certifications in the market fail to meet this threshold. Some certifications and verification processes are little more than a vendor stamp of approval on pieces of equipment. Others take a checklist approach, without examining causes of test failures. Worthwhile verification and certification approaches insist on identifying the causes of anomalous results, so they do not repeat in a live environment.

Similarly, the CxA should also be independent and not the designer or project engineer. In addition the Cx team should have extensive data center experience.

The CxA should focus on proving the design and installation meet OPR. The CxA should be just as inquisitive as the verification and certification agencies, and for the same reasons: if the root cause of abnormal performance during commissioning is not identified and addressed, it will likely recur during operations.

Third-party verifications and certifications provide peer review of design changes and VE. The truth is that construction is messy: On-site teams can get caught up in the demands of meeting budget and schedule and may lose site of the objective. A third-party resource that reviews major RFIs, VE, or design changes can keep a project on track, because an independent third-party can remain uninfluenced by project pressure.

TIER CERTIFICATION IS THE WRONG TIME TO FIND THESE PROBLEMS

Uptime Institute believes that the Tier Certification process is not the appropriate time to identify design and construction errors or to find that a facility is not Concurrently Maintainable or Fault Tolerant, as the owner may require. In fact, we note with alarm that a great number of the examples in this article were first identified during the Tier Certification process, at a time when correcting problems is most costly.

In this regard, then, the number of errors discovered during commissioning and Tier Certifications point out one value of third-party review of the design and built facility. By identifying problems that would have gone unnoticed until a facility failed, the third-party reviewer saves the enterprise a potentially existential incident.

More often, though, Uptime Institute believes that a well-organized construction process, including independent Level 4 and Level 5 Commissioning and Tier Certification, includes enough checks and balances to catch errors as early as possible and to eliminate any contractor incentive to “paper over” or minimize the need for corrective action when deviations from design are identified.

SIDEBAR: FUEL SUPPLY SYSTEM ISSUE OVERLOOKED DURING COMMISSIONING

Uptime Institute recently examined a facility that used DRUPS to meet the IT loads in a data center. The facility also had separate engine generators for mechanical loads. The DRUPS were located on the lower of two basement levels, with bulk fuel storage tanks buried outside the building. As a result, the DRUPS and daily tanks were lower than the actual bulk storage tanks.

The Tier Certification Constructed Facility (TCCF) demonstrations required that the building operate on engine-generator sets for the majority of the testing. During the day, the low fuel alarm tripped on multiple DRUPS.

UNDETECTED ISSUE

The ensuing investigation faulted the sequence of operations for the fuel transfer from the bulk storage tanks to the day tanks. When the day tanks called for fuel, the system would open the electrical solenoid valve at the day tank and delay the start of the fuel transfer pump to flow fuel. This sequence was intended to ensure the solenoid valve had opened so the pump would not deadhead against a closed valve.

Unfortunately, when the valve opened, gravity caused the fuel in the pipe to flow into the day tank before the pump started, which caused an automatic fuel leak detection valve to close. The fuel pump was pumping against a closed valve.

The fuel supply problem had not manifested previously, although the facility had undergone a number of commissioning exercises, because the engine-generator sets had not run long enough to deplete the fuel from the day tanks. In these exercises, the engine generators would run for a period of time and not start again until the next day. By then, the pumps running against the closed valve pushed enough fuel by the closed valves to refill the day tanks. The TCCF demonstrations caused the engine generators to run non-stop for an entire day, which emptied the day tanks and required the system to refill the day tanks in real time.

CORRECTIVE STEPS

The solution to this problem did not require drastic remediation, as sometimes occurs. Instead, engineers removed the time delay after the opening of the valve from the sequence of operation so that fuel could flow as desired.

MORAL OF THE STORY

Commissioning is an important exercise. It ensures that data center infrastructure is ready day one to support a facility’s mission and business objectives. Commissioning activities must be planned so that every system is required to operate under real-world conditions. In this instance, the engine-generator set runs were too brief to test the fuel system in a real-world condition.

TCCF brings another perspective, which made all the difference in this case. In the effort to test everything during commissioning, the big picture can be lost. The TCCF focuses on demonstrating each system works as a whole to support the overall objective of supporting the critical load.


Keith Klesner

Keith Klesner

Keith Klesner’s career in critical facilities spans 15 years. In the role of Uptime Institute Vice President of Strategic Accounts, Mr. Klesner has provided leadership and results-driven consulting for leading organizations around the world. Prior to joining Uptime Institute, Mr. Klesner was responsible for the planning, design, construction, operation, and maintenance of critical facilities for the U.S. government worldwide. His early career includes six years as a U.S. Air Force Officer. He has a Bachelor of Science degree in Civil Engineering from the University of Colorado- Boulder and a Masters in Business Administration from the University of LaVerne. He maintains status as a Professional Engineer (PE) in Colorado and is a LEED Accredited Professional.