Improving Performance in Ever-Changing Mission-Critical IT Infrastructures

CenturyLink incorporates lessons learned and best practices for high reliability and energy efficiency.

By Alan Lachapelle

CenturyLink Technology Solutions and its antecedents (Exodus, Cable and Wireless, Qwest, and Savvis) have a long tradition of building mission critical data centers. With the advent of its Internet Data Centers in the mid-1990s, Exodus broke new ground by building facilities at unprecedented scale. Even today, legacy Exodus data centers are among the largest, highest capacity, and most robust data centers in CenturyLink’s portfolio, which the company uses to deliver innovative managed services for global businesses on virtual, dedicated, and colocation platforms (see Executive Perspectives on the Colocation and Wholesale Markets, p.51).

Through the years CenturyLink has seen significant advances not only in IT technology, but in mission-critical IT infrastructures as well; adapting to and capitalizing on those advances have been critical to the company’s success.

Applying new technologies and honing best-practice facility design standards is an ongoing process. But the best technology and design alone will not deliver the efficient, high-quality data center that CenturyLink’s customers demand. It takes experienced, well-trained staff with a commitment to rigorousadherence to standards and methods to deliver on the promise of a well-designed and well-constructed facility. Specifically, that promise is to always be up and running, come what may, to be the “perfect data center.”

The Quest
As its build process matured, CenturyLink infrastructure began to take on a phased approach, pushing the envelope and leading the industry in effective deployment of capital for mission critical infrastructures. As new technologies developed, CenturyLink introduced them to the design. As the potential densities of customers’ IT infrastructure environments increased, so too did the densities planned into new data center builds. And as the customer base embraced new environmental guidelines, designs changed to more efficiently accommodate these emerging best practices.

Not many can claim a pedigree of 56 (and counting) unique data center builds, with the continuous innovation necessary to stay on top in an industry in which constant change is the norm. The demand for continuous innovation has inspired CenturyLink’s multi-decade quest for the perfect data center design model and process. We’re currently on our fourth generation of the perfect data center—and, of course, it certainly won’t be the last.

The design focus of the perfect data center has shifted many times.

Dramatically increasing the efficiency of white space in the data centers is likely the biggest such shift. Under the model in use in 2000, a 10-megawatt (MW) IT load may have required 150,000 square feet (ft2) of white space. Today, the same capacity requires only a third the space. Better still, we have deployments of 1 MW in 2,500 ft2—six times denser than the year-2000 design. Figure 1 shows the average densities in four recent customer installations.

Figure 1. The average densities for four recent customer installations show how significantly IT infrastructure density has risen in recent years.

Figure 1. The average densities for four recent customer installations show how significantly IT infrastructure density has risen in recent years.

Our data centers are rarely homogenous, so the designs need to be flexible enough to support multiple densities in the same footprint. A high-volume trading firm might sit next to a sophisticated 24/7 e-retailer, next to a disaster recovery site for a health-care provider with a tape robot. Building in flexibility is a hurdle all successful colocation providers must overcome to effectively address their clients’ varied needs.

Concurrent with differences in power density are different cooling needs. Being able to accommodate a wide range of densities efficiently, from the lowest (storage and backup) to the highest (high-frequency trading, bitcoin mining, etc.), is a chief concern. By harnessing the latest technologies (e.g., pumped-refrigerant economizers, water-cooled chillers, high-efficiency rooftop units), we match an efficient, flexible cooling solution to the climate, helping ensure our ability to deliver value while maximizing capital efficiency.

Mechanical systems are not alone in seeing significant technological development. Electrical infrastructures have changed at nearly the same pace. All iterations of our design have safely and reliably supplied customer loads, and we have led the way in developing many best practices. Today, we continue to minimize component count, increase the mean time between failures, and pursue high operating efficiency infrastructures. To this end, we employ the latest technologies, such as delta conversion UPS systems for high availability and Eco Mode UPS systems that actually have a higher availability than double-conversion UPS systems. We consistently re-evaluate existing technologies and test new ones, including Bloom Energy’s Bloom Box solid oxide fuel cell, which we are testing in our OC2 facility in Irvine, CA. Only once a new technology is proven and has shown a compelling advantage will we implement it more broadly.

All the improvements in electrical and mechanical efficiencies could scarcely be realized in real-world data centers if controls were overlooked. Each iteration of our control scheme is more robust than the last, thanks to a multi-disciplinary team of controls experts who have built fault tolerance into the control systems. The current design, honed through much iteration, allows components to function independently, if necessary, but generates significant benefit by networking them together, so that they can be controlled collaboratively to achieve optimal overall efficiency. To be clear, each piece of equipment is able to function solely on its own if it loses communication with the network, but by allowing components with part-load efficiencies to communicate with each other effectively, the system intelligently selects ideal operating points to ensure maximum overall efficiency.

For example, the chilled-water pumping and staging software analyzes current chilled-water conditions (supply temperature, return temperature, and system flow) and chooses the appropriate number of chilled-water pumps and chillers to operate to minimize chiller plant energy consumption. To do this, the software evaluates current chiller performance against ambient temperature, load, and pumping efficiency. The entire system is simple enough to allow for effective troubleshooting and for each component to maintain required parameters under any circumstance, including failure of other components.

Finally, our commissioning process has grown and matured. Learning lessons from past commissioning procedures, as well as from the industry as a whole, has made the current process increasingly rigorous. Today, simulations used to test new data centers before they come on-line closely represent actual conditions in our other facilities. A thorough commissioning process has helped us ensure our buildings are turned over to operators as efficient, reliable, and easy to operate as our designs intended.

Design and Construction Certification
As the data center industry has matured, among the things that became clear to CenturyLink was the value of Tier Certification. The Uptime Institute’s Tier Standard: Topology makes the benchmark for performance clear and attainable. While our facilities have always been resilient and maintainable, CenturyLink’s partnership with the Uptime Institute to certify our designs to its well-known and recognizable standards creates customer certainty.

CenturyLink currently has five Uptime Institute Tier III Certified Facilities in Minneapolis, MN; Chicago, IL; Toronto, ON; Orange County, CA; and Hong Kong, with a sixth underway. By having our facilities Tier Certified, we do more than simply show commitment to transparency in design. Customers who can’t view and participate in  commissioning of facilities can rest assured knowing the Uptime Institute has Certified these facilities as Concurrently Maintainable. We invite comparison to other providers and know that our commitments will provide value for our customers in the long run.

Application of Design to Existing Data Centers
Our build team uses data center expansions to improve the capacity, efficiency, and reliability of existing data centers. This includes (but is not limited to) optimizing power distribution, aligning cooling infrastructure to utilize ASHRAE guidance, or upgrading controls to increase reliability and efficiency.

Meanwhile, our operations engineers continuously implement best practices and leading-edge technologies to improve energy efficiency, capacity, and reliability of data center facilities. Portfolio wide, the engineering team has enhanced control sequences for cooling systems, implementation of electronically commutated (EC) and variable frequency drive (VFD) fans, and Cold Aisle/Hot Aisle containment. These best practices serve to increase total cooling capacity and efficiency, ensuring customer server inlet conditions are homogenous and within tolerance. Figure 2 shows the total impact of all such design improvements on our company’s aggregate Power Usage Effectiveness (PUE). Working hand-in-hand with the build group, CenturyLink’s operations engineers ensure continuous improvement in perfect data center design, enhancing some areas while eliminating unneeded and unused features and functions—often based on feedback from customers.

Figure 2. As designs improve over time, CenturyLink implements best practices and lessons learned into its existing portfolio to continuously improve its aggregate PUE. It’s worth noting that the PUEs shown here include considerable office space as well as extra resiliency built into unutilized capacity.

Figure 2. As designs improve over time, CenturyLink implements best practices and lessons learned into its existing portfolio to continuously improve its aggregate PUE. It’s worth noting that the PUEs shown here include considerable office space as well as extra resiliency built into unutilized capacity.

Staffing Models
CenturyLink relies on expert data center facilities and operations teams to respond swiftly and intelligently to incidents. Systems are designed to withstand failures, but it is the facilities team that promptly corrects failures, maintaining equipment at a high level of availability that continually assures Fault Tolerance.

Each facility hosts its own Facility and Operations teams. The Facility team consists of a facility manager, a lead mechanical engineer, a lead electrical engineer, and a team of facility technicians. They maintain the building and its electrical and mechanical systems. They are experts on their locations. They ensure equipment is maintained in concurrence with CenturyLink’s maintenance standards, respond to incidents, and create detailed Methods of Procedure (MOPs) for all actions and activities. They also are responsible for provisioning new customers and maintaining facility capacity.

The Operations team consists of the operations manager, operations lead, and several operations technicians. This group staffs the center 24/7, providing our colocation customers with CenturyLink’s “Gold Support” (remote hands) for their environment so that they don’t need to dispatch someone to the data center. This team also handles structured cabling and interconnections.

Regional directors and regional engineers supplemented the location teams. The regional directors and engineers serve as subject matter experts (SMEs) but, when required, can also marshal the resources of CenturyLink’s entire organization to rapidly and effectively resolve issues and ensure potential problems are addressed on a portfolio-wide basis.

The regional teams work as peers, providing each individual team member’s expertise when and where appropriate, including to operations teams outside their region when needed. Collaborating on projects and objectives, this team ensures the highest standards of excellence are consistently maintained across a wide portfolio. Thus, regional engineers and directors engage in trusted and familiar relationships with site teams, while ensuring the effective exchange of information and learning across the global footprint.

Global Standard Process Model
A well-designed, -constructed, and -staffed data center is not enough to ensure a superior level of availability. The facility is only as good as the methods and procedures that are used to operate it. A culture that embraces process is also essential in operating a data center efficiently and delivering the reliability necessary to support the world’s most demanding businesses.

Uptime and latency are primary concerns for CenturyLink. The CenturyLink brand depends upon a sustained track record of excellence. Maintaining consistent reliability and availability across a varied and changing footprint requires an intensive and dynamic facilities management program encompassing an uncompromisingly rigid adherence to well-planned standards. These standards have been modeled in the IT Infrastructure Library spirit and are the result of years of planning, consideration, and trial and error. Adherence further requires close monitoring of many critical metrics, which is facilitated by the dashboard shown in Figure 3.

Figure 3. CenturyLink developed a dynamic dashboard that tracks and trends important data: site capacity, PUE, available raised floor space, operational costs, abnormal Incidents, uptime metrics, and much more to provide a central source of up-to-date information for all levels of the organization.

Figure 3. CenturyLink developed a dynamic dashboard that tracks and trends important data: site capacity, PUE, available raised floor space, operational costs, abnormal Incidents, uptime metrics, and much more to provide a central source of up-to-date information for all levels of the organization.

Early in the development of Savvis as a company, management established many of the organizational structures that exist today in CenturyLink Technology Solutions. CenturyLink experienced growth in many avenues; as it serviced increasingly demanding customers (in increasing volume), these structures continued to evolve to suit the ever-changing needs of the company and its customers.

First, the management team developed a staff capable of administering the many programs that would be required to maintain the standard of excellence demanded by the industry. Savvis developed models analyzing labor and maintenance requirements across the company and determined the most appropriate places to invest in personnel. Training was emphasized, and teams of SMEs were developed to implement the more detailed aspects of facilities operations initiatives. The management team is centralized, in a sense, because it is one global organization; this enhances the objective of global standardization. Yet the team is geographically diverse, subdivided into teams dedicated to each site and regional teams working with multiple sites, ensuring that all standards are applied globally throughout the company. And all teams contribute to the ongoing evolution of those standards and practices—for example, participating in two global conference calls per week.

Next, it was important to set up protocols to handle and resolve issues as they developed, inform customers of any impact, and help customers respond to and manage situations. No process guarantees that situations will play out exactly as anticipated, so a protocol to handle unexpected events was crucial. This process relied on an escalation schedule that brought decisions through the SMEs for guidance and gave decision makers the proper tools for decision making and risk mitigation. Parallel to this, a process was developed to ensure any incident with impact to customers caused notifications to those customers so they could prepare for or mitigate the impact of an event.

A tracking system accomplished many things. For example, it ensured follow up on items that might create problems in the future, identified similar scenarios or locations where a common problem might recur, established a review and training process to prevent future incidents through operator education, justified necessary improvements in systems creating problems, and tracked performance over longer periods to analyze success in implementation and evaluate need for plan improvement. The tracking system is inclusive of all types of problems, including those related to internal equipment, employees, and vendors.

Data centers, being dynamic, require frequent change. Yet unmanaged change can present a significant threat to business continuity. Congruent with the other programs, CenturyLink set up a Change Management program. This program tracked changes, their impacts, and their completion. It ensured that risks were understood and planned for and that unnecessary risks were not taken.

Any request for change, either internal or from a customer, must go through the Change Management process and be evaluated on metrics for risk. These metrics determine the level of controls associated with that work and what approvals are required. The key risk factors considered in this analysis include the possible number of customers impacted, likelihood of impact, and level of impact. Even more importantly, the process evaluates the risk of not completing a task and balances these factors. The Change Management program and standardization of risk analysis necessitated standardizing maintenance procedures and protocols as well.

Standards, policies, and best practices were established, documented, and reviewed by management. These create the operating frameworks for implementing IT Information Library methodology, enabling CenturyLink to incorporate industry best practices and standards, as well as develop internal operating best practices, all of which maximize uptime and resource utilization.

A rigid document-control program was established utilizing peer review, and all activities or actions performed were scripted, reviewed, and approved. Peer review also contributed to personnel training, ensuring that as documentation was developed, peers collaborated and maintained expertise on the affected systems. Document standardization was extended to casualty response as well. Even responding to failures requires use of approved procedures, and the response to every alarm or failure is scripted so the team can respond in a manner minimizing risk. In other words, there is a scripted procedure even for dealing with things we’ve never before encountered. This document control program and standardization has enabled personnel to easily support other facilities during periods of heightened risk, without requiring significant training for staff to become familiar with the facilities receiving the additional support.

Conclusion
All the factors described in this paper combine to allow CenturyLink to operate a mission-critical business on a grand scale, with uniform operational excellence. Without this framework in place, CenturyLink would not be able to maintain the high availability on which its reputation is built. Managing these factors while continuing to grow has obvious challenges. However, as CenturyLink grows, these practices are increasingly improved and refined. CenturyLink strives for continuous improvement and views reliability as a competitive advantage. The protocols CenturyLink follows are second-to-none, and help ensure the long-term viability of not only data center operations but also the company as a whole. The scalability and flexibility of these processes can be seen from the efficiency with which CenturyLink has integrated them into its new data center builds as well as data centers it acquired. As CenturyLink continues to grow, these programs will continue to be scaled to meet the needs of demanding enterprise businesses.

Site Spotlight: CH2
In 2013, we undertook a massive energy efficiency initiative at our CH2 data center in Chicago, IL. More than 2 years of planning went into this massive project, and the energy savings were considerable.

Projects included:
• Occupancy sensors for lighting
• VFDs on direct expansion computer room air conditioning units
• Hot Aisle containment
• Implementing advanced economization controls
• Replacing cooling tower fill with high-efficiency
evaporative material
• Installing high-efficiency cooling tower fan blades

These programs combined to reduce our winter PUE by over 17%, and our summer PUE by 20%. Additionally, the winter-time period has grown as we have greatly expanded the full free cooling window and added a large partial free cooling window, using the evaporative impact of the cooling towers to great advantage.

Working with Commonwealth Edison, energy efficient rebates contributed over US$500,000 to the project’s return, along with annual savings of nearly 10,000,000 kilowatt-hours. With costs of approximately US$1,000,000, this project proved to be an incredible investment, with an internal rate of return of ≈100% and a Net Present Value of over US$5,000,000. We consider this a phenomenal example of effective best practice implementation and energy efficiency.


 

Alan Lachapelle

Alan Lachapelle

Alan Lachapelle is a mechanical engineer at CenturyLink Technology Solutions. He has 6 years of experience in Naval Nuclear Propulsion on submarines and 4 years of experience in mission critical IT infrastructure. Merging rigorous financial analysis with engineering expertise, Mr. Lachapelle has helped ensure the success of engineering as a business strategy.

Mr. Lachapelle’s responsibilities include energy-efficiency initiatives, data center equipment end of life, operational policies and procedures, peer review of maintenance and operations procedures, utilization of existing equipment, and financial justifications for engineering projects throughout the company.

Operational Upgrade Helps Fuel Oil Exploration Surveyor

Petroleum Geo-Services increases its capabilities using innovative data center design
By Rob Elder and Mike Turff

Petroleum Geo-Services (PGS) is a leading oil exploration surveyor that helps oil companies find offshore oil and gas reserves. Its range of seismic and electromagnetic services, data acquisition, processing, reservoir analysis/interpretation, and multi-client library data all require PGS to collect and process vast amounts of data in a secure and cost-efficient manner. This all demands large quantities of compute capacity and deployment of a very high-density configuration. PGS operates 21 data centers globally, with three main data center hubs located in Houston, Texas; Kuala Lumpur, Malaysia; and Weybridge, Surrey (see Figure 1).

Figure 1. PGS global computing centers 

Figure 1. PGS global computing centers

Weybridge Data Center

Keysource Ltd designed and built the Weybridge Data Center for PGS in 2008. The high-density IT facility, won a number of awards and saved PGS 6.2 million kilowatt-hours (kWh) annually compared to the company’s previous UK data center. The Weybridge Data Center is located in an office building, which poses a number of challenges to designers and builders of a high-performance data center. The initial project phase in 2008 was designed as the first phase of a three-phase deployment (see Figure 2).

Figure 2. Phase 1 Data Center (2008)

Figure 2. Phase 1 Data Center (2008)

Phase one was designed for 600 kW of IT load, which was scalable up to 1.8 megawatts (MW) across two future phases if required. Within the facility, power rack densities of 20 kW were easily supported, exceeding the 15-kW target originally specified by the IT team at PGS.

The data center houses select mission-critical applications supporting business systems, but it primarily operates the data mining and analytics associated with the core business of oil exploration. This IT is deployed in full-height racks and requires up to 20 kilowatts (kW) per rack anywhere in the facility and at any time (see Figure 3).

Figure 3. The full PGS site layout

Figure 3. The full PGS site layout

In 2008, PGS selected Keysource’s ecofris solution for use at its Weybridge Data Center (see Figure 4), which became the first facility to use the technology. Ecofris recirculates air within a data center without using fresh air. Instead air is provided to the data center through the full height of a wall between the raised floor and suspended ceiling. Hot air from the IT racks is ducted into the suspended ceiling and then drawn back to the cooling coils of air handling units (AHU) located at the perimeter walls. The system makes use of adiabatic technology for external heat rejection when external temperatures and humidity do not allow 100% free cooling.

Figure 4. Ecofris units are part of the phase 1 (2008) cooling system to support PGS’s high-density IT. 

Figure 4. Ecofris units are part of the phase 1 (2008) cooling system to support PGS’s high-density IT.

Keysource integrated a water-cooled chiller into the ecofris design to provide mechanical cooling when needed to supplement the free cooling system (see Figure 5). As a result PGS ended up with two systems, each having a 400-kW chiller, which run for only 50 hours a year on average when external ambient conditions are at their highest.

Figure 5. Phase 2 ecofris cooling

Figure 5. Phase 2 ecofris cooling

As a result of this original design the Weybridge data center used outside air for heat rejection but without allowing that air into the building. Airflow design, a comprehensive control system, and total separation of hot and cold air means that the facility could accommodate 30 kW in any rack and deliver a PUE L2,YC (Level 2, Continuous Measurement) of 1.15 while maintaining a server inlet temperature consistent across the entire space of 72°F (22°C) +/-1°. Adopting an indirect free cooling design rather than direct fresh air eliminated the need for major filtration or mechanical back up (see the sidebar).

Surpassing the Original Design Goals
When PGS needed additional compute capacity, the Weybridge Data Center was a prime candidate for expansion because it had the flexibility to deploy high-density IT anywhere within the facility and a low operating cost. However, while the original design anticipated two future 600-kW phases, PGS wanted even more capacity because of the growth of its business and its need for the latest IT technology. In addition, PGS wanted to make a huge drive to reduce operating costs through efficient design of cooling systems and to maximize power capacity at the site.

When the recent project was completed at the end of 2013, the Weybridge Data Center housed the additional high-density IT within the footprint of the existing data hall. The latest ecofris solution was deployed which utilized a chillerless design, which limited the increased power demand.

Keysource undertook the design by looking at ways to maximize the use of white space for IT as well as to remove the overhead cost of power to run mechanical cooling, even for a very limited number of hours a year. This would ensure maximum availability of capacity of power for the IT equipment. With a marginal improvement in operating efficiency (annualized PUE) the biggest design change was the reduced peak PUE. This change enabled an increase in IT design load from 1.8 MW to 2.7 MW within the same footprint. With just over 5 kW/square meter (m2), PGS can deploy 30 kW in any cabinet up to the maximum total installed IT capacity (see Figure 6).

Figure 6.  More compute power within the same overall data center footprint 

Figure 6.  More compute power within the same overall data center footprint

Disruptive Cooling Design
Developments in technology and the wider allowable range of temperatures per ASHRAE TC9.9 enabled PGS to adopt higher server inlet temperatures when ambient temperatures are higher. This change allows PGS to operate at the optimum temperature for the equipment most of the time (normally 72°F (22°C) lowering the IT part of the PUE metric (see Figure 7).

Figure 7. Using computational fluid dynamics to model heat and airflow 

Figure 7. Using computational fluid dynamics to model heat and airflow

In this facility, elevating server inlet temperatures increases the supply inlet temperatures only when ambient outside air is too warm to maintain 72°F (22°C). Running at higher temperatures at other times actually increases server fan power across different equipment, which also increases UPS loads. Running the facility at optimal efficiency all of the time reduces the overall facility load, even though PUE may rise as a result of the decrease of server fan power. With site facilities management (FM) teams trained in operating the mechanical systems, this is fine-tuned through operation and as additional IT equipment is commissioned within the facility, ensuring performance is maintained at all times.

With innovation central to the improved performance of the data center, in addition to the cooling, Keysource also delivered modular, highly efficient UPS systems providing 96% efficiency from >25% facility load, plus facility controls, which provide automated optimization.

A Live Environment
Working in a live data center environment within an office building was never going to be risk free. Keysource built a temporary wall within the existing data center to divide the live operational equipment from the live project area (see Figure 8). Cooling, power, and data for the live equipment isn’t on a raised floor and is delivered from the same end of the data center. Therefore, the dividing screen had limited impact to the live environment, with only some minor modifications needed to the fire detection and suppression systems.

Figure 8. The temporary protective wall built for phase 2

Figure 8. The temporary protective wall built for phase 2

Keysource also manages the data center facility for PGS, which meant that the FM and projects teams could work closely together in the planning the upgrade. As a result facilities management considerations were included in all design and construction planning to minimize risk to the operational data center as well as helping to reduce the impact to other business operations at the site.

Upon completing the project, a full integrated-system test of the new equipment was undertaken ahead of removing the dividing partition. This test not only covered the function of electrical and mechanical systems but also tested the capability of the cooling to deliver the 30 kW/rack and the target design efficiency. Using rack heaters to simulate load allowed detailed testing to be carried out ahead of the deployment of the new IT technology (see Figure 9).

Figure 9. Testing the 30 kW per rack full load 

Figure 9. Testing the 30 kW per rack full load

Results


Phase two was completed in April 2014, and as a result the facility’s power density improved by approximately 50%, with the total IT capacity now scalable up to 2.7 MW. This has been achieved within the same internal footprint. The facility now has the capability to accommodate up to 188 rack positions, supporting up to 30 kW per rack. In addition, the PUEL2,YC of 1.15* was maintained (see the Figure 10).

Figure 10. A before and after comparison

Figure 10. A before and after comparison

The data center upgrade has been hailed as a resounding success, earning PGS and Keysource a Brill Award for Efficient IT from Uptime Institute. PGS is absolutely delighted to have the quality of its facility recognized by a judging panel of industry leaders and to receive a Brill Award.

Direct and Indirect Cooling Systems
Keysource hosted an industry expert roundtable that provides additional insights and debate on two pertinent cooling topics highlighted by the PGS story. Copies of these whitepapers can be obtained at http://www.keysource.co.uk/data-centre-white-papers.aspx

An organization requiring high availability is unlikely to install a direct fresh air system without 100% backup on the mechanical cooling. This is because the risks associated with the unknowns of what could happen outside, however infrequent, are generally out of the operator’s control.

Density of IT equipment does not make any impact to direct or indirect designs. It is the control of air and the method of air delivery within the space that dictates capacity and air volume requirements. There may be additional considerations for how backup systems and the control strategy between switching cooling methods works in high-density environments due to the risk of thermal increase in very short periods, but this is down to each individual design.

Following the agreement of the roundtable that direct fresh air is going to require some sort of a backup system in order to meet availability and customer risk requirements, it is worth considering what benefits might exist for opting for either this or indirect design.

Partly due to the different solutions in these two areas and partly because there are other variables on a site-specific basis, there are not many clear benefits either way, but there are a few considerations included:

• Indirect systems pose less or no risk from external pollutants and
contaminants.

• Indirect systems do not require integration into the building
fabric, where a direct system often needs large ducts or
modifications to the shell. This can increase complexity and cost
if, due to space or building height, it is even achievable.

• Direct systems often require more humidity control, depending on
which ranges are to be met.

With most efficient systems, there is some form of adiabatic cooling. With direct systems there is often a reliance on water to provide capacity rather than simply improve efficiency. In this case there is a much greater reliance on water for normal operation and to maintain availability, which can lead to the need for water storage or other measures. The metric of water usage effectiveness (WUE) needs to be considered.

Many data center facilities are already built with very inefficient cooling solutions. In such cases direct fresh air solutions provide an excellent opportunity to retrofit and run as the primary method of cooling, with the existing inefficient systems as back up. As the backup system is already in place this is often a very affordable option with a clear ROI.

One of the biggest advantages for an indirect system is the potential for zero refrigeration. Half of the U.S. could take this route, and even places people would never consider such as Madrid or even Dubai could benefit. This inevitably requires the use of and reliance on lots of water, as well as the acceptance of increasing server inlet temperatures during warmer periods.


Mike Turff

Mike Turff

Mike Turff is global compute resources manager for the Data Processing division of Petroleum Geo-Services (PGS), a Norwegian-based leader in oil exploration and production services. Mr. Turff has responsibility for building and managing the PGS supercomputer centers in Houston, TX; London, England; Kuala Lumpur, Malaysia and Rio De Janeiro, Brasil as well as the smaller satellite data centers across the world. He has worked for over 25 years in high performance compute, building and running supercomputer centers in places as diverse as Nigeria and Kazakhstan and for Baker Hughes, where he built the Eastern Hemisphere IT Services organization with IT Solutions Centers in Aberdeen, Scotland; Dubai, UAE; and Perth, Australia.

 

elder

Rob Elder

As Sales and Marketing director, Rob Elder is responsible for setting and implementing the strategy for Keysource. Based in Sussex in the United Kingdom, Keysource is a data center design, build, and optimization specialist. During his 10 years at Keysource,  Mr. Elder has also held marketing and sales management positions and management roles in Facilities Management and Data Centre Management Solutions Business Units.

Digital Realty Deploys Comprehensive DCIM Solution

Examining the scope of the challenge
By David Schirmacher

Digital Realty’s 127 properties cover around 24 million square feet of mission-critical data center space in over 30 markets across North America, Europe, Asia and Australia, and it continues to grow and expand its data center footprint. As senior VP of operations, it’s my job to ensure that all of these data centers perform consistently—that they’re operating reliably and at peak efficiency and delivering best-in-class performance to our 600-plus customers.

At its core, this challenge is one of managing information. Managing any one of these data centers requires access to large amounts of operational data.

If Digital Realty could collect all the operational data from every data center in its entire portfolio and analyze it properly, the company would have access to a tremendous amount of information that it could use to improve operations across its portfolio. And that is exactly what we have set out to do by rolling out what may be the largest-ever data center infrastructure management (DCIM) project.

Earlier this year, Digital Realty launched a custom DCIM platform that collects data from all the company’s properties, aggregates it into a data warehouse for analysis, and then reports the data to our data center operations team and customers using an intuitive browser-based user interface. Once the DCIM platform is fully operational, we believe we will have the ability to build the largest statistically meaningful operational data set in the data center industry.

Business Needs and Challenges
The list of systems that data center operators report using to manage their data center infrastructure often includes a building management system, an assortment of equipment-specific monitoring and control systems, possibly an IT asset management program and quite likely a series of homegrown spreadsheets and reports. But they also report that they don’t have access to the information they need. All too often, the data required to effectively manage a data center operation is captured by multiple isolated systems, or worse, not collected at all. Accessing the data necessary to effectively manage a data center operation continues to be a significant challenge in the industry.

At every level, data and access to data are necessary to measure data center performance, and DCIM is intrinsically about data management. In 451 Research’s DCIM: Market Monitor Forecast, 2010-2015, analyst Greg Zwakman writes that a DCIM platform, “…collects and manages information about a data center’s assets, resource use and operational status.” But 451 Research’s definition does not end there. The collected information “…is then distributed, integrated, analyzed and applied in ways that help managers meet business and service-oriented goals and optimize their data center’s performance.” In other words, a DCIM platform must be an information management system that, in the end, provides access to the data necessary to drive business decisions.

Over the years, Digital Realty successfully deployed both commercially available and custom software tools to gather operational data at its data center facilities. Some of these systems provide continuous measurement of energy consumption and give our operators and customers a variety of dashboards that show energy performance. Additional systems deliver automated condition and alarm escalation, as well as work order generation. In early 2012 Digital Realty recognized that the wealth of data that could be mined across its vast data center portfolio was far greater than current systems allowed.

In response to this realization, Digital Realty assembled a dedicated and cross-functional operations and technology team to conduct an extensive evaluation of the firm’s monitoring capabilities. The company also wanted to leverage the value of meaningful data mined from its entire global operations.

The team realized that the breadth of the company’s operations would make the project challenging even as it began designing a framework for developing and executing its solution. Neither Digital Realty nor its internal operations and technology teams were aware of any similar development and implementation project at this scale—and certainly not one done by an owner/operator.

As the team analyzed data points across the company portfolio, it found additional challenges. Those challenges included how to interlace the different varieties and vintages of infrastructure across the company’s portfolio, taking into consideration the broad deployment of Digital Realty’s Turn-Key Flex data center product, the design diversity of its custom solutions and acquired data center locations, the geographic diversity of the sites and the overall financial implications of the undertaking as well as its complexity.

Drilling Down
Many data center operators are tempted to first explore what DCIM vendors have to offer when starting a project, but taking the time to gain internal consensus on requirements is a better approach. Since no two commercially available systems offer the same features, assessing whether a particular product is right for an application is almost impossible without a clearly defined set of requirements. All too often, members of due diligence teams are drawn to what I refer to as “eye candy” user interfaces. While such interfaces might look appealing, the 3-D renderings and colorful “spinning visual elements” are rarely useful and can often be distracting to a user whose true goal is managing operational performance.

When we started our DCIM project, we took a highly disciplined approach to understanding our requirements and those of our customers. Harnessing all the in-house expertise that supports our portfolio to define the project requirements was itself a daunting task but essential to defining the larger project. Once we thought we had a firm handle on our requirements, we engaged a number of key customers and asked them what they needed. It turned out that our customers’ requirements aligned well with those our internal team had identified. We took this alignment as validation that we were on the right track. In the end, the project team defined the following requirements:

• The first of our primary business requirements was global access to consolidated data. We required every single one of Digital Realty’s data centers have access to the data, and we needed the capability to aggregate data from every facility into a consolidated view, which would allow us to compare performance of various data centers across the portfolio in real time.

• Second, the data access system had to be highly secured and give us the ability to limit views based on user type and credentials. More than 1,000 people in Digital Realty’s operations department alone would need some level of data access. Plus, we have a broad range of customers who would also need some level of access, which highlights the importance of data security.

• The user interface also had to be extremely user-friendly. If we didn’t get that right, Digital Realty’s help desk would be flooded with requests on how to use the system. We required a clean navigational platform that is intuitive enough for people to access the data they need quickly and easily, with minimal training.

• Data scalability and mining capability were other key requirements. The amount of information Digital Realty has across its many data centers is massive, and we needed a database that could handle all of it. We also had to ensure that Digital Realty would get that information into the database. Digital Realty has a good idea of what it wants from its dashboard and reporting systems today, but in five years the company will want access to additional kinds of data. We don’t want to run into a new requirement for reporting and not have the historical data available to meet it.

Other business requirements included:

• Open bidirectional access to data that would allow the DCIM system to exchange information with
other systems, including computerized maintenance management systems (CMMS), event management,
procurement and invoicing systems

• Real-time condition assessment that allows authorized users to instantly see and assess operational
performance and reliability at each local data center as well as at our central command center

• Asset tracking and capacity management

• Cost allocation and financial analysis to show not only how much energy is being consumed but also how
that translates to dollars spent and saved

• The ability to pull information from individual data centers back to a central location using minimal re
sources at each facility

Each of these features was crucial to Digital Realty. While other owners and operators may share similar requirements, the point is that a successful project is always contingent on how much discipline is exercised in defining requirements in the early stages of the project—before users become enamored by the “eye candy” screens many of these products employ.

To Buy or Build?
With 451 Research’s DCIM definition—as well as Digital Realty’s business requirements—in mind, the project team could focus on delivering an information management system that would meet the needs of a broad range of user types, from operators to C-suite executives. The team wanted DCIM to bridge the gap between facilities and IT systems, thus providing data center operators with a consolidated view of the data that would meet the requirements of each user type.

The team discussed whether to buy an off-the-shelf solution or to develop one on its own. A number of solutions on the market appeared to address some of the identified business requirements, but the team was unable to find a single solution that had the flexibility and scalability required to support all of Digital Realty’s operational requirements. The team concluded it would be necessary to develop a custom solution.

Avoiding Unnecessary Risk
There is significant debate in the industry about whether DCIM systems should have control functionality—i.e., the ability to change the state of IT, electrical and mechanical infrastructure systems. Digital Realty strongly disagrees with the idea of incorporating this capability into a DCIM platform. By its very definition, DCIM is an information management system. To be effective, this system needs to be accessible to a broad array of users. In our view, granting broad access to a platform that could alter the state of mission-critical systems would be careless, despite security provisions that would be incorporated into the platform.

While Digital Realty and the project team excluded direct-control functionality from its DCIM requirements, they saw that real-time data collection and analytics could be beneficial to various control-system schemas within the data center environment. Because of this potential benefit, the project team took great care to allow for seamless data exchange between the core database platform and other systems. This feature will enable the DCIM platform to exchange data with discrete control subsystems in situations where the function would be beneficial. Further, making the DCIM a true browser-based application would allow authorized users to call up any web-accessible control system or device from within the application. These users could then key in the additional security credentials of that system and have full access to it from within the DCIM platform. Digital Realty believes this strategy fully leverages the data without compromising security.

The Challenge of Data Scale
Managing the volume of data generated by a DCIM is among the most misunderstood areas of DCIM development and application. A DCIM platform collects, analyzes and stores a truly immense volume of data. Even a relatively small data center generates staggering amounts of information—billions of annual data transactions—that few systems can adequately support. By contrast, most building management systems (BMS) have very limited capability to manage significant amounts of historical data for the purposes of defining ongoing operational performance and trends.

Consider a data center with a 10,000-ft2 data hall and a traditional BMS that monitors a few thousand data points associated mainly with the mechanical and electrical infrastructure. This system communicates in near real time with devices in the data center to provide control- and alarm-monitoring functions. However, the information streams are rarely collected. Instead they are discarded after being acted on. Most of the time, in fact, the information never leaves the various controllers distributed throughout the facility. Data are collected and stored at the server for a period of time only when an operator chooses to manually initiate a trend routine.

If the facility operators were to add an effective DCIM to the facility, it would be able to collect much more data. In addition to the mechanical and electrical data, the DCIM could collect power and cooling data at the IT rack level and for each power circuit supporting the IT devices. The DCIM could also include detailed information about the IT devices installed in the racks. Depending on the type and amount desired, data collection could easily required 10,000 points.

But the challenge facing this facility operator is even more complex. In order to evaluate performance trends, all the data would need to be collected, analyzed and stored for future reference. If the DCIM were to collect and store a value for each data point for each minute of operation, it would have more than five billion transactions per year. And this would be just the data coming in. Once collected, the five billion transactions would have to be sorted, combined and analyzed to produce meaningful output. Few, if any, of the existing technologies installed in a typical data center have the ability to manage this volume of information. In the real world, Digital Realty is trying to accomplish this same goal across its entire global portfolio.

The Three Silos of DCIM
As Digital Realty’s project team examined the process of developing a DCIM platform, it found that the challenge included three distinct silos of data functionality: the engine for collection, the logical structures for analysis and the reporting interface.

Figure 1. Digital Realty’s view of the DCIM stack.

Figure 1. Digital Realty’s view of the DCIM stack.

The engine of Digital Realty’s DCIM must reach out and collect vast quantities of data from the company’s entire portfolio (see Figure 1). The platform will need to connect to all the sites and all the systems within these sites to gather information. This challenge requires a great deal of expertise in the communication protocols of these systems. In some instances, accomplishing this goal will require “cracking” data formats that have historically stranded data within local systems. Once collected, the data to be checked for integrity and packaged for reliable transmission to the central data store.

The project team also faced the challenge of creating the logical data structures that to process, analyze and archive the data once the DCIM has successfully accessed and transmitted the raw data from each location to the data store. Dealing with 100-plus data centers, often with hundreds of thousands of square feet of white space each, increases the scale of the challenge exponentially. The project team overcame a major hurdle in addressing this challenge when it was able to define relationships between various data categories that allowed the database developers to prebuild and then volume-test data structures to ensure they were up to the challenge.

These data structures, or “data hierarchies” as Digital Realty’s internal team refers to them, are the “secret sauce” of the solution (see Figure 2). Many of the traditional monitoring and control systems in the marketplace require a massive amount of site-level point mapping that is often field-determined by local installation technicians. These points are then manually associated with the formulas necessary to process the data. This manual work is why these projects often take much longer to deploy and can be difficult to commission as mistakes are flushed out.

Figure 2. Digital mapped all the information sources and their characteristics as a step to developing its DCIM.

Figure 2. Digital mapped all the information sources and their characteristics as a step to developing its DCIM.

In this solution, these data relationships have been predefined and are built into the core database from the start. Since this solution is targeted specifically to a data center operation, the project team was able to identify a series of data relationships, or hierarchies, that can be applied to any data center topology and still hold true.

For example, an IT application such as an email platform will always be installed on some type of IT device or devices. These devices will always be installed in some type of rack or footprint in a data room. The data room will always be located on a floor, the floor will always be located in a building, the building in a campus or region, and so on, up to the global view. The type of architectural or infrastructure design doesn’t matter; the relationship will always be fixed.

The challenge is defining a series of these hierarchies that always test true, regardless of the design type. Once designed, the hierarchies can be pre-built, their validity tested and they can be optimized to handle scale. There are many opportunities for these kinds of hierarchies. This is exactly what we have done.

Having these structures in place facilitates rapid deployment and minimizes data errors. It also streamlines the dashboard analytics and reporting capabilities, as the project team was able to define specific data requirements and relationships and then point the dashboard or report at the layer of the hierarchy to be analyzed. For example, a single report template designed to look at IT assets can be developed and optimized and would then rapidly return accurate values based on where the report was pointed. If pointed at the rack level, the report would show all the IT assets in the rack; if pointed at the room level, the report would show all the assets in the room, and so on. Since all the locations are brought into a common predefined database, the query will always yield an apples-to-apples comparison regardless of any unique topologies existing at specific sites.

Figure 3. Structure and analysis as well as web-based access were important functions.

Figure 3. Structure and analysis as well as web-based access were important functions.

Last remains the challenge of creating the user interface, or front end, for the system. There is no point in collecting and processing the data if operators and customers can’t easily access it. A core requirement was that the front end needed to be a true browser-based application. Terms like “web-based” or “web-enabled” are often used in the control industry to disguise the user interface limitations of existing systems. Often to achieve some of the latest visual and 3-D effects, vendors will require the user’s workstation to be configured with a variety of thin-client applications. In some cases, full-blown applications have to be installed. For Digital Realty, installing add-ins on workstations would be impractical given the number of potential users of the platform. In addition, in many cases, customers would reject these installs due to security concerns. A true browser-based application requires only a standard computer configuration, a browser and the correct security credentials (see Figure 3).

Intuitive navigation is another key user interface requirement. A user should need very little training to get to the information they need. Further, the information should be displayed to ensure quick and accurate assessment of the data.

Digital Realty’s DCIM Solution
Digital Realty set out to build and deploy a custom DCIM platform to meet all these requirements. Rollout commenced in May 2013, and as of August, the core team was ahead of schedule in terms of implementing the DCIM solution across the company’s global portfolio of data centers.

The name EnVision reflects the platform’s ability to look at data from different user perspectives. Digital Realty developed EnVision to allow its operators and customers insight into their operating environments and also to offer unique features specifically targeted to colocation customers. EnVision provides Digital Realty with vastly increased visibility into its data center operations as well as the ability to analyze information so it is digestible and actionable. It has a user interface with data displays and reports that are tailored to operators. Finally, it has access to historical and predictive data.

In addition, EnVision provides a global perspective allowing high-level and granular views across sites and regions. It solves the stranded data issue by reaching across all relevant data stores on the facilities and IT sides to provide a comprehensive and consolidated view of data center operations. EnVision is built on an enterprise-class database platform that allows for unlimited data scaling and analysis and provides intuitive visuals and data representations, comprehensive analytics, dashboard and reporting capabilities from an operator’s perspective.

Trillions of data points will be collected and processed by true browser-based software that is deployed on high-availability network architecture. The data collection engine offers real-time, high-speed and high-volume data collection and analytics across multiple systems and protocols. Furthermore, reporting and dashboard capabilities offer visualization of the interaction between systems and equipment.

Executing the Rollout
A project of this scale requires a broad range of skill sets to execute successfully. IT specialists must build and operate the high-availability compute infrastructure that the core platform sits on. Network specialists define the data transport mechanisms from each location.

Control specialists create the data integration for the various systems and data sources. Others assess the available data at each facility, determine where gaps exist and define the best methods and systems to fill those gaps.

The project team’s approach was to create and install the core, head-end compute architecture using a high-availability model and then to target several representative facilities for proof-of-concept. This allowed the team of specialists to work out the installation and configuration challenges and then to build a template so that Digital Realty could repeat the process successfully at other facilities. With the process validated, the program moved onto the full rollout phase, with multiple teams executing across the company’s portfolio.

Even as Digital Realty deploys version 1.0 of the platform, a separate development team continues to refine the user interface with the addition of reports, dashboards and other functions and features. Version 2.0 of the platform is expected in early 2014, and will feature an entirely new user interface, with even more powerful dashboard and reporting capabilities, dynamically configurable views and enhanced IT asset management capabilities.

The project has been daunting, but the team at Digital Realty believes the rollout of the EnVision DCIM platform will set a new standard of operational transparency, further bridging the gap between facilities and IT systems and allowing operators to drive performance into every aspect of a data center operation.


David Schirmacher

David Schirmacher

David Schirmacher is senior vice president of Portfolio Operations at Digital Realty, where is responsible for overseeing the company’s global property operations as well as technical operations, customer service and security functions. He joined Digital Realty in January 2012. His more than 30 years of relevant experience includes turns as principal and Chief Strategy Officer for FieldView Solutions, where he focused on driving data center operational performance; and vice president, global head of Engineering for Goldman Sachs, where he focused on developing data center strategy and IT infrastructure for the company’s headquarters, trading floor, branch offices and data center facilities around the world. Mr. Schirmacher also held senior executive and technical positions at Compass Management and Leasing and Jones Lang LaSalle. Considered a thought leader within the data center industry, Mr. Schirmacher is president of 7×24 Exchange International and he has served on the technical advisory board of Mission Critical.

 

Thinking Ahead Can Prevent the Mid-Life Energy Crisis in Data Centers

Turning back the clock at Barclays Americas’ data centers

By Jan van Halem and Frances Cabrera

Barclays is a major global financial services provider engaged in personal banking, credit cards, corporate and investment banking, and wealth and investment management with an extensive international presence in Europe, the Americas, Africa, and Asia. Barclays has two major data centers in the northeastern United States with production environments that support the Americas region’s operations (see Figure 1). Barclays Corporate Real Estate Solutions (CRES) Engineering team manages the data centers, in close partnership with the Global Technology and Information Services (GTIS) team.

Both of Barclays Americas’ data centers are reaching 7 years old, but not showing their age—at least not energy wise. For the last 3 years, Barclays Americas’ engineering and IT teams have been on an energy-efficiency program to ensure that the company’s data center portfolio continues operating even more efficiently than originally commissioned.

Fig1

Figure 1. Comparison of DC 1 and DC 2

By 2013, Barclays Americas’ two data centers had reduced their energy consumption by 8,000 (megawatt-hours (MWh) or 8%, which equates to 3,700 tons of carbon emissions avoided. In addition, the power usage effectiveness (PUE) of the largest data center dropped from an annual average of 1.63 to 1.54–earning it an Energy Star certification for 2013.

The Barclays Americas team pinpointed the following strategies for a three-pronged attack on energy inefficiency:

  • Airflow management
  • Enhancement of cooling operations
  • Variable frequency drive (VFD) installations on computer room air
    conditioning (CRAC) units and pumps

The goals were to:

  • Reduce total cooling
  • Enhance heat transfer efficiencies
  • Reduce cooling losses (i.e., short cycling of air)
Fig2

Figure 2. Summary of initiatives implemented across the two data centers: 2010-present.

The team found considerable savings without having to make large capital investments or use complex solutions that would threaten the live data center environment (see Figure 2). The savings opportunities were all identified, tested, implemented, and validated using in-house resources with collaboration between the engineering and IT teams.

Background

In 2010, Barclays launched a Climate Action Programme, which has since been expanded into its 2015 Citizenship Plan. The program included a carbon reduction target of 4% by the end of 2013 from 2010 levels. The target spurred action among the regions to identify and implement energy conservation measures and created a favorable culture for the funding of initiatives. Due to energy initiatives like the data center programs put in place in the Americas, Barclays reduced its carbon emissions globally by 12% in 2012, meeting the target two years early. Barclays is now setting its sights on a further carbon reduction of 10% by 2015.

From the beginning, it was understood that any new data center energy initiatives in the Americas must take into account both IT and building operation considerations. CRES and GTIS worked together on research, testing, and mocking up these initiatives. Their findings were shared with the central teams at headquarters so that case studies could be developed to ensure that the successes can be used at other sites around the world.

Blowing Back the Years

Initially, the team focused on airflow management. The team successfully reduced CRAC unit fan speeds at DC 1 from 70% to 40% by raising temperature set points as recommended by ASHRAE TC9.9, replacing unneeded perforated tiles with solid tiles, and installing Cold Aisle containment. All these changes were implemented with no impact to equipment operation.

The first step, and the easiest, was to evaluate the initial fan speed of the CRAC units at DC 1, and reduce it from 70% to 58%, with no effect on performance. When the DCs were commissioned they were briefly operating at 55°F (15°C), which was quickly increased to 65°F (18°C). By 2010, the temperature set points in Cold Aisles were raised to 75°F (24°C) from 65°F (18°C) in both DC 1 and DC 2 as recommended by updated ASHRAE standard for data center cooling.

Next, Barclays’ teams turned to the perforated tiles. Perforated tiles in DC 1 were replaced with solid tiles. This reduced CRAC fan speeds from 58% to 50%, with no impact on Cold Aisle temperature conditions and equipment operation.

Finally, in 2011, the PDU aisles were retrofitted with containment doors, which further improved airflow efficiency. After the site teams researched various options and recommended a commercial product, an in-house crew completed the installation. The team opted to use an in-house personnel to avoid having external contractors working in a live environment, which meant that the IT operations team felt comfortable having trusted engineering staff in the server rooms. The engineering staff used its expertise and common sense to install the doors with no disturbance to the end users.

Figure 3. Energy savings attributed to cooling enhancements

Figure 3. Energy savings attributed to cooling enhancements

The team chose not to extend the containment above the aisles. Using data from wireless temperature and humidity sensors located throughout the aisles, the team found that it could still achieve ~80% of the projected energy savings without having the containment reaching the ceiling and avoid the additional cost and potential issues with local fire codes.

With the doors installed, the teams continue monitoring temperatures through the sensors and can control airflow by adjusting CRAC fan speeds,which regulates the amount of supply air in the Cold Aisle to minimize the bypass overflow and ensure efficient air distribution. As a result, return air temperature increased, driving efficiency and allowing further fan speed reductions of 50% to 40%. Again, these reductions were achieved with no impact to operating conditions. Based on these successes, Cold Aisle containment was installed at DC 2 in 2014.

Figure 4. Available free cooling at DC 1 and DC 2

Figure 4. Available free cooling at DC 1 and DC 2

A curve based on manufacturer’s data and direct power readings for the CRAC units enabled the teams to calculate the power reductions associated with volume flow rate reductions As a result, they calculated that the airflow initiatives resulted in 3.6 gigawatt-hours (GWh) of energy annually across DC 1 and DC 2 (see Figure 3).

Staying Cool

DC 1’s primary cooling system is a water-cooled plant consisting of two 1,250-ton chillers and five cooling towers. DC 1 was designed to utilize free cooling technology that feeds condenser water from cooling towers into a heat exchanger. The heat exchanger then cools the building’s chilled water. In 2011, the CRES Engineering team embarked on an initiative to maximize the use of free cooling.

After reviewing operational parameters such as temperature, water flow, and cooling tower fan speeds, Barclays team made the following adjustments:

  • Adding additional units to the series to slow down the water pumped through the heat exchanger. Making
    the water go through two units instead of one increased the efficiency of the heat exchange.
  • Running two cooling tower unit fans at half the speed of one fan to reduce power demand as the result of
    analyzing data from an electrical power monitoring system (EPMS). As a result, the volume of
    condenser water was divided among multiple cooling tower units instead of all going into one.
  • Increasing chilled water temperature 4°F (7°C) from 2011 to today, expanded the period of time that
    free cooling is possible (see Figure 4).

Of the three strategies, it is hardest to directly measure and attribute energy savings to enhancing cooling operations, as the changes impact several parts of the cooling plant. Barclays used the EPMS to track power readings throughout the systems, particularly the cooling tower units. The EPMS enables PUE trending and shows the reduction of DC 1’s PUE overtime. Since 2011, it’s dropped 5% to an annual average of 1.54.

Driving Back Inefficiency

Figure 5. Savings attributed to VFDs

Figure 5. Savings attributed to VFDs

In 2013 the teams began an intensive review of VFD technology. They found that considerable energy savings were to be obtained by installing VFDs on several pieces of equipment such as air handlers in plant rooms and condenser water pumps in both DC 1 and DC 2. The VFDs control the speed of an existing AC induction motor. By reducing the speed on air handlers, the unit load can be adjusted to the existing heat load (see Figure 5).

The team focused on (36) 30-ton units throughout DC 1 that would yield positive energy and cost savings. Utility rebates further enhanced the business case. The VFDs were installed towards the end of 2013, and the business case applied to DC 2 for further installations in 2014.

Figure 6. Savings attributed to frequency reductions on AC motors

Figure 6. Savings attributed to frequency reductions on AC motors

To calculate savings, direct power readings were taken at the CRAC units at intervals of 2 hertz (Hz) from 60 Hz to 30 Hz. As shown in Figure 6, reducing CRAC frequency from 60 Hz to 56 Hz reduced power demand by 19%. In addition, the fan motor will release less heat to the air further reducing cooling load.

Additional maintenance cost savings are achieved through the extension of the filter replacement cycle. The VFDs allow less air to go through the system increasing the life span of the system filters. Once fully implemented the VFD installations will save over 3.9 GWh of energy.

Data as the Fountain of Youth

Comprehensive monitoring systems in place across the two data centers provided data accessible by both the GTIS and CRES teams, enabling them to make the best, data-driven decisions. The use of the sites’ EPMS and branch circuit monitoring system (BCMS) enable the teams to pinpoint areas with the greatest energy-saving potential and work together to trial and implement initiatives.

Barclays uses the EPMS as a tool to monitor, measure, and optimize the performance of the electrical loads. It monitors critical systems such as HVAC equipment, electrical switchgear, etc. A dashboard displays trend data. For example, the team can trend the total UPS load and the total building load, which yields PUE, on a continuous, real-time basis.

In addition to the EPMS, the CRES and GTIS teams also use the BCMS to track energy use by server cabinet, remote power panel, and power distribution unit. This system is used for capacity planning and load balancing. In addition, the monitored cabinet loads are used to optimize the airflow in the Cold Aisles.

Conclusion

With the right level of collaboration between CRES and GTIS, the right data, and the right corporate environmental targets, the Barclays Americas’ team was able to find energy and cost savings in their data centers. By executing airflow management, enhanced cooling, and VFD strategies in the existing data centers, the team applied the latest standards and best practices to keep energy consumption at levels typical of new data centers. At 8 GWh lighter, with an Energy Star certification and a PUE that keeps dropping—these data centers are not showing their energy age.


 

jan-van-halemJan van Halem is vice president, Data Center Engineering at Barclays. He joined Barclays’ engineering team in 2004. Mr. van Halem has more than 20 years experience and a strong knowledge of critical data center systems and mechanical operations. At Barclays, he is responsible for the mechanical and electrical systems of the company’s major data centers in the Americas. In addition, he has provided engineering design to new construction and expansion projects in the region.

Before joining Barclays, Mr. van Halem was with real estate organizations and held positions in facility management, construction management, and project management. Mr. van Halem has a BS degree in Marine Engineering from the Maritime Institute De Ruyter in Flushing, the Netherlands. He served as a marine engineer for 8 years in the Dutch Merchant Marine.

frances-cabreraFrances Cabrera, LEED AP, is vice president, Environmental Management at Barclays. She joined Barclays’ environmental management team in 2011. Ms. Cabrera oversees Barclays Americas’ environmental programs, both resource saving and compliance, to support the region’s ISO certified management system. With the collaboration of the corporate real estate team, the region has achieved multiple LEED and Energy Star certifications. She’s also part of the firm’s global center of excellence for environment, where she works across regions to measure and support the firm’s green IT achievements.

Before joining Barclays, Ms. Cabrera ran the ISO 14001 systems in North and South America for Canon USA and worked at various manufacturing companies in Rochester, NY, integrating environmental considerations into their operations and meeting regulations. Ms. Cabrera has a BS degree in Environmental Technology and a MS degree in Environmental, Health, and Safety Management from the Rochester Institute of Technology.

Close Coupled Cooling and Reliability

Achieving Concurrently Maintainable and Fault Tolerant cooling using various close coupled cooling technologies

By Matt Mescall

Early mainframe computers were cooled by water at the chip level. Then, as computing moved to the distributed server model, air replaced water. Typically, data centers included perimeter computer room air condition (CRAC) units to supply cold air to a raised floor plenum and perforated floor tiles to deliver it to IT equipment. These CRAC units were either direct-expansion (DX) or chilled-water units (for simplicity, CRAC will be used to refer to either kind of unit). This arrangement worked for the past few decades while data centers were primarily occupied by low-density IT equipment (< 2-4 kilowatts [kW] per rack). However, as high-density racks become more common, CRAC units and a raised floor may not provide adequate cooling.

To address these situations, data center cooling vendors developed close coupled cooling (CCC). CCC technology includes in-row, in-rack, above–rack, and rear-door heat exchanger (RDHx) systems. Manufacturers typically recommend the use of a Cold Aisle/Hot Aisle arrangement for greater efficiency, which is a best practice for all data center operations. As rack density increased due to IT consolidation and virtualization, CCC moved from being a solution to an unusual cooling situation to being the preferred cooling method. Implemented properly, a CCC solution can meet the Concurrently Maintainable and Fault Tolerant requirements of a data center.

While an air handler may provide humidity control, the close coupled cooling solution provides the onlycooling for the IT equipment in a data center. Additionally, it is assumed that the reader understands how to design a direct-expansion or chilled-water CRAC based cooling system to meet Concurrent Maintainability and Fault Tolerant requirements. This paper does not address Concurrent Maintainability and Fault Tolerant requirements for a central cooling plant, only the CCC system in the data hall.

Meeting Concurrently Maintainable and Fault Tolerant Requirements

First, let’s clarify what is required for a Concurrently Maintainable (Tier III) and a Fault Tolerant (Tier IV) cooling system. This discussion is not a comprehensive description of all Concurrently Maintainable and Fault Tolerant requirements, but it provides the basis for the rest of the discussion in this paper.

A Concurrently Maintainable system must have redundant capacity components and independent distribution paths, which means that each and every capacity component and distribution path element can be taken out of service for maintenance, repair, or replacement without impacting the critical environment.

To meet this requirement, the system must have dry pipes (no flowing or pressurized liquid) to prevent liquid spills when maintaining pipes, joints, and valves. Draining a pipe while it is disassembled is allowed, but hot tapping and pipe freezing are not. A Fault Tolerant cooling system may look like a Concurrently Maintainable system, but it must also autonomously respond to failures, including Continuous Cooling, and compartmentalize the chilled-water and/or refrigerant pipes outside the room of use (typically the computer room).

There are several different types and configurations of CCC. For simplicity, this paper will break them into two groups, in-row and above–row, and RDHx. While there are other CCC solutions available, the same concepts can be used to provide a Concurrently Maintainable or Fault Tolerant design.

In-row and above-row CCC

When data center owners have a business requirement for a high density data center to be Concurrently Maintainable or Fault Tolerant, a CCC design poses special circumstances that do not exist with room-based cooling. First, airflow must be considered. A CRAC-unit-based cooling design that is Concurrently Maintainable or Fault Tolerant has N+R cooling units that provide cooling to the whole room. When a redundant unit is off for maintenance or suffers a fault, the IT equipment still receives cooling from the remaining CRAC units via the perforated tiles in the Cold Aisle. The cooling in any Cold Aisle is not affected when the redundant unit is offline. This arrangement allows for one or two redundant CRAC units in an entire room (see Figure 1).

Figure 1. CCC airflow considerations

Figure 1. CCC airflow considerations

CCC provides cooling to the specific Cold Aisle where the unit is located. In other words, CCC units cannot provide cooling to different Cold Aisles the way CRAC units can. Accordingly, the redundant CCC unit must be located in the aisle where the cooling is required. In addition to having sufficient redundant cooling in every Cold Aisle, distance from the cooling unit to the IT equipment must also be considered. In-row and above row cooling units typically can provide cold air for only a limited distance. The design must take into account the worst-case scenario during maintenance or a failure event.

After considering the number of units and their location in the Cold Aisle, design team must consider the method of cooling, which may be air-cooled direct expansion (DX), chilled water, or a pumped refrigerant. Air-cooled DX units aretypically matched with their own condenser units. Other than proper routing, piping for air-cooled DX units require no special considerations.

Piping to chilled-water units is either traditional chilled-water piping or a cooling distribution unit (CDU). In the former method chilled water is piped directly to CCC units, similar to CRAC units. In this case, chilled-water piping systems are designed to be Concurrently Maintainable or Fault Tolerant in the same way as single-coil, room-based CRAC units.

The latter method, which uses CDUs, poses a number of special considerations. Again, chilled-water piping to a CDU and to single-coil, room-based CRAC units is designed to be Concurrently Maintainable or Fault Tolerant in the same way. However, designers must consider the impact to each Cold Aisle when a CDU is removed from service or suffers a fault.

If any single CDU provides cooling to more than the redundant number of cooling units in any aisle, the design is not Concurrently Maintainable or Fault Tolerant. When CDUs are located outside of the server room or data hall in a Fault Tolerant design, they must be properly compartmentalized so that a single event does not remove more than the redundant number of cooling units from service. A Fault Tolerant system also requires Continuous Cooling, the ability to detect, isolate, and contain a fault, and sustain operations. In a CCC system that rejects heat to a chilled-water system, the mechanical part of Continuous Cooling can be met with an appropriate thermal storage tank system that is part of a central plant.

A CCC system that rejects heat to outside air via refrigerant and a condenser will likely rely on uninterrupted power to provide Continuous Cooling which will be discussed in the following paragraphs.

Some CCC systems use pumped refrigerant. These systems transfer heat from pumped refrigerant to a building’s chilled-water system, a glycol system, or an external condenser unit.

Due to the similarities between chilled-water and glycol systems with respect to the piping headers, glycol and chilled water systems will be treated the same for purposes of this paper.. The heat transfer occurs at an in-room chiller or heat exchanger that, for the purposes of this discussion, is similar to a CDU. The Con- currently Maintainable and Fault Tolerant design considerations for a pumped refrigerant system are the same as a chilled-water system that uses a CDU.

The system that powers all CCC components must be designed to ensure that the electrical system does not defeat the Concurrent Maintainability or Fault Tolerance of the mechanical system. In a Concurrently Maintainable mechanical system electrical design, no more than the redundant number of cooling units may be removed from service when any part of the electrical system is removed from service in a planned manner. This requirement includes the cooling within any aisle, not just the room as a whole. Designing the CCC units and the associated CDUs, in-room chillers, or heat exchangers in a 2N configuration greatly simplifies the electrical distribution.

Providing an A feed to half of the units and a B feed to the other half of the units while paying attention to the distribution of the CCC units, will typically provide a Concurrently Maintainable electrical design.

If the cooling system is in an N+R configuration, the distribution of the power sources will require special coordination. Typically, the units will be dual fed, which can be accomplished by utilizing an internal transfer switch n the units, an external manual transfer switch, or an external automatic transfer switch. This requirement applies to all components of the CCC system that require power to cool the critical space, including the in-row and above-row units, the in-room chillers, heat exchangers, and any power that is required for CDUs (see Figure 2).

Figure 2. CCC valve scenario

Figure 2. CCC valve scenario

When any part of a Fault Tolerant electrical design for a mechanical system experiences a fault no more than the redundant number of cooling units may be removed from service. The same Concurrently Maintainable concepts apply to a Fault Tolerant electrical system; however, all of the transfer switches must be automatic and cannot rely on human intervention to respond to a fault. Additionally, in order to provide Continuous Cooling, uninterruptible power must be provided for cooling fans, in-room chillers and heat exchangers, pumps, and CDUs. A CCC system that uses DX and condensers to reject heat to outside air will require uninterrupted power to all system components to achieve Continuous Cooling.

The controls for these systems must also be considered in the design and meet the appropriate Concurrent Maintainability and Fault Tolerant requirements.

RDHx

The requirements for a Concurrently Maintainable or Fault Tolerant RDHx cooling solution are similar to those for in-row cooling. The RDHx units typically use chilled water or a pumped refrigerant and CDUs, in-room chillers, or heat exchangers. These units need to meet all of the Concurrent Maintainability
or Fault Tolerant requirements of in-row CCC units. Airflow when a door is removed from service for either a planned event or due to a failure is a major consideration. When an RDHx solution cools an entire data center, it may be configured in a front-to-back rack configuration. When one or more doors are removed from service, the affected racks will blow hot exhaust air into the racks behind them, which may cause them to overheat, depending on the heat load.

This configuration does not meet Concurrent Maintainability or Fault Tolerant requirements, which require that the cooling system provide N cooling to all critical equipment during a planned maintenance event or a failure. Placing the racks in a Cold Aisle/Hot Aisle configuration may not meet this requirement as exhaust air from the affected rack may circulate over its top from the Hot Aisle and overheat the servers at the top of the rack and possibly adjacent racks. The same airflow issue is possible for racks placed at the end of rows when their RDHx is not working.

Summary

Using CCC as the only form of cooling in the data center is becoming more common. CCC provides additional challenges to meet Concurrent Maintainability and Fault Tolerant requirements beyond those typically experienced with a CRAC-based cooling system. The challenges of different airflow, when compared to room- based CRACs, and ensuring that the consequential impact of maintenance and failures on the additional capacity components and distribution systems do not remove more than the redundant number of units from service can be met with careful consideration when designing all parts of the CCC system.


 

Matt Mescall

Matt Mescall

Matthew Mescall, PE, is a senior consultant for Uptime Institute Professional Services and Tier Certification Authority, where he performs audits and provides strategic- level consulting and Tier Certification reviews. Mr. Mescall’s career in critical facilities spans 12 years and includes responsibilities in planning, engineering, design, construction, and operation. Before joining Uptime Institute, Mr. Mescall was with IBM, where he operated its Boulder, CO, data center and led a worldwide team analyzing best practices across IBM data centers to ensure consistent, cost-effective reliability.Mr. Mescall holds a BS degree in Civil Engineering from the University of Southern California, a MS in Civil Engineering from the Georgia Institute of Technology, and a Masters Certificate in Project Management from George Washington University.

 

Annual Data Center Industry Survey 2014

The fourth annual Uptime Institute Data Center Industry Survey provides an overview of global industry trends by surveying 1,000 data center operators and IT practitioners. Uptime Institute collected responses via email February through April 2014 and presented preliminary results in May 2014 at the 9th Uptime Institute Symposium: Empowering the Data Center Professional.

To immediately access the full report, please provide your business contact information.