Improving Performance in Ever-Changing Mission-Critical IT Infrastructures

CenturyLink incorporates lessons learned and best practices for high reliability and energy efficiency.

By Alan Lachapelle

CenturyLink Technology Solutions and its antecedents (Exodus, Cable and Wireless, Qwest, and Savvis) have a long tradition of building mission critical data centers. With the advent of its Internet Data Centers in the mid-1990s, Exodus broke new ground by building facilities at unprecedented scale. Even today, legacy Exodus data centers are among the largest, highest capacity, and most robust data centers in CenturyLink’s portfolio, which the company uses to deliver innovative managed services for global businesses on virtual, dedicated, and colocation platforms (see Executive Perspectives on the Colocation and Wholesale Markets, p.51).

Through the years CenturyLink has seen significant advances not only in IT technology, but in mission-critical IT infrastructures as well; adapting to and capitalizing on those advances have been critical to the company’s success.

Applying new technologies and honing best-practice facility design standards is an ongoing process. But the best technology and design alone will not deliver the efficient, high-quality data center that CenturyLink’s customers demand. It takes experienced, well-trained staff with a commitment to rigorousadherence to standards and methods to deliver on the promise of a well-designed and well-constructed facility. Specifically, that promise is to always be up and running, come what may, to be the “perfect data center.”

The Quest
As its build process matured, CenturyLink infrastructure began to take on a phased approach, pushing the envelope and leading the industry in effective deployment of capital for mission critical infrastructures. As new technologies developed, CenturyLink introduced them to the design. As the potential densities of customers’ IT infrastructure environments increased, so too did the densities planned into new data center builds. And as the customer base embraced new environmental guidelines, designs changed to more efficiently accommodate these emerging best practices.

Not many can claim a pedigree of 56 (and counting) unique data center builds, with the continuous innovation necessary to stay on top in an industry in which constant change is the norm. The demand for continuous innovation has inspired CenturyLink’s multi-decade quest for the perfect data center design model and process. We’re currently on our fourth generation of the perfect data center—and, of course, it certainly won’t be the last.

The design focus of the perfect data center has shifted many times.

Dramatically increasing the efficiency of white space in the data centers is likely the biggest such shift. Under the model in use in 2000, a 10-megawatt (MW) IT load may have required 150,000 square feet (ft2) of white space. Today, the same capacity requires only a third the space. Better still, we have deployments of 1 MW in 2,500 ft2—six times denser than the year-2000 design. Figure 1 shows the average densities in four recent customer installations.

Figure 1. The average densities for four recent customer installations show how significantly IT infrastructure density has risen in recent years.

Figure 1. The average densities for four recent customer installations show how significantly IT infrastructure density has risen in recent years.

Our data centers are rarely homogenous, so the designs need to be flexible enough to support multiple densities in the same footprint. A high-volume trading firm might sit next to a sophisticated 24/7 e-retailer, next to a disaster recovery site for a health-care provider with a tape robot. Building in flexibility is a hurdle all successful colocation providers must overcome to effectively address their clients’ varied needs.

Concurrent with differences in power density are different cooling needs. Being able to accommodate a wide range of densities efficiently, from the lowest (storage and backup) to the highest (high-frequency trading, bitcoin mining, etc.), is a chief concern. By harnessing the latest technologies (e.g., pumped-refrigerant economizers, water-cooled chillers, high-efficiency rooftop units), we match an efficient, flexible cooling solution to the climate, helping ensure our ability to deliver value while maximizing capital efficiency.

Mechanical systems are not alone in seeing significant technological development. Electrical infrastructures have changed at nearly the same pace. All iterations of our design have safely and reliably supplied customer loads, and we have led the way in developing many best practices. Today, we continue to minimize component count, increase the mean time between failures, and pursue high operating efficiency infrastructures. To this end, we employ the latest technologies, such as delta conversion UPS systems for high availability and Eco Mode UPS systems that actually have a higher availability than double-conversion UPS systems. We consistently re-evaluate existing technologies and test new ones, including Bloom Energy’s Bloom Box solid oxide fuel cell, which we are testing in our OC2 facility in Irvine, CA. Only once a new technology is proven and has shown a compelling advantage will we implement it more broadly.

All the improvements in electrical and mechanical efficiencies could scarcely be realized in real-world data centers if controls were overlooked. Each iteration of our control scheme is more robust than the last, thanks to a multi-disciplinary team of controls experts who have built fault tolerance into the control systems. The current design, honed through much iteration, allows components to function independently, if necessary, but generates significant benefit by networking them together, so that they can be controlled collaboratively to achieve optimal overall efficiency. To be clear, each piece of equipment is able to function solely on its own if it loses communication with the network, but by allowing components with part-load efficiencies to communicate with each other effectively, the system intelligently selects ideal operating points to ensure maximum overall efficiency.

For example, the chilled-water pumping and staging software analyzes current chilled-water conditions (supply temperature, return temperature, and system flow) and chooses the appropriate number of chilled-water pumps and chillers to operate to minimize chiller plant energy consumption. To do this, the software evaluates current chiller performance against ambient temperature, load, and pumping efficiency. The entire system is simple enough to allow for effective troubleshooting and for each component to maintain required parameters under any circumstance, including failure of other components.

Finally, our commissioning process has grown and matured. Learning lessons from past commissioning procedures, as well as from the industry as a whole, has made the current process increasingly rigorous. Today, simulations used to test new data centers before they come on-line closely represent actual conditions in our other facilities. A thorough commissioning process has helped us ensure our buildings are turned over to operators as efficient, reliable, and easy to operate as our designs intended.

Design and Construction Certification
As the data center industry has matured, among the things that became clear to CenturyLink was the value of Tier Certification. The Uptime Institute’s Tier Standard: Topology makes the benchmark for performance clear and attainable. While our facilities have always been resilient and maintainable, CenturyLink’s partnership with the Uptime Institute to certify our designs to its well-known and recognizable standards creates customer certainty.

CenturyLink currently has five Uptime Institute Tier III Certified Facilities in Minneapolis, MN; Chicago, IL; Toronto, ON; Orange County, CA; and Hong Kong, with a sixth underway. By having our facilities Tier Certified, we do more than simply show commitment to transparency in design. Customers who can’t view and participate in  commissioning of facilities can rest assured knowing the Uptime Institute has Certified these facilities as Concurrently Maintainable. We invite comparison to other providers and know that our commitments will provide value for our customers in the long run.

Application of Design to Existing Data Centers
Our build team uses data center expansions to improve the capacity, efficiency, and reliability of existing data centers. This includes (but is not limited to) optimizing power distribution, aligning cooling infrastructure to utilize ASHRAE guidance, or upgrading controls to increase reliability and efficiency.

Meanwhile, our operations engineers continuously implement best practices and leading-edge technologies to improve energy efficiency, capacity, and reliability of data center facilities. Portfolio wide, the engineering team has enhanced control sequences for cooling systems, implementation of electronically commutated (EC) and variable frequency drive (VFD) fans, and Cold Aisle/Hot Aisle containment. These best practices serve to increase total cooling capacity and efficiency, ensuring customer server inlet conditions are homogenous and within tolerance. Figure 2 shows the total impact of all such design improvements on our company’s aggregate Power Usage Effectiveness (PUE). Working hand-in-hand with the build group, CenturyLink’s operations engineers ensure continuous improvement in perfect data center design, enhancing some areas while eliminating unneeded and unused features and functions—often based on feedback from customers.

Figure 2. As designs improve over time, CenturyLink implements best practices and lessons learned into its existing portfolio to continuously improve its aggregate PUE. It’s worth noting that the PUEs shown here include considerable office space as well as extra resiliency built into unutilized capacity.

Figure 2. As designs improve over time, CenturyLink implements best practices and lessons learned into its existing portfolio to continuously improve its aggregate PUE. It’s worth noting that the PUEs shown here include considerable office space as well as extra resiliency built into unutilized capacity.

Staffing Models
CenturyLink relies on expert data center facilities and operations teams to respond swiftly and intelligently to incidents. Systems are designed to withstand failures, but it is the facilities team that promptly corrects failures, maintaining equipment at a high level of availability that continually assures Fault Tolerance.

Each facility hosts its own Facility and Operations teams. The Facility team consists of a facility manager, a lead mechanical engineer, a lead electrical engineer, and a team of facility technicians. They maintain the building and its electrical and mechanical systems. They are experts on their locations. They ensure equipment is maintained in concurrence with CenturyLink’s maintenance standards, respond to incidents, and create detailed Methods of Procedure (MOPs) for all actions and activities. They also are responsible for provisioning new customers and maintaining facility capacity.

The Operations team consists of the operations manager, operations lead, and several operations technicians. This group staffs the center 24/7, providing our colocation customers with CenturyLink’s “Gold Support” (remote hands) for their environment so that they don’t need to dispatch someone to the data center. This team also handles structured cabling and interconnections.

Regional directors and regional engineers supplemented the location teams. The regional directors and engineers serve as subject matter experts (SMEs) but, when required, can also marshal the resources of CenturyLink’s entire organization to rapidly and effectively resolve issues and ensure potential problems are addressed on a portfolio-wide basis.

The regional teams work as peers, providing each individual team member’s expertise when and where appropriate, including to operations teams outside their region when needed. Collaborating on projects and objectives, this team ensures the highest standards of excellence are consistently maintained across a wide portfolio. Thus, regional engineers and directors engage in trusted and familiar relationships with site teams, while ensuring the effective exchange of information and learning across the global footprint.

Global Standard Process Model
A well-designed, -constructed, and -staffed data center is not enough to ensure a superior level of availability. The facility is only as good as the methods and procedures that are used to operate it. A culture that embraces process is also essential in operating a data center efficiently and delivering the reliability necessary to support the world’s most demanding businesses.

Uptime and latency are primary concerns for CenturyLink. The CenturyLink brand depends upon a sustained track record of excellence. Maintaining consistent reliability and availability across a varied and changing footprint requires an intensive and dynamic facilities management program encompassing an uncompromisingly rigid adherence to well-planned standards. These standards have been modeled in the IT Infrastructure Library spirit and are the result of years of planning, consideration, and trial and error. Adherence further requires close monitoring of many critical metrics, which is facilitated by the dashboard shown in Figure 3.

Figure 3. CenturyLink developed a dynamic dashboard that tracks and trends important data: site capacity, PUE, available raised floor space, operational costs, abnormal Incidents, uptime metrics, and much more to provide a central source of up-to-date information for all levels of the organization.

Figure 3. CenturyLink developed a dynamic dashboard that tracks and trends important data: site capacity, PUE, available raised floor space, operational costs, abnormal Incidents, uptime metrics, and much more to provide a central source of up-to-date information for all levels of the organization.

Early in the development of Savvis as a company, management established many of the organizational structures that exist today in CenturyLink Technology Solutions. CenturyLink experienced growth in many avenues; as it serviced increasingly demanding customers (in increasing volume), these structures continued to evolve to suit the ever-changing needs of the company and its customers.

First, the management team developed a staff capable of administering the many programs that would be required to maintain the standard of excellence demanded by the industry. Savvis developed models analyzing labor and maintenance requirements across the company and determined the most appropriate places to invest in personnel. Training was emphasized, and teams of SMEs were developed to implement the more detailed aspects of facilities operations initiatives. The management team is centralized, in a sense, because it is one global organization; this enhances the objective of global standardization. Yet the team is geographically diverse, subdivided into teams dedicated to each site and regional teams working with multiple sites, ensuring that all standards are applied globally throughout the company. And all teams contribute to the ongoing evolution of those standards and practices—for example, participating in two global conference calls per week.

Next, it was important to set up protocols to handle and resolve issues as they developed, inform customers of any impact, and help customers respond to and manage situations. No process guarantees that situations will play out exactly as anticipated, so a protocol to handle unexpected events was crucial. This process relied on an escalation schedule that brought decisions through the SMEs for guidance and gave decision makers the proper tools for decision making and risk mitigation. Parallel to this, a process was developed to ensure any incident with impact to customers caused notifications to those customers so they could prepare for or mitigate the impact of an event.

A tracking system accomplished many things. For example, it ensured follow up on items that might create problems in the future, identified similar scenarios or locations where a common problem might recur, established a review and training process to prevent future incidents through operator education, justified necessary improvements in systems creating problems, and tracked performance over longer periods to analyze success in implementation and evaluate need for plan improvement. The tracking system is inclusive of all types of problems, including those related to internal equipment, employees, and vendors.

Data centers, being dynamic, require frequent change. Yet unmanaged change can present a significant threat to business continuity. Congruent with the other programs, CenturyLink set up a Change Management program. This program tracked changes, their impacts, and their completion. It ensured that risks were understood and planned for and that unnecessary risks were not taken.

Any request for change, either internal or from a customer, must go through the Change Management process and be evaluated on metrics for risk. These metrics determine the level of controls associated with that work and what approvals are required. The key risk factors considered in this analysis include the possible number of customers impacted, likelihood of impact, and level of impact. Even more importantly, the process evaluates the risk of not completing a task and balances these factors. The Change Management program and standardization of risk analysis necessitated standardizing maintenance procedures and protocols as well.

Standards, policies, and best practices were established, documented, and reviewed by management. These create the operating frameworks for implementing IT Information Library methodology, enabling CenturyLink to incorporate industry best practices and standards, as well as develop internal operating best practices, all of which maximize uptime and resource utilization.

A rigid document-control program was established utilizing peer review, and all activities or actions performed were scripted, reviewed, and approved. Peer review also contributed to personnel training, ensuring that as documentation was developed, peers collaborated and maintained expertise on the affected systems. Document standardization was extended to casualty response as well. Even responding to failures requires use of approved procedures, and the response to every alarm or failure is scripted so the team can respond in a manner minimizing risk. In other words, there is a scripted procedure even for dealing with things we’ve never before encountered. This document control program and standardization has enabled personnel to easily support other facilities during periods of heightened risk, without requiring significant training for staff to become familiar with the facilities receiving the additional support.

Conclusion
All the factors described in this paper combine to allow CenturyLink to operate a mission-critical business on a grand scale, with uniform operational excellence. Without this framework in place, CenturyLink would not be able to maintain the high availability on which its reputation is built. Managing these factors while continuing to grow has obvious challenges. However, as CenturyLink grows, these practices are increasingly improved and refined. CenturyLink strives for continuous improvement and views reliability as a competitive advantage. The protocols CenturyLink follows are second-to-none, and help ensure the long-term viability of not only data center operations but also the company as a whole. The scalability and flexibility of these processes can be seen from the efficiency with which CenturyLink has integrated them into its new data center builds as well as data centers it acquired. As CenturyLink continues to grow, these programs will continue to be scaled to meet the needs of demanding enterprise businesses.

Site Spotlight: CH2
In 2013, we undertook a massive energy efficiency initiative at our CH2 data center in Chicago, IL. More than 2 years of planning went into this massive project, and the energy savings were considerable.

Projects included:
• Occupancy sensors for lighting
• VFDs on direct expansion computer room air conditioning units
• Hot Aisle containment
• Implementing advanced economization controls
• Replacing cooling tower fill with high-efficiency
evaporative material
• Installing high-efficiency cooling tower fan blades

These programs combined to reduce our winter PUE by over 17%, and our summer PUE by 20%. Additionally, the winter-time period has grown as we have greatly expanded the full free cooling window and added a large partial free cooling window, using the evaporative impact of the cooling towers to great advantage.

Working with Commonwealth Edison, energy efficient rebates contributed over US$500,000 to the project’s return, along with annual savings of nearly 10,000,000 kilowatt-hours. With costs of approximately US$1,000,000, this project proved to be an incredible investment, with an internal rate of return of ≈100% and a Net Present Value of over US$5,000,000. We consider this a phenomenal example of effective best practice implementation and energy efficiency.


 

Alan Lachapelle

Alan Lachapelle

Alan Lachapelle is a mechanical engineer at CenturyLink Technology Solutions. He has 6 years of experience in Naval Nuclear Propulsion on submarines and 4 years of experience in mission critical IT infrastructure. Merging rigorous financial analysis with engineering expertise, Mr. Lachapelle has helped ensure the success of engineering as a business strategy.

Mr. Lachapelle’s responsibilities include energy-efficiency initiatives, data center equipment end of life, operational policies and procedures, peer review of maintenance and operations procedures, utilization of existing equipment, and financial justifications for engineering projects throughout the company.

Share this