Sun Life Plots Data Center Facilities and Operations Roadmap to Meet Application Demands

An imminent return of service causes a top-to-bottom examination of data center facilities and operations
By Rocco Alonzi and Paolo Piro

When Sun Life Financial completed a return of service for its data center operations in 2011, the Enterprise Infrastructure (IE) and Corporate Real Estate (CRE) teams immediately saw an opportunity to improve service stability, back-up capacity, and efficiency.

Focusing initially on the physical facility itself, Sun Life considered what it needed to do to improve its current situation. The options included upgrading its existing primary facility in Waterloo, ON, Canada, purchasing a new facility, and partnering with an existing facility in either a colocation or partnership scenario.

Sun Life scored the options on four main criteria: cost, time to market, interruption, and involvement. In the end, Sun Life decided that upgrading its existing Waterloo-King facility was the best option because the upgrade was the most cost effective, the least interruptive, and the most suitable fit for the organization. The plan resulted in design implementations and organizational improvements that ultimately led to Sun Life’s first Uptime Institute Tier III Certified Constructed Facility.

Achieving this milestone was no small feat. The Waterloo-King facility was fully operational during the upgrades and improvements. The facility hosted the majority of Sun Life’s primary applications, and the site already had many connections and linkages to other facilities, with international implications for the company. In the end, the Waterloo-King facility was transformed into a Concurrently Maintainable Tier III facility with all the redundancy that comes with the designation. The transformation was completed over a relatively short period with zero major outages.

The Decision
The decision to return service from an application outsourcing arrangement back to Sun Life prompted the organization to review its capabilities. Once the decision was communicated, the Enterprise Infrastructure branch (responsible for supporting the services) quickly began to analyze the requirements of the return of service and potential gaps that might impact service.

The Enterprise Infrastructure leadership team led by the Data Center Operations (DCO) assistant vice president (AVP) shouldered the responsibility of ensuring the sufficiency of the data center critical facility and the organization. The DCO reviewed current capabilities, outlined options, and developed an improvement plan. The decision was to upgrade the facility and create an environment that supported a Concurrently Maintainable and fully redundant operation.

To facilitate this transformation, Sun Life assembled a team of stakeholders to lay groundwork, manage responsibility, and execute the pieces to conclusion. The team led by the DCO AVP primarily comprised personnel from Corporate Real Estate (CRE) Facilities and Data Center Operations. Within the Sun Life Financial organization, these two teams have the greatest vested interest in the data center, and both are directly involved in the care and feeding of the facility.

Once the team was assembled, it began working on a mandate that would eventually describe the desired end goal. The goal would be to create a facility that was mechanically and electrically redundant and an organizational support structure that was operationally viable. The organization described in the mandate would be able to support the many functions required to run a critical facility and impose governance standards and strategies to keep it running optimally for years to come.

Data Center Due Diligence Assessment
To help guide the organization through this process, the DCO AVP contracted the services of the Uptime Institute to provide a Data Center Due Diligence Assessment analysis report. The report ultimately formed the basis of Sun Life’s roadmap for this journey.

Once the Data Center Due Diligence Assessment was complete, Uptime Institute presented its findings to the DCO AVP, who then reviewed the report with the CRE AVP and quickly identified opportunities for improvement. Using the Data Center Due Diligence Assessment and a structural assessment from another vendor, Sun Life’s team quickly isolated the critical areas and developed a comprehensive master plan.

These opportunities for improvement would help the team generate individual activities and project plans. The team focused on electrical, mechanical, and structural returns. The tasks the team developed included creating electrical redundancy, establishing dual-path service feeds, adding a second generator path to create a completely separate emergency generation system, hardening the structural fabric, and replacing the roof waterproofing membrane located above the raised floor.

With the team having identified infrastructure concerns, it then shifted its focus to organizational effectiveness and accountabilities. Sun Life used a review of operational processes to close organizational gaps and meet challenges to strengthen accountabilities, responsibilities, and relationships. Changes were necessary, not only during the transformation process but also post implementation, when the environment became fully operational and would require optimal and efficient support and maintenance.

The team needed to establish clear organizational delineation of responsibilities, and establish strong communication links between the DCO and CRE so the data center support structure would function as a single unit. Under the leadership of the DCO AVP with support from CRE, Sun Life established a Data Center Governance branch to help meet this requirement. Every aspect of the day-to-day care and feeding of the facility was discussed, reviewed, and then approved for implementation, with the establishment of a clear demarcation between CRE and DCO support areas, based on the Responsible, Accountable, Consulted and Informed (RACI) model. Figure 1 is a graphical example of Sun Life’s final delineation model.

Figure 1. Overview of Sun Life’s Final RACI model  (responsibilities assigned to the CRE and DCO groups)

Figure 1. Overview of Sun Life’s Final RACI model 
(responsibilities assigned to the CRE and DCO groups)

IT Technology
For the last step, the spotlight moved to IT Technology. The Data Center Governance team (by direction of the DCO AVP) reviewed the existing standards and policies. The team wrote and communicated new policies as required. Adherence to these policies would be strictly enforced, with full redundancy of the mechanical and electrical environment right down to the load level being the overarching goal. Establishment and enforcement of these rules follows the demarcation between CRE and DCO.

Roadmap action items were analyzed to determine grouping and scheduling. Each smaller project was initiated and approved using the same process–stakeholder approval (i.e., DCO and CRE AVP) had to be obtained before any project was allowed to proceed through the organization’s change and approval process. The team would first assess the change for risk and then for involvement and impact before allowing it to move forward for organizational assessment and approval. The criteria for approving these mechanical and electrical plans were based on “other” involvement and “other” commitment. The requirements of impacted areas of the organization (other than the DCO and CRE areas) would drive the level of analysis that a particular change would endure. Each project and activity was reviewed and scrutinized for individual merit and overall value-add. Internal IT Information Library (ITIL) change management processes were then followed. Representatives from all areas of the organization were given the opportunity to assess items for involvement and impact, and the change teams would assess the items for change window conflicts. Only after all involved areas were satisfied would the project be submitted for final ITIL approval and officially scheduled.

The following list provides a high-level summary of the changes that were completed in support of Sun Life’s transformation. Many were done in parallel and others in isolation; select changes required full organizational involvement or a data center shutdown.

• Added a 13.8-kilovolt-ampere (kVA) high-voltage hydro feed from a local utility

• Added a second electrical service in the parking garage basement

• Completed the construction of a separate generator room and diesel storage tank room in the parking garage basement t0 accommodate the addition of a 2-megawatt diesel generator, fuel storage tanks, and fuel pumps

• Introduced an underground 600-volt (V) electrical duct bank to data center

•Reconfigured the data center electrical room into two distinct sides

• Replaced old SP1 switchboard in data center electrical room

• Added a backup feed from the new electrical service for main building

• Replaced existing UPS devices

• Installed an additional switch between the new  generator and the switchgear to connect the load bank

• Installed an additional switch on each electrical feed providing power to the UPS system for LAN rooms

• Upgraded existing generators to prime-rated engine generators

• Replaced roof slab waterproofing membrane above the data center (see Figure 2)

• Created strategies to mitigate electrical outages

Figure 2. (Above and Below) Waterproof membrane  construction above raised floor 

Figure 2. (Above and Below) Waterproof membrane 
construction above raised floor

 

 

L. Alonzi Figure 2b image2Teamwork
Teamwork was essential to the success of these changes. Each of the changes required strong collaboration, which was only possible because of the strong communication links between CRE and DCO. The team responsible for building the roadmap that effectively guided the organization from where it was to where it needed to be had a full understanding of accountabilities and responsibilities. It was (and still is) a partnership based on a willingness to change and a desire to move in the right direction. The decision to add controls to a building UPS is a good example of this process. Since one of the critical facility UPS units at the Waterloo facility supports part of the general building as well as the critical facility, a control needed to be put in place to ensure compliance, agreement, and communication. Although the responsibility to execute the general building portion falls solely on CRE, a change to this environment could have an impact on the data center and therefore governance was required. Figure 3 shows a process that ensures collaboration, participation, and approval across responsibilities.

Figure 3. Building UPS process

Figure 3. Building UPS process

To achieve this level of collaboration, the focus needed to switch to the organizational commitments and support that was fostered during this process. Without this shift in organizational behavior, Sun Life would not have been able to achieve the level of success that it has—at least not as easily or as quickly. This change in mind set helped to change the way things are planned and executed. CRE and DCO work together to plan and then execute. The teamwork ensured knowledge and understanding. The collaboration removed barriers so the teams were able to develop a much broader line of sight (bird’s-eye view) when considering the data center.

Delineation of responsibilities was clearly outlined. DCO assumed full accountability of all changes relating to the raised floor space while CRE Critical Facilities managed all electrical and mechanical components in support of the data center. The DCO team reported through the technology branch of the organization while the Critical Facilities team reported up through the CRE branch of the organization. Overall accountability for the data center rested on DCO with final approval and ultimate ownership coming from DCO AVP.

During the planning phase of this transition, both sides (CRE and DCO) quickly realized that processing changes in isolation was not an effective or efficient approach and immediately established a strong collaborative tie. This tie proved to be critical to the organization’s success as both teams and their respective leaders were able to provide greater visibility, deliver a consistent message, and obtain a greater line of sight into potential issues. All of which helped to pave the way for easier acceptance, greater success, and fewer impacts organization wide. As preparations were being made to schedule activities, the team was able to work together and define the criteria for go/no-go decisions.

Documenting the Process
Once individual projects were assessed and approved, the attention turned to planning and execution. In cases where the activity involved only the stakeholder groups (CRE and DCO), the two groups managed the change/implementation in isolation. Using method of procedures (MOPs) provided by the vendor performing the activity kept the team fully aware of the tasks to be completed and the duration of each task. On the day of the change, communication was managed within the small group and executives were always kept informed. Activity runbooks were used in cases where the activity was larger and involvement was much broader. These runbooks contained a consolidation of all tasks (including MOPs), responsibilities assigned to areas and individuals, and estimated and tracked duration per step. The MOPs portion of the runbook would be tagged to CRE, and although the specific steps were not itemized, as they were only relevant to CRE and DCO, the time duration required for the MOP was allotted in the runbook for all to see and understand (See Figure 4). In these larger, more involved cases, the runbooks helped to ensure linkages of roles and responsibilities, especially across Facilities and IT, to plan the day, and to ensure that all requirements and pre-requisites were aligned and clearly understood.

 

Figure 4. Example of an electrical enhancement shutdown schedule

Figure 4. Example of an electrical enhancement shutdown schedule

Compiling these runbooks required a great deal of coordination. Once the date for the activity was scheduled, the DCO team assumed the lead in developing the runbook. At its inception, the team engaged areas of impact and began to document a step-by-step MOP that would be used on the day of the change. Who was required? Who from that team would be responsible? And how much time each task would take? The sum of which provided the overall estimate for how long the proposed activity would take. Several weeks prior to the actual change, dry runs of the runbook were scheduled to verify completeness of the approach. Final signoff was always required before any change was processed for execution. Failure to obtain signoff resulted in postponement or cancellation of the activity.

On the day of the activities, tasks were followed as outlined. Smaller activities (activities that only required DCO and Facilities involvement) were managed within the Facilities area with DCO participation. Larger activities requiring expanded IT coordination were managed using a command room. The room was set up to help facilitate the completion of the tasks in the order outlined in the runbook. The command room offered the coordination and collaboration. The facilitators (who were always members of the DCO team) were able to use the forum to document issues that would arise, assess impact, and document remediation. The information, post implementation, was then used to investigate and resolve for future runbook creations. The command room served as a focal point for information and status updates for the entire organization. Status updates would be provided at predetermined intervals. Issues were managed centrally to ensure that coordination and information was consistent and complete. The process was repeated for each of the major activities, and in the end, as far as Sun Life’s transformation goes, all changes were executed as planned, with complete cooperation and acceptance by all involved.

System cutovers were completed without major issues and without outages. Interruptions of applications were, for the most part expected or known. Outages were typically caused by technical limitations at the load level such as single-corded IT hardware or system limitations. Outages of single-corded equipment were minimized as systems were restored once power was fed from a live source. For outages caused by system limitations, arrangements had been made with the business client to shut down the service for the duration of the change. System was restored when the change was complete. In the rare circumstance when a minor outage did occur, the support group, which was on site, investigated immediately and determined the root cause to be localized to the IT equipment. The issues were faulty power supplies or IT hardware configuration errors. These issues, although not related to the overall progress or impact of the activity itself, were documented, resolved, and then added to future runbooks as pre-implementation steps to be completed.

DCO’s governance and strategy framework (see Figure 5) served as the fundamental component that would define authority, while employing controls. These controls ensured clarity, impact, risk, and execution during each of the planning and execution phases and have continued to evolve well into the support phase.

Figure 5. Overview of governance model

Figure 5. Overview of governance model

A new RACI model was developed to help outline the delineation between CRE and DCO in the data center environment. The information, which was built by DCO in collaboration with CRE, was developed in parallel to the changes being implemented. Once the RACI model was approved, the model became the foundation for building a clear understanding of the responsibilities within the data center.

During the planning phase, the collaboration between these two areas facilitated the awareness needed for establishing proper assessment of impact. As a result, the level of communication and amount of detail provided to the change teams was much more complete. The partnership fostered easier identification of potential application/infrastructure impacts. During the execution phase, management of consolidated implementation plans, validation and remediation, as well as the use of runbooks (with documented infrastructure/application shutdown/startup procedures), provided the necessary transparency that was required across responsibilities to effectively manage cutovers and building shutdowns with no major impact or outage.

Results
Several milestones had to be achieved to reach all these goals. The entire facility upgrade process from the point when funding was approved took approximately 18 months to complete. Along this journey, there were a number of key milestones that needed to be negotiated and completed. To help understand how Sun Life was able to complete each task, lead times are shown following.

• Description of task (duration) – Lead time

• Contract approvals (2 months) – 18 months

• Construction of two new electrical rooms, installation of one new UPS and installation of generator and fuel system (2 months) – 16 months

• Validation, testing and verification (1 month) – 14 months

• Assemble internal organizational team to define application assessment (1 months) – 9 months

• Initial communication regarding planned cutover (1 month) – 9 months

• Validate recommended cutover option with application and infrastructure teams (1 months) – 8 months

• Remediate application and infrastructure in advance of cutover (2 months) – 7 months

• Define and build cutover weekend governance model (3 months) – 7 months

• Define and build cutover sequence runbook (3 months) – 7 months

• Data center electrical upgrade complete – Tier III Certification of Constructed Facility

In the end, after all the planning, setups, and implementations, all that remained was validation that all the changes were executed according to design. For the Facility Certification, Uptime Institute provided a list of 29 demonstrations covering activities from all aspects of the mechanical and electrical facility. The same team of representatives from CRE and DCO reviewed each demonstration and analyzed them for involvement and impact. The Sun Life team created individual MOPs, and grouped these for execution based on the duration of the involvement required. These activities took place across 3 days. Runbooks were created and used throughout each of the groupings. Areas required were engaged. On the demonstration weekend, CRE and DCO resources worked together to process each demonstration, one by one, ensuring and validating the success of the implementation and the design. The end result was Tier III Certification of Constructed Facility (See Figure 6).

Figure 6. Example of Tier III Constructed Facility testing schedule (per demonstration code) 

Figure 6. Example of Tier III Constructed Facility testing schedule (per demonstration code)

Sun Life Financial received its Tier III Design Documents Certification in May 2013, and then successfully demonstrated all items required over the first weekend in November to receive Tier III Certification of Constructed Facility on November 8, 2013. The journey was not an easy one.

Figure 7. Power distribution overview (before and after) 

Figure 7. Power distribution overview (before and after)

 

 

L. Alonzi Figure 7 (after)image9In summary, Sun Life Financial transformed its primary operational data center facility (See Figure 7) within 18 months at a cost of approximately US$7 million (US$3.4 million) allocated to electrical contractor work and materials, US$1.2 million for waterproof roof membrane work, US$1.5 million for environmental upgrades and the addition of a new generator, US$900,000 for other costs and the remainder for project management and other minor improvements). The success of this transformation was possible in large part due to the collaboration of an entire organization and the leadership of a select few. The facility is now a Tier III Constructed Facility that is Concurrently Maintainable and optimally supported. Through Certification, Sun Life now has a much more positive position to manage the ever increasing demands of critical application processing.

Figure 8. Before and after summary

Figure 8. Before and after summary


Rocco Alonzi

Rocco Alonzi

  Rocco Alonzi has worked in the data center environment for the past 10 years, most recently as the AVP of Data Center Operations at Sun Life. He helped develop and implement the strategies that helped Sun Life achieve Tier III Certification of Constructed Facility. Prior to joining Sun Life, Mr. Alonzi worked for a large Canadian bank. During his 15 years there, he held many positions, including manager of Data Center Governance, where he was responsible for developing a team responsible for securing, managing, and maintaining the bank’s raised-floor environment. As a member of the Uptime Institute Network, Mr. Alonzi has strongly advocated the idea that IT and M&E must be considered as one in data center spaces.

 

Paolo Piro

Paolo Piro

 

 

Paolo Piro joined Sun Life in May of 2013 as a senior data center governance analyst, to help establish a governance framework and optimize organizational processes relating to Sun Life’s data centers. Prior to joining Sun Life, Mr. Piro worked 25 years at a large Canadian bank. In 2004, he became involved in data centers, when he became responsible, as a team lead, for establishing governance controls, implementing best practices, and optimizing the care and feeding of the data center raised floor. In 2011, he was able to increase his exposure and knowledge in this space, by taking on the role of data center manager, where for the next 2 years, he managed a team of resources and a consolidated budget allocated for maintaining and caring for the raised floor environment.

RagingWire’s Jason Weckworth Discusses the Execution of IT Strategy

In this series, Uptime Institute asked three of the industry’s most well recognized and innovative leaders to describe the problems facing enterprise IT organizations. Jason Weckworth examined the often-overlooked issue of server hugging; Mark Thiele suggested that service offerings often did not fit the customer’s long-term needs; and Fred Dickerman found customers and providers at fault.


Q: What are the most prevalent misconceptions hindering data center owners/operators trying to execute the organization’s IT strategy, and how do they resolve these problems?

Jason Weckworth: As a colocation provider, we sell infrastructure services to end users located throughout the country. The majority of our customers reside within a 200-mile radius. Most IT end users say that they need to be close to their servers. Yet remote customers, once deployed, tend to experience the same level of autonomy and feedback from their operations staffs as those who are close by. Why does this misconception exist?

We believe that the answer lies in legacy data center services vs. the technology of today’s data centers with the emergence of DCIM platforms.

The Misconception: “We need to be near our data center.”
The Reality: “We need real-time knowledge of our environment with details, accessibility, and transparent communication.”

As a pure colocation provider (IaaS), we are not in the business of managed services, hosting, or server applications. Our customers’ core business is IT services, and our core business is infrastructure. Yet they are so interconnected. We understand that our business is the backbone for our customers. They must have complete reliance and confidence in everything we touch. Any problem we have with infrastructure has the potential to take them off-line. This risk can have a crippling effect on an organization.

The answer to remote access is remote transparency.

Data Center Infrastructure Management (DCIM) solutions have been the darlings of the industry for two years running. The key offering, from our perspective, is real-time monitoring with detailed customization. When customers can see their individual racks, circuits, power utilization, temperature, and humidity, all with real-time alarming and visibility, they can pinpoint their risk at any given moment in time. In our industry, seconds and minutes count. Solutions always start with first knowing if there is a problem, and then, by knowing exactly the location and scope of that problem. Historically, customers wanted to be close to their servers so that they could quickly diagnose their physical environment without having to wait for someone to answer the phone or perform the diagnosis for them. Today, DCIM offers the best accessibility.

Remote Hands and Eyes (RHE) is a physical, hands-on service related to IT infrastructure server assets. Whether the need is a server reboot, asset verification, cable connectivity, or tape change, physical labor is always necessary in a data center environment. Labor costs are an important consideration of IT management. While many companies offer an outsourced billing rate that discourages the use of RHE as much as possible, we took an insurance policy approach by offering unlimited RHE for a flat monthly fee based on capacity. With 650,000 square feet (ft2) of data center space, we benefit greatly from scaling the environment. While some customers need a lot of services one month, others need hardly any at all. But when they need it, it’s always available. The overall savings of shared resources across all customers ends up benefitting everyone.

Customers want to be close to their servers because they want to know what’s really going on. And they want to know now. “Don’t sugarcoat issues, don’t spin the information so that the risk appears to be less than reality, and don’t delay information pending a report review and approval from management. If you can’t tell me everything that is happening in real time, you’re hiding something. And if you’re hiding something, then my servers are at risk. My whole company is at risk.” As the data center infrastructure industry has matured over the past 10 years, we have found that customers have become much more technical and sophisticated when it comes to electrical and mechanical infrastructure. Our solution to this issue of proximity has been to open our communication lines with immediate and global transparency. Technology today allows information to flow within minutes of an incident. But only culture dictates transparency and excellence in communication.

As a senior executive of massive infrastructure separated across the country on both coasts, I try to place myself in the minds of our customer. Their concerns are not unlike our own. IT professionals live and breathe uptime, risk management, and IT capacity/resource management. Historically, this meant the need to be close to the center of the infrastructure. But today, it means the need to be accessible to the information contained at the center. Server hugging may soon become legacy for all IT organizations.


Jason Weckworth

Jason Weckworth

Jason Weckworth is senior vice president and COO, RagingWire Data Centers. He has executive responsibility for critical facilities design and development, critical facilities operations, construction, quality assurance, client services, infrastructure service delivery, and physical security. Mr. Weckworth brings 25 years of data center operations and construction expertise to the data center industry. Previous to joining RagingWire, he was owner and CEO of Weckworth Construction Company, which focused on the design and construction of highly reliable data center infrastructure by self-performing all electrical work for operational best practices. Mr. Weckworth holds a bachelor’s degree in Business Administration from the California State University, Sacramento.

 

Mark Thiele from Switch examines the options in today’s data center industry

In this series, three of the industry’s most well-recognized and innovative leaders describe the problems facing enterprise IT organizations today. In this part, Switch’s Mark Thiele suggests that service offerings often don’t fit customer’s long-term needs.

Customers in the data center market have a wide range of options. They can choose to do something internally, lease retail colocation space, get wholesale colocation, move to the cloud, or all of the above. What are some of the more prevalent issues with current market choices relative to data center selection?

Mark Thiele: Most of the data center industry tries to fit customers into status-quo solutions and strategies, doing so in many cases simply because, “Those are the products we have and that’s the way we’ve always done it.” Little consideration seems to be given to the longer-term business and risk impacts of continuing to go with the flow in today’s rapidly changing innovation economy.

The correct solution can be a tremendous catalyst, enabling all kinds of communication and commerce, and the wrong solution can be a great burden for 5-15 years.

In addition, many data center suppliers and builders think of the data center as a discrete and detached component. Its placement, location, and ownership strategy have little to do with IT and business strategies. The following six conclusions are drawn from conversations Switch is having on a daily basis with technology leaders from every industry.

Data centers should be purpose built buildings. A converted warehouse with sky light penetrations and a wooden roof deck isn’t a data center. It’s a warehouse to which someone has added extra power and HVAC. These remodeled wooden-roof warehouses present a real risk for the industry because thousands of customers have billions of dollars’ worth of critical IT gear sitting in these converted buildings where they are expecting their provider to be protecting them at elite mission critical levels. A data center is by its very nature part of your critical infrastructure; as such it should be designed from scratch to be a data center that can actually offer the highest levels of protection from dangers like fire and weather.

A container data center is not a foundational solution for most businesses but can be a good solution for specific niche opportunities (disaster support, military, extreme-scale homogeneous environments, etc.). Containers strand HVAC resources. If you need more HVAC in one container than another you cannot just share it. If a container loses HVAC, all the IT gear is at risk even though there may be millions of dollars of healthy HVAC elsewhere.

The data center isn’t a discrete component. Data centers are a critical part of your larger IT and enterprise strategies, yet many are still building and/or selling data centers as if they were just a real estate component.

One of the reasons that owning a data center is a poor fit for many businesses is that it is hard to make the tight link needed between a company’s data center strategy and its business strategy. It’s hard to link the two when one has a 1- to 3-year life (business strategy) and the other has a 15- to 25-year life (data center).

The modern data center is the center of the universe for business enablement and IT readiness. Without a strong ecosystem of co-located partners and suppliers, a business can’t hope to compete in the world of the agile enterprise. We hear from customers every day that they need access to a wide range of independently offered technology solutions and services that are on premises. Building your own data center and occupying it alone for the sake of control isolates your company on an island away from all of the partners and suppliers that might otherwise easily assist in delivering successful future projects. The possibilities and capabilities of locating in an ultra-scale multi-company technology ecosystem cannot be ignored in the innovation economy.

Data centers should be managed like manufacturing capacity. Like a traditional manufacturing plant, the modern data center is a large investment. How effectively and efficiently it’s operated can have a major impact on corporate costs and risks. More importantly, the most effective data center design, location, and ecosystem strategies can offer significant flexibility and independence for IT to expand or contract at various speeds and to go in different directions entirely as new ideas are born.

More enterprises are getting out of the data center business. Fewer than 5% of businesses and enterprises have the appropriate business drivers and staffing models that would cause them to own and operate their own facilities in the most efficient manner. Even among some of the largest and most technologically savvy businesses there is a significant change in views on how data center capacity should be acquired.


 

ThieleMark Thiele is EVP, Data Center Tech at SUPERNAP, where his responsibilities include evaluating new data center technologies, developing partners, and providing industry thought leadership. Mr. Thiele’s insights in to the next generation of technological innovations and how these technologies speak to client needs and solutions are invaluable. He shares his enthusiasm and passion for technology and how it impacts daily life and business on local, national, and world stages.

Mr. Thiele has a long history of IT leadership specifically in the areas of team development, infrastructure, and data centers.  Over a career of more than 20 years, he’s demonstrated that IT infrastructure can be improved to drive innovation, increase efficiency, and reduce cost and complexity.  He is an advisor to venture firms and start-ups, and is a globally recognized speaker at premier industry events.

Improving Performance in Ever-Changing Mission-Critical IT Infrastructures

CenturyLink incorporates lessons learned and best practices for high reliability and energy efficiency.

By Alan Lachapelle

CenturyLink Technology Solutions and its antecedents (Exodus, Cable and Wireless, Qwest, and Savvis) have a long tradition of building mission critical data centers. With the advent of its Internet Data Centers in the mid-1990s, Exodus broke new ground by building facilities at unprecedented scale. Even today, legacy Exodus data centers are among the largest, highest capacity, and most robust data centers in CenturyLink’s portfolio, which the company uses to deliver innovative managed services for global businesses on virtual, dedicated, and colocation platforms (see Executive Perspectives on the Colocation and Wholesale Markets, p.51).

Through the years CenturyLink has seen significant advances not only in IT technology, but in mission-critical IT infrastructures as well; adapting to and capitalizing on those advances have been critical to the company’s success.

Applying new technologies and honing best-practice facility design standards is an ongoing process. But the best technology and design alone will not deliver the efficient, high-quality data center that CenturyLink’s customers demand. It takes experienced, well-trained staff with a commitment to rigorousadherence to standards and methods to deliver on the promise of a well-designed and well-constructed facility. Specifically, that promise is to always be up and running, come what may, to be the “perfect data center.”

The Quest
As its build process matured, CenturyLink infrastructure began to take on a phased approach, pushing the envelope and leading the industry in effective deployment of capital for mission critical infrastructures. As new technologies developed, CenturyLink introduced them to the design. As the potential densities of customers’ IT infrastructure environments increased, so too did the densities planned into new data center builds. And as the customer base embraced new environmental guidelines, designs changed to more efficiently accommodate these emerging best practices.

Not many can claim a pedigree of 56 (and counting) unique data center builds, with the continuous innovation necessary to stay on top in an industry in which constant change is the norm. The demand for continuous innovation has inspired CenturyLink’s multi-decade quest for the perfect data center design model and process. We’re currently on our fourth generation of the perfect data center—and, of course, it certainly won’t be the last.

The design focus of the perfect data center has shifted many times.

Dramatically increasing the efficiency of white space in the data centers is likely the biggest such shift. Under the model in use in 2000, a 10-megawatt (MW) IT load may have required 150,000 square feet (ft2) of white space. Today, the same capacity requires only a third the space. Better still, we have deployments of 1 MW in 2,500 ft2—six times denser than the year-2000 design. Figure 1 shows the average densities in four recent customer installations.

Figure 1. The average densities for four recent customer installations show how significantly IT infrastructure density has risen in recent years.

Figure 1. The average densities for four recent customer installations show how significantly IT infrastructure density has risen in recent years.

Our data centers are rarely homogenous, so the designs need to be flexible enough to support multiple densities in the same footprint. A high-volume trading firm might sit next to a sophisticated 24/7 e-retailer, next to a disaster recovery site for a health-care provider with a tape robot. Building in flexibility is a hurdle all successful colocation providers must overcome to effectively address their clients’ varied needs.

Concurrent with differences in power density are different cooling needs. Being able to accommodate a wide range of densities efficiently, from the lowest (storage and backup) to the highest (high-frequency trading, bitcoin mining, etc.), is a chief concern. By harnessing the latest technologies (e.g., pumped-refrigerant economizers, water-cooled chillers, high-efficiency rooftop units), we match an efficient, flexible cooling solution to the climate, helping ensure our ability to deliver value while maximizing capital efficiency.

Mechanical systems are not alone in seeing significant technological development. Electrical infrastructures have changed at nearly the same pace. All iterations of our design have safely and reliably supplied customer loads, and we have led the way in developing many best practices. Today, we continue to minimize component count, increase the mean time between failures, and pursue high operating efficiency infrastructures. To this end, we employ the latest technologies, such as delta conversion UPS systems for high availability and Eco Mode UPS systems that actually have a higher availability than double-conversion UPS systems. We consistently re-evaluate existing technologies and test new ones, including Bloom Energy’s Bloom Box solid oxide fuel cell, which we are testing in our OC2 facility in Irvine, CA. Only once a new technology is proven and has shown a compelling advantage will we implement it more broadly.

All the improvements in electrical and mechanical efficiencies could scarcely be realized in real-world data centers if controls were overlooked. Each iteration of our control scheme is more robust than the last, thanks to a multi-disciplinary team of controls experts who have built fault tolerance into the control systems. The current design, honed through much iteration, allows components to function independently, if necessary, but generates significant benefit by networking them together, so that they can be controlled collaboratively to achieve optimal overall efficiency. To be clear, each piece of equipment is able to function solely on its own if it loses communication with the network, but by allowing components with part-load efficiencies to communicate with each other effectively, the system intelligently selects ideal operating points to ensure maximum overall efficiency.

For example, the chilled-water pumping and staging software analyzes current chilled-water conditions (supply temperature, return temperature, and system flow) and chooses the appropriate number of chilled-water pumps and chillers to operate to minimize chiller plant energy consumption. To do this, the software evaluates current chiller performance against ambient temperature, load, and pumping efficiency. The entire system is simple enough to allow for effective troubleshooting and for each component to maintain required parameters under any circumstance, including failure of other components.

Finally, our commissioning process has grown and matured. Learning lessons from past commissioning procedures, as well as from the industry as a whole, has made the current process increasingly rigorous. Today, simulations used to test new data centers before they come on-line closely represent actual conditions in our other facilities. A thorough commissioning process has helped us ensure our buildings are turned over to operators as efficient, reliable, and easy to operate as our designs intended.

Design and Construction Certification
As the data center industry has matured, among the things that became clear to CenturyLink was the value of Tier Certification. The Uptime Institute’s Tier Standard: Topology makes the benchmark for performance clear and attainable. While our facilities have always been resilient and maintainable, CenturyLink’s partnership with the Uptime Institute to certify our designs to its well-known and recognizable standards creates customer certainty.

CenturyLink currently has five Uptime Institute Tier III Certified Facilities in Minneapolis, MN; Chicago, IL; Toronto, ON; Orange County, CA; and Hong Kong, with a sixth underway. By having our facilities Tier Certified, we do more than simply show commitment to transparency in design. Customers who can’t view and participate in  commissioning of facilities can rest assured knowing the Uptime Institute has Certified these facilities as Concurrently Maintainable. We invite comparison to other providers and know that our commitments will provide value for our customers in the long run.

Application of Design to Existing Data Centers
Our build team uses data center expansions to improve the capacity, efficiency, and reliability of existing data centers. This includes (but is not limited to) optimizing power distribution, aligning cooling infrastructure to utilize ASHRAE guidance, or upgrading controls to increase reliability and efficiency.

Meanwhile, our operations engineers continuously implement best practices and leading-edge technologies to improve energy efficiency, capacity, and reliability of data center facilities. Portfolio wide, the engineering team has enhanced control sequences for cooling systems, implementation of electronically commutated (EC) and variable frequency drive (VFD) fans, and Cold Aisle/Hot Aisle containment. These best practices serve to increase total cooling capacity and efficiency, ensuring customer server inlet conditions are homogenous and within tolerance. Figure 2 shows the total impact of all such design improvements on our company’s aggregate Power Usage Effectiveness (PUE). Working hand-in-hand with the build group, CenturyLink’s operations engineers ensure continuous improvement in perfect data center design, enhancing some areas while eliminating unneeded and unused features and functions—often based on feedback from customers.

Figure 2. As designs improve over time, CenturyLink implements best practices and lessons learned into its existing portfolio to continuously improve its aggregate PUE. It’s worth noting that the PUEs shown here include considerable office space as well as extra resiliency built into unutilized capacity.

Figure 2. As designs improve over time, CenturyLink implements best practices and lessons learned into its existing portfolio to continuously improve its aggregate PUE. It’s worth noting that the PUEs shown here include considerable office space as well as extra resiliency built into unutilized capacity.

Staffing Models
CenturyLink relies on expert data center facilities and operations teams to respond swiftly and intelligently to incidents. Systems are designed to withstand failures, but it is the facilities team that promptly corrects failures, maintaining equipment at a high level of availability that continually assures Fault Tolerance.

Each facility hosts its own Facility and Operations teams. The Facility team consists of a facility manager, a lead mechanical engineer, a lead electrical engineer, and a team of facility technicians. They maintain the building and its electrical and mechanical systems. They are experts on their locations. They ensure equipment is maintained in concurrence with CenturyLink’s maintenance standards, respond to incidents, and create detailed Methods of Procedure (MOPs) for all actions and activities. They also are responsible for provisioning new customers and maintaining facility capacity.

The Operations team consists of the operations manager, operations lead, and several operations technicians. This group staffs the center 24/7, providing our colocation customers with CenturyLink’s “Gold Support” (remote hands) for their environment so that they don’t need to dispatch someone to the data center. This team also handles structured cabling and interconnections.

Regional directors and regional engineers supplemented the location teams. The regional directors and engineers serve as subject matter experts (SMEs) but, when required, can also marshal the resources of CenturyLink’s entire organization to rapidly and effectively resolve issues and ensure potential problems are addressed on a portfolio-wide basis.

The regional teams work as peers, providing each individual team member’s expertise when and where appropriate, including to operations teams outside their region when needed. Collaborating on projects and objectives, this team ensures the highest standards of excellence are consistently maintained across a wide portfolio. Thus, regional engineers and directors engage in trusted and familiar relationships with site teams, while ensuring the effective exchange of information and learning across the global footprint.

Global Standard Process Model
A well-designed, -constructed, and -staffed data center is not enough to ensure a superior level of availability. The facility is only as good as the methods and procedures that are used to operate it. A culture that embraces process is also essential in operating a data center efficiently and delivering the reliability necessary to support the world’s most demanding businesses.

Uptime and latency are primary concerns for CenturyLink. The CenturyLink brand depends upon a sustained track record of excellence. Maintaining consistent reliability and availability across a varied and changing footprint requires an intensive and dynamic facilities management program encompassing an uncompromisingly rigid adherence to well-planned standards. These standards have been modeled in the IT Infrastructure Library spirit and are the result of years of planning, consideration, and trial and error. Adherence further requires close monitoring of many critical metrics, which is facilitated by the dashboard shown in Figure 3.

Figure 3. CenturyLink developed a dynamic dashboard that tracks and trends important data: site capacity, PUE, available raised floor space, operational costs, abnormal Incidents, uptime metrics, and much more to provide a central source of up-to-date information for all levels of the organization.

Figure 3. CenturyLink developed a dynamic dashboard that tracks and trends important data: site capacity, PUE, available raised floor space, operational costs, abnormal Incidents, uptime metrics, and much more to provide a central source of up-to-date information for all levels of the organization.

Early in the development of Savvis as a company, management established many of the organizational structures that exist today in CenturyLink Technology Solutions. CenturyLink experienced growth in many avenues; as it serviced increasingly demanding customers (in increasing volume), these structures continued to evolve to suit the ever-changing needs of the company and its customers.

First, the management team developed a staff capable of administering the many programs that would be required to maintain the standard of excellence demanded by the industry. Savvis developed models analyzing labor and maintenance requirements across the company and determined the most appropriate places to invest in personnel. Training was emphasized, and teams of SMEs were developed to implement the more detailed aspects of facilities operations initiatives. The management team is centralized, in a sense, because it is one global organization; this enhances the objective of global standardization. Yet the team is geographically diverse, subdivided into teams dedicated to each site and regional teams working with multiple sites, ensuring that all standards are applied globally throughout the company. And all teams contribute to the ongoing evolution of those standards and practices—for example, participating in two global conference calls per week.

Next, it was important to set up protocols to handle and resolve issues as they developed, inform customers of any impact, and help customers respond to and manage situations. No process guarantees that situations will play out exactly as anticipated, so a protocol to handle unexpected events was crucial. This process relied on an escalation schedule that brought decisions through the SMEs for guidance and gave decision makers the proper tools for decision making and risk mitigation. Parallel to this, a process was developed to ensure any incident with impact to customers caused notifications to those customers so they could prepare for or mitigate the impact of an event.

A tracking system accomplished many things. For example, it ensured follow up on items that might create problems in the future, identified similar scenarios or locations where a common problem might recur, established a review and training process to prevent future incidents through operator education, justified necessary improvements in systems creating problems, and tracked performance over longer periods to analyze success in implementation and evaluate need for plan improvement. The tracking system is inclusive of all types of problems, including those related to internal equipment, employees, and vendors.

Data centers, being dynamic, require frequent change. Yet unmanaged change can present a significant threat to business continuity. Congruent with the other programs, CenturyLink set up a Change Management program. This program tracked changes, their impacts, and their completion. It ensured that risks were understood and planned for and that unnecessary risks were not taken.

Any request for change, either internal or from a customer, must go through the Change Management process and be evaluated on metrics for risk. These metrics determine the level of controls associated with that work and what approvals are required. The key risk factors considered in this analysis include the possible number of customers impacted, likelihood of impact, and level of impact. Even more importantly, the process evaluates the risk of not completing a task and balances these factors. The Change Management program and standardization of risk analysis necessitated standardizing maintenance procedures and protocols as well.

Standards, policies, and best practices were established, documented, and reviewed by management. These create the operating frameworks for implementing IT Information Library methodology, enabling CenturyLink to incorporate industry best practices and standards, as well as develop internal operating best practices, all of which maximize uptime and resource utilization.

A rigid document-control program was established utilizing peer review, and all activities or actions performed were scripted, reviewed, and approved. Peer review also contributed to personnel training, ensuring that as documentation was developed, peers collaborated and maintained expertise on the affected systems. Document standardization was extended to casualty response as well. Even responding to failures requires use of approved procedures, and the response to every alarm or failure is scripted so the team can respond in a manner minimizing risk. In other words, there is a scripted procedure even for dealing with things we’ve never before encountered. This document control program and standardization has enabled personnel to easily support other facilities during periods of heightened risk, without requiring significant training for staff to become familiar with the facilities receiving the additional support.

Conclusion
All the factors described in this paper combine to allow CenturyLink to operate a mission-critical business on a grand scale, with uniform operational excellence. Without this framework in place, CenturyLink would not be able to maintain the high availability on which its reputation is built. Managing these factors while continuing to grow has obvious challenges. However, as CenturyLink grows, these practices are increasingly improved and refined. CenturyLink strives for continuous improvement and views reliability as a competitive advantage. The protocols CenturyLink follows are second-to-none, and help ensure the long-term viability of not only data center operations but also the company as a whole. The scalability and flexibility of these processes can be seen from the efficiency with which CenturyLink has integrated them into its new data center builds as well as data centers it acquired. As CenturyLink continues to grow, these programs will continue to be scaled to meet the needs of demanding enterprise businesses.

Site Spotlight: CH2
In 2013, we undertook a massive energy efficiency initiative at our CH2 data center in Chicago, IL. More than 2 years of planning went into this massive project, and the energy savings were considerable.

Projects included:
• Occupancy sensors for lighting
• VFDs on direct expansion computer room air conditioning units
• Hot Aisle containment
• Implementing advanced economization controls
• Replacing cooling tower fill with high-efficiency
evaporative material
• Installing high-efficiency cooling tower fan blades

These programs combined to reduce our winter PUE by over 17%, and our summer PUE by 20%. Additionally, the winter-time period has grown as we have greatly expanded the full free cooling window and added a large partial free cooling window, using the evaporative impact of the cooling towers to great advantage.

Working with Commonwealth Edison, energy efficient rebates contributed over US$500,000 to the project’s return, along with annual savings of nearly 10,000,000 kilowatt-hours. With costs of approximately US$1,000,000, this project proved to be an incredible investment, with an internal rate of return of ≈100% and a Net Present Value of over US$5,000,000. We consider this a phenomenal example of effective best practice implementation and energy efficiency.


 

Alan Lachapelle

Alan Lachapelle

Alan Lachapelle is a mechanical engineer at CenturyLink Technology Solutions. He has 6 years of experience in Naval Nuclear Propulsion on submarines and 4 years of experience in mission critical IT infrastructure. Merging rigorous financial analysis with engineering expertise, Mr. Lachapelle has helped ensure the success of engineering as a business strategy.

Mr. Lachapelle’s responsibilities include energy-efficiency initiatives, data center equipment end of life, operational policies and procedures, peer review of maintenance and operations procedures, utilization of existing equipment, and financial justifications for engineering projects throughout the company.

Operational Upgrade Helps Fuel Oil Exploration Surveyor

Petroleum Geo-Services increases its capabilities using innovative data center design
By Rob Elder and Mike Turff

Petroleum Geo-Services (PGS) is a leading oil exploration surveyor that helps oil companies find offshore oil and gas reserves. Its range of seismic and electromagnetic services, data acquisition, processing, reservoir analysis/interpretation, and multi-client library data all require PGS to collect and process vast amounts of data in a secure and cost-efficient manner. This all demands large quantities of compute capacity and deployment of a very high-density configuration. PGS operates 21 data centers globally, with three main data center hubs located in Houston, Texas; Kuala Lumpur, Malaysia; and Weybridge, Surrey (see Figure 1).

Figure 1. PGS global computing centers 

Figure 1. PGS global computing centers

Weybridge Data Center

Keysource Ltd designed and built the Weybridge Data Center for PGS in 2008. The high-density IT facility, won a number of awards and saved PGS 6.2 million kilowatt-hours (kWh) annually compared to the company’s previous UK data center. The Weybridge Data Center is located in an office building, which poses a number of challenges to designers and builders of a high-performance data center. The initial project phase in 2008 was designed as the first phase of a three-phase deployment (see Figure 2).

Figure 2. Phase 1 Data Center (2008)

Figure 2. Phase 1 Data Center (2008)

Phase one was designed for 600 kW of IT load, which was scalable up to 1.8 megawatts (MW) across two future phases if required. Within the facility, power rack densities of 20 kW were easily supported, exceeding the 15-kW target originally specified by the IT team at PGS.

The data center houses select mission-critical applications supporting business systems, but it primarily operates the data mining and analytics associated with the core business of oil exploration. This IT is deployed in full-height racks and requires up to 20 kilowatts (kW) per rack anywhere in the facility and at any time (see Figure 3).

Figure 3. The full PGS site layout

Figure 3. The full PGS site layout

In 2008, PGS selected Keysource’s ecofris solution for use at its Weybridge Data Center (see Figure 4), which became the first facility to use the technology. Ecofris recirculates air within a data center without using fresh air. Instead air is provided to the data center through the full height of a wall between the raised floor and suspended ceiling. Hot air from the IT racks is ducted into the suspended ceiling and then drawn back to the cooling coils of air handling units (AHU) located at the perimeter walls. The system makes use of adiabatic technology for external heat rejection when external temperatures and humidity do not allow 100% free cooling.

Figure 4. Ecofris units are part of the phase 1 (2008) cooling system to support PGS’s high-density IT. 

Figure 4. Ecofris units are part of the phase 1 (2008) cooling system to support PGS’s high-density IT.

Keysource integrated a water-cooled chiller into the ecofris design to provide mechanical cooling when needed to supplement the free cooling system (see Figure 5). As a result PGS ended up with two systems, each having a 400-kW chiller, which run for only 50 hours a year on average when external ambient conditions are at their highest.

Figure 5. Phase 2 ecofris cooling

Figure 5. Phase 2 ecofris cooling

As a result of this original design the Weybridge data center used outside air for heat rejection but without allowing that air into the building. Airflow design, a comprehensive control system, and total separation of hot and cold air means that the facility could accommodate 30 kW in any rack and deliver a PUE L2,YC (Level 2, Continuous Measurement) of 1.15 while maintaining a server inlet temperature consistent across the entire space of 72°F (22°C) +/-1°. Adopting an indirect free cooling design rather than direct fresh air eliminated the need for major filtration or mechanical back up (see the sidebar).

Surpassing the Original Design Goals
When PGS needed additional compute capacity, the Weybridge Data Center was a prime candidate for expansion because it had the flexibility to deploy high-density IT anywhere within the facility and a low operating cost. However, while the original design anticipated two future 600-kW phases, PGS wanted even more capacity because of the growth of its business and its need for the latest IT technology. In addition, PGS wanted to make a huge drive to reduce operating costs through efficient design of cooling systems and to maximize power capacity at the site.

When the recent project was completed at the end of 2013, the Weybridge Data Center housed the additional high-density IT within the footprint of the existing data hall. The latest ecofris solution was deployed which utilized a chillerless design, which limited the increased power demand.

Keysource undertook the design by looking at ways to maximize the use of white space for IT as well as to remove the overhead cost of power to run mechanical cooling, even for a very limited number of hours a year. This would ensure maximum availability of capacity of power for the IT equipment. With a marginal improvement in operating efficiency (annualized PUE) the biggest design change was the reduced peak PUE. This change enabled an increase in IT design load from 1.8 MW to 2.7 MW within the same footprint. With just over 5 kW/square meter (m2), PGS can deploy 30 kW in any cabinet up to the maximum total installed IT capacity (see Figure 6).

Figure 6.  More compute power within the same overall data center footprint 

Figure 6.  More compute power within the same overall data center footprint

Disruptive Cooling Design
Developments in technology and the wider allowable range of temperatures per ASHRAE TC9.9 enabled PGS to adopt higher server inlet temperatures when ambient temperatures are higher. This change allows PGS to operate at the optimum temperature for the equipment most of the time (normally 72°F (22°C) lowering the IT part of the PUE metric (see Figure 7).

Figure 7. Using computational fluid dynamics to model heat and airflow 

Figure 7. Using computational fluid dynamics to model heat and airflow

In this facility, elevating server inlet temperatures increases the supply inlet temperatures only when ambient outside air is too warm to maintain 72°F (22°C). Running at higher temperatures at other times actually increases server fan power across different equipment, which also increases UPS loads. Running the facility at optimal efficiency all of the time reduces the overall facility load, even though PUE may rise as a result of the decrease of server fan power. With site facilities management (FM) teams trained in operating the mechanical systems, this is fine-tuned through operation and as additional IT equipment is commissioned within the facility, ensuring performance is maintained at all times.

With innovation central to the improved performance of the data center, in addition to the cooling, Keysource also delivered modular, highly efficient UPS systems providing 96% efficiency from >25% facility load, plus facility controls, which provide automated optimization.

A Live Environment
Working in a live data center environment within an office building was never going to be risk free. Keysource built a temporary wall within the existing data center to divide the live operational equipment from the live project area (see Figure 8). Cooling, power, and data for the live equipment isn’t on a raised floor and is delivered from the same end of the data center. Therefore, the dividing screen had limited impact to the live environment, with only some minor modifications needed to the fire detection and suppression systems.

Figure 8. The temporary protective wall built for phase 2

Figure 8. The temporary protective wall built for phase 2

Keysource also manages the data center facility for PGS, which meant that the FM and projects teams could work closely together in the planning the upgrade. As a result facilities management considerations were included in all design and construction planning to minimize risk to the operational data center as well as helping to reduce the impact to other business operations at the site.

Upon completing the project, a full integrated-system test of the new equipment was undertaken ahead of removing the dividing partition. This test not only covered the function of electrical and mechanical systems but also tested the capability of the cooling to deliver the 30 kW/rack and the target design efficiency. Using rack heaters to simulate load allowed detailed testing to be carried out ahead of the deployment of the new IT technology (see Figure 9).

Figure 9. Testing the 30 kW per rack full load 

Figure 9. Testing the 30 kW per rack full load

Results


Phase two was completed in April 2014, and as a result the facility’s power density improved by approximately 50%, with the total IT capacity now scalable up to 2.7 MW. This has been achieved within the same internal footprint. The facility now has the capability to accommodate up to 188 rack positions, supporting up to 30 kW per rack. In addition, the PUEL2,YC of 1.15* was maintained (see the Figure 10).

Figure 10. A before and after comparison

Figure 10. A before and after comparison

The data center upgrade has been hailed as a resounding success, earning PGS and Keysource a Brill Award for Efficient IT from Uptime Institute. PGS is absolutely delighted to have the quality of its facility recognized by a judging panel of industry leaders and to receive a Brill Award.

Direct and Indirect Cooling Systems
Keysource hosted an industry expert roundtable that provides additional insights and debate on two pertinent cooling topics highlighted by the PGS story. Copies of these whitepapers can be obtained at http://www.keysource.co.uk/data-centre-white-papers.aspx

An organization requiring high availability is unlikely to install a direct fresh air system without 100% backup on the mechanical cooling. This is because the risks associated with the unknowns of what could happen outside, however infrequent, are generally out of the operator’s control.

Density of IT equipment does not make any impact to direct or indirect designs. It is the control of air and the method of air delivery within the space that dictates capacity and air volume requirements. There may be additional considerations for how backup systems and the control strategy between switching cooling methods works in high-density environments due to the risk of thermal increase in very short periods, but this is down to each individual design.

Following the agreement of the roundtable that direct fresh air is going to require some sort of a backup system in order to meet availability and customer risk requirements, it is worth considering what benefits might exist for opting for either this or indirect design.

Partly due to the different solutions in these two areas and partly because there are other variables on a site-specific basis, there are not many clear benefits either way, but there are a few considerations included:

• Indirect systems pose less or no risk from external pollutants and
contaminants.

• Indirect systems do not require integration into the building
fabric, where a direct system often needs large ducts or
modifications to the shell. This can increase complexity and cost
if, due to space or building height, it is even achievable.

• Direct systems often require more humidity control, depending on
which ranges are to be met.

With most efficient systems, there is some form of adiabatic cooling. With direct systems there is often a reliance on water to provide capacity rather than simply improve efficiency. In this case there is a much greater reliance on water for normal operation and to maintain availability, which can lead to the need for water storage or other measures. The metric of water usage effectiveness (WUE) needs to be considered.

Many data center facilities are already built with very inefficient cooling solutions. In such cases direct fresh air solutions provide an excellent opportunity to retrofit and run as the primary method of cooling, with the existing inefficient systems as back up. As the backup system is already in place this is often a very affordable option with a clear ROI.

One of the biggest advantages for an indirect system is the potential for zero refrigeration. Half of the U.S. could take this route, and even places people would never consider such as Madrid or even Dubai could benefit. This inevitably requires the use of and reliance on lots of water, as well as the acceptance of increasing server inlet temperatures during warmer periods.


Mike Turff

Mike Turff

Mike Turff is global compute resources manager for the Data Processing division of Petroleum Geo-Services (PGS), a Norwegian-based leader in oil exploration and production services. Mr. Turff has responsibility for building and managing the PGS supercomputer centers in Houston, TX; London, England; Kuala Lumpur, Malaysia and Rio De Janeiro, Brasil as well as the smaller satellite data centers across the world. He has worked for over 25 years in high performance compute, building and running supercomputer centers in places as diverse as Nigeria and Kazakhstan and for Baker Hughes, where he built the Eastern Hemisphere IT Services organization with IT Solutions Centers in Aberdeen, Scotland; Dubai, UAE; and Perth, Australia.

 

elder

Rob Elder

As Sales and Marketing director, Rob Elder is responsible for setting and implementing the strategy for Keysource. Based in Sussex in the United Kingdom, Keysource is a data center design, build, and optimization specialist. During his 10 years at Keysource,  Mr. Elder has also held marketing and sales management positions and management roles in Facilities Management and Data Centre Management Solutions Business Units.

Digital Realty Deploys Comprehensive DCIM Solution

Examining the scope of the challenge
By David Schirmacher

Digital Realty’s 127 properties cover around 24 million square feet of mission-critical data center space in over 30 markets across North America, Europe, Asia and Australia, and it continues to grow and expand its data center footprint. As senior VP of operations, it’s my job to ensure that all of these data centers perform consistently—that they’re operating reliably and at peak efficiency and delivering best-in-class performance to our 600-plus customers.

At its core, this challenge is one of managing information. Managing any one of these data centers requires access to large amounts of operational data.

If Digital Realty could collect all the operational data from every data center in its entire portfolio and analyze it properly, the company would have access to a tremendous amount of information that it could use to improve operations across its portfolio. And that is exactly what we have set out to do by rolling out what may be the largest-ever data center infrastructure management (DCIM) project.

Earlier this year, Digital Realty launched a custom DCIM platform that collects data from all the company’s properties, aggregates it into a data warehouse for analysis, and then reports the data to our data center operations team and customers using an intuitive browser-based user interface. Once the DCIM platform is fully operational, we believe we will have the ability to build the largest statistically meaningful operational data set in the data center industry.

Business Needs and Challenges
The list of systems that data center operators report using to manage their data center infrastructure often includes a building management system, an assortment of equipment-specific monitoring and control systems, possibly an IT asset management program and quite likely a series of homegrown spreadsheets and reports. But they also report that they don’t have access to the information they need. All too often, the data required to effectively manage a data center operation is captured by multiple isolated systems, or worse, not collected at all. Accessing the data necessary to effectively manage a data center operation continues to be a significant challenge in the industry.

At every level, data and access to data are necessary to measure data center performance, and DCIM is intrinsically about data management. In 451 Research’s DCIM: Market Monitor Forecast, 2010-2015, analyst Greg Zwakman writes that a DCIM platform, “…collects and manages information about a data center’s assets, resource use and operational status.” But 451 Research’s definition does not end there. The collected information “…is then distributed, integrated, analyzed and applied in ways that help managers meet business and service-oriented goals and optimize their data center’s performance.” In other words, a DCIM platform must be an information management system that, in the end, provides access to the data necessary to drive business decisions.

Over the years, Digital Realty successfully deployed both commercially available and custom software tools to gather operational data at its data center facilities. Some of these systems provide continuous measurement of energy consumption and give our operators and customers a variety of dashboards that show energy performance. Additional systems deliver automated condition and alarm escalation, as well as work order generation. In early 2012 Digital Realty recognized that the wealth of data that could be mined across its vast data center portfolio was far greater than current systems allowed.

In response to this realization, Digital Realty assembled a dedicated and cross-functional operations and technology team to conduct an extensive evaluation of the firm’s monitoring capabilities. The company also wanted to leverage the value of meaningful data mined from its entire global operations.

The team realized that the breadth of the company’s operations would make the project challenging even as it began designing a framework for developing and executing its solution. Neither Digital Realty nor its internal operations and technology teams were aware of any similar development and implementation project at this scale—and certainly not one done by an owner/operator.

As the team analyzed data points across the company portfolio, it found additional challenges. Those challenges included how to interlace the different varieties and vintages of infrastructure across the company’s portfolio, taking into consideration the broad deployment of Digital Realty’s Turn-Key Flex data center product, the design diversity of its custom solutions and acquired data center locations, the geographic diversity of the sites and the overall financial implications of the undertaking as well as its complexity.

Drilling Down
Many data center operators are tempted to first explore what DCIM vendors have to offer when starting a project, but taking the time to gain internal consensus on requirements is a better approach. Since no two commercially available systems offer the same features, assessing whether a particular product is right for an application is almost impossible without a clearly defined set of requirements. All too often, members of due diligence teams are drawn to what I refer to as “eye candy” user interfaces. While such interfaces might look appealing, the 3-D renderings and colorful “spinning visual elements” are rarely useful and can often be distracting to a user whose true goal is managing operational performance.

When we started our DCIM project, we took a highly disciplined approach to understanding our requirements and those of our customers. Harnessing all the in-house expertise that supports our portfolio to define the project requirements was itself a daunting task but essential to defining the larger project. Once we thought we had a firm handle on our requirements, we engaged a number of key customers and asked them what they needed. It turned out that our customers’ requirements aligned well with those our internal team had identified. We took this alignment as validation that we were on the right track. In the end, the project team defined the following requirements:

• The first of our primary business requirements was global access to consolidated data. We required every single one of Digital Realty’s data centers have access to the data, and we needed the capability to aggregate data from every facility into a consolidated view, which would allow us to compare performance of various data centers across the portfolio in real time.

• Second, the data access system had to be highly secured and give us the ability to limit views based on user type and credentials. More than 1,000 people in Digital Realty’s operations department alone would need some level of data access. Plus, we have a broad range of customers who would also need some level of access, which highlights the importance of data security.

• The user interface also had to be extremely user-friendly. If we didn’t get that right, Digital Realty’s help desk would be flooded with requests on how to use the system. We required a clean navigational platform that is intuitive enough for people to access the data they need quickly and easily, with minimal training.

• Data scalability and mining capability were other key requirements. The amount of information Digital Realty has across its many data centers is massive, and we needed a database that could handle all of it. We also had to ensure that Digital Realty would get that information into the database. Digital Realty has a good idea of what it wants from its dashboard and reporting systems today, but in five years the company will want access to additional kinds of data. We don’t want to run into a new requirement for reporting and not have the historical data available to meet it.

Other business requirements included:

• Open bidirectional access to data that would allow the DCIM system to exchange information with
other systems, including computerized maintenance management systems (CMMS), event management,
procurement and invoicing systems

• Real-time condition assessment that allows authorized users to instantly see and assess operational
performance and reliability at each local data center as well as at our central command center

• Asset tracking and capacity management

• Cost allocation and financial analysis to show not only how much energy is being consumed but also how
that translates to dollars spent and saved

• The ability to pull information from individual data centers back to a central location using minimal re
sources at each facility

Each of these features was crucial to Digital Realty. While other owners and operators may share similar requirements, the point is that a successful project is always contingent on how much discipline is exercised in defining requirements in the early stages of the project—before users become enamored by the “eye candy” screens many of these products employ.

To Buy or Build?
With 451 Research’s DCIM definition—as well as Digital Realty’s business requirements—in mind, the project team could focus on delivering an information management system that would meet the needs of a broad range of user types, from operators to C-suite executives. The team wanted DCIM to bridge the gap between facilities and IT systems, thus providing data center operators with a consolidated view of the data that would meet the requirements of each user type.

The team discussed whether to buy an off-the-shelf solution or to develop one on its own. A number of solutions on the market appeared to address some of the identified business requirements, but the team was unable to find a single solution that had the flexibility and scalability required to support all of Digital Realty’s operational requirements. The team concluded it would be necessary to develop a custom solution.

Avoiding Unnecessary Risk
There is significant debate in the industry about whether DCIM systems should have control functionality—i.e., the ability to change the state of IT, electrical and mechanical infrastructure systems. Digital Realty strongly disagrees with the idea of incorporating this capability into a DCIM platform. By its very definition, DCIM is an information management system. To be effective, this system needs to be accessible to a broad array of users. In our view, granting broad access to a platform that could alter the state of mission-critical systems would be careless, despite security provisions that would be incorporated into the platform.

While Digital Realty and the project team excluded direct-control functionality from its DCIM requirements, they saw that real-time data collection and analytics could be beneficial to various control-system schemas within the data center environment. Because of this potential benefit, the project team took great care to allow for seamless data exchange between the core database platform and other systems. This feature will enable the DCIM platform to exchange data with discrete control subsystems in situations where the function would be beneficial. Further, making the DCIM a true browser-based application would allow authorized users to call up any web-accessible control system or device from within the application. These users could then key in the additional security credentials of that system and have full access to it from within the DCIM platform. Digital Realty believes this strategy fully leverages the data without compromising security.

The Challenge of Data Scale
Managing the volume of data generated by a DCIM is among the most misunderstood areas of DCIM development and application. A DCIM platform collects, analyzes and stores a truly immense volume of data. Even a relatively small data center generates staggering amounts of information—billions of annual data transactions—that few systems can adequately support. By contrast, most building management systems (BMS) have very limited capability to manage significant amounts of historical data for the purposes of defining ongoing operational performance and trends.

Consider a data center with a 10,000-ft2 data hall and a traditional BMS that monitors a few thousand data points associated mainly with the mechanical and electrical infrastructure. This system communicates in near real time with devices in the data center to provide control- and alarm-monitoring functions. However, the information streams are rarely collected. Instead they are discarded after being acted on. Most of the time, in fact, the information never leaves the various controllers distributed throughout the facility. Data are collected and stored at the server for a period of time only when an operator chooses to manually initiate a trend routine.

If the facility operators were to add an effective DCIM to the facility, it would be able to collect much more data. In addition to the mechanical and electrical data, the DCIM could collect power and cooling data at the IT rack level and for each power circuit supporting the IT devices. The DCIM could also include detailed information about the IT devices installed in the racks. Depending on the type and amount desired, data collection could easily required 10,000 points.

But the challenge facing this facility operator is even more complex. In order to evaluate performance trends, all the data would need to be collected, analyzed and stored for future reference. If the DCIM were to collect and store a value for each data point for each minute of operation, it would have more than five billion transactions per year. And this would be just the data coming in. Once collected, the five billion transactions would have to be sorted, combined and analyzed to produce meaningful output. Few, if any, of the existing technologies installed in a typical data center have the ability to manage this volume of information. In the real world, Digital Realty is trying to accomplish this same goal across its entire global portfolio.

The Three Silos of DCIM
As Digital Realty’s project team examined the process of developing a DCIM platform, it found that the challenge included three distinct silos of data functionality: the engine for collection, the logical structures for analysis and the reporting interface.

Figure 1. Digital Realty’s view of the DCIM stack.

Figure 1. Digital Realty’s view of the DCIM stack.

The engine of Digital Realty’s DCIM must reach out and collect vast quantities of data from the company’s entire portfolio (see Figure 1). The platform will need to connect to all the sites and all the systems within these sites to gather information. This challenge requires a great deal of expertise in the communication protocols of these systems. In some instances, accomplishing this goal will require “cracking” data formats that have historically stranded data within local systems. Once collected, the data to be checked for integrity and packaged for reliable transmission to the central data store.

The project team also faced the challenge of creating the logical data structures that to process, analyze and archive the data once the DCIM has successfully accessed and transmitted the raw data from each location to the data store. Dealing with 100-plus data centers, often with hundreds of thousands of square feet of white space each, increases the scale of the challenge exponentially. The project team overcame a major hurdle in addressing this challenge when it was able to define relationships between various data categories that allowed the database developers to prebuild and then volume-test data structures to ensure they were up to the challenge.

These data structures, or “data hierarchies” as Digital Realty’s internal team refers to them, are the “secret sauce” of the solution (see Figure 2). Many of the traditional monitoring and control systems in the marketplace require a massive amount of site-level point mapping that is often field-determined by local installation technicians. These points are then manually associated with the formulas necessary to process the data. This manual work is why these projects often take much longer to deploy and can be difficult to commission as mistakes are flushed out.

Figure 2. Digital mapped all the information sources and their characteristics as a step to developing its DCIM.

Figure 2. Digital mapped all the information sources and their characteristics as a step to developing its DCIM.

In this solution, these data relationships have been predefined and are built into the core database from the start. Since this solution is targeted specifically to a data center operation, the project team was able to identify a series of data relationships, or hierarchies, that can be applied to any data center topology and still hold true.

For example, an IT application such as an email platform will always be installed on some type of IT device or devices. These devices will always be installed in some type of rack or footprint in a data room. The data room will always be located on a floor, the floor will always be located in a building, the building in a campus or region, and so on, up to the global view. The type of architectural or infrastructure design doesn’t matter; the relationship will always be fixed.

The challenge is defining a series of these hierarchies that always test true, regardless of the design type. Once designed, the hierarchies can be pre-built, their validity tested and they can be optimized to handle scale. There are many opportunities for these kinds of hierarchies. This is exactly what we have done.

Having these structures in place facilitates rapid deployment and minimizes data errors. It also streamlines the dashboard analytics and reporting capabilities, as the project team was able to define specific data requirements and relationships and then point the dashboard or report at the layer of the hierarchy to be analyzed. For example, a single report template designed to look at IT assets can be developed and optimized and would then rapidly return accurate values based on where the report was pointed. If pointed at the rack level, the report would show all the IT assets in the rack; if pointed at the room level, the report would show all the assets in the room, and so on. Since all the locations are brought into a common predefined database, the query will always yield an apples-to-apples comparison regardless of any unique topologies existing at specific sites.

Figure 3. Structure and analysis as well as web-based access were important functions.

Figure 3. Structure and analysis as well as web-based access were important functions.

Last remains the challenge of creating the user interface, or front end, for the system. There is no point in collecting and processing the data if operators and customers can’t easily access it. A core requirement was that the front end needed to be a true browser-based application. Terms like “web-based” or “web-enabled” are often used in the control industry to disguise the user interface limitations of existing systems. Often to achieve some of the latest visual and 3-D effects, vendors will require the user’s workstation to be configured with a variety of thin-client applications. In some cases, full-blown applications have to be installed. For Digital Realty, installing add-ins on workstations would be impractical given the number of potential users of the platform. In addition, in many cases, customers would reject these installs due to security concerns. A true browser-based application requires only a standard computer configuration, a browser and the correct security credentials (see Figure 3).

Intuitive navigation is another key user interface requirement. A user should need very little training to get to the information they need. Further, the information should be displayed to ensure quick and accurate assessment of the data.

Digital Realty’s DCIM Solution
Digital Realty set out to build and deploy a custom DCIM platform to meet all these requirements. Rollout commenced in May 2013, and as of August, the core team was ahead of schedule in terms of implementing the DCIM solution across the company’s global portfolio of data centers.

The name EnVision reflects the platform’s ability to look at data from different user perspectives. Digital Realty developed EnVision to allow its operators and customers insight into their operating environments and also to offer unique features specifically targeted to colocation customers. EnVision provides Digital Realty with vastly increased visibility into its data center operations as well as the ability to analyze information so it is digestible and actionable. It has a user interface with data displays and reports that are tailored to operators. Finally, it has access to historical and predictive data.

In addition, EnVision provides a global perspective allowing high-level and granular views across sites and regions. It solves the stranded data issue by reaching across all relevant data stores on the facilities and IT sides to provide a comprehensive and consolidated view of data center operations. EnVision is built on an enterprise-class database platform that allows for unlimited data scaling and analysis and provides intuitive visuals and data representations, comprehensive analytics, dashboard and reporting capabilities from an operator’s perspective.

Trillions of data points will be collected and processed by true browser-based software that is deployed on high-availability network architecture. The data collection engine offers real-time, high-speed and high-volume data collection and analytics across multiple systems and protocols. Furthermore, reporting and dashboard capabilities offer visualization of the interaction between systems and equipment.

Executing the Rollout
A project of this scale requires a broad range of skill sets to execute successfully. IT specialists must build and operate the high-availability compute infrastructure that the core platform sits on. Network specialists define the data transport mechanisms from each location.

Control specialists create the data integration for the various systems and data sources. Others assess the available data at each facility, determine where gaps exist and define the best methods and systems to fill those gaps.

The project team’s approach was to create and install the core, head-end compute architecture using a high-availability model and then to target several representative facilities for proof-of-concept. This allowed the team of specialists to work out the installation and configuration challenges and then to build a template so that Digital Realty could repeat the process successfully at other facilities. With the process validated, the program moved onto the full rollout phase, with multiple teams executing across the company’s portfolio.

Even as Digital Realty deploys version 1.0 of the platform, a separate development team continues to refine the user interface with the addition of reports, dashboards and other functions and features. Version 2.0 of the platform is expected in early 2014, and will feature an entirely new user interface, with even more powerful dashboard and reporting capabilities, dynamically configurable views and enhanced IT asset management capabilities.

The project has been daunting, but the team at Digital Realty believes the rollout of the EnVision DCIM platform will set a new standard of operational transparency, further bridging the gap between facilities and IT systems and allowing operators to drive performance into every aspect of a data center operation.


David Schirmacher

David Schirmacher

David Schirmacher is senior vice president of Portfolio Operations at Digital Realty, where is responsible for overseeing the company’s global property operations as well as technical operations, customer service and security functions. He joined Digital Realty in January 2012. His more than 30 years of relevant experience includes turns as principal and Chief Strategy Officer for FieldView Solutions, where he focused on driving data center operational performance; and vice president, global head of Engineering for Goldman Sachs, where he focused on developing data center strategy and IT infrastructure for the company’s headquarters, trading floor, branch offices and data center facilities around the world. Mr. Schirmacher also held senior executive and technical positions at Compass Management and Leasing and Jones Lang LaSalle. Considered a thought leader within the data center industry, Mr. Schirmacher is president of 7×24 Exchange International and he has served on the technical advisory board of Mission Critical.