Empowering the Data Center Professional

February 10, 2015/in Executive/by Kevin Heslin

Five questions to ask yourself, and five to ask your team
By Fred Dickerman

When Uptime Institute invited me to present at Symposium this year, I noticed that the event theme was Empowering the Data Center Professional. Furthermore, I noted that Uptime Institute described the Symposium as “providing attendees with the information to make better decisions for their organizations, and their careers.” The word empowering really caught my attention, and I decided to use the session to discuss empowerment with a peer group of data center professionals. The results were interesting.

If you manage people, chances are you have spent some time asking yourself about empowerment. You may not have used that word, but you would have certainly asked yourself how to make your people more productive, boost morale, develop your people, or address some other issue, which can be related back to empowering people.

If you don’t manage people but there is someone who manages you, you’ve probably thought about some of these issues in relationship to your job. You may have read books on the subject; there are a lot of books out there. You may have gone to the internet; I recently got 5,000,000 results from a Google search on the word empowerment. And many of us have attended at least one seminar on the subject of empowerment, leadership, motivation, or coaching. Empowerment Fire Walk is my favorite such title.

Even so, the act of empowering people is a hit-or-miss activity.

In 1985, the author and management consultant Edward deBono, perhaps best know for originating the concept of lateral thinking, wrote the book Six Thinking Hats (see the sidebar). In that book, DeBono presented an approach to discussion, problem solving, and group interactions, in which he describes six imaginary hats to represent the six possible ways of approaching a question.

The green hat, for instance, focuses on creativity, possibilities, alternatives, and new ideas. It’s an opportunity to express new concepts and new perceptions.

Let’s put our green hats firmly in place to examine the issue of empowerment. We can start by creating a distinction between two different types of questions. The first questions are the everyday questions, such as “Where should we go for lunch?, which UPS system provides the best value for investment?, and should we build our own data center or lease?” We value the answers to these questions. Everyday questions can be trivial or very important, with big consequences sometimes resting on the answers.

We derive value from answers to the other question type but also from the question itself. We can call these questions inquiries.

Focus is the difference between everyday questions and inquiries. The answer is the focus of the everyday question. Once we have an answer to an everyday question, we tend to forget the question altogether and base our next actions on the answer. After the question is forgotten, other possible answers are also forgotten.

On the other hand, if you are engaged in an inquiry, you continue to focus on the question—even after you have an answer. The inquiry leads to other answers and tests the answers you have developed. The inquiry continues to focus on the question. As Isaac Newton said, “I keep the subject of my inquiry constantly before me, and wait till the first dawning opens gradually, by little and little, into a full and clear light.”

In addition to having more than one answer (and usually no single right answer), other characteristics of the inquiry are:

• The answers rarely come easily or quickly

• The questions tend to be uncomfortable

• Answers to the initial question often lead to other questions
Thorstein Veblen said, “The outcome of any serious research can only be to make two questions grow where only one grew before.”

Our question, “How do you empower people,” pretty clearly falls into the category of inquiry. Before we start to propose some of the answers to our empowerment inquiry, let’s introduce one more idea.

An Example
What is the difference between knowing how to ride a bicycle and being able to ride a bicycle? If you were like most children, you probably first decided to learn to ride a bicycle. Then you probably gathered some information, perhaps you read an article on how to ride a bicycle watched other people ride bicycles, or had a parent who helped.

After gathering information, you may have felt ready to ride a bicycle, but if you were speaking more precisely, you could really only claim to know how to ride a bicycle in theory. Then, you got on the bicycle for the first time and fell off. And, if your focus during that first attempt was on the instructions you obtained from your parents or other expert source, you were almost certain to fall off the first few times. Eventually, with patience and commitment, you finally learned how to ride the bicycle without falling off.

First, once you have ridden the bicycle, you will probably always be able to ride a bicycle. We use the phrase “like riding a bicycle” to describe any skill or ability which, once learned, seems to never be forgotten.

Second, if you were trying to teach someone else how to ride a bicycle, you probably could not find the words to convey what happens when you learn how to balance on the bicycle and ride it. In other words, even though you have successfully been able to ride the bicycle, it is very unlikely that you will be able to add anything useful to the instructions you obtained.

Third, making that shift from knowing how to being able to is likely to be unique to the basic skill of riding the bicycle. That is, learning to ride backward, on one wheel, or without holding the handlebars will require repeating the same process of trying and failing.

The hundreds or thousands of books, seminars, and experts providing knowledge on how to empower people cannot teach anyone the skill of empowering others. When you have empowered someone, you probably will not be able to put into words what you did to produce that result, at least not so that a person reading or hearing those words will gain the ability to empower others.

The ability to empower is probably not transferable. Empowering someone in different circumstances will require repeating the learning process.

You can probably see that this article is not going to teach you how to empower people. But if you are committed to empowering people and are willing to fall off the bicycle a few times, you will have new questions to ask.

Giving Power
The best way to empower someone is to give them power, which is easy if you are royalty. In medieval times, when monarchs had almost unlimited power, the empowerment was also very real and required only a prescribed ceremony to transfer power to the person.

Not surprisingly, the first step to empowering others still requires transferring official authority or power.

Managers today wrestle with how much authority and autonomy to give to others. In an article entitled “Management Time, Who’s Got the Monkey?” (Harvard Business Review, Nov.-Dec. 1974, reprint #99609), William Oncken and Donald Wass describe a new manager who realizes that the way he manages his employees does not empower them.

Every time one of the employees would bring a problem to the manager’s attention, the way the employee communicated the problem (in Oncken’s analogy a “monkey”) and the manager’s response to the problem would cause the problem to become the manager’s responsibility. Recognizing the pattern enabled the manager to change the way he interacted with his employees. He stopped taking ownership of the problems, instead supporting their efforts to resolve the problems by themselves. While Oncken and Wass focus mainly on time management for managers, the authors also deal with the eternal question for managers of delegation of authority, with five levels of delegation from subordinate waits to be told what to do (minimal delegation) to subordinate acts on his own initiative and reports at set intervals (maximum delegation)

They write, “Your employees can exercise five levels of initiative in handling on-the-job problems. From lowest to highest, the levels are:

•Wait until told what to do.

• Ask what to do.

• Recommend an action, and then with your approval, implement it.

• Take independent action, but advise you at once.

•Take independent action and update you through routine procedure.”

Symposium literature suggests another way to empower people: provide information. Everyone has heard the expression “Give a man a fish, he can eat for a day; teach a man to fish, he can eat for a lifetime.” To be more precise, we should say that if you teach a person to fish (the information transfer) and that person realizes the potential of this new ability, that person can eat for a lifetime.

The realization that fish are edible might be the person’s eureka moment. The distinction between teaching the mechanics of fishing and transferring the concept of fishing is subtle, but critical. The person who drops a hook in the water might catch fish occasionally, just as a person who gets on bicycle occasionally might cover some distance; however, the first person is not a fisherman and the second is not a bicyclist.

Besides delegating authority or providing information, there are other answers to our empowerment inquiry, such as providing the resources necessary to accomplish some objective or creating a safe environment to work, and let’s not forget constructive feedback and criticism.

An engineer who designs safety features into a Formula 1 car empowers the driver to go fast, without being overly concerned about safety. The driver is empowered by making the connection between the safety features and the opportunity to push the car faster.

A coach who analyzes a play so that the quarterback begins to see all of the elements of an unfolding play as a whole rather than as a set of separate actions and activities empowers the quarterback to see and seize opportunities while another quarterback is still thinking about what to do (and getting sacked).

These answers have some things in common:

Whether the actions or information is empowering is not inherent in those actions or the information. We know that because exactly the same actions, the same information, and the same methodologies can result in empowerment or no empowerment. If empowerment were inherent in some action or information, empowerment would result every time the action was taken.

Empowerment may be a result of the action or process, but it is not a step to be taken.

While it is almost certainly true that you cannot empower someone who does not want to be empowered, it is equally true that the willingness and desire to be empowered does not guarantee that empowerment will result from an interaction.

Now let’s talk about one more approach to empowerment. I had a mentor early in my career who liked to start coaching sessions by telling a joke. The setup for the joke was: “What does an American do when you ask him a question?” The punch line, “He answers it!”

Our teams reacted with dead silence. Our boss would assure us that the joke was considered very funny in Europe and that it was only our American culture that caused us to miss the punch line. Since he was the boss, we never argued. I’ve never tried the joke on my Russian friends, so I can’t confirm that they see the humor either.

The point, I think, was that Americans are culturally attuned to answers, not questions. Our game shows are about getting one right answer, first. Only Jeopardy, the television game show, pretends to care about the questions, but only as a gimmick.

When an employee experiences success and the boss asks about the approaches that worked when acknowledging that success, the boss is trying to empower that employee. The positive reinforcement and encouragement can cause the employee to go to look for the behaviors that contributed to that success and allow the employee to be successful in other areas. Empowerment comes when the employee takes a step back and finds an underlying behavior or way of looking at an issue or opportunity in the future.

Questions are especially empowering because they:

• Are non-threatening as opposed to direct statements

• Generate thought versus defense

• May lead to answers which neither the manager or the employee considered

•May lead to another question that empowers both you and your employee

• My presentation at the Symposium evolved as I prepared it. It was originally titled “Five Questions
Every Data Center Manager Should Ask Every Year.” As I have been working on using questions to
empower others (and myself) for a long time, the session was a great way for me to further that inquiry.

The five questions that I am currently living with are:

• Who am I preparing to take my job?

• How reliable does my data center need to be?

• What are the top 100 risks to my data center?

• What should my data center look like in 10 years?

• How do I know I am getting my job done?

And here are five questions that I have found to empower the people who work for me:

• What worked? In your recent success or accomplishment, what contributed to your success?

•What can I do as your boss to support your being more productive?

• What parts of your job are you ready to train others to do?

• What parts of my job would you like to take on?

• What training can we offer to make you more productive or prepare you to move up in our company?

These are examples of the kinds of inquiries that can be empowering, questions that you can ask yourself or someone else over and over again, possibly coming up with a different answer each time you ask. There may not be one right answer, and the inquiries maybe uncomfortable because they challenge what we already know.

G. Spencer Brown said, “To teach pride in knowledge is to put up an effective barrier against any advance upon what is already known, since it makes one ashamed to look beyond the bonds imposed by one’s ignorance.”

Whether your approach to empowering people goes through delegation, teaching, asking questions, or any of the other possible routes, keep in mind that the real aha moment comes from your commitment to the people who you want to empower. Johann Wolfgang Von Goethe said, “Until one is committed, there is hesitancy, the chance to draw back, always ineffectiveness. Concerning all acts of initiative and creation, there is one elementary truth the ignorance of which kills countless ideas and splendid plans: that the moment one definitely commits oneself, then providence moves too. All sorts of things occur to help one that would never otherwise have occurred. A whole stream of events issues from the decision, raising in one’s favor all manner of unforeseen incidents, meetings, and material assistance which no man could have dreamed would have come his way. Whatever you can do or dream you can, begin it. Boldness has genius, power, and magic in it. Begin it now.”

Six Thinking Hats

Six Thinking Hats, (From the web site of the de Bono Group, LLC.)

Well-known management consultant Edward de Bono wrote a book called Six Thinking Hats (Little, Brown and Company, 1985). His purpose was to empower people to resolve problems by giving them a mechanism to examine those problems in different ways—from different points of view. De Bono suggested that people imagine that they have six hats of different colors, with each hat representing a different approach or point of view to the problem.

The white hat calls for information known or needed. “The facts, just the facts.”

The yellow hat symbolizes brightness and optimism. Under this hat you explore the positives and probe for value and benefit.

The black hat is judgment, the devil’s advocate or why something may not work. Spot the difficulties and dangers; where things might go wrong. The black hat is probably the most powerful and useful of the hats but can be overused.

The red hat signifies feelings, hunches, and intuition. When using this hat you can express emotions and feelings and share fears, likes, dislikes, loves, and hates. The green hat focuses on creativity: the possibilities, alternatives, and new ideas. It’s an opportunity to express new concepts and new perceptions.

The blue hat is used to manage the thinking process. It’s the control mechanism that ensures the Six Thinking Hats guidelines are observed.

Fred Dickerman

Fred Dickerman is vice president, Data Center Operations for DataSpace. In this role, Mr. Dickerman oversees all data center facility operations for DataSpace, a colocation data center owner/operator in Moscow, Russian Federation. He has more than 30 years experience in data center and mission-critical facility design, construction, and operation. His project resume includes owner representation and construction management for over US$1 billion in facilities development, including 500,000 square feet (ft2) of data center and 5 million ft2 of commercial properties. Prior to joining DataSpace, Mr. Dickerman was the VP of Engineering and Operations for a Colocation Data Center Development company in Silicon Valley.

Sun Life Plots Data Center Facilities and Operations Roadmap to Meet Application Demands

February 3, 2015/in Design, Operations/by Kevin Heslin

An imminent return of service causes a top-to-bottom examination of data center facilities and operations
By Rocco Alonzi and Paolo Piro

When Sun Life Financial completed a return of service for its data center operations in 2011, the Enterprise Infrastructure (IE) and Corporate Real Estate (CRE) teams immediately saw an opportunity to improve service stability, back-up capacity, and efficiency.

Focusing initially on the physical facility itself, Sun Life considered what it needed to do to improve its current situation. The options included upgrading its existing primary facility in Waterloo, ON, Canada, purchasing a new facility, and partnering with an existing facility in either a colocation or partnership scenario.

Sun Life scored the options on four main criteria: cost, time to market, interruption, and involvement. In the end, Sun Life decided that upgrading its existing Waterloo-King facility was the best option because the upgrade was the most cost effective, the least interruptive, and the most suitable fit for the organization. The plan resulted in design implementations and organizational improvements that ultimately led to Sun Life’s first Uptime Institute Tier III Certified Constructed Facility.

Achieving this milestone was no small feat. The Waterloo-King facility was fully operational during the upgrades and improvements. The facility hosted the majority of Sun Life’s primary applications, and the site already had many connections and linkages to other facilities, with international implications for the company. In the end, the Waterloo-King facility was transformed into a Concurrently Maintainable Tier III facility with all the redundancy that comes with the designation. The transformation was completed over a relatively short period with zero major outages.

The Decision
The decision to return service from an application outsourcing arrangement back to Sun Life prompted the organization to review its capabilities. Once the decision was communicated, the Enterprise Infrastructure branch (responsible for supporting the services) quickly began to analyze the requirements of the return of service and potential gaps that might impact service.

The Enterprise Infrastructure leadership team led by the Data Center Operations (DCO) assistant vice president (AVP) shouldered the responsibility of ensuring the sufficiency of the data center critical facility and the organization. The DCO reviewed current capabilities, outlined options, and developed an improvement plan. The decision was to upgrade the facility and create an environment that supported a Concurrently Maintainable and fully redundant operation.

To facilitate this transformation, Sun Life assembled a team of stakeholders to lay groundwork, manage responsibility, and execute the pieces to conclusion. The team led by the DCO AVP primarily comprised personnel from Corporate Real Estate (CRE) Facilities and Data Center Operations. Within the Sun Life Financial organization, these two teams have the greatest vested interest in the data center, and both are directly involved in the care and feeding of the facility.

Once the team was assembled, it began working on a mandate that would eventually describe the desired end goal. The goal would be to create a facility that was mechanically and electrically redundant and an organizational support structure that was operationally viable. The organization described in the mandate would be able to support the many functions required to run a critical facility and impose governance standards and strategies to keep it running optimally for years to come.

Data Center Due Diligence Assessment
To help guide the organization through this process, the DCO AVP contracted the services of the Uptime Institute to provide a Data Center Due Diligence Assessment analysis report. The report ultimately formed the basis of Sun Life’s roadmap for this journey.

Once the Data Center Due Diligence Assessment was complete, Uptime Institute presented its findings to the DCO AVP, who then reviewed the report with the CRE AVP and quickly identified opportunities for improvement. Using the Data Center Due Diligence Assessment and a structural assessment from another vendor, Sun Life’s team quickly isolated the critical areas and developed a comprehensive master plan.

These opportunities for improvement would help the team generate individual activities and project plans. The team focused on electrical, mechanical, and structural returns. The tasks the team developed included creating electrical redundancy, establishing dual-path service feeds, adding a second generator path to create a completely separate emergency generation system, hardening the structural fabric, and replacing the roof waterproofing membrane located above the raised floor.

With the team having identified infrastructure concerns, it then shifted its focus to organizational effectiveness and accountabilities. Sun Life used a review of operational processes to close organizational gaps and meet challenges to strengthen accountabilities, responsibilities, and relationships. Changes were necessary, not only during the transformation process but also post implementation, when the environment became fully operational and would require optimal and efficient support and maintenance.

The team needed to establish clear organizational delineation of responsibilities, and establish strong communication links between the DCO and CRE so the data center support structure would function as a single unit. Under the leadership of the DCO AVP with support from CRE, Sun Life established a Data Center Governance branch to help meet this requirement. Every aspect of the day-to-day care and feeding of the facility was discussed, reviewed, and then approved for implementation, with the establishment of a clear demarcation between CRE and DCO support areas, based on the Responsible, Accountable, Consulted and Informed (RACI) model. Figure 1 is a graphical example of Sun Life’s final delineation model.

Figure 1. Overview of Sun Life’s Final RACI model
(responsibilities assigned to the CRE and DCO groups)

IT Technology
For the last step, the spotlight moved to IT Technology. The Data Center Governance team (by direction of the DCO AVP) reviewed the existing standards and policies. The team wrote and communicated new policies as required. Adherence to these policies would be strictly enforced, with full redundancy of the mechanical and electrical environment right down to the load level being the overarching goal. Establishment and enforcement of these rules follows the demarcation between CRE and DCO.

Roadmap action items were analyzed to determine grouping and scheduling. Each smaller project was initiated and approved using the same process–stakeholder approval (i.e., DCO and CRE AVP) had to be obtained before any project was allowed to proceed through the organization’s change and approval process. The team would first assess the change for risk and then for involvement and impact before allowing it to move forward for organizational assessment and approval. The criteria for approving these mechanical and electrical plans were based on “other” involvement and “other” commitment. The requirements of impacted areas of the organization (other than the DCO and CRE areas) would drive the level of analysis that a particular change would endure. Each project and activity was reviewed and scrutinized for individual merit and overall value-add. Internal IT Information Library (ITIL) change management processes were then followed. Representatives from all areas of the organization were given the opportunity to assess items for involvement and impact, and the change teams would assess the items for change window conflicts. Only after all involved areas were satisfied would the project be submitted for final ITIL approval and officially scheduled.

The following list provides a high-level summary of the changes that were completed in support of Sun Life’s transformation. Many were done in parallel and others in isolation; select changes required full organizational involvement or a data center shutdown.

• Added a 13.8-kilovolt-ampere (kVA) high-voltage hydro feed from a local utility

• Added a second electrical service in the parking garage basement

• Completed the construction of a separate generator room and diesel storage tank room in the parking garage basement t0 accommodate the addition of a 2-megawatt diesel generator, fuel storage tanks, and fuel pumps

• Introduced an underground 600-volt (V) electrical duct bank to data center

•Reconfigured the data center electrical room into two distinct sides

• Replaced old SP1 switchboard in data center electrical room

• Added a backup feed from the new electrical service for main building

• Replaced existing UPS devices

• Installed an additional switch between the new generator and the switchgear to connect the load bank

• Installed an additional switch on each electrical feed providing power to the UPS system for LAN rooms

• Upgraded existing generators to prime-rated engine generators

• Replaced roof slab waterproofing membrane above the data center (see Figure 2)

• Created strategies to mitigate electrical outages

Figure 2. (Above and Below) Waterproof membrane
construction above raised floor

Teamwork
Teamwork was essential to the success of these changes. Each of the changes required strong collaboration, which was only possible because of the strong communication links between CRE and DCO. The team responsible for building the roadmap that effectively guided the organization from where it was to where it needed to be had a full understanding of accountabilities and responsibilities. It was (and still is) a partnership based on a willingness to change and a desire to move in the right direction. The decision to add controls to a building UPS is a good example of this process. Since one of the critical facility UPS units at the Waterloo facility supports part of the general building as well as the critical facility, a control needed to be put in place to ensure compliance, agreement, and communication. Although the responsibility to execute the general building portion falls solely on CRE, a change to this environment could have an impact on the data center and therefore governance was required. Figure 3 shows a process that ensures collaboration, participation, and approval across responsibilities.

Figure 3. Building UPS process

To achieve this level of collaboration, the focus needed to switch to the organizational commitments and support that was fostered during this process. Without this shift in organizational behavior, Sun Life would not have been able to achieve the level of success that it has—at least not as easily or as quickly. This change in mind set helped to change the way things are planned and executed. CRE and DCO work together to plan and then execute. The teamwork ensured knowledge and understanding. The collaboration removed barriers so the teams were able to develop a much broader line of sight (bird’s-eye view) when considering the data center.

Delineation of responsibilities was clearly outlined. DCO assumed full accountability of all changes relating to the raised floor space while CRE Critical Facilities managed all electrical and mechanical components in support of the data center. The DCO team reported through the technology branch of the organization while the Critical Facilities team reported up through the CRE branch of the organization. Overall accountability for the data center rested on DCO with final approval and ultimate ownership coming from DCO AVP.

During the planning phase of this transition, both sides (CRE and DCO) quickly realized that processing changes in isolation was not an effective or efficient approach and immediately established a strong collaborative tie. This tie proved to be critical to the organization’s success as both teams and their respective leaders were able to provide greater visibility, deliver a consistent message, and obtain a greater line of sight into potential issues. All of which helped to pave the way for easier acceptance, greater success, and fewer impacts organization wide. As preparations were being made to schedule activities, the team was able to work together and define the criteria for go/no-go decisions.

Documenting the Process
Once individual projects were assessed and approved, the attention turned to planning and execution. In cases where the activity involved only the stakeholder groups (CRE and DCO), the two groups managed the change/implementation in isolation. Using method of procedures (MOPs) provided by the vendor performing the activity kept the team fully aware of the tasks to be completed and the duration of each task. On the day of the change, communication was managed within the small group and executives were always kept informed. Activity runbooks were used in cases where the activity was larger and involvement was much broader. These runbooks contained a consolidation of all tasks (including MOPs), responsibilities assigned to areas and individuals, and estimated and tracked duration per step. The MOPs portion of the runbook would be tagged to CRE, and although the specific steps were not itemized, as they were only relevant to CRE and DCO, the time duration required for the MOP was allotted in the runbook for all to see and understand (See Figure 4). In these larger, more involved cases, the runbooks helped to ensure linkages of roles and responsibilities, especially across Facilities and IT, to plan the day, and to ensure that all requirements and pre-requisites were aligned and clearly understood.

Figure 4. Example of an electrical enhancement shutdown schedule

Compiling these runbooks required a great deal of coordination. Once the date for the activity was scheduled, the DCO team assumed the lead in developing the runbook. At its inception, the team engaged areas of impact and began to document a step-by-step MOP that would be used on the day of the change. Who was required? Who from that team would be responsible? And how much time each task would take? The sum of which provided the overall estimate for how long the proposed activity would take. Several weeks prior to the actual change, dry runs of the runbook were scheduled to verify completeness of the approach. Final signoff was always required before any change was processed for execution. Failure to obtain signoff resulted in postponement or cancellation of the activity.

On the day of the activities, tasks were followed as outlined. Smaller activities (activities that only required DCO and Facilities involvement) were managed within the Facilities area with DCO participation. Larger activities requiring expanded IT coordination were managed using a command room. The room was set up to help facilitate the completion of the tasks in the order outlined in the runbook. The command room offered the coordination and collaboration. The facilitators (who were always members of the DCO team) were able to use the forum to document issues that would arise, assess impact, and document remediation. The information, post implementation, was then used to investigate and resolve for future runbook creations. The command room served as a focal point for information and status updates for the entire organization. Status updates would be provided at predetermined intervals. Issues were managed centrally to ensure that coordination and information was consistent and complete. The process was repeated for each of the major activities, and in the end, as far as Sun Life’s transformation goes, all changes were executed as planned, with complete cooperation and acceptance by all involved.

System cutovers were completed without major issues and without outages. Interruptions of applications were, for the most part expected or known. Outages were typically caused by technical limitations at the load level such as single-corded IT hardware or system limitations. Outages of single-corded equipment were minimized as systems were restored once power was fed from a live source. For outages caused by system limitations, arrangements had been made with the business client to shut down the service for the duration of the change. System was restored when the change was complete. In the rare circumstance when a minor outage did occur, the support group, which was on site, investigated immediately and determined the root cause to be localized to the IT equipment. The issues were faulty power supplies or IT hardware configuration errors. These issues, although not related to the overall progress or impact of the activity itself, were documented, resolved, and then added to future runbooks as pre-implementation steps to be completed.

DCO’s governance and strategy framework (see Figure 5) served as the fundamental component that would define authority, while employing controls. These controls ensured clarity, impact, risk, and execution during each of the planning and execution phases and have continued to evolve well into the support phase.

Figure 5. Overview of governance model

A new RACI model was developed to help outline the delineation between CRE and DCO in the data center environment. The information, which was built by DCO in collaboration with CRE, was developed in parallel to the changes being implemented. Once the RACI model was approved, the model became the foundation for building a clear understanding of the responsibilities within the data center.

During the planning phase, the collaboration between these two areas facilitated the awareness needed for establishing proper assessment of impact. As a result, the level of communication and amount of detail provided to the change teams was much more complete. The partnership fostered easier identification of potential application/infrastructure impacts. During the execution phase, management of consolidated implementation plans, validation and remediation, as well as the use of runbooks (with documented infrastructure/application shutdown/startup procedures), provided the necessary transparency that was required across responsibilities to effectively manage cutovers and building shutdowns with no major impact or outage.

Results
Several milestones had to be achieved to reach all these goals. The entire facility upgrade process from the point when funding was approved took approximately 18 months to complete. Along this journey, there were a number of key milestones that needed to be negotiated and completed. To help understand how Sun Life was able to complete each task, lead times are shown following.

• Description of task (duration) – Lead time

• Contract approvals (2 months) – 18 months

• Construction of two new electrical rooms, installation of one new UPS and installation of generator and fuel system (2 months) – 16 months

• Validation, testing and verification (1 month) – 14 months

• Assemble internal organizational team to define application assessment (1 months) – 9 months

• Initial communication regarding planned cutover (1 month) – 9 months

• Validate recommended cutover option with application and infrastructure teams (1 months) – 8 months

• Remediate application and infrastructure in advance of cutover (2 months) – 7 months

• Define and build cutover weekend governance model (3 months) – 7 months

• Define and build cutover sequence runbook (3 months) – 7 months

• Data center electrical upgrade complete – Tier III Certification of Constructed Facility

In the end, after all the planning, setups, and implementations, all that remained was validation that all the changes were executed according to design. For the Facility Certification, Uptime Institute provided a list of 29 demonstrations covering activities from all aspects of the mechanical and electrical facility. The same team of representatives from CRE and DCO reviewed each demonstration and analyzed them for involvement and impact. The Sun Life team created individual MOPs, and grouped these for execution based on the duration of the involvement required. These activities took place across 3 days. Runbooks were created and used throughout each of the groupings. Areas required were engaged. On the demonstration weekend, CRE and DCO resources worked together to process each demonstration, one by one, ensuring and validating the success of the implementation and the design. The end result was Tier III Certification of Constructed Facility (See Figure 6).

Figure 6. Example of Tier III Constructed Facility testing schedule (per demonstration code)

Sun Life Financial received its Tier III Design Documents Certification in May 2013, and then successfully demonstrated all items required over the first weekend in November to receive Tier III Certification of Constructed Facility on November 8, 2013. The journey was not an easy one.

Figure 7. Power distribution overview (before and after)

In summary, Sun Life Financial transformed its primary operational data center facility (See Figure 7) within 18 months at a cost of approximately US$7 million (US$3.4 million) allocated to electrical contractor work and materials, US$1.2 million for waterproof roof membrane work, US$1.5 million for environmental upgrades and the addition of a new generator, US$900,000 for other costs and the remainder for project management and other minor improvements). The success of this transformation was possible in large part due to the collaboration of an entire organization and the leadership of a select few. The facility is now a Tier III Constructed Facility that is Concurrently Maintainable and optimally supported. Through Certification, Sun Life now has a much more positive position to manage the ever increasing demands of critical application processing.

Figure 8. Before and after summary

Rocco Alonzi

Rocco Alonzi has worked in the data center environment for the past 10 years, most recently as the AVP of Data Center Operations at Sun Life. He helped develop and implement the strategies that helped Sun Life achieve Tier III Certification of Constructed Facility. Prior to joining Sun Life, Mr. Alonzi worked for a large Canadian bank. During his 15 years there, he held many positions, including manager of Data Center Governance, where he was responsible for developing a team responsible for securing, managing, and maintaining the bank’s raised-floor environment. As a member of the Uptime Institute Network, Mr. Alonzi has strongly advocated the idea that IT and M&E must be considered as one in data center spaces.

Paolo Piro

Paolo Piro joined Sun Life in May of 2013 as a senior data center governance analyst, to help establish a governance framework and optimize organizational processes relating to Sun Life’s data centers. Prior to joining Sun Life, Mr. Piro worked 25 years at a large Canadian bank. In 2004, he became involved in data centers, when he became responsible, as a team lead, for establishing governance controls, implementing best practices, and optimizing the care and feeding of the data center raised floor. In 2011, he was able to increase his exposure and knowledge in this space, by taking on the role of data center manager, where for the next 2 years, he managed a team of resources and a consolidated budget allocated for maintaining and caring for the raised floor environment.

RagingWire’s Jason Weckworth Discusses the Execution of IT Strategy

January 22, 2015/in Executive/by Kevin Heslin

In this series, Uptime Institute asked three of the industry’s most well recognized and innovative leaders to describe the problems facing enterprise IT organizations. Jason Weckworth examined the often-overlooked issue of server hugging; Mark Thiele suggested that service offerings often did not fit the customer’s long-term needs; and Fred Dickerman found customers and providers at fault.

Q: What are the most prevalent misconceptions hindering data center owners/operators trying to execute the organization’s IT strategy, and how do they resolve these problems?

Jason Weckworth: As a colocation provider, we sell infrastructure services to end users located throughout the country. The majority of our customers reside within a 200-mile radius. Most IT end users say that they need to be close to their servers. Yet remote customers, once deployed, tend to experience the same level of autonomy and feedback from their operations staffs as those who are close by. Why does this misconception exist?

We believe that the answer lies in legacy data center services vs. the technology of today’s data centers with the emergence of DCIM platforms.

The Misconception: “We need to be near our data center.”
The Reality: “We need real-time knowledge of our environment with details, accessibility, and transparent communication.”

As a pure colocation provider (IaaS), we are not in the business of managed services, hosting, or server applications. Our customers’ core business is IT services, and our core business is infrastructure. Yet they are so interconnected. We understand that our business is the backbone for our customers. They must have complete reliance and confidence in everything we touch. Any problem we have with infrastructure has the potential to take them off-line. This risk can have a crippling effect on an organization.

The answer to remote access is remote transparency.

Data Center Infrastructure Management (DCIM) solutions have been the darlings of the industry for two years running. The key offering, from our perspective, is real-time monitoring with detailed customization. When customers can see their individual racks, circuits, power utilization, temperature, and humidity, all with real-time alarming and visibility, they can pinpoint their risk at any given moment in time. In our industry, seconds and minutes count. Solutions always start with first knowing if there is a problem, and then, by knowing exactly the location and scope of that problem. Historically, customers wanted to be close to their servers so that they could quickly diagnose their physical environment without having to wait for someone to answer the phone or perform the diagnosis for them. Today, DCIM offers the best accessibility.

Remote Hands and Eyes (RHE) is a physical, hands-on service related to IT infrastructure server assets. Whether the need is a server reboot, asset verification, cable connectivity, or tape change, physical labor is always necessary in a data center environment. Labor costs are an important consideration of IT management. While many companies offer an outsourced billing rate that discourages the use of RHE as much as possible, we took an insurance policy approach by offering unlimited RHE for a flat monthly fee based on capacity. With 650,000 square feet (ft2) of data center space, we benefit greatly from scaling the environment. While some customers need a lot of services one month, others need hardly any at all. But when they need it, it’s always available. The overall savings of shared resources across all customers ends up benefitting everyone.

Customers want to be close to their servers because they want to know what’s really going on. And they want to know now. “Don’t sugarcoat issues, don’t spin the information so that the risk appears to be less than reality, and don’t delay information pending a report review and approval from management. If you can’t tell me everything that is happening in real time, you’re hiding something. And if you’re hiding something, then my servers are at risk. My whole company is at risk.” As the data center infrastructure industry has matured over the past 10 years, we have found that customers have become much more technical and sophisticated when it comes to electrical and mechanical infrastructure. Our solution to this issue of proximity has been to open our communication lines with immediate and global transparency. Technology today allows information to flow within minutes of an incident. But only culture dictates transparency and excellence in communication.

As a senior executive of massive infrastructure separated across the country on both coasts, I try to place myself in the minds of our customer. Their concerns are not unlike our own. IT professionals live and breathe uptime, risk management, and IT capacity/resource management. Historically, this meant the need to be close to the center of the infrastructure. But today, it means the need to be accessible to the information contained at the center. Server hugging may soon become legacy for all IT organizations.

Jason Weckworth

Jason Weckworth is senior vice president and COO, RagingWire Data Centers. He has executive responsibility for critical facilities design and development, critical facilities operations, construction, quality assurance, client services, infrastructure service delivery, and physical security. Mr. Weckworth brings 25 years of data center operations and construction expertise to the data center industry. Previous to joining RagingWire, he was owner and CEO of Weckworth Construction Company, which focused on the design and construction of highly reliable data center infrastructure by self-performing all electrical work for operational best practices. Mr. Weckworth holds a bachelor’s degree in Business Administration from the California State University, Sacramento.

Mark Thiele from Switch examines the options in today’s data center industry

January 20, 2015/in Executive/by Kevin Heslin

In this series, three of the industry’s most well-recognized and innovative leaders describe the problems facing enterprise IT organizations today. In this part, Switch’s Mark Thiele suggests that service offerings often don’t fit customer’s long-term needs.

Customers in the data center market have a wide range of options. They can choose to do something internally, lease retail colocation space, get wholesale colocation, move to the cloud, or all of the above. What are some of the more prevalent issues with current market choices relative to data center selection?

Mark Thiele: Most of the data center industry tries to ﬁt customers into status-quo solutions and strategies, doing so in many cases simply because, “Those are the products we have and that’s the way we’ve always done it.” Little consideration seems to be given to the longer-term business and risk impacts of continuing to go with the ﬂow in today’s rapidly changing innovation economy.

The correct solution can be a tremendous catalyst, enabling all kinds of communication and commerce, and the wrong solution can be a great burden for 5-15 years.

In addition, many data center suppliers and builders think of the data center as a discrete and detached component. Its placement, location, and ownership strategy have little to do with IT and business strategies. The following six conclusions are drawn from conversations Switch is having on a daily basis with technology leaders from every industry.

Data centers should be purpose built buildings. A converted warehouse with sky light penetrations and a wooden roof deck isn’t a data center. It’s a warehouse to which someone has added extra power and HVAC. These remodeled wooden-roof warehouses present a real risk for the industry because thousands of customers have billions of dollars’ worth of critical IT gear sitting in these converted buildings where they are expecting their provider to be protecting them at elite mission critical levels. A data center is by its very nature part of your critical infrastructure; as such it should be designed from scratch to be a data center that can actually offer the highest levels of protection from dangers like ﬁre and weather.

A container data center is not a foundational solution for most businesses but can be a good solution for speciﬁc niche opportunities (disaster support, military, extreme-scale homogeneous environments, etc.). Containers strand HVAC resources. If you need more HVAC in one container than another you cannot just share it. If a container loses HVAC, all the IT gear is at risk even though there may be millions of dollars of healthy HVAC elsewhere.

The data center isn’t a discrete component. Data centers are a critical part of your larger IT and enterprise strategies, yet many are still building and/or selling data centers as if they were just a real estate component.

One of the reasons that owning a data center is a poor fit for many businesses is that it is hard to make the tight link needed between a company’s data center strategy and its business strategy. It’s hard to link the two when one has a 1- to 3-year life (business strategy) and the other has a 15- to 25-year life (data center).

The modern data center is the center of the universe for business enablement and IT readiness. Without a strong ecosystem of co-located partners and suppliers, a business can’t hope to compete in the world of the agile enterprise. We hear from customers every day that they need access to a wide range of independently offered technology solutions and services that are on premises. Building your own data center and occupying it alone for the sake of control isolates your company on an island away from all of the partners and suppliers that might otherwise easily assist in delivering successful future projects. The possibilities and capabilities of locating in an ultra-scale multi-company technology ecosystem cannot be ignored in the innovation economy.

Data centers should be managed like manufacturing capacity. Like a traditional manufacturing plant, the modern data center is a large investment. How eﬀectively and eﬃciently it’s operated can have a major impact on corporate costs and risks. More importantly, the most eﬀective data center design, location, and ecosystem strategies can oﬀer signiﬁcant ﬂexibility and independence for IT to expand or contract at various speeds and to go in diﬀerent directions entirely as new ideas are born.

More enterprises are getting out of the data center business. Fewer than 5% of businesses and enterprises have the appropriate business drivers and staﬃng models that would cause them to own and operate their own facilities in the most eﬃcient manner. Even among some of the largest and most technologically savvy businesses there is a signiﬁcant change in views on how data center capacity should be acquired.

Mark Thiele is EVP, Data Center Tech at SUPERNAP, where his responsibilities include evaluating new data center technologies, developing partners, and providing industry thought leadership. Mr. Thiele’s insights in to the next generation of technological innovations and how these technologies speak to client needs and solutions are invaluable. He shares his enthusiasm and passion for technology and how it impacts daily life and business on local, national, and world stages.

Mr. Thiele has a long history of IT leadership specifically in the areas of team development, infrastructure, and data centers. Over a career of more than 20 years, he’s demonstrated that IT infrastructure can be improved to drive innovation, increase efficiency, and reduce cost and complexity. He is an advisor to venture firms and start-ups, and is a globally recognized speaker at premier industry events.

Improving Performance in Ever-Changing Mission-Critical IT Infrastructures

January 12, 2015/in Design/by Kevin Heslin

CenturyLink incorporates lessons learned and best practices for high reliability and energy efficiency.

By Alan Lachapelle

CenturyLink Technology Solutions and its antecedents (Exodus, Cable and Wireless, Qwest, and Savvis) have a long tradition of building mission critical data centers. With the advent of its Internet Data Centers in the mid-1990s, Exodus broke new ground by building facilities at unprecedented scale. Even today, legacy Exodus data centers are among the largest, highest capacity, and most robust data centers in CenturyLink’s portfolio, which the company uses to deliver innovative managed services for global businesses on virtual, dedicated, and colocation platforms (see Executive Perspectives on the Colocation and Wholesale Markets, p.51).

Through the years CenturyLink has seen significant advances not only in IT technology, but in mission-critical IT infrastructures as well; adapting to and capitalizing on those advances have been critical to the company’s success.

Applying new technologies and honing best-practice facility design standards is an ongoing process. But the best technology and design alone will not deliver the efficient, high-quality data center that CenturyLink’s customers demand. It takes experienced, well-trained staff with a commitment to rigorousadherence to standards and methods to deliver on the promise of a well-designed and well-constructed facility. Specifically, that promise is to always be up and running, come what may, to be the “perfect data center.”

The Quest
As its build process matured, CenturyLink infrastructure began to take on a phased approach, pushing the envelope and leading the industry in effective deployment of capital for mission critical infrastructures. As new technologies developed, CenturyLink introduced them to the design. As the potential densities of customers’ IT infrastructure environments increased, so too did the densities planned into new data center builds. And as the customer base embraced new environmental guidelines, designs changed to more efficiently accommodate these emerging best practices.

Not many can claim a pedigree of 56 (and counting) unique data center builds, with the continuous innovation necessary to stay on top in an industry in which constant change is the norm. The demand for continuous innovation has inspired CenturyLink’s multi-decade quest for the perfect data center design model and process. We’re currently on our fourth generation of the perfect data center—and, of course, it certainly won’t be the last.

The design focus of the perfect data center has shifted many times.

Dramatically increasing the efficiency of white space in the data centers is likely the biggest such shift. Under the model in use in 2000, a 10-megawatt (MW) IT load may have required 150,000 square feet (ft2) of white space. Today, the same capacity requires only a third the space. Better still, we have deployments of 1 MW in 2,500 ft2—six times denser than the year-2000 design. Figure 1 shows the average densities in four recent customer installations.

Figure 1. The average densities for four recent customer installations show how significantly IT infrastructure density has risen in recent years.

Our data centers are rarely homogenous, so the designs need to be flexible enough to support multiple densities in the same footprint. A high-volume trading firm might sit next to a sophisticated 24/7 e-retailer, next to a disaster recovery site for a health-care provider with a tape robot. Building in flexibility is a hurdle all successful colocation providers must overcome to effectively address their clients’ varied needs.

Concurrent with differences in power density are different cooling needs. Being able to accommodate a wide range of densities efficiently, from the lowest (storage and backup) to the highest (high-frequency trading, bitcoin mining, etc.), is a chief concern. By harnessing the latest technologies (e.g., pumped-refrigerant economizers, water-cooled chillers, high-efficiency rooftop units), we match an efficient, flexible cooling solution to the climate, helping ensure our ability to deliver value while maximizing capital efficiency.

Mechanical systems are not alone in seeing significant technological development. Electrical infrastructures have changed at nearly the same pace. All iterations of our design have safely and reliably supplied customer loads, and we have led the way in developing many best practices. Today, we continue to minimize component count, increase the mean time between failures, and pursue high operating efficiency infrastructures. To this end, we employ the latest technologies, such as delta conversion UPS systems for high availability and Eco Mode UPS systems that actually have a higher availability than double-conversion UPS systems. We consistently re-evaluate existing technologies and test new ones, including Bloom Energy’s Bloom Box solid oxide fuel cell, which we are testing in our OC2 facility in Irvine, CA. Only once a new technology is proven and has shown a compelling advantage will we implement it more broadly.

All the improvements in electrical and mechanical efficiencies could scarcely be realized in real-world data centers if controls were overlooked. Each iteration of our control scheme is more robust than the last, thanks to a multi-disciplinary team of controls experts who have built fault tolerance into the control systems. The current design, honed through much iteration, allows components to function independently, if necessary, but generates significant benefit by networking them together, so that they can be controlled collaboratively to achieve optimal overall efficiency. To be clear, each piece of equipment is able to function solely on its own if it loses communication with the network, but by allowing components with part-load efficiencies to communicate with each other effectively, the system intelligently selects ideal operating points to ensure maximum overall efficiency.

For example, the chilled-water pumping and staging software analyzes current chilled-water conditions (supply temperature, return temperature, and system flow) and chooses the appropriate number of chilled-water pumps and chillers to operate to minimize chiller plant energy consumption. To do this, the software evaluates current chiller performance against ambient temperature, load, and pumping efficiency. The entire system is simple enough to allow for effective troubleshooting and for each component to maintain required parameters under any circumstance, including failure of other components.

Finally, our commissioning process has grown and matured. Learning lessons from past commissioning procedures, as well as from the industry as a whole, has made the current process increasingly rigorous. Today, simulations used to test new data centers before they come on-line closely represent actual conditions in our other facilities. A thorough commissioning process has helped us ensure our buildings are turned over to operators as efficient, reliable, and easy to operate as our designs intended.

Design and Construction Certification
As the data center industry has matured, among the things that became clear to CenturyLink was the value of Tier Certification. The Uptime Institute’s Tier Standard: Topology makes the benchmark for performance clear and attainable. While our facilities have always been resilient and maintainable, CenturyLink’s partnership with the Uptime Institute to certify our designs to its well-known and recognizable standards creates customer certainty.

CenturyLink currently has five Uptime Institute Tier III Certified Facilities in Minneapolis, MN; Chicago, IL; Toronto, ON; Orange County, CA; and Hong Kong, with a sixth underway. By having our facilities Tier Certified, we do more than simply show commitment to transparency in design. Customers who can’t view and participate in commissioning of facilities can rest assured knowing the Uptime Institute has Certified these facilities as Concurrently Maintainable. We invite comparison to other providers and know that our commitments will provide value for our customers in the long run.

Application of Design to Existing Data Centers
Our build team uses data center expansions to improve the capacity, efficiency, and reliability of existing data centers. This includes (but is not limited to) optimizing power distribution, aligning cooling infrastructure to utilize ASHRAE guidance, or upgrading controls to increase reliability and efficiency.

Meanwhile, our operations engineers continuously implement best practices and leading-edge technologies to improve energy efficiency, capacity, and reliability of data center facilities. Portfolio wide, the engineering team has enhanced control sequences for cooling systems, implementation of electronically commutated (EC) and variable frequency drive (VFD) fans, and Cold Aisle/Hot Aisle containment. These best practices serve to increase total cooling capacity and efficiency, ensuring customer server inlet conditions are homogenous and within tolerance. Figure 2 shows the total impact of all such design improvements on our company’s aggregate Power Usage Effectiveness (PUE). Working hand-in-hand with the build group, CenturyLink’s operations engineers ensure continuous improvement in perfect data center design, enhancing some areas while eliminating unneeded and unused features and functions—often based on feedback from customers.

Figure 2. As designs improve over time, CenturyLink implements best practices and lessons learned into its existing portfolio to continuously improve its aggregate PUE. It’s worth noting that the PUEs shown here include considerable office space as well as extra resiliency built into unutilized capacity.

Staffing Models
CenturyLink relies on expert data center facilities and operations teams to respond swiftly and intelligently to incidents. Systems are designed to withstand failures, but it is the facilities team that promptly corrects failures, maintaining equipment at a high level of availability that continually assures Fault Tolerance.

Each facility hosts its own Facility and Operations teams. The Facility team consists of a facility manager, a lead mechanical engineer, a lead electrical engineer, and a team of facility technicians. They maintain the building and its electrical and mechanical systems. They are experts on their locations. They ensure equipment is maintained in concurrence with CenturyLink’s maintenance standards, respond to incidents, and create detailed Methods of Procedure (MOPs) for all actions and activities. They also are responsible for provisioning new customers and maintaining facility capacity.

The Operations team consists of the operations manager, operations lead, and several operations technicians. This group staffs the center 24/7, providing our colocation customers with CenturyLink’s “Gold Support” (remote hands) for their environment so that they don’t need to dispatch someone to the data center. This team also handles structured cabling and interconnections.

Regional directors and regional engineers supplemented the location teams. The regional directors and engineers serve as subject matter experts (SMEs) but, when required, can also marshal the resources of CenturyLink’s entire organization to rapidly and effectively resolve issues and ensure potential problems are addressed on a portfolio-wide basis.

The regional teams work as peers, providing each individual team member’s expertise when and where appropriate, including to operations teams outside their region when needed. Collaborating on projects and objectives, this team ensures the highest standards of excellence are consistently maintained across a wide portfolio. Thus, regional engineers and directors engage in trusted and familiar relationships with site teams, while ensuring the effective exchange of information and learning across the global footprint.

Global Standard Process Model
A well-designed, -constructed, and -staffed data center is not enough to ensure a superior level of availability. The facility is only as good as the methods and procedures that are used to operate it. A culture that embraces process is also essential in operating a data center efficiently and delivering the reliability necessary to support the world’s most demanding businesses.

Uptime and latency are primary concerns for CenturyLink. The CenturyLink brand depends upon a sustained track record of excellence. Maintaining consistent reliability and availability across a varied and changing footprint requires an intensive and dynamic facilities management program encompassing an uncompromisingly rigid adherence to well-planned standards. These standards have been modeled in the IT Infrastructure Library spirit and are the result of years of planning, consideration, and trial and error. Adherence further requires close monitoring of many critical metrics, which is facilitated by the dashboard shown in Figure 3.

Figure 3. CenturyLink developed a dynamic dashboard that tracks and trends important data: site capacity, PUE, available raised floor space, operational costs, abnormal Incidents, uptime metrics, and much more to provide a central source of up-to-date information for all levels of the organization.

Early in the development of Savvis as a company, management established many of the organizational structures that exist today in CenturyLink Technology Solutions. CenturyLink experienced growth in many avenues; as it serviced increasingly demanding customers (in increasing volume), these structures continued to evolve to suit the ever-changing needs of the company and its customers.

First, the management team developed a staff capable of administering the many programs that would be required to maintain the standard of excellence demanded by the industry. Savvis developed models analyzing labor and maintenance requirements across the company and determined the most appropriate places to invest in personnel. Training was emphasized, and teams of SMEs were developed to implement the more detailed aspects of facilities operations initiatives. The management team is centralized, in a sense, because it is one global organization; this enhances the objective of global standardization. Yet the team is geographically diverse, subdivided into teams dedicated to each site and regional teams working with multiple sites, ensuring that all standards are applied globally throughout the company. And all teams contribute to the ongoing evolution of those standards and practices—for example, participating in two global conference calls per week.

Next, it was important to set up protocols to handle and resolve issues as they developed, inform customers of any impact, and help customers respond to and manage situations. No process guarantees that situations will play out exactly as anticipated, so a protocol to handle unexpected events was crucial. This process relied on an escalation schedule that brought decisions through the SMEs for guidance and gave decision makers the proper tools for decision making and risk mitigation. Parallel to this, a process was developed to ensure any incident with impact to customers caused notifications to those customers so they could prepare for or mitigate the impact of an event.

A tracking system accomplished many things. For example, it ensured follow up on items that might create problems in the future, identified similar scenarios or locations where a common problem might recur, established a review and training process to prevent future incidents through operator education, justified necessary improvements in systems creating problems, and tracked performance over longer periods to analyze success in implementation and evaluate need for plan improvement. The tracking system is inclusive of all types of problems, including those related to internal equipment, employees, and vendors.

Data centers, being dynamic, require frequent change. Yet unmanaged change can present a significant threat to business continuity. Congruent with the other programs, CenturyLink set up a Change Management program. This program tracked changes, their impacts, and their completion. It ensured that risks were understood and planned for and that unnecessary risks were not taken.

Any request for change, either internal or from a customer, must go through the Change Management process and be evaluated on metrics for risk. These metrics determine the level of controls associated with that work and what approvals are required. The key risk factors considered in this analysis include the possible number of customers impacted, likelihood of impact, and level of impact. Even more importantly, the process evaluates the risk of not completing a task and balances these factors. The Change Management program and standardization of risk analysis necessitated standardizing maintenance procedures and protocols as well.

Standards, policies, and best practices were established, documented, and reviewed by management. These create the operating frameworks for implementing IT Information Library methodology, enabling CenturyLink to incorporate industry best practices and standards, as well as develop internal operating best practices, all of which maximize uptime and resource utilization.

A rigid document-control program was established utilizing peer review, and all activities or actions performed were scripted, reviewed, and approved. Peer review also contributed to personnel training, ensuring that as documentation was developed, peers collaborated and maintained expertise on the affected systems. Document standardization was extended to casualty response as well. Even responding to failures requires use of approved procedures, and the response to every alarm or failure is scripted so the team can respond in a manner minimizing risk. In other words, there is a scripted procedure even for dealing with things we’ve never before encountered. This document control program and standardization has enabled personnel to easily support other facilities during periods of heightened risk, without requiring significant training for staff to become familiar with the facilities receiving the additional support.

Conclusion
All the factors described in this paper combine to allow CenturyLink to operate a mission-critical business on a grand scale, with uniform operational excellence. Without this framework in place, CenturyLink would not be able to maintain the high availability on which its reputation is built. Managing these factors while continuing to grow has obvious challenges. However, as CenturyLink grows, these practices are increasingly improved and refined. CenturyLink strives for continuous improvement and views reliability as a competitive advantage. The protocols CenturyLink follows are second-to-none, and help ensure the long-term viability of not only data center operations but also the company as a whole. The scalability and flexibility of these processes can be seen from the efficiency with which CenturyLink has integrated them into its new data center builds as well as data centers it acquired. As CenturyLink continues to grow, these programs will continue to be scaled to meet the needs of demanding enterprise businesses.

Site Spotlight: CH2
In 2013, we undertook a massive energy efficiency initiative at our CH2 data center in Chicago, IL. More than 2 years of planning went into this massive project, and the energy savings were considerable.

Projects included:
• Occupancy sensors for lighting
• VFDs on direct expansion computer room air conditioning units
• Hot Aisle containment
• Implementing advanced economization controls
• Replacing cooling tower fill with high-efficiency
evaporative material
• Installing high-efficiency cooling tower fan blades

These programs combined to reduce our winter PUE by over 17%, and our summer PUE by 20%. Additionally, the winter-time period has grown as we have greatly expanded the full free cooling window and added a large partial free cooling window, using the evaporative impact of the cooling towers to great advantage.

Working with Commonwealth Edison, energy efficient rebates contributed over US$500,000 to the project’s return, along with annual savings of nearly 10,000,000 kilowatt-hours. With costs of approximately US$1,000,000, this project proved to be an incredible investment, with an internal rate of return of ≈100% and a Net Present Value of over US$5,000,000. We consider this a phenomenal example of effective best practice implementation and energy efficiency.

Alan Lachapelle

Alan Lachapelle is a mechanical engineer at CenturyLink Technology Solutions. He has 6 years of experience in Naval Nuclear Propulsion on submarines and 4 years of experience in mission critical IT infrastructure. Merging rigorous financial analysis with engineering expertise, Mr. Lachapelle has helped ensure the success of engineering as a business strategy.

Mr. Lachapelle’s responsibilities include energy-efficiency initiatives, data center equipment end of life, operational policies and procedures, peer review of maintenance and operations procedures, utilization of existing equipment, and financial justifications for engineering projects throughout the company.

Operational Upgrade Helps Fuel Oil Exploration Surveyor

January 5, 2015/in Design/by Kevin Heslin

Petroleum Geo-Services increases its capabilities using innovative data center design
By Rob Elder and Mike Turff

Petroleum Geo-Services (PGS) is a leading oil exploration surveyor that helps oil companies find offshore oil and gas reserves. Its range of seismic and electromagnetic services, data acquisition, processing, reservoir analysis/interpretation, and multi-client library data all require PGS to collect and process vast amounts of data in a secure and cost-efficient manner. This all demands large quantities of compute capacity and deployment of a very high-density configuration. PGS operates 21 data centers globally, with three main data center hubs located in Houston, Texas; Kuala Lumpur, Malaysia; and Weybridge, Surrey (see Figure 1).

Figure 1. PGS global computing centers

Weybridge Data Center

Keysource Ltd designed and built the Weybridge Data Center for PGS in 2008. The high-density IT facility, won a number of awards and saved PGS 6.2 million kilowatt-hours (kWh) annually compared to the company’s previous UK data center. The Weybridge Data Center is located in an office building, which poses a number of challenges to designers and builders of a high-performance data center. The initial project phase in 2008 was designed as the first phase of a three-phase deployment (see Figure 2).

Figure 2. Phase 1 Data Center (2008)

Phase one was designed for 600 kW of IT load, which was scalable up to 1.8 megawatts (MW) across two future phases if required. Within the facility, power rack densities of 20 kW were easily supported, exceeding the 15-kW target originally specified by the IT team at PGS.

The data center houses select mission-critical applications supporting business systems, but it primarily operates the data mining and analytics associated with the core business of oil exploration. This IT is deployed in full-height racks and requires up to 20 kilowatts (kW) per rack anywhere in the facility and at any time (see Figure 3).

Figure 3. The full PGS site layout

In 2008, PGS selected Keysource’s ecofris solution for use at its Weybridge Data Center (see Figure 4), which became the first facility to use the technology. Ecofris recirculates air within a data center without using fresh air. Instead air is provided to the data center through the full height of a wall between the raised floor and suspended ceiling. Hot air from the IT racks is ducted into the suspended ceiling and then drawn back to the cooling coils of air handling units (AHU) located at the perimeter walls. The system makes use of adiabatic technology for external heat rejection when external temperatures and humidity do not allow 100% free cooling.

Figure 4. Ecofris units are part of the phase 1 (2008) cooling system to support PGS’s high-density IT.

Keysource integrated a water-cooled chiller into the ecofris design to provide mechanical cooling when needed to supplement the free cooling system (see Figure 5). As a result PGS ended up with two systems, each having a 400-kW chiller, which run for only 50 hours a year on average when external ambient conditions are at their highest.

Figure 5. Phase 2 ecofris cooling

As a result of this original design the Weybridge data center used outside air for heat rejection but without allowing that air into the building. Airflow design, a comprehensive control system, and total separation of hot and cold air means that the facility could accommodate 30 kW in any rack and deliver a PUE L2,YC (Level 2, Continuous Measurement) of 1.15 while maintaining a server inlet temperature consistent across the entire space of 72°F (22°C) +/-1°. Adopting an indirect free cooling design rather than direct fresh air eliminated the need for major filtration or mechanical back up (see the sidebar).

Surpassing the Original Design Goals
When PGS needed additional compute capacity, the Weybridge Data Center was a prime candidate for expansion because it had the flexibility to deploy high-density IT anywhere within the facility and a low operating cost. However, while the original design anticipated two future 600-kW phases, PGS wanted even more capacity because of the growth of its business and its need for the latest IT technology. In addition, PGS wanted to make a huge drive to reduce operating costs through efficient design of cooling systems and to maximize power capacity at the site.

When the recent project was completed at the end of 2013, the Weybridge Data Center housed the additional high-density IT within the footprint of the existing data hall. The latest ecofris solution was deployed which utilized a chillerless design, which limited the increased power demand.

Keysource undertook the design by looking at ways to maximize the use of white space for IT as well as to remove the overhead cost of power to run mechanical cooling, even for a very limited number of hours a year. This would ensure maximum availability of capacity of power for the IT equipment. With a marginal improvement in operating efficiency (annualized PUE) the biggest design change was the reduced peak PUE. This change enabled an increase in IT design load from 1.8 MW to 2.7 MW within the same footprint. With just over 5 kW/square meter (m²), PGS can deploy 30 kW in any cabinet up to the maximum total installed IT capacity (see Figure 6).

Figure 6. More compute power within the same overall data center footprint

Disruptive Cooling Design
Developments in technology and the wider allowable range of temperatures per ASHRAE TC9.9 enabled PGS to adopt higher server inlet temperatures when ambient temperatures are higher. This change allows PGS to operate at the optimum temperature for the equipment most of the time (normally 72°F (22°C) lowering the IT part of the PUE metric (see Figure 7).

Figure 7. Using computational fluid dynamics to model heat and airflow

In this facility, elevating server inlet temperatures increases the supply inlet temperatures only when ambient outside air is too warm to maintain 72°F (22°C). Running at higher temperatures at other times actually increases server fan power across different equipment, which also increases UPS loads. Running the facility at optimal efficiency all of the time reduces the overall facility load, even though PUE may rise as a result of the decrease of server fan power. With site facilities management (FM) teams trained in operating the mechanical systems, this is fine-tuned through operation and as additional IT equipment is commissioned within the facility, ensuring performance is maintained at all times.

With innovation central to the improved performance of the data center, in addition to the cooling, Keysource also delivered modular, highly efficient UPS systems providing 96% efficiency from >25% facility load, plus facility controls, which provide automated optimization.

A Live Environment
Working in a live data center environment within an office building was never going to be risk free. Keysource built a temporary wall within the existing data center to divide the live operational equipment from the live project area (see Figure 8). Cooling, power, and data for the live equipment isn’t on a raised floor and is delivered from the same end of the data center. Therefore, the dividing screen had limited impact to the live environment, with only some minor modifications needed to the fire detection and suppression systems.

Figure 8. The temporary protective wall built for phase 2

Keysource also manages the data center facility for PGS, which meant that the FM and projects teams could work closely together in the planning the upgrade. As a result facilities management considerations were included in all design and construction planning to minimize risk to the operational data center as well as helping to reduce the impact to other business operations at the site.

Upon completing the project, a full integrated-system test of the new equipment was undertaken ahead of removing the dividing partition. This test not only covered the function of electrical and mechanical systems but also tested the capability of the cooling to deliver the 30 kW/rack and the target design efficiency. Using rack heaters to simulate load allowed detailed testing to be carried out ahead of the deployment of the new IT technology (see Figure 9).

Figure 9. Testing the 30 kW per rack full load

Results 

Phase two was completed in April 2014, and as a result the facility’s power density improved by approximately 50%, with the total IT capacity now scalable up to 2.7 MW. This has been achieved within the same internal footprint. The facility now has the capability to accommodate up to 188 rack positions, supporting up to 30 kW per rack. In addition, the PUEL2,YC of 1.15* was maintained (see the Figure 10).

Figure 10. A before and after comparison

The data center upgrade has been hailed as a resounding success, earning PGS and Keysource a Brill Award for Efficient IT from Uptime Institute. PGS is absolutely delighted to have the quality of its facility recognized by a judging panel of industry leaders and to receive a Brill Award.

Direct and Indirect Cooling Systems
Keysource hosted an industry expert roundtable that provides additional insights and debate on two pertinent cooling topics highlighted by the PGS story. Copies of these whitepapers can be obtained at http://www.keysource.co.uk/data-centre-white-papers.aspx

An organization requiring high availability is unlikely to install a direct fresh air system without 100% backup on the mechanical cooling. This is because the risks associated with the unknowns of what could happen outside, however infrequent, are generally out of the operator’s control.

Density of IT equipment does not make any impact to direct or indirect designs. It is the control of air and the method of air delivery within the space that dictates capacity and air volume requirements. There may be additional considerations for how backup systems and the control strategy between switching cooling methods works in high-density environments due to the risk of thermal increase in very short periods, but this is down to each individual design.

Following the agreement of the roundtable that direct fresh air is going to require some sort of a backup system in order to meet availability and customer risk requirements, it is worth considering what benefits might exist for opting for either this or indirect design.

Partly due to the different solutions in these two areas and partly because there are other variables on a site-specific basis, there are not many clear benefits either way, but there are a few considerations included:

• Indirect systems pose less or no risk from external pollutants and
contaminants.

• Indirect systems do not require integration into the building
fabric, where a direct system often needs large ducts or
modifications to the shell. This can increase complexity and cost
if, due to space or building height, it is even achievable.

• Direct systems often require more humidity control, depending on
which ranges are to be met.

With most efficient systems, there is some form of adiabatic cooling. With direct systems there is often a reliance on water to provide capacity rather than simply improve efficiency. In this case there is a much greater reliance on water for normal operation and to maintain availability, which can lead to the need for water storage or other measures. The metric of water usage effectiveness (WUE) needs to be considered.

Many data center facilities are already built with very inefficient cooling solutions. In such cases direct fresh air solutions provide an excellent opportunity to retrofit and run as the primary method of cooling, with the existing inefficient systems as back up. As the backup system is already in place this is often a very affordable option with a clear ROI.

One of the biggest advantages for an indirect system is the potential for zero refrigeration. Half of the U.S. could take this route, and even places people would never consider such as Madrid or even Dubai could benefit. This inevitably requires the use of and reliance on lots of water, as well as the acceptance of increasing server inlet temperatures during warmer periods.

Mike Turff

Mike Turff is global compute resources manager for the Data Processing division of Petroleum Geo-Services (PGS), a Norwegian-based leader in oil exploration and production services. Mr. Turff has responsibility for building and managing the PGS supercomputer centers in Houston, TX; London, England; Kuala Lumpur, Malaysia and Rio De Janeiro, Brasil as well as the smaller satellite data centers across the world. He has worked for over 25 years in high performance compute, building and running supercomputer centers in places as diverse as Nigeria and Kazakhstan and for Baker Hughes, where he built the Eastern Hemisphere IT Services organization with IT Solutions Centers in Aberdeen, Scotland; Dubai, UAE; and Perth, Australia.

Rob Elder

As Sales and Marketing director, Rob Elder is responsible for setting and implementing the strategy for Keysource. Based in Sussex in the United Kingdom, Keysource is a data center design, build, and optimization specialist. During his 10 years at Keysource, Mr. Elder has also held marketing and sales management positions and management roles in Facilities Management and Data Centre Management Solutions Business Units.

Empowering the Data Center Professional

Six Thinking Hats

Sun Life Plots Data Center Facilities and Operations Roadmap to Meet Application Demands

RagingWire’s Jason Weckworth Discusses the Execution of IT Strategy

Mark Thiele from Switch examines the options in today’s data center industry

Improving Performance in Ever-Changing Mission-Critical IT Infrastructures

Operational Upgrade Helps Fuel Oil Exploration Surveyor

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices