An imminent return of service causes a top-to-bottom examination of data center facilities and operations
By Rocco Alonzi and Paolo Piro
When Sun Life Financial completed a return of service for its data center operations in 2011, the Enterprise Infrastructure (IE) and Corporate Real Estate (CRE) teams immediately saw an opportunity to improve service stability, back-up capacity, and efficiency.
Focusing initially on the physical facility itself, Sun Life considered what it needed to do to improve its current situation. The options included upgrading its existing primary facility in Waterloo, ON, Canada, purchasing a new facility, and partnering with an existing facility in either a colocation or partnership scenario.
Sun Life scored the options on four main criteria: cost, time to market, interruption, and involvement. In the end, Sun Life decided that upgrading its existing Waterloo-King facility was the best option because the upgrade was the most cost effective, the least interruptive, and the most suitable fit for the organization. The plan resulted in design implementations and organizational improvements that ultimately led to Sun Life’s first Uptime Institute Tier III Certified Constructed Facility.
Achieving this milestone was no small feat. The Waterloo-King facility was fully operational during the upgrades and improvements. The facility hosted the majority of Sun Life’s primary applications, and the site already had many connections and linkages to other facilities, with international implications for the company. In the end, the Waterloo-King facility was transformed into a Concurrently Maintainable Tier III facility with all the redundancy that comes with the designation. The transformation was completed over a relatively short period with zero major outages.
The decision to return service from an application outsourcing arrangement back to Sun Life prompted the organization to review its capabilities. Once the decision was communicated, the Enterprise Infrastructure branch (responsible for supporting the services) quickly began to analyze the requirements of the return of service and potential gaps that might impact service.
The Enterprise Infrastructure leadership team led by the Data Center Operations (DCO) assistant vice president (AVP) shouldered the responsibility of ensuring the sufficiency of the data center critical facility and the organization. The DCO reviewed current capabilities, outlined options, and developed an improvement plan. The decision was to upgrade the facility and create an environment that supported a Concurrently Maintainable and fully redundant operation.
To facilitate this transformation, Sun Life assembled a team of stakeholders to lay groundwork, manage responsibility, and execute the pieces to conclusion. The team led by the DCO AVP primarily comprised personnel from Corporate Real Estate (CRE) Facilities and Data Center Operations. Within the Sun Life Financial organization, these two teams have the greatest vested interest in the data center, and both are directly involved in the care and feeding of the facility.
Once the team was assembled, it began working on a mandate that would eventually describe the desired end goal. The goal would be to create a facility that was mechanically and electrically redundant and an organizational support structure that was operationally viable. The organization described in the mandate would be able to support the many functions required to run a critical facility and impose governance standards and strategies to keep it running optimally for years to come.
Data Center Due Diligence Assessment
To help guide the organization through this process, the DCO AVP contracted the services of the Uptime Institute to provide a Data Center Due Diligence Assessment analysis report. The report ultimately formed the basis of Sun Life’s roadmap for this journey.
Once the Data Center Due Diligence Assessment was complete, Uptime Institute presented its findings to the DCO AVP, who then reviewed the report with the CRE AVP and quickly identified opportunities for improvement. Using the Data Center Due Diligence Assessment and a structural assessment from another vendor, Sun Life’s team quickly isolated the critical areas and developed a comprehensive master plan.
These opportunities for improvement would help the team generate individual activities and project plans. The team focused on electrical, mechanical, and structural returns. The tasks the team developed included creating electrical redundancy, establishing dual-path service feeds, adding a second generator path to create a completely separate emergency generation system, hardening the structural fabric, and replacing the roof waterproofing membrane located above the raised floor.
With the team having identified infrastructure concerns, it then shifted its focus to organizational effectiveness and accountabilities. Sun Life used a review of operational processes to close organizational gaps and meet challenges to strengthen accountabilities, responsibilities, and relationships. Changes were necessary, not only during the transformation process but also post implementation, when the environment became fully operational and would require optimal and efficient support and maintenance.
The team needed to establish clear organizational delineation of responsibilities, and establish strong communication links between the DCO and CRE so the data center support structure would function as a single unit. Under the leadership of the DCO AVP with support from CRE, Sun Life established a Data Center Governance branch to help meet this requirement. Every aspect of the day-to-day care and feeding of the facility was discussed, reviewed, and then approved for implementation, with the establishment of a clear demarcation between CRE and DCO support areas, based on the Responsible, Accountable, Consulted and Informed (RACI) model. Figure 1 is a graphical example of Sun Life’s final delineation model.
For the last step, the spotlight moved to IT Technology. The Data Center Governance team (by direction of the DCO AVP) reviewed the existing standards and policies. The team wrote and communicated new policies as required. Adherence to these policies would be strictly enforced, with full redundancy of the mechanical and electrical environment right down to the load level being the overarching goal. Establishment and enforcement of these rules follows the demarcation between CRE and DCO.
Roadmap action items were analyzed to determine grouping and scheduling. Each smaller project was initiated and approved using the same process–stakeholder approval (i.e., DCO and CRE AVP) had to be obtained before any project was allowed to proceed through the organization’s change and approval process. The team would first assess the change for risk and then for involvement and impact before allowing it to move forward for organizational assessment and approval. The criteria for approving these mechanical and electrical plans were based on “other” involvement and “other” commitment. The requirements of impacted areas of the organization (other than the DCO and CRE areas) would drive the level of analysis that a particular change would endure. Each project and activity was reviewed and scrutinized for individual merit and overall value-add. Internal IT Information Library (ITIL) change management processes were then followed. Representatives from all areas of the organization were given the opportunity to assess items for involvement and impact, and the change teams would assess the items for change window conflicts. Only after all involved areas were satisfied would the project be submitted for final ITIL approval and officially scheduled.
The following list provides a high-level summary of the changes that were completed in support of Sun Life’s transformation. Many were done in parallel and others in isolation; select changes required full organizational involvement or a data center shutdown.
• Added a 13.8-kilovolt-ampere (kVA) high-voltage hydro feed from a local utility
• Added a second electrical service in the parking garage basement
• Completed the construction of a separate generator room and diesel storage tank room in the parking garage basement t0 accommodate the addition of a 2-megawatt diesel generator, fuel storage tanks, and fuel pumps
• Introduced an underground 600-volt (V) electrical duct bank to data center
•Reconfigured the data center electrical room into two distinct sides
• Replaced old SP1 switchboard in data center electrical room
• Added a backup feed from the new electrical service for main building
• Replaced existing UPS devices
• Installed an additional switch between the new generator and the switchgear to connect the load bank
• Installed an additional switch on each electrical feed providing power to the UPS system for LAN rooms
• Upgraded existing generators to prime-rated engine generators
• Replaced roof slab waterproofing membrane above the data center (see Figure 2)
• Created strategies to mitigate electrical outages
Teamwork was essential to the success of these changes. Each of the changes required strong collaboration, which was only possible because of the strong communication links between CRE and DCO. The team responsible for building the roadmap that effectively guided the organization from where it was to where it needed to be had a full understanding of accountabilities and responsibilities. It was (and still is) a partnership based on a willingness to change and a desire to move in the right direction. The decision to add controls to a building UPS is a good example of this process. Since one of the critical facility UPS units at the Waterloo facility supports part of the general building as well as the critical facility, a control needed to be put in place to ensure compliance, agreement, and communication. Although the responsibility to execute the general building portion falls solely on CRE, a change to this environment could have an impact on the data center and therefore governance was required. Figure 3 shows a process that ensures collaboration, participation, and approval across responsibilities.
To achieve this level of collaboration, the focus needed to switch to the organizational commitments and support that was fostered during this process. Without this shift in organizational behavior, Sun Life would not have been able to achieve the level of success that it has—at least not as easily or as quickly. This change in mind set helped to change the way things are planned and executed. CRE and DCO work together to plan and then execute. The teamwork ensured knowledge and understanding. The collaboration removed barriers so the teams were able to develop a much broader line of sight (bird’s-eye view) when considering the data center.
Delineation of responsibilities was clearly outlined. DCO assumed full accountability of all changes relating to the raised floor space while CRE Critical Facilities managed all electrical and mechanical components in support of the data center. The DCO team reported through the technology branch of the organization while the Critical Facilities team reported up through the CRE branch of the organization. Overall accountability for the data center rested on DCO with final approval and ultimate ownership coming from DCO AVP.
During the planning phase of this transition, both sides (CRE and DCO) quickly realized that processing changes in isolation was not an effective or efficient approach and immediately established a strong collaborative tie. This tie proved to be critical to the organization’s success as both teams and their respective leaders were able to provide greater visibility, deliver a consistent message, and obtain a greater line of sight into potential issues. All of which helped to pave the way for easier acceptance, greater success, and fewer impacts organization wide. As preparations were being made to schedule activities, the team was able to work together and define the criteria for go/no-go decisions.
Documenting the Process
Once individual projects were assessed and approved, the attention turned to planning and execution. In cases where the activity involved only the stakeholder groups (CRE and DCO), the two groups managed the change/implementation in isolation. Using method of procedures (MOPs) provided by the vendor performing the activity kept the team fully aware of the tasks to be completed and the duration of each task. On the day of the change, communication was managed within the small group and executives were always kept informed. Activity runbooks were used in cases where the activity was larger and involvement was much broader. These runbooks contained a consolidation of all tasks (including MOPs), responsibilities assigned to areas and individuals, and estimated and tracked duration per step. The MOPs portion of the runbook would be tagged to CRE, and although the specific steps were not itemized, as they were only relevant to CRE and DCO, the time duration required for the MOP was allotted in the runbook for all to see and understand (See Figure 4). In these larger, more involved cases, the runbooks helped to ensure linkages of roles and responsibilities, especially across Facilities and IT, to plan the day, and to ensure that all requirements and pre-requisites were aligned and clearly understood.
Compiling these runbooks required a great deal of coordination. Once the date for the activity was scheduled, the DCO team assumed the lead in developing the runbook. At its inception, the team engaged areas of impact and began to document a step-by-step MOP that would be used on the day of the change. Who was required? Who from that team would be responsible? And how much time each task would take? The sum of which provided the overall estimate for how long the proposed activity would take. Several weeks prior to the actual change, dry runs of the runbook were scheduled to verify completeness of the approach. Final signoff was always required before any change was processed for execution. Failure to obtain signoff resulted in postponement or cancellation of the activity.
On the day of the activities, tasks were followed as outlined. Smaller activities (activities that only required DCO and Facilities involvement) were managed within the Facilities area with DCO participation. Larger activities requiring expanded IT coordination were managed using a command room. The room was set up to help facilitate the completion of the tasks in the order outlined in the runbook. The command room offered the coordination and collaboration. The facilitators (who were always members of the DCO team) were able to use the forum to document issues that would arise, assess impact, and document remediation. The information, post implementation, was then used to investigate and resolve for future runbook creations. The command room served as a focal point for information and status updates for the entire organization. Status updates would be provided at predetermined intervals. Issues were managed centrally to ensure that coordination and information was consistent and complete. The process was repeated for each of the major activities, and in the end, as far as Sun Life’s transformation goes, all changes were executed as planned, with complete cooperation and acceptance by all involved.
System cutovers were completed without major issues and without outages. Interruptions of applications were, for the most part expected or known. Outages were typically caused by technical limitations at the load level such as single-corded IT hardware or system limitations. Outages of single-corded equipment were minimized as systems were restored once power was fed from a live source. For outages caused by system limitations, arrangements had been made with the business client to shut down the service for the duration of the change. System was restored when the change was complete. In the rare circumstance when a minor outage did occur, the support group, which was on site, investigated immediately and determined the root cause to be localized to the IT equipment. The issues were faulty power supplies or IT hardware configuration errors. These issues, although not related to the overall progress or impact of the activity itself, were documented, resolved, and then added to future runbooks as pre-implementation steps to be completed.
DCO’s governance and strategy framework (see Figure 5) served as the fundamental component that would define authority, while employing controls. These controls ensured clarity, impact, risk, and execution during each of the planning and execution phases and have continued to evolve well into the support phase.
A new RACI model was developed to help outline the delineation between CRE and DCO in the data center environment. The information, which was built by DCO in collaboration with CRE, was developed in parallel to the changes being implemented. Once the RACI model was approved, the model became the foundation for building a clear understanding of the responsibilities within the data center.
During the planning phase, the collaboration between these two areas facilitated the awareness needed for establishing proper assessment of impact. As a result, the level of communication and amount of detail provided to the change teams was much more complete. The partnership fostered easier identification of potential application/infrastructure impacts. During the execution phase, management of consolidated implementation plans, validation and remediation, as well as the use of runbooks (with documented infrastructure/application shutdown/startup procedures), provided the necessary transparency that was required across responsibilities to effectively manage cutovers and building shutdowns with no major impact or outage.
Several milestones had to be achieved to reach all these goals. The entire facility upgrade process from the point when funding was approved took approximately 18 months to complete. Along this journey, there were a number of key milestones that needed to be negotiated and completed. To help understand how Sun Life was able to complete each task, lead times are shown following.
• Description of task (duration) – Lead time
• Contract approvals (2 months) – 18 months
• Construction of two new electrical rooms, installation of one new UPS and installation of generator and fuel system (2 months) – 16 months
• Validation, testing and verification (1 month) – 14 months
• Assemble internal organizational team to define application assessment (1 months) – 9 months
• Initial communication regarding planned cutover (1 month) – 9 months
• Validate recommended cutover option with application and infrastructure teams (1 months) – 8 months
• Remediate application and infrastructure in advance of cutover (2 months) – 7 months
• Define and build cutover weekend governance model (3 months) – 7 months
• Define and build cutover sequence runbook (3 months) – 7 months
• Data center electrical upgrade complete – Tier III Certification of Constructed Facility
In the end, after all the planning, setups, and implementations, all that remained was validation that all the changes were executed according to design. For the Facility Certification, Uptime Institute provided a list of 29 demonstrations covering activities from all aspects of the mechanical and electrical facility. The same team of representatives from CRE and DCO reviewed each demonstration and analyzed them for involvement and impact. The Sun Life team created individual MOPs, and grouped these for execution based on the duration of the involvement required. These activities took place across 3 days. Runbooks were created and used throughout each of the groupings. Areas required were engaged. On the demonstration weekend, CRE and DCO resources worked together to process each demonstration, one by one, ensuring and validating the success of the implementation and the design. The end result was Tier III Certification of Constructed Facility (See Figure 6).
Sun Life Financial received its Tier III Design Documents Certification in May 2013, and then successfully demonstrated all items required over the first weekend in November to receive Tier III Certification of Constructed Facility on November 8, 2013. The journey was not an easy one.
In summary, Sun Life Financial transformed its primary operational data center facility (See Figure 7) within 18 months at a cost of approximately US$7 million (US$3.4 million) allocated to electrical contractor work and materials, US$1.2 million for waterproof roof membrane work, US$1.5 million for environmental upgrades and the addition of a new generator, US$900,000 for other costs and the remainder for project management and other minor improvements). The success of this transformation was possible in large part due to the collaboration of an entire organization and the leadership of a select few. The facility is now a Tier III Constructed Facility that is Concurrently Maintainable and optimally supported. Through Certification, Sun Life now has a much more positive position to manage the ever increasing demands of critical application processing.
Rocco Alonzi has worked in the data center environment for the past 10 years, most recently as the AVP of Data Center Operations at Sun Life. He helped develop and implement the strategies that helped Sun Life achieve Tier III Certification of Constructed Facility. Prior to joining Sun Life, Mr. Alonzi worked for a large Canadian bank. During his 15 years there, he held many positions, including manager of Data Center Governance, where he was responsible for developing a team responsible for securing, managing, and maintaining the bank’s raised-floor environment. As a member of the Uptime Institute Network, Mr. Alonzi has strongly advocated the idea that IT and M&E must be considered as one in data center spaces.
Paolo Piro joined Sun Life in May of 2013 as a senior data center governance analyst, to help establish a governance framework and optimize organizational processes relating to Sun Life’s data centers. Prior to joining Sun Life, Mr. Piro worked 25 years at a large Canadian bank. In 2004, he became involved in data centers, when he became responsible, as a team lead, for establishing governance controls, implementing best practices, and optimizing the care and feeding of the data center raised floor. In 2011, he was able to increase his exposure and knowledge in this space, by taking on the role of data center manager, where for the next 2 years, he managed a team of resources and a consolidated budget allocated for maintaining and caring for the raised floor environment.