Blog Multi Author - Uptime Institute Blog

When an Australian Government Department Required Operational Sustainability, Metronode Delivered

January 3, 2017/in Executive, Operations/by Kevin Heslin

Senior facility manager calls achieving Tier Certification of Operational Sustainability “a dream”

By Kevin Heslin

The New South Wales (NSW) Department of Finance, Services and Innovation (DFSI) is a government service provider and regulator for the southeastern Australian state. DFSI supports many government functions, including sustainable government finances, major public works and maintenance programs, government procurement, and information and communications technology.

Josh Griggs, managing director of Metronode; Glenn Aspland, Metronode’s senior facility manager; and Derek Paterson, director–GovDC & Marketplace Services at the NSW Department of Finance, Services and Innovation (DFSI) discuss how Metronode responded when the NSW government decided to consolidate its IT operations.

Josh, can you tell me more about Metronode and its facilities?

Griggs, Metronode managing director: Metronode was established in 2002. Today we operate 10 facilities across Australia. Having designed and constructed our data centers, we have significant construction and engineering expertise. We also offer services and operational management. Our Melbourne 2 facility was the first Uptime Institute Tier III Certified Constructed Facility in Australia in 2012, and our Illawarra and Silverwater facilities were the first in the Asia Pacific region to be Tier III Gold Certified for Operational Sustainability by Uptime Institute. We have the most energy-efficient facilities in Australia, with a 4.5 National Australian Built Environment Rating System (NABERS) rating for data centers.*

We have two styles of data center. Generation 1 facilities are typically closer to the cities and have very high connectivity. If you were looking to connect to the cloud or for backup, the Gen 1’s fit that purpose. As a result, we have a broad range of clients, including multinationals, local companies, and a lot of government.

And then, we have Generation 2 Bladeroom facilities across five sites including facilities in the Illawarra and at Silverwater, which host NSW’s services. At present, we’ve got 3 megawatts (MW) of IT load in Silverwater and 760 kilowatts (kW) in the Illawarra. With additional phasing, Silverwater could host 15 MW and Illawarra 8 MW.

We are engineered for predictability. Our customers rely on us for critical environments. This is one of the key reasons we went through Uptime Institute Certification. It is not enough for us to say we designed, built, and operate to the Tier III Standards; it is also important that we verify it. It means we have a Concurrently Maintainable site and it is Operationally Sustainable.

Tell me about the relationship between the NSW government and Metronode.

Griggs, Metronode managing director: We entered a partnership with the NSW government in 2012, when we were selected to build and construct their facilities. They had very high standards and groundbreaking requirements that we’ve been able to meet and exceed. They were the first to specify Uptime Institute Tier III Certification of Constructed Facility, the first to specify NABERS, and the first to specify Tier Certification of Operational Sustainability as contract requirements.

Paterson, DFSI: A big part of my role at DFSI is to consolidate agencies’ legacy data center infrastructure into two strategic data centers owned and operated by Metronode. Our initial objective was to consolidate more than 130 data centers into two. We’ve since broadened the program to include the needs of local government agencies and state-owned companies.

When you look across the evolving landscape of requirements, Metronode was best equipped to support agency needs of meeting energy-efficiency targets and providing highly secure physical environments, while meeting service level commitments in terms of overall uptime.

Are these dedicated or colocation facilities?

Paterson, DFSI: When Metronode sells services to the private sector, it can host these clients in the Silverwater facility. At this point, though, Silverwater is 80% government, and Illawarra is a dedicated government facility.

What drove the decision to earn the Tier III Certification of Operational Sustainability?

Paterson, DFSI: We wrote the spec for what we wanted in regards to security, uptime, and service level agreements (SLA). Our contract required Metronode’s facilities to be Tier III Certified for design and build and Tier III Gold for operations. We benefited, of course; however, Metronode is also reaping rewards from both the Tier III Certification of Constructed Facility and Tier III Gold Certification of Operational Sustainability.

Griggs, Metronode managing director: Obviously the contractual requirement was one driver, but that’s not the fundamental driver. We needed to ensure that our mission critical environments are always operating. So Operational Sustainability ensured that we have a reliable, consistent operation to match our Concurrently Maintainable baseline, and our customers can rely on that.

Our operations have been Certified and tested in a highly rigorous process that ensures we had clear and effective process documented and the flow charts to enable the systems maintenance, systems monitoring, fault identification, etc. The Tier III Gold Certification also evaluates the people side of the operation, including skill assessment, acquisition of the right people, training, rostering, contracting, and the management in place as well as continuous improvement.

In that process, we had to ask ourselves if what we were actually doing was documented clearly and followed by everybody. It reached across everything you can think of in terms of operating a data center to ensure that we had all of that in place.

There are only 23 facilities in the world to have this Certification and only two in the Asia Pacific, which demonstrates how hard it is to get. And we wanted to be the best.

Glenn, how did you respond when you learned about DFSI’s requirement for Tier III Gold Certification of Operational Sustainability?

Aspland, Metronode senior facility manager: I thought this is fantastic. Someone has actually identified a specific and comprehensive set of behaviors that can minimize risk to the operations of the data center without being prescriptive.

When I was handed the Operational Sustainability assignment, the general manager said, “This is your goal.” From my point of view—operations has been my career, it was a fantastic opportunity to turn years of knowledge into something real. Get all the information out, analyze and benchmark it, find the highs and lows, and learn why they exist. For me it was a passion.

I was new to the company at that point and the process started that night. I just couldn’t put the Tier materials down. I was quite impressed. After that, we spent 2-3 months organizing meetings and drilling into what I felt we needed to do to to run our data centers in an Operationally Sustainable manner.

And then we began truly analyzing and assessing many of the simple statements in the briefing. For example, what does “fully scripted maintenance” mean?

What benefits did Metronode experience?

Paterson, DFSI: Metronode’s early deployments didn’t use Bladeroom, so this was a new implementation and there was a lot to learn. For them to design and build these new facilities at Tier III and then get the Tier III Gold Certification for Operational Sustainability showed us the rigor and discipline the team has.

Did the business understand the level of effort?

Aspland, Metronode senior facility manager: Definitely not. I had to work hard to convince the business. Operational Sustainability has put operations front and center. That’s our core business.

One of the biggest challenges was getting stakeholder engagement. Selling the dream.

I had the dream firmly in my mind from reading the document. But we had to get busy people to do things in a way that’s structured and document what they do. Not everyone rushes to that.

Practically, what did “selling the dream” require?

Aspland, Metronode senior facility manager: The approach is practical. We have to do what we do, and work through it with all the team. For instance, we did training on a one-on-one basis, but it wasn’t documented. So we had to ask ourselves: what do we teach? And then we have to produce training documents. After 12 months, we should know what we do and what our vendors do. But how do we know that they are doing what they are supposed to do? How do we validate all this? We have to make it a culture. That was probably the biggest change. We made the documents reflect what we actually do. Not just policy words about uptime and reliability.

Are there “aha” moments?

Aspland, Metronode senior facility manager: Continually. I am still presenting what we do and what we’ve done. Usually the “aha” happens when someone comes on site and performs major projects and follows detailed instructions that tell them exactly what to do, what button to punch, and what switchboard and what room. And 24 hours later, when they’ve relied on the document and nothing goes wrong, there’s the “aha” moment. This supported me. It made it easy.

How do you monitor the results?

Aspland, Metronode senior facility manager: We monitor our uptime continually. In terms of KPIs, maintenance fulfillment rates, open jobs every week, how long are they open. And every 2 months, we run financial benchmarks to compare to budget.

Our people are aware that we are tracking and producing reports. They are also aware of the audit. We have all the evidence of what we’ve done because of the five-day Operational Sustainability assessment.

For ISO 27001 (editor’s note: ISO information security management standard), what they are really doing is checking our docs. We are demonstrating that all our maintenance is complete and it’s document based, so we have all that evidence and that’s now the culture

What do you view as your competitive advantage?

Griggs, Metronode managing director: You can describe the Australian market as a mature market. There are quite a few providers. At Metronode, we design, build and operate the most secure and energy-efficient, low PUE facilities in the country, providing our customers with high-density data centers to meet their needs, both now and in the future.

Recognized by NABERS for data center energy efficiency with the highest rating of 4.5, we are also the only Australian provider to have achieved Uptime Institute Tier III Gold Certification for Operational Sustainability. There is one other national provider that looks like us. Then you have local companies that operate in local markets and some international providers have one or two facilities and operate with international customers that have a presence in Australia.

In this environment, our advantages include geolocation, energy efficiency, security, reliability, Tier III Gold Certification, and flexibility.

The Tier III Gold Certification and NABERS rating are important, because there are some outlandish claims about PUE and Uptime Institute Certification—people who claim to have Tier III facilities without going through the process of having that actually verified. Similarly, we believe the NABERS rating is becoming more important because of people making claims that they cannot achieve in terms of PUE.

Finally, we are finding that people are struggling to forecast demand. Because we are able to go from 1 kW/rack up to 30 kW, our customers are able to grow within their current footprint. Metronode has engineered Australia’s most adaptive data center designs, with an enviable speed to build in terms of construction. That ability to grow out in a rapid manner means that we are able to meet the growing requirements and often unforeseen customer demand.

* NABERS is an Australian rating system, managed nationally by the NSW Office of Environment and Heritage on behalf of federal, state, and territory governments. It measures the environmental performance of Australian buildings.

LinkedIn’s Oregon Data Center Goes Live, Gains EIT Stamp of Approval

December 8, 2016/in Executive, Operations/by Kevin Heslin

Uptime Institute recently awarded its Efficient IT (EIT) Stamp of Approval to LinkedIn for its new data center in Infomart Portland, signaling that the modern new facility had exceeded extremely high standards for enterprise leadership, operations, and computing infrastructure. These standards are designed to help organizations lower costs and increase efficiency, and leverage technology for good stewardship of corporate and environmental resources. Uptime Institute congratulates LinkedIn on this significant accomplishment.

Sonu Nayyar, VP of Production Operations & IT for LinkedIn, said “Our newest data center is a big step forward, allowing us to adopt a hyperscale architecture while being even more efficient about how we consume resource. Uptime Institute’s Efficient IT Stamp of Approval is a strong confirmation that our management functions of IT, data center engineering, finance, and sustainability are aligned. We look forward to improving how we source energy and ultimately reaching our goal of 100 percent renewable energy.”

John Sheputis, President of Infomart Data Centers, said, “We knew this project would be special from the moment we began collaborating on design. Efficiency and sustainability are achieved through superior control of resources, not sacrificing performance or availability. We view this award as further evidence of Infomart’s long-term commitment to working with customers to create the most sustainable IT operations in the world.”

LinkedIn expressed its enthusiasm about both this new project and the Efficient IT Stamp of Approval in the LinkedIn Engineering blog post below.

LinkedIn’s Oregon Data Center Goes Live

By Shawn Zandi and Mike Yamaguchi

Earlier this year we announced Project Altair, our massively scalable, next-generation data center design. We also announced our plans to build a new data center in Oregon, in order to be able to more reliably deliver our services to our members and customers. Today, we’d like to announce that our Oregon data center, featuring the design innovations of Project Altair, is fully live and ramped. The primary criteria when selecting the Oregon location were: procuring a direct access contract for 100% renewable energy, network diversity, expansion capabilities, and talent opportunities.

LinkedIn’s Oregon data center, hosted by Infomart, is the most technologically advanced and highly efficient data center in our global portfolio, and includes a sustainable mechanical and electrical system that is now the benchmark for our future builds. We chose to utilize the ChilledDoor from MotivAir, a rear door heat exchanger neutralizing the heat closer to the source. The advanced water side economizer cooling system communicates with outside air sensors to utilize Oregon’s naturally cool temperatures, instead of using energy to create cool air. Incorporating efficient technologies such as these enables our operations to run a PUE (Power Usage Effectiveness) of 1.06 during full economization mode.

Implementing the Project Altair next-generation data center design enabled us to move to a widely distributed non-blocking fabric with uniform chipset, bandwidth, and buffering characteristics in a minimalistic and elegant design. We encourage the most minimalistic and simplistic approaches in infrastructure engineering because they are easy to understand and hence easier to scale. Moving to a unified software architecture for the whole data center allows us to run the same set of tools on both end systems (servers) and intermediate systems (network).

The shift to simplification in order to own and control our architecture motivated us to also use our own 100G Ethernet open switch platform, called Project Falco. The advantages of using our own software stack are numerous: faster time to market, uniformity, simplification in feature requirements as well as deployment, and controlling our architecture and scale, to name a few.

In addition to the infrastructure innovation mentioned earlier, our Oregon data center has been designed and deployed to use IPv6 (next generation internet protocol) from day one. This is part of our larger vision to move our entire stack to IPv6 globally in order to retire IPv4 in our existing data centers. The move to IPv6 enabled us to run our application stack and our private cloud, LPS (LinkedIn Platform as a Service), without the limitations of traditional stacks.

As we moved to a distributed security system by creating a distributed firewall and removing network load balancers, the process of preparing and migrating our site to the new data center became a complicated task. It required significant automation and code development, as well as changes to procedures and software configurations, but ultimately reduced our infrastructure costs and gave us additional operational flexibility.

Our site reliability engineers and systems engineering teams introduced a number of innovations in deployment and provisioning, which allowed us to streamline the software buildout process. This, combined with zero touch deployment, resulted in a shorter timeline and smoother process for bringing a data center live than we’ve ever achieved before.

YouTube Link

Acknowledgements

We’re delighted to participate in the growth of Oregon as a next-generation sustainable technology center. The Uptime Institute has recognized LinkedIn with an Efficient IT award for our efforts! This award evaluates the management processes and organizational behaviors to achieve lasting reductions in cost, utilities, staff time, and carbon emissions for data centers.

The deployment of our new data center was a collective effort of many talented individuals across LinkedIn in several organizations. Special thanks to the LinkedIn Data Center team, Infrastructure Engineering, Production Engineering and Operations (PEO), Global Ops, AAAA team, House Security, SRE organization, and LinkedIn Engineering.

Airline Outages FAQ: How to Keep Your Company Out of the Headlines

November 28, 2016/in Executive, Operations/by Kevin Heslin

Uptime Institute has prepared this brief airline outages FAQ to help the industry, media, and general public understand the reasons that data centers fail.

The failure of a power control module on Monday, August 8, 2016, at a Delta Airlines data center caused hundreds of flight cancellations, inconvenienced thousands of customers, and cost the airline millions of dollars. And while equipment failure is the proximate cause of the data center failure, Uptime Institute’s long experience evaluating the design, construction, and operations of facilities suggest that many enterprise data centers are similarly vulnerable because of construction and design flaws or poor operations practices.

What happened to Delta Airlines?

While software is blamed for many well publicized IT problems, Delta Airline is blaming a piece of infrastructure hardware. According to the airline, “…a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.” We take Delta at its word.

Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.

However, pointing to the failure of a single piece of equipment can be very misleading. Delta’s facility was designed to have redundant systems, so the facility should have remained operational had the facility performed as designed. In short, a design flaw, construction error or change, or poor operations procedures set the stage for the catastrophic failure.

What does this mean?

Equipment like Delta’s power control module (more often called a UPS) should be deployed in a redundant configuration to allow for maintenance or support IT operation in the event of a fault. However, IT demand can grow over time so that the initial redundancy is compromised and each piece of equipment is overloaded when one fails or is taken off line. At this point, only Delta really knows what happened.

Organizations can compromise redundancy in this way by failing to track load growth, lacking a process to manage load growth, or making a poor business decision because of unanticipated or uncontrolled load growth.

Delta Airlines is like many other organizations. Its IT operations are complex and expensive, change is constant, and business demands are constantly growing. As a result, IT must squeeze every cent out of its budgets, while maintaining 24x7x365 operations. Inevitably these demands will expose any vulnerability in the physical infrastructure and operations.

Failures like the one at Delta can have many causes, including a single point of failure, lack of totally redundant and independent systems, or poorly documented maintenance and operations procedures.

Delta’s early report ignores key issues why a single equipment failure could cause such damage, what could have been done to prevent it, and how will Delta respond in the future.

Delta and the media are rightly focused on the human cost of the failure. Uptime Institute is sure that Delta’s IT personnel will be trying to reconstruct the incident once operations have been restored and customer needs met. This incident will cost millions of dollars and hurt Delta’s brand; Delta will not want a recurrence. They have only started to calculate what changes will be required to prevent another incident and how much it will cost.

Why is Uptime Institute publishing this FAQ?

Uptime Institute has spent many years helping organizations avoid this exact scenario, and we want to share our experience. In addition, we have seen many publish reports that just miss the point of how to achieve and maintain redundancy. Equipment fails all the time. Facilities that are well designed, built, and operated have systems in place to minimize data center failure. Furthermore, these systems are tested and validated regularly, with special attention to upgrades, load growth, and modernization plans.

How can organizations avoid catastrophic failures?

Organizations spend millions of dollars to build highly reliable data centers and keep them running. They don’t always achieve the desired outcome because of a failure to understand data center infrastructure.

The temptation to reduce costs in data centers is great because data centers require a great deal of energy to power and cool servers and they require experienced and qualified staff to operate and maintain.

Value engineering in the design process and mistakes and changes in the building process can result in vulnerabilities even in new data centers. Poor change management processes and incomplete procedures in older data centers are another cause for concern.

Uptime Institute believes that independent third-party verifications are the best way to identify the vulnerabilities in IT infrastructure. For instance, For instance, Uptime Institute consultants have Certified that more than 900 facilities worldwide meet the requirements of the Tier Standards.. Certification is the only one way to ensure that a new data center has been built and verified to meet business standards for reliability. In almost all these facilities, Uptime Institute consultants made recommendations to reduce risks that were unknown to the data center owner and operators.

Over time, however, even new data centers can become vulnerable. Change management procedures must help organizations control IT growth, and maintenance procedures must be updated to account for equipment and configuration changes. Third-party verifications such as Uptime Institute’s Tier Certification for Operational Sustainability and Management & Operations Stamp of Approval ensure that an organization’s procedures and processes are complete, accurate, and up-to-date. Maintaining up-to-date procedures is next to impossible without management processes that recognize that data centers change over time as demand grows, equipment changes, and business requirements change.

I’m in IT, what can I do to keep my company out of the headlines?

It all depends on your company’s appetite for risk. If IT failures would not materially affect your operations, then you can probably relax. But if your business relies on IT for mission-critical services for customer-facing or internal customers, then you should consider having the Uptime Institute evaluate your data center’s management and operations procedures and then implement the recommendations that result.

If you are building a new facility or renovating an existing facility, Tier Certification will verify that your data center will meet your business requirements.

These assessments and certifications require a significant organizational commitment in time and money, but these costs are only a fraction of the time and money that Delta will be spending as the result of this one issue. In addition, insurance companies increasingly have begun to reduce premiums for companies that have their operational preparedness verified by a third party.

Published reports suggest that Delta thought it was safe from this type of outage, and that other airlines may be similarly vulnerable because of skimpy IT budgets, poor prioritization, and bad processes and procedures. All companies should undertake a top-to-bottom review of its facilities and operations before spending millions of dollars on new equipment.

Bulky Binders and Operations “Experts” Put Your Data Center at Risk

October 24, 2016/in Operations/by Kevin Heslin

Wearable technology and digitized operating procedures ensure compliance with standardized practices and provide quick access to critical information

By Jose Ruiz

No one in the data center industry questions that data center outages are extremely costly. Numerous vendors report that the average data center outage costs hundreds of thousands of dollars, with losses in some circumstances totaling much more. Uptime Institute Network Abnormal Incident Reports data show that most downtime instances are caused by human error–a key vulnerability in mission-critical environments. In addition, Compass Datacenters believes that many failures that enterprises attribute to equipment failure could accurately be considered human error. However, most companies prefer to blame the hardware rather than their processes.

Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.

Whether one accepts higher or lower outage costs, it is clear that reducing human error as a cause of downtime has a tremendous upside for the enterprise. As a result, data center  owners dedicate a great  deal of time to developing  maintenance and operation procedures  that are clear, effective, and easy to follow. Compass Datacenters is incorporating wearable technology in its efforts to make the procedures convenient for technicians to access and execute. During the process, Compass learned that the true value of introducing wearables into data center operations comes from standardizing the procedures. Incorporating wearable technology enhances the ability of users to follow the procedures. In addition, allowing users to choose the device most useful for the task at hand improves their comfort with the new process.

THE PROBLEM WITH TRADITIONAL OPs DOCUMENTATION

There are two major problems with the current methods of managing and documenting data center procedures.

Many data center operations teams document their procedures and methodologies in large three-ring binders. Although their presence is comforting, Operations staffs rarely consult them.
Also, organizations often rely on highly detailed written documentation presented in such depth that Operations staff wastes a great deal of time trying to locate the appropriate information.

In both instances, Operations personnel deem the written procedural guidelines inconvenient. As a result, actual data center operation tends to be more experience-based. Bluntly put, too many organizations rely too much on the most experienced or the most insistent Operations person on site at the time of an incident.

Figure 2. Compass Datacenters use of electronics and checklists clari es the responsibilities of Operations personnel doing daily rounds.

Figure 2. Compass Datacenters use of electronics and checklists clarifies the responsibilities of Operations personnel doing daily rounds.

On the whole, experience-based operations are not all bad. Subject matter experts (SME) emerge and are valued, with operations running smoothly for long periods at a time. However, as Julian Kudritzki wrote in “Examining and Learning from Complex Systems Failures” (The Uptime Institute Journal, vol. 5, p. 12) ‘Human error’ is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.” What’s more, subject matter experts may leave or get sick or simply have a bad day. In addition, most of their peers assume the SME is always right.

Figure 3.

Figure 4.

Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.

Of course, no one goes to work planning to hurt a coworker or cause an outage, but errors are made. Compass Datacenters believes its efforts to develop professional quality procedures and use wearable technology reduce the likelihood of human error. Compass’s intent is to make it easy for operators to handle both routine and foreseeable abnormal operations by standardizing where possible and improvising only when absolutely necessary and by making it simple and painless for technicians to comply with well-defined procedures.

AMERICAN ELECTRIC POWER

American Electric Power (AEP), a utility based in  Columbus, OH that provides electricity to more than 5.4  million customers in 11 states, inspired Compass to bring wearable technology to the data center to help the utility reduce human error. The initiative was a first for Compass, and AEP’s collaboration was essential to the product development effort. Compass and AEP began by adopting the checklist approach used in the airline business, among many others. This approach to avoiding human error is predicated on the use of carefully crafted operational procedures. Foreseeable critical actions by the pilot and copilot are strictly prescribed to ensure that all processes are followed in the same manner, every time, no matter who is in the cockpit. They standardize where possible and improvise only when required. Airline procedural techniques also put extra focus on critical steps to eliminate error. While this professional approach to operations has been widely adopted by a host of industries, it has not yet become a standard in the data center business.

Compass and AEP partnered with ICARUS Ops to develop the electronic checklist and wearables programs. ICARUS Ops was founded by two former U.S. Navy carrier pilots, who use their extensive experience in developing standardized aviation processes to develop wearable programs across a broad spectrum of high-reliability and mission-critical industries. During the process, ICARUS Ops brought on board several highly experienced consultants, including military and commercial pilots and a former Space Shuttle mission controller. At the start of the project, Compass, ICARUS Ops, and AEP worked together to identify the various maintenance and operations tasks that would be wearable accessible. Team members reviewed all existing emergency operating procedures (EOP). The three organizations held detailed discussions with AEP’s data center personnel to help identify the procedures best suited to run electronically.
Including operators in the process ensured that their perspectives would be part of the final procedures to make it easier for them to execute correctly. Also, this early involvement led to a high level of buy-in, which would be needed for this project to succeed.

ELECTRONIC PROCEDURES

The next step was converting the procedures to electronic checklists. The transition to electronics was accomplished using ICARUS Ops Application and web-based Mission Control. The project team wanted each checklist to

Be succinct and written with the operator in mind
Identify the key milestone in the process and use digital technology to verify critical steps
Use condition, purpose, or desired effect statements to assure the proper checklist is utilized
Identify common areas of confusion and provide links to tools such as animations and videos to clarify how and why the procedure is to be performed
Make use of Job Safety Analysis (JSA) to identify areas of significant risk prior to the required step/action. These included:
o Warnings to indicate where serious injury is possible if all procedures are not followed specifically
o Cautions indicating where significant data loss or damage is possible
o Notes to add any additional information that may be required or helpful
Condense SOP and EOP procedure items to prevent the user from being overwhelmed during critical and emergency scenarios.

AEP SMEs walked through each and every checklist to provide an additional quality control step.

Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.

Having SOPs and EOPs in a wearable digital format allows users to access them whenever an abnormal or emergency situation occurs, so no time is lost looking for the three-ring binder and proper procedure. In critical situations and emergencies, time is crucial and delay can be costly. Hard-copy information manuals and operating handbooks are bulky and can potentially overwhelm technicians looking for specific information. The bulk and formatting can cause users to lose their place on a page and skip over important items. By contrast, the digital interface provides quick access to users and allows for quicker navigation to desired materials and information via presorted emergency and abnormal procedures.

At the conclusion of the development phase of the process (about a month), 150 of AEP’s operational procedures had been converted into digital procedures that were conveniently accessible on mobile devices— on or offline. Most importantly, a digital continuous improvement process was put in place to assure that lessons learned would be easy to incorporate into the system as part of the process.

The process of entering the procedures took about 2 weeks. First, a data entry team put them in as written. Then Compass worked with AEP and ICARUS Ops checklist experts to tighten up the language and add decision points, warnings, cautions, and notes. This group also embedded pictures and video where appropriate and decided what data needed to be captured during the process. Updates are easy. Each time a checklist or procedure is used improvements can be approved and digitally published right away. This is critical. AEP believes that when employees know they are being heard job satisfaction improves and they take more ownership and pride in the facility. The modifications and improvements will be continuous. The system is designed for constant improvement. Any time a technician has an idea to improve a checklist he just records the idea in the application, which instantly emails it to the SME for approval and publishing into the system. Other devices on the system synch automatically when they update.

WEARABLE TECHNOLOGY

Using wearable technology made the electronic procedures even more accessible to the end users. At the outset, AEP, ICARUS Ops, and Compass determined that AEP technicians would be able to choose between a wrist-mounted Android device and a tablet. The majority of users selected the wrist-mounted device. This preference was not unexpected, as a good portion of their work requires them to use their hands. The wrist units allow users to have both hands available to perform actual work at all times.

Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.

Compass chose the Android technology system for several reasons. First, from the very outset, Compass envisioned making Google Glass one of the wearable options. In addition, Android allows programs maximum freedom to use the hardware and software as required to meet customer needs; it is simply less constraining than working with iOS. Androids are also very easy to procure and have manufacturers making Class 1, Division 1 Android tablets that are usable in hazardous environments.

The software-based technology  enhances AEP’s ability to track  all maintenance activity in its  data center. Once online, each  device communicates the recorded  operational information (actions  performed, components replaced, etc.) to the site’s maintenance system. This ensures that all routine maintenance is performed on schedule, and at the same time provides a detailed history of the activities performed on each major component within the facility. Web-based leadership dashboards provide oversight into what is complete, due, or overdue. The system is completely secure, as it does not reside on the same server or system as any other system in the data center. It can be encrypted or resident in the data center for increased security. Typically ICARUS Ops spends at least 3 days on site training the technicians and users, depending on the number of devices. Working together in that time normally leaves the team very comfortable with all components.

THE NEXT PHASE: GLASS

Although innovations such as Google Glass have increased awareness of visual displays, the concept dates back to the Vietnam War, when American fighter pilots began to use augmented reality on heads-up displays (HUDs). Compass Datacenters plans to harness the same augmented reality to enable maintenance technicians to complete a task completely hands free using Google Glass and other advanced wearables.

It is certainly safe to assume that costs related to data center outages will continue to rise in the future. As a result, it will become more important to increase operator reliability and reduce the risk of human error. We strongly believe that professional quality procedures and wearables will make a tremendous contribution to eliminating the human error element as a major cause of interruption and lost revenue. Our vision is that other data center providers will decide to make this innovative checklist system a part of their standard offerings over the course of the next 12 to 24 months.

Jose Ruiz

Jose Ruiz serves as Compass Datacenters’ Vice President of Operations. He provides interdisciplinary leadership in support of Compass’ strategic goals. Mr. Ruiz began his tenure at Compass as the company’s director of Engineering. Prior to joining Compass he spent three years in various engineering positions and was responsible for a global range of projects at Digital Realty Trust.

Tier III Certified Facilities Prove Critical to Norwegian Colo’s Client Appeal

October 11, 2016/in Executive/by Kevin Heslin

Green Mountain melds sustainability, reliability, competitive pricing, and independent certifications to attract international colo customers

By Kevin Heslin

Green Mountain operates two unique colo facilities in Norway, having a total potential capacity of several hundred megawatts. Though each facility has its own strengths, both embody the company’s commitment to providing secure, high-quality service in an energy efficient and sustainable manner. CEO Knut Molaug and Chief Sales Officer Petter Tømmeraas recently took time to explain how Green Mountain views the relationship between cost, quality, and sustainability.

Tell our readers about Green Mountain.

KM: Green Mountain focuses on the high-end data center market, including banking/finance, oil and gas, and other industries requiring high availability and high quality services.

PT: IT and cloud are also very big customer segments. We think the US and European markets are the biggest for us, but we also see some Asian companies moving into Europe that are really keen on having high-quality data centers in the area.

KM: Green Mountain Data Centers operates two data centers in Norway. Data Center 1 in Stavanger began operation in 2013 and is located in a former underground NATO ammunition storage facility inside a mountain on the west coast. Data Center 2, a more traditional facility, is located in Telemark, which is in the middle of Norway.

Today DC1-Stavanger is a high security colocation data center housing 13,600 square meter facility (m²) of customer space. The infrastructure can support up to 26 megawatts of IT load today. The main data center comprises three two-story concrete buildings built inside the mountain, with power densities ranging from 2-6 kW/m², but the facility can support up to 20 kW/m². NATO put a lot of money into creating their facilities inside the mountain, which probably saved us 1 billion Kroners ($US 150 million).

DC2-Telemark is located in a historic region of Norway and was built on a brownfield site with a 10-MW supply initially available. The first phase is a fully operationa1 10-MW Tier lll Certified Facility, with four new buildings and up to 25 MW total capacity planned. This site could support even larger facilities if the need arises.

Green Mountain focuses a lot on being green and environmentally friendly, so we use 100% renewable energy in both data centers.

How do the unique features of the data centers affect their performance?

KM: Besides being located in a mountain, DC1 has a unique cooling system. We use the fjords for cooling year round, which gives us 8°C (46 °F) water for cooling. The cooling solution (including cooling station, chilled water pipework and pumps) is fully duplicated, providing an N+N solution. Because there are few moving parts (circulating pumps) the solution is extremely robust and reliable. In-row cooling is installed to client specification using Hot Aisle technology.

We use only 1 kilowatt of power to produce 100 kilowatts of cooling. So the data center is extremely energy efficient. In addition, we are connected to three independent power supplies, so DC1 has extremely robust power.

DC2 probably has the most robust power supply in Europe. We have five independent hydropower plants within a few kilometers of the site, and the two closest are just a few hundred meters away.

How do you define high quality?

PT: High quality means Uptime Institute Tier Certification. We are not only saying we have very good data centers. We’ve gone through a lot of testing so we are able to back it up, and the Uptime Institute Tier Standard is the only standard worldwide that certifies data center infrastructure to a certain quality. We’re really strong on certifications because we don’t only want to tell our customers that we have good quality, we want to prove it. Plus we want the kinds of customers who demand proof. As a result, both our facilities are Tier III Certified.

Please talk about the factors that went into deciding to obtain Tier Certification.

KM: We have focused on high-end clients that require 100% uptime and are running high-availability solutions. Operations for this type of company generally require documented infrastructure.

The Tier III term is used a lot, but most companies can’t back it up. Having been through testing ourselves, we know that most companies that haven’t been certified don’t have a Tier III facility, no matter what they claim. When we talk to important clients, they see that as well.

What was the on-site experience like?

PT: When the Uptime Institute team was on site, we could tell that Certification was a quality process with quality people who knew what they were doing. Certification also helped us document our processes because of all the testing routines and scenarios. As a result, we know we have processes and procedures for all the thinkable and unthinkable scenarios and that would have been hard to do without this process.

Why do you call these data centers green?
KM: First of all we use only renewable energy. Of course that is easy in Norway because all the power is renewable. In addition we use very little of it, with the fjords as a cooling media. We also built the data centers using the most efficient equipment, even though we often paid more for it.

PT: Green Mountain is committed to operate in a sustainable way and this reflects in everything we do. The good thing about operating in such a way is that our customers benefit from this financially. As we bill power cost based on consumption of power, the more energy efficient we operate the smaller the bill to our customer. When we tell these companies that they can even save money going for our sustainable solutions this makes their decision easier.

More and more customers require that their new data center solutions are sustainable, but we still see that price is a key driver for most major customers. The combination of having very sustainable solutions and being very competitive on price is the best way of driving sustainability further into the mind of our customers.

All our clients reduce their carbon footprint when they move into our data centers and stop using their old and inefficient data centers.

We have a few major financial customers that have put forward very strict targets with regards to sustainability and that have found us to be the supplier that best meets these requirements.

KM: And, of course, making use of an already built facility was also part of the green strategy.

How does your cost structure help you win clients?

PT: It’s important, but it’s not the only important factor. Security and the quality we can offer are just as important, and that we can offer them with competitive pricing is very important.

Were there clients who were attracted to your green strategy?

PT: Several of them, but the decisive factor for customers is rarely only one factor. We offer a combination between a really, really competitive offering and a high quality level. We are a really, really sustainable and green solution. To be able to offer that at competitive price is quite unique because often people think they have to pay more to get a sustainable green solution.

Are any of your targeted segments more attracted to sustainability solutions?

PT: Several of the international system integrators really like the combination. They want a sustainable solution, but they want the competitive offering. When they get both, it’s a no-brainer for them.

How does your sustainability/energy efficiency program affect your reliability? Do potential clients have any concerns about this? Do any require sustainability?

PT: Our programs do not affect our reliability in any way. We have chosen only to implement solutions that do not harm our ability to deliver the quality we promise to our customers. We have never experienced one second of SLA breakage on any customer in any of our data centers. In fact, some of our most sustainable solutions, like the cooling system based on cold sea water, increase our reliability as it takes down the risk of failure considerably compared to regular cooling systems. We have not experienced any concerns about these solutions.

Has Tier Certification proven critical in any of your client’s decisions?

PT: Tier certification has proved critical in many of our client`s decision to move to Green Mountain. We see a shift in the market to require Tier certification, whereas it used to be more in the form of asking for Tier compliance, that anyone could claim without having to prove it. We think the future of quality data center providers will be to certify all their data centers.

Any customer with mission critical data should require their supplier/s to be Tier certified. At the moment this is the only way for a customer to secure that their data center is built and operated in the way it should in order to secure the quality that the customer needs.

Are there other factors that set you apart?

PT: Operational excellence. We have an operational team that excels every time. They deliver to the customers a lot more than expected every time, and we have customers that are extremely happy with their deliveries from us. I hear that from customers all the time, and that’s mainly because our operations team do a phenomenal job.

Uptime Institute testing criteria were very comprehensive and helped us develop our operational procedures to an even higher level as some of the scenarios created during the certification testing were used as a basis for new operational procedures and new tests that we now perform as part of our normal operating procedures.

Green Mountain definitely benefitted from the Tier process in a number of other ways, including training gave us useful input to improve our own management and operational procedures.

What did you do to develop this team?

KM: When we decided to focus on high-end clients, we knew that we needed high-end experience and expertise and knowledge on the ops side, so we focused on that when recruiting as well as building a culture inside the company that focused on delivering high quality the first time every time.

We recruited people with knowledge of how to operate critical environments, and we tasked them with developing those procedures and operational elements as a part of their efforts, and they have successfully done so.

PT: And the owners made the resources available so that they could spend the resources—both financial and staff-hour wise to create the quality we wanted. We also have a very good management system, so management has good knowledge of what’s happening, so if we have an issue it will be very visible.

KM: We also have high-end equipment and tools to measure and monitor everything inside the data center as well as operational tools to make sure we can handle any issue and deliver on our promises.

Kevin Heslin

Kevin Heslin is Chief Editor and Director of Ancillary Projects at Uptime Institute. In these roles, he supports Uptime Institute communications and education efforts. Previously, he served as an editor at BNP Media, where he founded Mission Critical, a commercial publication dedicated to data center and backup power professionals. He also served as editor at New York Construction News and CEE and was the editor of LD+A and JIES at the IESNA. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.

If You Can’t Buy Effective DCIM, Build It

September 29, 2016/in Operations/by Kevin Heslin

After commercial DCIM offerings failed to meet RELX Group’s requirements, the team built its own DCIM tool based on the existing IT Services Management Suite

By Stephanie Singer

What is DCIM? Most people might respond “data center infrastructure management.” However, simply defining the acronym is not enough. The meaning of DCIM is greatly different for each person and each enterprise.

At RELX Group and the data centers managed by the Reed Elsevier Technology Services (RETS) team, DCIM is built on top of our configuration management database (CMDB) and provides seamless, automated interaction between the IT and facility assets in our data centers, server rooms, and telco rooms. These assets are mapped to the floor plans and electrical/mechanical support systems within these locations. This arrangement gives the organization the ability to manage its data centers and internal customers in one central location. Additionally, DCIM components are fully integrated with the company’s global change and incident management processes for holistic management and accuracy.

A LITTLE HISTORY

So what does this all mean? Like many of our colleagues, RETS has consulted with several DCIM providers over the years. Many of them enthusiastically promised a solution for our every need, creating excitement at the prospect of a state-of-the art system for a reasonable cost. But, as we all know, the devil is in the details.

I grew up in this space and have personally been on this road since the early 1980s. Those of you who have been on the same journey will likely remember using the Hitachi Tiger tablet with AutoCAD to create equipment templates. We called it “hardware space planning” at the time, and it was a huge improvement over the template cutouts from IBM, which we had to move around manually on an E-size drawing.

Many things have changed over the last 30 years, including the role of vendors in our operation. Utilizing DCIM vendors to help achieve infrastructure goals has become a common practice in many companies, with varying degrees of success. This has most certainly been a topic at every Uptime Institute Symposium as peers shared what they were doing and sought better go-forward approaches and planning.

OUR FIRST RUN AT PROCURING DCIM

Oh how I coveted the DCIM system my friend was using for his data center portfolio at a large North America-based retail organization. He and his team helpfully provided real world demos and resource estimates to  help us construct the business case that was approved, including the contract resources for a complete DCIM installation. We were able to implement the power linking for our assets in all our locations. Unfortunately, the resource dollars did not enable us to achieve full implementation with the cable management. I suspect many of you reading this had similar experiences.

Nevertheless, we had the basic core functionality and a DCIM system that met our basic needs. We ignored inefficiencies, even though it took too many clicks to accomplish the tasks at hand and we still had various groups tracking the same information in different spreadsheets and formats across the organization.

About 4-1/2 years after implementation, our vendor reported the system we were using would soon be at end of life. Of course, they said, “We have a new, bigger, better system. Look at all the additional features and integration you can have to expand and truly have everything you need to run your data centers, manage your capacities, and drive costs out.”

Digging deeper, we found that driving cost out was not truly obtainable when we balanced the costs of fully collating the data, building the integration with other systems and processes, and maintaining the data and scripting that was required. With competing priorities and the drive for cost efficiencies, we went back to the drawing board and opened the DCIM search to other providers once again.

STARTING OVER WITH DCIM MARKET RESEARCH

The DCIM providers we looked at had all the physical attributes tied to the floor and rack space, cable management, asset life cycle, capacity planning, and various levels for the electrical/mechanical infrastructure. These tools all integrated with power quality and building management and automation systems, each varying slightly in their approach and data output. And many vendors offered bi-directional data movement from the service management tool suite and CMDB.

But our findings revealed potential problems such as duplicate and out-of-sync data. This was unacceptable. We wanted all the features our DCIM providers promised without suffering poor data quality. We also wanted the DCIM to fully integrate with our change and incident management systems so we could look holistically at potential root causes of errors. We wanted to see where the servers were located across the data center, if they were in alarm, when maintenance was occurring, and whether a problem was resolved. We wanted the configuration item attributes for maintenance, end of life, contract renewals, procedures, troubleshooting guidelines, equipment histories, business ownerships, and relationships to be 100% mapped globally.

RETS has always categorized data center facilities as part of IT, separate from Corporate Real Estate. Therefore, all mechanical and electrical equipment within our data center and server rooms are configuration items (CI) as well. This includes generators, switchgear, uninterruptible power systems (UPS), power distribution units (PDU), remote power panels (RPP), breakers, computer room air conditioners (CRAC), and chillers. Breaking away from the multiple sources we had been using for different facility purposes greatly improved our overall grasp on how our facility team could better manage our data centers and server rooms.

NEW PARADIGM

This realization caused us to ask ourselves: What if we flipped the way DCIM is constructed? What if it is built -on top of the Service Management Suite, so it is truly a full system that integrates floor plans, racks, and power distribution to the assets within the CMDB? Starting with this thought, we aggressively moved forward to completely map the DCIM system we currently had in place and customize our existing Service Management Suite.

We started this journey in October 2014 and followed a Scrum software development process. Scrum is an iterative and incremental agile software development methodology. The 2-week sprints and constant feedback that provided useful functionality were keys to this approach. It was easy to adapt quickly to changes in understanding.

Another important part of Scrum is to have a complete team set up with a subject matter expert (SME) to serve as the product owner to manage the backlog of features. Other team members included the Service Management Suite tool expert to design forms and tables, a user interface (UI) expert to design the visualization and a Scrum master to manage the process. A web service expert joined the team to port the data from the existing DCIM system into the CMDB. All these steps were critical; however, co-locating the team with a knowledgeable product owner to ensure immediate answers and direction to questions really got us off and running!

We created our wish list of requirements, prioritizing those that enabled us to move away from our existing DCIM system.

Interestingly enough, we soon learned from our vendor that the end of life for our current 4-1/2 year-old DCIM system would be extended because of the number of customers that remained on that system. Sound familiar? The key was to paint the vision of what we wanted and needed it to be, while pulling a creative and innovative group of people together to build a roadmap for how we were going to get there.

It was easy to stay focused on our goals. We avoided scope creep by aligning our requirements with the capabilities that our existing tool provided. The requirements and capabilities that aligned were in scope. Those that didn’t were put on a list for future enhancements. Clear and concise direction!

The most amazing part was that using our Service Management Suite was providing many advantages. We were getting all the configuration data linked and a confidence in the data accuracy. This created excitement across the team and the “wish” list of feature requests grew immensely! In addition to working on documented requests, our creative and agile team came back with several ideas for features we had not initially contemplated, but that made great business sense. Interestingly enough, we achieved many of these items so easily we had achieved a significantly advanced tool with the automation features we leveraged by the time we went live in our new system.

FEATURES

Today we can pull information by drilling down on a oor plan to a device that enables us to track the business owner, equipment relationships, application to infrastructure mapping, and application dependencies. This information allows us to really understand the impacts of adding, moving, modifying, or decommissioning within seconds. It provides real-time views for our business partners when power changes occur, maintenance is scheduled, and if a device alarming is in effect (see Figure 1).

Figure 1. DCIM visualization

The ability to tie in CIs to all work scheduled through our Service Management Suite change requests and incidents provides a global outlook on what is being performed within each data center, server room, and telco room, and guarantees accuracy and currency. We turned all electrical and mechanical devices into CIs and assigned them physical locations on our floor plans (see Figure 2).

Figure 2. Incident reporting affected CI

Facility work requests are incorporated into the server install and decommissioning workflow. Furthermore, auto discovery alerts us to devices that are new to the facility, so we are aware if something was installed outside of process.

If an employee needs to add a new model to the floor plan we have a facility management form for that process, where new device attributes can be specified and created by users within seconds.

Facility groups can modify  floor plans directly from the visualization providing dynamic updates to all users, Operations can monitor alarming and notifications 24×7 for any CI, and IT teams can view rack elevations for any server rack or storage array (see Figure 3).
Power management, warranty tracking, maintenance, hardware dependencies, procedures, equipment relationships, contract management, equipment histories can all be actively maintained all within one system.

Figure 3. Front and rear rack elevation

Speaking of power management, our agile team was able to create an exact replica of our electrical panel schedules inside of the DCIM without losing any of the functionality we had in Excel. This included the ability to update current power draw for each individual breaker, redundant failover calculation, alarming, and relationship creation from PDU to breaker to floor-mount device.

Oh and by the way, iPad capability is here…. Technicians can update information as work is being performed on the floor, allowing Operations to know when a change request is actually in process and what progress has been made. And, 100% automation is in full effect here! Technicians can also bring up equipment procedures to follow along the way as these are tied to CIs using the Service Management Suite knowledge articles.

Our Service Management Suite is fully integrated with active directory, in that we can associate personnel with individual site locations that they manage. Self-service forms are also in place where users can add, modify, or delete any new vendor information for specific sites.

The Service Management Suite has already integrated with our Real Estate management tool to integrate remote floor plans, site contacts, and resource usage for each individual location. The ability to pull power consumption per device at remote sites is also standardized based on an actual determined estimate to assist with consolidation efforts.
The self-service items include automated life cycle forms that allow users to actively track equipment adds, modifications, and removals, while also providing the functionality to correlate CIs together in order to form relationships from generator to UPS to PDU to breaker to rack to power strip to rack-mount device (see Figure 4).

Figure 4. Self-service facility CI management form

Functionality for report creation on any CI and associated relationships has been implemented for all users. Need to determine where to place a server? There’s a report that can be run for that as well!

The streamlined processes allow users to maintain facility and hardware CIs with ease, and truly provides a 100% grasp on the activity occurring with data centers on a daily basis.
I am quite proud of the small, but powerful, team that went on this adventure with us. As the leader of this initiative, it was refreshing to see the idea start with a manager. He worked with a key developer to build workflows and from there turned DCIM on its head.

We pulled the hardware, network, and facility teams together with five amazing part-time developers for the “what if” brainstorm session and the enthusiasm continued to explode. It was truly amazing to observe this team. Within 30 days, we had a prototype that was shared with senior management and stakeholders who fully supported the effort and the rest is history!

It’s important to note, for RETS, we have a far superior tool set for a fraction of the cost of other DCIM tools. With it being embedded into our Service Management Suite,
we avoid additional maintenance, software, and vendor services costs… win, win, win!

Our DCIM is forever evolving in that we have so far surpassed the requirements we originally set that we thought, “Why stop now?” Continuing with our journey
will bring service impact reports and alarming, incorporating our existing power monitoring application, and building an automation system, which will enhance our ability to include remote location CIs managed by us. With all the advances we are able to make using our own system, I am looking forward to more productivity than ever before and more than we can imagine right now!

Stephanie Singer joined Reed Elsevier, now known as RELX Group, in 1980 and has worked for Mead Data Central, LexisNexis, and Reed Elsevier–Technology Services during her career. She is currently the vice president of Global Data Center Services. In this role, she is responsible for global data center and server room facilities, networks, and cable plant infrastructure for the products and applications within these locations. She leads major infrastructure transformation efforts. Ms. Singer has led the data center facilities team since 1990, maintaining an excellent data center availability record throughout day-to-day operations and numerous lifecycle upgrades to the mechanical and electrical systems. She also led the construction and implementation of a strategic backup facility to provide in-house disaster recovery capability.

When an Australian Government Department Required Operational Sustainability, Metronode Delivered

LinkedIn’s Oregon Data Center Goes Live, Gains EIT Stamp of Approval

Airline Outages FAQ: How to Keep Your Company Out of the Headlines

Bulky Binders and Operations “Experts” Put Your Data Center at Risk

Tier III Certified Facilities Prove Critical to Norwegian Colo’s Client Appeal

If You Can’t Buy Effective DCIM, Build It

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Is PUE actually going UP?