Tier Certification of Operational Sustainability enhances ENTEL’s services to customers
By Kevin Heslin
ENTEL began operations in 1964 as a provider of national and international long distance telephone services to companies in Chile. Today it is a consolidated provider of integrated telecommunication services and information technologies services, meeting the needs of corporations and large companies through tailored solutions, providing value, experience, and quality of service.
The company began operations in the aftermath of an earthquake that severely damaged the Chilean telecommunications network. More recently it added more services. Towards the end of 1995, ENTEL started to provide Internet connection services and in 1997 it introduced the first commercial network with ATM technology in Latin America, which evolved into the current Multiservice IP network in order to offer broadband solutions, guarantee quality of service, and add value to its clients. Today ENTEL offers Information Technology (IT) services, which vary according to the industrial segment and business model of each client, allowing competitive efficiencies and advantages that differentiate ENTEL from traditional telecommunications services.
As part of its offerings, ENTEL also offers a number of data center services to more than 3,000 clients. It owns and operates a network of five data centers that are certified under ISO 9001-2000 and ISO 27001 standards. In addition, two of them are Tier III Gold Certified Constructed Facility and one has the M&O Stamp of Approval.
Data Center Infrastructure
ENTEL has five data centers located in Amunátegui, Pedro de Valdivia, Ñuñoa, Longovilo, and Ciudad De Los Valles that are linked through high availability and high capacity IP/MPLS/DWDM These data centers have in excess of 7,500 square meters (m2) of data hall space with plans to expand to 11,675 m2. From these data centers, ENTEL offers IT outsourcing services, from server hosting to more complex services involving operation and exploitation of the platforms that support clients’ business processes. These data center services meet the needs of companies’ most vital applications, improving security and protection of critical data and reducing considerably their infrastructure investments. ENTEL’s IT strategy is based on traditional data center services and growth in cloud services and permanent innovation.
ENTEL first offered data center services in 2003 and is now Chile’s largest provider. To this end, ENTEL has staffed its facilities with more than 120 professionals and management devoted to implementation and operation of data center Infrastructure projects
Thanks to its team of specialists with extensive experience, the IT challenges faced by ENTEL have been met entirely in-house. Of particular note are the Ciudad de los Valles 1 and 2 facilities, each offering 2,000 m2 of white space. Ciudad de los Valles 1 (CDLV1) entered production in May 2010, and Ciudad de los Valles 2 (CDLV2) entered production in March 2013 (see The Uptime Institute Journal, vol. 2, p. 64). Both are Tier III Gold Certified Constructed Facilities (see Figure 1). Both received Tier Certification for Operational Sustainability in October 2015 (see Table 1). Josué Ramírez, Uptime Institute, director of Business Development LATAM, “With these Certifications, ENTEL shows it commitment to seeking excellence in its operations to provide better services to its clients and to contribute to the development of knowledge and professionalism in the region of Latin America.”
Figure 1. CDLV 1 and 2 are both Tier III Gold Certified Constructed Facilities.
Rich Van Loo, Uptime Institute, VP Facility Management Services, said, “The Tier Certification and Operational assessments have had a broad impact on not only the data center management divisions, but the company environment as a whole. Operational procedures not only reduce risk but help improve consistency and efficiency in those operations. ENTEL is looking to expand this philosophy to all their data centers.“
Table 1
ENTEL began construction of a third facility in 2015 at Ciudad de los Valles. This will add an additional 2,000 m2 of white space. The new facility incorporates free-cooling technologies that achieve better energy efficiency and lower total costs of ownership for clients.
Tier Certification of Operational Sustainability
To differentiate itself from its competitors, ENTEL became the first to Tier Certify its facilities. As a result, ENTEL decided to further differentiate itself by earning Uptime Institute’s Tier Certification of Operational Sustainability for CDLV 1 and 2. This decision was based on client demand for excellence of service, risk mitigation, and assurance of operational continuity.
ENTEL believes that
Risk management should be approached as a team, leaving no room for assumptions or improvisation
Planning and later exhaustive review of activities should have an integral perspective, which allows risk to be controlled, mitigated, and contained
Planning activities and teamwork generate a virtuous circle of shared knowledge and learning
Empirical verification that continuous improvement is key to mitigating human error in Operations
Figure 2. The Tier Certification of Operational Sustainability verifies that ENTEL maximizes the potential of its facilities and differentiates it from its competitors.
The Tier Certification of Operational Sustainability (TCOS) was an excellent way to reach these goals and why a new internal organization was dedicated to defining data center operations activities and a second team was tasked to leading the effort to earn the Tier Certification of Operational Sustainability.
In addition, ENTEL created an Infrastructure Change Control Board (CCI), specific to Datacenter Infrastructure Management, with the purpose of establishing planning and control of activities in each data center. This Board is a consulting body. It meets periodically and manages, reviews, and approves infrastructure management projects to data centers.
Scope of the Infrastructure Change Control Board (CCI)
Although ENTEL had previously adopted the ITIL model and established a Change Advisory Board (CAB) that validates and approves all activities in the IT and data center infrastructure environment, it was determined to create the CCI because of the degree of specialization of infrastructure work and the risk associated with it. Its focus is to raise risk points associated to high-impact activities and ascertain control points, e.g. cells, transformers, generators, UPS, towers and chillers, where the system redundancy could be lost.
A summary of the type of work evaluated and documentation to be reviewed follows (see Table 2):
Table 2.
The CCI is intended to ensure that the good practices characteristic of Tier Certification of Operational Sustainability are respected and that all documentation and instructions are kept thoroughly valid, that they are applied, and are subject to continuous improvement.
The Process
In order to increase ENTEL’s familiarity with the Tier Certification of Operational Sustainability process, ENTEL and Uptime Institute agreed to carry it the process in four phases (see Figure 3).
Figure 3. ENTEL views operational sustainability as a four-phase continuous process.
Important Decisions
Eventually, Uptime Institute and ENTEL adjusted the process in light of the initial progress, the maturity level of the team, and the imperative work of assuring operational continuity:
The Operations team was split in two groups, both reporting to a single manager in charge of coordinating activities
Operation, maintenance and tests teams were responsible for day-to-day activities
The Operations team was also entrusted with adjusting existing procedures/instructions, for training the specialist staff, and with leading the internal change within the datacenter.
ENTEL also
Retained an external consultant to structure its methodology and develop a map for sustainable training.
Implemented a maintenance management system
Defined a support team to manage eventual setbacks for weekly follow-up, progress control, and resolution
Assured that activities the processes, procedures/instructions, and methodologies Operational Sustainability would be adopted and applied
ENTEL believes that it received many benefits from earning the Tier Certification of Operational Sustainability, including
Improved “standard operation” processes
Continuous improvement of processes and procedures
Formal training, sustainable over time
Shared lessons learned at all five data centers
The knowledge of the Certification methodology and good practices stayed in house, which ENTEL can replicate at all its data centers and eventually to other areas of the organization (see Figure 4). ENTEL has worked to share this information company wide, with the additional notable achievement of Uptime Institute’s M&O Stamp of Approval at its Amunategui facility in December 2016.
Figure 4. The staff at ENTEL is charged with promoting Operational Sustainability across the whole organization.
https://journal.uptimeinstitute.com/wp-content/uploads/2017/02/UI2.jpg5941500Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2017-02-17 14:08:202017-02-22 11:19:33ENTEL Achieves Uptime Institute Tier Certification of Operational Sustainability
The New South Wales (NSW) Department of Finance, Services and Innovation (DFSI) is a government service provider and regulator for the southeastern Australian state. DFSI supports many government functions, including sustainable government finances, major public works and maintenance programs, government procurement, and information and communications technology.
Josh Griggs, managing director of Metronode; Glenn Aspland, Metronode’s senior facility manager; and Derek Paterson, director–GovDC & Marketplace Services at the NSW Department of Finance, Services and Innovation (DFSI) discuss how Metronode responded when the NSW government decided to consolidate its IT operations.
Josh, can you tell me more about Metronode and its facilities?
Griggs, Metronode managing director: Metronode was established in 2002. Today we operate 10 facilities across Australia. Having designed and constructed our data centers, we have significant construction and engineering expertise. We also offer services and operational management. Our Melbourne 2 facility was the first Uptime Institute Tier III Certified Constructed Facility in Australia in 2012, and our Illawarra and Silverwater facilities were the first in the Asia Pacific region to be Tier III Gold Certified for Operational Sustainability by Uptime Institute. We have the most energy-efficient facilities in Australia, with a 4.5 National Australian Built Environment Rating System (NABERS) rating for data centers.*
We have two styles of data center. Generation 1 facilities are typically closer to the cities and have very high connectivity. If you were looking to connect to the cloud or for backup, the Gen 1’s fit that purpose. As a result, we have a broad range of clients, including multinationals, local companies, and a lot of government.
And then, we have Generation 2 Bladeroom facilities across five sites including facilities in the Illawarra and at Silverwater, which host NSW’s services. At present, we’ve got 3 megawatts (MW) of IT load in Silverwater and 760 kilowatts (kW) in the Illawarra. With additional phasing, Silverwater could host 15 MW and Illawarra 8 MW.
We are engineered for predictability. Our customers rely on us for critical environments. This is one of the key reasons we went through Uptime Institute Certification. It is not enough for us to say we designed, built, and operate to the Tier III Standards; it is also important that we verify it. It means we have a Concurrently Maintainable site and it is Operationally Sustainable.
Tell me about the relationship between the NSW government and Metronode.
Griggs, Metronode managing director: We entered a partnership with the NSW government in 2012, when we were selected to build and construct their facilities. They had very high standards and groundbreaking requirements that we’ve been able to meet and exceed. They were the first to specify Uptime Institute Tier III Certification of Constructed Facility, the first to specify NABERS, and the first to specify Tier Certification of Operational Sustainability as contract requirements.
Paterson, DFSI: A big part of my role at DFSI is to consolidate agencies’ legacy data center infrastructure into two strategic data centers owned and operated by Metronode. Our initial objective was to consolidate more than 130 data centers into two. We’ve since broadened the program to include the needs of local government agencies and state-owned companies.
When you look across the evolving landscape of requirements, Metronode was best equipped to support agency needs of meeting energy-efficiency targets and providing highly secure physical environments, while meeting service level commitments in terms of overall uptime.
Are these dedicated or colocation facilities?
Paterson, DFSI: When Metronode sells services to the private sector, it can host these clients in the Silverwater facility. At this point, though, Silverwater is 80% government, and Illawarra is a dedicated government facility.
What drove the decision to earn the Tier III Certification of Operational Sustainability?
Paterson, DFSI: We wrote the spec for what we wanted in regards to security, uptime, and service level agreements (SLA). Our contract required Metronode’s facilities to be Tier III Certified for design and build and Tier III Gold for operations. We benefited, of course; however, Metronode is also reaping rewards from both the Tier III Certification of Constructed Facility and Tier III Gold Certification of Operational Sustainability.
Griggs, Metronode managing director: Obviously the contractual requirement was one driver, but that’s not the fundamental driver. We needed to ensure that our mission critical environments are always operating. So Operational Sustainability ensured that we have a reliable, consistent operation to match our Concurrently Maintainable baseline, and our customers can rely on that.
Our operations have been Certified and tested in a highly rigorous process that ensures we had clear and effective process documented and the flow charts to enable the systems maintenance, systems monitoring, fault identification, etc. The Tier III Gold Certification also evaluates the people side of the operation, including skill assessment, acquisition of the right people, training, rostering, contracting, and the management in place as well as continuous improvement.
In that process, we had to ask ourselves if what we were actually doing was documented clearly and followed by everybody. It reached across everything you can think of in terms of operating a data center to ensure that we had all of that in place.
There are only 23 facilities in the world to have this Certification and only two in the Asia Pacific, which demonstrates how hard it is to get. And we wanted to be the best.
Glenn, how did you respond when you learned about DFSI’s requirement for Tier III Gold Certification of Operational Sustainability?
Aspland, Metronode senior facility manager: I thought this is fantastic. Someone has actually identified a specific and comprehensive set of behaviors that can minimize risk to the operations of the data center without being prescriptive.
When I was handed the Operational Sustainability assignment, the general manager said, “This is your goal.” From my point of view—operations has been my career, it was a fantastic opportunity to turn years of knowledge into something real. Get all the information out, analyze and benchmark it, find the highs and lows, and learn why they exist. For me it was a passion.
I was new to the company at that point and the process started that night. I just couldn’t put the Tier materials down. I was quite impressed. After that, we spent 2-3 months organizing meetings and drilling into what I felt we needed to do to to run our data centers in an Operationally Sustainable manner.
And then we began truly analyzing and assessing many of the simple statements in the briefing. For example, what does “fully scripted maintenance” mean?
What benefits did Metronode experience?
Paterson, DFSI: Metronode’s early deployments didn’t use Bladeroom, so this was a new implementation and there was a lot to learn. For them to design and build these new facilities at Tier III and then get the Tier III Gold Certification for Operational Sustainability showed us the rigor and discipline the team has.
Did the business understand the level of effort?
Aspland, Metronode senior facility manager: Definitely not. I had to work hard to convince the business. Operational Sustainability has put operations front and center. That’s our core business.
One of the biggest challenges was getting stakeholder engagement. Selling the dream.
I had the dream firmly in my mind from reading the document. But we had to get busy people to do things in a way that’s structured and document what they do. Not everyone rushes to that.
Practically, what did “selling the dream” require?
Aspland, Metronode senior facility manager: The approach is practical. We have to do what we do, and work through it with all the team. For instance, we did training on a one-on-one basis, but it wasn’t documented. So we had to ask ourselves: what do we teach? And then we have to produce training documents. After 12 months, we should know what we do and what our vendors do. But how do we know that they are doing what they are supposed to do? How do we validate all this? We have to make it a culture. That was probably the biggest change. We made the documents reflect what we actually do. Not just policy words about uptime and reliability.
Are there “aha” moments?
Aspland, Metronode senior facility manager: Continually. I am still presenting what we do and what we’ve done. Usually the “aha” happens when someone comes on site and performs major projects and follows detailed instructions that tell them exactly what to do, what button to punch, and what switchboard and what room. And 24 hours later, when they’ve relied on the document and nothing goes wrong, there’s the “aha” moment. This supported me. It made it easy.
How do you monitor the results?
Aspland, Metronode senior facility manager: We monitor our uptime continually. In terms of KPIs, maintenance fulfillment rates, open jobs every week, how long are they open. And every 2 months, we run financial benchmarks to compare to budget.
Our people are aware that we are tracking and producing reports. They are also aware of the audit. We have all the evidence of what we’ve done because of the five-day Operational Sustainability assessment.
For ISO 27001 (editor’s note: ISO information security management standard), what they are really doing is checking our docs. We are demonstrating that all our maintenance is complete and it’s document based, so we have all that evidence and that’s now the culture
What do you view as your competitive advantage?
Griggs, Metronode managing director: You can describe the Australian market as a mature market. There are quite a few providers. At Metronode, we design, build and operate the most secure and energy-efficient, low PUE facilities in the country, providing our customers with high-density data centers to meet their needs, both now and in the future.
Recognized by NABERS for data center energy efficiency with the highest rating of 4.5, we are also the only Australian provider to have achieved Uptime Institute Tier III Gold Certification for Operational Sustainability. There is one other national provider that looks like us. Then you have local companies that operate in local markets and some international providers have one or two facilities and operate with international customers that have a presence in Australia.
In this environment, our advantages include geolocation, energy efficiency, security, reliability, Tier III Gold Certification, and flexibility.
The Tier III Gold Certification and NABERS rating are important, because there are some outlandish claims about PUE and Uptime Institute Certification—people who claim to have Tier III facilities without going through the process of having that actually verified. Similarly, we believe the NABERS rating is becoming more important because of people making claims that they cannot achieve in terms of PUE.
Finally, we are finding that people are struggling to forecast demand. Because we are able to go from 1 kW/rack up to 30 kW, our customers are able to grow within their current footprint. Metronode has engineered Australia’s most adaptive data center designs, with an enviable speed to build in terms of construction. That ability to grow out in a rapid manner means that we are able to meet the growing requirements and often unforeseen customer demand.
* NABERS is an Australian rating system, managed nationally by the NSW Office of Environment and Heritage on behalf of federal, state, and territory governments. It measures the environmental performance of Australian buildings.
https://journal.uptimeinstitute.com/wp-content/uploads/2017/01/Unknown.jpeg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2017-01-03 09:29:212017-01-03 09:30:32When an Australian Government Department Required Operational Sustainability, Metronode Delivered
Uptime Institute recently awarded its Efficient IT (EIT) Stamp of Approval to LinkedIn for its new data center in Infomart Portland, signaling that the modern new facility had exceeded extremely high standards for enterprise leadership, operations, and computing infrastructure. These standards are designed to help organizations lower costs and increase efficiency, and leverage technology for good stewardship of corporate and environmental resources. Uptime Institute congratulates LinkedIn on this significant accomplishment.
Sonu Nayyar, VP of Production Operations & IT for LinkedIn, said “Our newest data center is a big step forward, allowing us to adopt a hyperscale architecture while being even more efficient about how we consume resource. Uptime Institute’s Efficient IT Stamp of Approval is a strong confirmation that our management functions of IT, data center engineering, finance, and sustainability are aligned. We look forward to improving how we source energy and ultimately reaching our goal of 100 percent renewable energy.”
John Sheputis, President of Infomart Data Centers, said, “We knew this project would be special from the moment we began collaborating on design. Efficiency and sustainability are achieved through superior control of resources, not sacrificing performance or availability. We view this award as further evidence of Infomart’s long-term commitment to working with customers to create the most sustainable IT operations in the world.”
LinkedIn expressed its enthusiasm about both this new project and the Efficient IT Stamp of Approval in the LinkedIn Engineering blog post below.
Earlier this year we announced Project Altair, our massively scalable, next-generation data center design. We also announced our plans to build a new data center in Oregon, in order to be able to more reliably deliver our services to our members and customers. Today, we’d like to announce that our Oregon data center, featuring the design innovations of Project Altair, is fully live and ramped. The primary criteria when selecting the Oregon location were: procuring a direct access contract for 100% renewable energy, network diversity, expansion capabilities, and talent opportunities.
LinkedIn’s Oregon data center, hosted by Infomart, is the most technologically advanced and highly efficient data center in our global portfolio, and includes a sustainable mechanical and electrical system that is now the benchmark for our future builds. We chose to utilize the ChilledDoor from MotivAir, a rear door heat exchanger neutralizing the heat closer to the source. The advanced water side economizer cooling system communicates with outside air sensors to utilize Oregon’s naturally cool temperatures, instead of using energy to create cool air. Incorporating efficient technologies such as these enables our operations to run a PUE (Power Usage Effectiveness) of 1.06 during full economization mode.
Implementing the Project Altair next-generation data center design enabled us to move to a widely distributed non-blocking fabric with uniform chipset, bandwidth, and buffering characteristics in a minimalistic and elegant design. We encourage the most minimalistic and simplistic approaches in infrastructure engineering because they are easy to understand and hence easier to scale. Moving to a unified software architecture for the whole data center allows us to run the same set of tools on both end systems (servers) and intermediate systems (network).
The shift to simplification in order to own and control our architecture motivated us to also use our own 100G Ethernet open switch platform, called Project Falco. The advantages of using our own software stack are numerous: faster time to market, uniformity, simplification in feature requirements as well as deployment, and controlling our architecture and scale, to name a few.
In addition to the infrastructure innovation mentioned earlier, our Oregon data center has been designed and deployed to use IPv6 (next generation internet protocol) from day one. This is part of our larger vision to move our entire stack to IPv6 globally in order to retire IPv4 in our existing data centers. The move to IPv6 enabled us to run our application stack and our private cloud, LPS (LinkedIn Platform as a Service), without the limitations of traditional stacks.
As we moved to a distributed security system by creating a distributed firewall and removing network load balancers, the process of preparing and migrating our site to the new data center became a complicated task. It required significant automation and code development, as well as changes to procedures and software configurations, but ultimately reduced our infrastructure costs and gave us additional operational flexibility.
Our site reliability engineers and systems engineering teams introduced a number of innovations in deployment and provisioning, which allowed us to streamline the software buildout process. This, combined with zero touch deployment, resulted in a shorter timeline and smoother process for bringing a data center live than we’ve ever achieved before.
We’re delighted to participate in the growth of Oregon as a next-generation sustainable technology center. The Uptime Institute has recognized LinkedIn with an Efficient IT awardforourefforts! This award evaluates the management processes and organizational behaviors to achieve lasting reductions in cost, utilities, staff time, and carbon emissions for data centers.
The deployment of our new data center was a collective effort of many talented individuals across LinkedIn in several organizations. Special thanks to the LinkedIn Data Center team, Infrastructure Engineering, Production Engineering and Operations (PEO), Global Ops, AAAA team, House Security, SRE organization, and LinkedIn Engineering.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/12/linkedinfinal.jpg4751200Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2016-12-08 08:12:292018-07-12 08:56:17LinkedIn’s Oregon Data Center Goes Live, Gains EIT Stamp of Approval
Uptime Institute has prepared this brief airline outages FAQ to help the industry, media, and general public understand the reasons that data centers fail.
The failure of a power control module on Monday, August 8, 2016, at a Delta Airlines data center caused hundreds of flight cancellations, inconvenienced thousands of customers, and cost the airline millions of dollars. And while equipment failure is the proximate cause of the data center failure, Uptime Institute’s long experience evaluating the design, construction, and operations of facilities suggest that many enterprise data centers are similarly vulnerable because of construction and design flaws or poor operations practices.
What happened to Delta Airlines?
While software is blamed for many well publicized IT problems, Delta Airline is blaming a piece of infrastructure hardware. According to the airline, “…a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.” We take Delta at its word.
Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.
However, pointing to the failure of a single piece of equipment can be very misleading. Delta’s facility was designed to have redundant systems, so the facility should have remained operational had the facility performed as designed. In short, a design flaw, construction error or change, or poor operations procedures set the stage for the catastrophic failure.
What does this mean?
Equipment like Delta’s power control module (more often called a UPS) should be deployed in a redundant configuration to allow for maintenance or support IT operation in the event of a fault. However, IT demand can grow over time so that the initial redundancy is compromised and each piece of equipment is overloaded when one fails or is taken off line. At this point, only Delta really knows what happened.
Organizations can compromise redundancy in this way by failing to track load growth, lacking a process to manage load growth, or making a poor business decision because of unanticipated or uncontrolled load growth.
Delta Airlines is like many other organizations. Its IT operations are complex and expensive, change is constant, and business demands are constantly growing. As a result, IT must squeeze every cent out of its budgets, while maintaining 24x7x365 operations. Inevitably these demands will expose any vulnerability in the physical infrastructure and operations.
Failures like the one at Delta can have many causes, including a single point of failure, lack of totally redundant and independent systems, or poorly documented maintenance and operations procedures.
Delta’s early report ignores key issues why a single equipment failure could cause such damage, what could have been done to prevent it, and how will Delta respond in the future.
Delta and the media are rightly focused on the human cost of the failure. Uptime Institute is sure that Delta’s IT personnel will be trying to reconstruct the incident once operations have been restored and customer needs met. This incident will cost millions of dollars and hurt Delta’s brand; Delta will not want a recurrence. They have only started to calculate what changes will be required to prevent another incident and how much it will cost.
Why is Uptime Institute publishing this FAQ?
Uptime Institute has spent many years helping organizations avoid this exact scenario, and we want to share our experience. In addition, we have seen many publish reports that just miss the point of how to achieve and maintain redundancy. Equipment fails all the time. Facilities that are well designed, built, and operated have systems in place to minimize data center failure. Furthermore, these systems are tested and validated regularly, with special attention to upgrades, load growth, and modernization plans.
How can organizations avoid catastrophic failures?
Organizations spend millions of dollars to build highly reliable data centers and keep them running. They don’t always achieve the desired outcome because of a failure to understand data center infrastructure.
The temptation to reduce costs in data centers is great because data centers require a great deal of energy to power and cool servers and they require experienced and qualified staff to operate and maintain.
Value engineering in the design process and mistakes and changes in the building process can result in vulnerabilities even in new data centers. Poor change management processes and incomplete procedures in older data centers are another cause for concern.
Uptime Institute believes that independent third-party verifications are the best way to identify the vulnerabilities in IT infrastructure. For instance, For instance, Uptime Institute consultants have Certified that more than 900 facilities worldwide meet the requirements of the Tier Standards.. Certification is the only one way to ensure that a new data center has been built and verified to meet business standards for reliability. In almost all these facilities, Uptime Institute consultants made recommendations to reduce risks that were unknown to the data center owner and operators.
Over time, however, even new data centers can become vulnerable. Change management procedures must help organizations control IT growth, and maintenance procedures must be updated to account for equipment and configuration changes. Third-party verifications such as Uptime Institute’s Tier Certification for Operational Sustainability and Management & Operations Stamp of Approval ensure that an organization’s procedures and processes are complete, accurate, and up-to-date. Maintaining up-to-date procedures is next to impossible without management processes that recognize that data centers change over time as demand grows, equipment changes, and business requirements change.
I’m in IT, what can I do to keep my company out of the headlines?
It all depends on your company’s appetite for risk. If IT failures would not materially affect your operations, then you can probably relax. But if your business relies on IT for mission-critical services for customer-facing or internal customers, then you should consider having the Uptime Institute evaluate your data center’s management and operations procedures and then implement the recommendations that result.
If you are building a new facility or renovating an existing facility, Tier Certification will verify that your data center will meet your business requirements.
These assessments and certifications require a significant organizational commitment in time and money, but these costs are only a fraction of the time and money that Delta will be spending as the result of this one issue. In addition, insurance companies increasingly have begun to reduce premiums for companies that have their operational preparedness verified by a third party.
Published reports suggest that Delta thought it was safe from this type of outage, and that other airlines may be similarly vulnerable because of skimpy IT budgets, poor prioritization, and bad processes and procedures. All companies should undertake a top-to-bottom review of its facilities and operations before spending millions of dollars on new equipment.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/11/delta1.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2016-11-28 12:44:412016-11-28 12:44:41Airline Outages FAQ: How to Keep Your Company Out of the Headlines
Wearable technology and digitized operating procedures ensure compliance with standardized practices and provide quick access to critical information
By Jose Ruiz
No one in the data center industry questions that data center outages are extremely costly. Numerous vendors report that the average data center outage costs hundreds of thousands of dollars, with losses in some circumstances totaling much more. Uptime Institute Network Abnormal Incident Reports data show that most downtime instances are caused by human error–a key vulnerability in mission-critical environments. In addition, Compass Datacenters believes that many failures that enterprises attribute to equipment failure could accurately be considered human error. However, most companies prefer to blame the hardware rather than their processes.
Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.
Whether one accepts higher or lower outage costs, it is clear that reducing human error as a cause of downtime has a tremendous upside for the enterprise. As a result, data center owners dedicate a great deal of time to developing maintenance and operation procedures that are clear, effective, and easy to follow. Compass Datacenters is incorporating wearable technology in its efforts to make the procedures convenient for technicians to access and execute. During the process, Compass learned that the true value of introducing wearables into data center operations comes from standardizing the procedures. Incorporating wearable technology enhances the ability of users to follow the procedures. In addition, allowing users to choose the device most useful for the task at hand improves their comfort with the new process.
THE PROBLEM WITH TRADITIONAL OPs DOCUMENTATION
There are two major problems with the current methods of managing and documenting data center procedures.
Many data center operations teams document their procedures and methodologies in large three-ring binders. Although their presence is comforting, Operations staffs rarely consult them.
Also, organizations often rely on highly detailed written documentation presented in such depth that Operations staff wastes a great deal of time trying to locate the appropriate information.
In both instances, Operations personnel deem the written procedural guidelines inconvenient. As a result, actual data center operation tends to be more experience-based. Bluntly put, too many organizations rely too much on the most experienced or the most insistent Operations person on site at the time of an incident.
Figure 2. Compass Datacenters use of electronics and checklists clarifies the responsibilities of Operations personnel doing daily rounds.
On the whole, experience-based operations are not all bad. Subject matter experts (SME) emerge and are valued, with operations running smoothly for long periods at a time. However, as Julian Kudritzki wrote in “Examining and Learning from Complex Systems Failures” (The Uptime Institute Journal, vol. 5, p. 12) ‘Human error’ is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.” What’s more, subject matter experts may leave or get sick or simply have a bad day. In addition, most of their peers assume the SME is always right.
Figure 3.
Figure 4.
Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.
Of course, no one goes to work planning to hurt a coworker or cause an outage, but errors are made. Compass Datacenters believes its efforts to develop professional quality procedures and use wearable technology reduce the likelihood of human error.Compass’s intent is to make it easy for operators to handle both routine and foreseeable abnormal operations by standardizing where possible and improvising only when absolutely necessary and by making it simple and painless for technicians to comply with well-defined procedures.
AMERICAN ELECTRIC POWER
American Electric Power (AEP), a utility based in Columbus, OH that provides electricity to more than 5.4 million customers in 11 states, inspired Compass to bring wearable technology to the data center to help the utility reduce human error. The initiative was a first for Compass, and AEP’s collaboration was essential to the product development effort.Compass and AEP began by adopting the checklist approach used in the airline business, among many others. This approach to avoiding human error is predicated on the use of carefully crafted operational procedures. Foreseeable critical actions by the pilot and copilot are strictly prescribed to ensure that all processes are followed in the same manner, every time, no matter who is in the cockpit. They standardize where possible and improvise only when required. Airline procedural techniques also put extra focus on critical steps to eliminate error. While this professional approach to operations has been widely adopted by a host of industries, it has not yet become a standard in the data center business.
Compass and AEP partnered with ICARUS Ops to develop the electronic checklist and wearables programs. ICARUS Ops was founded by two former U.S. Navy carrier pilots, who use their extensive experience in developing standardized aviation processes to develop wearable programs acrossa broad spectrum of high-reliability and mission-critical industries. During the process, ICARUS Ops brought on board several highly experienced consultants, including military and commercial pilots and a former Space Shuttle mission controller.At the start of the project, Compass, ICARUS Ops, and AEP worked together to identify the various maintenance and operations tasks that would be wearable accessible. Team members reviewed all existing emergency operating procedures (EOP). The three organizations held detailed discussions with AEP’s data center personnel to help identify the procedures best suited to run electronically. Including operators in the process ensured that their perspectives would be part of the final procedures to make it easier for them to execute correctly. Also, this early involvement led to a high level of buy-in, which would be needed for this project to succeed.
ELECTRONIC PROCEDURES
The next step was converting the procedures to electronic checklists. The transition to electronics was accomplished using ICARUS Ops Application and web-based Mission Control. The project team wanted each checklist to
Be succinct and written with the operator in mind
Identify the key milestone in the process and use digital technology to verify critical steps
Use condition, purpose, or desired effect statements to assure the proper checklist is utilized
Identify common areas of confusion and provide links to tools such as animations and videos to clarify how and why the procedure is to be performed
Make use of Job Safety Analysis (JSA) to identify areas of significant risk prior to the required step/action. These included: o Warnings to indicate where serious injury is possible if all procedures are not followed specifically o Cautions indicating where significant data loss or damage is possible o Notes to add any additional information that may be required or helpful
Condense SOP and EOP procedure items to prevent the user from being overwhelmed during critical and emergency scenarios.
AEP SMEs walked through each and every checklist to provide an additional quality control step.
Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.
Having SOPs and EOPs in a wearable digital format allows users to access them whenever an abnormal or emergency situation occurs, so no time is lost looking for the three-ring binder and proper procedure. In critical situations and emergencies, time is crucial and delay can be costly. Hard-copy information manuals and operating handbooks are bulky andcan potentially overwhelm technicians looking for specific information. The bulk and formatting can cause users to lose their place on a page and skip over important items. By contrast, the digital interface provides quick access to users and allows for quicker navigation to desired materials and information via presorted emergency and abnormal procedures.
At the conclusion of the development phase of the process (about a month), 150 of AEP’s operational procedures had been converted into digital procedures that were conveniently accessible on mobile devices— on or offline. Most importantly, a digital continuous improvement process was put in place to assure that lessons learned would be easy to incorporate into the system as part of the process.
The process of entering the procedures took about 2 weeks. First, a data entry team put them in as written. Then Compass worked with AEP and ICARUS Ops checklist experts to tighten up the language and add decision points, warnings, cautions, and notes. This group also embedded pictures and video where appropriate and decided what data needed to be captured during the process. Updates are easy. Each time a checklist or procedure is used improvements can be approved and digitally published right away. This is critical. AEP believes that when employees know they are being heard job satisfaction improves and they take more ownership and pride in the facility. The modifications and improvements will be continuous. The system is designed for constant improvement. Any time a technician has an idea to improve a checklist he just records the idea in the application, which instantly emails it to the SME for approval and publishing into the system. Other devices on the system synch automatically when they update.
WEARABLE TECHNOLOGY
Using wearable technology made the electronic procedures even more accessible to the end users. At the outset, AEP, ICARUS Ops, and Compass determined that AEP technicians would be able to choose between a wrist-mounted Android device and a tablet. The majority of users selected the wrist-mounted device. This preference was not unexpected, as a good portion of their work requires them to use their hands. The wrist units allow users to have both hands available to perform actual work at all times.
Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.
Compass chose the Android technology system for several reasons. First, from the very outset, Compass envisioned making Google Glass one of the wearable options. In addition, Android allows programs maximum freedom to use the hardware and software as required to meet customer needs; it is simply less constraining than working with iOS. Androids are also very easy to procure and have manufacturers making Class 1, Division 1 Android tablets that are usable in hazardous environments.
The software-based technology enhances AEP’s ability to track all maintenance activity in its data center. Once online, each device communicates the recorded operational information (actions performed, components replaced, etc.) to the site’s maintenance system. This ensures that all routine maintenance is performed on schedule, and at the same time provides a detailed history of the activities performed on each major component within the facility. Web-based leadership dashboards provide oversight into what is complete, due, or overdue. The system is completely secure, as it does not reside on the same server or system as any other system in the data center. It can be encrypted or resident in the data center for increased security. Typically ICARUS Ops spends at least 3 days on site training the technicians and users, depending on the number of devices. Working together in that time normally leaves the team very comfortable with all components.
THE NEXT PHASE: GLASS
Although innovations such as Google Glass have increased awareness of visual displays, the concept dates back to the Vietnam War, when American fighter pilots began to use augmented reality on heads-up displays (HUDs). Compass Datacenters plans to harness the same augmented reality to enable maintenance technicians to complete a task completely hands free using Google Glass and other advanced wearables.
It is certainly safe to assume that costs related to data center outages will continue to rise in the future. As a result, it will become more important to increase operator reliability and reduce the risk of human error. We strongly believe that professional quality procedures and wearables will make a tremendous contribution to eliminating the human error element as a major cause of interruption and lost revenue. Our vision is that other data center providers will decide to make this innovative checklist system a part of their standard offerings over the course of the next 12 to 24 months.
Jose Ruiz
Jose Ruiz serves as Compass Datacenters’ Vice President of Operations. He provides interdisciplinary leadership in support of Compass’ strategic goals. Mr. Ruiz began his tenure at Compass as the company’s director of Engineering. Prior to joining Compass he spent three years in various engineering positions and was responsible for a global range of projects at Digital Realty Trust.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/10/wearables.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2016-10-24 10:25:162016-10-25 11:16:16Bulky Binders and Operations “Experts” Put Your Data Center at Risk
Green Mountain melds sustainability, reliability, competitive pricing, and independent certifications to attract international colo customers
By Kevin Heslin
Green Mountain operates two unique colo facilities in Norway, having a total potential capacity of several hundred megawatts. Though each facility has its own strengths, both embody the company’s commitment to providing secure, high-quality service in an energy efficient and sustainable manner. CEO Knut Molaug and Chief Sales Officer Petter Tømmeraas recently took time to explain how Green Mountain views the relationship between cost, quality, and sustainability.
Tell our readers about Green Mountain.
KM: Green Mountain focuses on the high-end data center market, including banking/finance, oil and gas, and other industries requiring high availability and high quality services.
PT: IT and cloud are also very big customer segments. We think the US and European markets are the biggest for us, but we also see some Asian companies moving into Europe that are really keen on having high-quality data centers in the area.
KM: Green Mountain Data Centers operates two data centers in Norway. Data Center 1 in Stavanger began operation in 2013 and is located in a former underground NATO ammunition storage facility inside a mountain on the west coast. Data Center 2, a more traditional facility, is located in Telemark, which is in the middle of Norway.
Today DC1-Stavanger is a high security colocation data center housing 13,600 square meter facility (m2) of customer space. The infrastructure can support up to 26 megawatts of IT load today. The main data center comprises three two-story concrete buildings built inside the mountain, with power densities ranging from 2-6 kW/m2, but the facility can support up to 20 kW/m2. NATO put a lot of money into creating their facilities inside the mountain, which probably saved us 1 billion Kroners ($US 150 million).
DC2-Telemark is located in a historic region of Norway and was built on a brownfield site with a 10-MW supply initially available. The first phase is a fully operationa1 10-MW Tier lll Certified Facility, with four new buildings and up to 25 MW total capacity planned. This site could support even larger facilities if the need arises.
Green Mountain focuses a lot on being green and environmentally friendly, so we use 100% renewable energy in both data centers.
How do the unique features of the data centers affect their performance?
KM: Besides being located in a mountain, DC1 has a unique cooling system. We use the fjords for cooling year round, which gives us 8°C (46 °F) water for cooling. The cooling solution (including cooling station, chilled water pipework and pumps) is fully duplicated, providing an N+N solution. Because there are few moving parts (circulating pumps) the solution is extremely robust and reliable. In-row cooling is installed to client specification using Hot Aisle technology.
We use only 1 kilowatt of power to produce 100 kilowatts of cooling. So the data center is extremely energy efficient. In addition, we are connected to three independent power supplies, so DC1 has extremely robust power.
DC2 probably has the most robust power supply in Europe. We have five independent hydropower plants within a few kilometers of the site, and the two closest are just a few hundred meters away.
How do you define high quality?
PT: High quality means Uptime Institute Tier Certification. We are not only saying we have very good data centers. We’ve gone through a lot of testing so we are able to back it up, and the Uptime Institute Tier Standard is the only standard worldwide that certifies data center infrastructure to a certain quality. We’re really strong on certifications because we don’t only want to tell our customers that we have good quality, we want to prove it. Plus we want the kinds of customers who demand proof. As a result, both our facilities are Tier III Certified.
Please talk about the factors that went into deciding to obtain Tier Certification.
KM: We have focused on high-end clients that require 100% uptime and are running high-availability solutions. Operations for this type of company generally require documented infrastructure.
The Tier III term is used a lot, but most companies can’t back it up. Having been through testing ourselves, we know that most companies that haven’t been certified don’t have a Tier III facility, no matter what they claim. When we talk to important clients, they see that as well.
What was the on-site experience like?
PT: When the Uptime Institute team was on site, we could tell that Certification was a quality process with quality people who knew what they were doing. Certification also helped us document our processes because of all the testing routines and scenarios. As a result, we know we have processes and procedures for all the thinkable and unthinkable scenarios and that would have been hard to do without this process.
Why do you call these data centers green?
KM: First of all we use only renewable energy. Of course that is easy in Norway because all the power is renewable. In addition we use very little of it, with the fjords as a cooling media. We also built the data centers using the most efficient equipment, even though we often paid more for it.
PT: Green Mountain is committed to operate in a sustainable way and this reflects in everything we do. The good thing about operating in such a way is that our customers benefit from this financially. As we bill power cost based on consumption of power, the more energy efficient we operate the smaller the bill to our customer. When we tell these companies that they can even save money going for our sustainable solutions this makes their decision easier.
More and more customers require that their new data center solutions are sustainable, but we still see that price is a key driver for most major customers. The combination of having very sustainable solutions and being very competitive on price is the best way of driving sustainability further into the mind of our customers.
All our clients reduce their carbon footprint when they move into our data centers and stop using their old and inefficient data centers.
We have a few major financial customers that have put forward very strict targets with regards to sustainability and that have found us to be the supplier that best meets these requirements.
KM: And, of course, making use of an already built facility was also part of the green strategy.
How does your cost structure help you win clients?
PT: It’s important, but it’s not the only important factor. Security and the quality we can offer are just as important, and that we can offer them with competitive pricing is very important.
Were there clients who were attracted to your green strategy?
PT: Several of them, but the decisive factor for customers is rarely only one factor. We offer a combination between a really, really competitive offering and a high quality level. We are a really, really sustainable and green solution. To be able to offer that at competitive price is quite unique because often people think they have to pay more to get a sustainable green solution.
Are any of your targeted segments more attracted to sustainability solutions?
PT: Several of the international system integrators really like the combination. They want a sustainable solution, but they want the competitive offering. When they get both, it’s a no-brainer for them.
How does your sustainability/energy efficiency program affect your reliability? Do potential clients have any concerns about this? Do any require sustainability?
PT: Our programs do not affect our reliability in any way. We have chosen only to implement solutions that do not harm our ability to deliver the quality we promise to our customers. We have never experienced one second of SLA breakage on any customer in any of our data centers. In fact, some of our most sustainable solutions, like the cooling system based on cold sea water, increase our reliability as it takes down the risk of failure considerably compared to regular cooling systems. We have not experienced any concerns about these solutions.
Has Tier Certification proven critical in any of your client’s decisions?
PT: Tier certification has proved critical in many of our client`s decision to move to Green Mountain. We see a shift in the market to require Tier certification, whereas it used to be more in the form of asking for Tier compliance, that anyone could claim without having to prove it. We think the future of quality data center providers will be to certify all their data centers.
Any customer with mission critical data should require their supplier/s to be Tier certified. At the moment this is the only way for a customer to secure that their data center is built and operated in the way it should in order to secure the quality that the customer needs.
Are there other factors that set you apart?
PT: Operational excellence. We have an operational team that excels every time. They deliver to the customers a lot more than expected every time, and we have customers that are extremely happy with their deliveries from us. I hear that from customers all the time, and that’s mainly because our operations team do a phenomenal job.
Uptime Institute testing criteria were very comprehensive and helped us develop our operational procedures to an even higher level as some of the scenarios created during the certification testing were used as a basis for new operational procedures and new tests that we now perform as part of our normal operating procedures.
Green Mountain definitely benefitted from the Tier process in a number of other ways, including training gave us useful input to improve our own management and operational procedures.
What did you do to develop this team?
KM: When we decided to focus on high-end clients, we knew that we needed high-end experience and expertise and knowledge on the ops side, so we focused on that when recruiting as well as building a culture inside the company that focused on delivering high quality the first time every time.
We recruited people with knowledge of how to operate critical environments, and we tasked them with developing those procedures and operational elements as a part of their efforts, and they have successfully done so.
PT: And the owners made the resources available so that they could spend the resources—both financial and staff-hour wise to create the quality we wanted. We also have a very good management system, so management has good knowledge of what’s happening, so if we have an issue it will be very visible.
KM: We also have high-end equipment and tools to measure and monitor everything inside the data center as well as operational tools to make sure we can handle any issue and deliver on our promises.
Kevin Heslin
Kevin Heslin is Chief Editor and Director of Ancillary Projects at Uptime Institute. In these roles, he supports Uptime Institute communications and education efforts. Previously, he served as an editor at BNP Media, where he founded Mission Critical, a commercial publication dedicated to data center and backup power professionals. He also served as editor at New York Construction News and CEE and was the editor of LD+A and JIES at the IESNA. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/10/fjord.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2016-10-11 10:06:152016-10-19 11:18:52Tier III Certified Facilities Prove Critical to Norwegian Colo’s Client Appeal
ENTEL Achieves Uptime Institute Tier Certification of Operational Sustainability
/in Operations/by Kevin HeslinTier Certification of Operational Sustainability enhances ENTEL’s services to customers
By Kevin Heslin
ENTEL began operations in 1964 as a provider of national and international long distance telephone services to companies in Chile. Today it is a consolidated provider of integrated telecommunication services and information technologies services, meeting the needs of corporations and large companies through tailored solutions, providing value, experience, and quality of service.
The company began operations in the aftermath of an earthquake that severely damaged the Chilean telecommunications network. More recently it added more services. Towards the end of 1995, ENTEL started to provide Internet connection services and in 1997 it introduced the first commercial network with ATM technology in Latin America, which evolved into the current Multiservice IP network in order to offer broadband solutions, guarantee quality of service, and add value to its clients. Today ENTEL offers Information Technology (IT) services, which vary according to the industrial segment and business model of each client, allowing competitive efficiencies and advantages that differentiate ENTEL from traditional telecommunications services.
As part of its offerings, ENTEL also offers a number of data center services to more than 3,000 clients. It owns and operates a network of five data centers that are certified under ISO 9001-2000 and ISO 27001 standards. In addition, two of them are Tier III Gold Certified Constructed Facility and one has the M&O Stamp of Approval.
Data Center Infrastructure
ENTEL has five data centers located in Amunátegui, Pedro de Valdivia, Ñuñoa, Longovilo, and Ciudad De Los Valles that are linked through high availability and high capacity IP/MPLS/DWDM These data centers have in excess of 7,500 square meters (m2) of data hall space with plans to expand to 11,675 m2. From these data centers, ENTEL offers IT outsourcing services, from server hosting to more complex services involving operation and exploitation of the platforms that support clients’ business processes. These data center services meet the needs of companies’ most vital applications, improving security and protection of critical data and reducing considerably their infrastructure investments. ENTEL’s IT strategy is based on traditional data center services and growth in cloud services and permanent innovation.
ENTEL first offered data center services in 2003 and is now Chile’s largest provider. To this end, ENTEL has staffed its facilities with more than 120 professionals and management devoted to implementation and operation of data center Infrastructure projects
Thanks to its team of specialists with extensive experience, the IT challenges faced by ENTEL have been met entirely in-house. Of particular note are the Ciudad de los Valles 1 and 2 facilities, each offering 2,000 m2 of white space. Ciudad de los Valles 1 (CDLV1) entered production in May 2010, and Ciudad de los Valles 2 (CDLV2) entered production in March 2013 (see The Uptime Institute Journal, vol. 2, p. 64). Both are Tier III Gold Certified Constructed Facilities (see Figure 1). Both received Tier Certification for Operational Sustainability in October 2015 (see Table 1). Josué Ramírez, Uptime Institute, director of Business Development LATAM, “With these Certifications, ENTEL shows it commitment to seeking excellence in its operations to provide better services to its clients and to contribute to the development of knowledge and professionalism in the region of Latin America.”
Figure 1. CDLV 1 and 2 are both Tier III Gold Certified Constructed Facilities.
Rich Van Loo, Uptime Institute, VP Facility Management Services, said, “The Tier Certification and Operational assessments have had a broad impact on not only the data center management divisions, but the company environment as a whole. Operational procedures not only reduce risk but help improve consistency and efficiency in those operations. ENTEL is looking to expand this philosophy to all their data centers.“
Table 1
ENTEL began construction of a third facility in 2015 at Ciudad de los Valles. This will add an additional 2,000 m2 of white space. The new facility incorporates free-cooling technologies that achieve better energy efficiency and lower total costs of ownership for clients.
Tier Certification of Operational Sustainability
To differentiate itself from its competitors, ENTEL became the first to Tier Certify its facilities. As a result, ENTEL decided to further differentiate itself by earning Uptime Institute’s Tier Certification of Operational Sustainability for CDLV 1 and 2. This decision was based on client demand for excellence of service, risk mitigation, and assurance of operational continuity.
ENTEL believes that
Figure 2. The Tier Certification of Operational Sustainability verifies that ENTEL maximizes the potential of its facilities and differentiates it from its competitors.
The Tier Certification of Operational Sustainability (TCOS) was an excellent way to reach these goals and why a new internal organization was dedicated to defining data center operations activities and a second team was tasked to leading the effort to earn the Tier Certification of Operational Sustainability.
In addition, ENTEL created an Infrastructure Change Control Board (CCI), specific to Datacenter Infrastructure Management, with the purpose of establishing planning and control of activities in each data center. This Board is a consulting body. It meets periodically and manages, reviews, and approves infrastructure management projects to data centers.
Scope of the Infrastructure Change Control Board (CCI)
Although ENTEL had previously adopted the ITIL model and established a Change Advisory Board (CAB) that validates and approves all activities in the IT and data center infrastructure environment, it was determined to create the CCI because of the degree of specialization of infrastructure work and the risk associated with it. Its focus is to raise risk points associated to high-impact activities and ascertain control points, e.g. cells, transformers, generators, UPS, towers and chillers, where the system redundancy could be lost.
A summary of the type of work evaluated and documentation to be reviewed follows (see Table 2):
Table 2.
The CCI is intended to ensure that the good practices characteristic of Tier Certification of Operational Sustainability are respected and that all documentation and instructions are kept thoroughly valid, that they are applied, and are subject to continuous improvement.
The Process
In order to increase ENTEL’s familiarity with the Tier Certification of Operational Sustainability process, ENTEL and Uptime Institute agreed to carry it the process in four phases (see Figure 3).
Figure 3. ENTEL views operational sustainability as a four-phase continuous process.
Important Decisions
Eventually, Uptime Institute and ENTEL adjusted the process in light of the initial progress, the maturity level of the team, and the imperative work of assuring operational continuity:
ENTEL also
ENTEL believes that it received many benefits from earning the Tier Certification of Operational Sustainability, including
The knowledge of the Certification methodology and good practices stayed in house, which ENTEL can replicate at all its data centers and eventually to other areas of the organization (see Figure 4). ENTEL has worked to share this information company wide, with the additional notable achievement of Uptime Institute’s M&O Stamp of Approval at its Amunategui facility in December 2016.
Figure 4. The staff at ENTEL is charged with promoting Operational Sustainability across the whole organization.
When an Australian Government Department Required Operational Sustainability, Metronode Delivered
/in Executive, Operations/by Kevin HeslinSenior facility manager calls achieving Tier Certification of Operational Sustainability “a dream”
By Kevin Heslin
The New South Wales (NSW) Department of Finance, Services and Innovation (DFSI) is a government service provider and regulator for the southeastern Australian state. DFSI supports many government functions, including sustainable government finances, major public works and maintenance programs, government procurement, and information and communications technology.
Josh Griggs, managing director of Metronode; Glenn Aspland, Metronode’s senior facility manager; and Derek Paterson, director–GovDC & Marketplace Services at the NSW Department of Finance, Services and Innovation (DFSI) discuss how Metronode responded when the NSW government decided to consolidate its IT operations.
Josh, can you tell me more about Metronode and its facilities?
Griggs, Metronode managing director: Metronode was established in 2002. Today we operate 10 facilities across Australia. Having designed and constructed our data centers, we have significant construction and engineering expertise. We also offer services and operational management. Our Melbourne 2 facility was the first Uptime Institute Tier III Certified Constructed Facility in Australia in 2012, and our Illawarra and Silverwater facilities were the first in the Asia Pacific region to be Tier III Gold Certified for Operational Sustainability by Uptime Institute. We have the most energy-efficient facilities in Australia, with a 4.5 National Australian Built Environment Rating System (NABERS) rating for data centers.*
We have two styles of data center. Generation 1 facilities are typically closer to the cities and have very high connectivity. If you were looking to connect to the cloud or for backup, the Gen 1’s fit that purpose. As a result, we have a broad range of clients, including multinationals, local companies, and a lot of government.
And then, we have Generation 2 Bladeroom facilities across five sites including facilities in the Illawarra and at Silverwater, which host NSW’s services. At present, we’ve got 3 megawatts (MW) of IT load in Silverwater and 760 kilowatts (kW) in the Illawarra. With additional phasing, Silverwater could host 15 MW and Illawarra 8 MW.
We are engineered for predictability. Our customers rely on us for critical environments. This is one of the key reasons we went through Uptime Institute Certification. It is not enough for us to say we designed, built, and operate to the Tier III Standards; it is also important that we verify it. It means we have a Concurrently Maintainable site and it is Operationally Sustainable.
Tell me about the relationship between the NSW government and Metronode.
Griggs, Metronode managing director: We entered a partnership with the NSW government in 2012, when we were selected to build and construct their facilities. They had very high standards and groundbreaking requirements that we’ve been able to meet and exceed. They were the first to specify Uptime Institute Tier III Certification of Constructed Facility, the first to specify NABERS, and the first to specify Tier Certification of Operational Sustainability as contract requirements.
Paterson, DFSI: A big part of my role at DFSI is to consolidate agencies’ legacy data center infrastructure into two strategic data centers owned and operated by Metronode. Our initial objective was to consolidate more than 130 data centers into two. We’ve since broadened the program to include the needs of local government agencies and state-owned companies.
When you look across the evolving landscape of requirements, Metronode was best equipped to support agency needs of meeting energy-efficiency targets and providing highly secure physical environments, while meeting service level commitments in terms of overall uptime.
Are these dedicated or colocation facilities?
Paterson, DFSI: When Metronode sells services to the private sector, it can host these clients in the Silverwater facility. At this point, though, Silverwater is 80% government, and Illawarra is a dedicated government facility.
What drove the decision to earn the Tier III Certification of Operational Sustainability?
Paterson, DFSI: We wrote the spec for what we wanted in regards to security, uptime, and service level agreements (SLA). Our contract required Metronode’s facilities to be Tier III Certified for design and build and Tier III Gold for operations. We benefited, of course; however, Metronode is also reaping rewards from both the Tier III Certification of Constructed Facility and Tier III Gold Certification of Operational Sustainability.
Griggs, Metronode managing director: Obviously the contractual requirement was one driver, but that’s not the fundamental driver. We needed to ensure that our mission critical environments are always operating. So Operational Sustainability ensured that we have a reliable, consistent operation to match our Concurrently Maintainable baseline, and our customers can rely on that.
Our operations have been Certified and tested in a highly rigorous process that ensures we had clear and effective process documented and the flow charts to enable the systems maintenance, systems monitoring, fault identification, etc. The Tier III Gold Certification also evaluates the people side of the operation, including skill assessment, acquisition of the right people, training, rostering, contracting, and the management in place as well as continuous improvement.
In that process, we had to ask ourselves if what we were actually doing was documented clearly and followed by everybody. It reached across everything you can think of in terms of operating a data center to ensure that we had all of that in place.
There are only 23 facilities in the world to have this Certification and only two in the Asia Pacific, which demonstrates how hard it is to get. And we wanted to be the best.
Glenn, how did you respond when you learned about DFSI’s requirement for Tier III Gold Certification of Operational Sustainability?
Aspland, Metronode senior facility manager: I thought this is fantastic. Someone has actually identified a specific and comprehensive set of behaviors that can minimize risk to the operations of the data center without being prescriptive.
When I was handed the Operational Sustainability assignment, the general manager said, “This is your goal.” From my point of view—operations has been my career, it was a fantastic opportunity to turn years of knowledge into something real. Get all the information out, analyze and benchmark it, find the highs and lows, and learn why they exist. For me it was a passion.
I was new to the company at that point and the process started that night. I just couldn’t put the Tier materials down. I was quite impressed. After that, we spent 2-3 months organizing meetings and drilling into what I felt we needed to do to to run our data centers in an Operationally Sustainable manner.
And then we began truly analyzing and assessing many of the simple statements in the briefing. For example, what does “fully scripted maintenance” mean?
What benefits did Metronode experience?
Paterson, DFSI: Metronode’s early deployments didn’t use Bladeroom, so this was a new implementation and there was a lot to learn. For them to design and build these new facilities at Tier III and then get the Tier III Gold Certification for Operational Sustainability showed us the rigor and discipline the team has.
Did the business understand the level of effort?
Aspland, Metronode senior facility manager: Definitely not. I had to work hard to convince the business. Operational Sustainability has put operations front and center. That’s our core business.
One of the biggest challenges was getting stakeholder engagement. Selling the dream.
I had the dream firmly in my mind from reading the document. But we had to get busy people to do things in a way that’s structured and document what they do. Not everyone rushes to that.
Practically, what did “selling the dream” require?
Aspland, Metronode senior facility manager: The approach is practical. We have to do what we do, and work through it with all the team. For instance, we did training on a one-on-one basis, but it wasn’t documented. So we had to ask ourselves: what do we teach? And then we have to produce training documents. After 12 months, we should know what we do and what our vendors do. But how do we know that they are doing what they are supposed to do? How do we validate all this? We have to make it a culture. That was probably the biggest change. We made the documents reflect what we actually do. Not just policy words about uptime and reliability.
Are there “aha” moments?
Aspland, Metronode senior facility manager: Continually. I am still presenting what we do and what we’ve done. Usually the “aha” happens when someone comes on site and performs major projects and follows detailed instructions that tell them exactly what to do, what button to punch, and what switchboard and what room. And 24 hours later, when they’ve relied on the document and nothing goes wrong, there’s the “aha” moment. This supported me. It made it easy.
How do you monitor the results?
Aspland, Metronode senior facility manager: We monitor our uptime continually. In terms of KPIs, maintenance fulfillment rates, open jobs every week, how long are they open. And every 2 months, we run financial benchmarks to compare to budget.
Our people are aware that we are tracking and producing reports. They are also aware of the audit. We have all the evidence of what we’ve done because of the five-day Operational Sustainability assessment.
For ISO 27001 (editor’s note: ISO information security management standard), what they are really doing is checking our docs. We are demonstrating that all our maintenance is complete and it’s document based, so we have all that evidence and that’s now the culture
What do you view as your competitive advantage?
Griggs, Metronode managing director: You can describe the Australian market as a mature market. There are quite a few providers. At Metronode, we design, build and operate the most secure and energy-efficient, low PUE facilities in the country, providing our customers with high-density data centers to meet their needs, both now and in the future.
Recognized by NABERS for data center energy efficiency with the highest rating of 4.5, we are also the only Australian provider to have achieved Uptime Institute Tier III Gold Certification for Operational Sustainability. There is one other national provider that looks like us. Then you have local companies that operate in local markets and some international providers have one or two facilities and operate with international customers that have a presence in Australia.
In this environment, our advantages include geolocation, energy efficiency, security, reliability, Tier III Gold Certification, and flexibility.
The Tier III Gold Certification and NABERS rating are important, because there are some outlandish claims about PUE and Uptime Institute Certification—people who claim to have Tier III facilities without going through the process of having that actually verified. Similarly, we believe the NABERS rating is becoming more important because of people making claims that they cannot achieve in terms of PUE.
Finally, we are finding that people are struggling to forecast demand. Because we are able to go from 1 kW/rack up to 30 kW, our customers are able to grow within their current footprint. Metronode has engineered Australia’s most adaptive data center designs, with an enviable speed to build in terms of construction. That ability to grow out in a rapid manner means that we are able to meet the growing requirements and often unforeseen customer demand.
* NABERS is an Australian rating system, managed nationally by the NSW Office of Environment and Heritage on behalf of federal, state, and territory governments. It measures the environmental performance of Australian buildings.
LinkedIn’s Oregon Data Center Goes Live, Gains EIT Stamp of Approval
/in Executive, Operations/by Kevin HeslinUptime Institute recently awarded its Efficient IT (EIT) Stamp of Approval to LinkedIn for its new data center in Infomart Portland, signaling that the modern new facility had exceeded extremely high standards for enterprise leadership, operations, and computing infrastructure. These standards are designed to help organizations lower costs and increase efficiency, and leverage technology for good stewardship of corporate and environmental resources. Uptime Institute congratulates LinkedIn on this significant accomplishment.
Sonu Nayyar, VP of Production Operations & IT for LinkedIn, said “Our newest data center is a big step forward, allowing us to adopt a hyperscale architecture while being even more efficient about how we consume resource. Uptime Institute’s Efficient IT Stamp of Approval is a strong confirmation that our management functions of IT, data center engineering, finance, and sustainability are aligned. We look forward to improving how we source energy and ultimately reaching our goal of 100 percent renewable energy.”
John Sheputis, President of Infomart Data Centers, said, “We knew this project would be special from the moment we began collaborating on design. Efficiency and sustainability are achieved through superior control of resources, not sacrificing performance or availability. We view this award as further evidence of Infomart’s long-term commitment to working with customers to create the most sustainable IT operations in the world.”
LinkedIn expressed its enthusiasm about both this new project and the Efficient IT Stamp of Approval in the LinkedIn Engineering blog post below.
LinkedIn’s Oregon Data Center Goes Live
By Shawn Zandi and Mike Yamaguchi
Earlier this year we announced Project Altair, our massively scalable, next-generation data center design. We also announced our plans to build a new data center in Oregon, in order to be able to more reliably deliver our services to our members and customers. Today, we’d like to announce that our Oregon data center, featuring the design innovations of Project Altair, is fully live and ramped. The primary criteria when selecting the Oregon location were: procuring a direct access contract for 100% renewable energy, network diversity, expansion capabilities, and talent opportunities.
LinkedIn’s Oregon data center, hosted by Infomart, is the most technologically advanced and highly efficient data center in our global portfolio, and includes a sustainable mechanical and electrical system that is now the benchmark for our future builds. We chose to utilize the ChilledDoor from MotivAir, a rear door heat exchanger neutralizing the heat closer to the source. The advanced water side economizer cooling system communicates with outside air sensors to utilize Oregon’s naturally cool temperatures, instead of using energy to create cool air. Incorporating efficient technologies such as these enables our operations to run a PUE (Power Usage Effectiveness) of 1.06 during full economization mode.
Implementing the Project Altair next-generation data center design enabled us to move to a widely distributed non-blocking fabric with uniform chipset, bandwidth, and buffering characteristics in a minimalistic and elegant design. We encourage the most minimalistic and simplistic approaches in infrastructure engineering because they are easy to understand and hence easier to scale. Moving to a unified software architecture for the whole data center allows us to run the same set of tools on both end systems (servers) and intermediate systems (network).
The shift to simplification in order to own and control our architecture motivated us to also use our own 100G Ethernet open switch platform, called Project Falco. The advantages of using our own software stack are numerous: faster time to market, uniformity, simplification in feature requirements as well as deployment, and controlling our architecture and scale, to name a few.
In addition to the infrastructure innovation mentioned earlier, our Oregon data center has been designed and deployed to use IPv6 (next generation internet protocol) from day one. This is part of our larger vision to move our entire stack to IPv6 globally in order to retire IPv4 in our existing data centers. The move to IPv6 enabled us to run our application stack and our private cloud, LPS (LinkedIn Platform as a Service), without the limitations of traditional stacks.
As we moved to a distributed security system by creating a distributed firewall and removing network load balancers, the process of preparing and migrating our site to the new data center became a complicated task. It required significant automation and code development, as well as changes to procedures and software configurations, but ultimately reduced our infrastructure costs and gave us additional operational flexibility.
Our site reliability engineers and systems engineering teams introduced a number of innovations in deployment and provisioning, which allowed us to streamline the software buildout process. This, combined with zero touch deployment, resulted in a shorter timeline and smoother process for bringing a data center live than we’ve ever achieved before.
YouTube Link
Acknowledgements
We’re delighted to participate in the growth of Oregon as a next-generation sustainable technology center. The Uptime Institute has recognized LinkedIn with an Efficient IT award for our efforts! This award evaluates the management processes and organizational behaviors to achieve lasting reductions in cost, utilities, staff time, and carbon emissions for data centers.
The deployment of our new data center was a collective effort of many talented individuals across LinkedIn in several organizations. Special thanks to the LinkedIn Data Center team, Infrastructure Engineering, Production Engineering and Operations (PEO), Global Ops, AAAA team, House Security, SRE organization, and LinkedIn Engineering.
Airline Outages FAQ: How to Keep Your Company Out of the Headlines
/in Executive, Operations/by Kevin HeslinUptime Institute has prepared this brief airline outages FAQ to help the industry, media, and general public understand the reasons that data centers fail.
The failure of a power control module on Monday, August 8, 2016, at a Delta Airlines data center caused hundreds of flight cancellations, inconvenienced thousands of customers, and cost the airline millions of dollars. And while equipment failure is the proximate cause of the data center failure, Uptime Institute’s long experience evaluating the design, construction, and operations of facilities suggest that many enterprise data centers are similarly vulnerable because of construction and design flaws or poor operations practices.
What happened to Delta Airlines?
While software is blamed for many well publicized IT problems, Delta Airline is blaming a piece of infrastructure hardware. According to the airline, “…a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.” We take Delta at its word.
Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.
However, pointing to the failure of a single piece of equipment can be very misleading. Delta’s facility was designed to have redundant systems, so the facility should have remained operational had the facility performed as designed. In short, a design flaw, construction error or change, or poor operations procedures set the stage for the catastrophic failure.
What does this mean?
Equipment like Delta’s power control module (more often called a UPS) should be deployed in a redundant configuration to allow for maintenance or support IT operation in the event of a fault. However, IT demand can grow over time so that the initial redundancy is compromised and each piece of equipment is overloaded when one fails or is taken off line. At this point, only Delta really knows what happened.
Organizations can compromise redundancy in this way by failing to track load growth, lacking a process to manage load growth, or making a poor business decision because of unanticipated or uncontrolled load growth.
Delta Airlines is like many other organizations. Its IT operations are complex and expensive, change is constant, and business demands are constantly growing. As a result, IT must squeeze every cent out of its budgets, while maintaining 24x7x365 operations. Inevitably these demands will expose any vulnerability in the physical infrastructure and operations.
Failures like the one at Delta can have many causes, including a single point of failure, lack of totally redundant and independent systems, or poorly documented maintenance and operations procedures.
Delta’s early report ignores key issues why a single equipment failure could cause such damage, what could have been done to prevent it, and how will Delta respond in the future.
Delta and the media are rightly focused on the human cost of the failure. Uptime Institute is sure that Delta’s IT personnel will be trying to reconstruct the incident once operations have been restored and customer needs met. This incident will cost millions of dollars and hurt Delta’s brand; Delta will not want a recurrence. They have only started to calculate what changes will be required to prevent another incident and how much it will cost.
Why is Uptime Institute publishing this FAQ?
Uptime Institute has spent many years helping organizations avoid this exact scenario, and we want to share our experience. In addition, we have seen many publish reports that just miss the point of how to achieve and maintain redundancy. Equipment fails all the time. Facilities that are well designed, built, and operated have systems in place to minimize data center failure. Furthermore, these systems are tested and validated regularly, with special attention to upgrades, load growth, and modernization plans.
How can organizations avoid catastrophic failures?
Organizations spend millions of dollars to build highly reliable data centers and keep them running. They don’t always achieve the desired outcome because of a failure to understand data center infrastructure.
The temptation to reduce costs in data centers is great because data centers require a great deal of energy to power and cool servers and they require experienced and qualified staff to operate and maintain.
Value engineering in the design process and mistakes and changes in the building process can result in vulnerabilities even in new data centers. Poor change management processes and incomplete procedures in older data centers are another cause for concern.
Uptime Institute believes that independent third-party verifications are the best way to identify the vulnerabilities in IT infrastructure. For instance, For instance, Uptime Institute consultants have Certified that more than 900 facilities worldwide meet the requirements of the Tier Standards.. Certification is the only one way to ensure that a new data center has been built and verified to meet business standards for reliability. In almost all these facilities, Uptime Institute consultants made recommendations to reduce risks that were unknown to the data center owner and operators.
Over time, however, even new data centers can become vulnerable. Change management procedures must help organizations control IT growth, and maintenance procedures must be updated to account for equipment and configuration changes. Third-party verifications such as Uptime Institute’s Tier Certification for Operational Sustainability and Management & Operations Stamp of Approval ensure that an organization’s procedures and processes are complete, accurate, and up-to-date. Maintaining up-to-date procedures is next to impossible without management processes that recognize that data centers change over time as demand grows, equipment changes, and business requirements change.
I’m in IT, what can I do to keep my company out of the headlines?
It all depends on your company’s appetite for risk. If IT failures would not materially affect your operations, then you can probably relax. But if your business relies on IT for mission-critical services for customer-facing or internal customers, then you should consider having the Uptime Institute evaluate your data center’s management and operations procedures and then implement the recommendations that result.
If you are building a new facility or renovating an existing facility, Tier Certification will verify that your data center will meet your business requirements.
These assessments and certifications require a significant organizational commitment in time and money, but these costs are only a fraction of the time and money that Delta will be spending as the result of this one issue. In addition, insurance companies increasingly have begun to reduce premiums for companies that have their operational preparedness verified by a third party.
Published reports suggest that Delta thought it was safe from this type of outage, and that other airlines may be similarly vulnerable because of skimpy IT budgets, poor prioritization, and bad processes and procedures. All companies should undertake a top-to-bottom review of its facilities and operations before spending millions of dollars on new equipment.
Bulky Binders and Operations “Experts” Put Your Data Center at Risk
/in Operations/by Kevin HeslinWearable technology and digitized operating procedures ensure compliance with standardized practices and provide quick access to critical information
By Jose Ruiz
No one in the data center industry questions that data center outages are extremely costly. Numerous vendors report that the average data center outage costs hundreds of thousands of dollars, with losses in some circumstances totaling much more. Uptime Institute Network Abnormal Incident Reports data show that most downtime instances are caused by human error–a key vulnerability in mission-critical environments. In addition, Compass Datacenters believes that many failures that enterprises attribute to equipment failure could accurately be considered human error. However, most companies prefer to blame the hardware rather than their processes.
Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.
Whether one accepts higher or lower outage costs, it is clear that reducing human error as a cause of downtime has a tremendous upside for the enterprise. As a result, data center owners dedicate a great deal of time to developing maintenance and operation procedures that are clear, effective, and easy to follow. Compass Datacenters is incorporating wearable technology in its efforts to make the procedures convenient for technicians to access and execute. During the process, Compass learned that the true value of introducing wearables into data center operations comes from standardizing the procedures. Incorporating wearable technology enhances the ability of users to follow the procedures. In addition, allowing users to choose the device most useful for the task at hand improves their comfort with the new process.
THE PROBLEM WITH TRADITIONAL OPs DOCUMENTATION
There are two major problems with the current methods of managing and documenting data center procedures.
In both instances, Operations personnel deem the written procedural guidelines inconvenient. As a result, actual data center operation tends to be more experience-based. Bluntly put, too many organizations rely too much on the most experienced or the most insistent Operations person on site at the time of an incident.
Figure 2. Compass Datacenters use of electronics and checklists clarifies the responsibilities of Operations personnel doing daily rounds.
On the whole, experience-based operations are not all bad. Subject matter experts (SME) emerge and are valued, with operations running smoothly for long periods at a time. However, as Julian Kudritzki wrote in “Examining and Learning from Complex Systems Failures” (The Uptime Institute Journal, vol. 5, p. 12) ‘Human error’ is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.” What’s more, subject matter experts may leave or get sick or simply have a bad day. In addition, most of their peers assume the SME is always right.
Figure 3.
Figure 4.
Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.
Of course, no one goes to work planning to hurt a coworker or cause an outage, but errors are made. Compass Datacenters believes its efforts to develop professional quality procedures and use wearable technology reduce the likelihood of human error. Compass’s intent is to make it easy for operators to handle both routine and foreseeable abnormal operations by standardizing where possible and improvising only when absolutely necessary and by making it simple and painless for technicians to comply with well-defined procedures.
AMERICAN ELECTRIC POWER
American Electric Power (AEP), a utility based in Columbus, OH that provides electricity to more than 5.4 million customers in 11 states, inspired Compass to bring wearable technology to the data center to help the utility reduce human error. The initiative was a first for Compass, and AEP’s collaboration was essential to the product development effort. Compass and AEP began by adopting the checklist approach used in the airline business, among many others. This approach to avoiding human error is predicated on the use of carefully crafted operational procedures. Foreseeable critical actions by the pilot and copilot are strictly prescribed to ensure that all processes are followed in the same manner, every time, no matter who is in the cockpit. They standardize where possible and improvise only when required. Airline procedural techniques also put extra focus on critical steps to eliminate error. While this professional approach to operations has been widely adopted by a host of industries, it has not yet become a standard in the data center business.
Compass and AEP partnered with ICARUS Ops to develop the electronic checklist and wearables programs. ICARUS Ops was founded by two former U.S. Navy carrier pilots, who use their extensive experience in developing standardized aviation processes to develop wearable programs across a broad spectrum of high-reliability and mission-critical industries. During the process, ICARUS Ops brought on board several highly experienced consultants, including military and commercial pilots and a former Space Shuttle mission controller. At the start of the project, Compass, ICARUS Ops, and AEP worked together to identify the various maintenance and operations tasks that would be wearable accessible. Team members reviewed all existing emergency operating procedures (EOP). The three organizations held detailed discussions with AEP’s data center personnel to help identify the procedures best suited to run electronically.
Including operators in the process ensured that their perspectives would be part of the final procedures to make it easier for them to execute correctly. Also, this early involvement led to a high level of buy-in, which would be needed for this project to succeed.
ELECTRONIC PROCEDURES
The next step was converting the procedures to electronic checklists. The transition to electronics was accomplished using ICARUS Ops Application and web-based Mission Control. The project team wanted each checklist to
o Warnings to indicate where serious injury is possible if all procedures are not followed specifically
o Cautions indicating where significant data loss or damage is possible
o Notes to add any additional information that may be required or helpful
AEP SMEs walked through each and every checklist to provide an additional quality control step.
Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.
Having SOPs and EOPs in a wearable digital format allows users to access them whenever an abnormal or emergency situation occurs, so no time is lost looking for the three-ring binder and proper procedure. In critical situations and emergencies, time is crucial and delay can be costly. Hard-copy information manuals and operating handbooks are bulky and can potentially overwhelm technicians looking for specific information. The bulk and formatting can cause users to lose their place on a page and skip over important items. By contrast, the digital interface provides quick access to users and allows for quicker navigation to desired materials and information via presorted emergency and abnormal procedures.
At the conclusion of the development phase of the process (about a month), 150 of AEP’s operational procedures had been converted into digital procedures that were conveniently accessible on mobile devices— on or offline. Most importantly, a digital continuous improvement process was put in place to assure that lessons learned would be easy to incorporate into the system as part of the process.
The process of entering the procedures took about 2 weeks. First, a data entry team put them in as written. Then Compass worked with AEP and ICARUS Ops checklist experts to tighten up the language and add decision points, warnings, cautions, and notes. This group also embedded pictures and video where appropriate and decided what data needed to be captured during the process. Updates are easy. Each time a checklist or procedure is used improvements can be approved and digitally published right away. This is critical. AEP believes that when employees know they are being heard job satisfaction improves and they take more ownership and pride in the facility. The modifications and improvements will be continuous. The system is designed for constant improvement. Any time a technician has an idea to improve a checklist he just records the idea in the application, which instantly emails it to the SME for approval and publishing into the system. Other devices on the system synch automatically when they update.
WEARABLE TECHNOLOGY
Using wearable technology made the electronic procedures even more accessible to the end users. At the outset, AEP, ICARUS Ops, and Compass determined that AEP technicians would be able to choose between a wrist-mounted Android device and a tablet. The majority of users selected the wrist-mounted device. This preference was not unexpected, as a good portion of their work requires them to use their hands. The wrist units allow users to have both hands available to perform actual work at all times.
Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.
Compass chose the Android technology system for several reasons. First, from the very outset, Compass envisioned making Google Glass one of the wearable options. In addition, Android allows programs maximum freedom to use the hardware and software as required to meet customer needs; it is simply less constraining than working with iOS. Androids are also very easy to procure and have manufacturers making Class 1, Division 1 Android tablets that are usable in hazardous environments.
The software-based technology enhances AEP’s ability to track all maintenance activity in its data center. Once online, each device communicates the recorded operational information (actions performed, components replaced, etc.) to the site’s maintenance system. This ensures that all routine maintenance is performed on schedule, and at the same time provides a detailed history of the activities performed on each major component within the facility. Web-based leadership dashboards provide oversight into what is complete, due, or overdue. The system is completely secure, as it does not reside on the same server or system as any other system in the data center. It can be encrypted or resident in the data center for increased security. Typically ICARUS Ops spends at least 3 days on site training the technicians and users, depending on the number of devices. Working together in that time normally leaves the team very comfortable with all components.
THE NEXT PHASE: GLASS
Although innovations such as Google Glass have increased awareness of visual displays, the concept dates back to the Vietnam War, when American fighter pilots began to use augmented reality on heads-up displays (HUDs). Compass Datacenters plans to harness the same augmented reality to enable maintenance technicians to complete a task completely hands free using Google Glass and other advanced wearables.
It is certainly safe to assume that costs related to data center outages will continue to rise in the future. As a result, it will become more important to increase operator reliability and reduce the risk of human error. We strongly believe that professional quality procedures and wearables will make a tremendous contribution to eliminating the human error element as a major cause of interruption and lost revenue. Our vision is that other data center providers will decide to make this innovative checklist system a part of their standard offerings over the course of the next 12 to 24 months.
Jose Ruiz
Jose Ruiz serves as Compass Datacenters’ Vice President of Operations. He provides interdisciplinary leadership in support of Compass’ strategic goals. Mr. Ruiz began his tenure at Compass as the company’s director of Engineering. Prior to joining Compass he spent three years in various engineering positions and was responsible for a global range of projects at Digital Realty Trust.
Tier III Certified Facilities Prove Critical to Norwegian Colo’s Client Appeal
/in Executive/by Kevin HeslinGreen Mountain melds sustainability, reliability, competitive pricing, and independent certifications to attract international colo customers
By Kevin Heslin
Green Mountain operates two unique colo facilities in Norway, having a total potential capacity of several hundred megawatts. Though each facility has its own strengths, both embody the company’s commitment to providing secure, high-quality service in an energy efficient and sustainable manner. CEO Knut Molaug and Chief Sales Officer Petter Tømmeraas recently took time to explain how Green Mountain views the relationship between cost, quality, and sustainability.
Tell our readers about Green Mountain.
KM: Green Mountain focuses on the high-end data center market, including banking/finance, oil and gas, and other industries requiring high availability and high quality services.
PT: IT and cloud are also very big customer segments. We think the US and European markets are the biggest for us, but we also see some Asian companies moving into Europe that are really keen on having high-quality data centers in the area.
KM: Green Mountain Data Centers operates two data centers in Norway. Data Center 1 in Stavanger began operation in 2013 and is located in a former underground NATO ammunition storage facility inside a mountain on the west coast. Data Center 2, a more traditional facility, is located in Telemark, which is in the middle of Norway.
Today DC1-Stavanger is a high security colocation data center housing 13,600 square meter facility (m2) of customer space. The infrastructure can support up to 26 megawatts of IT load today. The main data center comprises three two-story concrete buildings built inside the mountain, with power densities ranging from 2-6 kW/m2, but the facility can support up to 20 kW/m2. NATO put a lot of money into creating their facilities inside the mountain, which probably saved us 1 billion Kroners ($US 150 million).
DC2-Telemark is located in a historic region of Norway and was built on a brownfield site with a 10-MW supply initially available. The first phase is a fully operationa1 10-MW Tier lll Certified Facility, with four new buildings and up to 25 MW total capacity planned. This site could support even larger facilities if the need arises.
Green Mountain focuses a lot on being green and environmentally friendly, so we use 100% renewable energy in both data centers.
How do the unique features of the data centers affect their performance?
KM: Besides being located in a mountain, DC1 has a unique cooling system. We use the fjords for cooling year round, which gives us 8°C (46 °F) water for cooling. The cooling solution (including cooling station, chilled water pipework and pumps) is fully duplicated, providing an N+N solution. Because there are few moving parts (circulating pumps) the solution is extremely robust and reliable. In-row cooling is installed to client specification using Hot Aisle technology.
We use only 1 kilowatt of power to produce 100 kilowatts of cooling. So the data center is extremely energy efficient. In addition, we are connected to three independent power supplies, so DC1 has extremely robust power.
DC2 probably has the most robust power supply in Europe. We have five independent hydropower plants within a few kilometers of the site, and the two closest are just a few hundred meters away.
How do you define high quality?
PT: High quality means Uptime Institute Tier Certification. We are not only saying we have very good data centers. We’ve gone through a lot of testing so we are able to back it up, and the Uptime Institute Tier Standard is the only standard worldwide that certifies data center infrastructure to a certain quality. We’re really strong on certifications because we don’t only want to tell our customers that we have good quality, we want to prove it. Plus we want the kinds of customers who demand proof. As a result, both our facilities are Tier III Certified.
Please talk about the factors that went into deciding to obtain Tier Certification.
KM: We have focused on high-end clients that require 100% uptime and are running high-availability solutions. Operations for this type of company generally require documented infrastructure.
The Tier III term is used a lot, but most companies can’t back it up. Having been through testing ourselves, we know that most companies that haven’t been certified don’t have a Tier III facility, no matter what they claim. When we talk to important clients, they see that as well.
What was the on-site experience like?
PT: When the Uptime Institute team was on site, we could tell that Certification was a quality process with quality people who knew what they were doing. Certification also helped us document our processes because of all the testing routines and scenarios. As a result, we know we have processes and procedures for all the thinkable and unthinkable scenarios and that would have been hard to do without this process.
Why do you call these data centers green?
KM: First of all we use only renewable energy. Of course that is easy in Norway because all the power is renewable. In addition we use very little of it, with the fjords as a cooling media. We also built the data centers using the most efficient equipment, even though we often paid more for it.
PT: Green Mountain is committed to operate in a sustainable way and this reflects in everything we do. The good thing about operating in such a way is that our customers benefit from this financially. As we bill power cost based on consumption of power, the more energy efficient we operate the smaller the bill to our customer. When we tell these companies that they can even save money going for our sustainable solutions this makes their decision easier.
More and more customers require that their new data center solutions are sustainable, but we still see that price is a key driver for most major customers. The combination of having very sustainable solutions and being very competitive on price is the best way of driving sustainability further into the mind of our customers.
All our clients reduce their carbon footprint when they move into our data centers and stop using their old and inefficient data centers.
We have a few major financial customers that have put forward very strict targets with regards to sustainability and that have found us to be the supplier that best meets these requirements.
KM: And, of course, making use of an already built facility was also part of the green strategy.
How does your cost structure help you win clients?
PT: It’s important, but it’s not the only important factor. Security and the quality we can offer are just as important, and that we can offer them with competitive pricing is very important.
Were there clients who were attracted to your green strategy?
PT: Several of them, but the decisive factor for customers is rarely only one factor. We offer a combination between a really, really competitive offering and a high quality level. We are a really, really sustainable and green solution. To be able to offer that at competitive price is quite unique because often people think they have to pay more to get a sustainable green solution.
Are any of your targeted segments more attracted to sustainability solutions?
PT: Several of the international system integrators really like the combination. They want a sustainable solution, but they want the competitive offering. When they get both, it’s a no-brainer for them.
How does your sustainability/energy efficiency program affect your reliability? Do potential clients have any concerns about this? Do any require sustainability?
PT: Our programs do not affect our reliability in any way. We have chosen only to implement solutions that do not harm our ability to deliver the quality we promise to our customers. We have never experienced one second of SLA breakage on any customer in any of our data centers. In fact, some of our most sustainable solutions, like the cooling system based on cold sea water, increase our reliability as it takes down the risk of failure considerably compared to regular cooling systems. We have not experienced any concerns about these solutions.
Has Tier Certification proven critical in any of your client’s decisions?
PT: Tier certification has proved critical in many of our client`s decision to move to Green Mountain. We see a shift in the market to require Tier certification, whereas it used to be more in the form of asking for Tier compliance, that anyone could claim without having to prove it. We think the future of quality data center providers will be to certify all their data centers.
Any customer with mission critical data should require their supplier/s to be Tier certified. At the moment this is the only way for a customer to secure that their data center is built and operated in the way it should in order to secure the quality that the customer needs.
Are there other factors that set you apart?
PT: Operational excellence. We have an operational team that excels every time. They deliver to the customers a lot more than expected every time, and we have customers that are extremely happy with their deliveries from us. I hear that from customers all the time, and that’s mainly because our operations team do a phenomenal job.
Uptime Institute testing criteria were very comprehensive and helped us develop our operational procedures to an even higher level as some of the scenarios created during the certification testing were used as a basis for new operational procedures and new tests that we now perform as part of our normal operating procedures.
Green Mountain definitely benefitted from the Tier process in a number of other ways, including training gave us useful input to improve our own management and operational procedures.
What did you do to develop this team?
KM: When we decided to focus on high-end clients, we knew that we needed high-end experience and expertise and knowledge on the ops side, so we focused on that when recruiting as well as building a culture inside the company that focused on delivering high quality the first time every time.
We recruited people with knowledge of how to operate critical environments, and we tasked them with developing those procedures and operational elements as a part of their efforts, and they have successfully done so.
PT: And the owners made the resources available so that they could spend the resources—both financial and staff-hour wise to create the quality we wanted. We also have a very good management system, so management has good knowledge of what’s happening, so if we have an issue it will be very visible.
KM: We also have high-end equipment and tools to measure and monitor everything inside the data center as well as operational tools to make sure we can handle any issue and deliver on our promises.
Kevin Heslin
Kevin Heslin is Chief Editor and Director of Ancillary Projects at Uptime Institute. In these roles, he supports Uptime Institute communications and education efforts. Previously, he served as an editor at BNP Media, where he founded Mission Critical, a commercial publication dedicated to data center and backup power professionals. He also served as editor at New York Construction News and CEE and was the editor of LD+A and JIES at the IESNA. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.