Blog Multi Author - Uptime Institute Blog

Luxembourg Colo Provides Multi-Tier Options

February 24, 2017/in Design, Executive/by Kevin Heslin

Luxconnect obtains different Tier Certifications to meet the needs of the changing market demands

By Christine De Ridder

LuxConnect, a multi-tenant, multi-tier data center and dark fiber network operator based in Luxembourg, developed an innovative strategy using Uptime Institute Tiers to differentiate the services and pricing it offers to key customer groups. In doing so, LuxConnect became the first—and so far only—facility to offer multiple Tier levels in the same data center.

The exterior of DC1.3, Luxconnect’s green data center.

“LuxConnect’s multi-tier approach to Tier Certification enables it to offer customers different rooms with different Tier Objectives in the same building.” Said Phil Collerton, managing director, Uptime Institute EMEA.” This gives the client the opportunity to use different rooms to house different types of applications, both mission critical and noncritical, while also benefiting from the flexible SLAs and pricing models that this strategy allows LuxConnect to offer.”

A Tier IV Certified Constructed Facility server room ready to host servers.

LuxConnect is a private limited company owned by the Luxembourg State and founded in 2006 with several missions such as strengthening the Luxembourg’s position on the European Internet map and providing state-of-the-art IT environment.

Roger Lampach, CEO, said, “Among our goals was to promote and facilitate high-end ICT investments into Luxembourg and offer white space in high-quality data centers. And that is what we achieved.”

Today LuxConnect has two sites and four buildings with a total server space of 14,700 square meters. LuxConnect has three independent data centres, DC1.1, DC1.2 and DC1.3, on its Bettembourg ICT Campus in the south of the country. The DC2 facility is located in Bissen. The two sites are interconnected via multiple routes using dark fiber to guarantee customers data center business continuity.

Luxconnect’s low-voltage distribution room.

LuxConnect designed, planned, and built these data centers in less than 10 years. The newest facility, DC1.3, launched in August 2015. Three of the four buildings have been Certified by the Uptime Institute, including Tier IV Certification of Design Documents for DC1.1 and Tier II Certification of Design Documents in DC2. DC1.3 has Tier II and IV Certifications of Design Documents and Constructed Facility in the same facility, which is a unique configuration.

Cold water distribution

Lampach said, “We know that we have to convince our visitors and potential customers that we are professional and that we know what to do and how to do it. Our visitors are usually very impressed by the quality and high level of our five-star data centers.”

LuxConnect’s mix of Tier Certifications is an integral part of meeting its customer’s demands. Having multiple Tier II facilities enables LuxConnect to attract gamers who require high availability and low latency but are very cost sensitive. Other verticals, such as banks, require the Fault Tolerant Tier IV Certification of Constructed Facilities.

Luxconnect received Tier II and Tier IV Certifications for spaces in the same facility, making it unique.

With support from the government, LuxConnect is able to expand as soon as market demand appears. Lampach said, “We start building when we see that there is not enough capacity in the market and without having any contracts. It’s not like a profit-driven company would do, but as our primary shareholder wants outside companies to come at any moment, we have to anticipate.”

The cornerstone of this approach is DC 1.3, which has both Tier II and Tier IV Certifications of Constructed Facility on the same floor in the same building, which is a unique approach.

The Tier II spaces are mostly for the gamers, according to Lampach, “The gamers have two installations, one called front end where the gaming company connects and the other one where our client analyzes customer behavior and determines which new products to present their customers. These companies pushed us to consider offering Tier II spaces. And we began the idea by achieving Tier II Certification of Design Documents in DC2.”

Uptime Institute presents Tier Certifications to Luxconnect.

From the beginning, LuxConnect designed its data centers following Uptime Institute’s Tier standards and specifications. The company’s two project managers are Uptime Institute Accredited Tier Designers. Lampach said: “Having the Tier Certifications has been a real benefit for us. It’s also a concern for us because we go to the market through our business partners. At the beginning, they don’t really understand the differences between Tier II, III, and IV and they think downtime will never happen or they decide to take the risk.”

Lampach feels that the Tier Certification of Constructed Facility process provided the project team with an opportunity to better understand the behavior of the different systems, which helps them to assure business continuity. The Operations staff also benefited from the Tier Certification process with a very intensive workweek that was instructive on the technical and human level.

“At the end of the week, Operations could really see how the electrical and mechanical systems reacted during the multiple demonstrations. They feel more comfortable working with the systems because they know exactly how to proceed for maintenance without impacting business continuity.” The staff was also able to witness how the systems react during different failure scenarios.

Rich Van Loo, Uptime Institute’s senior vice president, Facility Management Services, said, “The operations assessments helped to insure LuxConnect gets the most availability out of each data center regardless of Tier while still allowing flexibility to its customers. It positions them to be a true leader for IT services in the region.

LuxConnect is justifiably proud of its performance during the Certification process, as it had just two minor issues, both of which the company was able to resolve during the testing week. The minor nature of these changes testifies to the quality of LuxConnect’s design, construction, and commissioning processes.

Christine De Ridder

Christine De Ridder worked for several companies, including Schneider Electric and Siemens, before joining LuxConnect in 2012 as manager, Data Centre Projects. She became an Accredited Tier Designer in 2013. In addition, she speaks five languages and has an engineering degree in electromechanics.

ENTEL Achieves Uptime Institute Tier Certification of Operational Sustainability

February 17, 2017/in Operations/by Kevin Heslin

Tier Certification of Operational Sustainability enhances ENTEL’s services to customers

By Kevin Heslin

ENTEL began operations in 1964 as a provider of national and international long distance telephone services to companies in Chile. Today it is a consolidated provider of integrated telecommunication services and information technologies services, meeting the needs of corporations and large companies through tailored solutions, providing value, experience, and quality of service.

The company began operations in the aftermath of an earthquake that severely damaged the Chilean telecommunications network. More recently it added more services. Towards the end of 1995, ENTEL started to provide Internet connection services and in 1997 it introduced the first commercial network with ATM technology in Latin America, which evolved into the current Multiservice IP network in order to offer broadband solutions, guarantee quality of service, and add value to its clients. Today ENTEL offers Information Technology (IT) services, which vary according to the industrial segment and business model of each client, allowing competitive efficiencies and advantages that differentiate ENTEL from traditional telecommunications services.

As part of its offerings, ENTEL also offers a number of data center services to more than 3,000 clients. It owns and operates a network of five data centers that are certified under ISO 9001-2000 and ISO 27001 standards. In addition, two of them are Tier III Gold Certified Constructed Facility and one has the M&O Stamp of Approval.

Data Center Infrastructure

ENTEL has five data centers located in Amunátegui, Pedro de Valdivia, Ñuñoa, Longovilo, and Ciudad De Los Valles that are linked through high availability and high capacity IP/MPLS/DWDM These data centers have in excess of 7,500 square meters (m²) of data hall space with plans to expand to 11,675 m². From these data centers, ENTEL offers IT outsourcing services, from server hosting to more complex services involving operation and exploitation of the platforms that support clients’ business processes. These data center services meet the needs of companies’ most vital applications, improving security and protection of critical data and reducing considerably their infrastructure investments. ENTEL’s IT strategy is based on traditional data center services and growth in cloud services and permanent innovation.

ENTEL first offered data center services in 2003 and is now Chile’s largest provider. To this end, ENTEL has staffed its facilities with more than 120 professionals and management devoted to implementation and operation of data center Infrastructure projects

Thanks to its team of specialists with extensive experience, the IT challenges faced by ENTEL have been met entirely in-house. Of particular note are the Ciudad de los Valles 1 and 2 facilities, each offering 2,000 m² of white space. Ciudad de los Valles 1 (CDLV1) entered production in May 2010, and Ciudad de los Valles 2 (CDLV2) entered production in March 2013 (see The Uptime Institute Journal, vol. 2, p. 64). Both are Tier III Gold Certified Constructed Facilities (see Figure 1). Both received Tier Certification for Operational Sustainability in October 2015 (see Table 1). Josué Ramírez, Uptime Institute, director of Business Development LATAM, “With these Certifications, ENTEL shows it commitment to seeking excellence in its operations to provide better services to its clients and to contribute to the development of knowledge and professionalism in the region of Latin America.”

Figure 1. CDLV 1 and 2 are both Tier III Gold Certified Constructed Facilities.

Rich Van Loo, Uptime Institute, VP Facility Management Services, said, “The Tier Certification and Operational assessments have had a broad impact on not only the data center management divisions, but the company environment as a whole. Operational procedures not only reduce risk but help improve consistency and efficiency in those operations. ENTEL is looking to expand this philosophy to all their data centers.“

Table 1

ENTEL began construction of a third facility in 2015 at Ciudad de los Valles. This will add an additional 2,000 m² of white space. The new facility incorporates free-cooling technologies that achieve better energy efficiency and lower total costs of ownership for clients.

Tier Certification of Operational Sustainability

To differentiate itself from its competitors, ENTEL became the first to Tier Certify its facilities. As a result, ENTEL decided to further differentiate itself by earning Uptime Institute’s Tier Certification of Operational Sustainability for CDLV 1 and 2. This decision was based on client demand for excellence of service, risk mitigation, and assurance of operational continuity.

ENTEL believes that

Risk management should be approached as a team, leaving no room for assumptions or improvisation
Planning and later exhaustive review of activities should have an integral perspective, which allows risk to be controlled, mitigated, and contained
Planning activities and teamwork generate a virtuous circle of shared knowledge and learning
Empirical verification that continuous improvement is key to mitigating human error in Operations

Figure 2. The Tier Certification of Operational Sustainability verifies that ENTEL maximizes the potential of its facilities and differentiates it from its competitors.

The Tier Certification of Operational Sustainability (TCOS) was an excellent way to reach these goals and why a new internal organization was dedicated to defining data center operations activities and a second team was tasked to leading the effort to earn the Tier Certification of Operational Sustainability.

In addition, ENTEL created an Infrastructure Change Control Board (CCI), specific to Datacenter Infrastructure Management, with the purpose of establishing planning and control of activities in each data center. This Board is a consulting body. It meets periodically and manages, reviews, and approves infrastructure management projects to data centers.

Scope of the Infrastructure Change Control Board (CCI)

Although ENTEL had previously adopted the ITIL model and established a Change Advisory Board (CAB) that validates and approves all activities in the IT and data center infrastructure environment, it was determined to create the CCI because of the degree of specialization of infrastructure work and the risk associated with it. Its focus is to raise risk points associated to high-impact activities and ascertain control points, e.g. cells, transformers, generators, UPS, towers and chillers, where the system redundancy could be lost.

A summary of the type of work evaluated and documentation to be reviewed follows (see Table 2):

Table 2.

The CCI is intended to ensure that the good practices characteristic of Tier Certification of Operational Sustainability are respected and that all documentation and instructions are kept thoroughly valid, that they are applied, and are subject to continuous improvement.

The Process

In order to increase ENTEL’s familiarity with the Tier Certification of Operational Sustainability process, ENTEL and Uptime Institute agreed to carry it the process in four phases (see Figure 3).

Figure 3. ENTEL views operational sustainability as a four-phase continuous process.

Important Decisions

Eventually, Uptime Institute and ENTEL adjusted the process in light of the initial progress, the maturity level of the team, and the imperative work of assuring operational continuity:

The Operations team was split in two groups, both reporting to a single manager in charge of coordinating activities
Operation, maintenance and tests teams were responsible for day-to-day activities
The Operations team was also entrusted with adjusting existing procedures/instructions, for training the specialist staff, and with leading the internal change within the datacenter.

ENTEL also

Retained an external consultant to structure its methodology and develop a map for sustainable training.
Implemented a maintenance management system
Defined a support team to manage eventual setbacks for weekly follow-up, progress control, and resolution
Assured that activities the processes, procedures/instructions, and methodologies Operational Sustainability would be adopted and applied

ENTEL believes that it received many benefits from earning the Tier Certification of Operational Sustainability, including

Improved “standard operation” processes
Continuous improvement of processes and procedures
Formal training, sustainable over time
Shared lessons learned at all five data centers

The knowledge of the Certification methodology and good practices stayed in house, which ENTEL can replicate at all its data centers and eventually to other areas of the organization (see Figure 4). ENTEL has worked to share this information company wide, with the additional notable achievement of Uptime Institute’s M&O Stamp of Approval at its Amunategui facility in December 2016.

Figure 4. The staff at ENTEL is charged with promoting Operational Sustainability across the whole organization.

When an Australian Government Department Required Operational Sustainability, Metronode Delivered

January 3, 2017/in Executive, Operations/by Kevin Heslin

Senior facility manager calls achieving Tier Certification of Operational Sustainability “a dream”

By Kevin Heslin

The New South Wales (NSW) Department of Finance, Services and Innovation (DFSI) is a government service provider and regulator for the southeastern Australian state. DFSI supports many government functions, including sustainable government finances, major public works and maintenance programs, government procurement, and information and communications technology.

Josh Griggs, managing director of Metronode; Glenn Aspland, Metronode’s senior facility manager; and Derek Paterson, director–GovDC & Marketplace Services at the NSW Department of Finance, Services and Innovation (DFSI) discuss how Metronode responded when the NSW government decided to consolidate its IT operations.

Josh, can you tell me more about Metronode and its facilities?

Griggs, Metronode managing director: Metronode was established in 2002. Today we operate 10 facilities across Australia. Having designed and constructed our data centers, we have significant construction and engineering expertise. We also offer services and operational management. Our Melbourne 2 facility was the first Uptime Institute Tier III Certified Constructed Facility in Australia in 2012, and our Illawarra and Silverwater facilities were the first in the Asia Pacific region to be Tier III Gold Certified for Operational Sustainability by Uptime Institute. We have the most energy-efficient facilities in Australia, with a 4.5 National Australian Built Environment Rating System (NABERS) rating for data centers.*

We have two styles of data center. Generation 1 facilities are typically closer to the cities and have very high connectivity. If you were looking to connect to the cloud or for backup, the Gen 1’s fit that purpose. As a result, we have a broad range of clients, including multinationals, local companies, and a lot of government.

And then, we have Generation 2 Bladeroom facilities across five sites including facilities in the Illawarra and at Silverwater, which host NSW’s services. At present, we’ve got 3 megawatts (MW) of IT load in Silverwater and 760 kilowatts (kW) in the Illawarra. With additional phasing, Silverwater could host 15 MW and Illawarra 8 MW.

We are engineered for predictability. Our customers rely on us for critical environments. This is one of the key reasons we went through Uptime Institute Certification. It is not enough for us to say we designed, built, and operate to the Tier III Standards; it is also important that we verify it. It means we have a Concurrently Maintainable site and it is Operationally Sustainable.

Tell me about the relationship between the NSW government and Metronode.

Griggs, Metronode managing director: We entered a partnership with the NSW government in 2012, when we were selected to build and construct their facilities. They had very high standards and groundbreaking requirements that we’ve been able to meet and exceed. They were the first to specify Uptime Institute Tier III Certification of Constructed Facility, the first to specify NABERS, and the first to specify Tier Certification of Operational Sustainability as contract requirements.

Paterson, DFSI: A big part of my role at DFSI is to consolidate agencies’ legacy data center infrastructure into two strategic data centers owned and operated by Metronode. Our initial objective was to consolidate more than 130 data centers into two. We’ve since broadened the program to include the needs of local government agencies and state-owned companies.

When you look across the evolving landscape of requirements, Metronode was best equipped to support agency needs of meeting energy-efficiency targets and providing highly secure physical environments, while meeting service level commitments in terms of overall uptime.

Are these dedicated or colocation facilities?

Paterson, DFSI: When Metronode sells services to the private sector, it can host these clients in the Silverwater facility. At this point, though, Silverwater is 80% government, and Illawarra is a dedicated government facility.

What drove the decision to earn the Tier III Certification of Operational Sustainability?

Paterson, DFSI: We wrote the spec for what we wanted in regards to security, uptime, and service level agreements (SLA). Our contract required Metronode’s facilities to be Tier III Certified for design and build and Tier III Gold for operations. We benefited, of course; however, Metronode is also reaping rewards from both the Tier III Certification of Constructed Facility and Tier III Gold Certification of Operational Sustainability.

Griggs, Metronode managing director: Obviously the contractual requirement was one driver, but that’s not the fundamental driver. We needed to ensure that our mission critical environments are always operating. So Operational Sustainability ensured that we have a reliable, consistent operation to match our Concurrently Maintainable baseline, and our customers can rely on that.

Our operations have been Certified and tested in a highly rigorous process that ensures we had clear and effective process documented and the flow charts to enable the systems maintenance, systems monitoring, fault identification, etc. The Tier III Gold Certification also evaluates the people side of the operation, including skill assessment, acquisition of the right people, training, rostering, contracting, and the management in place as well as continuous improvement.

In that process, we had to ask ourselves if what we were actually doing was documented clearly and followed by everybody. It reached across everything you can think of in terms of operating a data center to ensure that we had all of that in place.

There are only 23 facilities in the world to have this Certification and only two in the Asia Pacific, which demonstrates how hard it is to get. And we wanted to be the best.

Glenn, how did you respond when you learned about DFSI’s requirement for Tier III Gold Certification of Operational Sustainability?

Aspland, Metronode senior facility manager: I thought this is fantastic. Someone has actually identified a specific and comprehensive set of behaviors that can minimize risk to the operations of the data center without being prescriptive.

When I was handed the Operational Sustainability assignment, the general manager said, “This is your goal.” From my point of view—operations has been my career, it was a fantastic opportunity to turn years of knowledge into something real. Get all the information out, analyze and benchmark it, find the highs and lows, and learn why they exist. For me it was a passion.

I was new to the company at that point and the process started that night. I just couldn’t put the Tier materials down. I was quite impressed. After that, we spent 2-3 months organizing meetings and drilling into what I felt we needed to do to to run our data centers in an Operationally Sustainable manner.

And then we began truly analyzing and assessing many of the simple statements in the briefing. For example, what does “fully scripted maintenance” mean?

What benefits did Metronode experience?

Paterson, DFSI: Metronode’s early deployments didn’t use Bladeroom, so this was a new implementation and there was a lot to learn. For them to design and build these new facilities at Tier III and then get the Tier III Gold Certification for Operational Sustainability showed us the rigor and discipline the team has.

Did the business understand the level of effort?

Aspland, Metronode senior facility manager: Definitely not. I had to work hard to convince the business. Operational Sustainability has put operations front and center. That’s our core business.

One of the biggest challenges was getting stakeholder engagement. Selling the dream.

I had the dream firmly in my mind from reading the document. But we had to get busy people to do things in a way that’s structured and document what they do. Not everyone rushes to that.

Practically, what did “selling the dream” require?

Aspland, Metronode senior facility manager: The approach is practical. We have to do what we do, and work through it with all the team. For instance, we did training on a one-on-one basis, but it wasn’t documented. So we had to ask ourselves: what do we teach? And then we have to produce training documents. After 12 months, we should know what we do and what our vendors do. But how do we know that they are doing what they are supposed to do? How do we validate all this? We have to make it a culture. That was probably the biggest change. We made the documents reflect what we actually do. Not just policy words about uptime and reliability.

Are there “aha” moments?

Aspland, Metronode senior facility manager: Continually. I am still presenting what we do and what we’ve done. Usually the “aha” happens when someone comes on site and performs major projects and follows detailed instructions that tell them exactly what to do, what button to punch, and what switchboard and what room. And 24 hours later, when they’ve relied on the document and nothing goes wrong, there’s the “aha” moment. This supported me. It made it easy.

How do you monitor the results?

Aspland, Metronode senior facility manager: We monitor our uptime continually. In terms of KPIs, maintenance fulfillment rates, open jobs every week, how long are they open. And every 2 months, we run financial benchmarks to compare to budget.

Our people are aware that we are tracking and producing reports. They are also aware of the audit. We have all the evidence of what we’ve done because of the five-day Operational Sustainability assessment.

For ISO 27001 (editor’s note: ISO information security management standard), what they are really doing is checking our docs. We are demonstrating that all our maintenance is complete and it’s document based, so we have all that evidence and that’s now the culture

What do you view as your competitive advantage?

Griggs, Metronode managing director: You can describe the Australian market as a mature market. There are quite a few providers. At Metronode, we design, build and operate the most secure and energy-efficient, low PUE facilities in the country, providing our customers with high-density data centers to meet their needs, both now and in the future.

Recognized by NABERS for data center energy efficiency with the highest rating of 4.5, we are also the only Australian provider to have achieved Uptime Institute Tier III Gold Certification for Operational Sustainability. There is one other national provider that looks like us. Then you have local companies that operate in local markets and some international providers have one or two facilities and operate with international customers that have a presence in Australia.

In this environment, our advantages include geolocation, energy efficiency, security, reliability, Tier III Gold Certification, and flexibility.

The Tier III Gold Certification and NABERS rating are important, because there are some outlandish claims about PUE and Uptime Institute Certification—people who claim to have Tier III facilities without going through the process of having that actually verified. Similarly, we believe the NABERS rating is becoming more important because of people making claims that they cannot achieve in terms of PUE.

Finally, we are finding that people are struggling to forecast demand. Because we are able to go from 1 kW/rack up to 30 kW, our customers are able to grow within their current footprint. Metronode has engineered Australia’s most adaptive data center designs, with an enviable speed to build in terms of construction. That ability to grow out in a rapid manner means that we are able to meet the growing requirements and often unforeseen customer demand.

* NABERS is an Australian rating system, managed nationally by the NSW Office of Environment and Heritage on behalf of federal, state, and territory governments. It measures the environmental performance of Australian buildings.

LinkedIn’s Oregon Data Center Goes Live, Gains EIT Stamp of Approval

December 8, 2016/in Executive, Operations/by Kevin Heslin

Uptime Institute recently awarded its Efficient IT (EIT) Stamp of Approval to LinkedIn for its new data center in Infomart Portland, signaling that the modern new facility had exceeded extremely high standards for enterprise leadership, operations, and computing infrastructure. These standards are designed to help organizations lower costs and increase efficiency, and leverage technology for good stewardship of corporate and environmental resources. Uptime Institute congratulates LinkedIn on this significant accomplishment.

Sonu Nayyar, VP of Production Operations & IT for LinkedIn, said “Our newest data center is a big step forward, allowing us to adopt a hyperscale architecture while being even more efficient about how we consume resource. Uptime Institute’s Efficient IT Stamp of Approval is a strong confirmation that our management functions of IT, data center engineering, finance, and sustainability are aligned. We look forward to improving how we source energy and ultimately reaching our goal of 100 percent renewable energy.”

John Sheputis, President of Infomart Data Centers, said, “We knew this project would be special from the moment we began collaborating on design. Efficiency and sustainability are achieved through superior control of resources, not sacrificing performance or availability. We view this award as further evidence of Infomart’s long-term commitment to working with customers to create the most sustainable IT operations in the world.”

LinkedIn expressed its enthusiasm about both this new project and the Efficient IT Stamp of Approval in the LinkedIn Engineering blog post below.

LinkedIn’s Oregon Data Center Goes Live

By Shawn Zandi and Mike Yamaguchi

Earlier this year we announced Project Altair, our massively scalable, next-generation data center design. We also announced our plans to build a new data center in Oregon, in order to be able to more reliably deliver our services to our members and customers. Today, we’d like to announce that our Oregon data center, featuring the design innovations of Project Altair, is fully live and ramped. The primary criteria when selecting the Oregon location were: procuring a direct access contract for 100% renewable energy, network diversity, expansion capabilities, and talent opportunities.

LinkedIn’s Oregon data center, hosted by Infomart, is the most technologically advanced and highly efficient data center in our global portfolio, and includes a sustainable mechanical and electrical system that is now the benchmark for our future builds. We chose to utilize the ChilledDoor from MotivAir, a rear door heat exchanger neutralizing the heat closer to the source. The advanced water side economizer cooling system communicates with outside air sensors to utilize Oregon’s naturally cool temperatures, instead of using energy to create cool air. Incorporating efficient technologies such as these enables our operations to run a PUE (Power Usage Effectiveness) of 1.06 during full economization mode.

Implementing the Project Altair next-generation data center design enabled us to move to a widely distributed non-blocking fabric with uniform chipset, bandwidth, and buffering characteristics in a minimalistic and elegant design. We encourage the most minimalistic and simplistic approaches in infrastructure engineering because they are easy to understand and hence easier to scale. Moving to a unified software architecture for the whole data center allows us to run the same set of tools on both end systems (servers) and intermediate systems (network).

The shift to simplification in order to own and control our architecture motivated us to also use our own 100G Ethernet open switch platform, called Project Falco. The advantages of using our own software stack are numerous: faster time to market, uniformity, simplification in feature requirements as well as deployment, and controlling our architecture and scale, to name a few.

In addition to the infrastructure innovation mentioned earlier, our Oregon data center has been designed and deployed to use IPv6 (next generation internet protocol) from day one. This is part of our larger vision to move our entire stack to IPv6 globally in order to retire IPv4 in our existing data centers. The move to IPv6 enabled us to run our application stack and our private cloud, LPS (LinkedIn Platform as a Service), without the limitations of traditional stacks.

As we moved to a distributed security system by creating a distributed firewall and removing network load balancers, the process of preparing and migrating our site to the new data center became a complicated task. It required significant automation and code development, as well as changes to procedures and software configurations, but ultimately reduced our infrastructure costs and gave us additional operational flexibility.

Our site reliability engineers and systems engineering teams introduced a number of innovations in deployment and provisioning, which allowed us to streamline the software buildout process. This, combined with zero touch deployment, resulted in a shorter timeline and smoother process for bringing a data center live than we’ve ever achieved before.

YouTube Link

Acknowledgements

We’re delighted to participate in the growth of Oregon as a next-generation sustainable technology center. The Uptime Institute has recognized LinkedIn with an Efficient IT award for our efforts! This award evaluates the management processes and organizational behaviors to achieve lasting reductions in cost, utilities, staff time, and carbon emissions for data centers.

The deployment of our new data center was a collective effort of many talented individuals across LinkedIn in several organizations. Special thanks to the LinkedIn Data Center team, Infrastructure Engineering, Production Engineering and Operations (PEO), Global Ops, AAAA team, House Security, SRE organization, and LinkedIn Engineering.

Airline Outages FAQ: How to Keep Your Company Out of the Headlines

November 28, 2016/in Executive, Operations/by Kevin Heslin

Uptime Institute has prepared this brief airline outages FAQ to help the industry, media, and general public understand the reasons that data centers fail.

The failure of a power control module on Monday, August 8, 2016, at a Delta Airlines data center caused hundreds of flight cancellations, inconvenienced thousands of customers, and cost the airline millions of dollars. And while equipment failure is the proximate cause of the data center failure, Uptime Institute’s long experience evaluating the design, construction, and operations of facilities suggest that many enterprise data centers are similarly vulnerable because of construction and design flaws or poor operations practices.

What happened to Delta Airlines?

While software is blamed for many well publicized IT problems, Delta Airline is blaming a piece of infrastructure hardware. According to the airline, “…a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.” We take Delta at its word.

Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.

However, pointing to the failure of a single piece of equipment can be very misleading. Delta’s facility was designed to have redundant systems, so the facility should have remained operational had the facility performed as designed. In short, a design flaw, construction error or change, or poor operations procedures set the stage for the catastrophic failure.

What does this mean?

Equipment like Delta’s power control module (more often called a UPS) should be deployed in a redundant configuration to allow for maintenance or support IT operation in the event of a fault. However, IT demand can grow over time so that the initial redundancy is compromised and each piece of equipment is overloaded when one fails or is taken off line. At this point, only Delta really knows what happened.

Organizations can compromise redundancy in this way by failing to track load growth, lacking a process to manage load growth, or making a poor business decision because of unanticipated or uncontrolled load growth.

Delta Airlines is like many other organizations. Its IT operations are complex and expensive, change is constant, and business demands are constantly growing. As a result, IT must squeeze every cent out of its budgets, while maintaining 24x7x365 operations. Inevitably these demands will expose any vulnerability in the physical infrastructure and operations.

Failures like the one at Delta can have many causes, including a single point of failure, lack of totally redundant and independent systems, or poorly documented maintenance and operations procedures.

Delta’s early report ignores key issues why a single equipment failure could cause such damage, what could have been done to prevent it, and how will Delta respond in the future.

Delta and the media are rightly focused on the human cost of the failure. Uptime Institute is sure that Delta’s IT personnel will be trying to reconstruct the incident once operations have been restored and customer needs met. This incident will cost millions of dollars and hurt Delta’s brand; Delta will not want a recurrence. They have only started to calculate what changes will be required to prevent another incident and how much it will cost.

Why is Uptime Institute publishing this FAQ?

Uptime Institute has spent many years helping organizations avoid this exact scenario, and we want to share our experience. In addition, we have seen many publish reports that just miss the point of how to achieve and maintain redundancy. Equipment fails all the time. Facilities that are well designed, built, and operated have systems in place to minimize data center failure. Furthermore, these systems are tested and validated regularly, with special attention to upgrades, load growth, and modernization plans.

How can organizations avoid catastrophic failures?

Organizations spend millions of dollars to build highly reliable data centers and keep them running. They don’t always achieve the desired outcome because of a failure to understand data center infrastructure.

The temptation to reduce costs in data centers is great because data centers require a great deal of energy to power and cool servers and they require experienced and qualified staff to operate and maintain.

Value engineering in the design process and mistakes and changes in the building process can result in vulnerabilities even in new data centers. Poor change management processes and incomplete procedures in older data centers are another cause for concern.

Uptime Institute believes that independent third-party verifications are the best way to identify the vulnerabilities in IT infrastructure. For instance, For instance, Uptime Institute consultants have Certified that more than 900 facilities worldwide meet the requirements of the Tier Standards.. Certification is the only one way to ensure that a new data center has been built and verified to meet business standards for reliability. In almost all these facilities, Uptime Institute consultants made recommendations to reduce risks that were unknown to the data center owner and operators.

Over time, however, even new data centers can become vulnerable. Change management procedures must help organizations control IT growth, and maintenance procedures must be updated to account for equipment and configuration changes. Third-party verifications such as Uptime Institute’s Tier Certification for Operational Sustainability and Management & Operations Stamp of Approval ensure that an organization’s procedures and processes are complete, accurate, and up-to-date. Maintaining up-to-date procedures is next to impossible without management processes that recognize that data centers change over time as demand grows, equipment changes, and business requirements change.

I’m in IT, what can I do to keep my company out of the headlines?

It all depends on your company’s appetite for risk. If IT failures would not materially affect your operations, then you can probably relax. But if your business relies on IT for mission-critical services for customer-facing or internal customers, then you should consider having the Uptime Institute evaluate your data center’s management and operations procedures and then implement the recommendations that result.

If you are building a new facility or renovating an existing facility, Tier Certification will verify that your data center will meet your business requirements.

These assessments and certifications require a significant organizational commitment in time and money, but these costs are only a fraction of the time and money that Delta will be spending as the result of this one issue. In addition, insurance companies increasingly have begun to reduce premiums for companies that have their operational preparedness verified by a third party.

Published reports suggest that Delta thought it was safe from this type of outage, and that other airlines may be similarly vulnerable because of skimpy IT budgets, poor prioritization, and bad processes and procedures. All companies should undertake a top-to-bottom review of its facilities and operations before spending millions of dollars on new equipment.

Bulky Binders and Operations “Experts” Put Your Data Center at Risk

October 24, 2016/in Operations/by Kevin Heslin

Wearable technology and digitized operating procedures ensure compliance with standardized practices and provide quick access to critical information

By Jose Ruiz

No one in the data center industry questions that data center outages are extremely costly. Numerous vendors report that the average data center outage costs hundreds of thousands of dollars, with losses in some circumstances totaling much more. Uptime Institute Network Abnormal Incident Reports data show that most downtime instances are caused by human error–a key vulnerability in mission-critical environments. In addition, Compass Datacenters believes that many failures that enterprises attribute to equipment failure could accurately be considered human error. However, most companies prefer to blame the hardware rather than their processes.

Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.

Whether one accepts higher or lower outage costs, it is clear that reducing human error as a cause of downtime has a tremendous upside for the enterprise. As a result, data center  owners dedicate a great  deal of time to developing  maintenance and operation procedures  that are clear, effective, and easy to follow. Compass Datacenters is incorporating wearable technology in its efforts to make the procedures convenient for technicians to access and execute. During the process, Compass learned that the true value of introducing wearables into data center operations comes from standardizing the procedures. Incorporating wearable technology enhances the ability of users to follow the procedures. In addition, allowing users to choose the device most useful for the task at hand improves their comfort with the new process.

THE PROBLEM WITH TRADITIONAL OPs DOCUMENTATION

There are two major problems with the current methods of managing and documenting data center procedures.

Many data center operations teams document their procedures and methodologies in large three-ring binders. Although their presence is comforting, Operations staffs rarely consult them.
Also, organizations often rely on highly detailed written documentation presented in such depth that Operations staff wastes a great deal of time trying to locate the appropriate information.

In both instances, Operations personnel deem the written procedural guidelines inconvenient. As a result, actual data center operation tends to be more experience-based. Bluntly put, too many organizations rely too much on the most experienced or the most insistent Operations person on site at the time of an incident.

Figure 2. Compass Datacenters use of electronics and checklists clari es the responsibilities of Operations personnel doing daily rounds.

Figure 2. Compass Datacenters use of electronics and checklists clarifies the responsibilities of Operations personnel doing daily rounds.

On the whole, experience-based operations are not all bad. Subject matter experts (SME) emerge and are valued, with operations running smoothly for long periods at a time. However, as Julian Kudritzki wrote in “Examining and Learning from Complex Systems Failures” (The Uptime Institute Journal, vol. 5, p. 12) ‘Human error’ is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.” What’s more, subject matter experts may leave or get sick or simply have a bad day. In addition, most of their peers assume the SME is always right.

Figure 3.

Figure 4.

Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.

Of course, no one goes to work planning to hurt a coworker or cause an outage, but errors are made. Compass Datacenters believes its efforts to develop professional quality procedures and use wearable technology reduce the likelihood of human error. Compass’s intent is to make it easy for operators to handle both routine and foreseeable abnormal operations by standardizing where possible and improvising only when absolutely necessary and by making it simple and painless for technicians to comply with well-defined procedures.

AMERICAN ELECTRIC POWER

American Electric Power (AEP), a utility based in  Columbus, OH that provides electricity to more than 5.4  million customers in 11 states, inspired Compass to bring wearable technology to the data center to help the utility reduce human error. The initiative was a first for Compass, and AEP’s collaboration was essential to the product development effort. Compass and AEP began by adopting the checklist approach used in the airline business, among many others. This approach to avoiding human error is predicated on the use of carefully crafted operational procedures. Foreseeable critical actions by the pilot and copilot are strictly prescribed to ensure that all processes are followed in the same manner, every time, no matter who is in the cockpit. They standardize where possible and improvise only when required. Airline procedural techniques also put extra focus on critical steps to eliminate error. While this professional approach to operations has been widely adopted by a host of industries, it has not yet become a standard in the data center business.

Compass and AEP partnered with ICARUS Ops to develop the electronic checklist and wearables programs. ICARUS Ops was founded by two former U.S. Navy carrier pilots, who use their extensive experience in developing standardized aviation processes to develop wearable programs across a broad spectrum of high-reliability and mission-critical industries. During the process, ICARUS Ops brought on board several highly experienced consultants, including military and commercial pilots and a former Space Shuttle mission controller. At the start of the project, Compass, ICARUS Ops, and AEP worked together to identify the various maintenance and operations tasks that would be wearable accessible. Team members reviewed all existing emergency operating procedures (EOP). The three organizations held detailed discussions with AEP’s data center personnel to help identify the procedures best suited to run electronically.
Including operators in the process ensured that their perspectives would be part of the final procedures to make it easier for them to execute correctly. Also, this early involvement led to a high level of buy-in, which would be needed for this project to succeed.

ELECTRONIC PROCEDURES

The next step was converting the procedures to electronic checklists. The transition to electronics was accomplished using ICARUS Ops Application and web-based Mission Control. The project team wanted each checklist to

Be succinct and written with the operator in mind
Identify the key milestone in the process and use digital technology to verify critical steps
Use condition, purpose, or desired effect statements to assure the proper checklist is utilized
Identify common areas of confusion and provide links to tools such as animations and videos to clarify how and why the procedure is to be performed
Make use of Job Safety Analysis (JSA) to identify areas of significant risk prior to the required step/action. These included:
o Warnings to indicate where serious injury is possible if all procedures are not followed specifically
o Cautions indicating where significant data loss or damage is possible
o Notes to add any additional information that may be required or helpful
Condense SOP and EOP procedure items to prevent the user from being overwhelmed during critical and emergency scenarios.

AEP SMEs walked through each and every checklist to provide an additional quality control step.

Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.

Having SOPs and EOPs in a wearable digital format allows users to access them whenever an abnormal or emergency situation occurs, so no time is lost looking for the three-ring binder and proper procedure. In critical situations and emergencies, time is crucial and delay can be costly. Hard-copy information manuals and operating handbooks are bulky and can potentially overwhelm technicians looking for specific information. The bulk and formatting can cause users to lose their place on a page and skip over important items. By contrast, the digital interface provides quick access to users and allows for quicker navigation to desired materials and information via presorted emergency and abnormal procedures.

At the conclusion of the development phase of the process (about a month), 150 of AEP’s operational procedures had been converted into digital procedures that were conveniently accessible on mobile devices— on or offline. Most importantly, a digital continuous improvement process was put in place to assure that lessons learned would be easy to incorporate into the system as part of the process.

The process of entering the procedures took about 2 weeks. First, a data entry team put them in as written. Then Compass worked with AEP and ICARUS Ops checklist experts to tighten up the language and add decision points, warnings, cautions, and notes. This group also embedded pictures and video where appropriate and decided what data needed to be captured during the process. Updates are easy. Each time a checklist or procedure is used improvements can be approved and digitally published right away. This is critical. AEP believes that when employees know they are being heard job satisfaction improves and they take more ownership and pride in the facility. The modifications and improvements will be continuous. The system is designed for constant improvement. Any time a technician has an idea to improve a checklist he just records the idea in the application, which instantly emails it to the SME for approval and publishing into the system. Other devices on the system synch automatically when they update.

WEARABLE TECHNOLOGY

Using wearable technology made the electronic procedures even more accessible to the end users. At the outset, AEP, ICARUS Ops, and Compass determined that AEP technicians would be able to choose between a wrist-mounted Android device and a tablet. The majority of users selected the wrist-mounted device. This preference was not unexpected, as a good portion of their work requires them to use their hands. The wrist units allow users to have both hands available to perform actual work at all times.

Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.

Compass chose the Android technology system for several reasons. First, from the very outset, Compass envisioned making Google Glass one of the wearable options. In addition, Android allows programs maximum freedom to use the hardware and software as required to meet customer needs; it is simply less constraining than working with iOS. Androids are also very easy to procure and have manufacturers making Class 1, Division 1 Android tablets that are usable in hazardous environments.

The software-based technology  enhances AEP’s ability to track  all maintenance activity in its  data center. Once online, each  device communicates the recorded  operational information (actions  performed, components replaced, etc.) to the site’s maintenance system. This ensures that all routine maintenance is performed on schedule, and at the same time provides a detailed history of the activities performed on each major component within the facility. Web-based leadership dashboards provide oversight into what is complete, due, or overdue. The system is completely secure, as it does not reside on the same server or system as any other system in the data center. It can be encrypted or resident in the data center for increased security. Typically ICARUS Ops spends at least 3 days on site training the technicians and users, depending on the number of devices. Working together in that time normally leaves the team very comfortable with all components.

THE NEXT PHASE: GLASS

Although innovations such as Google Glass have increased awareness of visual displays, the concept dates back to the Vietnam War, when American fighter pilots began to use augmented reality on heads-up displays (HUDs). Compass Datacenters plans to harness the same augmented reality to enable maintenance technicians to complete a task completely hands free using Google Glass and other advanced wearables.

It is certainly safe to assume that costs related to data center outages will continue to rise in the future. As a result, it will become more important to increase operator reliability and reduce the risk of human error. We strongly believe that professional quality procedures and wearables will make a tremendous contribution to eliminating the human error element as a major cause of interruption and lost revenue. Our vision is that other data center providers will decide to make this innovative checklist system a part of their standard offerings over the course of the next 12 to 24 months.

Jose Ruiz

Jose Ruiz serves as Compass Datacenters’ Vice President of Operations. He provides interdisciplinary leadership in support of Compass’ strategic goals. Mr. Ruiz began his tenure at Compass as the company’s director of Engineering. Prior to joining Compass he spent three years in various engineering positions and was responsible for a global range of projects at Digital Realty Trust.

Luxembourg Colo Provides Multi-Tier Options

ENTEL Achieves Uptime Institute Tier Certification of Operational Sustainability

When an Australian Government Department Required Operational Sustainability, Metronode Delivered

LinkedIn’s Oregon Data Center Goes Live, Gains EIT Stamp of Approval

Airline Outages FAQ: How to Keep Your Company Out of the Headlines

Bulky Binders and Operations “Experts” Put Your Data Center at Risk

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices