Airline Outages FAQ: How to Keep Your Company Out of the Headlines

Uptime Institute has prepared this brief airline outages FAQ to help the industry, media, and general public understand the reasons that data centers fail.

The failure of a power control module on Monday, August 8, 2016, at a Delta Airlines data center caused hundreds of flight cancellations, inconvenienced thousands of customers, and cost the airline millions of dollars. And while equipment failure is the proximate cause of the data center failure, Uptime Institute’s long experience evaluating the design, construction, and operations of facilities suggest that many enterprise data centers are similarly vulnerable because of construction and design flaws or poor operations practices.

What happened to Delta Airlines?

While software is blamed for many well publicized IT problems, Delta Airline is blaming a piece of infrastructure hardware. According to the airline, “…a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.” We take Delta at its word.

Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.

However, pointing to the failure of a single piece of equipment can be very misleading. Delta’s facility was designed to have redundant systems, so the facility should have remained operational had the facility performed as designed. In short, a design flaw, construction error or change, or poor operations procedures set the stage for the catastrophic failure.

What does this mean?

Equipment like Delta’s power control module (more often called a UPS) should be deployed in a redundant configuration to allow for maintenance or support IT operation in the event of a fault. However, IT demand can grow over time so that the initial redundancy is compromised and each piece of equipment is overloaded when one fails or is taken off line. At this point, only Delta really knows what happened.

Organizations can compromise redundancy in this way by failing to track load growth, lacking a process to manage load growth, or making a poor business decision because of unanticipated or uncontrolled load growth.

Delta Airlines is like many other organizations. Its IT operations are complex and expensive, change is constant, and business demands are constantly growing. As a result, IT must squeeze every cent out of its budgets, while maintaining 24x7x365 operations. Inevitably these demands will expose any vulnerability in the physical infrastructure and operations.

Failures like the one at Delta can have many causes, including a single point of failure, lack of totally redundant and independent systems, or poorly documented maintenance and operations procedures.

 

Delta’s early report ignores key issues why a single equipment failure could cause such damage, what could have been done to prevent it, and how will Delta respond in the future.

Delta and the media are rightly focused on the human cost of the failure. Uptime Institute is sure that Delta’s IT personnel will be trying to reconstruct the incident once operations have been restored and customer needs met. This incident will cost millions of dollars and hurt Delta’s brand; Delta will not want a recurrence. They have only started to calculate what changes will be required to prevent another incident and how much it will cost.

Why is Uptime Institute publishing this FAQ?

Uptime Institute has spent many years helping organizations avoid this exact scenario, and we want to share our experience. In addition, we have seen many publish reports that just miss the point of how to achieve and maintain redundancy. Equipment fails all the time. Facilities that are well designed, built, and operated have systems in place to minimize data center failure. Furthermore, these systems are tested and validated regularly, with special attention to upgrades, load growth, and modernization plans.

How can organizations avoid catastrophic failures?

Organizations spend millions of dollars to build highly reliable data centers and keep them running. They don’t always achieve the desired outcome because of a failure to understand data center infrastructure.

The temptation to reduce costs in data centers is great because data centers require a great deal of energy to power and cool servers and they require experienced and qualified staff to operate and maintain.

Value engineering in the design process and mistakes and changes in the building process can result in vulnerabilities even in new data centers. Poor change management processes and incomplete procedures in older data centers are another cause for concern.

Uptime Institute believes that independent third-party verifications are the best way to identify the vulnerabilities in IT infrastructure. For instance, For instance, Uptime Institute consultants have Certified that more than 900 facilities worldwide meet the requirements of the Tier Standards.. Certification is the only one way to ensure that a new data center has been built and verified to meet business standards for reliability. In almost all these facilities, Uptime Institute consultants made recommendations to reduce risks that were unknown to the data center owner and operators.

Over time, however, even new data centers can become vulnerable. Change management procedures must help organizations control IT growth, and maintenance procedures must be updated to account for equipment and configuration changes. Third-party verifications such as Uptime Institute’s Tier Certification for Operational Sustainability and Management & Operations Stamp of Approval ensure that an organization’s procedures and processes are complete, accurate, and up-to-date. Maintaining up-to-date procedures is next to impossible without management processes that recognize that data centers change over time as demand grows, equipment changes, and business requirements change.

I’m in IT, what can I do to keep my company out of the headlines?

It all depends on your company’s appetite for risk. If IT failures would not materially affect your operations, then you can probably relax. But if your business relies on IT for mission-critical services for customer-facing or internal customers, then you should consider having the Uptime Institute evaluate your data center’s management and operations procedures and then implement the recommendations that result.

If you are building a new facility or renovating an existing facility, Tier Certification will verify that your data center will meet your business requirements.

These assessments and certifications require a significant organizational commitment in time and money, but these costs are only a fraction of the time and money that Delta will be spending as the result of this one issue. In addition, insurance companies increasingly have begun to reduce premiums for companies that have their operational preparedness verified by a third party.

Published reports suggest that Delta thought it was safe from this type of outage, and that other airlines may be similarly vulnerable because of skimpy IT budgets, poor prioritization, and bad processes and procedures. All companies should undertake a top-to-bottom review of its facilities and operations before spending millions of dollars on new equipment.

Bulky Binders and Operations “Experts” Put Your Data Center at Risk

Wearable technology and digitized operating procedures ensure compliance with standardized practices and provide quick access to critical information

By Jose Ruiz

No one in the data center industry questions that data center outages are extremely costly. Numerous vendors report that the average data center outage costs hundreds of thousands of dollars, with losses in some circumstances totaling much more. Uptime Institute Network Abnormal Incident Reports data show that most downtime instances are caused by human error–a key vulnerability in mission-critical environments. In addition, Compass Datacenters believes that many failures that enterprises attribute to equipment failure could accurately be considered human error. However, most companies prefer to blame the hardware rather than their processes.

Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.

Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.

Whether one accepts higher or lower outage costs, it is clear that reducing human error
as a cause of downtime has a tremendous upside for the enterprise. As a result, data center
 owners dedicate a great 
deal of time to developing 
maintenance and
operation procedures 
that are clear, effective, and easy to follow. Compass Datacenters is incorporating wearable technology in its efforts to make the procedures convenient for technicians to access and execute. During the process, Compass learned that the true value of introducing wearables into data center operations comes from standardizing the procedures. Incorporating wearable technology enhances the ability of users to follow the procedures. In addition, allowing users to choose the device most useful for the task at hand improves their comfort with the new process.

THE PROBLEM WITH TRADITIONAL OPs DOCUMENTATION

There are two major problems with the current methods of managing and documenting data center procedures.

  • Many data center operations teams document their procedures and methodologies in large three-ring binders. Although their presence is comforting, Operations staffs rarely consult them.
  • Also, organizations often rely on highly detailed written documentation presented in such depth that Operations staff wastes a great deal of time trying to locate the appropriate information.

In both instances, Operations personnel deem the written procedural guidelines inconvenient. As a result, actual data
center operation tends to be more experience-based. Bluntly put, too many organizations rely too much on the most experienced or the most insistent Operations person on site at the time of an incident.

Figure 2. Compass Datacenters use of electronics and checklists clari es the responsibilities of Operations personnel doing daily rounds.

Figure 2. Compass Datacenters use of electronics and checklists clarifies the responsibilities of Operations personnel doing daily rounds.

On the whole, experience-based operations are not all bad. Subject matter experts (SME) emerge and are valued, with operations running smoothly for long periods at a time. However, as Julian Kudritzki wrote in “Examining and Learning from Complex Systems Failures” (The Uptime Institute Journal, vol. 5, p. 12) ‘Human error’ is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.” What’s more, subject matter experts may leave or get sick or simply have a bad day. In addition, most of their peers assume the SME is always right.

Figure 3.

Figure 3.

Figure 4.

Figure 4.

Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.

Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.

Of course, no one goes to work planning to hurt a coworker or cause an outage, but errors are made. Compass Datacenters believes its efforts to develop professional quality procedures and use wearable technology reduce the likelihood of human error. Compass’s intent is to make it easy for operators to handle both routine and foreseeable abnormal operations by standardizing where possible and improvising only when absolutely necessary and by making it simple and painless for technicians to comply with well-defined procedures.

AMERICAN ELECTRIC POWER

American Electric Power (AEP), a utility based in 
Columbus, OH that provides electricity to more than 5.4
 million customers in 11 states, inspired Compass to bring wearable technology to the data center to help the utility reduce human error. The initiative was a first for Compass, and AEP’s collaboration was essential to the product development effort. Compass and AEP began by adopting the checklist approach used in the airline business, among many others. This approach to avoiding human error is predicated on the use of carefully crafted operational procedures. Foreseeable critical actions by the pilot and copilot are strictly prescribed to ensure that all processes are followed in the same manner, every time, no matter who is in the cockpit. They standardize where possible and improvise only when required. Airline procedural techniques also put extra focus on critical steps to eliminate error. While this professional approach to operations has been widely adopted by a host of industries, it has not yet become a standard in the data center business.

Compass and AEP partnered with ICARUS Ops to develop the electronic checklist and wearables programs. ICARUS Ops was founded by two former U.S. Navy carrier pilots, who use their extensive experience in developing standardized aviation processes to develop wearable programs across a broad spectrum of high-reliability and mission-critical industries. During the process, ICARUS Ops brought on board several highly experienced consultants, including military and commercial pilots and a former Space Shuttle mission controller. At the start of the project, Compass, ICARUS Ops, and
AEP worked together to identify the various maintenance and operations tasks that would be wearable accessible. Team members reviewed all existing emergency operating procedures (EOP). The three organizations held detailed discussions with AEP’s data center personnel to help identify the procedures best suited to run electronically.
Including operators in the process ensured that their perspectives would be part of the final procedures to make it easier for them to execute correctly. Also, this early involvement led to a high level of buy-in, which would be needed for this project to succeed.

ELECTRONIC PROCEDURES

The next step was converting the procedures to electronic checklists. The transition to electronics was accomplished using ICARUS Ops Application and web-based Mission Control. The project team wanted each checklist to

  • Be succinct and written with the operator in mind
  • Identify the key milestone in the process and use digital technology to verify critical steps
  • Use condition, purpose, or desired effect statements to assure the proper checklist is utilized
  • Identify common areas of confusion and provide links to tools such as animations and videos to clarify how and why the procedure is to be performed
  • Make use of Job Safety Analysis (JSA) to identify areas of significant risk prior to the required step/action. These included:
    o Warnings to indicate where serious injury is possible if all procedures are not followed specifically
    o Cautions indicating where significant data loss or damage is possible
    o Notes to add any additional information that may be required or helpful
  • Condense SOP and EOP procedure items to prevent the user from being overwhelmed during critical and emergency scenarios.

AEP SMEs walked through each and every checklist to provide an additional quality control step.

Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.

Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.

Having SOPs and EOPs in a wearable digital format allows users to access them whenever an abnormal or emergency situation occurs, so no time is lost looking for the three-ring binder and proper procedure. In critical situations and emergencies, time is crucial and delay can be costly. Hard-copy information manuals and operating handbooks are bulky and can potentially overwhelm technicians looking for specific information. The bulk and formatting can cause users to lose their place on a page and skip over important items. By contrast, the digital interface provides quick access to users and allows for quicker navigation to desired materials and information via presorted emergency and abnormal procedures.

At the conclusion of the development phase of the process (about a month), 150 of AEP’s operational procedures had been converted into digital procedures that were conveniently accessible on mobile devices— on or offline. Most importantly, a digital continuous improvement process was put in place to assure that lessons learned would be easy to incorporate into the system as part of the process.

The process of entering the procedures took about 2 weeks. First, a data entry team put them in as written. Then Compass worked with AEP and ICARUS Ops checklist experts to tighten up the language and add decision points, warnings, cautions, and notes. This group also embedded pictures and video where appropriate and decided what data needed to be captured during the process. Updates are easy. Each time a checklist or procedure is used improvements can be approved and digitally published right away. This is critical. AEP believes that when employees know they are being heard job satisfaction improves and they take more ownership and pride in the facility. The modifications and improvements will be continuous. The system is designed for constant improvement. Any time a technician has an idea to improve a checklist he just records the idea in the application, which instantly emails it to the SME for approval and publishing into the system. Other devices on the system synch automatically when they update.

WEARABLE TECHNOLOGY

Using wearable technology made the electronic procedures even more accessible to the end users. At the outset, AEP, ICARUS Ops, and Compass determined that AEP technicians would be able to choose between a wrist-mounted Android device and a tablet. The majority of users selected the wrist-mounted device. This preference was not unexpected, as a good portion of their work requires them to use their hands. The wrist units allow users to have both hands available to perform actual work at all times.

Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.

Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.

Compass chose the Android technology system for several reasons. First, from the very outset, Compass envisioned making Google Glass one of the wearable options. In addition, Android allows programs maximum freedom to use the hardware and software as required to meet customer needs; it is simply less constraining than working with iOS. Androids are also very easy to procure and have manufacturers making Class 1, Division 1 Android tablets that are usable in hazardous environments.

The software-based technology 
enhances AEP’s ability to track
 all maintenance activity in its 
data center. Once online, each 
device communicates the recorded
 operational information (actions
 performed, components replaced, etc.) to the site’s maintenance system. This ensures that all routine maintenance is performed on schedule, and at the same time provides a detailed history of the activities performed on each major component within the facility. Web-based leadership dashboards provide oversight into what is complete, due, or overdue. The system is completely secure, as it does not reside on the same server or system as any other system in the data center. It can be encrypted or resident in the data center for increased security. Typically ICARUS Ops spends at least 3 days on site training the technicians and users, depending on the number of devices. Working together in that time normally leaves the team very comfortable with all components.

THE NEXT PHASE: GLASS

Although innovations such as Google Glass have increased awareness of visual displays, the concept
dates back to the Vietnam War, when American fighter pilots began to use augmented reality on heads-up displays (HUDs). Compass Datacenters plans to harness the same augmented reality to enable maintenance technicians to complete a task completely hands free using Google Glass and other advanced wearables.

It is certainly safe to assume that costs related to data center outages will continue to rise in the future. As a result, it will become more important to increase operator reliability and reduce the risk of human error. We strongly believe that professional quality procedures and wearables will make a tremendous contribution to eliminating the human error element as a major cause of interruption and lost revenue. Our vision is that other data center providers will decide to make this innovative checklist system a part of their standard offerings over the course of the next 12 to 24 months.


Jose Ruiz

Jose Ruiz

Jose Ruiz serves as Compass Datacenters’ Vice President of Operations. He provides interdisciplinary leadership in support of Compass’ strategic goals. Mr. Ruiz began his tenure at Compass as the company’s director of Engineering. Prior to joining Compass he spent three years in various engineering positions and was responsible for a global range of projects at Digital Realty Trust.

Tier III Certified Facilities Prove Critical to Norwegian Colo’s Client Appeal

Green Mountain melds sustainability, reliability, competitive pricing, and independent certifications to attract international colo customers

By Kevin Heslin

Green Mountain operates two unique colo facilities in Norway, having a total potential capacity of several hundred megawatts. Though each facility has its own strengths, both embody the company’s commitment to providing secure, high-quality service in an energy efficient and sustainable manner. CEO Knut Molaug and Chief Sales Officer Petter Tømmeraas recently took time to explain how Green Mountain views the relationship between cost, quality, and sustainability.

 

Tell our readers about Green Mountain.

KM: Green Mountain focuses on the high-end data center market, including banking/finance, oil and gas, and other industries requiring high availability and high quality services.

PT: IT and cloud are also very big customer segments. We think the US and European markets are the biggest for us, but we also see some Asian companies moving into Europe that are really keen on having high-quality data centers in the area.

KM: Green Mountain Data Centers operates two data centers in Norway. Data Center 1 in Stavanger began operation in 2013 and is located in a former underground NATO ammunition storage facility inside a mountain on the west coast. Data Center 2, a more traditional facility, is located in Telemark, which is in the middle of Norway.

Today DC1-Stavanger is a high security colocation data center housing 13,600 square meter facility (m2) of customer space. The infrastructure can support up to 26 megawatts of IT load today. The main data center comprises three two-story concrete buildings built inside the mountain, with power densities ranging from 2-6 kW/m2, but the facility can support up to 20 kW/m2. NATO put a lot of money into creating their facilities inside the mountain, which probably saved us 1 billion Kroners ($US 150 million).

DC2-Telemark is located in a historic region of Norway and was built on a brownfield site with a 10-MW supply initially available. The first phase is a fully operationa1 10-MW Tier lll Certified Facility, with four new buildings and up to 25 MW total capacity planned. This site could support even larger facilities if the need arises.

Green Mountain focuses a lot on being green and environmentally friendly, so we use 100% renewable energy in both data centers.

How do the unique features of the data centers affect their performance?

KM: Besides being located in a mountain, DC1 has a unique cooling system. We use the fjords for cooling year round, which gives us 8°C (46 °F) water for cooling. The cooling solution (including cooling station, chilled water pipework and pumps) is fully duplicated, providing an N+N solution. Because there are few moving parts (circulating pumps) the solution is extremely robust and reliable. In-row cooling is installed to client specification using Hot Aisle technology.

We use only 1 kilowatt of power to produce 100 kilowatts of cooling. So the data center is extremely energy efficient. In addition, we are connected to three independent power supplies, so DC1 has extremely robust power.

DC2 probably has the most robust power supply in Europe. We have five independent hydropower plants within a few kilometers of the site, and the two closest are just a few hundred meters away.

How do you define high quality?

PT: High quality means Uptime Institute Tier Certification. We are not only saying we have very good data centers. We’ve gone through a lot of testing so we are able to back it up, and the Uptime Institute Tier Standard is the only standard worldwide that certifies data center infrastructure to a certain quality. We’re really strong on certifications because we don’t only want to tell our customers that we have good quality, we want to prove it. Plus we want the kinds of customers who demand proof. As a result, both our facilities are Tier III Certified.

Please talk about the factors that went into deciding to obtain Tier Certification.

KM: We have focused on high-end clients that require 100% uptime and are running high-availability solutions. Operations for this type of company generally require documented infrastructure.

The Tier III term is used a lot, but most companies can’t back it up. Having been through testing ourselves, we know that most companies that haven’t been certified don’t have a Tier III facility, no matter what they claim. When we talk to important clients, they see that as well.

What was the on-site experience like?

PT: When the Uptime Institute team was on site, we could tell that Certification was a quality process with quality people who knew what they were doing. Certification also helped us document our processes because of all the testing routines and scenarios. As a result, we know we have processes and procedures for all the thinkable and unthinkable scenarios and that would have been hard to do without this process.

Why do you call these data centers green?
KM:
First of all we use only renewable energy. Of course that is easy in Norway because all the power is renewable. In addition we use very little of it, with the fjords as a cooling media. We also built the data centers using the most efficient equipment, even though we often paid more for it.

PT: Green Mountain is committed to operate in a sustainable way and this reflects in everything we do. The good thing about operating in such a way is that our customers benefit from this financially. As we bill power cost based on consumption of power, the more energy efficient we operate the smaller the bill to our customer. When we tell these companies that they can even save money going for our sustainable solutions this makes their decision easier.

More and more customers require that their new data center solutions are sustainable, but we still see that price is a key driver for most major customers. The combination of having very sustainable solutions and being very competitive on price is the best way of driving sustainability further into the mind of our customers.

All our clients reduce their carbon footprint when they move into our data centers and stop using their old and inefficient data centers.

We have a few major financial customers that have put forward very strict targets with regards to sustainability and that have found us to be the supplier that best meets these requirements.

KM: And, of course, making use of an already built facility was also part of the green strategy.

How does your cost structure help you win clients?

PT: It’s important, but it’s not the only important factor. Security and the quality we can offer are just as important, and that we can offer them with competitive pricing is very important.

Were there clients who were attracted to your green strategy?

PT: Several of them, but the decisive factor for customers is rarely only one factor. We offer a combination between a really, really competitive offering and a high quality level. We are a really, really sustainable and green solution. To be able to offer that at competitive price is quite unique because often people think they have to pay more to get a sustainable green solution.

Are any of your targeted segments more attracted to sustainability solutions?

PT: Several of the international system integrators really like the combination. They want a sustainable solution, but they want the competitive offering. When they get both, it’s a no-brainer for them.

How does your sustainability/energy efficiency program affect your reliability? Do potential clients have any concerns about this? Do any require sustainability?

PT: Our programs do not affect our reliability in any way. We have chosen only to implement solutions that do not harm our ability to deliver the quality we promise to our customers. We have never experienced one second of SLA breakage on any customer in any of our data centers. In fact, some of our most sustainable solutions, like the cooling system based on cold sea water, increase our reliability as it takes down the risk of failure considerably compared to regular cooling systems. We have not experienced any concerns about these solutions.

Has Tier Certification proven critical in any of your client’s decisions?

PT: Tier certification has proved critical in many of our client`s decision to move to Green Mountain. We see a shift in the market to require Tier certification, whereas it used to be more in the form of asking for Tier compliance, that anyone could claim without having to prove it. We think the future of quality data center providers will be to certify all their data centers.

Any customer with mission critical data should require their supplier/s to be Tier certified. At the moment this is the only way for a customer to secure that their data center is built and operated in the way it should in order to secure the quality that the customer needs.

Are there other factors that set you apart?

PT: Operational excellence. We have an operational team that excels every time. They deliver to the customers a lot more than expected every time, and we have customers that are extremely happy with their deliveries from us. I hear that from customers all the time, and that’s mainly because our operations team do a phenomenal job.

Uptime Institute testing criteria were very comprehensive and helped us develop our operational procedures to an even higher level as some of the scenarios created during the certification testing were used as a basis for new operational procedures and new tests that we now perform as part of our normal operating procedures.

Green Mountain definitely benefitted from the Tier process in a number of other ways, including training gave us useful input to improve our own management and operational procedures.

What did you do to develop this team?

KM: When we decided to focus on high-end clients, we knew that we needed high-end experience and expertise and knowledge on the ops side, so we focused on that when recruiting as well as building a culture inside the company that focused on delivering high quality the first time every time.

We recruited people with knowledge of how to operate critical environments, and we tasked them with developing those procedures and operational elements as a part of their efforts, and they have successfully done so.

PT: And the owners made the resources available so that they could spend the resources—both financial and staff-hour wise to create the quality we wanted. We also have a very good management system, so management has good knowledge of what’s happening, so if we have an issue it will be very visible.

KM: We also have high-end equipment and tools to measure and monitor everything inside the data center as well as operational tools to make sure we can handle any issue and deliver on our promises.


Kevin Heslin

Kevin Heslin

Kevin Heslin is Chief Editor and Director of Ancillary Projects at Uptime Institute. In these roles, he supports Uptime Institute communications and education efforts. Previously, he served as an editor at BNP Media, where he founded Mission Critical, a commercial publication dedicated to data center and backup power professionals. He also served as editor at New York Construction News and CEE and was the editor of LD+A and JIES at the IESNA. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.

If You Can’t Buy Effective DCIM, Build It

After commercial DCIM offerings failed to meet RELX Group’s requirements, the team built its own DCIM tool based on the existing IT Services Management Suite

By Stephanie Singer

What is DCIM? Most people might respond “data center infrastructure management.” However, simply defining the acronym is not enough. The meaning of DCIM is greatly different for each person and each enterprise.

At RELX Group and the data centers managed by the Reed Elsevier Technology Services (RETS) team, DCIM is built on top of our configuration management database (CMDB) and provides seamless, automated interaction between the IT and facility assets in our data centers, server rooms, and telco rooms. These assets are mapped to the floor plans and electrical/mechanical support systems within these locations. This arrangement gives the organization the ability to manage its data centers and internal customers in one central location. Additionally, DCIM components are fully integrated with the company’s global change and incident management processes for holistic management and accuracy.

A LITTLE HISTORY

So what does this all mean? Like many of our colleagues, RETS has consulted with several DCIM providers over the years. Many of them enthusiastically promised a solution for our every need, creating excitement at the prospect of a state-of-the art system for a reasonable cost. But, as we all know, the devil is in the details.

I grew up in this space and have personally been on this road since the early 1980s. Those of you who have been on the same journey will likely remember using the Hitachi Tiger tablet with AutoCAD to create equipment templates. We called it “hardware space planning” at the time, and it was a huge improvement over the template cutouts from IBM, which we had to move around manually on an E-size drawing.

Many things have changed over the last 30 years, including the role of vendors in our operation. Utilizing DCIM vendors to help achieve infrastructure goals has become a common practice in many companies, with varying degrees of success. This has most certainly been a topic at every Uptime Institute Symposium as peers shared what they were doing and sought better go-forward approaches and planning.

OUR FIRST RUN AT PROCURING DCIM

Oh how I coveted the DCIM system my friend was using for his data center portfolio at a large North America-based retail organization. He and his team helpfully provided real world demos and resource estimates to 
help us construct the business case that was approved, including the contract resources for a complete DCIM installation. We were able to implement the power linking for our assets in all our locations. Unfortunately, the resource dollars did not enable us to achieve full implementation with the cable management. I suspect many of you reading this had similar experiences.

Nevertheless, we had the basic core functionality and a DCIM system that met our basic needs. We ignored inefficiencies, even though it took too many clicks to accomplish the tasks at hand and we still had various groups tracking the same information in different spreadsheets and formats across the organization.

About 4-1/2 years after implementation, our vendor reported the system we were using would soon be at end of life. Of course, they said, “We have a new, bigger, better system. Look at all the additional features and integration you can have to expand and truly have everything you need to run your data centers, manage your capacities, and drive costs out.”

Digging deeper, we found that driving cost out was not truly obtainable when we balanced the costs of fully collating the data, building the integration with other systems and processes, and maintaining the data and scripting that was required. With competing priorities and the drive for cost efficiencies, we went back to the drawing board and opened the DCIM search to other providers once again.

STARTING OVER WITH DCIM MARKET RESEARCH

The DCIM providers we looked at had all the physical attributes tied to the floor and rack space, cable management, asset life cycle, capacity planning, and various levels for the electrical/mechanical infrastructure. These tools all integrated with power quality and building management and automation systems, each varying slightly in their approach and data output. And many vendors offered bi-directional data movement from the service management tool suite and CMDB.

But our findings revealed potential problems such as duplicate and out-of-sync data. This was unacceptable. We wanted all the features our DCIM providers promised without suffering poor data quality. We also wanted the DCIM to fully integrate with our change and incident management systems so we could look holistically at potential root causes of errors. We wanted to see where the servers were located across the data center, if they were in alarm, when maintenance was occurring, and whether a problem was resolved. We wanted the configuration item attributes for maintenance, end of life, contract renewals, procedures, troubleshooting guidelines, equipment histories, business ownerships, and relationships to be 100% mapped globally.

RETS has always categorized data center facilities as part of IT, separate from Corporate Real Estate. Therefore, all mechanical and electrical equipment within our data center and server rooms are configuration items (CI) as well. This includes generators, switchgear, uninterruptible power systems (UPS), power distribution units (PDU), remote power panels (RPP), breakers, computer room air conditioners (CRAC), and chillers. Breaking away from the multiple sources we had been using for different facility purposes greatly improved our overall grasp on how our facility team could better manage our data centers and server rooms.

NEW PARADIGM

This realization caused us to ask ourselves: What if we flipped the way DCIM is constructed? What if it is built
-on top of the Service Management Suite, so it is truly a full system that integrates floor plans, racks, and power distribution to the assets within the CMDB? Starting with this thought, we aggressively moved forward to completely map the DCIM system we currently had in place and customize our existing Service Management Suite.

We started this journey in October 2014 and followed a Scrum software development process. Scrum is an iterative and incremental agile software development methodology. The 2-week sprints and constant feedback that provided useful functionality were keys to this approach. It was easy to adapt quickly to changes in understanding.

Another important part of Scrum is to have a complete team set up with a subject matter expert (SME) to serve as the product owner to manage the backlog of features. Other team members included the Service Management Suite tool expert to design forms and tables, a user interface (UI) expert to design the visualization and a Scrum master to manage the process. A web service expert joined the team to port the data from the existing DCIM system into the CMDB. All these steps were critical; however, co-locating the team with a knowledgeable product owner to ensure immediate answers and direction to questions really got us off and running!

We created our wish list of requirements, prioritizing those that enabled us to move away from our existing DCIM system.

Interestingly enough, we soon learned from our vendor that the end of life for our current 4-1/2 year-old DCIM system would be extended because of the number of customers that remained on that system. Sound familiar? The key was to paint the vision of what we wanted and needed it to be, while pulling a creative and innovative group of people together to build a roadmap for how we were going to get there.

It was easy to stay focused on our goals. We avoided scope creep by aligning our requirements with the capabilities that our existing tool provided. The requirements and capabilities that aligned were in scope. Those that didn’t were put on a list for future enhancements. Clear and concise direction!

The most amazing part was that using our Service Management Suite was providing many advantages. We were getting all the configuration data linked and a confidence in the data accuracy. This created excitement across the team and the “wish” list of feature requests grew immensely! In addition to working on documented requests, our creative and agile team came back with several ideas for features we had not initially contemplated, but that made great business sense. Interestingly enough, we achieved many of these items so easily we had achieved a significantly advanced tool with the automation features we leveraged by the time we went live in our new system.

FEATURES

Today we can pull information by drilling down on a oor plan to a device that enables us to track the business owner, equipment relationships, application to infrastructure mapping, and application dependencies. This information allows us to really understand the impacts of adding, moving, modifying, or decommissioning within seconds. It provides real-time views for our business partners when power changes occur, maintenance is scheduled, and if a device alarming is in effect (see Figure 1).

Figure 1. DCIM visualization

Figure 1. DCIM visualization

The ability to tie in CIs to all work scheduled through our Service Management Suite change requests and incidents provides a global outlook on what is being performed within each data center, server room, and telco room, and guarantees accuracy and currency. We turned all electrical and mechanical devices into CIs and assigned them physical locations on our floor plans (see Figure 2).

Figure 2. Incident reporting affected CI

Figure 2. Incident reporting affected CI

Facility work requests are incorporated into the server install and decommissioning workflow. Furthermore, auto discovery alerts us to devices that are new to the facility, so we are aware if something was installed outside of process.

If an employee needs to add a new model to the floor plan we have a facility management form for that process, where new device attributes can be specified and created by users within seconds.

Facility groups can modify
 floor plans directly from the visualization providing dynamic updates to all users, Operations can monitor alarming and notifications 24×7 for any CI, and IT teams can view rack elevations for any server rack or storage array (see Figure 3).
Power management, warranty tracking, maintenance, hardware dependencies, procedures, equipment relationships, contract management, equipment histories can all be actively maintained all within one system.

Figure 3. Front and rear rack elevation

Figure 3. Front and rear rack elevation

Speaking of power management, our agile team was able to create an exact replica of our electrical panel schedules inside of the DCIM without losing any of the functionality we had in Excel. This included the ability to update current power draw
for each individual breaker, redundant failover calculation, alarming, and relationship creation from PDU to breaker to floor-mount device.

Oh and by the way, iPad capability is here…. Technicians can update information as work is being performed on the floor, allowing Operations to know when a change request is actually in process and what progress has been made. And, 100% automation is in full effect here! Technicians can also bring up equipment procedures to follow along the way as these are tied to CIs using the Service Management Suite knowledge articles.

Our Service Management Suite is fully integrated with active directory, in that we can associate personnel with individual site locations that they manage. Self-service forms are also in place where users can add, modify, or delete any new vendor information for specific sites.

The Service Management Suite has already integrated with our Real Estate management tool to integrate remote floor plans, site contacts, and resource usage for each individual location. The ability to pull power consumption per device at remote sites is also standardized based on an actual determined estimate to assist with consolidation efforts.
The self-service items include automated life cycle forms that allow users to actively track equipment adds, modifications, and removals, while also providing the functionality to correlate CIs together in order to form relationships from generator to UPS to PDU to breaker to rack to power strip to rack-mount device (see Figure 4).

Figure 4. Self-service facility CI management form 

Figure 4. Self-service facility CI management form

Functionality for report creation on any CI and associated relationships has been implemented for all users. Need to determine where to place a server? There’s a report that can be run for that as well!

The streamlined processes allow users to maintain facility and hardware CIs with ease, and truly provides a 100% grasp on the activity occurring with data centers on a daily basis.
I am quite proud of the small, but powerful, team that went on this adventure with us. As the leader of this initiative, it was refreshing to see the idea start with a manager. He worked with a key developer to build workflows and from there turned DCIM on its head.

We pulled the hardware, network, and facility teams together with five amazing part-time developers for the “what if” brainstorm session and the enthusiasm continued to explode. It was truly amazing to observe this team. Within 30 days, we had a prototype that was shared with senior management and stakeholders who fully supported the effort and the rest is history!

It’s important to note, for RETS, we have a far superior tool set for a fraction of the cost of other DCIM tools. With it being embedded into our Service Management Suite,
we avoid additional maintenance, software, and vendor services costs… win, win, win!

Our DCIM is forever evolving in that we have so far surpassed the requirements we originally set that we thought, “Why stop now?” Continuing with our journey
will bring service impact reports and alarming, incorporating our existing power monitoring application, and building an automation system, which will enhance our ability to include remote location CIs managed by us. With all the advances we are able to make using our own system, I am looking forward to more productivity than ever before and more than we can imagine right now!


singerStephanie Singer joined Reed Elsevier, now known as RELX Group, in 1980 and has worked for Mead Data Central, LexisNexis, and Reed Elsevier–Technology Services during her career. She is currently the vice president of Global Data Center Services. In this role, she is responsible for global data center and server room facilities, networks, and cable plant infrastructure for the products and applications within these locations. She leads major infrastructure transformation efforts. Ms. Singer has led the data center facilities team since 1990, maintaining an excellent data center availability record throughout day-to-day operations and numerous lifecycle upgrades to the mechanical and electrical systems. She also led the construction and implementation of a strategic backup facility to provide in-house disaster recovery capability.

Top 10 Considerations for Enterprises Progressing to Cloud

Industry data from Uptime Institute and 451 Research evidence a rapid rate of cloud computing adoption for enterprise IT departments. Organizations weigh cloud benefits and risks, and also evaluate how cloud will impact their existing and future data center infrastructure investment. In this video, Uptime Institute COO Julian Kudritzki and Andrew Reichman, Research Director at 451 Research discuss the various aspects of how much risk, and how much reward, is on the table for companies considering a cloud transition.

While some organizations are making a “Tear down this data center” wholesale move to cloud computing, the vast majority of cloud adopters are getting there on a workload-by-workload basis–carefully evaluating their portfolio of workloads and applications and identifying the best cloud or non-cloud venue to host each.

The decision process is based on multiple considerations, including performance, integration issues, economics, competitive differentiation, solution maturity, risk tolerance, regulatory compliance considerations, skills availability, and partner landscape.

Some of the most important of these considerations when deciding whether to put a workload or application in the cloud include:

1. Know which applications impact competitive advantage, and which don’t.
You might be able to increase competitive advantage by operating critical differentiating applications better than peer companies do, but most organizations will agree that back office functions like email or payroll, while important, don’t really set a company apart. As mature SaaS (software as a service) options for these mundane, but important, business functions have emerged, many companies have decided that a cloud application that delivers a credible solution at a fair cost is good enough. Offloading the effort for these workloads can free up valuable time and effort for customization, optimization, and innovation around applications that drive real competitive differentiation.

2. Workloads with highly variable demand see the biggest benefit from cloud.
Public cloud was born to address big swings in demand seen in the retail world. If you need thousands of servers around the Christmas shopping spree, public cloud IaaS (infrastructure as a service) makes them available when you need them and return them after the New Year when demand settles down. Any workload with highly variable demand can see obvious benefits from running in cloud, so long as the architecture supports load balancing and changing the number of servers working on the job, known as scale-out design. Knowing which applications fit this bill and what changes could be made to cloud-enable other applications will help to identify those that would see the biggest economic benefit from a move to cloud.

3. Cloud supports trial and error without penalty.
Cloud gives users an off switch for IT resources, allowing easy changes to application architecture. Find out that a different server type or chipset or feature is needed after the initial build-out? No problem, just move it to something else and see how that works. The flexibility of cloud resources lend themselves very well to experimentation in finding the perfect fit. If you’re looking for a home for an application that’s been running well for years, you might find that keeping it in on-premises will be cheaper and less disruptive than moving it to the cloud.

4. Big looming investments can present opportunities for change.
Cloud can function as an effective investment avoidance strategy when organizations face big bills for activities like data center build-out or expansion, hardware refresh, software upgrade, staff expansion, or outsourcing contract renewal. When looking at big upcoming expenditures, it’s a great time to look at how offloading selected applications to cloud might reduce or eliminate the need for that spend. Once the investments are made, they become sunk costs and will likely make the business case for cloud transition much less attractive.

5. Consider whether customization is required or if good enough is good enough.
Is this an application that isn’t too picky in terms of server architecture and advanced features, or is it an app that requires specific custom hardware architectures or configurations to run well? If you have clearly understood requirements that you must keep in place, a cloud provider might not give you exactly what you need. On the other hand, if your internal IT organization struggles to keep pace with the latest and greatest, or if your team is still determining the optimal configuration, cloud could give more flexibility to experiment with a wider range of options than you have access to internally, given a smaller scale of operations than mega-scale cloud providers operate at.

6. Conversion to cloud native architectures can be difficult but rewarding in the long run.
One benefit of cloud is renting instead of buying, with advantages in terms of scaling up and down at will and letting a service provider do the work. A separate benefit comes from the use of cloud native architectures. Taking advantage of things like API-controlled infrastructure, object storage, micro-services, and server-less computing requires switching to cloud-friendly applications or modifying legacy apps to use cloud design principles. If you have plans to switch or modify applications anyway, think about whether you would be better served running these applications in house or if it would make sense to use that inflection point to move to something hosted by a third party. If your organization runs mostly traditional apps and has no intention of taking on major projects to cloud-enable them, you should know that options and benefits of forklifting them unchanged to cloud will be limited.

7. Be honest about what your company is good at and what it is not.
If cloud promises organizations the ability to get out of the business of mundane activities such as racking and stacking gear, refreshing and updating systems, and managing facilities, it’s important to start the journey with a clear and honest assessment of what your company does well and what it does not do well. If you have long standing processes to manage efficiency, reliability and security, have relatively new facilities and equipment, and the IT staff is good at keeping it all running, then cloud might not offer much benefit. If there are areas where things don’t go so smoothly or you struggle to get everything done with existing headcount, cloud could be a good way to get better results without taking on more effort or learning major new skills. On the other hand, managing a cloud environment requires its own specialized skills, which can make it hard for unfamiliar organization to jump in and realize benefits.

8. Regulatory issues have a significant bearing on the cloud decision.
Designing infrastructure solutions that meet regulations can be tremendously complicated. It’s good to know upfront if you face regulations that explicitly prevent public cloud usage for certain activities, or if your legal department interprets those regulations as such, before wasting time evaluating non-starter solutions. That said, regulations generally impact some workloads and not others, and in many cases, are specific to certain aspects of workloads such as payment or customer or patient identity information. A hybrid architecture might allow sensitive data to be kept in private venues, while less sensitive information might be fine in public cloud. Consider that after a public cloud solution has seen wide acceptance for a regulated workload, there may be more certainty that that solution is compliant.

9. Geography can be a limiting factor or a driver for cloud usage.
If you face regulations around data sovereignty and your data has to be physically stored in Poland, Portugal, or Panama (or anywhere else on the globe), the footprint of a cloud provider could be a non-starter. On the flip side, big cloud providers are likely already operating in far more geographies than your enterprise. This means that if you need multiple sites for redundancy and reliability, require content delivery network (CDN) capabilities to reach customers in a wide variety of locations, or want to expanding into new regions, the major cloud providers can extend your geographic reach without major capital investments.

10. The unfamiliar may only seem less secure.
Public cloud can go either way in terms of being more or less secure than what’s provisioned in an enterprise data center. Internal infrastructure has more mature tools and processes associated with it, enjoys wider familiarity among enterprise administrators, and the higher level of transparency associated with owning and operating facilities and gear allows organizations to get predictable results and enjoy a certain level of comfort from sheer proximity. That said, cloud service providers operate at far greater scale than individual enterprises and therefore employ more security experts and gain more experience from addressing more threats than do single companies. Also, building for shared tenancy can drive service providers to lock down data across the board, compared to enterprises that may have carried over vulnerabilities from older configurations that may become issues as the user base or feature set of workloads change. Either way, a thorough assessment of vulnerabilities in enterprise and service provider facilities, infrastructure and applications is critical to determine whether cloud is a good or bad option for you.


reichmanAndrew Reichman is a Research Director for cloud data within the 451 Research Voice of the Enterprise team. In this role, he designs and interprets quarterly surveys that explore cloud adoption in the overall enterprise technology market.

Prior to this role, he worked at Amazon Web Services, leading their efforts in marketing infrastructure as a service (IaaS) to enterprise firms worldwide. Before this, he spent six years as a Principal Analyst at Forrester Research, covering storage, cloud and datacenter economics. Prior to his time at Forrester, Andrew was a consultant with Accenture, optimizing datacenter environments on behalf of EMC. He holds an MBA in finance from the Foster School of Business at the University of Washington and a BA in History from Wesleyan University.

Open19 is About Improving Hardware Choices, Standardizing Deployment

In July 2016, Yuval Bachar, principal engineer, Global Infrastructure Architecture and Strategy at LinkedIn announced Open19, a project spearheaded by the social networking service to develop a new specification for server hardware based on a common form factor. The project aims to standardize the physical characteristics of IT equipment, with the goal of cutting costs and installation headaches without restricting choice or innovation in the IT equipment itself.

Uptime Institute’s Matt Stansberry discussed the new project with Mr. Bachar following the announcement. The following is an excerpt of that conversation:

Please tell us why LinkedIn launched the Open19 Project.
We started Open19 with the goal that any IT equipment we deploy would be able to be installed in any location, such as a colocation facility, one of our owned sites, or sitting in a POP, in a standardized 19-inch rack environment. Standardizing on this form factor significantly reduces the cost of integration.

We aren’t the kind of organization that has one type or workload or one type of server. Data centers are dynamic, with apps evolving on a weekly basis. New demands for apps and solutions require different technologies. Open19 provides the opportunity to mix and match server hardware very easily without the need to change any of the mechanical or installation aspects.

Different technologies evolve at a different pace. We’re always trying a variety of different servers—from low-power multi core machines to very high end, high performance hardware. In an Open19 configuration, you can mix and match this equipment in any configuration you want.

In the past, if you had a chassis full of blade servers, they were super-expensive, heavy, and difficult to handle. With Open19, you would build a standardized chassis into your cage and fit servers from various suppliers into the same slot. This provides a huge advantage from a procurement perspective. If I want to replace one blade today, the only opportunity I have is to buy from a single server supplier. With Open19, I can buy a blade from anybody that complies. I can have five or six proposals for a procurement instead of just one.

Also, by being able to blend high performance and high power with low-power equipment in two adjacent slots in the brick cage in the same rack, you can create a balanced environment in the data center from a cooling and power perspective. It helps avoid hot spots.

I recall you mentioned that you may consider folding Open19 into the Facebook-led Open Compute Project (OCP), but right now that’s not going to happen. Why launch Open19 as a standalone project?
The reason we don’t want to fold Open19 under OCP at this time is that there are strong restrictions around how innovation and contributions are routed back into the OCP community.

IT equipment partners aren’t willing to contribute IP and innovation. The OCP solution wasn’t enabling the industry—when organizations create things that are special, OCP requires you to expose everything outside of your company. That’s why some of our large server partners couldn’t join OCP. Open19 defines the form factor, and each server provider can compete in the market and innovate in their own dimension.

What are your next steps and what is the timeline?
We will have an Open19-based system up and running in the middle of September 2016 in our labs. We are targeting late Q1 of 2017 to have a variety of servers installed from three to four suppliers. We are considering Open19 as our primary deployment model, and if all of the aspects are completed and tested, we could see Open19 in production environments after Q1 2017.

The challenge is that this effort has multiple layers. We are working with partners to do the engineering development. And that is achievable. A secondary challenge is the legal side. How do you create an environment that the providers are willing to join? How do we do this right this time?

But most importantly, for us to be successful we have to have significant adoption from suppliers and operators.

It seems like something that would be valuable for the whole industry—not just the hyperscale organizations.
Enterprise groups will see an opportunity to participate without having to be the 100,000-server data center companies. So many enterprise IT groups had expressed reluctance about OCP because the white box servers come with limited support and warranty levels.

It will also lower their costs, increase the speed of installations and changes, and improve the positions of business negotiations on procurement.

If we come together, we will generate enough demand to make Open19 interesting to all of the suppliers and operators. You will get the option to choose.


yuval-bacharYuval Bachar is a principal engineer in the global infrastructure and strategy team for Linkedin, which is responsible for the company strategy for data center architecture and implementation of the mega scale future data centers. In this capacity, he drives and supports the new technology development, architecture, and collaboration to support the tremendous growth in LinkedIn’s future user-base, data centers, and services provided. Prior to Linkedin, Mr. Bachar held IT leadership positions at Facebook, Cisco and Juniper Networks.