Bulky Binders and Operations “Experts” Put Your Data Center at Risk

Share this

Wearable technology and digitized operating procedures ensure compliance with standardized practices and provide quick access to critical information

By Jose Ruiz

No one in the data center industry questions that data center outages are extremely costly. Numerous vendors report that the average data center outage costs hundreds of thousands of dollars, with losses in some circumstances totaling much more. Uptime Institute Network Abnormal Incident Reports data show that most downtime instances are caused by human error–a key vulnerability in mission-critical environments. In addition, Compass Datacenters believes that many failures that enterprises attribute to equipment failure could accurately be considered human error. However, most companies prefer to blame the hardware rather than their processes.

Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.

Figure 1. American Electric Power is one of the largest utilities in the U.S., with a service territory covering parts of 11 states.

Whether one accepts higher or lower outage costs, it is clear that reducing human error
as a cause of downtime has a tremendous upside for the enterprise. As a result, data center
 owners dedicate a great 
deal of time to developing 
maintenance and
operation procedures 
that are clear, effective, and easy to follow. Compass Datacenters is incorporating wearable technology in its efforts to make the procedures convenient for technicians to access and execute. During the process, Compass learned that the true value of introducing wearables into data center operations comes from standardizing the procedures. Incorporating wearable technology enhances the ability of users to follow the procedures. In addition, allowing users to choose the device most useful for the task at hand improves their comfort with the new process.

THE PROBLEM WITH TRADITIONAL OPs DOCUMENTATION

There are two major problems with the current methods of managing and documenting data center procedures.

  • Many data center operations teams document their procedures and methodologies in large three-ring binders. Although their presence is comforting, Operations staffs rarely consult them.
  • Also, organizations often rely on highly detailed written documentation presented in such depth that Operations staff wastes a great deal of time trying to locate the appropriate information.

In both instances, Operations personnel deem the written procedural guidelines inconvenient. As a result, actual data
center operation tends to be more experience-based. Bluntly put, too many organizations rely too much on the most experienced or the most insistent Operations person on site at the time of an incident.

Figure 2. Compass Datacenters use of electronics and checklists clari es the responsibilities of Operations personnel doing daily rounds.

Figure 2. Compass Datacenters use of electronics and checklists clarifies the responsibilities of Operations personnel doing daily rounds.

On the whole, experience-based operations are not all bad. Subject matter experts (SME) emerge and are valued, with operations running smoothly for long periods at a time. However, as Julian Kudritzki wrote in “Examining and Learning from Complex Systems Failures” (The Uptime Institute Journal, vol. 5, p. 12) ‘Human error’ is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement.” What’s more, subject matter experts may leave or get sick or simply have a bad day. In addition, most of their peers assume the SME is always right.

Figure 3.

Figure 3.

Figure 4.

Figure 4.

Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.

Figures 3-5. The new system providers reminders of monthly tasks, prompts personnel to take required steps, and then requires verification that the steps have been taken.

Of course, no one goes to work planning to hurt a coworker or cause an outage, but errors are made. Compass Datacenters believes its efforts to develop professional quality procedures and use wearable technology reduce the likelihood of human error. Compass’s intent is to make it easy for operators to handle both routine and foreseeable abnormal operations by standardizing where possible and improvising only when absolutely necessary and by making it simple and painless for technicians to comply with well-defined procedures.

AMERICAN ELECTRIC POWER

American Electric Power (AEP), a utility based in 
Columbus, OH that provides electricity to more than 5.4
 million customers in 11 states, inspired Compass to bring wearable technology to the data center to help the utility reduce human error. The initiative was a first for Compass, and AEP’s collaboration was essential to the product development effort. Compass and AEP began by adopting the checklist approach used in the airline business, among many others. This approach to avoiding human error is predicated on the use of carefully crafted operational procedures. Foreseeable critical actions by the pilot and copilot are strictly prescribed to ensure that all processes are followed in the same manner, every time, no matter who is in the cockpit. They standardize where possible and improvise only when required. Airline procedural techniques also put extra focus on critical steps to eliminate error. While this professional approach to operations has been widely adopted by a host of industries, it has not yet become a standard in the data center business.

Compass and AEP partnered with ICARUS Ops to develop the electronic checklist and wearables programs. ICARUS Ops was founded by two former U.S. Navy carrier pilots, who use their extensive experience in developing standardized aviation processes to develop wearable programs across a broad spectrum of high-reliability and mission-critical industries. During the process, ICARUS Ops brought on board several highly experienced consultants, including military and commercial pilots and a former Space Shuttle mission controller. At the start of the project, Compass, ICARUS Ops, and
AEP worked together to identify the various maintenance and operations tasks that would be wearable accessible. Team members reviewed all existing emergency operating procedures (EOP). The three organizations held detailed discussions with AEP’s data center personnel to help identify the procedures best suited to run electronically.
Including operators in the process ensured that their perspectives would be part of the final procedures to make it easier for them to execute correctly. Also, this early involvement led to a high level of buy-in, which would be needed for this project to succeed.

ELECTRONIC PROCEDURES

The next step was converting the procedures to electronic checklists. The transition to electronics was accomplished using ICARUS Ops Application and web-based Mission Control. The project team wanted each checklist to

  • Be succinct and written with the operator in mind
  • Identify the key milestone in the process and use digital technology to verify critical steps
  • Use condition, purpose, or desired effect statements to assure the proper checklist is utilized
  • Identify common areas of confusion and provide links to tools such as animations and videos to clarify how and why the procedure is to be performed
  • Make use of Job Safety Analysis (JSA) to identify areas of significant risk prior to the required step/action. These included:
    o Warnings to indicate where serious injury is possible if all procedures are not followed specifically
    o Cautions indicating where significant data loss or damage is possible
    o Notes to add any additional information that may be required or helpful
  • Condense SOP and EOP procedure items to prevent the user from being overwhelmed during critical and emergency scenarios.

AEP SMEs walked through each and every checklist to provide an additional quality control step.

Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.

Figure 6. Individual steps include illustrations and diagrams to eliminate any confusion.

Having SOPs and EOPs in a wearable digital format allows users to access them whenever an abnormal or emergency situation occurs, so no time is lost looking for the three-ring binder and proper procedure. In critical situations and emergencies, time is crucial and delay can be costly. Hard-copy information manuals and operating handbooks are bulky and can potentially overwhelm technicians looking for specific information. The bulk and formatting can cause users to lose their place on a page and skip over important items. By contrast, the digital interface provides quick access to users and allows for quicker navigation to desired materials and information via presorted emergency and abnormal procedures.

At the conclusion of the development phase of the process (about a month), 150 of AEP’s operational procedures had been converted into digital procedures that were conveniently accessible on mobile devices— on or offline. Most importantly, a digital continuous improvement process was put in place to assure that lessons learned would be easy to incorporate into the system as part of the process.

The process of entering the procedures took about 2 weeks. First, a data entry team put them in as written. Then Compass worked with AEP and ICARUS Ops checklist experts to tighten up the language and add decision points, warnings, cautions, and notes. This group also embedded pictures and video where appropriate and decided what data needed to be captured during the process. Updates are easy. Each time a checklist or procedure is used improvements can be approved and digitally published right away. This is critical. AEP believes that when employees know they are being heard job satisfaction improves and they take more ownership and pride in the facility. The modifications and improvements will be continuous. The system is designed for constant improvement. Any time a technician has an idea to improve a checklist he just records the idea in the application, which instantly emails it to the SME for approval and publishing into the system. Other devices on the system synch automatically when they update.

WEARABLE TECHNOLOGY

Using wearable technology made the electronic procedures even more accessible to the end users. At the outset, AEP, ICARUS Ops, and Compass determined that AEP technicians would be able to choose between a wrist-mounted Android device and a tablet. The majority of users selected the wrist-mounted device. This preference was not unexpected, as a good portion of their work requires them to use their hands. The wrist units allow users to have both hands available to perform actual work at all times.

Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.

Figure 7. With all maintenance activities logged, a dashboard enables management to retrieve and examine reports.

Compass chose the Android technology system for several reasons. First, from the very outset, Compass envisioned making Google Glass one of the wearable options. In addition, Android allows programs maximum freedom to use the hardware and software as required to meet customer needs; it is simply less constraining than working with iOS. Androids are also very easy to procure and have manufacturers making Class 1, Division 1 Android tablets that are usable in hazardous environments.

The software-based technology 
enhances AEP’s ability to track
 all maintenance activity in its 
data center. Once online, each 
device communicates the recorded
 operational information (actions
 performed, components replaced, etc.) to the site’s maintenance system. This ensures that all routine maintenance is performed on schedule, and at the same time provides a detailed history of the activities performed on each major component within the facility. Web-based leadership dashboards provide oversight into what is complete, due, or overdue. The system is completely secure, as it does not reside on the same server or system as any other system in the data center. It can be encrypted or resident in the data center for increased security. Typically ICARUS Ops spends at least 3 days on site training the technicians and users, depending on the number of devices. Working together in that time normally leaves the team very comfortable with all components.

THE NEXT PHASE: GLASS

Although innovations such as Google Glass have increased awareness of visual displays, the concept
dates back to the Vietnam War, when American fighter pilots began to use augmented reality on heads-up displays (HUDs). Compass Datacenters plans to harness the same augmented reality to enable maintenance technicians to complete a task completely hands free using Google Glass and other advanced wearables.

It is certainly safe to assume that costs related to data center outages will continue to rise in the future. As a result, it will become more important to increase operator reliability and reduce the risk of human error. We strongly believe that professional quality procedures and wearables will make a tremendous contribution to eliminating the human error element as a major cause of interruption and lost revenue. Our vision is that other data center providers will decide to make this innovative checklist system a part of their standard offerings over the course of the next 12 to 24 months.


Jose Ruiz

Jose Ruiz

Jose Ruiz serves as Compass Datacenters’ Vice President of Operations. He provides interdisciplinary leadership in support of Compass’ strategic goals. Mr. Ruiz began his tenure at Compass as the company’s director of Engineering. Prior to joining Compass he spent three years in various engineering positions and was responsible for a global range of projects at Digital Realty Trust.


Share this