Identifying Lurking Vulnerabilities in the World’s Best-Run Data Centers

Peer-based critiques drive continuous improvement, identify lurking data center vulnerabilities

By Kevin Heslin

Shared information is one of the distinctive features of the Uptime Institute Network and its activities. Under non-disclosure agreements, Network members not only share information, but they also collaborate on projects of mutual interest. Uptime Institute facilitates the information sharing and helps draw conclusions and identify trends from the raw data and information submitted by members representing industries such as banking and finance, telecommunications, manufacturing, retail, transportation, government, and colocation. In fact, information sharing is required of all members.

As a result of their Network activities, longtime Network members report reduced frequency and duration of unplanned downtime in their data centers. They also say that they’ve experienced enhanced facilities and IT operations because of ideas, proven solutions, and best practices that they’ve gleaned from other members. In that way, the Network is more than the sum of its parts. Obvious examples of exclusive Network benefits include the Abnormal Incident Reports (AIRs) database , real-time Flash Reports, email inquiries, and peer-to-peer interactions at twice-annual Network conferences. No single enterprise or organization would be able to replicate the benefits created by the collective wisdom of the Network membership.

Perhaps the best examples of shared learning are the data center site tours (140 tours through the fall of 2015) held in conjunction with Network conferences. During these in-depth tours of live, technologically advanced data centers, Network members share their experiences, hear about vendor experiences, and gather new ideas—often within the facility in which they were first road tested. Ideas and observations generated during the site tours are collated during detailed follow-up discussions with the site team. Uptime Institute notes that both site hosts and guests express high satisfaction with these tours. Hosts remark that visitors raised interesting and useful observations about their data centers, and participants witness new ideas in action.

Rob Costa, Uptime Institute North America Network Director, has probably attended more Network data center tours than anyone else, first as a senior manager for The Boeing Co., and then as Network Director. In addition, Boeing and Costa hosted two tours of the company’s facilities since joining the Network in 1997. As a result, Costa is very knowledgeable about what happens during site tours and how Network members benefit.

“One of the many reasons Boeing joined the Uptime Institute Network was the opportunity to visit world-class, mission-critical data centers. We learned a lot from the tours and after returning from the conferences, we would meet with our facility partners and review the best practices and areas of improvement that we noted during the tour,” said Costa.

“We also hosted two Network tours at our Boeing data centers. The value of hosting a tour was the honest and thoughtful feedback from our peer Network members. We focused on the areas of improvement noted from the feedback sessions,” he said.

Recently, Costa noted the improvement in the quality of the data centers hosting tours, as Network members have had the opportunity to make improvements based on the lessons learned from participating in the Network. He said, “It is all about continuous improvement and the drive for zero downtime.” In addition, he has noted an increased emphasis on safety and physical security.

Fred Dickerman, Uptime Institute, Senior Vice President, Management Services, who also hosted a tour of DataSpace facilities, said, “In preparing to host a tour you tend to look at your own data center from a different point of view, which helps you to see things which get overlooked in the day to day. Basically you walk around the data center asking yourself, ‘How will others see my data center?’

Prior to the tour, Network members study design and engineering documents for the facility to get an understanding of the site’s topology.

Prior to the tour, Network members study design and engineering documents for the facility to get an understanding of the site’s topology.

“A manager who has had problems at a data center with smoke from nearby industrial sites entering the makeup air intakes will look at the filters on the host’s data center and suggest improvement. Managers from active seismic zones will look at your structure. Managers who have experienced a recent safety incident will look at your safety procedures, etc.” Dickerman’s perspective summarizes why normally risk-averse organizations are happy to have groups of Network members in their data centers.

Though tour participants have generated literally thousands of comments since 1994  when the first tour was held, recent data suggest that more work remains to improve facilities. In 2015, Network staff collated all the areas of improvement suggested by participants in the data center tours since 2012. In doing so, Uptime Institute counted more than 300 unique comments made in 15 loosely defined categories, with airflow management, energy efficiency, labeling, operations, risk management, and safety meriting the most attention. The power/backup power categories also received a lot of comments. Although categories such as testing, raised floor issues, and natural disasters did not receive many comments, some participants had special interest in these areas, which made the results of the tours all the more comprehensive.

Uptime Institute works with tour hosts to address all security requirements. All participants have signed non-disclosure agreements through the Network.

Uptime Institute works with tour hosts to address all security requirements. All participants have signed non-disclosure agreements through the Network.

Dickerman suggested that the comments highlighted the fact that even the best-managed data centers have vulnerabilities. He highlighted comments related to human action (including negligence due to employee incompetence, lack of training, and procedural or management error) as significant. He also pointed out that some facilities are vulnerable because they lack contingency and emergency response plans.

Site tours last 2 hours and review the raised floor, site’s command center, switchgear, UPS systems, batteries, generators, electrical distribution, and cooling systems.

Site tours last 2 hours and review the raised floor, site’s command center, switchgear, UPS systems, batteries, generators, electrical distribution, and cooling systems.

He notes that events become failures when operators respond too slowly, respond incorrectly, or don’t respond at all, whether the cause of the event is human action or natural disaster. “In almost every major technological disaster, subsequent analysis shows that timely, correct response by the responsible operators would have prevented or minimized the failure. The same is true for every serious data center failure I’ve looked at,” Dickerman said.

It should be emphasized that a lot of the recommendations deal with operations and processes, things that can be corrected in any data center. It is always nice to talk about the next data center “I would build,” but the reality is that not many people will have that opportunity. Everyone, however, can improve how they operate the site.

For example, tour participants continue to find vulnerabilities in electrical systems, perhaps because any electrical system problem may appear instantaneously, leaving little or no time to react. In addition, tour participants also continue to focus on safety, training, physical security, change management, and the presence of procedures.

In recent years, energy efficiency has become more of an issue. Related to this is the nagging sense that Operations is not making full use of management systems to provide early warning about potential problems. In addition, most companies are not using an interdisciplinary approach to improving IT efficiency.

Dickerman notes that changes in industry practices and regulations explain why comments tend to cluster in the same categories year after year. Tour hosts are very safety conscious and tend to be very proud of their safety records, but new U.S. Occupational Safety and Health Administration (OSHA) regulations limiting hot work, for example, increase pressure on facility operators to implement redundant systems that allow for the shutdown of electrical systems to enable maintenance to be performed safely. Tour participants can share experiences about how to effectively and efficiently develop appropriate procedures to track every piece of IT gear and ensure that the connectivity of the gear is known and that it is installed and plugged in correctly.

Of course, merely creating a list of comments after a walk-through is not necessarily helpful to tour hosts. Each network tour concludes with discussion where comments are compiled and discussed. Most importantly Uptime Institute staff moderate discussions, as comments are evaluated and rationales for construction and operations decisions explained. These discussions ensure that all ideas are vetted for accuracy, and that the expertise of the full group is tapped before a comment gets recorded.

Finally, Uptime Institute moderators prepare a final report for use by the tour host, so that the most valid ideas can be implemented.

Pitt Turner, Uptime Institute Executive Director Emeritus, notes that attending or hosting tours is not sufficient by itself, “There has to be motivation to improve. And people with a bias toward action do especially well. They have the opportunity to access technical ideas without worrying about cost justifications or gaining buy in. Then, when they get back to work, they can implement the easy and low-cost ideas and begin to do cost justifications on those with budget impacts.”

TESTIMONIALS

One long-time Network member illustrates that site tours are a two-way street of continuous improvement by telling two stories separated by several years and from different perspectives. “In 1999, I learned two vitally important things during a facility tour at Company A. During that tour my team saw that Company A color coded both its electrical feeds and CEVAC (command, echo, validate, acknowledge, control), which is a process that minimizes the chance for errors when executing procedures. We still color code in this way to this day.

Years later the same Network member hosted a data center tour and learned an important lesson during the site tour. “We had power distribution unit (PDU) breakers installed in the critical distribution switchgear,” he said. “We had breaker locks on the breaker handles that are used to open and close the breakers to prevent an accidental trip. I thought we had protected our breakers well, but I hadn’t noticed a very small red button at the bottom of the breaker that read ‘push to trip’ under it. A Network member brought it to my attention during a tour. I was shocked when I saw it. We now have removable plastic covers over those buttons.”


Kevin Heslin

Kevin Heslin

Kevin Heslin is Chief Editor and Director of Ancillary Projects at Uptime Institute. In these roles, he supports Uptime Institute communications and education efforts. Previously, he served as an editor at BNP Media, where he founded Mission Critical, a commercial publication dedicated to data center and backup power professionals. He also served as editor at New York Construction News and CEE and was the editor of LD+A and JIES at the IESNA. In addition, Heslin served as communications manager at the Lighting Research Center of Rensselaer Polytechnic Institute. He earned the B.A. in Journalism from Fordham University in 1981 and a B.S. in Technical Communications from Rensselaer Polytechnic Institute in 2000.


1


2

Share this