Top 10 Considerations for Enterprises Progressing to Cloud

Industry data from Uptime Institute and 451 Research evidence a rapid rate of cloud computing adoption for enterprise IT departments. Organizations weigh cloud benefits and risks, and also evaluate how cloud will impact their existing and future data center infrastructure investment. In this video, Uptime Institute COO Julian Kudritzki and Andrew Reichman, Research Director at 451 Research discuss the various aspects of how much risk, and how much reward, is on the table for companies considering a cloud transition.

While some organizations are making a “Tear down this data center” wholesale move to cloud computing, the vast majority of cloud adopters are getting there on a workload-by-workload basis–carefully evaluating their portfolio of workloads and applications and identifying the best cloud or non-cloud venue to host each.

The decision process is based on multiple considerations, including performance, integration issues, economics, competitive differentiation, solution maturity, risk tolerance, regulatory compliance considerations, skills availability, and partner landscape.

Some of the most important of these considerations when deciding whether to put a workload or application in the cloud include:

1. Know which applications impact competitive advantage, and which don’t.
You might be able to increase competitive advantage by operating critical differentiating applications better than peer companies do, but most organizations will agree that back office functions like email or payroll, while important, don’t really set a company apart. As mature SaaS (software as a service) options for these mundane, but important, business functions have emerged, many companies have decided that a cloud application that delivers a credible solution at a fair cost is good enough. Offloading the effort for these workloads can free up valuable time and effort for customization, optimization, and innovation around applications that drive real competitive differentiation.

2. Workloads with highly variable demand see the biggest benefit from cloud.
Public cloud was born to address big swings in demand seen in the retail world. If you need thousands of servers around the Christmas shopping spree, public cloud IaaS (infrastructure as a service) makes them available when you need them and return them after the New Year when demand settles down. Any workload with highly variable demand can see obvious benefits from running in cloud, so long as the architecture supports load balancing and changing the number of servers working on the job, known as scale-out design. Knowing which applications fit this bill and what changes could be made to cloud-enable other applications will help to identify those that would see the biggest economic benefit from a move to cloud.

3. Cloud supports trial and error without penalty.
Cloud gives users an off switch for IT resources, allowing easy changes to application architecture. Find out that a different server type or chipset or feature is needed after the initial build-out? No problem, just move it to something else and see how that works. The flexibility of cloud resources lend themselves very well to experimentation in finding the perfect fit. If you’re looking for a home for an application that’s been running well for years, you might find that keeping it in on-premises will be cheaper and less disruptive than moving it to the cloud.

4. Big looming investments can present opportunities for change.
Cloud can function as an effective investment avoidance strategy when organizations face big bills for activities like data center build-out or expansion, hardware refresh, software upgrade, staff expansion, or outsourcing contract renewal. When looking at big upcoming expenditures, it’s a great time to look at how offloading selected applications to cloud might reduce or eliminate the need for that spend. Once the investments are made, they become sunk costs and will likely make the business case for cloud transition much less attractive.

5. Consider whether customization is required or if good enough is good enough.
Is this an application that isn’t too picky in terms of server architecture and advanced features, or is it an app that requires specific custom hardware architectures or configurations to run well? If you have clearly understood requirements that you must keep in place, a cloud provider might not give you exactly what you need. On the other hand, if your internal IT organization struggles to keep pace with the latest and greatest, or if your team is still determining the optimal configuration, cloud could give more flexibility to experiment with a wider range of options than you have access to internally, given a smaller scale of operations than mega-scale cloud providers operate at.

6. Conversion to cloud native architectures can be difficult but rewarding in the long run.
One benefit of cloud is renting instead of buying, with advantages in terms of scaling up and down at will and letting a service provider do the work. A separate benefit comes from the use of cloud native architectures. Taking advantage of things like API-controlled infrastructure, object storage, micro-services, and server-less computing requires switching to cloud-friendly applications or modifying legacy apps to use cloud design principles. If you have plans to switch or modify applications anyway, think about whether you would be better served running these applications in house or if it would make sense to use that inflection point to move to something hosted by a third party. If your organization runs mostly traditional apps and has no intention of taking on major projects to cloud-enable them, you should know that options and benefits of forklifting them unchanged to cloud will be limited.

7. Be honest about what your company is good at and what it is not.
If cloud promises organizations the ability to get out of the business of mundane activities such as racking and stacking gear, refreshing and updating systems, and managing facilities, it’s important to start the journey with a clear and honest assessment of what your company does well and what it does not do well. If you have long standing processes to manage efficiency, reliability and security, have relatively new facilities and equipment, and the IT staff is good at keeping it all running, then cloud might not offer much benefit. If there are areas where things don’t go so smoothly or you struggle to get everything done with existing headcount, cloud could be a good way to get better results without taking on more effort or learning major new skills. On the other hand, managing a cloud environment requires its own specialized skills, which can make it hard for unfamiliar organization to jump in and realize benefits.

8. Regulatory issues have a significant bearing on the cloud decision.
Designing infrastructure solutions that meet regulations can be tremendously complicated. It’s good to know upfront if you face regulations that explicitly prevent public cloud usage for certain activities, or if your legal department interprets those regulations as such, before wasting time evaluating non-starter solutions. That said, regulations generally impact some workloads and not others, and in many cases, are specific to certain aspects of workloads such as payment or customer or patient identity information. A hybrid architecture might allow sensitive data to be kept in private venues, while less sensitive information might be fine in public cloud. Consider that after a public cloud solution has seen wide acceptance for a regulated workload, there may be more certainty that that solution is compliant.

9. Geography can be a limiting factor or a driver for cloud usage.
If you face regulations around data sovereignty and your data has to be physically stored in Poland, Portugal, or Panama (or anywhere else on the globe), the footprint of a cloud provider could be a non-starter. On the flip side, big cloud providers are likely already operating in far more geographies than your enterprise. This means that if you need multiple sites for redundancy and reliability, require content delivery network (CDN) capabilities to reach customers in a wide variety of locations, or want to expanding into new regions, the major cloud providers can extend your geographic reach without major capital investments.

10. The unfamiliar may only seem less secure.
Public cloud can go either way in terms of being more or less secure than what’s provisioned in an enterprise data center. Internal infrastructure has more mature tools and processes associated with it, enjoys wider familiarity among enterprise administrators, and the higher level of transparency associated with owning and operating facilities and gear allows organizations to get predictable results and enjoy a certain level of comfort from sheer proximity. That said, cloud service providers operate at far greater scale than individual enterprises and therefore employ more security experts and gain more experience from addressing more threats than do single companies. Also, building for shared tenancy can drive service providers to lock down data across the board, compared to enterprises that may have carried over vulnerabilities from older configurations that may become issues as the user base or feature set of workloads change. Either way, a thorough assessment of vulnerabilities in enterprise and service provider facilities, infrastructure and applications is critical to determine whether cloud is a good or bad option for you.


reichmanAndrew Reichman is a Research Director for cloud data within the 451 Research Voice of the Enterprise team. In this role, he designs and interprets quarterly surveys that explore cloud adoption in the overall enterprise technology market.

Prior to this role, he worked at Amazon Web Services, leading their efforts in marketing infrastructure as a service (IaaS) to enterprise firms worldwide. Before this, he spent six years as a Principal Analyst at Forrester Research, covering storage, cloud and datacenter economics. Prior to his time at Forrester, Andrew was a consultant with Accenture, optimizing datacenter environments on behalf of EMC. He holds an MBA in finance from the Foster School of Business at the University of Washington and a BA in History from Wesleyan University.

Open19 is About Improving Hardware Choices, Standardizing Deployment

In July 2016, Yuval Bachar, principal engineer, Global Infrastructure Architecture and Strategy at LinkedIn announced Open19, a project spearheaded by the social networking service to develop a new specification for server hardware based on a common form factor. The project aims to standardize the physical characteristics of IT equipment, with the goal of cutting costs and installation headaches without restricting choice or innovation in the IT equipment itself.

Uptime Institute’s Matt Stansberry discussed the new project with Mr. Bachar following the announcement. The following is an excerpt of that conversation:

Please tell us why LinkedIn launched the Open19 Project.
We started Open19 with the goal that any IT equipment we deploy would be able to be installed in any location, such as a colocation facility, one of our owned sites, or sitting in a POP, in a standardized 19-inch rack environment. Standardizing on this form factor significantly reduces the cost of integration.

We aren’t the kind of organization that has one type or workload or one type of server. Data centers are dynamic, with apps evolving on a weekly basis. New demands for apps and solutions require different technologies. Open19 provides the opportunity to mix and match server hardware very easily without the need to change any of the mechanical or installation aspects.

Different technologies evolve at a different pace. We’re always trying a variety of different servers—from low-power multi core machines to very high end, high performance hardware. In an Open19 configuration, you can mix and match this equipment in any configuration you want.

In the past, if you had a chassis full of blade servers, they were super-expensive, heavy, and difficult to handle. With Open19, you would build a standardized chassis into your cage and fit servers from various suppliers into the same slot. This provides a huge advantage from a procurement perspective. If I want to replace one blade today, the only opportunity I have is to buy from a single server supplier. With Open19, I can buy a blade from anybody that complies. I can have five or six proposals for a procurement instead of just one.

Also, by being able to blend high performance and high power with low-power equipment in two adjacent slots in the brick cage in the same rack, you can create a balanced environment in the data center from a cooling and power perspective. It helps avoid hot spots.

I recall you mentioned that you may consider folding Open19 into the Facebook-led Open Compute Project (OCP), but right now that’s not going to happen. Why launch Open19 as a standalone project?
The reason we don’t want to fold Open19 under OCP at this time is that there are strong restrictions around how innovation and contributions are routed back into the OCP community.

IT equipment partners aren’t willing to contribute IP and innovation. The OCP solution wasn’t enabling the industry—when organizations create things that are special, OCP requires you to expose everything outside of your company. That’s why some of our large server partners couldn’t join OCP. Open19 defines the form factor, and each server provider can compete in the market and innovate in their own dimension.

What are your next steps and what is the timeline?
We will have an Open19-based system up and running in the middle of September 2016 in our labs. We are targeting late Q1 of 2017 to have a variety of servers installed from three to four suppliers. We are considering Open19 as our primary deployment model, and if all of the aspects are completed and tested, we could see Open19 in production environments after Q1 2017.

The challenge is that this effort has multiple layers. We are working with partners to do the engineering development. And that is achievable. A secondary challenge is the legal side. How do you create an environment that the providers are willing to join? How do we do this right this time?

But most importantly, for us to be successful we have to have significant adoption from suppliers and operators.

It seems like something that would be valuable for the whole industry—not just the hyperscale organizations.
Enterprise groups will see an opportunity to participate without having to be the 100,000-server data center companies. So many enterprise IT groups had expressed reluctance about OCP because the white box servers come with limited support and warranty levels.

It will also lower their costs, increase the speed of installations and changes, and improve the positions of business negotiations on procurement.

If we come together, we will generate enough demand to make Open19 interesting to all of the suppliers and operators. You will get the option to choose.


yuval-bacharYuval Bachar is a principal engineer in the global infrastructure and strategy team for Linkedin, which is responsible for the company strategy for data center architecture and implementation of the mega scale future data centers. In this capacity, he drives and supports the new technology development, architecture, and collaboration to support the tremendous growth in LinkedIn’s future user-base, data centers, and services provided. Prior to Linkedin, Mr. Bachar held IT leadership positions at Facebook, Cisco and Juniper Networks.

FORTRUST Gains Competitive Advantage from Management and Operations

FORTRUST regards management and operations as a core competency that helps it win new clients and control capital and operating expenses

Shortly after receiving word that FORTRUST had earned Uptime Institute’s Tier Certification for Operational Sustainability (Gold) for Phase 7 of its Denver data center, Rob McClary, the company’s executive vice president and general manager, discussed the importance of management and operations and how FORTRUST utilizes Uptime Institute’s certifications to maintain and improve its operations and gain new clients.

mcclaryheaderQ: Please tell me about FORTRUST and its facilities.

FORTRUST is a multi-tenant data center (MTDC) colocation services provider. Our main data center is a 300,000 square foot facility in Denver, and we have been operating as privately owned company since 2001, providing services for Fortune 100, 500, and 1000 companies.

Q: You recently completed a new phase. How is it different than earlier phases?

The big change is that we started to take a prefabricated approach to construction for phase 6, and instead of traditional raised floor we went to data modules, effectively encapsulating the customers’ IT environments, which increased our per rack densities and subsequent efficiencies both from a cooling and a capital standpoint.

One of the biggest trends in the industry that needs a course correction, is that data centers are not being allowed to evolve as they need to. We keep trying to build the same data centers over and over. The engineers and general contractors and even the vendors just want to do what they have always done. And I think the data centers are going to have to become capital efficient. When we made that change to a modular approach, we reduced our capital outlay and started getting almost instantaneous return on the capital.

Q: How did you ensure that these changes would not reduce your availability?

For Phase 6a, we were trying to get used to the whole modular approach in a colocation environment. We had Uptime Institute review our designs. As a result, we earned the Tier Certification of Design Documents.

We planned to go further and do the Tier Certification of Constructed Facility as well, but Uptime Institute helped us determine that it would be better to do the Tier Certification of Constructed Facility in an upcoming phase (7) because we already had live customers in phase 6A. And, as you know, about half the customers in a colo facility have at least one piece of single-corded equipment in their IT environment. To address this, we worked with Uptime Institute consultants to adapt what Uptime Institute normally does during Tier Certification of Constructed Facility when there are usually no live customers. And we pursued Uptime Institute’s M&O Stamp of Approval.

This helped us understand the whole modular approach and how we would approach these circumstances going forward. At the same time we became one of the earlier Uptime Institute M&O Stamp of Approval sites.

Q: Why has FORTRUST adopted Uptime Institute’s Operations assessments and certifications?

We are big believers that design is only a small part of uptime and reliability. We focus 90% of our effort on management and operations, risk mitigation, and process discipline, doing things in a manner that prevents human error. That’s really what’s allowed us to achieve continuous uptime and have high customer satisfaction for such a long time.

I’d say we’re known for our operations and a level of excellence, so the Tier Certification of Operational Sustainability validated things that we have been doing for many years. It also allowed us to take a look at what we do from a strategy and tactical standpoint by essentially giving us a third-party look into what we think is important and what someone else might think is important.

We’ve got a lot of former military folks here, a lot of former Navy. That approach may influence our approach. The military conducts a lot of inspections and audits. You get used to it, and it becomes your chance to shine. So the Tier Certification of Operational Sustainability allows our people to show what they do, how they do it, and the pride they take in doing it. It gives them validation of how well they are doing, and it emphasizes why it is important.

The process re-emphasizes why it is so important to have operational strategies and your tactics aligned in a harmonious fashion. A lot of people in the data center industry get bogged down in checklists, best practices and trying to use them to compare data centers, and about 50 percent of it is noise, which means tactics without strategy. If you have your strategies in place and your tactics are aligned with your strategies, that’s much more powerful than trying to incorporate 100 best practices in your day-to day-ops. So doing 50 things very well is a better thing than do 100 things halfway.

Q: Did previously preparing for the M&O Stamp of Approval help you prepare for the Tier Certification of Operational Sustainability?

Absolutely. One reason we scored so well on the Tier Certification of Operational Sustainability was that we looked at our M&O 3 years ago and implemented the suggested improvements right away, and we were comfortable because we’ve had those things in place for years.

Q: What challenges did you face? You make it sound easy.

The biggest challenge for us during both the Tier Certification Constructed Facility and Tier Certification of Operational Sustainability was coordinating with Uptime Institute in a live colo environment with shared systems that weren’t specific to one phase. It was pretty much like doing surgery on a live body.

We were confident in what we were doing. Obviously the Tier Certification of Operational Sustainability is centric around documentation, and we’ve been known for process discipline and our procedural compliance for over 14 years of operations. It’s our cornerstone; it’s what we do very well. We literally don’t do anything without a procedure in hand.

We think design is the design. Once you build the data center and infrastructure after that it is all about management and operations, so management, operations, and risk mitigation is what will give you the long track record of success.

At the end of the day, if the biggest cause of outages in data centers is human error, why wouldn’t we put more emphasis on understanding why that happens and how to avoid it? To me, that’s what the M&O and Tier Certification of Operational Sustainability are all about.

Q: It’s obvious you consider procedures a point of differentiation in the market. How do you capitalize on this difference to new customers?

We show it to them. Part of our sales cycle includes taking the customer through the due diligence that they want to do and what we think they need to do. We make available hundreds of our documented procedures. We show them how we manage them. When we take a potential customer through our data center, it’s a historical story that we put together that starts with reliability, risk mitigation, business value and customer service.

Customers cannot only hear it and see the differences, but they can also feel it. If you have been in a lot of data centers, you can walk through the MEP or colo areas and maybe in 10-15 minutes, you can tell if there’s a difference in the management and operations philosophy. It’s quite apparent.

That’s always been our call to action. It’s really educating the customer on why the management and ops are so important. We put a lot of emphasis and resource in mitigating and eliminating human error. We devote probably 80-90% of our time in training, process discipline, and procedural compliance. That’s how we operate day to day.

Q: What are the added costs?

Actually this approach has less cost. I would challenge anyone who is outsourcing most of their maintenance and operations and even management because we’re doing it cheaper and we’re doing more aggressive predictive and preventive maintenance than most any data center. It’s really what we call an operational mindset and that can rarely be outsourced. Your personnel have to own it.

We don’t have people coming in to clean our data centers. Our folks do it. We do the majority of the maintenance in the facility, and the staff owns it.

We don’t do a lot of corrective maintenance. Corrective maintenance currently costs on the magnitude of 10 times the cost of a comprehensive preventive and predictive maintenance program.

I can show proof because we have been operating for 15 years now. I would dare anyone to tell me what part of that data center or which one of our substations, switchgear, or other equipment components are 15-years old and which are the new ones. It would be hard to tell.

I think there are too many engineering firms and GCs that try to influence the build in a manner that isn’t necessary. Like I said, they try to design around human error instead of spending time preventing it.


Rob McClary

Rob McClary

Rob McClary is executive vice resident and general manager at FORTRUST Data Centers. Since joining FORTRUST in 2001, he has held the critical role of building the company into a premier data center services provider and colocation facility. Mr. McClary is responsible for the overall supervision of business operations, high-profile construction, and strategic technical direction. He developed and implemented the process controls and procedures that support the continuous uptime and reliability that FORTRUST Denver has delivered for more than 14 years.


Saudi Aramco’s Cold Aisle Containment Saves Energy

Oil exploration and drilling require HPC

By Issa A. Riyani and Nacianceno L. Mendoza

Saudi Aramco’s Exploration and Petroleum Engineering Computer Center (ECC) is a three-story data center built in 1982. It is located in Dhahran, Kingdom of Saudi Arabia. It provides computing capability to the company’s geologists, geophysicists, and petroleum engineers to enable them to explore, develop, and manage Saudi Arabia’s oil and gas reserves. Transitioning the facility from mainframe to rack-mounted servers was just the first of several transitions that challenged the IT organization over the last three decades. More recently, Saudi Aramco reconfigured the legacy data center to a Cold Aisle/Hot Aisle configuration, increasing rack densities to 8 kilowatts per rack (kW/rack) from 3 kW/rack in 2003 and nearly doubling capacity. Further increasing efficiency, Saudi Aramco also sealed openings around and under the computer racks, cooling units, and the computer power distribution panel in addition to blanking unused rack space.

The use of computational fluid dynamics (CFD) simulation software to manage the hardware deployment process enabled Saudi Aramco to increase the total number of racks and rack density in each data hall. Saudi Aramco used the software to analyze various proposed configurations prior to deployment, eliminating the risk of trial and error.

In 2015 one of the ECC’s five data halls was modified to accommodate a Cold Aisle Containment System. This installation supports the biggest single deployment so far in the data center, 124 racks of high performance computers (HPC) with a total power demand of 994 kW. As a result, the data hall now hosts 219 racks on a 10,113-square-foot (940-square-meter) raised floor. To date, the data center hall has not experienced any temperature problems.

Business Drivers

Increasing demand by ECC customers requiring the deployment of IT hardware and software technology advances necessitated a major reconfiguration in the data center. Each new configuration increased the heat that needed to be dissipated from the ECC. At each step, several measures were employed to mitigate potential impact to the hardware, ensuring safety and reliability during each deployment and project implementation.

For instance, Saudi Aramco developed a hardware deployment master plan based on a projected life cycle and refresh rate of 3–5 years to transition to the Cold Aisle/Hot Aisle configuration. This plan allows for advance planning of space and power source allocation with no compromise to existing operation as well as fund allocation and material procurement (see Figures 1 and 2).

Figure 1. Data center configuration

Figure 1. Data center configuration

Current day

Current day

Masterplan Figure 2. Data center plan view

Masterplan
Figure 2. Data center plan view

Because of the age of the building and its construction methodology, the company’s engineering and consulting department was asked to evaluate the building structure based on the initial master plan. This department determined the maximum weight capacity of the building structure, which was used to establish the maximum rack weight to avoid compromising structural stability.

In addition, the engineering and consulting department evaluated the chilled water pipe network and determined the maximum number of cooling units to be deployed in each data hall, based on maximum allowable chilled water flow. Similarly, the department determined the total heat to be dissipated per Hot Aisle to optimize the heat rejection capability of the cooling units. The department also determined the amount heat to be dissipated per rack to ensure sufficient cooling as per manufacturer’s recommendation.

Subsequently, facility requirements based on these limiting factors were shared with the technology planning team and IT system support. The checklist includes maximum weight, rack dimensions, and the requirement for blanking panels and sealing technologies to prevent air mixing.

Other features of the data center include:

  • A 1.5-foot (ft) [0.45 meter (m)] raised floor containing chilled water supply and return pipes for the CRAH units, cable trays for network connectivity, sweet water line for the humidifier, liquid-tight flexible conduits for power, and computer power system (CPS) junction boxes
  • A 9-ft (2.8 m) ceiling height
  • False ceilings
  • Down-flow chilled water computer room air handling (CRAH) units
  • CRAH units located at the end of each Hot Aisle
  • Perforated floor tiles (56% open with manually controlled dampers)
  • No overhead obstructions
  • Total data center heat load of 1,200 kW
  • Total IT load of 1,084 kW, which is constant for all three models
  • Sealed cable penetrations (modeled at 20% leakage)

The 42U cabinets in the ECC have solid sides and tops, with 64% perforated front and rear doors on each cabinet. Each is 6.5-ft. high by 2-ft. wide by 3.61-ft. deep (2 m by 0.6 m by 1.10 m) and weighs 1874 pounds (850 kilograms). Rack density ranges from 6.0–8.0 kW. The total nominal cooling capacity is 1,582 kW from (25) 18-ton computer room air conditioning (CRAC) units.

Modeling Software

In 2007, Saudi Aramco commissioned the CFD modeling software company to prepare baseline models for all the data halls. The software is capable of performing transient analysis that suits the company’s requirement. The company uses the modeling software to simulate proposed hardware deployment, investigate deployment scenarios, and identify any stranded capacity.The modeling company developed several simulations based on different hardware iterations of the master plan to help establish the final hardware master plan with each Hot Aisle not exceeding a 125-kW heat load on a 16-rack Hot Aisle and not more than 8 kW per rack.  After the modeling software company completed the initial iterations, Saudi Aramco acquired a perpetual license and support contract for the CFD simulation software in January 2010.

Saudi Aramco finds that the CFD simulation software makes it easier to identify and address heat stratification, recirculation, and even short-circuiting of cool air. By identifying the issues in this way, Saudi Aramco was able to take several precautionary measures and improve its capacity management procedures, including increasing cooling efficiency and optimizing load distribution.

Temperature and Humidity Monitoring System

With the CFD software simulation results at hand, the facilities management team looked for other means to gather data for use in future cooling optimization simulations while validating the results of CFD simulations. As a result, the facilities management group decided to install a temperature and humidity monitoring system. The initial deployment was carried out in 2008, with the monitoring of subfloor air supply temperature and hardware entering temperature.

At that time, three sensors were installed in each Cold Aisle for a total of six sensors. The sensors were positioned at the ends of each end of the row and in the middle, at the highest point of each rack. Saudi Aramco chose these points to get better understanding of the temperature variance (∆T) between the subfloor and the highest rack inlet temperature. Additionally, Saudi Aramco uses this data to monitor and ensure that all inlet temperatures are within the recommended ranges of ASHRAE and the manufacturer.

The real-time temperature and humidity monitoring system enabled the operation and facility management team to monitor and document unusual and sudden temperature variances allowing proactive responses and early resolution of potential cooling issues. The monitoring system gathers data that can be used to validate the CFD simulations and for further evaluation and iteration.

The Prototyping

The simulation models identified stratification, short circuiting, and recirculation issues in the data halls, which prompted the facilities management team to develop more optimization projects, including a containment system. In December 2008, a prototype was installed in one of the Cold Aisles (see Figure 3) using ordinary plastic sheets as refrigerated doors and Plexiglass sheets on aluminum frame, Saudi Aramco monitored the resulting inlet and core temperatures using the temperature and humidity monitoring system and internal system monitors prior to, during, and upon completion of installation to ensure no adverse effect with the hardware. The prototype was observed over the course of three months with no reported hardware issues.

Figure 3. Prototype containment system

Figure 3 Prototype containment system

Following the successful installation of the prototype, various simulation studies were further conducted to ensure the proposed deployment’s benefit and savings. In parallel, Saudi Aramco looked for the most suitable materials to comply with all applicable standards, giving prime consideration to the safety of assets and personnel and minimizing risk to IT operations.

Table 1. Installation dates

Table 1. Installation dates

When the Cold Aisle was contained, Saudi Aramco noticed considerable improvement in the overall environment. Containment improved cold air distribution by eliminating hot air mixing with the supply air from the subfloor, so that air temperature at the front of the servers was close to the subfloor supply temperature. With cooler air entering the hardware, the core temperature was vastly improved, resulting in lower exhaust and return air temperatures to the cooling units. As a result, the data hall was able to support more hardware

Material Selection and Cold Aisle Containment System installation

From 2009 to 2012, the facility management team evaluated and screened several products It secured and reviewed the material data sheets and submitted them to the Authority Having Jurisdiction (AHJ) for evaluation and concurrence. Each of the solutions would require some modifications to the facility before being implemented. The facility management team evaluated and weighed the impact of these modifications as part of the procurement process.

Of all the products, one stood out from the rest; the use of easy to install and transparent material addresses not only safety but also eliminated the need for modifications of the existing infrastructure, which translates to considerable savings in terms of project execution and money.

Movement in and out of the aisle is easy and safe as people can see through the doors and walls. Additionally, the data hall lighting did not need to be modified since it was not obstructed. Even the fire suppression system was not affected since it has a fusible link and lanyard connector. The only requirement by AHJ prior to deployment was additional smoke detectors in the Cold Aisle itself.

To comply with this requirement, an engineering work order was raised for the preparation of the necessary design package for the modification of the smoke detection system. After completing the required design package including certification from a chartered fire protection engineer as mandated by the National Fire Protection Association (NFPA), it was established that four smoke detectors were to be relocated and an additional seven smoke detectors installed in the data hall.

Implementation and challenges

Optimizations and improvements always come with challenges; the reconfiguration process necessitated close coordination between the technology planning team, IT system support, ECC customers, the network management group, Operations, and facility management. These teams had to identify hardware that could be decommissioned without impacting operations, prepare temporary spaces for interim operations, and then take the decommissioned hardware out of the data hall, allowing the immediate deployment of new hardware in Cold Aisle/Hot Aisle. Succeeding deployments follow the master plan, allowing the complete realignment process to be completed in five years.

Installation of the Cold Aisle Containment System did not come without challenges; all optimization activities, including relocating luminaires in the way of the required smoke detectors had to be completed with no impact to system operations. To meet this requirement, ECC followed a strict no work permit–no work procedure; work permits are countersigned by operation management staff on duty during issuance and prior to close out. This enabled close monitoring of all activities within the data halls, ensuring safety and no impact to daily operation and hardware reliability. Additionally, a strict change management documentation process was utilized and adhered to by the facility management team and monitored by operation management staff; all activities within the data halls have to undergo a change request approval process.

Operations management and facility management worked hand in hand to overcome these challenges. Operations management, working in three shifts, closely monitored the implementation process, especially after regular working hours. Continuous coordination between contractors, vendors, operation staff, and facility management team enabled smooth transition and project implementations eliminating any showstoppers along the way.

Summary

The simulation comparison in Figure 4 clearly shows the benefits of the Cold Aisle Containment System. Figure 4a shows hot air recirculating around the end of the rows and mixing with the cold air supply to the Cold Aisles. In Figure 4b, mixing of hot and cold air is considerably reduced with the installation of the 14 Cold Aisle containment systems. The Cold Aisles are better defined and clearly visible in the figures, with less hot air recirculation, but the three rows without containment still suffer from recirculation. In Figure 4c, the Cold Aisles are far better defined, and hot air recirculation and short circuiting are reduced. Additionally, the exhaust air temperature from the hardware has dropped considerably.

Figure 4a. Without Cold Aisle Containment

Figure 4a. Without Cold Aisle Containment

Figure 4b. With current Cold Aisle Containment (14 of 17 aisles)

Figure 4b. With current Cold Aisle Containment (14 of 17 aisles)

 

Figure 4c. With full Cold Aisle Containment

Figure 4c. With full Cold Aisle Containment

Figures 5–11 show that the actual power and temperature readings taken from the sensors installed in the racks validated the simulation results. As shown in Figures 4a and 5a, the power draw of the racks in Aisles 1 and 2 fluctuates while the corresponding entering and leaving temperature was maintained. On Week 40, the temperature even dropped slightly despite the slight increase in the power draw. The same can also be observed in Figures 6 and 7. All these aisles are fitted with a Cold Aisle Containment System.

B Issa Figure 4a Aisle 1image8

Figure 5. Actual Power utilization, entering temperature, and leaving temperature Aisle 01 (installed on July 30, 2015 – week 31)

Figure 5. Actual Power utilization, entering temperature, and leaving temperature
Aisle 01 (installed on July 30, 2015 – week 31)

B Issa Figure 5A Aisle 2 image10

Figure 6. Aisle 2 (installed on July 28, 2015 – week 31)

Figure 6. Aisle 2 (installed on July 28, 2015 – week 31)

B Issa Figure 6A Aisle 3image12

Figure 7. Aisle 3 (installed on March 7, 2015 – week 10)

Figure 7. Aisle 3 (installed on March 7, 2015 – week 10)

Figure 8. Aisle 6 (installed on April 09, 2015 – week 15)

Figure 8. Aisle 6 (installed on April 09, 2015 – week 15)

Figure 9. Aisle 7a (a and b) and Aisle 7b (c and d)

Figure 9. Aisle 7a (a and b) and Aisle 7b (c and d)

B Issa Figure 9a Aisle 8 image20

Figure 10. Aisle 8 (installed on February 28, 2015 – week 09)

Figure 10. Aisle 8 (installed on February 28, 2015 – week 09)

B Issa Figure 10A image22

Figure 11. Aisle 17 (no Cold Aisle installed)

Figure 11. Aisle 17 (no Cold Aisle installed)

Additionally, Figure 11 clearly shows slightly higher entering and leaving temperature as well as fluctuation in the temperature readings that coincided with the power draw fluctuation of the racks within the aisle. This aisle has no containment.

The installation of the Cold Aisle Containment System greatly improved the overall cooling environment of the data hall (see Figure 12). Eliminating hot and cold air mixing and short circuiting allowed for more efficient cooling unit performance and cooler supply and leaving air. Return air temperature readings in the CRAH units were also monitored and sampled in Figure 12, which shows the actual return air temperature variance as a result of the improved overall data hall room temperature.

B Issa Figure 12a image24 B Issa Figure 12b image25 B Issa Figure 12c image26

Figure 11. Computer air handling unit actual return air temperature graphs

Figure 11. Computer air handling unit actual return air temperature graphs

Figure 13. Cold Aisle floor layout

Figure 13. Cold Aisle floor layout

The installation of the Cold Aisle Containment System allows the same data hall to host the company’s MAKMAN and MAKMAN-2 supercomputers (see Figures 5

Figure 14. Installed Cold Aisle Containment System

Figure 14. Installed Cold Aisle Containment System

). Both MAKMAN and MAKMAN-2 appear on the June 2015 Top500 Supercomputers list.


Issa Riyani

Issa Riyani

Issa A. Riyani joined the Saudi Aramco Exploration Computer Center (ECC) in January 1993. He graduated from King Fahad University of Petroleum and Minerals (KFUPM) in Dhahran, Kingdom of Saudi Arabia, with a bachelor’s degree in electrical engineering. Mr. Riyani currently leads the ECC Facility Planning & Management Group and has more than 23 years experience managing ECC facilities.

Nacianceno L. Mendoza

Nacianceno L. Mendoza

Nacianceno L. Mendoza joined the Saudi Aramco Exploration Computer Center (ECC) in March 2002. He holds a bachelor of science in civil engineering and has more than 25 years of diverse experience in project design, review, construction management, supervision, coordination and implementation. Mr. Mendoza spearheaded the design and implementation of the temperature and humidity monitoring system and deployment of Cold Aisle Containment System in the ECC.

 

 

Data-Driven Approach to Reduce Failures

Operations teams use the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database to enrich site knowledge, enhance preventative maintenance, and improve preparedness

By Ron Davis

The AIRs system is one of the most valuable resources available to Uptime Institute Network members. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management.

A quick search of the database, using various parameters or keywords, turns up invaluable documentation on a broad range of facility and equipment topics. The results can support evidence-based decision making and operational planning to guide process improvement, identify gaps in documentation and procedures, refine training and drills, benchmark against successful organizations, inform purchasing decisions, fine-tune preventive maintenance (PM) programs to minimize failure risk, help maximize uptime, and support financial planning.

THE VALUE OF AIRs OPERATIONS
The philosopher, essayist, poet, and novelist George Santayana wrote, “Those who cannot remember the past are doomed to repeat it.” Using records of past data center incidents, errors, and outages can help inform operational practices to help prevent future incidents.

All Network member organizations participate in the AIRs program, ensuring a broad sample of incident information from data center organizations representing diverse sizes, business sectors, and geography. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management. The database contains data resulting from data center facility infrastructure incidents and outages for a period beginning in 1994 and continuing through the present. This volume of incident data allows for meaningful and extremely valuable analysis of trends and emerging patterns. Annually, Uptime Institute presents aggregated results and analysis of the AIRs database, spotlighting issues from the year and both current and historical trends.

Going beyond the annual aggregate trend reporting there is also significant insight to be gained from looking at individual incidents. Detailed incident information is particularly relevant to front-line operators, helping to inform key facility activities including:

• Operational documentation creation or improvement

• Planned maintenance process development or improvement

• Training

• PM

• Purchasing

• Effective practices, failure analysis, lessons learned

AIRs reporting is confidential and subject to a non-disclosure agreement (NDA), but the following hypothetical case study illustrates how AIRs information can be applied to improve an organization’s operations and effectiveness.

USING THE AIRs DATA IN OPERATIONS: CASE STUDY
A hypothetical “Site X” is installing a popular model of uninterruptible power supply (UPS) modules.

The facility engineer decides to research equipment incident reports for any useful information to help the site prepare for a smooth installation and operation of these critical units.

Figure 1. The page where members regularly go to submit an AIR. The circled link takes you directly to the “AIR Search” page.

Figure 1. The page where members regularly go to submit an AIR. The circled link takes you directly to the “AIR Search” page.

Figure 2. The AIR Search page is the starting point for Abnormal Incident research. The page is structured to facilitate broad searches but includes user friendly filters that permit efficient and effective narrowing of the desired search results.

Figure 2. The AIR Search page is the starting point for Abnormal Incident research. The page is structured to facilitate broad searches but includes user friendly filters that permit efficient and effective narrowing of the desired search results.

The facility engineer searches the AIRs database using specific filter criteria (see Figures 1 and 2), looking for any incidents within the last 10 years (2005-2015) involving the specific manufacturer and model where there was a critical load loss. The database search returns seven incidents meeting those criteria (see Figure 3).

Figure 3. The results page of our search for incidents within the last 10 years (2005-2015) involving a specific manufacturer/model where there was a critical load loss. We selected the first result returned for further analysis.

Figure 3. The results page of our search for incidents within the last 10 years (2005-2015) involving a specific manufacturer/model where there was a critical load loss. We selected the first result returned for further analysis.

Figure 4. The overview page of the abnormal incident report selected for detailed analysis.

Figure 4. The overview page of the abnormal incident report selected for detailed analysis.

The first incident report on the list (see Figure 4) reveals that the unit involved was built in 2008. A ventilation fan failed in the unit (a common occurrence for UPS modules of any manufacturer/model). Replacing the fan required technicians to implement a UPS maintenance bypass, which qualifies as a site configuration change. At this site, vendor personnel were permitted to perform site configuration changes. The UPS vendor technician was working in concert with one of the facility’s own operations engineers but was not being directly supervised (observed) at the time the incident occurred; he was out of the line of sight.

Social scientist Brené Brow said, “Maybe stories are just data with a soul.” If so, the narrative portion of each report is where we find the soul of the AIR (Figure 5). Drilling down into the story (Description, Action Taken, Final Resolution, and Synopsis) reveals what really happened, how the incident played out, and what the site did to address any issues. The detailed information found in these reports offers the richest value that can be mined for current and future operations. Reviewing this information yields insights and cautions and points towards prevention and solution steps to take to avoid (or respond to) a similar problem at other sites.

Figure 5. The detail page of the abnormal incident report selected for analysis. This is where the “story” of our incident is told.

Figure 5. The detail page of the abnormal incident report selected for analysis. This is where the “story” of our incident is told.

In this incident, the event occurred when the UPS vendor technician opened the output breaker before bringing the module to bypass, causing a loss of power, and the load was dropped. This seemingly small but crucial error in communication and timing interrupted critical production operations—a downtime event.

Back-up systems and safeguards, training, procedures, and precautions, detailed documentation, and investments in redundant equipment—all are in vain the moment there is a critical load loss. The very rationale for the site having a UPS was negated by one error. However, this site’s hard lesson can be of use if other data center operators learn from the mistake and use this example to shore up their own processes, procedures, and incident training. Data center operators do not have to witness an incident to learn from it; the AIRs database opens up this history so that others may benefit.

As the incident unfolded, the vendor quickly reset the breaker to restore power, as instructed by the facility technician. Subsequently, to prevent this type of incident from happening in the future, the site:

• Created a more detailed method of procedure (MOP) for UPS maintenance

• Placed warning signs near the output breaker

• Placed switch tags at breakers

• Instituted a process improvement that now requires the presence of two technicians: an MOP supervisor and MOP performer, with both technicians required to verified each step

These four steps are Uptime Institute-recommended practices for data center operations. However, this narrative begs the question of how many sites have actually taken the effort to follow through on each of these elements, checked and double-checked, and drilled their teams to respond to an incident like this. Today’s operators can use this incident as a cautionary tale to shore up efforts in all relevant areas: operational documentation creation/improvement, planned maintenance process development/improvement, training, and PM program improvement.

Operational Documenation Creation/Improvement
In this hypothetical, the site added content and detail to its MOP for UPS maintenance. This can inspire other sites to review their UPS bypass procedures to determine if there is sufficient content and detail.

The consequences of having too little detail are obvious. Having too much content can also be a problem if it causes a technician to focus more on the document than on the task.

The AIR in this hypothetical did not say whether facility staff followed an emergency operating procedure (EOP), so there is not enough information to say whether they handled it correctly. This event may never happen in this exact manner again, but anyone who has been around data centers long enough knows that UPS output breakers can and will fail in a variety of unexpected ways. All sites should examine their EOP for unexpected failure/trip of UPS output breaker.

In this incident, the technician reclosed the breaker immediately, which is an understandable human reaction in the heat of the moment. However, this was probably not the best course of action. Full system start-up and shutdown should be orderly affairs, with IT personnel fully informed, if not present as active participants. A prudent EOP might require recording the time of the incident, following an escalation tree, gathering white space data, and confirming redundant equipment status, along with additional steps before undertaking a controlled, fully scripted restart.

Another response to failure was the addition of warning signs and improved equipment labeling as improvements to the facility’s site configuration procedures (SCPs). This change can motivate other sites to review their nomenclature and signage. Some sites include a document that gives expected or steady state system/equipment information. Other sites rely on labeling and warning signs or tools like stickers or magnets located beside equipment to indicate proper position. If a site has none of these safeguards in place, then assessment of this incident should prompt the site team to implement them.

Examining AIRs can provide specific examples of potential failure points, which can be used by other sites as a checklist of where to improve policies. The AIRs data can also be a spur to evaluate whether site policies match practices and ensure that documented procedures are being followed.

Planned Maintenance Process Improvement
After this incident, the site that submitted the AIR incident report changed its entire methodology for performing procedures. Now two technicians must be present, each with strictly defined roles: one technician reads the MOP and supervises the process, and the second technician verifies, performs, and confirms. Both technicians must sign off on the proper and correct completion of the task. It is unclear whether there was a change in vendor privileges.

When reviewing AIRs as a learning and improvement tool, facilities teams can benefit by implementing measures that are not already in place or any improvements that they determine they would implement if a similar incident had occurred at their site. For example, a site may conclude that configuration changes should be reserved only for those individuals who:

• Have a comprehensive understanding of site policy

• Have completed necessary site training

• Have more at stake for site performance and business outcomes

Training
The primary objective of site training is to increase adherence to site policy and knowledge of effective mission critical facility operations. Incorporating information gleaned from AIRs analysis helps maximize these benefits. Training materials should be geared to ensure that technicians are fully qualified to utilize their skills and abilities to operate the installed infrastructure within a mission critical environment and not to certify electricians or mechanics. In addition, training is an opportunity to provide an opportunity for professional development and interdisciplinary education amongst our operations team, which can help enterprises retain key personnel.

The basic components of an effective site-training program are an instructor, scheduled class times that can be tracked by student and instructor, on-the-job training (OJT), reference material, and a metric(s) for success.

With these essentials in place, the documentation and maintenance process improvement steps derived from the AIR incident report can be applied immediately for training. Newly optimized SOPs/MOPs/EOPs can be incorporated into the training, as well as process improvements such as the two-person rule. Improved documentation can be a training reference and study material, and improved SCPs will reduce confusion during OJT and daily rounds. Training drills can be created directly from real-world incidents, with outcomes not just predicted but also chronicled from actual events. Trainer development is enhanced by the involvement of an experienced technician in the AIR review process and creation of any resulting documentation/process improvement.

Combining AIRs data with existing resources enables sites to take a systematic approach to personnel training, for example:

1. John Doe is an experienced construction  electrician who was recently hired. He needs UPS bypass training.

2. Jane Smith is a facility operations tech/operating engineer with 10 years of experience as a UPS technician. She was instrumental in the analysis of the AIRs incident and consequent improvements in the UPS bypass procedures and processes; she is the site’s SME in this area.

3. Using a learning management system (LMS) or a simple spreadsheet, John Doe’s training is scheduled.

• Scheduled Class: UPS bypass procedure discussion and walk-through

• Instructor: Jane Smith

• Student: John Doe

• Reference material: the new and improved UPS BYPASS SOP XXXX_20150630, along with the EOP and SCP

• Metric might include:

o Successful simulation of procedure as a performer

o Successful simulation of procedure as a supervisor

o Both of the above

o Successful completion of procedure during a PM activity

o Success at providing training to another technician

Drills for both new trainees and seasoned personnel are important. Because an AIRs-based training exercise is drawn from an actual event, not an imaginary scenario, it lends greater credibility to the exercise and validates the real risks.  Anyone who has led a team drill has probably encountered that one participant who questions the premise of a procedure or suggests a different procedure. Far from being a roadblock to effective drills, the participant is proving to be actively engaged and can assist in program improvement by assisting in creating drills and assessing AIRs scenarios.

PM Program Improvement
The goal of any PM program is to prevent the failure of equipment. The incident detailed in the AIR incident report was triggered from a planned maintenance event, a UPS fan replacement. Typically, a fan replacement requires systems be put on bypass, as do annual PM procedures. Since any change of equipment configuration such as changing a fan introduces risk, it is worth asking whether predictive/proactive fan replacement performed during PM makes more sense than awaiting fan failure. The risk of configuration change must be weighed against the risk of inactivity.

Figure 6. Our incident was not caused by UPS Fan Failure, but occurred as a result of human error during its replacement. So how many AIRs involving fan failures for our manufacturer/model of UPS exist within our database? Figure 6 shows the filters we chose to obtain this information.

Figure 6. Our incident was not caused by UPS Fan Failure, but occurred as a result of human error during its replacement. So how many AIRs involving fan failures for our manufacturer/model of UPS exist within our database? Figure 6 shows the filters we chose to obtain this information.

Examining this and similar incidents in the AIRs database yields information about UPS fan life expectancy that can be used to develop an “evidence-based” replacement strategy. Start by searching the AIRs database for the keyword “fan” using the same dates, manufacturer, and model criteria, with no filter for critical load loss (see Figure 6). This search returns eight reports with fan failure (see Figure 7). The data show that the average life span of the units with fan failure was 5.5 years. The limited sample size means that this result should not be relied on, but this experience at other sites can help guide planning. Less restrictive search criteria can return even more specific data.

Figure 7. A sample of the results (showing 10 of 25 results reports) returned from the search described in Figure 6. Analysis of these incidents may help us to determine and develop the best strategy for cooling fan replacement.

Figure 7. A sample of the results (showing 10 of 25 results reports) returned from the search described in Figure 6. Analysis of these incidents may help us to determine and develop the best strategy for cooling fan replacement.

Additional Incidents Yield Additional Insight
The initial database search at the start of this hypothetical case returned a result of seven AIRs total. What can we learn from the other six?  Three of the remaining six reports involved capacitor failures. At one site, the capacitor was 12 years old, and the report noted, “No notification provided of the 7-year life cycle by the vendor.” Another incident occurred in 2009, involving a capacitor with a 2002 manufacture date, which would match (perhaps coincidentally) a 7-year life cycle. The third capacitor failure was in a 13-year old piece of equipment, and the AIR notes that it was “outside of 4–5-year life cycle.” These results highlight the importance of having an equipment/component life-cycle replacement strategy. The AIRs database is a great starting point.

A fourth AIR describes a driver board failure in a 13-year old UPS. Driver board failure could fall into any of the AIR root-cause types. Examples of insufficient maintenance might include a case where maintenance performed was limited in scope and did not consider end-of-life. Perhaps there was no procedure to diagnose equipment for a condition or measurement indicative of component deterioration, or maybe maintenance frequency was insufficient. Without further data in the report it is hard to draw an actionable insight, but the analysis does raise several important topics for discussion regarding the status of a site’s preventive and predictive maintenance approach. A fifth AIR involves an overload condition resulting from flawed operational documentation. The lesson there is obvious.

The final of the remaining six reports resulted from a lightning strike that made it through the UPS to interrupt the critical load. Other sites might want to check transient voltage surge suppressor (TVSS) integrity during rounds. With approximately 138,000,000 lightning strikes per year worldwide, any data center can be hit. A site can implement an EOP that dictates checking summary alarms, ensuring redundant equipment integrity, performing a facility walk-through by space priority, and providing an escalation tree with contact information.

Each of the AIRs casts light on the types of shortfalls and gaps that can be found in even the most capably run facilities. With data centers made up of vast numbers of components and systems operating in a complex interplay, it can be difficult to anticipate and prevent every single eventuality. AIRs may not provide the most definitive information on equipment specifications, but assessing these incidents provides an opportunity for other sites to identify potential risks and plan how to avoid them.

PURCHASING/EQUIPMENT PROCUREMENT DECISIONS
In addition to the operational uses described above, AIRs information can also support effective procurement. However, as with using almost any type of aggregated statistics, one should be cautious about making broad assumptions based on the limited sample size of the AIRs database.

For example, a search using simply the keywords ‘fan failure’ and ‘UPS’ could return 50 incidents involving Vendor A products and five involving Vendor B’s products (or vice versa). This does not necessarily mean that Vendor A has a UPS fan failure problem. The number of incidents reported could just mean that Vendor B has a significant market share advantage.

Further, one must be careful of jumping to conclusions regarding manufacturing defects. For example, the first AIR incident report made no mention of how the UPS may (or may not) have been designed to help mitigate the risk of the incident. Some UPS modules have HMI (human machine interface) menu-driven bypass/shutdown procedures that dictate action and provide an expected outcome indication. These equipment options can help mitigate the risk of such an event but may also increase the unit cost. Incorporating AIRs information as just one element in a more detailed performance evaluation and cost-benefit analysis will help operators accurately decide which unit will be the best fit for a specific environment and budget.

LEARNING FROM FAILURES
If adversity is the best teacher, then every failure in life is an opportunity to learn, and that certainly applies in the data center environment and other mission critical settings. The value of undertaking failure analysis and applying lessons learned to continually develop and refine procedures is what makes an organization resilient and successful over the long term.

To use an example from my own experience, I was working one night at a site when the operations team was transferring the UPS to maintenance bypass during PM. Both the UPS output breaker and the UPS bypass breaker were in the CLOSED position, and they were sharing the connected load. The MOP directed personnel to visually confirm that the bypass breaker was closed and then directed them to open the UPS output breaker. The team followed these steps as written, but the critical load was dropped.

Immediately, the team followed EOP steps to stabilization. Failure analysis revealed that the breaker had suffered internal failure; although the handle was in the CLOSED position, the internal contacts were not closed. Further analysis yielded a more detailed picture of events. For instance, the MOP did not require verification of the status of the equipment. Maintenance records also revealed that the failed breaker had passed primary injection testing within the previous year, well within the site-required 3-year period. Although meticulous compliance with the site’s maintenance standards had eliminated negligence as a root cause, the operational documentation could have required verification of critical component test status as a preliminary step. There was even a dated TEST PASSED sticker on the breaker.

Indeed eliminating key gaps in the procedures would have prevented the incident. As stated, the breaker appeared to be in the closed position as directed, but the team had not followed the load during switching activities, (i.e., had not confirmed the transfer of the amperage to the bypass breaker). If we had done so, we would have seen a problem, and initiated a back-out of the procedure. Subsequently, these improvements were added to the MOP.

FLASH REPORTS
Flash reports are a particularly useful AIRs service because they provide early warning about incidents identified as immediate risks, with root causes and fixes to help Network members prevent a service interruption. These reports are an important source of timely front-line risk information.

For example, searching the AIRs database for any FLASH AIR since 2005 involving one popular UPS model returns two results. Both reports detailed a rectifier shutdown as a result of faulty trap filter components; The vendor consequently performed a redesign and recommended a replacement strategy. The FLASH report mechanism became a crucial channel for communicating the manufacturer’s recommendation to equipment owners. Receiving a FLASH notification can spur a team to check maintenance records and consult with trusted vendors to ensure that manufacturer bulletins or suggested modifications have been addressed.

When FLASH incidents are reported, Uptime Institute’s AIRs program team contacts the manufacturer as part of its validation and reporting process. Uptime Institute strives for and considers its relationships with OEMs (original equipment manufacturers) to be cooperative, rather than confrontational. All parties understand that no piece of complex equipment is perfect, so the common goal is to identify and resolve issues as quickly and smoothly as possible.

CONCLUSION
It is virtually impossible for an organization’s site culture, procedures, and processes to be so refined that there are no details left unaddressed and no improvements that can be made. There is also a need to beware of hidden disparities between site policy and actual practice. Will a team be ready when something unexpected does go wrong? Just because an incident has not happened yet does not mean it will not happen. In fact, if a site has not experienced an issue, complacency can set in; steady state can get boring. Operators with foresight will use AIRs as opportunities to create drills and get the team engaged with troubleshooting and implementing new, improved procedures.

Instead of trying to ferret out gaps or imagine every possible failure, the AIRs database provides a ready source of real-world incidents to draw from. Using this information can help hone team function and fine tune operating practices. Technicians do not have to guess at what could happen to equipment but can benefit from the lessons learned by other sites. Team leaders do not have to just hope that personnel are ready to face a crisis; they can use AIRs information to prepare for operating eventualities and to help keep personnel responses sharp.

AIRs is much more than a database; it is a valuable tool for raising awareness of what can happen, mitigating the risk that it will happen, and for preparing an operations team for when/if it does happen. With uses that extend to purchasing, training, and maintenance activities, the AIRs database truly is Uptime Institute Network members’ secret weapon for operational success.


Ron Davis

Ron Davis

Ron Davis is a Consultant for Uptime Institute, specializing in Operational Sustainability. Mr. Davis brings more than 20 years of experience in mission critical facility operations in various roles supporting data center portfolios, including facility management, management and operations consultant, and central engineering subject matter expert. Mr. Davis manages the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database, performing root-cause and trending analysis of data center outages and near outages to improve industry performance and vigilance.

Reduce Data Center Insurance Premiums

Uptime Institute President Lee Kirby and Stephen Douglas, Risk Control Director for CNA, an insurance and risk control provider for the software and IT services industry recently coauthored an article for Data Center Knowledge: Lowering Your Data Center’s Exposure to Insurance Claims. In this follow on, Kirby discusses how companies can reduce insurance premiums by providing insurance providers with an Uptime Institute Tier Certification of Operational Sustainability or Management & Operations Stamp of Approval.

Uptime Institute has provided data center expertise for more than 20 years to mission-critical and high-reliability data centers. It has identified a comprehensive set of evidence-based methods, processes, and procedures at both the management and operations level that have been proven to dramatically reduce data center risk, as outlined in the Tier Standard: Operational Sustainability.

Organizations that apply and maintain the Standard are taking the most effective actions available to protect their investment in infrastructure and systems and reduce the risk of costly incidents and downtime. The elements outlined in the Standard have been developed based on the industry’s most comprehensive database of information about real-world data center incidents, errors, and failures: Uptime Institute’s Abnormal Incident Reporting System (AIRS). Many of the key Standards elements are based on analysis of 20 years of AIRS data collected on thousands of data center incidents, pinpointing causes and contributing factors. The Standards focus in on specific behaviors and criteria that have been proven to decrease the risk of downtime.

To assess and validate whether a data center organization is meeting this operating Standard, Uptime Institute administers the industry’s leading operations certifications. These independent, third-party credentials signify that a data center is managed and operated in a manner that will reduce risk and support availability. There are two types of operations credentials:

Tier Certification of Operational Sustainability (TCOS) is for organizations that have been designed and built to meet Tier Topology criteria. Earning a TCOS credential signifies that a data center upholds the most stringent criteria for quality, consistency, and risk prevention in its facility and operations.

The Management & Operations (M&O) Stamp of Approval is for any existing data center that does not have Tier Certification for Design and Construction. The M&O assessment evaluates management, staffing, and procedures independent of topology, and ensures that the facility is being operated to maximize the uptime potential and minimize the risks to the existing infrastructure.

Both credentials are based on the same rigorous Standards for data center operations management, with detailed behaviors and factors that have been shown to impact availability and performance. The Standards encompass all aspects of data center planning, policies and procedures, staffing and organization, training, maintenance, operating conditions, and disaster preparedness. Earning one of these credentials demonstrates to all stakeholders that a data center is following the principles of effective operations and is being managed with transparency following industry best practices.

The process for a data center to receive either TCOS or the M&O Stamp of Approval includes review of each facility’s policies and documentation, but also includes on-site inspections and live demonstrations to verify that critical systems, backups, and procedures are effective—not just on paper but in daily practice. It’s analogous to putting a vehicle operator through a live driving test before issuing a license. These credentials offer the only comprehensive risk assessment in the data center industry, zeroing in on the risk factors that are the most critical.

The data center environment is never static; continuous review of performance metrics and vigilant attention to changing operating conditions is vital. The data center environment is so dynamic that if policies, procedures, and practices are not revisited on a regular basis, they can quickly become obsolete. Even the best procedures implemented by solid teams are subject to erosion. Staff may become complacent, or bad habits begin to creep in.

Just as ‘good driver’ discounts use an individual’s track record as a reliable indicator of good ongoing behaviors (such as effective maintenance and safe driving habits), periodic data center recertification (biannually at a minimum) provides a key indicator of ongoing effective facility management and operational best practices. Uptime Institute’s data center credentials have built-in expiry periods, with reassessment required at regular intervals.

There is tremendous value for organizations that hold themselves to a consistent set of standards over time, evaluating, fine tuning, and retraining on a routine basis. This discipline creates resiliency, ensuring that maintenance and operations procedures are appropriate and effective, and that teams are prepared to respond to contingencies, prevent errors, and keep small issues from becoming large problems.

Insurance is priced competitively based on the insurers assessment of the exposure presented. Data center operations credentials provide the consistent benchmarking of an unbiased third party review that can be used by service providers at all levels of the data supply chain to demonstrate the quality of the organization’s risk management efforts. This demonstration of risk quality allows infrastructure and service providers to obtain more competitive terms and pricing across their insurance programs.

When data centers obtain the relevant Uptime Institute credential, it results in a level of expert scrutiny unmatched in the industry, giving insurance companies the risk management proof they need. Insurers can validate risk level to a consistent set of reliable Standards. As a result, facilities with good operations, as validated by TCOS or M&O Stamp of Approval, can benefit from reduced insurance costs. When a data center has a current certification, underwriters can be assured that it has withstood the rigorous evaluation of an unbiased third-party, meets globally-recognized Standards, and that its management has taken effective steps to maintain uninterrupted performance and mitigate the risk of loss.