After commercial DCIM offerings failed to meet RELX Group’s requirements, the team built its own DCIM tool based on the existing IT Services Management Suite
By Stephanie Singer
What is DCIM? Most people might respond “data center infrastructure management.” However, simply defining the acronym is not enough. The meaning of DCIM is greatly different for each person and each enterprise.
At RELX Group and the data centers managed by the Reed Elsevier Technology Services (RETS) team, DCIM is built on top of our configuration management database (CMDB) and provides seamless, automated interaction between the IT and facility assets in our data centers, server rooms, and telco rooms. These assets are mapped to the floor plans and electrical/mechanical support systems within these locations. This arrangement gives the organization the ability to manage its data centers and internal customers in one central location. Additionally, DCIM components are fully integrated with the company’s global change and incident management processes for holistic management and accuracy.
A LITTLE HISTORY
So what does this all mean? Like many of our colleagues, RETS has consulted with several DCIM providers over the years. Many of them enthusiastically promised a solution for our every need, creating excitement at the prospect of a state-of-the art system for a reasonable cost. But, as we all know, the devil is in the details.
I grew up in this space and have personally been on this road since the early 1980s. Those of you who have been on the same journey will likely remember using the Hitachi Tiger tablet with AutoCAD to create equipment templates. We called it “hardware space planning” at the time, and it was a huge improvement over the template cutouts from IBM, which we had to move around manually on an E-size drawing.
Many things have changed over the last 30 years, including the role of vendors in our operation. Utilizing DCIM vendors to help achieve infrastructure goals has become a common practice in many companies, with varying degrees of success. This has most certainly been a topic at every Uptime Institute Symposium as peers shared what they were doing and sought better go-forward approaches and planning.
OUR FIRST RUN AT PROCURING DCIM
Oh how I coveted the DCIM system my friend was using for his data center portfolio at a large North America-based retail organization. He and his team helpfully provided real world demos and resource estimates to help us construct the business case that was approved, including the contract resources for a complete DCIM installation. We were able to implement the power linking for our assets in all our locations. Unfortunately, the resource dollars did not enable us to achieve full implementation with the cable management. I suspect many of you reading this had similar experiences.
Nevertheless, we had the basic core functionality and a DCIM system that met our basic needs. We ignored inefficiencies, even though it took too many clicks to accomplish the tasks at hand and we still had various groups tracking the same information in different spreadsheets and formats across the organization.
About 4-1/2 years after implementation, our vendor reported the system we were using would soon be at end of life. Of course, they said, “We have a new, bigger, better system. Look at all the additional features and integration you can have to expand and truly have everything you need to run your data centers, manage your capacities, and drive costs out.”
Digging deeper, we found that driving cost out was not truly obtainable when we balanced the costs of fully collating the data, building the integration with other systems and processes, and maintaining the data and scripting that was required. With competing priorities and the drive for cost efficiencies, we went back to the drawing board and opened the DCIM search to other providers once again.
STARTING OVER WITH DCIM MARKET RESEARCH
The DCIM providers we looked at had all the physical attributes tied to the floor and rack space, cable management, asset life cycle, capacity planning, and various levels for the electrical/mechanical infrastructure. These tools all integrated with power quality and building management and automation systems, each varying slightly in their approach and data output. And many vendors offered bi-directional data movement from the service management tool suite and CMDB.
But our findings revealed potential problems such as duplicate and out-of-sync data. This was unacceptable. We wanted all the features our DCIM providers promised without suffering poor data quality. We also wanted the DCIM to fully integrate with our change and incident management systems so we could look holistically at potential root causes of errors. We wanted to see where the servers were located across the data center, if they were in alarm, when maintenance was occurring, and whether a problem was resolved. We wanted the configuration item attributes for maintenance, end of life, contract renewals, procedures, troubleshooting guidelines, equipment histories, business ownerships, and relationships to be 100% mapped globally.
RETS has always categorized data center facilities as part of IT, separate from Corporate Real Estate. Therefore, all mechanical and electrical equipment within our data center and server rooms are configuration items (CI) as well. This includes generators, switchgear, uninterruptible power systems (UPS), power distribution units (PDU), remote power panels (RPP), breakers, computer room air conditioners (CRAC), and chillers. Breaking away from the multiple sources we had been using for different facility purposes greatly improved our overall grasp on how our facility team could better manage our data centers and server rooms.
NEW PARADIGM
This realization caused us to ask ourselves: What if we flipped the way DCIM is constructed? What if it is built -on top of the Service Management Suite, so it is truly a full system that integrates floor plans, racks, and power distribution to the assets within the CMDB? Starting with this thought, we aggressively moved forward to completely map the DCIM system we currently had in place and customize our existing Service Management Suite.
We started this journey in October 2014 and followed a Scrum software development process. Scrum is an iterative and incremental agile software development methodology. The 2-week sprints and constant feedback that provided useful functionality were keys to this approach. It was easy to adapt quickly to changes in understanding.
Another important part of Scrum is to have a complete team set up with a subject matter expert (SME) to serve as the product owner to manage the backlog of features. Other team members included the Service Management Suite tool expert to design forms and tables, a user interface (UI) expert to design the visualization and a Scrum master to manage the process. A web service expert joined the team to port the data from the existing DCIM system into the CMDB. All these steps were critical; however, co-locating the team with a knowledgeable product owner to ensure immediate answers and direction to questions really got us off and running!
We created our wish list of requirements, prioritizing those that enabled us to move away from our existing DCIM system.
Interestingly enough, we soon learned from our vendor that the end of life for our current 4-1/2 year-old DCIM system would be extended because of the number of customers that remained on that system. Sound familiar? The key was to paint the vision of what we wanted and needed it to be, while pulling a creative and innovative group of people together to build a roadmap for how we were going to get there.
It was easy to stay focused on our goals. We avoided scope creep by aligning our requirements with the capabilities that our existing tool provided. The requirements and capabilities that aligned were in scope. Those that didn’t were put on a list for future enhancements. Clear and concise direction!
The most amazing part was that using our Service Management Suite was providing many advantages. We were getting all the configuration data linked and a confidence in the data accuracy. This created excitement across the team and the “wish” list of feature requests grew immensely! In addition to working on documented requests, our creative and agile team came back with several ideas for features we had not initially contemplated, but that made great business sense. Interestingly enough, we achieved many of these items so easily we had achieved a significantly advanced tool with the automation features we leveraged by the time we went live in our new system.
FEATURES
Today we can pull information by drilling down on a oor plan to a device that enables us to track the business owner, equipment relationships, application to infrastructure mapping, and application dependencies. This information allows us to really understand the impacts of adding, moving, modifying, or decommissioning within seconds. It provides real-time views for our business partners when power changes occur, maintenance is scheduled, and if a device alarming is in effect (see Figure 1).
Figure 1. DCIM visualization
The ability to tie in CIs to all work scheduled through our Service Management Suite change requests and incidents provides a global outlook on what is being performed within each data center, server room, and telco room, and guarantees accuracy and currency. We turned all electrical and mechanical devices into CIs and assigned them physical locations on our floor plans (see Figure 2).
Figure 2. Incident reporting affected CI
Facility work requests are incorporated into the server install and decommissioning workflow. Furthermore, auto discovery alerts us to devices that are new to the facility, so we are aware if something was installed outside of process.
If an employee needs to add a new model to the floor plan we have a facility management form for that process, where new device attributes can be specified and created by users within seconds.
Facility groups can modify floor plans directly from the visualization providing dynamic updates to all users, Operations can monitor alarming and notifications 24×7 for any CI, and IT teams can view rack elevations for any server rack or storage array (see Figure 3).
Power management, warranty tracking, maintenance, hardware dependencies, procedures, equipment relationships, contract management, equipment histories can all be actively maintained all within one system.
Figure 3. Front and rear rack elevation
Speaking of power management, our agile team was able to create an exact replica of our electrical panel schedules inside of the DCIM without losing any of the functionality we had in Excel. This included the ability to update current power draw for each individual breaker, redundant failover calculation, alarming, and relationship creation from PDU to breaker to floor-mount device.
Oh and by the way, iPad capability is here…. Technicians can update information as work is being performed on the floor, allowing Operations to know when a change request is actually in process and what progress has been made. And, 100% automation is in full effect here! Technicians can also bring up equipment procedures to follow along the way as these are tied to CIs using the Service Management Suite knowledge articles.
Our Service Management Suite is fully integrated with active directory, in that we can associate personnel with individual site locations that they manage. Self-service forms are also in place where users can add, modify, or delete any new vendor information for specific sites.
The Service Management Suite has already integrated with our Real Estate management tool to integrate remote floor plans, site contacts, and resource usage for each individual location. The ability to pull power consumption per device at remote sites is also standardized based on an actual determined estimate to assist with consolidation efforts.
The self-service items include automated life cycle forms that allow users to actively track equipment adds, modifications, and removals, while also providing the functionality to correlate CIs together in order to form relationships from generator to UPS to PDU to breaker to rack to power strip to rack-mount device (see Figure 4).
Figure 4. Self-service facility CI management form
Functionality for report creation on any CI and associated relationships has been implemented for all users. Need to determine where to place a server? There’s a report that can be run for that as well!
The streamlined processes allow users to maintain facility and hardware CIs with ease, and truly provides a 100% grasp on the activity occurring with data centers on a daily basis.
I am quite proud of the small, but powerful, team that went on this adventure with us. As the leader of this initiative, it was refreshing to see the idea start with a manager. He worked with a key developer to build workflows and from there turned DCIM on its head.
We pulled the hardware, network, and facility teams together with five amazing part-time developers for the “what if” brainstorm session and the enthusiasm continued to explode. It was truly amazing to observe this team. Within 30 days, we had a prototype that was shared with senior management and stakeholders who fully supported the effort and the rest is history!
It’s important to note, for RETS, we have a far superior tool set for a fraction of the cost of other DCIM tools. With it being embedded into our Service Management Suite,
we avoid additional maintenance, software, and vendor services costs… win, win, win!
Our DCIM is forever evolving in that we have so far surpassed the requirements we originally set that we thought, “Why stop now?” Continuing with our journey
will bring service impact reports and alarming, incorporating our existing power monitoring application, and building an automation system, which will enhance our ability to include remote location CIs managed by us. With all the advances we are able to make using our own system, I am looking forward to more productivity than ever before and more than we can imagine right now!
Stephanie Singer joined Reed Elsevier, now known as RELX Group, in 1980 and has worked for Mead Data Central, LexisNexis, and Reed Elsevier–Technology Services during her career. She is currently the vice president of Global Data Center Services. In this role, she is responsible for global data center and server room facilities, networks, and cable plant infrastructure for the products and applications within these locations. She leads major infrastructure transformation efforts. Ms. Singer has led the data center facilities team since 1990, maintaining an excellent data center availability record throughout day-to-day operations and numerous lifecycle upgrades to the mechanical and electrical systems. She also led the construction and implementation of a strategic backup facility to provide in-house disaster recovery capability.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/09/dcimtop.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2016-09-29 13:02:382016-09-29 13:02:38If You Can’t Buy Effective DCIM, Build It
Industry data from Uptime Institute and 451 Research evidence a rapid rate of cloud computing adoption for enterprise IT departments. Organizations weigh cloud benefits and risks, and also evaluate how cloud will impact their existing and future data center infrastructure investment. In this video, Uptime Institute COO Julian Kudritzki and Andrew Reichman, Research Director at 451 Research discuss the various aspects of how much risk, and how much reward, is on the table for companies considering a cloud transition.
While some organizations are making a “Tear down this data center” wholesale move to cloud computing, the vast majority of cloud adopters are getting there on a workload-by-workload basis–carefully evaluating their portfolio of workloads and applications and identifying the best cloud or non-cloud venue to host each.
The decision process is based on multiple considerations, including performance, integration issues, economics, competitive differentiation, solution maturity, risk tolerance, regulatory compliance considerations, skills availability, and partner landscape.
Some of the most important of these considerations when deciding whether to put a workload or application in the cloud include:
1. Know which applications impact competitive advantage, and which don’t.
You might be able to increase competitive advantage by operating critical differentiating applications better than peer companies do, but most organizations will agree that back office functions like email or payroll, while important, don’t really set a company apart. As mature SaaS (software as a service) options for these mundane, but important, business functions have emerged, many companies have decided that a cloud application that delivers a credible solution at a fair cost is good enough. Offloading the effort for these workloads can free up valuable time and effort for customization, optimization, and innovation around applications that drive real competitive differentiation.
2. Workloads with highly variable demand see the biggest benefit from cloud.
Public cloud was born to address big swings in demand seen in the retail world. If you need thousands of servers around the Christmas shopping spree, public cloud IaaS (infrastructure as a service) makes them available when you need them and return them after the New Year when demand settles down. Any workload with highly variable demand can see obvious benefits from running in cloud, so long as the architecture supports load balancing and changing the number of servers working on the job, known as scale-out design. Knowing which applications fit this bill and what changes could be made to cloud-enable other applications will help to identify those that would see the biggest economic benefit from a move to cloud.
3. Cloud supports trial and error without penalty.
Cloud gives users an off switch for IT resources, allowing easy changes to application architecture. Find out that a different server type or chipset or feature is needed after the initial build-out? No problem, just move it to something else and see how that works. The flexibility of cloud resources lend themselves very well to experimentation in finding the perfect fit. If you’re looking for a home for an application that’s been running well for years, you might find that keeping it in on-premises will be cheaper and less disruptive than moving it to the cloud.
4. Big looming investments can present opportunities for change.
Cloud can function as an effective investment avoidance strategy when organizations face big bills for activities like data center build-out or expansion, hardware refresh, software upgrade, staff expansion, or outsourcing contract renewal. When looking at big upcoming expenditures, it’s a great time to look at how offloading selected applications to cloud might reduce or eliminate the need for that spend. Once the investments are made, they become sunk costs and will likely make the business case for cloud transition much less attractive.
5. Consider whether customization is required or if good enough is good enough.
Is this an application that isn’t too picky in terms of server architecture and advanced features, or is it an app that requires specific custom hardware architectures or configurations to run well? If you have clearly understood requirements that you must keep in place, a cloud provider might not give you exactly what you need. On the other hand, if your internal IT organization struggles to keep pace with the latest and greatest, or if your team is still determining the optimal configuration, cloud could give more flexibility to experiment with a wider range of options than you have access to internally, given a smaller scale of operations than mega-scale cloud providers operate at.
6. Conversion to cloud native architectures can be difficult but rewarding in the long run.
One benefit of cloud is renting instead of buying, with advantages in terms of scaling up and down at will and letting a service provider do the work. A separate benefit comes from the use of cloud native architectures. Taking advantage of things like API-controlled infrastructure, object storage, micro-services, and server-less computing requires switching to cloud-friendly applications or modifying legacy apps to use cloud design principles. If you have plans to switch or modify applications anyway, think about whether you would be better served running these applications in house or if it would make sense to use that inflection point to move to something hosted by a third party. If your organization runs mostly traditional apps and has no intention of taking on major projects to cloud-enable them, you should know that options and benefits of forklifting them unchanged to cloud will be limited.
7. Be honest about what your company is good at and what it is not.
If cloud promises organizations the ability to get out of the business of mundane activities such as racking and stacking gear, refreshing and updating systems, and managing facilities, it’s important to start the journey with a clear and honest assessment of what your company does well and what it does not do well. If you have long standing processes to manage efficiency, reliability and security, have relatively new facilities and equipment, and the IT staff is good at keeping it all running, then cloud might not offer much benefit. If there are areas where things don’t go so smoothly or you struggle to get everything done with existing headcount, cloud could be a good way to get better results without taking on more effort or learning major new skills. On the other hand, managing a cloud environment requires its own specialized skills, which can make it hard for unfamiliar organization to jump in and realize benefits.
8. Regulatory issues have a significant bearing on the cloud decision.
Designing infrastructure solutions that meet regulations can be tremendously complicated. It’s good to know upfront if you face regulations that explicitly prevent public cloud usage for certain activities, or if your legal department interprets those regulations as such, before wasting time evaluating non-starter solutions. That said, regulations generally impact some workloads and not others, and in many cases, are specific to certain aspects of workloads such as payment or customer or patient identity information. A hybrid architecture might allow sensitive data to be kept in private venues, while less sensitive information might be fine in public cloud. Consider that after a public cloud solution has seen wide acceptance for a regulated workload, there may be more certainty that that solution is compliant.
9. Geography can be a limiting factor or a driver for cloud usage.
If you face regulations around data sovereignty and your data has to be physically stored in Poland, Portugal, or Panama (or anywhere else on the globe), the footprint of a cloud provider could be a non-starter. On the flip side, big cloud providers are likely already operating in far more geographies than your enterprise. This means that if you need multiple sites for redundancy and reliability, require content delivery network (CDN) capabilities to reach customers in a wide variety of locations, or want to expanding into new regions, the major cloud providers can extend your geographic reach without major capital investments.
10. The unfamiliar may only seem less secure.
Public cloud can go either way in terms of being more or less secure than what’s provisioned in an enterprise data center. Internal infrastructure has more mature tools and processes associated with it, enjoys wider familiarity among enterprise administrators, and the higher level of transparency associated with owning and operating facilities and gear allows organizations to get predictable results and enjoy a certain level of comfort from sheer proximity. That said, cloud service providers operate at far greater scale than individual enterprises and therefore employ more security experts and gain more experience from addressing more threats than do single companies. Also, building for shared tenancy can drive service providers to lock down data across the board, compared to enterprises that may have carried over vulnerabilities from older configurations that may become issues as the user base or feature set of workloads change. Either way, a thorough assessment of vulnerabilities in enterprise and service provider facilities, infrastructure and applications is critical to determine whether cloud is a good or bad option for you.
Andrew Reichman is a Research Director for cloud data within the 451 Research Voice of the Enterprise team. In this role, he designs and interprets quarterly surveys that explore cloud adoption in the overall enterprise technology market.
Prior to this role, he worked at Amazon Web Services, leading their efforts in marketing infrastructure as a service (IaaS) to enterprise firms worldwide. Before this, he spent six years as a Principal Analyst at Forrester Research, covering storage, cloud and datacenter economics. Prior to his time at Forrester, Andrew was a consultant with Accenture, optimizing datacenter environments on behalf of EMC. He holds an MBA in finance from the Foster School of Business at the University of Washington and a BA in History from Wesleyan University.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/09/cloud.jpg4751200Matt Stansberryhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngMatt Stansberry2016-09-06 10:40:092016-10-11 08:29:55Top 10 Considerations for Enterprises Progressing to Cloud
In July 2016, Yuval Bachar, principal engineer, Global Infrastructure Architecture and Strategy at LinkedIn announced Open19, a project spearheaded by the social networking service to develop a new specification for server hardware based on a common form factor. The project aims to standardize the physical characteristics of IT equipment, with the goal of cutting costs and installation headaches without restricting choice or innovation in the IT equipment itself.
Uptime Institute’s Matt Stansberry discussed the new project with Mr. Bachar following the announcement. The following is an excerpt of that conversation:
Please tell us why LinkedIn launched the Open19 Project.
We started Open19 with the goal that any IT equipment we deploy would be able to be installed in any location, such as a colocation facility, one of our owned sites, or sitting in a POP, in a standardized 19-inch rack environment. Standardizing on this form factor significantly reduces the cost of integration.
We aren’t the kind of organization that has one type or workload or one type of server. Data centers are dynamic, with apps evolving on a weekly basis. New demands for apps and solutions require different technologies. Open19 provides the opportunity to mix and match server hardware very easily without the need to change any of the mechanical or installation aspects.
Different technologies evolve at a different pace. We’re always trying a variety of different servers—from low-power multi core machines to very high end, high performance hardware. In an Open19 configuration, you can mix and match this equipment in any configuration you want.
In the past, if you had a chassis full of blade servers, they were super-expensive, heavy, and difficult to handle. With Open19, you would build a standardized chassis into your cage and fit servers from various suppliers into the same slot. This provides a huge advantage from a procurement perspective. If I want to replace one blade today, the only opportunity I have is to buy from a single server supplier. With Open19, I can buy a blade from anybody that complies. I can have five or six proposals for a procurement instead of just one.
Also, by being able to blend high performance and high power with low-power equipment in two adjacent slots in the brick cage in the same rack, you can create a balanced environment in the data center from a cooling and power perspective. It helps avoid hot spots.
I recall you mentioned that you may consider folding Open19 into the Facebook-led Open Compute Project (OCP), but right now that’s not going to happen. Why launch Open19 as a standalone project?
The reason we don’t want to fold Open19 under OCP at this time is that there are strong restrictions around how innovation and contributions are routed back into the OCP community.
IT equipment partners aren’t willing to contribute IP and innovation. The OCP solution wasn’t enabling the industry—when organizations create things that are special, OCP requires you to expose everything outside of your company. That’s why some of our large server partners couldn’t join OCP. Open19 defines the form factor, and each server provider can compete in the market and innovate in their own dimension.
What are your next steps and what is the timeline?
We will have an Open19-based system up and running in the middle of September 2016 in our labs. We are targeting late Q1 of 2017 to have a variety of servers installed from three to four suppliers. We are considering Open19 as our primary deployment model, and if all of the aspects are completed and tested, we could see Open19 in production environments after Q1 2017.
The challenge is that this effort has multiple layers. We are working with partners to do the engineering development. And that is achievable. A secondary challenge is the legal side. How do you create an environment that the providers are willing to join? How do we do this right this time?
But most importantly, for us to be successful we have to have significant adoption from suppliers and operators.
It seems like something that would be valuable for the whole industry—not just the hyperscale organizations.
Enterprise groups will see an opportunity to participate without having to be the 100,000-server data center companies. So many enterprise IT groups had expressed reluctance about OCP because the white box servers come with limited support and warranty levels.
It will also lower their costs, increase the speed of installations and changes, and improve the positions of business negotiations on procurement.
If we come together, we will generate enough demand to make Open19 interesting to all of the suppliers and operators. You will get the option to choose.
Yuval Bachar is a principal engineer in the global infrastructure and strategy team for Linkedin, which is responsible for the company strategy for data center architecture and implementation of the mega scale future data centers. In this capacity, he drives and supports the new technology development, architecture, and collaboration to support the tremendous growth in LinkedIn’s future user-base, data centers, and services provided. Prior to Linkedin, Mr. Bachar held IT leadership positions at Facebook, Cisco and Juniper Networks.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/08/open19.jpg4751200Matt Stansberryhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngMatt Stansberry2016-08-31 12:10:492016-09-07 08:19:53Open19 is About Improving Hardware Choices, Standardizing Deployment
FORTRUST regards management and operations as a core competency that helps it win new clients and control capital and operating expenses
Shortly after receiving word that FORTRUST had earned Uptime Institute’s Tier Certification for Operational Sustainability (Gold) for Phase 7 of its Denver data center, Rob McClary, the company’s executive vice president and general manager, discussed the importance of management and operations and how FORTRUST utilizes Uptime Institute’s certifications to maintain and improve its operations and gain new clients.
Q: Please tell me about FORTRUST and its facilities.
FORTRUST is a multi-tenant data center (MTDC) colocation services provider. Our main data center is a 300,000 square foot facility in Denver, and we have been operating as privately owned company since 2001, providing services for Fortune 100, 500, and 1000 companies.
Q: You recently completed a new phase. How is it different than earlier phases?
The big change is that we started to take a prefabricated approach to construction for phase 6, and instead of traditional raised floor we went to data modules, effectively encapsulating the customers’ IT environments, which increased our per rack densities and subsequent efficiencies both from a cooling and a capital standpoint.
One of the biggest trends in the industry that needs a course correction, is that data centers are not being allowed to evolve as they need to. We keep trying to build the same data centers over and over. The engineers and general contractors and even the vendors just want to do what they have always done. And I think the data centers are going to have to become capital efficient. When we made that change to a modular approach, we reduced our capital outlay and started getting almost instantaneous return on the capital.
Q: How did you ensure that these changes would not reduce your availability?
For Phase 6a, we were trying to get used to the whole modular approach in a colocation environment. We had Uptime Institute review our designs. As a result, we earned the Tier Certification of Design Documents.
We planned to go further and do the Tier Certification of Constructed Facility as well, but Uptime Institute helped us determine that it would be better to do the Tier Certification of Constructed Facility in an upcoming phase (7) because we already had live customers in phase 6A. And, as you know, about half the customers in a colo facility have at least one piece of single-corded equipment in their IT environment. To address this, we worked with Uptime Institute consultants to adapt what Uptime Institute normally does during Tier Certification of Constructed Facility when there are usually no live customers. And we pursued Uptime Institute’s M&O Stamp of Approval.
This helped us understand the whole modular approach and how we would approach these circumstances going forward. At the same time we became one of the earlier Uptime Institute M&O Stamp of Approval sites.
Q: Why has FORTRUST adopted Uptime Institute’s Operations assessments and certifications?
We are big believers that design is only a small part of uptime and reliability. We focus 90% of our effort on management and operations, risk mitigation, and process discipline, doing things in a manner that prevents human error. That’s really what’s allowed us to achieve continuous uptime and have high customer satisfaction for such a long time.
I’d say we’re known for our operations and a level of excellence, so the Tier Certification of Operational Sustainability validated things that we have been doing for many years. It also allowed us to take a look at what we do from a strategy and tactical standpoint by essentially giving us a third-party look into what we think is important and what someone else might think is important.
We’ve got a lot of former military folks here, a lot of former Navy. That approach may influence our approach. The military conducts a lot of inspections and audits. You get used to it, and it becomes your chance to shine. So the Tier Certification of Operational Sustainability allows our people to show what they do, how they do it, and the pride they take in doing it. It gives them validation of how well they are doing, and it emphasizes why it is important.
The process re-emphasizes why it is so important to have operational strategies and your tactics aligned in a harmonious fashion. A lot of people in the data center industry get bogged down in checklists, best practices and trying to use them to compare data centers, and about 50 percent of it is noise, which means tactics without strategy. If you have your strategies in place and your tactics are aligned with your strategies, that’s much more powerful than trying to incorporate 100 best practices in your day-to day-ops. So doing 50 things very well is a better thing than do 100 things halfway.
Q: Did previously preparing for the M&O Stamp of Approval help you prepare for the Tier Certification of Operational Sustainability?
Absolutely. One reason we scored so well on the Tier Certification of Operational Sustainability was that we looked at our M&O 3 years ago and implemented the suggested improvements right away, and we were comfortable because we’ve had those things in place for years.
Q: What challenges did you face? You make it sound easy.
The biggest challenge for us during both the Tier Certification Constructed Facility and Tier Certification of Operational Sustainability was coordinating with Uptime Institute in a live colo environment with shared systems that weren’t specific to one phase. It was pretty much like doing surgery on a live body.
We were confident in what we were doing. Obviously the Tier Certification of Operational Sustainability is centric around documentation, and we’ve been known for process discipline and our procedural compliance for over 14 years of operations. It’s our cornerstone; it’s what we do very well. We literally don’t do anything without a procedure in hand.
We think design is the design. Once you build the data center and infrastructure after that it is all about management and operations, so management, operations, and risk mitigation is what will give you the long track record of success.
At the end of the day, if the biggest cause of outages in data centers is human error, why wouldn’t we put more emphasis on understanding why that happens and how to avoid it? To me, that’s what the M&O and Tier Certification of Operational Sustainability are all about.
Q: It’s obvious you consider procedures a point of differentiation in the market. How do you capitalize on this difference to new customers?
We show it to them. Part of our sales cycle includes taking the customer through the due diligence that they want to do and what we think they need to do. We make available hundreds of our documented procedures. We show them how we manage them. When we take a potential customer through our data center, it’s a historical story that we put together that starts with reliability, risk mitigation, business value and customer service.
Customers cannot only hear it and see the differences, but they can also feel it. If you have been in a lot of data centers, you can walk through the MEP or colo areas and maybe in 10-15 minutes, you can tell if there’s a difference in the management and operations philosophy. It’s quite apparent.
That’s always been our call to action. It’s really educating the customer on why the management and ops are so important. We put a lot of emphasis and resource in mitigating and eliminating human error. We devote probably 80-90% of our time in training, process discipline, and procedural compliance. That’s how we operate day to day.
Q: What are the added costs?
Actually this approach has less cost. I would challenge anyone who is outsourcing most of their maintenance and operations and even management because we’re doing it cheaper and we’re doing more aggressive predictive and preventive maintenance than most any data center. It’s really what we call an operational mindset and that can rarely be outsourced. Your personnel have to own it.
We don’t have people coming in to clean our data centers. Our folks do it. We do the majority of the maintenance in the facility, and the staff owns it.
We don’t do a lot of corrective maintenance. Corrective maintenance currently costs on the magnitude of 10 times the cost of a comprehensive preventive and predictive maintenance program.
I can show proof because we have been operating for 15 years now. I would dare anyone to tell me what part of that data center or which one of our substations, switchgear, or other equipment components are 15-years old and which are the new ones. It would be hard to tell.
I think there are too many engineering firms and GCs that try to influence the build in a manner that isn’t necessary. Like I said, they try to design around human error instead of spending time preventing it.
Rob McClary
Rob McClary is executive vice resident and general manager at FORTRUST Data Centers. Since joining FORTRUST in 2001, he has held the critical role of building the company into a premier data center services provider and colocation facility. Mr. McClary is responsible for the overall supervision of business operations, high-profile construction, and strategic technical direction. He developed and implemented the process controls and procedures that support the continuous uptime and reliability that FORTRUST Denver has delivered for more than 14 years.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/08/fortrust2.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2016-08-23 11:28:302016-08-23 11:28:30FORTRUST Gains Competitive Advantage from Management and Operations
Saudi Aramco’s Exploration and Petroleum Engineering Computer Center (ECC) is a three-story data center built in 1982. It is located in Dhahran, Kingdom of Saudi Arabia. It provides computing capability to the company’s geologists, geophysicists, and petroleum engineers to enable them to explore, develop, and manage Saudi Arabia’s oil and gas reserves. Transitioning the facility from mainframe to rack-mounted servers was just the first of several transitions that challenged the IT organization over the last three decades. More recently, Saudi Aramco reconfigured the legacy data center to a Cold Aisle/Hot Aisle configuration, increasing rack densities to 8 kilowatts per rack (kW/rack) from 3 kW/rack in 2003 and nearly doubling capacity. Further increasing efficiency, Saudi Aramco also sealed openings around and under the computer racks, cooling units, and the computer power distribution panel in addition to blanking unused rack space.
The use of computational fluid dynamics (CFD) simulation software to manage the hardware deployment process enabled Saudi Aramco to increase the total number of racks and rack density in each data hall. Saudi Aramco used the software to analyze various proposed configurations prior to deployment, eliminating the risk of trial and error.
In 2015 one of the ECC’s five data halls was modified to accommodate a Cold Aisle Containment System. This installation supports the biggest single deployment so far in the data center, 124 racks of high performance computers (HPC) with a total power demand of 994 kW. As a result, the data hall now hosts 219 racks on a 10,113-square-foot (940-square-meter) raised floor. To date, the data center hall has not experienced any temperature problems.
Business Drivers
Increasing demand by ECC customers requiring the deployment of IT hardware and software technology advances necessitated a major reconfiguration in the data center. Each new configuration increased the heat that needed to be dissipated from the ECC. At each step, several measures were employed to mitigate potential impact to the hardware, ensuring safety and reliability during each deployment and project implementation.
For instance, Saudi Aramco developed a hardware deployment master plan based on a projected life cycle and refresh rate of 3–5 years to transition to the Cold Aisle/Hot Aisle configuration. This plan allows for advance planning of space and power source allocation with no compromise to existing operation as well as fund allocation and material procurement (see Figures 1 and 2).
Figure 1. Data center configuration
Current day
Masterplan Figure 2. Data center plan view
Because of the age of the building and its construction methodology, the company’s engineering and consulting department was asked to evaluate the building structure based on the initial master plan. This department determined the maximum weight capacity of the building structure, which was used to establish the maximum rack weight to avoid compromising structural stability.
In addition, the engineering and consulting department evaluated the chilled water pipe network and determined the maximum number of cooling units to be deployed in each data hall, based on maximum allowable chilled water flow. Similarly, the department determined the total heat to be dissipated per Hot Aisle to optimize the heat rejection capability of the cooling units. The department also determined the amount heat to be dissipated per rack to ensure sufficient cooling as per manufacturer’s recommendation.
Subsequently, facility requirements based on these limiting factors were shared with the technology planning team and IT system support. The checklist includes maximum weight, rack dimensions, and the requirement for blanking panels and sealing technologies to prevent air mixing.
Other features of the data center include:
A 1.5-foot (ft) [0.45 meter (m)] raised floor containing chilled water supply and return pipes for the CRAH units, cable trays for network connectivity, sweet water line for the humidifier, liquid-tight flexible conduits for power, and computer power system (CPS) junction boxes
A 9-ft (2.8 m) ceiling height
False ceilings
Down-flow chilled water computer room air handling (CRAH) units
CRAH units located at the end of each Hot Aisle
Perforated floor tiles (56% open with manually controlled dampers)
No overhead obstructions
Total data center heat load of 1,200 kW
Total IT load of 1,084 kW, which is constant for all three models
Sealed cable penetrations (modeled at 20% leakage)
The 42U cabinets in the ECC have solid sides and tops, with 64% perforated front and rear doors on each cabinet. Each is 6.5-ft. high by 2-ft. wide by 3.61-ft. deep (2 m by 0.6 m by 1.10 m) and weighs 1874 pounds (850 kilograms). Rack density ranges from 6.0–8.0 kW. The total nominal cooling capacity is 1,582 kW from (25) 18-ton computer room air conditioning (CRAC) units.
Modeling Software
In 2007, Saudi Aramco commissioned the CFD modeling software company to prepare baseline models for all the data halls. The software is capable of performing transient analysis that suits the company’s requirement. The company uses the modeling software to simulate proposed hardware deployment, investigate deployment scenarios, and identify any stranded capacity.The modeling company developed several simulations based on different hardware iterations of the master plan to help establish the final hardware master plan with each Hot Aisle not exceeding a 125-kW heat load on a 16-rack Hot Aisle and not more than 8 kW per rack. After the modeling software company completed the initial iterations, Saudi Aramco acquired a perpetual license and support contract for the CFD simulation software in January 2010.
Saudi Aramco finds that the CFD simulation software makes it easier to identify and address heat stratification, recirculation, and even short-circuiting of cool air. By identifying the issues in this way, Saudi Aramco was able to take several precautionary measures and improve its capacity management procedures, including increasing cooling efficiency and optimizing load distribution.
Temperature and Humidity Monitoring System
With the CFD software simulation results at hand, the facilities management team looked for other means to gather data for use in future cooling optimization simulations while validating the results of CFD simulations. As a result, the facilities management group decided to install a temperature and humidity monitoring system. The initial deployment was carried out in 2008, with the monitoring of subfloor air supply temperature and hardware entering temperature.
At that time, three sensors were installed in each Cold Aisle for a total of six sensors. The sensors were positioned at the ends of each end of the row and in the middle, at the highest point of each rack. Saudi Aramco chose these points to get better understanding of the temperature variance (∆T) between the subfloor and the highest rack inlet temperature. Additionally, Saudi Aramco uses this data to monitor and ensure that all inlet temperatures are within the recommended ranges of ASHRAE and the manufacturer.
The real-time temperature and humidity monitoring system enabled the operation and facility management team to monitor and document unusual and sudden temperature variances allowing proactive responses and early resolution of potential cooling issues. The monitoring system gathers data that can be used to validate the CFD simulations and for further evaluation and iteration.
The Prototyping
The simulation models identified stratification, short circuiting, and recirculation issues in the data halls, which prompted the facilities management team to develop more optimization projects, including a containment system. In December 2008, a prototype was installed in one of the Cold Aisles (see Figure 3) using ordinary plastic sheets as refrigerated doors and Plexiglass sheets on aluminum frame, Saudi Aramco monitored the resulting inlet and core temperatures using the temperature and humidity monitoring system and internal system monitors prior to, during, and upon completion of installation to ensure no adverse effect with the hardware. The prototype was observed over the course of three months with no reported hardware issues.
Figure 3 Prototype containment system
Following the successful installation of the prototype, various simulation studies were further conducted to ensure the proposed deployment’s benefit and savings. In parallel, Saudi Aramco looked for the most suitable materials to comply with all applicable standards, giving prime consideration to the safety of assets and personnel and minimizing risk to IT operations.
Table 1. Installation dates
When the Cold Aisle was contained, Saudi Aramco noticed considerable improvement in the overall environment. Containment improved cold air distribution by eliminating hot air mixing with the supply air from the subfloor, so that air temperature at the front of the servers was close to the subfloor supply temperature. With cooler air entering the hardware, the core temperature was vastly improved, resulting in lower exhaust and return air temperatures to the cooling units. As a result, the data hall was able to support more hardware
Material Selection and Cold Aisle Containment System installation
From 2009 to 2012, the facility management team evaluated and screened several products It secured and reviewed the material data sheets and submitted them to the Authority Having Jurisdiction (AHJ) for evaluation and concurrence. Each of the solutions would require some modifications to the facility before being implemented. The facility management team evaluated and weighed the impact of these modifications as part of the procurement process.
Of all the products, one stood out from the rest; the use of easy to install and transparent material addresses not only safety but also eliminated the need for modifications of the existing infrastructure, which translates to considerable savings in terms of project execution and money.
Movement in and out of the aisle is easy and safe as people can see through the doors and walls. Additionally, the data hall lighting did not need to be modified since it was not obstructed. Even the fire suppression system was not affected since it has a fusible link and lanyard connector. The only requirement by AHJ prior to deployment was additional smoke detectors in the Cold Aisle itself.
To comply with this requirement, an engineering work order was raised for the preparation of the necessary design package for the modification of the smoke detection system. After completing the required design package including certification from a chartered fire protection engineer as mandated by the National Fire Protection Association (NFPA), it was established that four smoke detectors were to be relocated and an additional seven smoke detectors installed in the data hall.
Implementation and challenges
Optimizations and improvements always come with challenges; the reconfiguration process necessitated close coordination between the technology planning team, IT system support, ECC customers, the network management group, Operations, and facility management. These teams had to identify hardware that could be decommissioned without impacting operations, prepare temporary spaces for interim operations, and then take the decommissioned hardware out of the data hall, allowing the immediate deployment of new hardware in Cold Aisle/Hot Aisle. Succeeding deployments follow the master plan, allowing the complete realignment process to be completed in five years.
Installation of the Cold Aisle Containment System did not come without challenges; all optimization activities, including relocating luminaires in the way of the required smoke detectors had to be completed with no impact to system operations. To meet this requirement, ECC followed a strict no work permit–no work procedure; work permits are countersigned by operation management staff on duty during issuance and prior to close out. This enabled close monitoring of all activities within the data halls, ensuring safety and no impact to daily operation and hardware reliability. Additionally, a strict change management documentation process was utilized and adhered to by the facility management team and monitored by operation management staff; all activities within the data halls have to undergo a change request approval process.
Operations management and facility management worked hand in hand to overcome these challenges. Operations management, working in three shifts, closely monitored the implementation process, especially after regular working hours. Continuous coordination between contractors, vendors, operation staff, and facility management team enabled smooth transition and project implementations eliminating any showstoppers along the way.
Summary
The simulation comparison in Figure 4 clearly shows the benefits of the Cold Aisle Containment System. Figure 4a shows hot air recirculating around the end of the rows and mixing with the cold air supply to the Cold Aisles. In Figure 4b, mixing of hot and cold air is considerably reduced with the installation of the 14 Cold Aisle containment systems. The Cold Aisles are better defined and clearly visible in the figures, with less hot air recirculation, but the three rows without containment still suffer from recirculation. In Figure 4c, the Cold Aisles are far better defined, and hot air recirculation and short circuiting are reduced. Additionally, the exhaust air temperature from the hardware has dropped considerably.
Figure 4a. Without Cold Aisle Containment
Figure 4b. With current Cold Aisle Containment (14 of 17 aisles)
Figure 4c. With full Cold Aisle Containment
Figures 5–11 show that the actual power and temperature readings taken from the sensors installed in the racks validated the simulation results. As shown in Figures 4a and 5a, the power draw of the racks in Aisles 1 and 2 fluctuates while the corresponding entering and leaving temperature was maintained. On Week 40, the temperature even dropped slightly despite the slight increase in the power draw. The same can also be observed in Figures 6 and 7. All these aisles are fitted with a Cold Aisle Containment System.
Figure 5. Actual Power utilization, entering temperature, and leaving temperature Aisle 01 (installed on July 30, 2015 – week 31)
Figure 6. Aisle 2 (installed on July 28, 2015 – week 31)
Figure 7. Aisle 3 (installed on March 7, 2015 – week 10)
Figure 8. Aisle 6 (installed on April 09, 2015 – week 15)
Figure 9. Aisle 7a (a and b) and Aisle 7b (c and d)
Figure 10. Aisle 8 (installed on February 28, 2015 – week 09)
Figure 11. Aisle 17 (no Cold Aisle installed)
Additionally, Figure 11 clearly shows slightly higher entering and leaving temperature as well as fluctuation in the temperature readings that coincided with the power draw fluctuation of the racks within the aisle. This aisle has no containment.
The installation of the Cold Aisle Containment System greatly improved the overall cooling environment of the data hall (see Figure 12). Eliminating hot and cold air mixing and short circuiting allowed for more efficient cooling unit performance and cooler supply and leaving air. Return air temperature readings in the CRAH units were also monitored and sampled in Figure 12, which shows the actual return air temperature variance as a result of the improved overall data hall room temperature.
Figure 11. Computer air handling unit actual return air temperature graphs
Figure 13. Cold Aisle floor layout
The installation of the Cold Aisle Containment System allows the same data hall to host the company’s MAKMAN and MAKMAN-2 supercomputers (see Figures 5
Figure 14. Installed Cold Aisle Containment System
). Both MAKMAN and MAKMAN-2 appear on the June 2015 Top500 Supercomputers list.
Issa Riyani
Issa A. Riyani joined the Saudi Aramco Exploration Computer Center (ECC) in January 1993. He graduated from King Fahad University of Petroleum and Minerals (KFUPM) in Dhahran, Kingdom of Saudi Arabia, with a bachelor’s degree in electrical engineering. Mr. Riyani currently leads the ECC Facility Planning & Management Group and has more than 23 years experience managing ECC facilities.
Nacianceno L. Mendoza
Nacianceno L. Mendoza joined the Saudi Aramco Exploration Computer Center (ECC) in March 2002. He holds a bachelor of science in civil engineering and has more than 25 years of diverse experience in project design, review, construction management, supervision, coordination and implementation. Mr. Mendoza spearheaded the design and implementation of the temperature and humidity monitoring system and deployment of Cold Aisle Containment System in the ECC.
Operations teams use the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database to enrich site knowledge, enhance preventative maintenance, and improve preparedness
By Ron Davis
The AIRs system is one of the most valuable resources available to Uptime Institute Network members. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management.
A quick search of the database, using various parameters or keywords, turns up invaluable documentation on a broad range of facility and equipment topics. The results can support evidence-based decision making and operational planning to guide process improvement, identify gaps in documentation and procedures, refine training and drills, benchmark against successful organizations, inform purchasing decisions, fine-tune preventive maintenance (PM) programs to minimize failure risk, help maximize uptime, and support financial planning.
THE VALUE OF AIRs OPERATIONS
The philosopher, essayist, poet, and novelist George Santayana wrote, “Those who cannot remember the past are doomed to repeat it.” Using records of past data center incidents, errors, and outages can help inform operational practices to help prevent future incidents.
All Network member organizations participate in the AIRs program, ensuring a broad sample of incident information from data center organizations representing diverse sizes, business sectors, and geography. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management. The database contains data resulting from data center facility infrastructure incidents and outages for a period beginning in 1994 and continuing through the present. This volume of incident data allows for meaningful and extremely valuable analysis of trends and emerging patterns. Annually, Uptime Institute presents aggregated results and analysis of the AIRs database, spotlighting issues from the year and both current and historical trends.
Going beyond the annual aggregate trend reporting there is also significant insight to be gained from looking at individual incidents. Detailed incident information is particularly relevant to front-line operators, helping to inform key facility activities including:
• Operational documentation creation or improvement
• Planned maintenance process development or improvement
AIRs reporting is confidential and subject to a non-disclosure agreement (NDA), but the following hypothetical case study illustrates how AIRs information can be applied to improve an organization’s operations and effectiveness.
USING THE AIRs DATA IN OPERATIONS: CASE STUDY
A hypothetical “Site X” is installing a popular model of uninterruptible power supply (UPS) modules.
The facility engineer decides to research equipment incident reports for any useful information to help the site prepare for a smooth installation and operation of these critical units.
Figure 1. The page where members regularly go to submit an AIR. The circled link takes you directly to the “AIR Search” page.
Figure 2. The AIR Search page is the starting point for Abnormal Incident research. The page is structured to facilitate broad searches but includes user friendly filters that permit efficient and effective narrowing of the desired search results.
The facility engineer searches the AIRs database using specific filter criteria (see Figures 1 and 2), looking for any incidents within the last 10 years (2005-2015) involving the specific manufacturer and model where there was a critical load loss. The database search returns seven incidents meeting those criteria (see Figure 3).
Figure 3. The results page of our search for incidents within the last 10 years (2005-2015) involving a specific manufacturer/model where there was a critical load loss. We selected the first result returned for further analysis.
Figure 4. The overview page of the abnormal incident report selected for detailed analysis.
The first incident report on the list (see Figure 4) reveals that the unit involved was built in 2008. A ventilation fan failed in the unit (a common occurrence for UPS modules of any manufacturer/model). Replacing the fan required technicians to implement a UPS maintenance bypass, which qualifies as a site configuration change. At this site, vendor personnel were permitted to perform site configuration changes. The UPS vendor technician was working in concert with one of the facility’s own operations engineers but was not being directly supervised (observed) at the time the incident occurred; he was out of the line of sight.
Social scientist Brené Brow said, “Maybe stories are just data with a soul.” If so, the narrative portion of each report is where we find the soul of the AIR (Figure 5). Drilling down into the story (Description, Action Taken, Final Resolution, and Synopsis) reveals what really happened, how the incident played out, and what the site did to address any issues. The detailed information found in these reports offers the richest value that can be mined for current and future operations. Reviewing this information yields insights and cautions and points towards prevention and solution steps to take to avoid (or respond to) a similar problem at other sites.
Figure 5. The detail page of the abnormal incident report selected for analysis. This is where the “story” of our incident is told.
In this incident, the event occurred when the UPS vendor technician opened the output breaker before bringing the module to bypass, causing a loss of power, and the load was dropped. This seemingly small but crucial error in communication and timing interrupted critical production operations—a downtime event.
Back-up systems and safeguards, training, procedures, and precautions, detailed documentation, and investments in redundant equipment—all are in vain the moment there is a critical load loss. The very rationale for the site having a UPS was negated by one error. However, this site’s hard lesson can be of use if other data center operators learn from the mistake and use this example to shore up their own processes, procedures, and incident training. Data center operators do not have to witness an incident to learn from it; the AIRs database opens up this history so that others may benefit.
As the incident unfolded, the vendor quickly reset the breaker to restore power, as instructed by the facility technician. Subsequently, to prevent this type of incident from happening in the future, the site:
• Created a more detailed method of procedure (MOP) for UPS maintenance
• Placed warning signs near the output breaker
• Placed switch tags at breakers
• Instituted a process improvement that now requires the presence of two technicians: an MOP supervisor and MOP performer, with both technicians required to verified each step
These four steps are Uptime Institute-recommended practices for data center operations. However, this narrative begs the question of how many sites have actually taken the effort to follow through on each of these elements, checked and double-checked, and drilled their teams to respond to an incident like this. Today’s operators can use this incident as a cautionary tale to shore up efforts in all relevant areas: operational documentation creation/improvement, planned maintenance process development/improvement, training, and PM program improvement.
Operational Documenation Creation/Improvement
In this hypothetical, the site added content and detail to its MOP for UPS maintenance. This can inspire other sites to review their UPS bypass procedures to determine if there is sufficient content and detail.
The consequences of having too little detail are obvious. Having too much content can also be a problem if it causes a technician to focus more on the document than on the task.
The AIR in this hypothetical did not say whether facility staff followed an emergency operating procedure (EOP), so there is not enough information to say whether they handled it correctly. This event may never happen in this exact manner again, but anyone who has been around data centers long enough knows that UPS output breakers can and will fail in a variety of unexpected ways. All sites should examine their EOP for unexpected failure/trip of UPS output breaker.
In this incident, the technician reclosed the breaker immediately, which is an understandable human reaction in the heat of the moment. However, this was probably not the best course of action. Full system start-up and shutdown should be orderly affairs, with IT personnel fully informed, if not present as active participants. A prudent EOP might require recording the time of the incident, following an escalation tree, gathering white space data, and confirming redundant equipment status, along with additional steps before undertaking a controlled, fully scripted restart.
Another response to failure was the addition of warning signs and improved equipment labeling as improvements to the facility’s site configuration procedures (SCPs). This change can motivate other sites to review their nomenclature and signage. Some sites include a document that gives expected or steady state system/equipment information. Other sites rely on labeling and warning signs or tools like stickers or magnets located beside equipment to indicate proper position. If a site has none of these safeguards in place, then assessment of this incident should prompt the site team to implement them.
Examining AIRs can provide specific examples of potential failure points, which can be used by other sites as a checklist of where to improve policies. The AIRs data can also be a spur to evaluate whether site policies match practices and ensure that documented procedures are being followed.
Planned Maintenance Process Improvement
After this incident, the site that submitted the AIR incident report changed its entire methodology for performing procedures. Now two technicians must be present, each with strictly defined roles: one technician reads the MOP and supervises the process, and the second technician verifies, performs, and confirms. Both technicians must sign off on the proper and correct completion of the task. It is unclear whether there was a change in vendor privileges.
When reviewing AIRs as a learning and improvement tool, facilities teams can benefit by implementing measures that are not already in place or any improvements that they determine they would implement if a similar incident had occurred at their site. For example, a site may conclude that configuration changes should be reserved only for those individuals who:
• Have a comprehensive understanding of site policy
• Have completed necessary site training
• Have more at stake for site performance and business outcomes
Training
The primary objective of site training is to increase adherence to site policy and knowledge of effective mission critical facility operations. Incorporating information gleaned from AIRs analysis helps maximize these benefits. Training materials should be geared to ensure that technicians are fully qualified to utilize their skills and abilities to operate the installed infrastructure within a mission critical environment and not to certify electricians or mechanics. In addition, training is an opportunity to provide an opportunity for professional development and interdisciplinary education amongst our operations team, which can help enterprises retain key personnel.
The basic components of an effective site-training program are an instructor, scheduled class times that can be tracked by student and instructor, on-the-job training (OJT), reference material, and a metric(s) for success.
With these essentials in place, the documentation and maintenance process improvement steps derived from the AIR incident report can be applied immediately for training. Newly optimized SOPs/MOPs/EOPs can be incorporated into the training, as well as process improvements such as the two-person rule. Improved documentation can be a training reference and study material, and improved SCPs will reduce confusion during OJT and daily rounds. Training drills can be created directly from real-world incidents, with outcomes not just predicted but also chronicled from actual events. Trainer development is enhanced by the involvement of an experienced technician in the AIR review process and creation of any resulting documentation/process improvement.
Combining AIRs data with existing resources enables sites to take a systematic approach to personnel training, for example:
1. John Doe is an experienced construction electrician who was recently hired. He needs UPS bypass training.
2. Jane Smith is a facility operations tech/operating engineer with 10 years of experience as a UPS technician. She was instrumental in the analysis of the AIRs incident and consequent improvements in the UPS bypass procedures and processes; she is the site’s SME in this area.
3. Using a learning management system (LMS) or a simple spreadsheet, John Doe’s training is scheduled.
• Scheduled Class: UPS bypass procedure discussion and walk-through
• Instructor: Jane Smith
• Student: John Doe
• Reference material: the new and improved UPS BYPASS SOP XXXX_20150630, along with the EOP and SCP
• Metric might include:
o Successful simulation of procedure as a performer
o Successful simulation of procedure as a supervisor
o Both of the above
o Successful completion of procedure during a PM activity
o Success at providing training to another technician
Drills for both new trainees and seasoned personnel are important. Because an AIRs-based training exercise is drawn from an actual event, not an imaginary scenario, it lends greater credibility to the exercise and validates the real risks. Anyone who has led a team drill has probably encountered that one participant who questions the premise of a procedure or suggests a different procedure. Far from being a roadblock to effective drills, the participant is proving to be actively engaged and can assist in program improvement by assisting in creating drills and assessing AIRs scenarios.
PM Program Improvement
The goal of any PM program is to prevent the failure of equipment. The incident detailed in the AIR incident report was triggered from a planned maintenance event, a UPS fan replacement. Typically, a fan replacement requires systems be put on bypass, as do annual PM procedures. Since any change of equipment configuration such as changing a fan introduces risk, it is worth asking whether predictive/proactive fan replacement performed during PM makes more sense than awaiting fan failure. The risk of configuration change must be weighed against the risk of inactivity.
Figure 6. Our incident was not caused by UPS Fan Failure, but occurred as a result of human error during its replacement. So how many AIRs involving fan failures for our manufacturer/model of UPS exist within our database? Figure 6 shows the filters we chose to obtain this information.
Examining this and similar incidents in the AIRs database yields information about UPS fan life expectancy that can be used to develop an “evidence-based” replacement strategy. Start by searching the AIRs database for the keyword “fan” using the same dates, manufacturer, and model criteria, with no filter for critical load loss (see Figure 6). This search returns eight reports with fan failure (see Figure 7). The data show that the average life span of the units with fan failure was 5.5 years. The limited sample size means that this result should not be relied on, but this experience at other sites can help guide planning. Less restrictive search criteria can return even more specific data.
Figure 7. A sample of the results (showing 10 of 25 results reports) returned from the search described in Figure 6. Analysis of these incidents may help us to determine and develop the best strategy for cooling fan replacement.
Additional Incidents Yield Additional Insight
The initial database search at the start of this hypothetical case returned a result of seven AIRs total. What can we learn from the other six? Three of the remaining six reports involved capacitor failures. At one site, the capacitor was 12 years old, and the report noted, “No notification provided of the 7-year life cycle by the vendor.” Another incident occurred in 2009, involving a capacitor with a 2002 manufacture date, which would match (perhaps coincidentally) a 7-year life cycle. The third capacitor failure was in a 13-year old piece of equipment, and the AIR notes that it was “outside of 4–5-year life cycle.” These results highlight the importance of having an equipment/component life-cycle replacement strategy. The AIRs database is a great starting point.
A fourth AIR describes a driver board failure in a 13-year old UPS. Driver board failure could fall into any of the AIR root-cause types. Examples of insufficient maintenance might include a case where maintenance performed was limited in scope and did not consider end-of-life. Perhaps there was no procedure to diagnose equipment for a condition or measurement indicative of component deterioration, or maybe maintenance frequency was insufficient. Without further data in the report it is hard to draw an actionable insight, but the analysis does raise several important topics for discussion regarding the status of a site’s preventive and predictive maintenance approach. A fifth AIR involves an overload condition resulting from flawed operational documentation. The lesson there is obvious.
The final of the remaining six reports resulted from a lightning strike that made it through the UPS to interrupt the critical load. Other sites might want to check transient voltage surge suppressor (TVSS) integrity during rounds. With approximately 138,000,000 lightning strikes per year worldwide, any data center can be hit. A site can implement an EOP that dictates checking summary alarms, ensuring redundant equipment integrity, performing a facility walk-through by space priority, and providing an escalation tree with contact information.
Each of the AIRs casts light on the types of shortfalls and gaps that can be found in even the most capably run facilities. With data centers made up of vast numbers of components and systems operating in a complex interplay, it can be difficult to anticipate and prevent every single eventuality. AIRs may not provide the most definitive information on equipment specifications, but assessing these incidents provides an opportunity for other sites to identify potential risks and plan how to avoid them.
PURCHASING/EQUIPMENT PROCUREMENT DECISIONS
In addition to the operational uses described above, AIRs information can also support effective procurement. However, as with using almost any type of aggregated statistics, one should be cautious about making broad assumptions based on the limited sample size of the AIRs database.
For example, a search using simply the keywords ‘fan failure’ and ‘UPS’ could return 50 incidents involving Vendor A products and five involving Vendor B’s products (or vice versa). This does not necessarily mean that Vendor A has a UPS fan failure problem. The number of incidents reported could just mean that Vendor B has a significant market share advantage.
Further, one must be careful of jumping to conclusions regarding manufacturing defects. For example, the first AIR incident report made no mention of how the UPS may (or may not) have been designed to help mitigate the risk of the incident. Some UPS modules have HMI (human machine interface) menu-driven bypass/shutdown procedures that dictate action and provide an expected outcome indication. These equipment options can help mitigate the risk of such an event but may also increase the unit cost. Incorporating AIRs information as just one element in a more detailed performance evaluation and cost-benefit analysis will help operators accurately decide which unit will be the best fit for a specific environment and budget.
LEARNING FROM FAILURES
If adversity is the best teacher, then every failure in life is an opportunity to learn, and that certainly applies in the data center environment and other mission critical settings. The value of undertaking failure analysis and applying lessons learned to continually develop and refine procedures is what makes an organization resilient and successful over the long term.
To use an example from my own experience, I was working one night at a site when the operations team was transferring the UPS to maintenance bypass during PM. Both the UPS output breaker and the UPS bypass breaker were in the CLOSED position, and they were sharing the connected load. The MOP directed personnel to visually confirm that the bypass breaker was closed and then directed them to open the UPS output breaker. The team followed these steps as written, but the critical load was dropped.
Immediately, the team followed EOP steps to stabilization. Failure analysis revealed that the breaker had suffered internal failure; although the handle was in the CLOSED position, the internal contacts were not closed. Further analysis yielded a more detailed picture of events. For instance, the MOP did not require verification of the status of the equipment. Maintenance records also revealed that the failed breaker had passed primary injection testing within the previous year, well within the site-required 3-year period. Although meticulous compliance with the site’s maintenance standards had eliminated negligence as a root cause, the operational documentation could have required verification of critical component test status as a preliminary step. There was even a dated TEST PASSED sticker on the breaker.
Indeed eliminating key gaps in the procedures would have prevented the incident. As stated, the breaker appeared to be in the closed position as directed, but the team had not followed the load during switching activities, (i.e., had not confirmed the transfer of the amperage to the bypass breaker). If we had done so, we would have seen a problem, and initiated a back-out of the procedure. Subsequently, these improvements were added to the MOP.
FLASH REPORTS
Flash reports are a particularly useful AIRs service because they provide early warning about incidents identified as immediate risks, with root causes and fixes to help Network members prevent a service interruption. These reports are an important source of timely front-line risk information.
For example, searching the AIRs database for any FLASH AIR since 2005 involving one popular UPS model returns two results. Both reports detailed a rectifier shutdown as a result of faulty trap filter components; The vendor consequently performed a redesign and recommended a replacement strategy. The FLASH report mechanism became a crucial channel for communicating the manufacturer’s recommendation to equipment owners. Receiving a FLASH notification can spur a team to check maintenance records and consult with trusted vendors to ensure that manufacturer bulletins or suggested modifications have been addressed.
When FLASH incidents are reported, Uptime Institute’s AIRs program team contacts the manufacturer as part of its validation and reporting process. Uptime Institute strives for and considers its relationships with OEMs (original equipment manufacturers) to be cooperative, rather than confrontational. All parties understand that no piece of complex equipment is perfect, so the common goal is to identify and resolve issues as quickly and smoothly as possible.
CONCLUSION
It is virtually impossible for an organization’s site culture, procedures, and processes to be so refined that there are no details left unaddressed and no improvements that can be made. There is also a need to beware of hidden disparities between site policy and actual practice. Will a team be ready when something unexpected does go wrong? Just because an incident has not happened yet does not mean it will not happen. In fact, if a site has not experienced an issue, complacency can set in; steady state can get boring. Operators with foresight will use AIRs as opportunities to create drills and get the team engaged with troubleshooting and implementing new, improved procedures.
Instead of trying to ferret out gaps or imagine every possible failure, the AIRs database provides a ready source of real-world incidents to draw from. Using this information can help hone team function and fine tune operating practices. Technicians do not have to guess at what could happen to equipment but can benefit from the lessons learned by other sites. Team leaders do not have to just hope that personnel are ready to face a crisis; they can use AIRs information to prepare for operating eventualities and to help keep personnel responses sharp.
AIRs is much more than a database; it is a valuable tool for raising awareness of what can happen, mitigating the risk that it will happen, and for preparing an operations team for when/if it does happen. With uses that extend to purchasing, training, and maintenance activities, the AIRs database truly is Uptime Institute Network members’ secret weapon for operational success.
Ron Davis
Ron Davis is a Consultant for Uptime Institute, specializing in Operational Sustainability. Mr. Davis brings more than 20 years of experience in mission critical facility operations in various roles supporting data center portfolios, including facility management, management and operations consultant, and central engineering subject matter expert. Mr. Davis manages the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database, performing root-cause and trending analysis of data center outages and near outages to improve industry performance and vigilance.
https://journal.uptimeinstitute.com/wp-content/uploads/2016/06/data.jpg4751201Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2016-07-29 13:51:392016-07-29 13:51:39Data-Driven Approach to Reduce Failures
If You Can’t Buy Effective DCIM, Build It
/in Operations/by Kevin HeslinAfter commercial DCIM offerings failed to meet RELX Group’s requirements, the team built its own DCIM tool based on the existing IT Services Management Suite
By Stephanie Singer
What is DCIM? Most people might respond “data center infrastructure management.” However, simply defining the acronym is not enough. The meaning of DCIM is greatly different for each person and each enterprise.
At RELX Group and the data centers managed by the Reed Elsevier Technology Services (RETS) team, DCIM is built on top of our configuration management database (CMDB) and provides seamless, automated interaction between the IT and facility assets in our data centers, server rooms, and telco rooms. These assets are mapped to the floor plans and electrical/mechanical support systems within these locations. This arrangement gives the organization the ability to manage its data centers and internal customers in one central location. Additionally, DCIM components are fully integrated with the company’s global change and incident management processes for holistic management and accuracy.
A LITTLE HISTORY
So what does this all mean? Like many of our colleagues, RETS has consulted with several DCIM providers over the years. Many of them enthusiastically promised a solution for our every need, creating excitement at the prospect of a state-of-the art system for a reasonable cost. But, as we all know, the devil is in the details.
I grew up in this space and have personally been on this road since the early 1980s. Those of you who have been on the same journey will likely remember using the Hitachi Tiger tablet with AutoCAD to create equipment templates. We called it “hardware space planning” at the time, and it was a huge improvement over the template cutouts from IBM, which we had to move around manually on an E-size drawing.
Many things have changed over the last 30 years, including the role of vendors in our operation. Utilizing DCIM vendors to help achieve infrastructure goals has become a common practice in many companies, with varying degrees of success. This has most certainly been a topic at every Uptime Institute Symposium as peers shared what they were doing and sought better go-forward approaches and planning.
OUR FIRST RUN AT PROCURING DCIM
Oh how I coveted the DCIM system my friend was using for his data center portfolio at a large North America-based retail organization. He and his team helpfully provided real world demos and resource estimates to help us construct the business case that was approved, including the contract resources for a complete DCIM installation. We were able to implement the power linking for our assets in all our locations. Unfortunately, the resource dollars did not enable us to achieve full implementation with the cable management. I suspect many of you reading this had similar experiences.
Nevertheless, we had the basic core functionality and a DCIM system that met our basic needs. We ignored inefficiencies, even though it took too many clicks to accomplish the tasks at hand and we still had various groups tracking the same information in different spreadsheets and formats across the organization.
About 4-1/2 years after implementation, our vendor reported the system we were using would soon be at end of life. Of course, they said, “We have a new, bigger, better system. Look at all the additional features and integration you can have to expand and truly have everything you need to run your data centers, manage your capacities, and drive costs out.”
Digging deeper, we found that driving cost out was not truly obtainable when we balanced the costs of fully collating the data, building the integration with other systems and processes, and maintaining the data and scripting that was required. With competing priorities and the drive for cost efficiencies, we went back to the drawing board and opened the DCIM search to other providers once again.
STARTING OVER WITH DCIM MARKET RESEARCH
The DCIM providers we looked at had all the physical attributes tied to the floor and rack space, cable management, asset life cycle, capacity planning, and various levels for the electrical/mechanical infrastructure. These tools all integrated with power quality and building management and automation systems, each varying slightly in their approach and data output. And many vendors offered bi-directional data movement from the service management tool suite and CMDB.
But our findings revealed potential problems such as duplicate and out-of-sync data. This was unacceptable. We wanted all the features our DCIM providers promised without suffering poor data quality. We also wanted the DCIM to fully integrate with our change and incident management systems so we could look holistically at potential root causes of errors. We wanted to see where the servers were located across the data center, if they were in alarm, when maintenance was occurring, and whether a problem was resolved. We wanted the configuration item attributes for maintenance, end of life, contract renewals, procedures, troubleshooting guidelines, equipment histories, business ownerships, and relationships to be 100% mapped globally.
RETS has always categorized data center facilities as part of IT, separate from Corporate Real Estate. Therefore, all mechanical and electrical equipment within our data center and server rooms are configuration items (CI) as well. This includes generators, switchgear, uninterruptible power systems (UPS), power distribution units (PDU), remote power panels (RPP), breakers, computer room air conditioners (CRAC), and chillers. Breaking away from the multiple sources we had been using for different facility purposes greatly improved our overall grasp on how our facility team could better manage our data centers and server rooms.
NEW PARADIGM
This realization caused us to ask ourselves: What if we flipped the way DCIM is constructed? What if it is built -on top of the Service Management Suite, so it is truly a full system that integrates floor plans, racks, and power distribution to the assets within the CMDB? Starting with this thought, we aggressively moved forward to completely map the DCIM system we currently had in place and customize our existing Service Management Suite.
We started this journey in October 2014 and followed a Scrum software development process. Scrum is an iterative and incremental agile software development methodology. The 2-week sprints and constant feedback that provided useful functionality were keys to this approach. It was easy to adapt quickly to changes in understanding.
Another important part of Scrum is to have a complete team set up with a subject matter expert (SME) to serve as the product owner to manage the backlog of features. Other team members included the Service Management Suite tool expert to design forms and tables, a user interface (UI) expert to design the visualization and a Scrum master to manage the process. A web service expert joined the team to port the data from the existing DCIM system into the CMDB. All these steps were critical; however, co-locating the team with a knowledgeable product owner to ensure immediate answers and direction to questions really got us off and running!
We created our wish list of requirements, prioritizing those that enabled us to move away from our existing DCIM system.
Interestingly enough, we soon learned from our vendor that the end of life for our current 4-1/2 year-old DCIM system would be extended because of the number of customers that remained on that system. Sound familiar? The key was to paint the vision of what we wanted and needed it to be, while pulling a creative and innovative group of people together to build a roadmap for how we were going to get there.
It was easy to stay focused on our goals. We avoided scope creep by aligning our requirements with the capabilities that our existing tool provided. The requirements and capabilities that aligned were in scope. Those that didn’t were put on a list for future enhancements. Clear and concise direction!
The most amazing part was that using our Service Management Suite was providing many advantages. We were getting all the configuration data linked and a confidence in the data accuracy. This created excitement across the team and the “wish” list of feature requests grew immensely! In addition to working on documented requests, our creative and agile team came back with several ideas for features we had not initially contemplated, but that made great business sense. Interestingly enough, we achieved many of these items so easily we had achieved a significantly advanced tool with the automation features we leveraged by the time we went live in our new system.
FEATURES
Today we can pull information by drilling down on a oor plan to a device that enables us to track the business owner, equipment relationships, application to infrastructure mapping, and application dependencies. This information allows us to really understand the impacts of adding, moving, modifying, or decommissioning within seconds. It provides real-time views for our business partners when power changes occur, maintenance is scheduled, and if a device alarming is in effect (see Figure 1).
Figure 1. DCIM visualization
The ability to tie in CIs to all work scheduled through our Service Management Suite change requests and incidents provides a global outlook on what is being performed within each data center, server room, and telco room, and guarantees accuracy and currency. We turned all electrical and mechanical devices into CIs and assigned them physical locations on our floor plans (see Figure 2).
Figure 2. Incident reporting affected CI
Facility work requests are incorporated into the server install and decommissioning workflow. Furthermore, auto discovery alerts us to devices that are new to the facility, so we are aware if something was installed outside of process.
If an employee needs to add a new model to the floor plan we have a facility management form for that process, where new device attributes can be specified and created by users within seconds.
Facility groups can modify floor plans directly from the visualization providing dynamic updates to all users, Operations can monitor alarming and notifications 24×7 for any CI, and IT teams can view rack elevations for any server rack or storage array (see Figure 3).
Power management, warranty tracking, maintenance, hardware dependencies, procedures, equipment relationships, contract management, equipment histories can all be actively maintained all within one system.
Figure 3. Front and rear rack elevation
Speaking of power management, our agile team was able to create an exact replica of our electrical panel schedules inside of the DCIM without losing any of the functionality we had in Excel. This included the ability to update current power draw for each individual breaker, redundant failover calculation, alarming, and relationship creation from PDU to breaker to floor-mount device.
Oh and by the way, iPad capability is here…. Technicians can update information as work is being performed on the floor, allowing Operations to know when a change request is actually in process and what progress has been made. And, 100% automation is in full effect here! Technicians can also bring up equipment procedures to follow along the way as these are tied to CIs using the Service Management Suite knowledge articles.
Our Service Management Suite is fully integrated with active directory, in that we can associate personnel with individual site locations that they manage. Self-service forms are also in place where users can add, modify, or delete any new vendor information for specific sites.
The Service Management Suite has already integrated with our Real Estate management tool to integrate remote floor plans, site contacts, and resource usage for each individual location. The ability to pull power consumption per device at remote sites is also standardized based on an actual determined estimate to assist with consolidation efforts.
The self-service items include automated life cycle forms that allow users to actively track equipment adds, modifications, and removals, while also providing the functionality to correlate CIs together in order to form relationships from generator to UPS to PDU to breaker to rack to power strip to rack-mount device (see Figure 4).
Figure 4. Self-service facility CI management form
Functionality for report creation on any CI and associated relationships has been implemented for all users. Need to determine where to place a server? There’s a report that can be run for that as well!
The streamlined processes allow users to maintain facility and hardware CIs with ease, and truly provides a 100% grasp on the activity occurring with data centers on a daily basis.
I am quite proud of the small, but powerful, team that went on this adventure with us. As the leader of this initiative, it was refreshing to see the idea start with a manager. He worked with a key developer to build workflows and from there turned DCIM on its head.
We pulled the hardware, network, and facility teams together with five amazing part-time developers for the “what if” brainstorm session and the enthusiasm continued to explode. It was truly amazing to observe this team. Within 30 days, we had a prototype that was shared with senior management and stakeholders who fully supported the effort and the rest is history!
It’s important to note, for RETS, we have a far superior tool set for a fraction of the cost of other DCIM tools. With it being embedded into our Service Management Suite,
we avoid additional maintenance, software, and vendor services costs… win, win, win!
Our DCIM is forever evolving in that we have so far surpassed the requirements we originally set that we thought, “Why stop now?” Continuing with our journey
will bring service impact reports and alarming, incorporating our existing power monitoring application, and building an automation system, which will enhance our ability to include remote location CIs managed by us. With all the advances we are able to make using our own system, I am looking forward to more productivity than ever before and more than we can imagine right now!
Stephanie Singer joined Reed Elsevier, now known as RELX Group, in 1980 and has worked for Mead Data Central, LexisNexis, and Reed Elsevier–Technology Services during her career. She is currently the vice president of Global Data Center Services. In this role, she is responsible for global data center and server room facilities, networks, and cable plant infrastructure for the products and applications within these locations. She leads major infrastructure transformation efforts. Ms. Singer has led the data center facilities team since 1990, maintaining an excellent data center availability record throughout day-to-day operations and numerous lifecycle upgrades to the mechanical and electrical systems. She also led the construction and implementation of a strategic backup facility to provide in-house disaster recovery capability.
Top 10 Considerations for Enterprises Progressing to Cloud
/in Executive/by Matt StansberryIndustry data from Uptime Institute and 451 Research evidence a rapid rate of cloud computing adoption for enterprise IT departments. Organizations weigh cloud benefits and risks, and also evaluate how cloud will impact their existing and future data center infrastructure investment. In this video, Uptime Institute COO Julian Kudritzki and Andrew Reichman, Research Director at 451 Research discuss the various aspects of how much risk, and how much reward, is on the table for companies considering a cloud transition.
While some organizations are making a “Tear down this data center” wholesale move to cloud computing, the vast majority of cloud adopters are getting there on a workload-by-workload basis–carefully evaluating their portfolio of workloads and applications and identifying the best cloud or non-cloud venue to host each.
The decision process is based on multiple considerations, including performance, integration issues, economics, competitive differentiation, solution maturity, risk tolerance, regulatory compliance considerations, skills availability, and partner landscape.
Some of the most important of these considerations when deciding whether to put a workload or application in the cloud include:
1. Know which applications impact competitive advantage, and which don’t.
You might be able to increase competitive advantage by operating critical differentiating applications better than peer companies do, but most organizations will agree that back office functions like email or payroll, while important, don’t really set a company apart. As mature SaaS (software as a service) options for these mundane, but important, business functions have emerged, many companies have decided that a cloud application that delivers a credible solution at a fair cost is good enough. Offloading the effort for these workloads can free up valuable time and effort for customization, optimization, and innovation around applications that drive real competitive differentiation.
2. Workloads with highly variable demand see the biggest benefit from cloud.
Public cloud was born to address big swings in demand seen in the retail world. If you need thousands of servers around the Christmas shopping spree, public cloud IaaS (infrastructure as a service) makes them available when you need them and return them after the New Year when demand settles down. Any workload with highly variable demand can see obvious benefits from running in cloud, so long as the architecture supports load balancing and changing the number of servers working on the job, known as scale-out design. Knowing which applications fit this bill and what changes could be made to cloud-enable other applications will help to identify those that would see the biggest economic benefit from a move to cloud.
3. Cloud supports trial and error without penalty.
Cloud gives users an off switch for IT resources, allowing easy changes to application architecture. Find out that a different server type or chipset or feature is needed after the initial build-out? No problem, just move it to something else and see how that works. The flexibility of cloud resources lend themselves very well to experimentation in finding the perfect fit. If you’re looking for a home for an application that’s been running well for years, you might find that keeping it in on-premises will be cheaper and less disruptive than moving it to the cloud.
4. Big looming investments can present opportunities for change.
Cloud can function as an effective investment avoidance strategy when organizations face big bills for activities like data center build-out or expansion, hardware refresh, software upgrade, staff expansion, or outsourcing contract renewal. When looking at big upcoming expenditures, it’s a great time to look at how offloading selected applications to cloud might reduce or eliminate the need for that spend. Once the investments are made, they become sunk costs and will likely make the business case for cloud transition much less attractive.
5. Consider whether customization is required or if good enough is good enough.
Is this an application that isn’t too picky in terms of server architecture and advanced features, or is it an app that requires specific custom hardware architectures or configurations to run well? If you have clearly understood requirements that you must keep in place, a cloud provider might not give you exactly what you need. On the other hand, if your internal IT organization struggles to keep pace with the latest and greatest, or if your team is still determining the optimal configuration, cloud could give more flexibility to experiment with a wider range of options than you have access to internally, given a smaller scale of operations than mega-scale cloud providers operate at.
6. Conversion to cloud native architectures can be difficult but rewarding in the long run.
One benefit of cloud is renting instead of buying, with advantages in terms of scaling up and down at will and letting a service provider do the work. A separate benefit comes from the use of cloud native architectures. Taking advantage of things like API-controlled infrastructure, object storage, micro-services, and server-less computing requires switching to cloud-friendly applications or modifying legacy apps to use cloud design principles. If you have plans to switch or modify applications anyway, think about whether you would be better served running these applications in house or if it would make sense to use that inflection point to move to something hosted by a third party. If your organization runs mostly traditional apps and has no intention of taking on major projects to cloud-enable them, you should know that options and benefits of forklifting them unchanged to cloud will be limited.
7. Be honest about what your company is good at and what it is not.
If cloud promises organizations the ability to get out of the business of mundane activities such as racking and stacking gear, refreshing and updating systems, and managing facilities, it’s important to start the journey with a clear and honest assessment of what your company does well and what it does not do well. If you have long standing processes to manage efficiency, reliability and security, have relatively new facilities and equipment, and the IT staff is good at keeping it all running, then cloud might not offer much benefit. If there are areas where things don’t go so smoothly or you struggle to get everything done with existing headcount, cloud could be a good way to get better results without taking on more effort or learning major new skills. On the other hand, managing a cloud environment requires its own specialized skills, which can make it hard for unfamiliar organization to jump in and realize benefits.
8. Regulatory issues have a significant bearing on the cloud decision.
Designing infrastructure solutions that meet regulations can be tremendously complicated. It’s good to know upfront if you face regulations that explicitly prevent public cloud usage for certain activities, or if your legal department interprets those regulations as such, before wasting time evaluating non-starter solutions. That said, regulations generally impact some workloads and not others, and in many cases, are specific to certain aspects of workloads such as payment or customer or patient identity information. A hybrid architecture might allow sensitive data to be kept in private venues, while less sensitive information might be fine in public cloud. Consider that after a public cloud solution has seen wide acceptance for a regulated workload, there may be more certainty that that solution is compliant.
9. Geography can be a limiting factor or a driver for cloud usage.
If you face regulations around data sovereignty and your data has to be physically stored in Poland, Portugal, or Panama (or anywhere else on the globe), the footprint of a cloud provider could be a non-starter. On the flip side, big cloud providers are likely already operating in far more geographies than your enterprise. This means that if you need multiple sites for redundancy and reliability, require content delivery network (CDN) capabilities to reach customers in a wide variety of locations, or want to expanding into new regions, the major cloud providers can extend your geographic reach without major capital investments.
10. The unfamiliar may only seem less secure.
Public cloud can go either way in terms of being more or less secure than what’s provisioned in an enterprise data center. Internal infrastructure has more mature tools and processes associated with it, enjoys wider familiarity among enterprise administrators, and the higher level of transparency associated with owning and operating facilities and gear allows organizations to get predictable results and enjoy a certain level of comfort from sheer proximity. That said, cloud service providers operate at far greater scale than individual enterprises and therefore employ more security experts and gain more experience from addressing more threats than do single companies. Also, building for shared tenancy can drive service providers to lock down data across the board, compared to enterprises that may have carried over vulnerabilities from older configurations that may become issues as the user base or feature set of workloads change. Either way, a thorough assessment of vulnerabilities in enterprise and service provider facilities, infrastructure and applications is critical to determine whether cloud is a good or bad option for you.
Andrew Reichman is a Research Director for cloud data within the 451 Research Voice of the Enterprise team. In this role, he designs and interprets quarterly surveys that explore cloud adoption in the overall enterprise technology market.
Prior to this role, he worked at Amazon Web Services, leading their efforts in marketing infrastructure as a service (IaaS) to enterprise firms worldwide. Before this, he spent six years as a Principal Analyst at Forrester Research, covering storage, cloud and datacenter economics. Prior to his time at Forrester, Andrew was a consultant with Accenture, optimizing datacenter environments on behalf of EMC. He holds an MBA in finance from the Foster School of Business at the University of Washington and a BA in History from Wesleyan University.
Open19 is About Improving Hardware Choices, Standardizing Deployment
/in Operations/by Matt StansberryIn July 2016, Yuval Bachar, principal engineer, Global Infrastructure Architecture and Strategy at LinkedIn announced Open19, a project spearheaded by the social networking service to develop a new specification for server hardware based on a common form factor. The project aims to standardize the physical characteristics of IT equipment, with the goal of cutting costs and installation headaches without restricting choice or innovation in the IT equipment itself.
Uptime Institute’s Matt Stansberry discussed the new project with Mr. Bachar following the announcement. The following is an excerpt of that conversation:
Please tell us why LinkedIn launched the Open19 Project.
We started Open19 with the goal that any IT equipment we deploy would be able to be installed in any location, such as a colocation facility, one of our owned sites, or sitting in a POP, in a standardized 19-inch rack environment. Standardizing on this form factor significantly reduces the cost of integration.
We aren’t the kind of organization that has one type or workload or one type of server. Data centers are dynamic, with apps evolving on a weekly basis. New demands for apps and solutions require different technologies. Open19 provides the opportunity to mix and match server hardware very easily without the need to change any of the mechanical or installation aspects.
Different technologies evolve at a different pace. We’re always trying a variety of different servers—from low-power multi core machines to very high end, high performance hardware. In an Open19 configuration, you can mix and match this equipment in any configuration you want.
In the past, if you had a chassis full of blade servers, they were super-expensive, heavy, and difficult to handle. With Open19, you would build a standardized chassis into your cage and fit servers from various suppliers into the same slot. This provides a huge advantage from a procurement perspective. If I want to replace one blade today, the only opportunity I have is to buy from a single server supplier. With Open19, I can buy a blade from anybody that complies. I can have five or six proposals for a procurement instead of just one.
Also, by being able to blend high performance and high power with low-power equipment in two adjacent slots in the brick cage in the same rack, you can create a balanced environment in the data center from a cooling and power perspective. It helps avoid hot spots.
I recall you mentioned that you may consider folding Open19 into the Facebook-led Open Compute Project (OCP), but right now that’s not going to happen. Why launch Open19 as a standalone project?
The reason we don’t want to fold Open19 under OCP at this time is that there are strong restrictions around how innovation and contributions are routed back into the OCP community.
IT equipment partners aren’t willing to contribute IP and innovation. The OCP solution wasn’t enabling the industry—when organizations create things that are special, OCP requires you to expose everything outside of your company. That’s why some of our large server partners couldn’t join OCP. Open19 defines the form factor, and each server provider can compete in the market and innovate in their own dimension.
What are your next steps and what is the timeline?
We will have an Open19-based system up and running in the middle of September 2016 in our labs. We are targeting late Q1 of 2017 to have a variety of servers installed from three to four suppliers. We are considering Open19 as our primary deployment model, and if all of the aspects are completed and tested, we could see Open19 in production environments after Q1 2017.
The challenge is that this effort has multiple layers. We are working with partners to do the engineering development. And that is achievable. A secondary challenge is the legal side. How do you create an environment that the providers are willing to join? How do we do this right this time?
But most importantly, for us to be successful we have to have significant adoption from suppliers and operators.
It seems like something that would be valuable for the whole industry—not just the hyperscale organizations.
Enterprise groups will see an opportunity to participate without having to be the 100,000-server data center companies. So many enterprise IT groups had expressed reluctance about OCP because the white box servers come with limited support and warranty levels.
It will also lower their costs, increase the speed of installations and changes, and improve the positions of business negotiations on procurement.
If we come together, we will generate enough demand to make Open19 interesting to all of the suppliers and operators. You will get the option to choose.
Yuval Bachar is a principal engineer in the global infrastructure and strategy team for Linkedin, which is responsible for the company strategy for data center architecture and implementation of the mega scale future data centers. In this capacity, he drives and supports the new technology development, architecture, and collaboration to support the tremendous growth in LinkedIn’s future user-base, data centers, and services provided. Prior to Linkedin, Mr. Bachar held IT leadership positions at Facebook, Cisco and Juniper Networks.
FORTRUST Gains Competitive Advantage from Management and Operations
/in Executive, Operations/by Kevin HeslinFORTRUST regards management and operations as a core competency that helps it win new clients and control capital and operating expenses
Shortly after receiving word that FORTRUST had earned Uptime Institute’s Tier Certification for Operational Sustainability (Gold) for Phase 7 of its Denver data center, Rob McClary, the company’s executive vice president and general manager, discussed the importance of management and operations and how FORTRUST utilizes Uptime Institute’s certifications to maintain and improve its operations and gain new clients.
Q: Please tell me about FORTRUST and its facilities.
FORTRUST is a multi-tenant data center (MTDC) colocation services provider. Our main data center is a 300,000 square foot facility in Denver, and we have been operating as privately owned company since 2001, providing services for Fortune 100, 500, and 1000 companies.
Q: You recently completed a new phase. How is it different than earlier phases?
The big change is that we started to take a prefabricated approach to construction for phase 6, and instead of traditional raised floor we went to data modules, effectively encapsulating the customers’ IT environments, which increased our per rack densities and subsequent efficiencies both from a cooling and a capital standpoint.
One of the biggest trends in the industry that needs a course correction, is that data centers are not being allowed to evolve as they need to. We keep trying to build the same data centers over and over. The engineers and general contractors and even the vendors just want to do what they have always done. And I think the data centers are going to have to become capital efficient. When we made that change to a modular approach, we reduced our capital outlay and started getting almost instantaneous return on the capital.
Q: How did you ensure that these changes would not reduce your availability?
For Phase 6a, we were trying to get used to the whole modular approach in a colocation environment. We had Uptime Institute review our designs. As a result, we earned the Tier Certification of Design Documents.
We planned to go further and do the Tier Certification of Constructed Facility as well, but Uptime Institute helped us determine that it would be better to do the Tier Certification of Constructed Facility in an upcoming phase (7) because we already had live customers in phase 6A. And, as you know, about half the customers in a colo facility have at least one piece of single-corded equipment in their IT environment. To address this, we worked with Uptime Institute consultants to adapt what Uptime Institute normally does during Tier Certification of Constructed Facility when there are usually no live customers. And we pursued Uptime Institute’s M&O Stamp of Approval.
This helped us understand the whole modular approach and how we would approach these circumstances going forward. At the same time we became one of the earlier Uptime Institute M&O Stamp of Approval sites.
Q: Why has FORTRUST adopted Uptime Institute’s Operations assessments and certifications?
We are big believers that design is only a small part of uptime and reliability. We focus 90% of our effort on management and operations, risk mitigation, and process discipline, doing things in a manner that prevents human error. That’s really what’s allowed us to achieve continuous uptime and have high customer satisfaction for such a long time.
I’d say we’re known for our operations and a level of excellence, so the Tier Certification of Operational Sustainability validated things that we have been doing for many years. It also allowed us to take a look at what we do from a strategy and tactical standpoint by essentially giving us a third-party look into what we think is important and what someone else might think is important.
We’ve got a lot of former military folks here, a lot of former Navy. That approach may influence our approach. The military conducts a lot of inspections and audits. You get used to it, and it becomes your chance to shine. So the Tier Certification of Operational Sustainability allows our people to show what they do, how they do it, and the pride they take in doing it. It gives them validation of how well they are doing, and it emphasizes why it is important.
The process re-emphasizes why it is so important to have operational strategies and your tactics aligned in a harmonious fashion. A lot of people in the data center industry get bogged down in checklists, best practices and trying to use them to compare data centers, and about 50 percent of it is noise, which means tactics without strategy. If you have your strategies in place and your tactics are aligned with your strategies, that’s much more powerful than trying to incorporate 100 best practices in your day-to day-ops. So doing 50 things very well is a better thing than do 100 things halfway.
Q: Did previously preparing for the M&O Stamp of Approval help you prepare for the Tier Certification of Operational Sustainability?
Absolutely. One reason we scored so well on the Tier Certification of Operational Sustainability was that we looked at our M&O 3 years ago and implemented the suggested improvements right away, and we were comfortable because we’ve had those things in place for years.
Q: What challenges did you face? You make it sound easy.
The biggest challenge for us during both the Tier Certification Constructed Facility and Tier Certification of Operational Sustainability was coordinating with Uptime Institute in a live colo environment with shared systems that weren’t specific to one phase. It was pretty much like doing surgery on a live body.
We were confident in what we were doing. Obviously the Tier Certification of Operational Sustainability is centric around documentation, and we’ve been known for process discipline and our procedural compliance for over 14 years of operations. It’s our cornerstone; it’s what we do very well. We literally don’t do anything without a procedure in hand.
We think design is the design. Once you build the data center and infrastructure after that it is all about management and operations, so management, operations, and risk mitigation is what will give you the long track record of success.
At the end of the day, if the biggest cause of outages in data centers is human error, why wouldn’t we put more emphasis on understanding why that happens and how to avoid it? To me, that’s what the M&O and Tier Certification of Operational Sustainability are all about.
Q: It’s obvious you consider procedures a point of differentiation in the market. How do you capitalize on this difference to new customers?
We show it to them. Part of our sales cycle includes taking the customer through the due diligence that they want to do and what we think they need to do. We make available hundreds of our documented procedures. We show them how we manage them. When we take a potential customer through our data center, it’s a historical story that we put together that starts with reliability, risk mitigation, business value and customer service.
Customers cannot only hear it and see the differences, but they can also feel it. If you have been in a lot of data centers, you can walk through the MEP or colo areas and maybe in 10-15 minutes, you can tell if there’s a difference in the management and operations philosophy. It’s quite apparent.
That’s always been our call to action. It’s really educating the customer on why the management and ops are so important. We put a lot of emphasis and resource in mitigating and eliminating human error. We devote probably 80-90% of our time in training, process discipline, and procedural compliance. That’s how we operate day to day.
Q: What are the added costs?
Actually this approach has less cost. I would challenge anyone who is outsourcing most of their maintenance and operations and even management because we’re doing it cheaper and we’re doing more aggressive predictive and preventive maintenance than most any data center. It’s really what we call an operational mindset and that can rarely be outsourced. Your personnel have to own it.
We don’t have people coming in to clean our data centers. Our folks do it. We do the majority of the maintenance in the facility, and the staff owns it.
We don’t do a lot of corrective maintenance. Corrective maintenance currently costs on the magnitude of 10 times the cost of a comprehensive preventive and predictive maintenance program.
I can show proof because we have been operating for 15 years now. I would dare anyone to tell me what part of that data center or which one of our substations, switchgear, or other equipment components are 15-years old and which are the new ones. It would be hard to tell.
I think there are too many engineering firms and GCs that try to influence the build in a manner that isn’t necessary. Like I said, they try to design around human error instead of spending time preventing it.
Rob McClary
Rob McClary is executive vice resident and general manager at FORTRUST Data Centers. Since joining FORTRUST in 2001, he has held the critical role of building the company into a premier data center services provider and colocation facility. Mr. McClary is responsible for the overall supervision of business operations, high-profile construction, and strategic technical direction. He developed and implemented the process controls and procedures that support the continuous uptime and reliability that FORTRUST Denver has delivered for more than 14 years.
Saudi Aramco’s Cold Aisle Containment Saves Energy
/in Operations/by Kevin HeslinOil exploration and drilling require HPC
By Issa A. Riyani and Nacianceno L. Mendoza
Saudi Aramco’s Exploration and Petroleum Engineering Computer Center (ECC) is a three-story data center built in 1982. It is located in Dhahran, Kingdom of Saudi Arabia. It provides computing capability to the company’s geologists, geophysicists, and petroleum engineers to enable them to explore, develop, and manage Saudi Arabia’s oil and gas reserves. Transitioning the facility from mainframe to rack-mounted servers was just the first of several transitions that challenged the IT organization over the last three decades. More recently, Saudi Aramco reconfigured the legacy data center to a Cold Aisle/Hot Aisle configuration, increasing rack densities to 8 kilowatts per rack (kW/rack) from 3 kW/rack in 2003 and nearly doubling capacity. Further increasing efficiency, Saudi Aramco also sealed openings around and under the computer racks, cooling units, and the computer power distribution panel in addition to blanking unused rack space.
The use of computational fluid dynamics (CFD) simulation software to manage the hardware deployment process enabled Saudi Aramco to increase the total number of racks and rack density in each data hall. Saudi Aramco used the software to analyze various proposed configurations prior to deployment, eliminating the risk of trial and error.
In 2015 one of the ECC’s five data halls was modified to accommodate a Cold Aisle Containment System. This installation supports the biggest single deployment so far in the data center, 124 racks of high performance computers (HPC) with a total power demand of 994 kW. As a result, the data hall now hosts 219 racks on a 10,113-square-foot (940-square-meter) raised floor. To date, the data center hall has not experienced any temperature problems.
Business Drivers
Increasing demand by ECC customers requiring the deployment of IT hardware and software technology advances necessitated a major reconfiguration in the data center. Each new configuration increased the heat that needed to be dissipated from the ECC. At each step, several measures were employed to mitigate potential impact to the hardware, ensuring safety and reliability during each deployment and project implementation.
For instance, Saudi Aramco developed a hardware deployment master plan based on a projected life cycle and refresh rate of 3–5 years to transition to the Cold Aisle/Hot Aisle configuration. This plan allows for advance planning of space and power source allocation with no compromise to existing operation as well as fund allocation and material procurement (see Figures 1 and 2).
Figure 1. Data center configuration
Current day
Masterplan
Figure 2. Data center plan view
Because of the age of the building and its construction methodology, the company’s engineering and consulting department was asked to evaluate the building structure based on the initial master plan. This department determined the maximum weight capacity of the building structure, which was used to establish the maximum rack weight to avoid compromising structural stability.
In addition, the engineering and consulting department evaluated the chilled water pipe network and determined the maximum number of cooling units to be deployed in each data hall, based on maximum allowable chilled water flow. Similarly, the department determined the total heat to be dissipated per Hot Aisle to optimize the heat rejection capability of the cooling units. The department also determined the amount heat to be dissipated per rack to ensure sufficient cooling as per manufacturer’s recommendation.
Subsequently, facility requirements based on these limiting factors were shared with the technology planning team and IT system support. The checklist includes maximum weight, rack dimensions, and the requirement for blanking panels and sealing technologies to prevent air mixing.
Other features of the data center include:
The 42U cabinets in the ECC have solid sides and tops, with 64% perforated front and rear doors on each cabinet. Each is 6.5-ft. high by 2-ft. wide by 3.61-ft. deep (2 m by 0.6 m by 1.10 m) and weighs 1874 pounds (850 kilograms). Rack density ranges from 6.0–8.0 kW. The total nominal cooling capacity is 1,582 kW from (25) 18-ton computer room air conditioning (CRAC) units.
Modeling Software
In 2007, Saudi Aramco commissioned the CFD modeling software company to prepare baseline models for all the data halls. The software is capable of performing transient analysis that suits the company’s requirement. The company uses the modeling software to simulate proposed hardware deployment, investigate deployment scenarios, and identify any stranded capacity.The modeling company developed several simulations based on different hardware iterations of the master plan to help establish the final hardware master plan with each Hot Aisle not exceeding a 125-kW heat load on a 16-rack Hot Aisle and not more than 8 kW per rack. After the modeling software company completed the initial iterations, Saudi Aramco acquired a perpetual license and support contract for the CFD simulation software in January 2010.
Saudi Aramco finds that the CFD simulation software makes it easier to identify and address heat stratification, recirculation, and even short-circuiting of cool air. By identifying the issues in this way, Saudi Aramco was able to take several precautionary measures and improve its capacity management procedures, including increasing cooling efficiency and optimizing load distribution.
Temperature and Humidity Monitoring System
With the CFD software simulation results at hand, the facilities management team looked for other means to gather data for use in future cooling optimization simulations while validating the results of CFD simulations. As a result, the facilities management group decided to install a temperature and humidity monitoring system. The initial deployment was carried out in 2008, with the monitoring of subfloor air supply temperature and hardware entering temperature.
At that time, three sensors were installed in each Cold Aisle for a total of six sensors. The sensors were positioned at the ends of each end of the row and in the middle, at the highest point of each rack. Saudi Aramco chose these points to get better understanding of the temperature variance (∆T) between the subfloor and the highest rack inlet temperature. Additionally, Saudi Aramco uses this data to monitor and ensure that all inlet temperatures are within the recommended ranges of ASHRAE and the manufacturer.
The real-time temperature and humidity monitoring system enabled the operation and facility management team to monitor and document unusual and sudden temperature variances allowing proactive responses and early resolution of potential cooling issues. The monitoring system gathers data that can be used to validate the CFD simulations and for further evaluation and iteration.
The Prototyping
The simulation models identified stratification, short circuiting, and recirculation issues in the data halls, which prompted the facilities management team to develop more optimization projects, including a containment system. In December 2008, a prototype was installed in one of the Cold Aisles (see Figure 3) using ordinary plastic sheets as refrigerated doors and Plexiglass sheets on aluminum frame, Saudi Aramco monitored the resulting inlet and core temperatures using the temperature and humidity monitoring system and internal system monitors prior to, during, and upon completion of installation to ensure no adverse effect with the hardware. The prototype was observed over the course of three months with no reported hardware issues.
Figure 3 Prototype containment system
Following the successful installation of the prototype, various simulation studies were further conducted to ensure the proposed deployment’s benefit and savings. In parallel, Saudi Aramco looked for the most suitable materials to comply with all applicable standards, giving prime consideration to the safety of assets and personnel and minimizing risk to IT operations.
Table 1. Installation dates
When the Cold Aisle was contained, Saudi Aramco noticed considerable improvement in the overall environment. Containment improved cold air distribution by eliminating hot air mixing with the supply air from the subfloor, so that air temperature at the front of the servers was close to the subfloor supply temperature. With cooler air entering the hardware, the core temperature was vastly improved, resulting in lower exhaust and return air temperatures to the cooling units. As a result, the data hall was able to support more hardware
Material Selection and Cold Aisle Containment System installation
From 2009 to 2012, the facility management team evaluated and screened several products It secured and reviewed the material data sheets and submitted them to the Authority Having Jurisdiction (AHJ) for evaluation and concurrence. Each of the solutions would require some modifications to the facility before being implemented. The facility management team evaluated and weighed the impact of these modifications as part of the procurement process.
Of all the products, one stood out from the rest; the use of easy to install and transparent material addresses not only safety but also eliminated the need for modifications of the existing infrastructure, which translates to considerable savings in terms of project execution and money.
Movement in and out of the aisle is easy and safe as people can see through the doors and walls. Additionally, the data hall lighting did not need to be modified since it was not obstructed. Even the fire suppression system was not affected since it has a fusible link and lanyard connector. The only requirement by AHJ prior to deployment was additional smoke detectors in the Cold Aisle itself.
To comply with this requirement, an engineering work order was raised for the preparation of the necessary design package for the modification of the smoke detection system. After completing the required design package including certification from a chartered fire protection engineer as mandated by the National Fire Protection Association (NFPA), it was established that four smoke detectors were to be relocated and an additional seven smoke detectors installed in the data hall.
Implementation and challenges
Optimizations and improvements always come with challenges; the reconfiguration process necessitated close coordination between the technology planning team, IT system support, ECC customers, the network management group, Operations, and facility management. These teams had to identify hardware that could be decommissioned without impacting operations, prepare temporary spaces for interim operations, and then take the decommissioned hardware out of the data hall, allowing the immediate deployment of new hardware in Cold Aisle/Hot Aisle. Succeeding deployments follow the master plan, allowing the complete realignment process to be completed in five years.
Installation of the Cold Aisle Containment System did not come without challenges; all optimization activities, including relocating luminaires in the way of the required smoke detectors had to be completed with no impact to system operations. To meet this requirement, ECC followed a strict no work permit–no work procedure; work permits are countersigned by operation management staff on duty during issuance and prior to close out. This enabled close monitoring of all activities within the data halls, ensuring safety and no impact to daily operation and hardware reliability. Additionally, a strict change management documentation process was utilized and adhered to by the facility management team and monitored by operation management staff; all activities within the data halls have to undergo a change request approval process.
Operations management and facility management worked hand in hand to overcome these challenges. Operations management, working in three shifts, closely monitored the implementation process, especially after regular working hours. Continuous coordination between contractors, vendors, operation staff, and facility management team enabled smooth transition and project implementations eliminating any showstoppers along the way.
Summary
The simulation comparison in Figure 4 clearly shows the benefits of the Cold Aisle Containment System. Figure 4a shows hot air recirculating around the end of the rows and mixing with the cold air supply to the Cold Aisles. In Figure 4b, mixing of hot and cold air is considerably reduced with the installation of the 14 Cold Aisle containment systems. The Cold Aisles are better defined and clearly visible in the figures, with less hot air recirculation, but the three rows without containment still suffer from recirculation. In Figure 4c, the Cold Aisles are far better defined, and hot air recirculation and short circuiting are reduced. Additionally, the exhaust air temperature from the hardware has dropped considerably.
Figure 4a. Without Cold Aisle Containment
Figure 4b. With current Cold Aisle Containment (14 of 17 aisles)
Figure 4c. With full Cold Aisle Containment
Figures 5–11 show that the actual power and temperature readings taken from the sensors installed in the racks validated the simulation results. As shown in Figures 4a and 5a, the power draw of the racks in Aisles 1 and 2 fluctuates while the corresponding entering and leaving temperature was maintained. On Week 40, the temperature even dropped slightly despite the slight increase in the power draw. The same can also be observed in Figures 6 and 7. All these aisles are fitted with a Cold Aisle Containment System.
Figure 5. Actual Power utilization, entering temperature, and leaving temperature
Aisle 01 (installed on July 30, 2015 – week 31)
Figure 6. Aisle 2 (installed on July 28, 2015 – week 31)
Figure 7. Aisle 3 (installed on March 7, 2015 – week 10)
Figure 8. Aisle 6 (installed on April 09, 2015 – week 15)
Figure 9. Aisle 7a (a and b) and Aisle 7b (c and d)
Figure 10. Aisle 8 (installed on February 28, 2015 – week 09)
Figure 11. Aisle 17 (no Cold Aisle installed)
Additionally, Figure 11 clearly shows slightly higher entering and leaving temperature as well as fluctuation in the temperature readings that coincided with the power draw fluctuation of the racks within the aisle. This aisle has no containment.
The installation of the Cold Aisle Containment System greatly improved the overall cooling environment of the data hall (see Figure 12). Eliminating hot and cold air mixing and short circuiting allowed for more efficient cooling unit performance and cooler supply and leaving air. Return air temperature readings in the CRAH units were also monitored and sampled in Figure 12, which shows the actual return air temperature variance as a result of the improved overall data hall room temperature.
Figure 11. Computer air handling unit actual return air temperature graphs
Figure 13. Cold Aisle floor layout
The installation of the Cold Aisle Containment System allows the same data hall to host the company’s MAKMAN and MAKMAN-2 supercomputers (see Figures 5
Figure 14. Installed Cold Aisle Containment System
). Both MAKMAN and MAKMAN-2 appear on the June 2015 Top500 Supercomputers list.
Issa Riyani
Issa A. Riyani joined the Saudi Aramco Exploration Computer Center (ECC) in January 1993. He graduated from King Fahad University of Petroleum and Minerals (KFUPM) in Dhahran, Kingdom of Saudi Arabia, with a bachelor’s degree in electrical engineering. Mr. Riyani currently leads the ECC Facility Planning & Management Group and has more than 23 years experience managing ECC facilities.
Nacianceno L. Mendoza
Nacianceno L. Mendoza joined the Saudi Aramco Exploration Computer Center (ECC) in March 2002. He holds a bachelor of science in civil engineering and has more than 25 years of diverse experience in project design, review, construction management, supervision, coordination and implementation. Mr. Mendoza spearheaded the design and implementation of the temperature and humidity monitoring system and deployment of Cold Aisle Containment System in the ECC.
Data-Driven Approach to Reduce Failures
/in Operations/by Kevin HeslinOperations teams use the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database to enrich site knowledge, enhance preventative maintenance, and improve preparedness
By Ron Davis
The AIRs system is one of the most valuable resources available to Uptime Institute Network members. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management.
A quick search of the database, using various parameters or keywords, turns up invaluable documentation on a broad range of facility and equipment topics. The results can support evidence-based decision making and operational planning to guide process improvement, identify gaps in documentation and procedures, refine training and drills, benchmark against successful organizations, inform purchasing decisions, fine-tune preventive maintenance (PM) programs to minimize failure risk, help maximize uptime, and support financial planning.
THE VALUE OF AIRs OPERATIONS
The philosopher, essayist, poet, and novelist George Santayana wrote, “Those who cannot remember the past are doomed to repeat it.” Using records of past data center incidents, errors, and outages can help inform operational practices to help prevent future incidents.
All Network member organizations participate in the AIRs program, ensuring a broad sample of incident information from data center organizations representing diverse sizes, business sectors, and geography. It comprises more than 5,000 data center incidents and errors spanning two decades of site operations. Using AIRs to leverage the collective learning experiences of many of the world’s leading data center organizations helps Network members improve their operating effectiveness and risk management. The database contains data resulting from data center facility infrastructure incidents and outages for a period beginning in 1994 and continuing through the present. This volume of incident data allows for meaningful and extremely valuable analysis of trends and emerging patterns. Annually, Uptime Institute presents aggregated results and analysis of the AIRs database, spotlighting issues from the year and both current and historical trends.
Going beyond the annual aggregate trend reporting there is also significant insight to be gained from looking at individual incidents. Detailed incident information is particularly relevant to front-line operators, helping to inform key facility activities including:
• Operational documentation creation or improvement
• Planned maintenance process development or improvement
• Training
• PM
• Purchasing
• Effective practices, failure analysis, lessons learned
AIRs reporting is confidential and subject to a non-disclosure agreement (NDA), but the following hypothetical case study illustrates how AIRs information can be applied to improve an organization’s operations and effectiveness.
USING THE AIRs DATA IN OPERATIONS: CASE STUDY
A hypothetical “Site X” is installing a popular model of uninterruptible power supply (UPS) modules.
The facility engineer decides to research equipment incident reports for any useful information to help the site prepare for a smooth installation and operation of these critical units.
Figure 1. The page where members regularly go to submit an AIR. The circled link takes you directly to the “AIR Search” page.
Figure 2. The AIR Search page is the starting point for Abnormal Incident research. The page is structured to facilitate broad searches but includes user friendly filters that permit efficient and effective narrowing of the desired search results.
The facility engineer searches the AIRs database using specific filter criteria (see Figures 1 and 2), looking for any incidents within the last 10 years (2005-2015) involving the specific manufacturer and model where there was a critical load loss. The database search returns seven incidents meeting those criteria (see Figure 3).
Figure 3. The results page of our search for incidents within the last 10 years (2005-2015) involving a specific manufacturer/model where there was a critical load loss. We selected the first result returned for further analysis.
Figure 4. The overview page of the abnormal incident report selected for detailed analysis.
The first incident report on the list (see Figure 4) reveals that the unit involved was built in 2008. A ventilation fan failed in the unit (a common occurrence for UPS modules of any manufacturer/model). Replacing the fan required technicians to implement a UPS maintenance bypass, which qualifies as a site configuration change. At this site, vendor personnel were permitted to perform site configuration changes. The UPS vendor technician was working in concert with one of the facility’s own operations engineers but was not being directly supervised (observed) at the time the incident occurred; he was out of the line of sight.
Social scientist Brené Brow said, “Maybe stories are just data with a soul.” If so, the narrative portion of each report is where we find the soul of the AIR (Figure 5). Drilling down into the story (Description, Action Taken, Final Resolution, and Synopsis) reveals what really happened, how the incident played out, and what the site did to address any issues. The detailed information found in these reports offers the richest value that can be mined for current and future operations. Reviewing this information yields insights and cautions and points towards prevention and solution steps to take to avoid (or respond to) a similar problem at other sites.
Figure 5. The detail page of the abnormal incident report selected for analysis. This is where the “story” of our incident is told.
In this incident, the event occurred when the UPS vendor technician opened the output breaker before bringing the module to bypass, causing a loss of power, and the load was dropped. This seemingly small but crucial error in communication and timing interrupted critical production operations—a downtime event.
Back-up systems and safeguards, training, procedures, and precautions, detailed documentation, and investments in redundant equipment—all are in vain the moment there is a critical load loss. The very rationale for the site having a UPS was negated by one error. However, this site’s hard lesson can be of use if other data center operators learn from the mistake and use this example to shore up their own processes, procedures, and incident training. Data center operators do not have to witness an incident to learn from it; the AIRs database opens up this history so that others may benefit.
As the incident unfolded, the vendor quickly reset the breaker to restore power, as instructed by the facility technician. Subsequently, to prevent this type of incident from happening in the future, the site:
• Created a more detailed method of procedure (MOP) for UPS maintenance
• Placed warning signs near the output breaker
• Placed switch tags at breakers
• Instituted a process improvement that now requires the presence of two technicians: an MOP supervisor and MOP performer, with both technicians required to verified each step
These four steps are Uptime Institute-recommended practices for data center operations. However, this narrative begs the question of how many sites have actually taken the effort to follow through on each of these elements, checked and double-checked, and drilled their teams to respond to an incident like this. Today’s operators can use this incident as a cautionary tale to shore up efforts in all relevant areas: operational documentation creation/improvement, planned maintenance process development/improvement, training, and PM program improvement.
Operational Documenation Creation/Improvement
In this hypothetical, the site added content and detail to its MOP for UPS maintenance. This can inspire other sites to review their UPS bypass procedures to determine if there is sufficient content and detail.
The consequences of having too little detail are obvious. Having too much content can also be a problem if it causes a technician to focus more on the document than on the task.
The AIR in this hypothetical did not say whether facility staff followed an emergency operating procedure (EOP), so there is not enough information to say whether they handled it correctly. This event may never happen in this exact manner again, but anyone who has been around data centers long enough knows that UPS output breakers can and will fail in a variety of unexpected ways. All sites should examine their EOP for unexpected failure/trip of UPS output breaker.
In this incident, the technician reclosed the breaker immediately, which is an understandable human reaction in the heat of the moment. However, this was probably not the best course of action. Full system start-up and shutdown should be orderly affairs, with IT personnel fully informed, if not present as active participants. A prudent EOP might require recording the time of the incident, following an escalation tree, gathering white space data, and confirming redundant equipment status, along with additional steps before undertaking a controlled, fully scripted restart.
Another response to failure was the addition of warning signs and improved equipment labeling as improvements to the facility’s site configuration procedures (SCPs). This change can motivate other sites to review their nomenclature and signage. Some sites include a document that gives expected or steady state system/equipment information. Other sites rely on labeling and warning signs or tools like stickers or magnets located beside equipment to indicate proper position. If a site has none of these safeguards in place, then assessment of this incident should prompt the site team to implement them.
Examining AIRs can provide specific examples of potential failure points, which can be used by other sites as a checklist of where to improve policies. The AIRs data can also be a spur to evaluate whether site policies match practices and ensure that documented procedures are being followed.
Planned Maintenance Process Improvement
After this incident, the site that submitted the AIR incident report changed its entire methodology for performing procedures. Now two technicians must be present, each with strictly defined roles: one technician reads the MOP and supervises the process, and the second technician verifies, performs, and confirms. Both technicians must sign off on the proper and correct completion of the task. It is unclear whether there was a change in vendor privileges.
When reviewing AIRs as a learning and improvement tool, facilities teams can benefit by implementing measures that are not already in place or any improvements that they determine they would implement if a similar incident had occurred at their site. For example, a site may conclude that configuration changes should be reserved only for those individuals who:
• Have a comprehensive understanding of site policy
• Have completed necessary site training
• Have more at stake for site performance and business outcomes
Training
The primary objective of site training is to increase adherence to site policy and knowledge of effective mission critical facility operations. Incorporating information gleaned from AIRs analysis helps maximize these benefits. Training materials should be geared to ensure that technicians are fully qualified to utilize their skills and abilities to operate the installed infrastructure within a mission critical environment and not to certify electricians or mechanics. In addition, training is an opportunity to provide an opportunity for professional development and interdisciplinary education amongst our operations team, which can help enterprises retain key personnel.
The basic components of an effective site-training program are an instructor, scheduled class times that can be tracked by student and instructor, on-the-job training (OJT), reference material, and a metric(s) for success.
With these essentials in place, the documentation and maintenance process improvement steps derived from the AIR incident report can be applied immediately for training. Newly optimized SOPs/MOPs/EOPs can be incorporated into the training, as well as process improvements such as the two-person rule. Improved documentation can be a training reference and study material, and improved SCPs will reduce confusion during OJT and daily rounds. Training drills can be created directly from real-world incidents, with outcomes not just predicted but also chronicled from actual events. Trainer development is enhanced by the involvement of an experienced technician in the AIR review process and creation of any resulting documentation/process improvement.
Combining AIRs data with existing resources enables sites to take a systematic approach to personnel training, for example:
1. John Doe is an experienced construction electrician who was recently hired. He needs UPS bypass training.
2. Jane Smith is a facility operations tech/operating engineer with 10 years of experience as a UPS technician. She was instrumental in the analysis of the AIRs incident and consequent improvements in the UPS bypass procedures and processes; she is the site’s SME in this area.
3. Using a learning management system (LMS) or a simple spreadsheet, John Doe’s training is scheduled.
• Scheduled Class: UPS bypass procedure discussion and walk-through
• Instructor: Jane Smith
• Student: John Doe
• Reference material: the new and improved UPS BYPASS SOP XXXX_20150630, along with the EOP and SCP
• Metric might include:
o Successful simulation of procedure as a performer
o Successful simulation of procedure as a supervisor
o Both of the above
o Successful completion of procedure during a PM activity
o Success at providing training to another technician
Drills for both new trainees and seasoned personnel are important. Because an AIRs-based training exercise is drawn from an actual event, not an imaginary scenario, it lends greater credibility to the exercise and validates the real risks. Anyone who has led a team drill has probably encountered that one participant who questions the premise of a procedure or suggests a different procedure. Far from being a roadblock to effective drills, the participant is proving to be actively engaged and can assist in program improvement by assisting in creating drills and assessing AIRs scenarios.
PM Program Improvement
The goal of any PM program is to prevent the failure of equipment. The incident detailed in the AIR incident report was triggered from a planned maintenance event, a UPS fan replacement. Typically, a fan replacement requires systems be put on bypass, as do annual PM procedures. Since any change of equipment configuration such as changing a fan introduces risk, it is worth asking whether predictive/proactive fan replacement performed during PM makes more sense than awaiting fan failure. The risk of configuration change must be weighed against the risk of inactivity.
Figure 6. Our incident was not caused by UPS Fan Failure, but occurred as a result of human error during its replacement. So how many AIRs involving fan failures for our manufacturer/model of UPS exist within our database? Figure 6 shows the filters we chose to obtain this information.
Examining this and similar incidents in the AIRs database yields information about UPS fan life expectancy that can be used to develop an “evidence-based” replacement strategy. Start by searching the AIRs database for the keyword “fan” using the same dates, manufacturer, and model criteria, with no filter for critical load loss (see Figure 6). This search returns eight reports with fan failure (see Figure 7). The data show that the average life span of the units with fan failure was 5.5 years. The limited sample size means that this result should not be relied on, but this experience at other sites can help guide planning. Less restrictive search criteria can return even more specific data.
Figure 7. A sample of the results (showing 10 of 25 results reports) returned from the search described in Figure 6. Analysis of these incidents may help us to determine and develop the best strategy for cooling fan replacement.
Additional Incidents Yield Additional Insight
The initial database search at the start of this hypothetical case returned a result of seven AIRs total. What can we learn from the other six? Three of the remaining six reports involved capacitor failures. At one site, the capacitor was 12 years old, and the report noted, “No notification provided of the 7-year life cycle by the vendor.” Another incident occurred in 2009, involving a capacitor with a 2002 manufacture date, which would match (perhaps coincidentally) a 7-year life cycle. The third capacitor failure was in a 13-year old piece of equipment, and the AIR notes that it was “outside of 4–5-year life cycle.” These results highlight the importance of having an equipment/component life-cycle replacement strategy. The AIRs database is a great starting point.
A fourth AIR describes a driver board failure in a 13-year old UPS. Driver board failure could fall into any of the AIR root-cause types. Examples of insufficient maintenance might include a case where maintenance performed was limited in scope and did not consider end-of-life. Perhaps there was no procedure to diagnose equipment for a condition or measurement indicative of component deterioration, or maybe maintenance frequency was insufficient. Without further data in the report it is hard to draw an actionable insight, but the analysis does raise several important topics for discussion regarding the status of a site’s preventive and predictive maintenance approach. A fifth AIR involves an overload condition resulting from flawed operational documentation. The lesson there is obvious.
The final of the remaining six reports resulted from a lightning strike that made it through the UPS to interrupt the critical load. Other sites might want to check transient voltage surge suppressor (TVSS) integrity during rounds. With approximately 138,000,000 lightning strikes per year worldwide, any data center can be hit. A site can implement an EOP that dictates checking summary alarms, ensuring redundant equipment integrity, performing a facility walk-through by space priority, and providing an escalation tree with contact information.
Each of the AIRs casts light on the types of shortfalls and gaps that can be found in even the most capably run facilities. With data centers made up of vast numbers of components and systems operating in a complex interplay, it can be difficult to anticipate and prevent every single eventuality. AIRs may not provide the most definitive information on equipment specifications, but assessing these incidents provides an opportunity for other sites to identify potential risks and plan how to avoid them.
PURCHASING/EQUIPMENT PROCUREMENT DECISIONS
In addition to the operational uses described above, AIRs information can also support effective procurement. However, as with using almost any type of aggregated statistics, one should be cautious about making broad assumptions based on the limited sample size of the AIRs database.
For example, a search using simply the keywords ‘fan failure’ and ‘UPS’ could return 50 incidents involving Vendor A products and five involving Vendor B’s products (or vice versa). This does not necessarily mean that Vendor A has a UPS fan failure problem. The number of incidents reported could just mean that Vendor B has a significant market share advantage.
Further, one must be careful of jumping to conclusions regarding manufacturing defects. For example, the first AIR incident report made no mention of how the UPS may (or may not) have been designed to help mitigate the risk of the incident. Some UPS modules have HMI (human machine interface) menu-driven bypass/shutdown procedures that dictate action and provide an expected outcome indication. These equipment options can help mitigate the risk of such an event but may also increase the unit cost. Incorporating AIRs information as just one element in a more detailed performance evaluation and cost-benefit analysis will help operators accurately decide which unit will be the best fit for a specific environment and budget.
LEARNING FROM FAILURES
If adversity is the best teacher, then every failure in life is an opportunity to learn, and that certainly applies in the data center environment and other mission critical settings. The value of undertaking failure analysis and applying lessons learned to continually develop and refine procedures is what makes an organization resilient and successful over the long term.
To use an example from my own experience, I was working one night at a site when the operations team was transferring the UPS to maintenance bypass during PM. Both the UPS output breaker and the UPS bypass breaker were in the CLOSED position, and they were sharing the connected load. The MOP directed personnel to visually confirm that the bypass breaker was closed and then directed them to open the UPS output breaker. The team followed these steps as written, but the critical load was dropped.
Immediately, the team followed EOP steps to stabilization. Failure analysis revealed that the breaker had suffered internal failure; although the handle was in the CLOSED position, the internal contacts were not closed. Further analysis yielded a more detailed picture of events. For instance, the MOP did not require verification of the status of the equipment. Maintenance records also revealed that the failed breaker had passed primary injection testing within the previous year, well within the site-required 3-year period. Although meticulous compliance with the site’s maintenance standards had eliminated negligence as a root cause, the operational documentation could have required verification of critical component test status as a preliminary step. There was even a dated TEST PASSED sticker on the breaker.
Indeed eliminating key gaps in the procedures would have prevented the incident. As stated, the breaker appeared to be in the closed position as directed, but the team had not followed the load during switching activities, (i.e., had not confirmed the transfer of the amperage to the bypass breaker). If we had done so, we would have seen a problem, and initiated a back-out of the procedure. Subsequently, these improvements were added to the MOP.
FLASH REPORTS
Flash reports are a particularly useful AIRs service because they provide early warning about incidents identified as immediate risks, with root causes and fixes to help Network members prevent a service interruption. These reports are an important source of timely front-line risk information.
For example, searching the AIRs database for any FLASH AIR since 2005 involving one popular UPS model returns two results. Both reports detailed a rectifier shutdown as a result of faulty trap filter components; The vendor consequently performed a redesign and recommended a replacement strategy. The FLASH report mechanism became a crucial channel for communicating the manufacturer’s recommendation to equipment owners. Receiving a FLASH notification can spur a team to check maintenance records and consult with trusted vendors to ensure that manufacturer bulletins or suggested modifications have been addressed.
When FLASH incidents are reported, Uptime Institute’s AIRs program team contacts the manufacturer as part of its validation and reporting process. Uptime Institute strives for and considers its relationships with OEMs (original equipment manufacturers) to be cooperative, rather than confrontational. All parties understand that no piece of complex equipment is perfect, so the common goal is to identify and resolve issues as quickly and smoothly as possible.
CONCLUSION
It is virtually impossible for an organization’s site culture, procedures, and processes to be so refined that there are no details left unaddressed and no improvements that can be made. There is also a need to beware of hidden disparities between site policy and actual practice. Will a team be ready when something unexpected does go wrong? Just because an incident has not happened yet does not mean it will not happen. In fact, if a site has not experienced an issue, complacency can set in; steady state can get boring. Operators with foresight will use AIRs as opportunities to create drills and get the team engaged with troubleshooting and implementing new, improved procedures.
Instead of trying to ferret out gaps or imagine every possible failure, the AIRs database provides a ready source of real-world incidents to draw from. Using this information can help hone team function and fine tune operating practices. Technicians do not have to guess at what could happen to equipment but can benefit from the lessons learned by other sites. Team leaders do not have to just hope that personnel are ready to face a crisis; they can use AIRs information to prepare for operating eventualities and to help keep personnel responses sharp.
AIRs is much more than a database; it is a valuable tool for raising awareness of what can happen, mitigating the risk that it will happen, and for preparing an operations team for when/if it does happen. With uses that extend to purchasing, training, and maintenance activities, the AIRs database truly is Uptime Institute Network members’ secret weapon for operational success.
Ron Davis
Ron Davis is a Consultant for Uptime Institute, specializing in Operational Sustainability. Mr. Davis brings more than 20 years of experience in mission critical facility operations in various roles supporting data center portfolios, including facility management, management and operations consultant, and central engineering subject matter expert. Mr. Davis manages the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database, performing root-cause and trending analysis of data center outages and near outages to improve industry performance and vigilance.