Certifying the Tier performance data center… Remotely? Yes!

Mitigating Risk Now with Remote Delivery of Tier Certifications

COVID-19 has changed the way we go about our lives and how we approach our work. We found ourselves—almost overnight—not traveling, not going into an office, figuring out how to stay home most of the time while continuing to work, and wondering why a respiratory infection requires so much toilet paper. We all faced, and still face, significant impact to all aspects of our lives. And companies learned quickly that they are not immune. Organizations and their approach to their business also changed practically overnight. Of course, Uptime Institute faced the same situation. We wondered how we were going to continue to support and positively impact the data center industry without the opportunity to meet face-to-face with clients to help them work through their challenges.

Remote Capabilities
Uptime Institute has spent decades helping the data center industry understand how best to design, build, and operate data centers to provide the availability required to meet unique business objectives. We reacted to the COVID-19 challenge by figuring out how to deliver value to our clients remotely. We have always been able to remotely review a design package for a new data center build or upgrade and provide insight into how that design meets Tier requirements, how easy it would be to operate, and if that design has the support spaces and features to provide for effective operations. Our Tier Certification of Design Documents has always been performed remotely, so continuing that was a no brainer. However, we often sit down with clients and work through design challenges in Working Sessions. Now, we utilize video conferencing in numerous areas to meaningfully connect with clients to discuss those challenges. Along the way, we did realize early on that focusing on a video screen for hours on end was challenging and tiresome, so we designed our Working Sessions to work more efficiently, with two half days of intensive discussion instead of one full day. We have all learned more about the need for creativity and flexibility during this pandemic.

It was obvious to us that design packages could be effectively reviewed remotely, but then asked ourselves “what’s the best way to witness demonstrations to confirm that a final completed facility was built according to the design documents and performs to Tier requirements?” We knew a remote delivery of a Tier Certification of Constructed Facility would involve video conferencing systems, but the challenge was to figure out precisely how to capture the proper perspective of what is going on in a data center to demonstration that it is performing in a correct and meaningful way.

As you might realize, watching a video feed has limitations, since the peripheral view is less than normal and is limited by the operator of the camera. To solve that issue, we worked closely with our clients to fine-tune an approach that involves multiple cameras, audio channels, and access to the Building Management System or DCIM, along with some important preplanning. Another tricky part is that working with someone in person it is far easier and effective to talk things through. But for some reason, is does seem a bit less effective over the phone or a video conferencing system.

It’s in the Angle
That tricky part is just one of the reasons the preplanning of the demonstrations is so critical. This requires the client to have documented and thought through the procedures for performing all demonstrations. This upfront work pays dividends to the operations, because as the facility moves forward with these procedures, they will be used to configure the systems to perform maintenance, and be tested. This gives operations personnel hands-on experience with the procedures during the demonstrations. An additional preplanning step is a dry run of the video and audio communications channels. This way everyone has experience with the setup since it is so very different in look and feel for both the Uptime Institute team and our clients.

The demonstrations are then performed according to the plan and scripts. Uptime Institute consultants witness remotely and the audio channel allows us to ask questions and indicate where we need a better angle or view on the camera or make a correction to see something further if a step that has been missed.

Site Visits Still Needed
Even though the demonstrations can be witnessed remotely, the Tier Certification of Constructed Facility still requires a site visit. We need to put eyes on the systems We understand and appreciate that the industry places their trust in us, and when we say something is Tier III compliant, we mean it is Tier III complaint. With that always in mind, we want to ensure and verify that what we saw live online. This includes verifying that the facility we saw on-camera is the correct one, and that nothing off-camera impacts compliance. This is something that simply cannot be fully verified remotely.

The other site component we now can begin remotely is data center operations, which we have traditionally delivered by a site visit to review the processes and procedures for a site, interview personnel to ensure they are aware of the processes, know what they are and where to find them, as well as regularly and completely follow through with the processes. These are delivered via Tier Certification of Operational Sustainability and the Management and Operations Stamp of Approval. Many aspects of these reviews are now delivered remotely through a document review of the processes and procedures along with video conference interviews of key personnel. This is followed up by a site visit to ensure the plans are being effectively put into practice, and that the results are evident. The remote delivery of a TCOS or M&O provides additional value as our clients receive initial feedback on their operational plans. Then when we do arrive onsite (after travel is possible again) there is an opportunity to review any updates to help identify and solve any problems or issues that have occurred in the meantime.

Why Now?
I know most data centers are under increased demand and scrutiny during the uncertain times we are all now facing. Most have changed operations to help ensure that staff are social distancing and to protect relief personnel from exposure to COVID-19, allowing them to be ready to step in as needed. Some ask us, why should they take the time out now to work with Uptime Institute to evaluate their data centers and/or operations?

Why? If you think about it there are two basic responses to such a situation that we find ourselves in today. The first is to keep doing what you are doing and to refrain from introducing any more additional distractions than necessary. The other way is to evaluate your situation, identify the risks, and start to mitigate them, or at minimum, at least plan for that risk mitigation. Taking the time to engage with us allows data center owners and operators the opportunity to learn where they have gaps in their design and/or operations and provides them with the ability to better plan on which corrective measures to take as soon as possible. This way they are ready and prepared to execute on that plan to improve their situation quickly. And, just maybe, the next global, or local, dilemma is not quite so disruptive as this one!

Uptime Institute’s 10th Annual Data Center Survey is here!

It’s HERE!!!  Every year, Uptime Institute reaches out to thousands of industry leaders, enterprises and suppliers to ask them about their view of where the data center and digital infrastructure industry is going and what kinds of challenges they are dealing with today and expect in the future. We ask about trends and migrations, tactics and strategies, challenges and successes. This survey is the most comprehensive and longest-running of its kind and used by thousands of companies worldwide to influence their own IT choices and directions. The results may also make you think about some of your long-held ideas which may be becoming widely accepoted or perhaps no longer accurate or strategic.

This year, more than 1300 people responded. The 10th annual survey was conducted in the spring of 2020 and the results provide an overview of the practices, experiences and underlying trends in the mission-critical digital infrastructure industry, today and in the future.

I encourage you to listen to the entire narrative by Andy Lawrence, our head of our Uptime Intelligence research group. Andy will describe a sector that is grappling with a number of difficult issues including the increasing scope and frequency of outages, the actual rate at which migration to cloud is occuring, increasing complexity of hybrid infrastructures and their resilience and the resulting performance expectations.

The survey also confirms that it is an industry that is growing in absolute size and one that is adapting to rapid change on multiple levels. In almost every area under discussion — outages, resiliency, staffing, placement of workloads, deployment of innovation and the use of cloud — there is considerable diversity in the strategies being employed by the professionals chartered to deliver business critical results. And perhaps most importantly, respondants overwhelmingly agree that they have the ability to address each of these challenges IF THEY CHOOSE TO take the required steps and actions! (i.e. Outages are preventable if more attention is paid to operational planning, staffing shortages can be addressed if they widen their searches, data centers will be able to withstand more stress if they were designed and verified properly, etc)

Key findings:

  • The enterprise data center is neither dead nor dying. The migration of critical loads to a public cloud is happening slowly, with more than half of workloads expected to remain in on-premises data centers in 2022.
  • Transparent clouds are good for business. Cloud operators would win more mission-critical business if they were more open. Enterprises want greater visibility into facilities and how resiliency is achieved.
  • Edge is still on the edge. Most organizations expect their edge computing requirements to increase somewhat in 2020, but fewer than 20% expect a significant increase.
  • Average site energy efficiency has flatlined. Power usage effectiveness values have not improved much across the industry since 2013. But because more work is now done in big, efficient facilities, the overall energy efficiency of IT has improved.
  • Rack densities are rising, but facilities are not stretched. The mean average density for 2020 was 8.4 kilowatts per rack. Densities are rising, but not enough to drive wholesale site-level changes in power distribution or cooling technologies.
  • Bigger outages are becoming more painful. Outages generally continue to occur with disturbing frequency, and the bigger outages are becoming more damaging and expensive — a fact supported by Uptime Institute survey findings for three years running.
  • Operators admit most outages were their fault. Three-quarters of respondents admit that, in hindsight, their most recent major outage was preventable. With more attention and investment, outage frequency would almost certainly fall significantly.
  • Power problems are still the biggest cause of major outages. Systems/software and networks may be catching up, but power failures — which impact everything on-site and can cause knock-on effects — are the most likely cause of major outages.
  • Hardware refreshes are less frequent. Operators are upgrading or replacing their servers less frequently. However, the slowdown in Moore’s law means the potential energy savings from frequent refreshes are no longer very significant.
  • The data center staffing crisis is getting worse. The portion of managers saying they have difficulty finding qualified candidates for open jobs has risen steadily over the past several years.
  • Artificial intelligence won’t take over … yet. Artificial intelligence and automation will not reduce data center operations staffing requirements in the next five years, according to the majority of respondents. After that, however, most think it will.
  • Water use is unmetered by many. Despite the growing threat of water scarcity, only half of respondents say their organization collects water usage data for their IT/data center operations.
  • More work is needed to address the workforce gender imbalance. The proportion of women in the data center industry remains very low. Despite pressure and good intent, relatively few operators have a plan or initiative in place to boost the hiring of women.
  • Use of availability zones is now mainstream. The use of multi-data center availability zones is now common beyond hyperscale operators, with half of respondents saying they use this approach.

Want to know the whole story? Click here.

The pandemic, outages and the internet giants

In a recent Uptime Institute Intelligence analysis we considered a question that Uptime Institute has been asked many times since COVID-19 lockdowns began: Has the pandemic caused any increase in outages? The question arose because the pandemic has caused staff shortages, extended shifts, delays to maintenance, and a shortage of parts for at least some operators. In theory, any of these factors could contribute to more outages.

We also noted considerable speculation in the media about the internet giants, which have seen some dramatic changes to traffic and workload patterns during lockdowns in various regions.

In April, based on a survey and other evidence, we concluded that there may indeed have been a small increase in outages, although it is not always definitively possible to ascribe the cause to the pandemic. There were roughly eight outages, about 4% of the sample, that were related to COVID-19.

In mid-July, we repeated our research. Again, the number of those with an outage said to be related to COVID-19 was in the 3-4% range. In this survey, a similar percentage said COVID-19 contributed to an IT service slowdown. Not dramatic, but significant. For context, there were two to three times as many outages over the period that were not COVID-19 related (per survey findings). However, as we noted before, an outage caused by human error can’t necessarily be ascribed to the pandemic (e.g., to tiredness or unfamiliar duties).

And what of the internet giants/cloud companies, which deploy architectures based on sharing loads across multiple data centers (within and between regional availability zones)? These companies make use of the natural, distributed resiliency of the internet, but at the same time experience great changes in traffic flows as worker (and machine) behaviors change.

Now that the first half of 2020 has ended, we have compared the prevalence and impact of publicly reported outages in the cloud/internet giant and digital service provider groups against their 2019 performance.


 

 


As shown in the figure above, the patterns are consistent with our past reporting: the number of publicly recorded outages by cloud/internet giants is holding steady or increasing, and most outages are minor. Although the category of “serious/severe” outages is likely to jump this year (the half year for 2020 equals the full year for 2019) there has been a strong increase in all outages every year since we began tracking public outages in 2016.

As the pandemic forced changes on businesses — notably, a shift to remote working and greater online service delivery — many increased their dependence on cloud and digital service providers. There have been a few well-publicized outages (e.g., Google Cloud, Zoom, IBM Cloud) and/or instances of capacity constraints (e.g., Microsoft Azure), but overall, these cloud and digital service providers appear to have responded well to the stresses of the pandemic.

Best-in-class data center provisioning: Simplify, Standardize, Repeat

Demand for IT capacity continues to grow rapidly across the globe, which has driven the need for more industrialized approaches to data center provisioning including construction and component assembly. Large operators and their partners have scrambled to apply new processes and disciplines, expand and re-organize supply chains, deploy prefabricated components, and, where possible, reduce cost overheads, variation and complexity.

These approaches have led to dramatically shortened provisioning times in recent years. Globally, the average time to provision a new large data center (20 megawatts or more), following best practices, is just nine to 10 months, according to research by the Uptime Institute Some are provisioned in as little as six months — an incredible achievement given the multiyear timelines of a decade or two ago.

These approaches have also led to lower capital costs (on average). The money required to build a new large data center (20 MW or more), following best practices, has fallen to $7-8 million per megawatt (global average). Some are even able to achieve this for less — as low as $3.6 million per megawatt. This is also significant improvement compared with a decade ago, when $12 million per megawatt was common for data centers typically in the 3-15 MW range. Today, the cost of building a medium-sized data center (5-19.9 MW) can be similar to that of a large data center if best practices are followed.

Factors that affect speed

Achieving the best-case provisioning speed relies on the use of prefabricated systems across all areas and on having access to experienced builders doing repeat builds of standardized construction approaches, supported by a strong local supply chain.

The need to quickly mobilize many construction workers for a brief project at a reasonable price can be a management and logistical challenge — one that requires the availability of considerable local presence and expertise. The workforce will need experience doing repeat builds of standardized construction approaches and, ideally, the specific building system to be installed (e.g., precast concrete, steel).

Most builders have adopted a standard power increment, typically between 1.5 MW and 10 MW, that is repeated in multiples to achieve the total power delivered in a project. Repeating standard configurations, enabling parallel installs and using the same team of experienced workers simplifies material supply, leads to process improvements and reduces provisioning time.

Factors that affect cost

Data center provisioning capital expenses will vary among projects, even when using best practices. This is due to differences in site conditions and location, access to specialist workforces, material and worker transportation, and a range of other variables related to the region (e.g., regulations, seismic activity, supply chains, climate) and facility specifications.

One important aspect that drives cost is power density of the IT racks. Higher density can drive down overall costs. While there could be some trade-offs in electrical equipment and cooling, the main impact of higher density is reduced floor space/building shell, resulting in overall lower building costs.

Custom requirements can also affect both capex and opex. For example, large cloud providers that design their own server hardware can deviate from ASHRAE rules, allowing a wider thermal envelope (usually hotter) and lower opex due to reduced cooling (and thus power) use. This modification may mean lower capex in terms of equipment but may increase some build costs.

The key to building data centers — regardless of size — for the lowest possible cost is similar to the key to building them at the fastest possible speed: repeat proven, standard approaches. Hire the same experienced crews. Consistently use the same engineering, techniques and technologies. Larger projects can reach slightly lower price points by purchasing equipment in larger volumes and by splitting fixed costs over additional megawatts.

There is no single technology or practice that can be credited — these faster, cheaper builds are achieved through a combination of technologies and practices. Data center capacity providers, such as colocation and wholesale leasing companies, that are building large data centers very quickly are competitively positioned to attract large-scale cloud customers — and are likely to continue to refine their processes and approaches to achieve shorter provisioning times, at even lower cost, in the future.

Cleaning a data center: Contractors vs. DIY

Modern data centers are rarely dirty places, but even so, most are a lot cleaner now than they were before COVID-19 became a concern. A recent Uptime Institute survey, conducted in response to the pandemic, found that about two-thirds (68%) of data center owners/operators recently deep cleaned their facilities, and more than 80% recently sanitized them.

But what exactly does it mean to clean a data center, and who is actually skilled to do it? In a data center, deep cleaning is the removal of particles, static and residue from all vertical and horizontal surfaces, as well as from plenum and subfloor spaces. This requires vacuums with high efficiency particulate air (HEPA) filters to prevent the spread of particles as small as 0.5 microns from damaging servers and other sensitive gear. Sanitization (disinfection) is intended to kill 99.9999% of biological matter in the space, except spores. To eliminate the coronavirus that causes COVID-19 from the data center environment, both processes must be performed.

Much of the above-mentioned work was performed by specialty cleaners, contracted by data center owners and operators. Surprisingly, Uptime Institute has found that, despite high levels of cleaning and sanitization activity due to coronavirus precautions, specialist companies report availability, reducing the need for data center owners and operators to take a do-it-yourself (DIY) approach. Even data center cleaners located in the New York metropolitan area — a data center and COVID-19 hotspot — say they would be able to provide services for a new client in 2-3 days or even faster, if an urgent situation developed.

Specialists say an initial treatment of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers, but it is not cheap. The cost of having a specialized contractor clean and sanitize a data center can become prohibitive if repeated deep cleanings and sanitizations are required. The bill — as much as $100,000 per cleaning for a 100,000 square-foot facility (add another 20% for sanitization) — can rise quickly, especially if repeated work is required (as may be common in high-traffic areas).

Some data center owners/operators take a DIY approach in an attempt to reduce these costs, using facility staff. Even specialized cleaning contractors agree that trained and certified (ISO 14644) personnel can successfully clean and sanitize a facility, but the task is not as straightforward as one might think: Staff must have the proper cleaning equipment; correctly use relevant personal protective equipment (PPE), such as disposable gloves and masks; and have sufficient quantities of appropriate materials for cleaning and sanitization. Some of the most common and easiest-to-use products have been in great demand, so DIYers may struggle to obtain supplies. Cleaning specialists, however, should have sufficient inventory. In addition, personnel must be aware of specific treatments for the coronavirus, as they may vary in some ways from regular cleaning procedures.

Untrained staff may make matters worse if they fail to wear the proper PPE or do not follow procedures exactly. They could overlook server air intakes or disperse particulates into the air, and even trained personnel could accidentally damage or disconnect IT equipment.

Companies that wish to conduct their own cleaning should be aware that the US Environmental Protection Agency (EPA) has published and regularly updates a list of chemicals effective against COVID-19, called List N: Disinfectants for Use Against SARS-CoV 2 (COVID-19). This appears to be the most comprehensive such list in the world.

Many regions and municipalities also maintain similar lists; for example, the UK publishes an informative webpage called the Regulatory status of equipment being used to help prevent coronavirus (COVID-19).

Instructions on the use of chemicals and materials can be confusing and even conflicting. For example, in some cases — but not all — diluted bleach may be an appropriate disinfectant, yet Dell warns enterprises against its use, as well as against the use of peroxides, solvents, ammonia, and ethyl alcohol. Instead, Dell recommends use of a microfiber fabric moistened with 70% isopropyl alcohol by volume.

If, after all these red flags, a data center owner/operator still plans to clean their own facility, the following may help:

  • Remember to power down equipment where possible, according to manufacturer instructions and Methods of Procedure.
  • If equipment must remain operational while external surfaces are cleaned, use extreme caution in exposing powered equipment to any moisture. Take all proper and necessary precautions when handling powered equipment that has been exposed to moisture.
  • Cleaning must be limited to external surfaces such as handles and other common points of contact. Do not open cabinet and chassis doors or attempt to clean any internal components.
  • Fiber optics should not be removed for general purpose cleaning due to increased risk of debris contamination.
  • Never spray any liquids directly onto or into any equipment.
  • When cleaning a related display screen, carefully wipe in one direction, moving from the top of the display to the bottom.
  • If the equipment was powered down, all surfaces must be completely air-dried before powering up the equipment after cleaning. No moisture should be visible on the surfaces of the equipment before it is powered on or plugged in.
  • Use appropriate PPE and discard disposable items appropriately after use. Clean your hands immediately afterward.

COVID-19: Critical impact and legacy

What will be the long-lasting impact of the COVID-19 pandemic on the digital critical infrastructure industry? It may be too soon to ask the question given that, at the time of writing, the virus is taking its toll at scale across the world. But Uptime Institute has been asked this question many times, and it’s a discussion point on several upcoming (virtual) conferences.

In a recent Financial Times column, the economist Tim Harford speaks of “hysteresis” — a concept borrowed originally from electromagnetics. Hysteresis describes how some events have a long-lasting, lagging impact, even after the original cause of the change has long gone. Some of the effects of the coronavirus pandemic are elastic — industries, behaviors and business models, for example, are pulled around, but they will spring back their original shape as the risk passes. But others behave more like plastic: once stretched or broken, they stay that way.

While it may be soon to tell, many businesses and investors have already begun making bets that certain things have changed forever. Some companies are preparing for the sale of the headquarters; several have told their staff that, from now on, they are at-home workers. One conference company has moved its entire business online. And in the critical infrastructure world, we know of companies that have already begun to introduce permanent changes in the data center, including introducing large-scale automation programs (to reduce the need for on-site staff).

Uptime Institute has classed the evolution of the digital critical response into three phases.

Phase 1

The first phase might be termed “reactive.” At this point, operators are on high alert, in firefighting or emergency response mode. Their biggest concerns are to understand the threat and to identify and implement best practices to decrease the risk to staff and to availability; high on the list are the need to source appropriate personal protective equipment and decontaminants, to clean facilities, and to operate with reduced staff and possible supply chain disruption. For most, this phase lasted a few weeks or more and has already passed. While the vast majority came through this well, about one in 20 data centers told Uptime Institute in a survey that they had a COVID-19 related outage. Others said they experienced IT service slowdowns, likely due to changing demand patterns or server/network maintenance problems.

Phase 2

The second phase is an interim normal – most data center operators are in this period (which may last up to a year). In this phase, the virus is still widespread, but threat is reduced as governments regain control of the spread. Processes to manage-down risk that were established in the reactive phase have become established — these include more remote working, blue/red operations teams, reduced maintenance, greater redundancy, and now form the “new normal.” Most of these are process-based and do not involve long-term investments or strategic changes. Some examples of these activities are presented in the chart below.


 

 


During this phase, delayed projects are likely to be gradually restarted, but only as a managed risk; investment is curtailed unless clearly pulled by strong demand (which continues to drive new builds and investments). At this time, the management also prepares investment projects for the third phase — the next permanent normal.

Phase 3

The third stage of this cycle is the next normal. At this point, it is likely a vaccine has been found, treatments have improved, or the virus has been contained by social measures to the point of routine manageability. However, the world has been alerted to the possibility (likelihood?) of another pandemic — so long-term changes are likely, alongside the acceptance of other changes that were found to have proved effective, in spite of the danger’s having passed.

What will these be? Here, the answers are less clear. We think the following are highly likely:

  • Operators will incorporate pandemic planning (and drilling) into their business continuity/disaster recovery plans.
  • Governments will seek to have more insight and oversight of critical and near-critical infrastructure, including establishing key worker principles, etc. This may involve more certification.
  • Automation/remote management tools will receive a surge of investment, as operators seek to operate with fewer staff/no staff at times, or to monitor systems from safer or remote locations.
  • Greater management of the supply chain for key parts or services. This is likely to drive budgets up, especially if emergency cover agreements need to be put in place with services companies.

Here are some other possible developments, where the outcome is less clear:

  • A greater/faster move to cloud. This may happen, but this is a trend two decades old, and existing critical loads are being moved steadily and slowly, not rapidly. In addition, cloud may require increased, not decreased, funding.
  • A shift to edge. More home working, viewed by many as a trend that won’t be reversed, will lead to more work being done away from corporate offices and at edge locations.
  • A need for more staff. In spite of planned automation, data centers will, in the near term (to 2023?), need more staff to fulfill the requirement of maintaining separate teams.

One thing all analysts agree on: The pandemic has not slowed, and has probably accelerated. This has raised both the demand for more data center services and the dependency on these services yet further.