The pandemic, outages and the internet giants

In a recent Uptime Institute Intelligence analysis we considered a question that Uptime Institute has been asked many times since COVID-19 lockdowns began: Has the pandemic caused any increase in outages? The question arose because the pandemic has caused staff shortages, extended shifts, delays to maintenance, and a shortage of parts for at least some operators. In theory, any of these factors could contribute to more outages.

We also noted considerable speculation in the media about the internet giants, which have seen some dramatic changes to traffic and workload patterns during lockdowns in various regions.

In April, based on a survey and other evidence, we concluded that there may indeed have been a small increase in outages, although it is not always definitively possible to ascribe the cause to the pandemic. There were roughly eight outages, about 4% of the sample, that were related to COVID-19.

In mid-July, we repeated our research. Again, the number of those with an outage said to be related to COVID-19 was in the 3-4% range. In this survey, a similar percentage said COVID-19 contributed to an IT service slowdown. Not dramatic, but significant. For context, there were two to three times as many outages over the period that were not COVID-19 related (per survey findings). However, as we noted before, an outage caused by human error can’t necessarily be ascribed to the pandemic (e.g., to tiredness or unfamiliar duties).

And what of the internet giants/cloud companies, which deploy architectures based on sharing loads across multiple data centers (within and between regional availability zones)? These companies make use of the natural, distributed resiliency of the internet, but at the same time experience great changes in traffic flows as worker (and machine) behaviors change.

Now that the first half of 2020 has ended, we have compared the prevalence and impact of publicly reported outages in the cloud/internet giant and digital service provider groups against their 2019 performance.


 

 


As shown in the figure above, the patterns are consistent with our past reporting: the number of publicly recorded outages by cloud/internet giants is holding steady or increasing, and most outages are minor. Although the category of “serious/severe” outages is likely to jump this year (the half year for 2020 equals the full year for 2019) there has been a strong increase in all outages every year since we began tracking public outages in 2016.

As the pandemic forced changes on businesses — notably, a shift to remote working and greater online service delivery — many increased their dependence on cloud and digital service providers. There have been a few well-publicized outages (e.g., Google Cloud, Zoom, IBM Cloud) and/or instances of capacity constraints (e.g., Microsoft Azure), but overall, these cloud and digital service providers appear to have responded well to the stresses of the pandemic.

Best-in-class data center provisioning: Simplify, Standardize, Repeat

Demand for IT capacity continues to grow rapidly across the globe, which has driven the need for more industrialized approaches to data center provisioning including construction and component assembly. Large operators and their partners have scrambled to apply new processes and disciplines, expand and re-organize supply chains, deploy prefabricated components, and, where possible, reduce cost overheads, variation and complexity.

These approaches have led to dramatically shortened provisioning times in recent years. Globally, the average time to provision a new large data center (20 megawatts or more), following best practices, is just nine to 10 months, according to research by the Uptime Institute Some are provisioned in as little as six months — an incredible achievement given the multiyear timelines of a decade or two ago.

These approaches have also led to lower capital costs (on average). The money required to build a new large data center (20 MW or more), following best practices, has fallen to $7-8 million per megawatt (global average). Some are even able to achieve this for less — as low as $3.6 million per megawatt. This is also significant improvement compared with a decade ago, when $12 million per megawatt was common for data centers typically in the 3-15 MW range. Today, the cost of building a medium-sized data center (5-19.9 MW) can be similar to that of a large data center if best practices are followed.

Factors that affect speed

Achieving the best-case provisioning speed relies on the use of prefabricated systems across all areas and on having access to experienced builders doing repeat builds of standardized construction approaches, supported by a strong local supply chain.

The need to quickly mobilize many construction workers for a brief project at a reasonable price can be a management and logistical challenge — one that requires the availability of considerable local presence and expertise. The workforce will need experience doing repeat builds of standardized construction approaches and, ideally, the specific building system to be installed (e.g., precast concrete, steel).

Most builders have adopted a standard power increment, typically between 1.5 MW and 10 MW, that is repeated in multiples to achieve the total power delivered in a project. Repeating standard configurations, enabling parallel installs and using the same team of experienced workers simplifies material supply, leads to process improvements and reduces provisioning time.

Factors that affect cost

Data center provisioning capital expenses will vary among projects, even when using best practices. This is due to differences in site conditions and location, access to specialist workforces, material and worker transportation, and a range of other variables related to the region (e.g., regulations, seismic activity, supply chains, climate) and facility specifications.

One important aspect that drives cost is power density of the IT racks. Higher density can drive down overall costs. While there could be some trade-offs in electrical equipment and cooling, the main impact of higher density is reduced floor space/building shell, resulting in overall lower building costs.

Custom requirements can also affect both capex and opex. For example, large cloud providers that design their own server hardware can deviate from ASHRAE rules, allowing a wider thermal envelope (usually hotter) and lower opex due to reduced cooling (and thus power) use. This modification may mean lower capex in terms of equipment but may increase some build costs.

The key to building data centers — regardless of size — for the lowest possible cost is similar to the key to building them at the fastest possible speed: repeat proven, standard approaches. Hire the same experienced crews. Consistently use the same engineering, techniques and technologies. Larger projects can reach slightly lower price points by purchasing equipment in larger volumes and by splitting fixed costs over additional megawatts.

There is no single technology or practice that can be credited — these faster, cheaper builds are achieved through a combination of technologies and practices. Data center capacity providers, such as colocation and wholesale leasing companies, that are building large data centers very quickly are competitively positioned to attract large-scale cloud customers — and are likely to continue to refine their processes and approaches to achieve shorter provisioning times, at even lower cost, in the future.

Cleaning a data center: Contractors vs. DIY

Modern data centers are rarely dirty places, but even so, most are a lot cleaner now than they were before COVID-19 became a concern. A recent Uptime Institute survey, conducted in response to the pandemic, found that about two-thirds (68%) of data center owners/operators recently deep cleaned their facilities, and more than 80% recently sanitized them.

But what exactly does it mean to clean a data center, and who is actually skilled to do it? In a data center, deep cleaning is the removal of particles, static and residue from all vertical and horizontal surfaces, as well as from plenum and subfloor spaces. This requires vacuums with high efficiency particulate air (HEPA) filters to prevent the spread of particles as small as 0.5 microns from damaging servers and other sensitive gear. Sanitization (disinfection) is intended to kill 99.9999% of biological matter in the space, except spores. To eliminate the coronavirus that causes COVID-19 from the data center environment, both processes must be performed.

Much of the above-mentioned work was performed by specialty cleaners, contracted by data center owners and operators. Surprisingly, Uptime Institute has found that, despite high levels of cleaning and sanitization activity due to coronavirus precautions, specialist companies report availability, reducing the need for data center owners and operators to take a do-it-yourself (DIY) approach. Even data center cleaners located in the New York metropolitan area — a data center and COVID-19 hotspot — say they would be able to provide services for a new client in 2-3 days or even faster, if an urgent situation developed.

Specialists say an initial treatment of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers, but it is not cheap. The cost of having a specialized contractor clean and sanitize a data center can become prohibitive if repeated deep cleanings and sanitizations are required. The bill — as much as $100,000 per cleaning for a 100,000 square-foot facility (add another 20% for sanitization) — can rise quickly, especially if repeated work is required (as may be common in high-traffic areas).

Some data center owners/operators take a DIY approach in an attempt to reduce these costs, using facility staff. Even specialized cleaning contractors agree that trained and certified (ISO 14644) personnel can successfully clean and sanitize a facility, but the task is not as straightforward as one might think: Staff must have the proper cleaning equipment; correctly use relevant personal protective equipment (PPE), such as disposable gloves and masks; and have sufficient quantities of appropriate materials for cleaning and sanitization. Some of the most common and easiest-to-use products have been in great demand, so DIYers may struggle to obtain supplies. Cleaning specialists, however, should have sufficient inventory. In addition, personnel must be aware of specific treatments for the coronavirus, as they may vary in some ways from regular cleaning procedures.

Untrained staff may make matters worse if they fail to wear the proper PPE or do not follow procedures exactly. They could overlook server air intakes or disperse particulates into the air, and even trained personnel could accidentally damage or disconnect IT equipment.

Companies that wish to conduct their own cleaning should be aware that the US Environmental Protection Agency (EPA) has published and regularly updates a list of chemicals effective against COVID-19, called List N: Disinfectants for Use Against SARS-CoV 2 (COVID-19). This appears to be the most comprehensive such list in the world.

Many regions and municipalities also maintain similar lists; for example, the UK publishes an informative webpage called the Regulatory status of equipment being used to help prevent coronavirus (COVID-19).

Instructions on the use of chemicals and materials can be confusing and even conflicting. For example, in some cases — but not all — diluted bleach may be an appropriate disinfectant, yet Dell warns enterprises against its use, as well as against the use of peroxides, solvents, ammonia, and ethyl alcohol. Instead, Dell recommends use of a microfiber fabric moistened with 70% isopropyl alcohol by volume.

If, after all these red flags, a data center owner/operator still plans to clean their own facility, the following may help:

  • Remember to power down equipment where possible, according to manufacturer instructions and Methods of Procedure.
  • If equipment must remain operational while external surfaces are cleaned, use extreme caution in exposing powered equipment to any moisture. Take all proper and necessary precautions when handling powered equipment that has been exposed to moisture.
  • Cleaning must be limited to external surfaces such as handles and other common points of contact. Do not open cabinet and chassis doors or attempt to clean any internal components.
  • Fiber optics should not be removed for general purpose cleaning due to increased risk of debris contamination.
  • Never spray any liquids directly onto or into any equipment.
  • When cleaning a related display screen, carefully wipe in one direction, moving from the top of the display to the bottom.
  • If the equipment was powered down, all surfaces must be completely air-dried before powering up the equipment after cleaning. No moisture should be visible on the surfaces of the equipment before it is powered on or plugged in.
  • Use appropriate PPE and discard disposable items appropriately after use. Clean your hands immediately afterward.

COVID-19: Critical impact and legacy

What will be the long-lasting impact of the COVID-19 pandemic on the digital critical infrastructure industry? It may be too soon to ask the question given that, at the time of writing, the virus is taking its toll at scale across the world. But Uptime Institute has been asked this question many times, and it’s a discussion point on several upcoming (virtual) conferences.

In a recent Financial Times column, the economist Tim Harford speaks of “hysteresis” — a concept borrowed originally from electromagnetics. Hysteresis describes how some events have a long-lasting, lagging impact, even after the original cause of the change has long gone. Some of the effects of the coronavirus pandemic are elastic — industries, behaviors and business models, for example, are pulled around, but they will spring back their original shape as the risk passes. But others behave more like plastic: once stretched or broken, they stay that way.

While it may be soon to tell, many businesses and investors have already begun making bets that certain things have changed forever. Some companies are preparing for the sale of the headquarters; several have told their staff that, from now on, they are at-home workers. One conference company has moved its entire business online. And in the critical infrastructure world, we know of companies that have already begun to introduce permanent changes in the data center, including introducing large-scale automation programs (to reduce the need for on-site staff).

Uptime Institute has classed the evolution of the digital critical response into three phases.

Phase 1

The first phase might be termed “reactive.” At this point, operators are on high alert, in firefighting or emergency response mode. Their biggest concerns are to understand the threat and to identify and implement best practices to decrease the risk to staff and to availability; high on the list are the need to source appropriate personal protective equipment and decontaminants, to clean facilities, and to operate with reduced staff and possible supply chain disruption. For most, this phase lasted a few weeks or more and has already passed. While the vast majority came through this well, about one in 20 data centers told Uptime Institute in a survey that they had a COVID-19 related outage. Others said they experienced IT service slowdowns, likely due to changing demand patterns or server/network maintenance problems.

Phase 2

The second phase is an interim normal – most data center operators are in this period (which may last up to a year). In this phase, the virus is still widespread, but threat is reduced as governments regain control of the spread. Processes to manage-down risk that were established in the reactive phase have become established — these include more remote working, blue/red operations teams, reduced maintenance, greater redundancy, and now form the “new normal.” Most of these are process-based and do not involve long-term investments or strategic changes. Some examples of these activities are presented in the chart below.


 

 


During this phase, delayed projects are likely to be gradually restarted, but only as a managed risk; investment is curtailed unless clearly pulled by strong demand (which continues to drive new builds and investments). At this time, the management also prepares investment projects for the third phase — the next permanent normal.

Phase 3

The third stage of this cycle is the next normal. At this point, it is likely a vaccine has been found, treatments have improved, or the virus has been contained by social measures to the point of routine manageability. However, the world has been alerted to the possibility (likelihood?) of another pandemic — so long-term changes are likely, alongside the acceptance of other changes that were found to have proved effective, in spite of the danger’s having passed.

What will these be? Here, the answers are less clear. We think the following are highly likely:

  • Operators will incorporate pandemic planning (and drilling) into their business continuity/disaster recovery plans.
  • Governments will seek to have more insight and oversight of critical and near-critical infrastructure, including establishing key worker principles, etc. This may involve more certification.
  • Automation/remote management tools will receive a surge of investment, as operators seek to operate with fewer staff/no staff at times, or to monitor systems from safer or remote locations.
  • Greater management of the supply chain for key parts or services. This is likely to drive budgets up, especially if emergency cover agreements need to be put in place with services companies.

Here are some other possible developments, where the outcome is less clear:

  • A greater/faster move to cloud. This may happen, but this is a trend two decades old, and existing critical loads are being moved steadily and slowly, not rapidly. In addition, cloud may require increased, not decreased, funding.
  • A shift to edge. More home working, viewed by many as a trend that won’t be reversed, will lead to more work being done away from corporate offices and at edge locations.
  • A need for more staff. In spite of planned automation, data centers will, in the near term (to 2023?), need more staff to fulfill the requirement of maintaining separate teams.

One thing all analysts agree on: The pandemic has not slowed, and has probably accelerated. This has raised both the demand for more data center services and the dependency on these services yet further.

Enterprises’ need for control and visibility still slows cloud adoption

During the current COVID-19 crisis, enterprise dependency on cloud platform providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform) and on software as a service (Salesforce, Zoom, Teams) has increased. Operators report booming demand, with Microsoft’s Chief Executive Officer Satya Nadella saying, “We have seen two years’ worth of digital transformation in two months.”

This success brings it with new challenges and some growing responsibilities. Corporate operators tend to classify their applications and services into levels or categories, which defines what they might require for each one in terms of service levels, access, transparency and accountability. Test and development applications, shadow IT, and smaller and newer online businesses were the initial and main drivers of cloud demand in the first decade, and these have relatively light availability needs. But for critical production applications, now the biggest target for cloud companies, enterprises have expectations and compliance requirements that are far more stringent.

During the pandemic, a lot of applications (such as video conferencing, but a lot more besides) have become more critical than they were before. This is a trend that was already underway, but whereas it was previously happening almost imperceptibly, it is now much more obvious. Uptime Institute has described this trend as “creeping criticality” — and it can lead to a circumstance we call “asymmetric resiliency,” which occurs when the level of resiliency required by the application is not matched by the infrastructure supporting it.

The big public cloud operators do not think this applies to them, because all have addressed the issue of availability at the outset. The dominant cloud architecture, with multisite replication using availability zones, all built on fairly robust data centers (if not always uniformly or demonstrably so), has proved largely effective. Cloud services do suffer some infrastructure-related outages, according to Uptime Institute data, but no more than most enterprises.

Unfortunately for the cloud companies, assurances of availability or of adherence to best practices are not enough. In both the 2019 and the latest 2020 global annual survey of IT and critical infrastructure operators, Uptime Institute asked respondents if they put mission-critical workloads into the cloud, and whether resiliency and visibility is an issue. The answers in each year are almost identical: Over 70% did not put any critical applications in the cloud, with a third of this group (21% of the total sample) saying the reason was a lack of visibility/accountability about resiliency. And one in six of those who do place applications in the public cloud also say they do not have enough visibility (see chart below).


 


The pressure to be more open and accountable is growing. During the COVID-19 pandemic, governments have tried to better understand the vulnerability of their national critical infrastructure, and it hasn’t always been easy. In the UK, the government has expressed concern that the cloud players were unforthcoming with good information.

The rules being developed and applied in the financial services sector, especially in Europe, may provide a model for other sectors to follow. The EBA (European Banking Authority) says that banks may not use a cloud, hosting or colocation company for critical applications unless they are open to full, on-site inspection. That presents a challenge for the banks, who must verify that dozens or hundreds of data centers are well built and well run. Those rules have proved important, because without them, banks seeking access did not always get a welcoming response.

The challenge for financial services regulators and operators is that the banking system is highly interdependent, with many end-to-end processes spanning multiple businesses and multiple data centers and data services. Failure of one can mean a systemic failure; minimization of risk, and full accountability, is essential in the sector — assurances of 99.99% availability are not enough.

During the pandemic, the cloud providers have performed well, but failures or slowdowns have occurred. Microsoft’s success, for example, had led to constraints in the availability of Azure services in many regions. For test, dev or even some smaller businesses, this is not too serious: for critical, enterprise IT, it might be.

Uptime Institute data suggests that less than 10% of mission critical workloads are running in the public cloud, and that this number is not rising as fast as overall cloud growth (even if new applications like Microsoft Teams have taken off globally). With more accountability and visibility, this number might increase faster.


More information on this topic is available to members and guest members of the Uptime Institute Network community. Details on joining can be found here.

Data Center cleaning and sanitization: detail, cost and effectiveness

Fear of the coronavirus or confirmed exposure has caused about half (49%) of data center owners to increase the frequency of regular cleanings, according to a recent Uptime Institute flash survey. Even more (66%) have increased the frequency of sanitization (disinfection) to eliminate (however transitorily) any possible presence of the novel coronavirus in their facilities.

In times of normal operation, most facilities conduct a thorough cleaning (the removal of particles, static and residue) on a regular basis, with frequency determined by need. These cleanings help data centers meet ISO 14644-1, as required by ASHRAE.

Many use air quality indicators to determine when further cleaning is needed. Facility-specific factors such as air exchange rate/volume and raised floors can greatly affect the number of particulates in the air.

Other data center owners/operators clean the entire facility — including plenum spaces and underfloor areas — on an annual or biannual basis, with more frequent cleanings of all horizontal and vertical surfaces and even more frequent cleanings of handrails, doors and high-traffic areas. Schedules vary by facility. Some facility owners also regularly conduct facility-wide disinfection, which is the process that eliminates many or all pathogenic microorganisms, except bacterial spores, on inanimate objects. This is similar to sterilization, which eliminates all forms of microbial life.

Cleaning and sanitization are not necessarily risk-free processes, and they can be expensive. Moreover, the protection they provide may not persist, as the virus can easily be introduced — or reintroduced — minutes after a facility has been thoroughly cleaned and disinfected. However, the threat to operations is real: 10% of our survey respondents have experienced an IT service slowdown related to COVID-19, and 4% reported a pandemic-related outage or service interruption.

Cleaning considerations

Cleaners must exercise a great deal of care and follow rigorous procedures, as they often work in close proximity to sensitive electronic gear. In addition, cleaners must use vacuums with high efficiency particulate air filters, so that no dust or dirt particles become airborne and get pulled into the supply air. Deep cleaning is a necessary step to make sanitization and disinfection effective.

Sanitization or disinfection requires even greater caution, as most cleaning materials come with precise care, handling and use instructions, for both worker and equipment safety. Properly done, sanitization can remove up to 99.9999% (6-log kill) of enveloped viruses, such as SARS-CoV-2, the virus that causes COVID-19. Higher levels of disinfection are possible using stronger chemicals (see the EPA’s list of disinfectants that are effective against the coronavirus and even more durable biological contaminants such as fungi and bacterial spores). This list also includes many products that will be effective against coronaviruses that are safe and easy to use.

Data center cleaning services interviewed by Uptime Institute do not necessarily consider repeated deep cleaning or sanitization of entire facilities a necessary COVID-19 precaution. Clients, they say, must consider the cost of cleaning, which could be as much as $100,000 for each cleaning of a 100,000-square-foot facility, and sanitization, which can add another 20% or more to the cost of deep cleaning.

The cleaners note that many data centers are relatively clean “low-touch” facilities. In normal operation, data centers have features that work to protect against the entry of viruses and other particles. These include controlled access, static mats at many entries, and air filtration and pressurization in white spaces, all of which reduce the ability of the SARS-CoV-2 virus to enter sensitive areas of the facility. Many data centers also have strict rules against bringing unnecessary gear or personal items into the data center.

These characteristics suggest that an initial deep cleaning and sanitization of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers. Frequent or regular disinfections can be reserved for highly trafficked spaces and high-touch areas, such as handrails, handles, light switches and doorknobs. According to ASHRAE, computational fluid dynamics can be used to identify likely landing places for airborne viruses, making these treatments even more effective.

Some data center owners may want more frequent cleanings for “peace of mind” or legal reasons. Higher levels of attention may be warranted if an infected visitor has accessed the facility. To date, Uptime Institute has investigated three such reports worldwide, but anonymous responses to our survey indicate that the number of incidents is much higher (15% of respondents reported COVID-19-positive or symptomatic staff or visitors) .

Uptime Institute clients have expressed interest in a range of cleaning and disinfection processes, including ozone treatment and ultraviolet germicidal irradiation (UVGI), as well as relevant products.

Uptime Institute is aware of at least one new product — a biostatic antimicrobial coating — that has been introduced to the market. The US-based vendor describes it as “an invisible and durable coating that helps provide for the physical kill, or more precisely the disruption, of microorganisms (i.e., bacteria, viruses, mold, mildew and fungi) and can last up to 90 days or possibly longer.”

Even if proven effective against COVID-19, these treatments may have other use limitations and prohibitive downsides for either the equipment in data centers or the humans who staff them. For instance, UVGI is a disinfection method that uses short-wavelength ultraviolet (ultraviolet C or UVC) light to kill or inactivate microorganisms. But UVC light, for instance, can only disinfect where it is directly applied, so it cannot be used to treat any surfaces where targeted exposure is not possible (such as under any raised floors, overhead plenums, and behind any cabinets/cables). In addition, exposure to the wavelength typically used in UVGI (253.7 nm) can damage the skin and eyes.

Based on currently available information, the rigorous adherence to a program that includes hand hygiene, surface cleaning and social distancing will reduce the likelihood of transmission without introducing additional risks.