Cleaning a data center: Contractors vs. DIY

Modern data centers are rarely dirty places, but even so, most are a lot cleaner now than they were before COVID-19 became a concern. A recent Uptime Institute survey, conducted in response to the pandemic, found that about two-thirds (68%) of data center owners/operators recently deep cleaned their facilities, and more than 80% recently sanitized them.

But what exactly does it mean to clean a data center, and who is actually skilled to do it? In a data center, deep cleaning is the removal of particles, static and residue from all vertical and horizontal surfaces, as well as from plenum and subfloor spaces. This requires vacuums with high efficiency particulate air (HEPA) filters to prevent the spread of particles as small as 0.5 microns from damaging servers and other sensitive gear. Sanitization (disinfection) is intended to kill 99.9999% of biological matter in the space, except spores. To eliminate the coronavirus that causes COVID-19 from the data center environment, both processes must be performed.

Much of the above-mentioned work was performed by specialty cleaners, contracted by data center owners and operators. Surprisingly, Uptime Institute has found that, despite high levels of cleaning and sanitization activity due to coronavirus precautions, specialist companies report availability, reducing the need for data center owners and operators to take a do-it-yourself (DIY) approach. Even data center cleaners located in the New York metropolitan area — a data center and COVID-19 hotspot — say they would be able to provide services for a new client in 2-3 days or even faster, if an urgent situation developed.

Specialists say an initial treatment of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers, but it is not cheap. The cost of having a specialized contractor clean and sanitize a data center can become prohibitive if repeated deep cleanings and sanitizations are required. The bill — as much as $100,000 per cleaning for a 100,000 square-foot facility (add another 20% for sanitization) — can rise quickly, especially if repeated work is required (as may be common in high-traffic areas).

Some data center owners/operators take a DIY approach in an attempt to reduce these costs, using facility staff. Even specialized cleaning contractors agree that trained and certified (ISO 14644) personnel can successfully clean and sanitize a facility, but the task is not as straightforward as one might think: Staff must have the proper cleaning equipment; correctly use relevant personal protective equipment (PPE), such as disposable gloves and masks; and have sufficient quantities of appropriate materials for cleaning and sanitization. Some of the most common and easiest-to-use products have been in great demand, so DIYers may struggle to obtain supplies. Cleaning specialists, however, should have sufficient inventory. In addition, personnel must be aware of specific treatments for the coronavirus, as they may vary in some ways from regular cleaning procedures.

Untrained staff may make matters worse if they fail to wear the proper PPE or do not follow procedures exactly. They could overlook server air intakes or disperse particulates into the air, and even trained personnel could accidentally damage or disconnect IT equipment.

Companies that wish to conduct their own cleaning should be aware that the US Environmental Protection Agency (EPA) has published and regularly updates a list of chemicals effective against COVID-19, called List N: Disinfectants for Use Against SARS-CoV 2 (COVID-19). This appears to be the most comprehensive such list in the world.

Many regions and municipalities also maintain similar lists; for example, the UK publishes an informative webpage called the Regulatory status of equipment being used to help prevent coronavirus (COVID-19).

Instructions on the use of chemicals and materials can be confusing and even conflicting. For example, in some cases — but not all — diluted bleach may be an appropriate disinfectant, yet Dell warns enterprises against its use, as well as against the use of peroxides, solvents, ammonia, and ethyl alcohol. Instead, Dell recommends use of a microfiber fabric moistened with 70% isopropyl alcohol by volume.

If, after all these red flags, a data center owner/operator still plans to clean their own facility, the following may help:

  • Remember to power down equipment where possible, according to manufacturer instructions and Methods of Procedure.
  • If equipment must remain operational while external surfaces are cleaned, use extreme caution in exposing powered equipment to any moisture. Take all proper and necessary precautions when handling powered equipment that has been exposed to moisture.
  • Cleaning must be limited to external surfaces such as handles and other common points of contact. Do not open cabinet and chassis doors or attempt to clean any internal components.
  • Fiber optics should not be removed for general purpose cleaning due to increased risk of debris contamination.
  • Never spray any liquids directly onto or into any equipment.
  • When cleaning a related display screen, carefully wipe in one direction, moving from the top of the display to the bottom.
  • If the equipment was powered down, all surfaces must be completely air-dried before powering up the equipment after cleaning. No moisture should be visible on the surfaces of the equipment before it is powered on or plugged in.
  • Use appropriate PPE and discard disposable items appropriately after use. Clean your hands immediately afterward.

COVID-19: Critical impact and legacy

What will be the long-lasting impact of the COVID-19 pandemic on the digital critical infrastructure industry? It may be too soon to ask the question given that, at the time of writing, the virus is taking its toll at scale across the world. But Uptime Institute has been asked this question many times, and it’s a discussion point on several upcoming (virtual) conferences.

In a recent Financial Times column, the economist Tim Harford speaks of “hysteresis” — a concept borrowed originally from electromagnetics. Hysteresis describes how some events have a long-lasting, lagging impact, even after the original cause of the change has long gone. Some of the effects of the coronavirus pandemic are elastic — industries, behaviors and business models, for example, are pulled around, but they will spring back their original shape as the risk passes. But others behave more like plastic: once stretched or broken, they stay that way.

While it may be soon to tell, many businesses and investors have already begun making bets that certain things have changed forever. Some companies are preparing for the sale of the headquarters; several have told their staff that, from now on, they are at-home workers. One conference company has moved its entire business online. And in the critical infrastructure world, we know of companies that have already begun to introduce permanent changes in the data center, including introducing large-scale automation programs (to reduce the need for on-site staff).

Uptime Institute has classed the evolution of the digital critical response into three phases.

Phase 1

The first phase might be termed “reactive.” At this point, operators are on high alert, in firefighting or emergency response mode. Their biggest concerns are to understand the threat and to identify and implement best practices to decrease the risk to staff and to availability; high on the list are the need to source appropriate personal protective equipment and decontaminants, to clean facilities, and to operate with reduced staff and possible supply chain disruption. For most, this phase lasted a few weeks or more and has already passed. While the vast majority came through this well, about one in 20 data centers told Uptime Institute in a survey that they had a COVID-19 related outage. Others said they experienced IT service slowdowns, likely due to changing demand patterns or server/network maintenance problems.

Phase 2

The second phase is an interim normal – most data center operators are in this period (which may last up to a year). In this phase, the virus is still widespread, but threat is reduced as governments regain control of the spread. Processes to manage-down risk that were established in the reactive phase have become established — these include more remote working, blue/red operations teams, reduced maintenance, greater redundancy, and now form the “new normal.” Most of these are process-based and do not involve long-term investments or strategic changes. Some examples of these activities are presented in the chart below.


 

 


During this phase, delayed projects are likely to be gradually restarted, but only as a managed risk; investment is curtailed unless clearly pulled by strong demand (which continues to drive new builds and investments). At this time, the management also prepares investment projects for the third phase — the next permanent normal.

Phase 3

The third stage of this cycle is the next normal. At this point, it is likely a vaccine has been found, treatments have improved, or the virus has been contained by social measures to the point of routine manageability. However, the world has been alerted to the possibility (likelihood?) of another pandemic — so long-term changes are likely, alongside the acceptance of other changes that were found to have proved effective, in spite of the danger’s having passed.

What will these be? Here, the answers are less clear. We think the following are highly likely:

  • Operators will incorporate pandemic planning (and drilling) into their business continuity/disaster recovery plans.
  • Governments will seek to have more insight and oversight of critical and near-critical infrastructure, including establishing key worker principles, etc. This may involve more certification.
  • Automation/remote management tools will receive a surge of investment, as operators seek to operate with fewer staff/no staff at times, or to monitor systems from safer or remote locations.
  • Greater management of the supply chain for key parts or services. This is likely to drive budgets up, especially if emergency cover agreements need to be put in place with services companies.

Here are some other possible developments, where the outcome is less clear:

  • A greater/faster move to cloud. This may happen, but this is a trend two decades old, and existing critical loads are being moved steadily and slowly, not rapidly. In addition, cloud may require increased, not decreased, funding.
  • A shift to edge. More home working, viewed by many as a trend that won’t be reversed, will lead to more work being done away from corporate offices and at edge locations.
  • A need for more staff. In spite of planned automation, data centers will, in the near term (to 2023?), need more staff to fulfill the requirement of maintaining separate teams.

One thing all analysts agree on: The pandemic has not slowed, and has probably accelerated. This has raised both the demand for more data center services and the dependency on these services yet further.

Enterprises’ need for control and visibility still slows cloud adoption

During the current COVID-19 crisis, enterprise dependency on cloud platform providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform) and on software as a service (Salesforce, Zoom, Teams) has increased. Operators report booming demand, with Microsoft’s Chief Executive Officer Satya Nadella saying, “We have seen two years’ worth of digital transformation in two months.”

This success brings it with new challenges and some growing responsibilities. Corporate operators tend to classify their applications and services into levels or categories, which defines what they might require for each one in terms of service levels, access, transparency and accountability. Test and development applications, shadow IT, and smaller and newer online businesses were the initial and main drivers of cloud demand in the first decade, and these have relatively light availability needs. But for critical production applications, now the biggest target for cloud companies, enterprises have expectations and compliance requirements that are far more stringent.

During the pandemic, a lot of applications (such as video conferencing, but a lot more besides) have become more critical than they were before. This is a trend that was already underway, but whereas it was previously happening almost imperceptibly, it is now much more obvious. Uptime Institute has described this trend as “creeping criticality” — and it can lead to a circumstance we call “asymmetric resiliency,” which occurs when the level of resiliency required by the application is not matched by the infrastructure supporting it.

The big public cloud operators do not think this applies to them, because all have addressed the issue of availability at the outset. The dominant cloud architecture, with multisite replication using availability zones, all built on fairly robust data centers (if not always uniformly or demonstrably so), has proved largely effective. Cloud services do suffer some infrastructure-related outages, according to Uptime Institute data, but no more than most enterprises.

Unfortunately for the cloud companies, assurances of availability or of adherence to best practices are not enough. In both the 2019 and the latest 2020 global annual survey of IT and critical infrastructure operators, Uptime Institute asked respondents if they put mission-critical workloads into the cloud, and whether resiliency and visibility is an issue. The answers in each year are almost identical: Over 70% did not put any critical applications in the cloud, with a third of this group (21% of the total sample) saying the reason was a lack of visibility/accountability about resiliency. And one in six of those who do place applications in the public cloud also say they do not have enough visibility (see chart below).


 


The pressure to be more open and accountable is growing. During the COVID-19 pandemic, governments have tried to better understand the vulnerability of their national critical infrastructure, and it hasn’t always been easy. In the UK, the government has expressed concern that the cloud players were unforthcoming with good information.

The rules being developed and applied in the financial services sector, especially in Europe, may provide a model for other sectors to follow. The EBA (European Banking Authority) says that banks may not use a cloud, hosting or colocation company for critical applications unless they are open to full, on-site inspection. That presents a challenge for the banks, who must verify that dozens or hundreds of data centers are well built and well run. Those rules have proved important, because without them, banks seeking access did not always get a welcoming response.

The challenge for financial services regulators and operators is that the banking system is highly interdependent, with many end-to-end processes spanning multiple businesses and multiple data centers and data services. Failure of one can mean a systemic failure; minimization of risk, and full accountability, is essential in the sector — assurances of 99.99% availability are not enough.

During the pandemic, the cloud providers have performed well, but failures or slowdowns have occurred. Microsoft’s success, for example, had led to constraints in the availability of Azure services in many regions. For test, dev or even some smaller businesses, this is not too serious: for critical, enterprise IT, it might be.

Uptime Institute data suggests that less than 10% of mission critical workloads are running in the public cloud, and that this number is not rising as fast as overall cloud growth (even if new applications like Microsoft Teams have taken off globally). With more accountability and visibility, this number might increase faster.


More information on this topic is available to members and guest members of the Uptime Institute Network community. Details on joining can be found here.

Data Center cleaning and sanitization: detail, cost and effectiveness

Fear of the coronavirus or confirmed exposure has caused about half (49%) of data center owners to increase the frequency of regular cleanings, according to a recent Uptime Institute flash survey. Even more (66%) have increased the frequency of sanitization (disinfection) to eliminate (however transitorily) any possible presence of the novel coronavirus in their facilities.

In times of normal operation, most facilities conduct a thorough cleaning (the removal of particles, static and residue) on a regular basis, with frequency determined by need. These cleanings help data centers meet ISO 14644-1, as required by ASHRAE.

Many use air quality indicators to determine when further cleaning is needed. Facility-specific factors such as air exchange rate/volume and raised floors can greatly affect the number of particulates in the air.

Other data center owners/operators clean the entire facility — including plenum spaces and underfloor areas — on an annual or biannual basis, with more frequent cleanings of all horizontal and vertical surfaces and even more frequent cleanings of handrails, doors and high-traffic areas. Schedules vary by facility. Some facility owners also regularly conduct facility-wide disinfection, which is the process that eliminates many or all pathogenic microorganisms, except bacterial spores, on inanimate objects. This is similar to sterilization, which eliminates all forms of microbial life.

Cleaning and sanitization are not necessarily risk-free processes, and they can be expensive. Moreover, the protection they provide may not persist, as the virus can easily be introduced — or reintroduced — minutes after a facility has been thoroughly cleaned and disinfected. However, the threat to operations is real: 10% of our survey respondents have experienced an IT service slowdown related to COVID-19, and 4% reported a pandemic-related outage or service interruption.

Cleaning considerations

Cleaners must exercise a great deal of care and follow rigorous procedures, as they often work in close proximity to sensitive electronic gear. In addition, cleaners must use vacuums with high efficiency particulate air filters, so that no dust or dirt particles become airborne and get pulled into the supply air. Deep cleaning is a necessary step to make sanitization and disinfection effective.

Sanitization or disinfection requires even greater caution, as most cleaning materials come with precise care, handling and use instructions, for both worker and equipment safety. Properly done, sanitization can remove up to 99.9999% (6-log kill) of enveloped viruses, such as SARS-CoV-2, the virus that causes COVID-19. Higher levels of disinfection are possible using stronger chemicals (see the EPA’s list of disinfectants that are effective against the coronavirus and even more durable biological contaminants such as fungi and bacterial spores). This list also includes many products that will be effective against coronaviruses that are safe and easy to use.

Data center cleaning services interviewed by Uptime Institute do not necessarily consider repeated deep cleaning or sanitization of entire facilities a necessary COVID-19 precaution. Clients, they say, must consider the cost of cleaning, which could be as much as $100,000 for each cleaning of a 100,000-square-foot facility, and sanitization, which can add another 20% or more to the cost of deep cleaning.

The cleaners note that many data centers are relatively clean “low-touch” facilities. In normal operation, data centers have features that work to protect against the entry of viruses and other particles. These include controlled access, static mats at many entries, and air filtration and pressurization in white spaces, all of which reduce the ability of the SARS-CoV-2 virus to enter sensitive areas of the facility. Many data centers also have strict rules against bringing unnecessary gear or personal items into the data center.

These characteristics suggest that an initial deep cleaning and sanitization of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers. Frequent or regular disinfections can be reserved for highly trafficked spaces and high-touch areas, such as handrails, handles, light switches and doorknobs. According to ASHRAE, computational fluid dynamics can be used to identify likely landing places for airborne viruses, making these treatments even more effective.

Some data center owners may want more frequent cleanings for “peace of mind” or legal reasons. Higher levels of attention may be warranted if an infected visitor has accessed the facility. To date, Uptime Institute has investigated three such reports worldwide, but anonymous responses to our survey indicate that the number of incidents is much higher (15% of respondents reported COVID-19-positive or symptomatic staff or visitors) .

Uptime Institute clients have expressed interest in a range of cleaning and disinfection processes, including ozone treatment and ultraviolet germicidal irradiation (UVGI), as well as relevant products.

Uptime Institute is aware of at least one new product — a biostatic antimicrobial coating — that has been introduced to the market. The US-based vendor describes it as “an invisible and durable coating that helps provide for the physical kill, or more precisely the disruption, of microorganisms (i.e., bacteria, viruses, mold, mildew and fungi) and can last up to 90 days or possibly longer.”

Even if proven effective against COVID-19, these treatments may have other use limitations and prohibitive downsides for either the equipment in data centers or the humans who staff them. For instance, UVGI is a disinfection method that uses short-wavelength ultraviolet (ultraviolet C or UVC) light to kill or inactivate microorganisms. But UVC light, for instance, can only disinfect where it is directly applied, so it cannot be used to treat any surfaces where targeted exposure is not possible (such as under any raised floors, overhead plenums, and behind any cabinets/cables). In addition, exposure to the wavelength typically used in UVGI (253.7 nm) can damage the skin and eyes.

Based on currently available information, the rigorous adherence to a program that includes hand hygiene, surface cleaning and social distancing will reduce the likelihood of transmission without introducing additional risks.

COVID-19: What worries data center management the most?

To date, the critical infrastructure industry has mostly managed effectively with reduced staff, deferred maintenance, social distancing and new patterns of demand. While there have been some serious incidents and outages directly related to the pandemic, these have been few.

But what worries operators for the months ahead? Clearly, the situation is changing all the time, with different facilities in different countries or states facing very different situations. Many countries are easing their lockdowns, while others have still to reach the (first?) peak of the pandemic.

In an Uptime Institute survey of over 200 critical IT/facility infrastructure operators around the world, a third of operators said a reduced level of IT infrastructure operations staff poses the single biggest risk to operations (see Figure 1).

 

This is perhaps not surprising — earlier Uptime Institute research has shown that insufficient staffing can be associated with increased levels of failures and incidents; and even before the pandemic, 60% of data center owners and operators reported difficulty finding or retaining skilled staff. Taken together, reduced maintenance (15%) and shortages of critical components (6%) suggest that continued, problem-free operation of equipment is the second biggest concern.

When asked to cite the top three concerns, data center project or construction delays (see Figure 2) was cited by 45% of the respondents (as was the staffing issue). This is perhaps not surprising; when asked later in the survey to select all relevant impacts from among a range of potential repercussions, 60% reported pandemic-related project or construction delays.

More information on this topic is available to members of the Uptime Institute Network here.

Pandemic is causing some outages and slowdowns

To date, media coverage of the impact of COVID-19 and the lockdowns has been largely laudatory. There have been few high profile or serious outages (perhaps fewer than normal) and for the most part, internet traffic flow analysis shows that a sudden jump in demand, along with a shift toward the residential edge and busy daytime hours, has had little impact on performance. The military-grade resiliency of the ARPANET (Advanced Research Projects Agency Network) and the Internet Protocol, the foundation of the modern internet, is given much credit.

But beneath the calm surface water, data center operators have been paddling furiously to maintain services, especially with some alarming staff shortages at some sites.

In our most recent survey, 84% of respondents had not experienced a service slowdown or outage that could be attributed to COVID-19. However, 4% (eight operators) said they had an COVID-19-related outage and 10% (20 operators) had experienced a service slowdown that was COVID-19 related (see figure below).



Establishing the causes of these slowdowns or outages will probably not be easy. Research does show that staff shortage and tiredness can lead to more incidents and outages, and sustained staff shortages (due to illness, separation of shifts and self-isolation) are widespread across the sector. Some recent data center outages that Uptime Institute has tracked were clearly the result of operator or management error — but this is a usual occurrence.

Slowdowns, meanwhile, are most likely to be the result of sudden changes in demand and overload, or an external third-party network problem. Two examples are the UK online grocer that mistook high demand for a denial of service attack; and Sharp Electronics, which offered consumer PPE (personal protection equipment) using the same systems as its online appliance management systems in Japan. Both crashed. Zoom, the suddenly popular conferencing service, has also experienced some maintenance-related issues. In the US, a CenturyLink cable cut, and another network issue at Tata Communications in Europe, caused outage numbers to surge above averages.

As the impact of the virus continues, data center operators could come under more strain. Most operators have deferred some scheduled maintenance, which in spite of monitoring and close management, will likely lead to an increased risk of more failures. In addition, many if not most sites are now operating with a reduced level of on-site staff, with many engineers on call, rather than on site. The industry’s traditional conservatism has so far it given a good protective buffer — but this will come under pressure over time, unless restrictions and practices are eased or evolve to ensure risk is again reduced.


More information on this topic is available to members of the Uptime Institute Network here.