What will be the long-lasting impact of the COVID-19 pandemic on the digital critical infrastructure industry? It may be too soon to ask the question given that, at the time of writing, the virus is taking its toll at scale across the world. But Uptime Institute has been asked this question many times, and it’s a discussion point on several upcoming (virtual) conferences.
In a recent Financial Times column, the economist Tim Harford speaks of “hysteresis” — a concept borrowed originally from electromagnetics. Hysteresis describes how some events have a long-lasting, lagging impact, even after the original cause of the change has long gone. Some of the effects of the coronavirus pandemic are elastic — industries, behaviors and business models, for example, are pulled around, but they will spring back their original shape as the risk passes. But others behave more like plastic: once stretched or broken, they stay that way.
While it may be soon to tell, many businesses and investors have already begun making bets that certain things have changed forever. Some companies are preparing for the sale of the headquarters; several have told their staff that, from now on, they are at-home workers. One conference company has moved its entire business online. And in the critical infrastructure world, we know of companies that have already begun to introduce permanent changes in the data center, including introducing large-scale automation programs (to reduce the need for on-site staff).
Uptime Institute has classed the evolution of the digital critical response into three phases.
Phase 1
The first phase might be termed “reactive.” At this point, operators are on high alert, in firefighting or emergency response mode. Their biggest concerns are to understand the threat and to identify and implement best practices to decrease the risk to staff and to availability; high on the list are the need to source appropriate personal protective equipment and decontaminants, to clean facilities, and to operate with reduced staff and possible supply chain disruption. For most, this phase lasted a few weeks or more and has already passed. While the vast majority came through this well, about one in 20 data centers told Uptime Institute in a survey that they had a COVID-19 related outage. Others said they experienced IT service slowdowns, likely due to changing demand patterns or server/network maintenance problems.
Phase 2
The second phase is an interim normal – most data center operators are in this period (which may last up to a year). In this phase, the virus is still widespread, but threat is reduced as governments regain control of the spread. Processes to manage-down risk that were established in the reactive phase have become established — these include more remote working, blue/red operations teams, reduced maintenance, greater redundancy, and now form the “new normal.” Most of these are process-based and do not involve long-term investments or strategic changes. Some examples of these activities are presented in the chart below.
During this phase, delayed projects are likely to be gradually restarted, but only as a managed risk; investment is curtailed unless clearly pulled by strong demand (which continues to drive new builds and investments). At this time, the management also prepares investment projects for the third phase — the next permanent normal.
Phase 3
The third stage of this cycle is the next normal. At this point, it is likely a vaccine has been found, treatments have improved, or the virus has been contained by social measures to the point of routine manageability. However, the world has been alerted to the possibility (likelihood?) of another pandemic — so long-term changes are likely, alongside the acceptance of other changes that were found to have proved effective, in spite of the danger’s having passed.
What will these be? Here, the answers are less clear. We think the following are highly likely:
Operators will incorporate pandemic planning (and drilling) into their business continuity/disaster recovery plans.
Governments will seek to have more insight and oversight of critical and near-critical infrastructure, including establishing key worker principles, etc. This may involve more certification.
Automation/remote management tools will receive a surge of investment, as operators seek to operate with fewer staff/no staff at times, or to monitor systems from safer or remote locations.
Greater management of the supply chain for key parts or services. This is likely to drive budgets up, especially if emergency cover agreements need to be put in place with services companies.
Here are some other possible developments, where the outcome is less clear:
A greater/faster move to cloud. This may happen, but this is a trend two decades old, and existing critical loads are being moved steadily and slowly, not rapidly. In addition, cloud may require increased, not decreased, funding.
A shift to edge. More home working, viewed by many as a trend that won’t be reversed, will lead to more work being done away from corporate offices and at edge locations.
A need for more staff. In spite of planned automation, data centers will, in the near term (to 2023?), need more staff to fulfill the requirement of maintaining separate teams.
One thing all analysts agree on: The pandemic has not slowed, and has probably accelerated. This has raised both the demand for more data center services and the dependency on these services yet further.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/06/COVID-LEGACY-sm.jpg10002700Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-06-22 06:00:552021-05-04 10:14:36COVID-19: Critical impact and legacy
During the current COVID-19 crisis, enterprise dependency on cloud platform providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform) and on software as a service (Salesforce, Zoom, Teams) has increased. Operators report booming demand, with Microsoft’s Chief Executive Officer Satya Nadella saying, “We have seen two years’ worth of digital transformation in two months.”
This success brings it with new challenges and some growing responsibilities. Corporate operators tend to classify their applications and services into levels or categories, which defines what they might require for each one in terms of service levels, access, transparency and accountability. Test and development applications, shadow IT, and smaller and newer online businesses were the initial and main drivers of cloud demand in the first decade, and these have relatively light availability needs. But for critical production applications, now the biggest target for cloud companies, enterprises have expectations and compliance requirements that are far more stringent.
During the pandemic, a lot of applications (such as video conferencing, but a lot more besides) have become more critical than they were before. This is a trend that was already underway, but whereas it was previously happening almost imperceptibly, it is now much more obvious. Uptime Institute has described this trend as “creeping criticality” — and it can lead to a circumstance we call “asymmetric resiliency,” which occurs when the level of resiliency required by the application is not matched by the infrastructure supporting it.
The big public cloud operators do not think this applies to them, because all have addressed the issue of availability at the outset. The dominant cloud architecture, with multisite replication using availability zones, all built on fairly robust data centers (if not always uniformly or demonstrably so), has proved largely effective. Cloud services do suffer some infrastructure-related outages, according to Uptime Institute data, but no more than most enterprises.
Unfortunately for the cloud companies, assurances of availability or of adherence to best practices are not enough. In both the 2019 and the latest 2020 global annual survey of IT and critical infrastructure operators, Uptime Institute asked respondents if they put mission-critical workloads into the cloud, and whether resiliency and visibility is an issue. The answers in each year are almost identical: Over 70% did not put any critical applications in the cloud, with a third of this group (21% of the total sample) saying the reason was a lack of visibility/accountability about resiliency. And one in six of those who do place applications in the public cloud also say they do not have enough visibility (see chart below).
The pressure to be more open and accountable is growing. During the COVID-19 pandemic, governments have tried to better understand the vulnerability of their national critical infrastructure, and it hasn’t always been easy. In the UK, the government has expressed concern that the cloud players were unforthcoming with good information.
The rules being developed and applied in the financial services sector, especially in Europe, may provide a model for other sectors to follow. The EBA (European Banking Authority) says that banks may not use a cloud, hosting or colocation company for critical applications unless they are open to full, on-site inspection. That presents a challenge for the banks, who must verify that dozens or hundreds of data centers are well built and well run. Those rules have proved important, because without them, banks seeking access did not always get a welcoming response.
The challenge for financial services regulators and operators is that the banking system is highly interdependent, with many end-to-end processes spanning multiple businesses and multiple data centers and data services. Failure of one can mean a systemic failure; minimization of risk, and full accountability, is essential in the sector — assurances of 99.99% availability are not enough.
During the pandemic, the cloud providers have performed well, but failures or slowdowns have occurred. Microsoft’s success, for example, had led to constraints in the availability of Azure services in many regions. For test, dev or even some smaller businesses, this is not too serious: for critical, enterprise IT, it might be.
Uptime Institute data suggests that less than 10% of mission critical workloads are running in the public cloud, and that this number is not rising as fast as overall cloud growth (even if new applications like Microsoft Teams have taken off globally). With more accountability and visibility, this number might increase faster.
More information on this topic is available to members and guest members of the Uptime Institute Network community. Details on joining can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/06/Banks-cloud-getty-blog.jpg5351440Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-06-08 06:00:242021-05-04 10:11:23Enterprises’ need for control and visibility still slows cloud adoption
Fear of the coronavirus or confirmed exposure has caused about half (49%) of data center owners to increase the frequency of regular cleanings, according to a recent Uptime Institute flash survey. Even more (66%) have increased the frequency of sanitization (disinfection) to eliminate (however transitorily) any possible presence of the novel coronavirus in their facilities.
In times of normal operation, most facilities conduct a thorough cleaning (the removal of particles, static and residue) on a regular basis, with frequency determined by need. These cleanings help data centers meet ISO 14644-1, as required by ASHRAE.
Many use air quality indicators to determine when further cleaning is needed. Facility-specific factors such as air exchange rate/volume and raised floors can greatly affect the number of particulates in the air.
Other data center owners/operators clean the entire facility — including plenum spaces and underfloor areas — on an annual or biannual basis, with more frequent cleanings of all horizontal and vertical surfaces and even more frequent cleanings of handrails, doors and high-traffic areas. Schedules vary by facility. Some facility owners also regularly conduct facility-wide disinfection, which is the process that eliminates many or all pathogenic microorganisms, except bacterial spores, on inanimate objects. This is similar to sterilization, which eliminates all forms of microbial life.
Cleaning and sanitization are not necessarily risk-free processes, and they can be expensive. Moreover, the protection they provide may not persist, as the virus can easily be introduced — or reintroduced — minutes after a facility has been thoroughly cleaned and disinfected. However, the threat to operations is real: 10% of our survey respondents have experienced an IT service slowdown related to COVID-19, and 4% reported a pandemic-related outage or service interruption.
Cleaning considerations
Cleaners must exercise a great deal of care and follow rigorous procedures, as they often work in close proximity to sensitive electronic gear. In addition, cleaners must use vacuums with high efficiency particulate air filters, so that no dust or dirt particles become airborne and get pulled into the supply air. Deep cleaning is a necessary step to make sanitization and disinfection effective.
Sanitization or disinfection requires even greater caution, as most cleaning materials come with precise care, handling and use instructions, for both worker and equipment safety. Properly done, sanitization can remove up to 99.9999% (6-log kill) of enveloped viruses, such as SARS-CoV-2, the virus that causes COVID-19. Higher levels of disinfection are possible using stronger chemicals (see the EPA’s list of disinfectants that are effective against the coronavirus and even more durable biological contaminants such as fungi and bacterial spores). This list also includes many products that will be effective against coronaviruses that are safe and easy to use.
Data center cleaning services interviewed by Uptime Institute do not necessarily consider repeated deep cleaning or sanitization of entire facilities a necessary COVID-19 precaution. Clients, they say, must consider the cost of cleaning, which could be as much as $100,000 for each cleaning of a 100,000-square-foot facility, and sanitization, which can add another 20% or more to the cost of deep cleaning.
The cleaners note that many data centers are relatively clean “low-touch” facilities. In normal operation, data centers have features that work to protect against the entry of viruses and other particles. These include controlled access, static mats at many entries, and air filtration and pressurization in white spaces, all of which reduce the ability of the SARS-CoV-2 virus to enter sensitive areas of the facility. Many data centers also have strict rules against bringing unnecessary gear or personal items into the data center.
These characteristics suggest that an initial deep cleaning and sanitization of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers. Frequent or regular disinfections can be reserved for highly trafficked spaces and high-touch areas, such as handrails, handles, light switches and doorknobs. According to ASHRAE, computational fluid dynamics can be used to identify likely landing places for airborne viruses, making these treatments even more effective.
Some data center owners may want more frequent cleanings for “peace of mind” or legal reasons. Higher levels of attention may be warranted if an infected visitor has accessed the facility. To date, Uptime Institute has investigated three such reports worldwide, but anonymous responses to our survey indicate that the number of incidents is much higher (15% of respondents reported COVID-19-positive or symptomatic staff or visitors) .
Uptime Institute clients have expressed interest in a range of cleaning and disinfection processes, including ozone treatment and ultraviolet germicidal irradiation (UVGI), as well as relevant products.
Uptime Institute is aware of at least one new product — a biostatic antimicrobial coating — that has been introduced to the market. The US-based vendor describes it as “an invisible and durable coating that helps provide for the physical kill, or more precisely the disruption, of microorganisms (i.e., bacteria, viruses, mold, mildew and fungi) and can last up to 90 days or possibly longer.”
Even if proven effective against COVID-19, these treatments may have other use limitations and prohibitive downsides for either the equipment in data centers or the humans who staff them. For instance, UVGI is a disinfection method that uses short-wavelength ultraviolet (ultraviolet C or UVC) light to kill or inactivate microorganisms. But UVC light, for instance, can only disinfect where it is directly applied, so it cannot be used to treat any surfaces where targeted exposure is not possible (such as under any raised floors, overhead plenums, and behind any cabinets/cables). In addition, exposure to the wavelength typically used in UVGI (253.7 nm) can damage the skin and eyes.
Based on currently available information, the rigorous adherence to a program that includes hand hygiene, surface cleaning and social distancing will reduce the likelihood of transmission without introducing additional risks.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/06/GettyImages-1222181804blog.jpg9252486Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2020-06-01 06:00:082020-06-04 12:06:45Data Center cleaning and sanitization: detail, cost and effectiveness
To date, the critical infrastructure industry has mostly managed effectively with reduced staff, deferred maintenance, social distancing and new patterns of demand. While there have been some serious incidents and outages directly related to the pandemic, these have been few.
But what worries operators for the months ahead? Clearly, the situation is changing all the time, with different facilities in different countries or states facing very different situations. Many countries are easing their lockdowns, while others have still to reach the (first?) peak of the pandemic.
In an Uptime Institute survey of over 200 critical IT/facility infrastructure operators around the world, a third of operators said a reduced level of IT infrastructure operations staff poses the single biggest risk to operations (see Figure 1).
This is perhaps not surprising — earlier Uptime Institute research has shown that insufficient staffing can be associated with increased levels of failures and incidents; and even before the pandemic, 60% of data center owners and operators reported difficulty finding or retaining skilled staff. Taken together, reduced maintenance (15%) and shortages of critical components (6%) suggest that continued, problem-free operation of equipment is the second biggest concern.
When asked to cite the top three concerns, data center project or construction delays (see Figure 2) was cited by 45% of the respondents (as was the staffing issue). This is perhaps not surprising; when asked later in the survey to select all relevant impacts from among a range of potential repercussions, 60% reported pandemic-related project or construction delays.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/05/GettyImages-1134312106-blog.jpg7402003Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-05-11 06:00:502021-05-04 10:10:02COVID-19: What worries data center management the most?
To date, media coverage of the impact of COVID-19 and the lockdowns has been largely laudatory. There have been few high profile or serious outages (perhaps fewer than normal) and for the most part, internet traffic flow analysis shows that a sudden jump in demand, along with a shift toward the residential edge and busy daytime hours, has had little impact on performance. The military-grade resiliency of the ARPANET (Advanced Research Projects Agency Network) and the Internet Protocol, the foundation of the modern internet, is given much credit.
But beneath the calm surface water, data center operators have been paddling furiously to maintain services, especially with some alarming staff shortages at some sites.
In our most recent survey, 84% of respondents had not experienced a service slowdown or outage that could be attributed to COVID-19. However, 4% (eight operators) said they had an COVID-19-related outage and 10% (20 operators) had experienced a service slowdown that was COVID-19 related (see figure below).
Establishing the causes of these slowdowns or outages will probably not be easy. Research does show that staff shortage and tiredness can lead to more incidents and outages, and sustained staff shortages (due to illness, separation of shifts and self-isolation) are widespread across the sector. Some recent data center outages that Uptime Institute has tracked were clearly the result of operator or management error — but this is a usual occurrence.
Slowdowns, meanwhile, are most likely to be the result of sudden changes in demand and overload, or an external third-party network problem. Two examples are the UK online grocer that mistook high demand for a denial of service attack; and Sharp Electronics, which offered consumer PPE (personal protection equipment) using the same systems as its online appliance management systems in Japan. Both crashed. Zoom, the suddenly popular conferencing service, has also experienced some maintenance-related issues. In the US, a CenturyLink cable cut, and another network issue at Tata Communications in Europe, caused outage numbers to surge above averages.
As the impact of the virus continues, data center operators could come under more strain. Most operators have deferred some scheduled maintenance, which in spite of monitoring and close management, will likely lead to an increased risk of more failures. In addition, many if not most sites are now operating with a reduced level of on-site staff, with many engineers on call, rather than on site. The industry’s traditional conservatism has so far it given a good protective buffer — but this will come under pressure over time, unless restrictions and practices are eased or evolve to ensure risk is again reduced.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/05/GettyImages-1171885110-blog.jpg9602592Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-05-04 06:00:242021-05-04 10:08:04Pandemic is causing some outages and slowdowns
The average power usage effectiveness (PUE) ratio for a data center in 2020 is 1.58, only marginally better than 7 years ago, according to the latest annual Uptime Institute survey (findings to be published shortly).
PUE, an international standard first developed by The Green Grid and others in 2007, is the most widely accepted way of measuring the energy efficiency of a data center. It measures the ratio of the energy used by the IT equipment to the energy used by the entire data center.
All operators strive to get their PUE ratio down to as near 1.0 as possible. Using the latest technology and practices, most new builds fall between 1.2 and 1.4. But there are still thousands of older data centers that cannot be economically or safely upgraded to become that efficient, especially if high availability is required. In 2019, the PUE value increased slightly (see previous article), with a number of possible explanations.
The new data (shown in the figure below) conforms to a consistent pattern: Big improvements in energy efficiency were made from 2007 to 2013, mostly using inexpensive or easy methods such as simple air containment, after which improvements became more difficult or expensive. The Uptime Institute figures are based on surveys of global data centers ranging in size from 1 megawatt (MW) to over 60 MW, of varying ages.
As ever, the data does not tell a complete story. This data is based on the average PUE per site, regardless of size or age. Newer data centers, usually built by hyperscale or colocation companies, tend to be much more efficient, and larger. A growing amount of work is therefore done in larger, more efficient data centers (Uptime Institute data in 2019 shows data centers above 20 MW to have lower PUEs). Data released by Google shows almost exactly the same curve shape — but at much lower values.
Operators who cannot improve their site PUE can still do a lot to reduce energy and/or decarbonize operations. First, they can improve the utilization of their IT and refresh their servers to ensure IT energy optimization. Second, they can re-use the heat generated by the data center; and third, they can buy renewable energy or invest in renewable energy generation.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/04/GettyImages-1193108294blog.jpg10002700Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-04-27 11:50:272020-04-27 11:52:57Data center PUEs flat since 2013
COVID-19: Critical impact and legacy
/in Executive, News, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]What will be the long-lasting impact of the COVID-19 pandemic on the digital critical infrastructure industry? It may be too soon to ask the question given that, at the time of writing, the virus is taking its toll at scale across the world. But Uptime Institute has been asked this question many times, and it’s a discussion point on several upcoming (virtual) conferences.
In a recent Financial Times column, the economist Tim Harford speaks of “hysteresis” — a concept borrowed originally from electromagnetics. Hysteresis describes how some events have a long-lasting, lagging impact, even after the original cause of the change has long gone. Some of the effects of the coronavirus pandemic are elastic — industries, behaviors and business models, for example, are pulled around, but they will spring back their original shape as the risk passes. But others behave more like plastic: once stretched or broken, they stay that way.
While it may be soon to tell, many businesses and investors have already begun making bets that certain things have changed forever. Some companies are preparing for the sale of the headquarters; several have told their staff that, from now on, they are at-home workers. One conference company has moved its entire business online. And in the critical infrastructure world, we know of companies that have already begun to introduce permanent changes in the data center, including introducing large-scale automation programs (to reduce the need for on-site staff).
Uptime Institute has classed the evolution of the digital critical response into three phases.
Phase 1
The first phase might be termed “reactive.” At this point, operators are on high alert, in firefighting or emergency response mode. Their biggest concerns are to understand the threat and to identify and implement best practices to decrease the risk to staff and to availability; high on the list are the need to source appropriate personal protective equipment and decontaminants, to clean facilities, and to operate with reduced staff and possible supply chain disruption. For most, this phase lasted a few weeks or more and has already passed. While the vast majority came through this well, about one in 20 data centers told Uptime Institute in a survey that they had a COVID-19 related outage. Others said they experienced IT service slowdowns, likely due to changing demand patterns or server/network maintenance problems.
Phase 2
The second phase is an interim normal – most data center operators are in this period (which may last up to a year). In this phase, the virus is still widespread, but threat is reduced as governments regain control of the spread. Processes to manage-down risk that were established in the reactive phase have become established — these include more remote working, blue/red operations teams, reduced maintenance, greater redundancy, and now form the “new normal.” Most of these are process-based and do not involve long-term investments or strategic changes. Some examples of these activities are presented in the chart below.
During this phase, delayed projects are likely to be gradually restarted, but only as a managed risk; investment is curtailed unless clearly pulled by strong demand (which continues to drive new builds and investments). At this time, the management also prepares investment projects for the third phase — the next permanent normal.
Phase 3
The third stage of this cycle is the next normal. At this point, it is likely a vaccine has been found, treatments have improved, or the virus has been contained by social measures to the point of routine manageability. However, the world has been alerted to the possibility (likelihood?) of another pandemic — so long-term changes are likely, alongside the acceptance of other changes that were found to have proved effective, in spite of the danger’s having passed.
What will these be? Here, the answers are less clear. We think the following are highly likely:
Here are some other possible developments, where the outcome is less clear:
One thing all analysts agree on: The pandemic has not slowed, and has probably accelerated. This has raised both the demand for more data center services and the dependency on these services yet further.
Enterprises’ need for control and visibility still slows cloud adoption
/in Executive/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]During the current COVID-19 crisis, enterprise dependency on cloud platform providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform) and on software as a service (Salesforce, Zoom, Teams) has increased. Operators report booming demand, with Microsoft’s Chief Executive Officer Satya Nadella saying, “We have seen two years’ worth of digital transformation in two months.”
This success brings it with new challenges and some growing responsibilities. Corporate operators tend to classify their applications and services into levels or categories, which defines what they might require for each one in terms of service levels, access, transparency and accountability. Test and development applications, shadow IT, and smaller and newer online businesses were the initial and main drivers of cloud demand in the first decade, and these have relatively light availability needs. But for critical production applications, now the biggest target for cloud companies, enterprises have expectations and compliance requirements that are far more stringent.
During the pandemic, a lot of applications (such as video conferencing, but a lot more besides) have become more critical than they were before. This is a trend that was already underway, but whereas it was previously happening almost imperceptibly, it is now much more obvious. Uptime Institute has described this trend as “creeping criticality” — and it can lead to a circumstance we call “asymmetric resiliency,” which occurs when the level of resiliency required by the application is not matched by the infrastructure supporting it.
The big public cloud operators do not think this applies to them, because all have addressed the issue of availability at the outset. The dominant cloud architecture, with multisite replication using availability zones, all built on fairly robust data centers (if not always uniformly or demonstrably so), has proved largely effective. Cloud services do suffer some infrastructure-related outages, according to Uptime Institute data, but no more than most enterprises.
Unfortunately for the cloud companies, assurances of availability or of adherence to best practices are not enough. In both the 2019 and the latest 2020 global annual survey of IT and critical infrastructure operators, Uptime Institute asked respondents if they put mission-critical workloads into the cloud, and whether resiliency and visibility is an issue. The answers in each year are almost identical: Over 70% did not put any critical applications in the cloud, with a third of this group (21% of the total sample) saying the reason was a lack of visibility/accountability about resiliency. And one in six of those who do place applications in the public cloud also say they do not have enough visibility (see chart below).
The pressure to be more open and accountable is growing. During the COVID-19 pandemic, governments have tried to better understand the vulnerability of their national critical infrastructure, and it hasn’t always been easy. In the UK, the government has expressed concern that the cloud players were unforthcoming with good information.
The rules being developed and applied in the financial services sector, especially in Europe, may provide a model for other sectors to follow. The EBA (European Banking Authority) says that banks may not use a cloud, hosting or colocation company for critical applications unless they are open to full, on-site inspection. That presents a challenge for the banks, who must verify that dozens or hundreds of data centers are well built and well run. Those rules have proved important, because without them, banks seeking access did not always get a welcoming response.
The challenge for financial services regulators and operators is that the banking system is highly interdependent, with many end-to-end processes spanning multiple businesses and multiple data centers and data services. Failure of one can mean a systemic failure; minimization of risk, and full accountability, is essential in the sector — assurances of 99.99% availability are not enough.
During the pandemic, the cloud providers have performed well, but failures or slowdowns have occurred. Microsoft’s success, for example, had led to constraints in the availability of Azure services in many regions. For test, dev or even some smaller businesses, this is not too serious: for critical, enterprise IT, it might be.
Uptime Institute data suggests that less than 10% of mission critical workloads are running in the public cloud, and that this number is not rising as fast as overall cloud growth (even if new applications like Microsoft Teams have taken off globally). With more accountability and visibility, this number might increase faster.
More information on this topic is available to members and guest members of the Uptime Institute Network community. Details on joining can be found here.
Data Center cleaning and sanitization: detail, cost and effectiveness
/in Executive, Operations/by Kevin HeslinFear of the coronavirus or confirmed exposure has caused about half (49%) of data center owners to increase the frequency of regular cleanings, according to a recent Uptime Institute flash survey. Even more (66%) have increased the frequency of sanitization (disinfection) to eliminate (however transitorily) any possible presence of the novel coronavirus in their facilities.
In times of normal operation, most facilities conduct a thorough cleaning (the removal of particles, static and residue) on a regular basis, with frequency determined by need. These cleanings help data centers meet ISO 14644-1, as required by ASHRAE.
Many use air quality indicators to determine when further cleaning is needed. Facility-specific factors such as air exchange rate/volume and raised floors can greatly affect the number of particulates in the air.
Other data center owners/operators clean the entire facility — including plenum spaces and underfloor areas — on an annual or biannual basis, with more frequent cleanings of all horizontal and vertical surfaces and even more frequent cleanings of handrails, doors and high-traffic areas. Schedules vary by facility. Some facility owners also regularly conduct facility-wide disinfection, which is the process that eliminates many or all pathogenic microorganisms, except bacterial spores, on inanimate objects. This is similar to sterilization, which eliminates all forms of microbial life.
Cleaning and sanitization are not necessarily risk-free processes, and they can be expensive. Moreover, the protection they provide may not persist, as the virus can easily be introduced — or reintroduced — minutes after a facility has been thoroughly cleaned and disinfected. However, the threat to operations is real: 10% of our survey respondents have experienced an IT service slowdown related to COVID-19, and 4% reported a pandemic-related outage or service interruption.
Cleaning considerations
Cleaners must exercise a great deal of care and follow rigorous procedures, as they often work in close proximity to sensitive electronic gear. In addition, cleaners must use vacuums with high efficiency particulate air filters, so that no dust or dirt particles become airborne and get pulled into the supply air. Deep cleaning is a necessary step to make sanitization and disinfection effective.
Sanitization or disinfection requires even greater caution, as most cleaning materials come with precise care, handling and use instructions, for both worker and equipment safety. Properly done, sanitization can remove up to 99.9999% (6-log kill) of enveloped viruses, such as SARS-CoV-2, the virus that causes COVID-19. Higher levels of disinfection are possible using stronger chemicals (see the EPA’s list of disinfectants that are effective against the coronavirus and even more durable biological contaminants such as fungi and bacterial spores). This list also includes many products that will be effective against coronaviruses that are safe and easy to use.
Data center cleaning services interviewed by Uptime Institute do not necessarily consider repeated deep cleaning or sanitization of entire facilities a necessary COVID-19 precaution. Clients, they say, must consider the cost of cleaning, which could be as much as $100,000 for each cleaning of a 100,000-square-foot facility, and sanitization, which can add another 20% or more to the cost of deep cleaning.
The cleaners note that many data centers are relatively clean “low-touch” facilities. In normal operation, data centers have features that work to protect against the entry of viruses and other particles. These include controlled access, static mats at many entries, and air filtration and pressurization in white spaces, all of which reduce the ability of the SARS-CoV-2 virus to enter sensitive areas of the facility. Many data centers also have strict rules against bringing unnecessary gear or personal items into the data center.
These characteristics suggest that an initial deep cleaning and sanitization of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers. Frequent or regular disinfections can be reserved for highly trafficked spaces and high-touch areas, such as handrails, handles, light switches and doorknobs. According to ASHRAE, computational fluid dynamics can be used to identify likely landing places for airborne viruses, making these treatments even more effective.
Some data center owners may want more frequent cleanings for “peace of mind” or legal reasons. Higher levels of attention may be warranted if an infected visitor has accessed the facility. To date, Uptime Institute has investigated three such reports worldwide, but anonymous responses to our survey indicate that the number of incidents is much higher (15% of respondents reported COVID-19-positive or symptomatic staff or visitors) .
Uptime Institute clients have expressed interest in a range of cleaning and disinfection processes, including ozone treatment and ultraviolet germicidal irradiation (UVGI), as well as relevant products.
Uptime Institute is aware of at least one new product — a biostatic antimicrobial coating — that has been introduced to the market. The US-based vendor describes it as “an invisible and durable coating that helps provide for the physical kill, or more precisely the disruption, of microorganisms (i.e., bacteria, viruses, mold, mildew and fungi) and can last up to 90 days or possibly longer.”
Even if proven effective against COVID-19, these treatments may have other use limitations and prohibitive downsides for either the equipment in data centers or the humans who staff them. For instance, UVGI is a disinfection method that uses short-wavelength ultraviolet (ultraviolet C or UVC) light to kill or inactivate microorganisms. But UVC light, for instance, can only disinfect where it is directly applied, so it cannot be used to treat any surfaces where targeted exposure is not possible (such as under any raised floors, overhead plenums, and behind any cabinets/cables). In addition, exposure to the wavelength typically used in UVGI (253.7 nm) can damage the skin and eyes.
Based on currently available information, the rigorous adherence to a program that includes hand hygiene, surface cleaning and social distancing will reduce the likelihood of transmission without introducing additional risks.
COVID-19: What worries data center management the most?
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]To date, the critical infrastructure industry has mostly managed effectively with reduced staff, deferred maintenance, social distancing and new patterns of demand. While there have been some serious incidents and outages directly related to the pandemic, these have been few.
But what worries operators for the months ahead? Clearly, the situation is changing all the time, with different facilities in different countries or states facing very different situations. Many countries are easing their lockdowns, while others have still to reach the (first?) peak of the pandemic.
In an Uptime Institute survey of over 200 critical IT/facility infrastructure operators around the world, a third of operators said a reduced level of IT infrastructure operations staff poses the single biggest risk to operations (see Figure 1).
This is perhaps not surprising — earlier Uptime Institute research has shown that insufficient staffing can be associated with increased levels of failures and incidents; and even before the pandemic, 60% of data center owners and operators reported difficulty finding or retaining skilled staff. Taken together, reduced maintenance (15%) and shortages of critical components (6%) suggest that continued, problem-free operation of equipment is the second biggest concern.
When asked to cite the top three concerns, data center project or construction delays (see Figure 2) was cited by 45% of the respondents (as was the staffing issue). This is perhaps not surprising; when asked later in the survey to select all relevant impacts from among a range of potential repercussions, 60% reported pandemic-related project or construction delays.
More information on this topic is available to members of the Uptime Institute Network here.
Pandemic is causing some outages and slowdowns
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]To date, media coverage of the impact of COVID-19 and the lockdowns has been largely laudatory. There have been few high profile or serious outages (perhaps fewer than normal) and for the most part, internet traffic flow analysis shows that a sudden jump in demand, along with a shift toward the residential edge and busy daytime hours, has had little impact on performance. The military-grade resiliency of the ARPANET (Advanced Research Projects Agency Network) and the Internet Protocol, the foundation of the modern internet, is given much credit.
But beneath the calm surface water, data center operators have been paddling furiously to maintain services, especially with some alarming staff shortages at some sites.
In our most recent survey, 84% of respondents had not experienced a service slowdown or outage that could be attributed to COVID-19. However, 4% (eight operators) said they had an COVID-19-related outage and 10% (20 operators) had experienced a service slowdown that was COVID-19 related (see figure below).
Establishing the causes of these slowdowns or outages will probably not be easy. Research does show that staff shortage and tiredness can lead to more incidents and outages, and sustained staff shortages (due to illness, separation of shifts and self-isolation) are widespread across the sector. Some recent data center outages that Uptime Institute has tracked were clearly the result of operator or management error — but this is a usual occurrence.
Slowdowns, meanwhile, are most likely to be the result of sudden changes in demand and overload, or an external third-party network problem. Two examples are the UK online grocer that mistook high demand for a denial of service attack; and Sharp Electronics, which offered consumer PPE (personal protection equipment) using the same systems as its online appliance management systems in Japan. Both crashed. Zoom, the suddenly popular conferencing service, has also experienced some maintenance-related issues. In the US, a CenturyLink cable cut, and another network issue at Tata Communications in Europe, caused outage numbers to surge above averages.
As the impact of the virus continues, data center operators could come under more strain. Most operators have deferred some scheduled maintenance, which in spite of monitoring and close management, will likely lead to an increased risk of more failures. In addition, many if not most sites are now operating with a reduced level of on-site staff, with many engineers on call, rather than on site. The industry’s traditional conservatism has so far it given a good protective buffer — but this will come under pressure over time, unless restrictions and practices are eased or evolve to ensure risk is again reduced.
More information on this topic is available to members of the Uptime Institute Network here.
Data center PUEs flat since 2013
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]The average power usage effectiveness (PUE) ratio for a data center in 2020 is 1.58, only marginally better than 7 years ago, according to the latest annual Uptime Institute survey (findings to be published shortly).
PUE, an international standard first developed by The Green Grid and others in 2007, is the most widely accepted way of measuring the energy efficiency of a data center. It measures the ratio of the energy used by the IT equipment to the energy used by the entire data center.
All operators strive to get their PUE ratio down to as near 1.0 as possible. Using the latest technology and practices, most new builds fall between 1.2 and 1.4. But there are still thousands of older data centers that cannot be economically or safely upgraded to become that efficient, especially if high availability is required. In 2019, the PUE value increased slightly (see previous article), with a number of possible explanations.
The new data (shown in the figure below) conforms to a consistent pattern: Big improvements in energy efficiency were made from 2007 to 2013, mostly using inexpensive or easy methods such as simple air containment, after which improvements became more difficult or expensive. The Uptime Institute figures are based on surveys of global data centers ranging in size from 1 megawatt (MW) to over 60 MW, of varying ages.
As ever, the data does not tell a complete story. This data is based on the average PUE per site, regardless of size or age. Newer data centers, usually built by hyperscale or colocation companies, tend to be much more efficient, and larger. A growing amount of work is therefore done in larger, more efficient data centers (Uptime Institute data in 2019 shows data centers above 20 MW to have lower PUEs). Data released by Google shows almost exactly the same curve shape — but at much lower values.
Operators who cannot improve their site PUE can still do a lot to reduce energy and/or decarbonize operations. First, they can improve the utilization of their IT and refresh their servers to ensure IT energy optimization. Second, they can re-use the heat generated by the data center; and third, they can buy renewable energy or invest in renewable energy generation.
More information on this topic is available to members of the Uptime Institute Network here.