During the current COVID-19 crisis, enterprise dependency on cloud platform providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform) and on software as a service (Salesforce, Zoom, Teams) has increased. Operators report booming demand, with Microsoft’s Chief Executive Officer Satya Nadella saying, “We have seen two years’ worth of digital transformation in two months.”
This success brings it with new challenges and some growing responsibilities. Corporate operators tend to classify their applications and services into levels or categories, which defines what they might require for each one in terms of service levels, access, transparency and accountability. Test and development applications, shadow IT, and smaller and newer online businesses were the initial and main drivers of cloud demand in the first decade, and these have relatively light availability needs. But for critical production applications, now the biggest target for cloud companies, enterprises have expectations and compliance requirements that are far more stringent.
During the pandemic, a lot of applications (such as video conferencing, but a lot more besides) have become more critical than they were before. This is a trend that was already underway, but whereas it was previously happening almost imperceptibly, it is now much more obvious. Uptime Institute has described this trend as “creeping criticality” — and it can lead to a circumstance we call “asymmetric resiliency,” which occurs when the level of resiliency required by the application is not matched by the infrastructure supporting it.
The big public cloud operators do not think this applies to them, because all have addressed the issue of availability at the outset. The dominant cloud architecture, with multisite replication using availability zones, all built on fairly robust data centers (if not always uniformly or demonstrably so), has proved largely effective. Cloud services do suffer some infrastructure-related outages, according to Uptime Institute data, but no more than most enterprises.
Unfortunately for the cloud companies, assurances of availability or of adherence to best practices are not enough. In both the 2019 and the latest 2020 global annual survey of IT and critical infrastructure operators, Uptime Institute asked respondents if they put mission-critical workloads into the cloud, and whether resiliency and visibility is an issue. The answers in each year are almost identical: Over 70% did not put any critical applications in the cloud, with a third of this group (21% of the total sample) saying the reason was a lack of visibility/accountability about resiliency. And one in six of those who do place applications in the public cloud also say they do not have enough visibility (see chart below).
The pressure to be more open and accountable is growing. During the COVID-19 pandemic, governments have tried to better understand the vulnerability of their national critical infrastructure, and it hasn’t always been easy. In the UK, the government has expressed concern that the cloud players were unforthcoming with good information.
The rules being developed and applied in the financial services sector, especially in Europe, may provide a model for other sectors to follow. The EBA (European Banking Authority) says that banks may not use a cloud, hosting or colocation company for critical applications unless they are open to full, on-site inspection. That presents a challenge for the banks, who must verify that dozens or hundreds of data centers are well built and well run. Those rules have proved important, because without them, banks seeking access did not always get a welcoming response.
The challenge for financial services regulators and operators is that the banking system is highly interdependent, with many end-to-end processes spanning multiple businesses and multiple data centers and data services. Failure of one can mean a systemic failure; minimization of risk, and full accountability, is essential in the sector — assurances of 99.99% availability are not enough.
During the pandemic, the cloud providers have performed well, but failures or slowdowns have occurred. Microsoft’s success, for example, had led to constraints in the availability of Azure services in many regions. For test, dev or even some smaller businesses, this is not too serious: for critical, enterprise IT, it might be.
Uptime Institute data suggests that less than 10% of mission critical workloads are running in the public cloud, and that this number is not rising as fast as overall cloud growth (even if new applications like Microsoft Teams have taken off globally). With more accountability and visibility, this number might increase faster.
More information on this topic is available to members and guest members of the Uptime Institute Network community. Details on joining can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/06/Banks-cloud-getty-blog.jpg5351440Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-06-08 06:00:242021-05-04 10:11:23Enterprises’ need for control and visibility still slows cloud adoption
Fear of the coronavirus or confirmed exposure has caused about half (49%) of data center owners to increase the frequency of regular cleanings, according to a recent Uptime Institute flash survey. Even more (66%) have increased the frequency of sanitization (disinfection) to eliminate (however transitorily) any possible presence of the novel coronavirus in their facilities.
In times of normal operation, most facilities conduct a thorough cleaning (the removal of particles, static and residue) on a regular basis, with frequency determined by need. These cleanings help data centers meet ISO 14644-1, as required by ASHRAE.
Many use air quality indicators to determine when further cleaning is needed. Facility-specific factors such as air exchange rate/volume and raised floors can greatly affect the number of particulates in the air.
Other data center owners/operators clean the entire facility — including plenum spaces and underfloor areas — on an annual or biannual basis, with more frequent cleanings of all horizontal and vertical surfaces and even more frequent cleanings of handrails, doors and high-traffic areas. Schedules vary by facility. Some facility owners also regularly conduct facility-wide disinfection, which is the process that eliminates many or all pathogenic microorganisms, except bacterial spores, on inanimate objects. This is similar to sterilization, which eliminates all forms of microbial life.
Cleaning and sanitization are not necessarily risk-free processes, and they can be expensive. Moreover, the protection they provide may not persist, as the virus can easily be introduced — or reintroduced — minutes after a facility has been thoroughly cleaned and disinfected. However, the threat to operations is real: 10% of our survey respondents have experienced an IT service slowdown related to COVID-19, and 4% reported a pandemic-related outage or service interruption.
Cleaning considerations
Cleaners must exercise a great deal of care and follow rigorous procedures, as they often work in close proximity to sensitive electronic gear. In addition, cleaners must use vacuums with high efficiency particulate air filters, so that no dust or dirt particles become airborne and get pulled into the supply air. Deep cleaning is a necessary step to make sanitization and disinfection effective.
Sanitization or disinfection requires even greater caution, as most cleaning materials come with precise care, handling and use instructions, for both worker and equipment safety. Properly done, sanitization can remove up to 99.9999% (6-log kill) of enveloped viruses, such as SARS-CoV-2, the virus that causes COVID-19. Higher levels of disinfection are possible using stronger chemicals (see the EPA’s list of disinfectants that are effective against the coronavirus and even more durable biological contaminants such as fungi and bacterial spores). This list also includes many products that will be effective against coronaviruses that are safe and easy to use.
Data center cleaning services interviewed by Uptime Institute do not necessarily consider repeated deep cleaning or sanitization of entire facilities a necessary COVID-19 precaution. Clients, they say, must consider the cost of cleaning, which could be as much as $100,000 for each cleaning of a 100,000-square-foot facility, and sanitization, which can add another 20% or more to the cost of deep cleaning.
The cleaners note that many data centers are relatively clean “low-touch” facilities. In normal operation, data centers have features that work to protect against the entry of viruses and other particles. These include controlled access, static mats at many entries, and air filtration and pressurization in white spaces, all of which reduce the ability of the SARS-CoV-2 virus to enter sensitive areas of the facility. Many data centers also have strict rules against bringing unnecessary gear or personal items into the data center.
These characteristics suggest that an initial deep cleaning and sanitization of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers. Frequent or regular disinfections can be reserved for highly trafficked spaces and high-touch areas, such as handrails, handles, light switches and doorknobs. According to ASHRAE, computational fluid dynamics can be used to identify likely landing places for airborne viruses, making these treatments even more effective.
Some data center owners may want more frequent cleanings for “peace of mind” or legal reasons. Higher levels of attention may be warranted if an infected visitor has accessed the facility. To date, Uptime Institute has investigated three such reports worldwide, but anonymous responses to our survey indicate that the number of incidents is much higher (15% of respondents reported COVID-19-positive or symptomatic staff or visitors) .
Uptime Institute clients have expressed interest in a range of cleaning and disinfection processes, including ozone treatment and ultraviolet germicidal irradiation (UVGI), as well as relevant products.
Uptime Institute is aware of at least one new product — a biostatic antimicrobial coating — that has been introduced to the market. The US-based vendor describes it as “an invisible and durable coating that helps provide for the physical kill, or more precisely the disruption, of microorganisms (i.e., bacteria, viruses, mold, mildew and fungi) and can last up to 90 days or possibly longer.”
Even if proven effective against COVID-19, these treatments may have other use limitations and prohibitive downsides for either the equipment in data centers or the humans who staff them. For instance, UVGI is a disinfection method that uses short-wavelength ultraviolet (ultraviolet C or UVC) light to kill or inactivate microorganisms. But UVC light, for instance, can only disinfect where it is directly applied, so it cannot be used to treat any surfaces where targeted exposure is not possible (such as under any raised floors, overhead plenums, and behind any cabinets/cables). In addition, exposure to the wavelength typically used in UVGI (253.7 nm) can damage the skin and eyes.
Based on currently available information, the rigorous adherence to a program that includes hand hygiene, surface cleaning and social distancing will reduce the likelihood of transmission without introducing additional risks.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/06/GettyImages-1222181804blog.jpg9252486Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2020-06-01 06:00:082020-06-04 12:06:45Data Center cleaning and sanitization: detail, cost and effectiveness
To date, the critical infrastructure industry has mostly managed effectively with reduced staff, deferred maintenance, social distancing and new patterns of demand. While there have been some serious incidents and outages directly related to the pandemic, these have been few.
But what worries operators for the months ahead? Clearly, the situation is changing all the time, with different facilities in different countries or states facing very different situations. Many countries are easing their lockdowns, while others have still to reach the (first?) peak of the pandemic.
In an Uptime Institute survey of over 200 critical IT/facility infrastructure operators around the world, a third of operators said a reduced level of IT infrastructure operations staff poses the single biggest risk to operations (see Figure 1).
This is perhaps not surprising — earlier Uptime Institute research has shown that insufficient staffing can be associated with increased levels of failures and incidents; and even before the pandemic, 60% of data center owners and operators reported difficulty finding or retaining skilled staff. Taken together, reduced maintenance (15%) and shortages of critical components (6%) suggest that continued, problem-free operation of equipment is the second biggest concern.
When asked to cite the top three concerns, data center project or construction delays (see Figure 2) was cited by 45% of the respondents (as was the staffing issue). This is perhaps not surprising; when asked later in the survey to select all relevant impacts from among a range of potential repercussions, 60% reported pandemic-related project or construction delays.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/05/GettyImages-1134312106-blog.jpg7402003Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-05-11 06:00:502021-05-04 10:10:02COVID-19: What worries data center management the most?
To date, media coverage of the impact of COVID-19 and the lockdowns has been largely laudatory. There have been few high profile or serious outages (perhaps fewer than normal) and for the most part, internet traffic flow analysis shows that a sudden jump in demand, along with a shift toward the residential edge and busy daytime hours, has had little impact on performance. The military-grade resiliency of the ARPANET (Advanced Research Projects Agency Network) and the Internet Protocol, the foundation of the modern internet, is given much credit.
But beneath the calm surface water, data center operators have been paddling furiously to maintain services, especially with some alarming staff shortages at some sites.
In our most recent survey, 84% of respondents had not experienced a service slowdown or outage that could be attributed to COVID-19. However, 4% (eight operators) said they had an COVID-19-related outage and 10% (20 operators) had experienced a service slowdown that was COVID-19 related (see figure below).
Establishing the causes of these slowdowns or outages will probably not be easy. Research does show that staff shortage and tiredness can lead to more incidents and outages, and sustained staff shortages (due to illness, separation of shifts and self-isolation) are widespread across the sector. Some recent data center outages that Uptime Institute has tracked were clearly the result of operator or management error — but this is a usual occurrence.
Slowdowns, meanwhile, are most likely to be the result of sudden changes in demand and overload, or an external third-party network problem. Two examples are the UK online grocer that mistook high demand for a denial of service attack; and Sharp Electronics, which offered consumer PPE (personal protection equipment) using the same systems as its online appliance management systems in Japan. Both crashed. Zoom, the suddenly popular conferencing service, has also experienced some maintenance-related issues. In the US, a CenturyLink cable cut, and another network issue at Tata Communications in Europe, caused outage numbers to surge above averages.
As the impact of the virus continues, data center operators could come under more strain. Most operators have deferred some scheduled maintenance, which in spite of monitoring and close management, will likely lead to an increased risk of more failures. In addition, many if not most sites are now operating with a reduced level of on-site staff, with many engineers on call, rather than on site. The industry’s traditional conservatism has so far it given a good protective buffer — but this will come under pressure over time, unless restrictions and practices are eased or evolve to ensure risk is again reduced.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/05/GettyImages-1171885110-blog.jpg9602592Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-05-04 06:00:242021-05-04 10:08:04Pandemic is causing some outages and slowdowns
The average power usage effectiveness (PUE) ratio for a data center in 2020 is 1.58, only marginally better than 7 years ago, according to the latest annual Uptime Institute survey (findings to be published shortly).
PUE, an international standard first developed by The Green Grid and others in 2007, is the most widely accepted way of measuring the energy efficiency of a data center. It measures the ratio of the energy used by the IT equipment to the energy used by the entire data center.
All operators strive to get their PUE ratio down to as near 1.0 as possible. Using the latest technology and practices, most new builds fall between 1.2 and 1.4. But there are still thousands of older data centers that cannot be economically or safely upgraded to become that efficient, especially if high availability is required. In 2019, the PUE value increased slightly (see previous article), with a number of possible explanations.
The new data (shown in the figure below) conforms to a consistent pattern: Big improvements in energy efficiency were made from 2007 to 2013, mostly using inexpensive or easy methods such as simple air containment, after which improvements became more difficult or expensive. The Uptime Institute figures are based on surveys of global data centers ranging in size from 1 megawatt (MW) to over 60 MW, of varying ages.
As ever, the data does not tell a complete story. This data is based on the average PUE per site, regardless of size or age. Newer data centers, usually built by hyperscale or colocation companies, tend to be much more efficient, and larger. A growing amount of work is therefore done in larger, more efficient data centers (Uptime Institute data in 2019 shows data centers above 20 MW to have lower PUEs). Data released by Google shows almost exactly the same curve shape — but at much lower values.
Operators who cannot improve their site PUE can still do a lot to reduce energy and/or decarbonize operations. First, they can improve the utilization of their IT and refresh their servers to ensure IT energy optimization. Second, they can re-use the heat generated by the data center; and third, they can buy renewable energy or invest in renewable energy generation.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/04/GettyImages-1193108294blog.jpg10002700Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-04-27 11:50:272020-04-27 11:52:57Data center PUEs flat since 2013
The final installment of our Q&A regarding digital infrastructure considerations during the COVID19 crisis includes Deferred Maintenance, Remote Word, Supply Chain, Tier Standard and Long-Term Outlook
Below is Part 3 of the Q&A responses brought up during our recent series of webinars about managing operational risk during the COVID19 crisis. (Part 1 dealt with Staffing and Part 2 focused on Site Sanitation)
DEFERRED MAINTENANCE
Q: One of our clients does not want to do maintenance to avoid entering the data center. What do you recommend for this? Is it necessary to defer preventative maintenance on data center components?
A: Maintenance activities should be prioritized. At the very least, try to perform the most critical activities. If unable to do this, try to rotate hours as much as possible between redundant components and also contact the manufacturers of components/equipment to better identify the impacts of not performing maintenance on specific equipment. Deferred maintenance brings higher risk; in some equipment this risk is more serious than in others, so maintenance activities should be prioritized in order of criticality. For more information, please see our report COVID-19: Minimizing critical facility risk.
UPTIME INSTITUTE
Q: Will the learning from this pandemic be reflected in adjustments in the certification levels of each of the Tiers of the Uptime Institute? Will the Uptime Institute’s standard for operations be updated due to COVID-19, to incorporate lessons learned from this situation? How relevant will DCIM (data center infrastructure management) systems become from this pandemic? Will there be an emphasis on DCIM at the Uptime Institute certification levels?
A: Yes, Uptime Institute is currently evaluating potential adjustments in the criteria of the Uptime Institute Tier Certification of Operational Sustainability to take in consideration the pandemic and potential endemic issues that can affect the normal operation and sustainability of the data center. If the data center has implemented a DCIM system and BMS (building management system), during a pandemic or other similar emergency events these systems should be used to continually monitor, measure and manage both IT and supporting infrastructure equipment such as power and cooling systems. There should be an emphasis on all virtual private network (VPN) connections, which should be tested to ensure reliable access for remote data center monitoring.
Q: Will future operational [sustainability] or management and operations awards contemplate additional procedures associated with pandemic risks?
A: Yes, Uptime Institute is currently evaluating and planning modifications of the Uptime Institute Tier Certification of Operational Sustainability, which will result in a change in the evaluation of data centers’ ability to mitigate various risks, including pandemics.
Q: What would be the “Tier IV measures” in a data center regarding COVID-19?
A: Uptime Institute Tier IV is a reference largely to data center topology design and installation. COVID-19 is mostly impacting data center operations. Therefore, COVID-19 would not impact the Tier IV compliance of a facility.
REMOTE WORKING
Q: How can we minimize the issue of network saturation because we are all working from home?
A: We recommend that all remote workers have established security policies set up by their IT departments, and that IT departments explore potential bottlenecks and recommend mitigation efforts.
Many employees working from a home office will use their ISP (internet service provider) to access the cloud and office LAN (local area network) over VPN, where the read/write profile is totally different compared with, say, a Netflix streaming movie (which is practically all downloading and has a “read” profile). Adding consumption from others in the home (such as family members also working or doing remote schooling), the “read” function is increased and often becomes the great villain in bandwidth consumption. This can cause bandwidth limits being reached, leading to packet loss or time outs, delivering a slow internet experience.
With regard to VPNs, as part of their cybersecurity policies, many organizations and governments use strict access policies to control LAN users when working from the office. Remote workers use VPNs to build a tunnel in their ISP, linked with their office’s secure access, however, this model is rigid and wasn’t developed to accommodate the number of employees currently working remotely. This can lead to additional bottlenecks, giving the same slow internet experience as from home. Occasionally, connected to the LAN, a remote worker’s internet access will pass through their company´s firewall to locate a cloud-based service. This can cause a serious degradation of service because the remote worker’s cloud services traffic is compounded by activity such as their video-conferencing traffic and by their family´s consumption.
Q: Any recommendations for the security of data center information? The question relates to the importance of remote monitoring of mission-critical systems and whether this would be done through cloud-hosted applications.
A: If the data center has implemented remote monitoring and BMS, these systems should be used during a pandemic or other similar emergency events to monitor, measure and manage both IT equipment and supporting infrastructure such as power and cooling systems. There should be emphasis on all VPN connections to ensure they are tested and enable reliable access for remote data center monitoring.
SUPPLY CHAIN
Q: In the case of suppliers working at reduced capacity, do you recommend purchasing spare parts stock for operating equipment, taking into account the budget constraints/budget recommendations of prioritization?
A: The potential for long-term disruption to the supply chain for critical spares and consumables should be considered. If service level agreements include spare-parts supplies, communication should be established to ensure key equipment parts are available and/or to establish additional time for arrival in case of failure or emergency.
LONG-TERM EFFECTS
Q: Once this pandemic is overcome, it should accelerate the transfer of data center owners’ IT platforms to large data centers or to the cloud. What is your vision with regard to this issue?
A: There are several dynamics at play here, and for this reason, it is premature to give any definitive guidance until the situation clarifies. But some observations:
Many enterprises are likely to conclude that they want to reduce risk and complexity in the future, and they will not welcome the extra costs and processes associated with reducing the impact of future pandemics. For many, the obvious solution will be to go to the cloud or colocation companies. But the former, while strategic for many, will be most disruptive, perhaps more expensive, and may make the risks less visible.
Our research already shows that the biggest single impact of the lockdown has been to delay data center and IT projects. This is likely to slow down any major cloud/colocation moves, as a backlog builds and new priorities come into play. Overall, there is likely to be bias toward strengthening the status quo.
As we move out of the pandemic, many enterprises will have cost-reduction programs in place resulting from loss of business. Cloud has many advantages, but few large businesses have found it to be cheaper, especially where data centers are already depreciated. And almost all find that even where the costs are not higher, there are temporary transition costs.
Of course, every application, every service and every company is different. Although it is speculative, we think it likely that using colocation will prove a less-disruptive and more cost-effective path than full-on cloud transformation. While the long-term trend toward cloud will continue, there may be more pressure on cloud operators to take active steps to attract enterprise workloads.
Q: What do you see as the future focus, having now had a pandemic as a precedent?
A: In situations like this, data centers face particular challenges due to the unavailability of key personnel to be in their roles due to illness or quarantine. We recommend that organizations develop a specific pandemic preparedness plan similar to civic emergencies that focuses on performance, efficiency and reliability that include contingency plans that can be adapted to the challenges of the current pandemic or the potential of recurrent endemic events. Each organization’s response will vary based on individual site environments and local government/mandatory restrictions. Plans should consider situations in which staff may be unable to access or leave the site on short notice. Please refer to our report COVID-19: Minimizing critical facility risk, which addresses this topic in detail.
Q: What time projection does Uptime Institute have for the COVID-19 crisis?
A: Unfortunately, there is insufficient information at this point in time to answer questions regarding the duration of the COVID-19 crisis.
Enterprises’ need for control and visibility still slows cloud adoption
/in Executive/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]During the current COVID-19 crisis, enterprise dependency on cloud platform providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform) and on software as a service (Salesforce, Zoom, Teams) has increased. Operators report booming demand, with Microsoft’s Chief Executive Officer Satya Nadella saying, “We have seen two years’ worth of digital transformation in two months.”
This success brings it with new challenges and some growing responsibilities. Corporate operators tend to classify their applications and services into levels or categories, which defines what they might require for each one in terms of service levels, access, transparency and accountability. Test and development applications, shadow IT, and smaller and newer online businesses were the initial and main drivers of cloud demand in the first decade, and these have relatively light availability needs. But for critical production applications, now the biggest target for cloud companies, enterprises have expectations and compliance requirements that are far more stringent.
During the pandemic, a lot of applications (such as video conferencing, but a lot more besides) have become more critical than they were before. This is a trend that was already underway, but whereas it was previously happening almost imperceptibly, it is now much more obvious. Uptime Institute has described this trend as “creeping criticality” — and it can lead to a circumstance we call “asymmetric resiliency,” which occurs when the level of resiliency required by the application is not matched by the infrastructure supporting it.
The big public cloud operators do not think this applies to them, because all have addressed the issue of availability at the outset. The dominant cloud architecture, with multisite replication using availability zones, all built on fairly robust data centers (if not always uniformly or demonstrably so), has proved largely effective. Cloud services do suffer some infrastructure-related outages, according to Uptime Institute data, but no more than most enterprises.
Unfortunately for the cloud companies, assurances of availability or of adherence to best practices are not enough. In both the 2019 and the latest 2020 global annual survey of IT and critical infrastructure operators, Uptime Institute asked respondents if they put mission-critical workloads into the cloud, and whether resiliency and visibility is an issue. The answers in each year are almost identical: Over 70% did not put any critical applications in the cloud, with a third of this group (21% of the total sample) saying the reason was a lack of visibility/accountability about resiliency. And one in six of those who do place applications in the public cloud also say they do not have enough visibility (see chart below).
The pressure to be more open and accountable is growing. During the COVID-19 pandemic, governments have tried to better understand the vulnerability of their national critical infrastructure, and it hasn’t always been easy. In the UK, the government has expressed concern that the cloud players were unforthcoming with good information.
The rules being developed and applied in the financial services sector, especially in Europe, may provide a model for other sectors to follow. The EBA (European Banking Authority) says that banks may not use a cloud, hosting or colocation company for critical applications unless they are open to full, on-site inspection. That presents a challenge for the banks, who must verify that dozens or hundreds of data centers are well built and well run. Those rules have proved important, because without them, banks seeking access did not always get a welcoming response.
The challenge for financial services regulators and operators is that the banking system is highly interdependent, with many end-to-end processes spanning multiple businesses and multiple data centers and data services. Failure of one can mean a systemic failure; minimization of risk, and full accountability, is essential in the sector — assurances of 99.99% availability are not enough.
During the pandemic, the cloud providers have performed well, but failures or slowdowns have occurred. Microsoft’s success, for example, had led to constraints in the availability of Azure services in many regions. For test, dev or even some smaller businesses, this is not too serious: for critical, enterprise IT, it might be.
Uptime Institute data suggests that less than 10% of mission critical workloads are running in the public cloud, and that this number is not rising as fast as overall cloud growth (even if new applications like Microsoft Teams have taken off globally). With more accountability and visibility, this number might increase faster.
More information on this topic is available to members and guest members of the Uptime Institute Network community. Details on joining can be found here.
Data Center cleaning and sanitization: detail, cost and effectiveness
/in Executive, Operations/by Kevin HeslinFear of the coronavirus or confirmed exposure has caused about half (49%) of data center owners to increase the frequency of regular cleanings, according to a recent Uptime Institute flash survey. Even more (66%) have increased the frequency of sanitization (disinfection) to eliminate (however transitorily) any possible presence of the novel coronavirus in their facilities.
In times of normal operation, most facilities conduct a thorough cleaning (the removal of particles, static and residue) on a regular basis, with frequency determined by need. These cleanings help data centers meet ISO 14644-1, as required by ASHRAE.
Many use air quality indicators to determine when further cleaning is needed. Facility-specific factors such as air exchange rate/volume and raised floors can greatly affect the number of particulates in the air.
Other data center owners/operators clean the entire facility — including plenum spaces and underfloor areas — on an annual or biannual basis, with more frequent cleanings of all horizontal and vertical surfaces and even more frequent cleanings of handrails, doors and high-traffic areas. Schedules vary by facility. Some facility owners also regularly conduct facility-wide disinfection, which is the process that eliminates many or all pathogenic microorganisms, except bacterial spores, on inanimate objects. This is similar to sterilization, which eliminates all forms of microbial life.
Cleaning and sanitization are not necessarily risk-free processes, and they can be expensive. Moreover, the protection they provide may not persist, as the virus can easily be introduced — or reintroduced — minutes after a facility has been thoroughly cleaned and disinfected. However, the threat to operations is real: 10% of our survey respondents have experienced an IT service slowdown related to COVID-19, and 4% reported a pandemic-related outage or service interruption.
Cleaning considerations
Cleaners must exercise a great deal of care and follow rigorous procedures, as they often work in close proximity to sensitive electronic gear. In addition, cleaners must use vacuums with high efficiency particulate air filters, so that no dust or dirt particles become airborne and get pulled into the supply air. Deep cleaning is a necessary step to make sanitization and disinfection effective.
Sanitization or disinfection requires even greater caution, as most cleaning materials come with precise care, handling and use instructions, for both worker and equipment safety. Properly done, sanitization can remove up to 99.9999% (6-log kill) of enveloped viruses, such as SARS-CoV-2, the virus that causes COVID-19. Higher levels of disinfection are possible using stronger chemicals (see the EPA’s list of disinfectants that are effective against the coronavirus and even more durable biological contaminants such as fungi and bacterial spores). This list also includes many products that will be effective against coronaviruses that are safe and easy to use.
Data center cleaning services interviewed by Uptime Institute do not necessarily consider repeated deep cleaning or sanitization of entire facilities a necessary COVID-19 precaution. Clients, they say, must consider the cost of cleaning, which could be as much as $100,000 for each cleaning of a 100,000-square-foot facility, and sanitization, which can add another 20% or more to the cost of deep cleaning.
The cleaners note that many data centers are relatively clean “low-touch” facilities. In normal operation, data centers have features that work to protect against the entry of viruses and other particles. These include controlled access, static mats at many entries, and air filtration and pressurization in white spaces, all of which reduce the ability of the SARS-CoV-2 virus to enter sensitive areas of the facility. Many data centers also have strict rules against bringing unnecessary gear or personal items into the data center.
These characteristics suggest that an initial deep cleaning and sanitization of the entire facility as protection against the spread of COVID-19 is a reasonable approach in many data centers. Frequent or regular disinfections can be reserved for highly trafficked spaces and high-touch areas, such as handrails, handles, light switches and doorknobs. According to ASHRAE, computational fluid dynamics can be used to identify likely landing places for airborne viruses, making these treatments even more effective.
Some data center owners may want more frequent cleanings for “peace of mind” or legal reasons. Higher levels of attention may be warranted if an infected visitor has accessed the facility. To date, Uptime Institute has investigated three such reports worldwide, but anonymous responses to our survey indicate that the number of incidents is much higher (15% of respondents reported COVID-19-positive or symptomatic staff or visitors) .
Uptime Institute clients have expressed interest in a range of cleaning and disinfection processes, including ozone treatment and ultraviolet germicidal irradiation (UVGI), as well as relevant products.
Uptime Institute is aware of at least one new product — a biostatic antimicrobial coating — that has been introduced to the market. The US-based vendor describes it as “an invisible and durable coating that helps provide for the physical kill, or more precisely the disruption, of microorganisms (i.e., bacteria, viruses, mold, mildew and fungi) and can last up to 90 days or possibly longer.”
Even if proven effective against COVID-19, these treatments may have other use limitations and prohibitive downsides for either the equipment in data centers or the humans who staff them. For instance, UVGI is a disinfection method that uses short-wavelength ultraviolet (ultraviolet C or UVC) light to kill or inactivate microorganisms. But UVC light, for instance, can only disinfect where it is directly applied, so it cannot be used to treat any surfaces where targeted exposure is not possible (such as under any raised floors, overhead plenums, and behind any cabinets/cables). In addition, exposure to the wavelength typically used in UVGI (253.7 nm) can damage the skin and eyes.
Based on currently available information, the rigorous adherence to a program that includes hand hygiene, surface cleaning and social distancing will reduce the likelihood of transmission without introducing additional risks.
COVID-19: What worries data center management the most?
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]To date, the critical infrastructure industry has mostly managed effectively with reduced staff, deferred maintenance, social distancing and new patterns of demand. While there have been some serious incidents and outages directly related to the pandemic, these have been few.
But what worries operators for the months ahead? Clearly, the situation is changing all the time, with different facilities in different countries or states facing very different situations. Many countries are easing their lockdowns, while others have still to reach the (first?) peak of the pandemic.
In an Uptime Institute survey of over 200 critical IT/facility infrastructure operators around the world, a third of operators said a reduced level of IT infrastructure operations staff poses the single biggest risk to operations (see Figure 1).
This is perhaps not surprising — earlier Uptime Institute research has shown that insufficient staffing can be associated with increased levels of failures and incidents; and even before the pandemic, 60% of data center owners and operators reported difficulty finding or retaining skilled staff. Taken together, reduced maintenance (15%) and shortages of critical components (6%) suggest that continued, problem-free operation of equipment is the second biggest concern.
When asked to cite the top three concerns, data center project or construction delays (see Figure 2) was cited by 45% of the respondents (as was the staffing issue). This is perhaps not surprising; when asked later in the survey to select all relevant impacts from among a range of potential repercussions, 60% reported pandemic-related project or construction delays.
More information on this topic is available to members of the Uptime Institute Network here.
Pandemic is causing some outages and slowdowns
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]To date, media coverage of the impact of COVID-19 and the lockdowns has been largely laudatory. There have been few high profile or serious outages (perhaps fewer than normal) and for the most part, internet traffic flow analysis shows that a sudden jump in demand, along with a shift toward the residential edge and busy daytime hours, has had little impact on performance. The military-grade resiliency of the ARPANET (Advanced Research Projects Agency Network) and the Internet Protocol, the foundation of the modern internet, is given much credit.
But beneath the calm surface water, data center operators have been paddling furiously to maintain services, especially with some alarming staff shortages at some sites.
In our most recent survey, 84% of respondents had not experienced a service slowdown or outage that could be attributed to COVID-19. However, 4% (eight operators) said they had an COVID-19-related outage and 10% (20 operators) had experienced a service slowdown that was COVID-19 related (see figure below).
Establishing the causes of these slowdowns or outages will probably not be easy. Research does show that staff shortage and tiredness can lead to more incidents and outages, and sustained staff shortages (due to illness, separation of shifts and self-isolation) are widespread across the sector. Some recent data center outages that Uptime Institute has tracked were clearly the result of operator or management error — but this is a usual occurrence.
Slowdowns, meanwhile, are most likely to be the result of sudden changes in demand and overload, or an external third-party network problem. Two examples are the UK online grocer that mistook high demand for a denial of service attack; and Sharp Electronics, which offered consumer PPE (personal protection equipment) using the same systems as its online appliance management systems in Japan. Both crashed. Zoom, the suddenly popular conferencing service, has also experienced some maintenance-related issues. In the US, a CenturyLink cable cut, and another network issue at Tata Communications in Europe, caused outage numbers to surge above averages.
As the impact of the virus continues, data center operators could come under more strain. Most operators have deferred some scheduled maintenance, which in spite of monitoring and close management, will likely lead to an increased risk of more failures. In addition, many if not most sites are now operating with a reduced level of on-site staff, with many engineers on call, rather than on site. The industry’s traditional conservatism has so far it given a good protective buffer — but this will come under pressure over time, unless restrictions and practices are eased or evolve to ensure risk is again reduced.
More information on this topic is available to members of the Uptime Institute Network here.
Data center PUEs flat since 2013
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]The average power usage effectiveness (PUE) ratio for a data center in 2020 is 1.58, only marginally better than 7 years ago, according to the latest annual Uptime Institute survey (findings to be published shortly).
PUE, an international standard first developed by The Green Grid and others in 2007, is the most widely accepted way of measuring the energy efficiency of a data center. It measures the ratio of the energy used by the IT equipment to the energy used by the entire data center.
All operators strive to get their PUE ratio down to as near 1.0 as possible. Using the latest technology and practices, most new builds fall between 1.2 and 1.4. But there are still thousands of older data centers that cannot be economically or safely upgraded to become that efficient, especially if high availability is required. In 2019, the PUE value increased slightly (see previous article), with a number of possible explanations.
The new data (shown in the figure below) conforms to a consistent pattern: Big improvements in energy efficiency were made from 2007 to 2013, mostly using inexpensive or easy methods such as simple air containment, after which improvements became more difficult or expensive. The Uptime Institute figures are based on surveys of global data centers ranging in size from 1 megawatt (MW) to over 60 MW, of varying ages.
As ever, the data does not tell a complete story. This data is based on the average PUE per site, regardless of size or age. Newer data centers, usually built by hyperscale or colocation companies, tend to be much more efficient, and larger. A growing amount of work is therefore done in larger, more efficient data centers (Uptime Institute data in 2019 shows data centers above 20 MW to have lower PUEs). Data released by Google shows almost exactly the same curve shape — but at much lower values.
Operators who cannot improve their site PUE can still do a lot to reduce energy and/or decarbonize operations. First, they can improve the utilization of their IT and refresh their servers to ensure IT energy optimization. Second, they can re-use the heat generated by the data center; and third, they can buy renewable energy or invest in renewable energy generation.
More information on this topic is available to members of the Uptime Institute Network here.
COVID-19: Q&A (Part 3): Deferred Maintenance, Remote Work, Supply Chain, Long-Term Outlook
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteThe final installment of our Q&A regarding digital infrastructure considerations during the COVID19 crisis includes Deferred Maintenance, Remote Word, Supply Chain, Tier Standard and Long-Term Outlook
Below is Part 3 of the Q&A responses brought up during our recent series of webinars about managing operational risk during the COVID19 crisis. (Part 1 dealt with Staffing and Part 2 focused on Site Sanitation)
DEFERRED MAINTENANCE
Q: One of our clients does not want to do maintenance to avoid entering the data center. What do you recommend for this? Is it necessary to defer preventative maintenance on data center components?
A: Maintenance activities should be prioritized. At the very least, try to perform the most critical activities. If unable to do this, try to rotate hours as much as possible between redundant components and also contact the manufacturers of components/equipment to better identify the impacts of not performing maintenance on specific equipment. Deferred maintenance brings higher risk; in some equipment this risk is more serious than in others, so maintenance activities should be prioritized in order of criticality. For more information, please see our report COVID-19: Minimizing critical facility risk.
UPTIME INSTITUTE
Q: Will the learning from this pandemic be reflected in adjustments in the certification levels of each of the Tiers of the Uptime Institute? Will the Uptime Institute’s standard for operations be updated due to COVID-19, to incorporate lessons learned from this situation? How relevant will DCIM (data center infrastructure management) systems become from this pandemic? Will there be an emphasis on DCIM at the Uptime Institute certification levels?
A: Yes, Uptime Institute is currently evaluating potential adjustments in the criteria of the Uptime Institute Tier Certification of Operational Sustainability to take in consideration the pandemic and potential endemic issues that can affect the normal operation and sustainability of the data center. If the data center has implemented a DCIM system and BMS (building management system), during a pandemic or other similar emergency events these systems should be used to continually monitor, measure and manage both IT and supporting infrastructure equipment such as power and cooling systems. There should be an emphasis on all virtual private network (VPN) connections, which should be tested to ensure reliable access for remote data center monitoring.
Q: Will future operational [sustainability] or management and operations awards contemplate additional procedures associated with pandemic risks?
A: Yes, Uptime Institute is currently evaluating and planning modifications of the Uptime Institute Tier Certification of Operational Sustainability, which will result in a change in the evaluation of data centers’ ability to mitigate various risks, including pandemics.
Q: What would be the “Tier IV measures” in a data center regarding COVID-19?
A: Uptime Institute Tier IV is a reference largely to data center topology design and installation. COVID-19 is mostly impacting data center operations. Therefore, COVID-19 would not impact the Tier IV compliance of a facility.
REMOTE WORKING
Q: How can we minimize the issue of network saturation because we are all working from home?
A: We recommend that all remote workers have established security policies set up by their IT departments, and that IT departments explore potential bottlenecks and recommend mitigation efforts.
Many employees working from a home office will use their ISP (internet service provider) to access the cloud and office LAN (local area network) over VPN, where the read/write profile is totally different compared with, say, a Netflix streaming movie (which is practically all downloading and has a “read” profile). Adding consumption from others in the home (such as family members also working or doing remote schooling), the “read” function is increased and often becomes the great villain in bandwidth consumption. This can cause bandwidth limits being reached, leading to packet loss or time outs, delivering a slow internet experience.
With regard to VPNs, as part of their cybersecurity policies, many organizations and governments use strict access policies to control LAN users when working from the office. Remote workers use VPNs to build a tunnel in their ISP, linked with their office’s secure access, however, this model is rigid and wasn’t developed to accommodate the number of employees currently working remotely. This can lead to additional bottlenecks, giving the same slow internet experience as from home. Occasionally, connected to the LAN, a remote worker’s internet access will pass through their company´s firewall to locate a cloud-based service. This can cause a serious degradation of service because the remote worker’s cloud services traffic is compounded by activity such as their video-conferencing traffic and by their family´s consumption.
Q: Any recommendations for the security of data center information? The question relates to the importance of remote monitoring of mission-critical systems and whether this would be done through cloud-hosted applications.
A: If the data center has implemented remote monitoring and BMS, these systems should be used during a pandemic or other similar emergency events to monitor, measure and manage both IT equipment and supporting infrastructure such as power and cooling systems. There should be emphasis on all VPN connections to ensure they are tested and enable reliable access for remote data center monitoring.
SUPPLY CHAIN
Q: In the case of suppliers working at reduced capacity, do you recommend purchasing spare parts stock for operating equipment, taking into account the budget constraints/budget recommendations of prioritization?
A: The potential for long-term disruption to the supply chain for critical spares and consumables should be considered. If service level agreements include spare-parts supplies, communication should be established to ensure key equipment parts are available and/or to establish additional time for arrival in case of failure or emergency.
LONG-TERM EFFECTS
Q: Once this pandemic is overcome, it should accelerate the transfer of data center owners’ IT platforms to large data centers or to the cloud. What is your vision with regard to this issue?
A: There are several dynamics at play here, and for this reason, it is premature to give any definitive guidance until the situation clarifies. But some observations:
Of course, every application, every service and every company is different. Although it is speculative, we think it likely that using colocation will prove a less-disruptive and more cost-effective path than full-on cloud transformation. While the long-term trend toward cloud will continue, there may be more pressure on cloud operators to take active steps to attract enterprise workloads.
Q: What do you see as the future focus, having now had a pandemic as a precedent?
A: In situations like this, data centers face particular challenges due to the unavailability of key personnel to be in their roles due to illness or quarantine. We recommend that organizations develop a specific pandemic preparedness plan similar to civic emergencies that focuses on performance, efficiency and reliability that include contingency plans that can be adapted to the challenges of the current pandemic or the potential of recurrent endemic events. Each organization’s response will vary based on individual site environments and local government/mandatory restrictions. Plans should consider situations in which staff may be unable to access or leave the site on short notice. Please refer to our report COVID-19: Minimizing critical facility risk, which addresses this topic in detail.
Q: What time projection does Uptime Institute have for the COVID-19 crisis?
A: Unfortunately, there is insufficient information at this point in time to answer questions regarding the duration of the COVID-19 crisis.