Blog Single Author Small - Uptime Institute Blog

COVID-19: What worries data center management the most?

May 11, 2020/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com

To date, the critical infrastructure industry has mostly managed effectively with reduced staff, deferred maintenance, social distancing and new patterns of demand. While there have been some serious incidents and outages directly related to the pandemic, these have been few.

But what worries operators for the months ahead? Clearly, the situation is changing all the time, with different facilities in different countries or states facing very different situations. Many countries are easing their lockdowns, while others have still to reach the (first?) peak of the pandemic.

In an Uptime Institute survey of over 200 critical IT/facility infrastructure operators around the world, a third of operators said a reduced level of IT infrastructure operations staff poses the single biggest risk to operations (see Figure 1).

This is perhaps not surprising — earlier Uptime Institute research has shown that insufficient staffing can be associated with increased levels of failures and incidents; and even before the pandemic, 60% of data center owners and operators reported difficulty finding or retaining skilled staff. Taken together, reduced maintenance (15%) and shortages of critical components (6%) suggest that continued, problem-free operation of equipment is the second biggest concern.

When asked to cite the top three concerns, data center project or construction delays (see Figure 2) was cited by 45% of the respondents (as was the staffing issue). This is perhaps not surprising; when asked later in the survey to select all relevant impacts from among a range of potential repercussions, 60% reported pandemic-related project or construction delays.

More information on this topic is available to members of the Uptime Institute Network here.

2020

Pandemic is causing some outages and slowdowns

May 4, 2020/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com

To date, media coverage of the impact of COVID-19 and the lockdowns has been largely laudatory. There have been few high profile or serious outages (perhaps fewer than normal) and for the most part, internet traffic flow analysis shows that a sudden jump in demand, along with a shift toward the residential edge and busy daytime hours, has had little impact on performance. The military-grade resiliency of the ARPANET (Advanced Research Projects Agency Network) and the Internet Protocol, the foundation of the modern internet, is given much credit.

But beneath the calm surface water, data center operators have been paddling furiously to maintain services, especially with some alarming staff shortages at some sites.

In our most recent survey, 84% of respondents had not experienced a service slowdown or outage that could be attributed to COVID-19. However, 4% (eight operators) said they had an COVID-19-related outage and 10% (20 operators) had experienced a service slowdown that was COVID-19 related (see figure below).

Establishing the causes of these slowdowns or outages will probably not be easy. Research does show that staff shortage and tiredness can lead to more incidents and outages, and sustained staff shortages (due to illness, separation of shifts and self-isolation) are widespread across the sector. Some recent data center outages that Uptime Institute has tracked were clearly the result of operator or management error — but this is a usual occurrence.

Slowdowns, meanwhile, are most likely to be the result of sudden changes in demand and overload, or an external third-party network problem. Two examples are the UK online grocer that mistook high demand for a denial of service attack; and Sharp Electronics, which offered consumer PPE (personal protection equipment) using the same systems as its online appliance management systems in Japan. Both crashed. Zoom, the suddenly popular conferencing service, has also experienced some maintenance-related issues. In the US, a CenturyLink cable cut, and another network issue at Tata Communications in Europe, caused outage numbers to surge above averages.

As the impact of the virus continues, data center operators could come under more strain. Most operators have deferred some scheduled maintenance, which in spite of monitoring and close management, will likely lead to an increased risk of more failures. In addition, many if not most sites are now operating with a reduced level of on-site staff, with many engineers on call, rather than on site. The industry’s traditional conservatism has so far it given a good protective buffer — but this will come under pressure over time, unless restrictions and practices are eased or evolve to ensure risk is again reduced.

More information on this topic is available to members of the Uptime Institute Network here.

Data Center PUE has leveled off since 2013

2020

Data center PUEs flat since 2013

April 27, 2020/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com

The average power usage effectiveness (PUE) ratio for a data center in 2020 is 1.58, only marginally better than 7 years ago, according to the latest annual Uptime Institute survey (findings to be published shortly).

PUE, an international standard first developed by The Green Grid and others in 2007, is the most widely accepted way of measuring the energy efficiency of a data center. It measures the ratio of the energy used by the IT equipment to the energy used by the entire data center.

All operators strive to get their PUE ratio down to as near 1.0 as possible. Using the latest technology and practices, most new builds fall between 1.2 and 1.4. But there are still thousands of older data centers that cannot be economically or safely upgraded to become that efficient, especially if high availability is required. In 2019, the PUE value increased slightly (see previous article), with a number of possible explanations.

The new data (shown in the figure below) conforms to a consistent pattern: Big improvements in energy efficiency were made from 2007 to 2013, mostly using inexpensive or easy methods such as simple air containment, after which improvements became more difficult or expensive. The Uptime Institute figures are based on surveys of global data centers ranging in size from 1 megawatt (MW) to over 60 MW, of varying ages.

As ever, the data does not tell a complete story. This data is based on the average PUE per site, regardless of size or age. Newer data centers, usually built by hyperscale or colocation companies, tend to be much more efficient, and larger. A growing amount of work is therefore done in larger, more efficient data centers (Uptime Institute data in 2019 shows data centers above 20 MW to have lower PUEs). Data released by Google shows almost exactly the same curve shape — but at much lower values.

Operators who cannot improve their site PUE can still do a lot to reduce energy and/or decarbonize operations. First, they can improve the utilization of their IT and refresh their servers to ensure IT energy optimization. Second, they can re-use the heat generated by the data center; and third, they can buy renewable energy or invest in renewable energy generation.

More information on this topic is available to members of the Uptime Institute Network here.

COVID-19: Q&A (Part 3): Deferred Maintenance, Remote Work, Supply Chain, Long-Term Outlook

April 17, 2020/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime Institute

The final installment of our Q&A regarding digital infrastructure considerations during the COVID19 crisis includes Deferred Maintenance, Remote Word, Supply Chain, Tier Standard and Long-Term Outlook

Below is Part 3 of the Q&A responses brought up during our recent series of webinars about managing operational risk during the COVID19 crisis. (Part 1 dealt with Staffing and Part 2 focused on Site Sanitation)

DEFERRED MAINTENANCE

Q: One of our clients does not want to do maintenance to avoid entering the data center. What do you recommend for this? Is it necessary to defer preventative maintenance on data center components?

A: Maintenance activities should be prioritized. At the very least, try to perform the most critical activities. If unable to do this, try to rotate hours as much as possible between redundant components and also contact the manufacturers of components/equipment to better identify the impacts of not performing maintenance on specific equipment. Deferred maintenance brings higher risk; in some equipment this risk is more serious than in others, so maintenance activities should be prioritized in order of criticality. For more information, please see our report COVID-19: Minimizing critical facility risk.

UPTIME INSTITUTE

Q: Will the learning from this pandemic be reflected in adjustments in the certification levels of each of the Tiers of the Uptime Institute? Will the Uptime Institute’s standard for operations be updated due to COVID-19, to incorporate lessons learned from this situation? How relevant will DCIM (data center infrastructure management) systems become from this pandemic? Will there be an emphasis on DCIM at the Uptime Institute certification levels?

A: Yes, Uptime Institute is currently evaluating potential adjustments in the criteria of the Uptime Institute Tier Certification of Operational Sustainability to take in consideration the pandemic and potential endemic issues that can affect the normal operation and sustainability of the data center. If the data center has implemented a DCIM system and BMS (building management system), during a pandemic or other similar emergency events these systems should be used to continually monitor, measure and manage both IT and supporting infrastructure equipment such as power and cooling systems. There should be an emphasis on all virtual private network (VPN) connections, which should be tested to ensure reliable access for remote data center monitoring.

Q: Will future operational [sustainability] or management and operations awards contemplate additional procedures associated with pandemic risks?

A: Yes, Uptime Institute is currently evaluating and planning modifications of the Uptime Institute Tier Certification of Operational Sustainability, which will result in a change in the evaluation of data centers’ ability to mitigate various risks, including pandemics.

Q: What would be the “Tier IV measures” in a data center regarding COVID-19?

A: Uptime Institute Tier IV is a reference largely to data center topology design and installation. COVID-19 is mostly impacting data center operations. Therefore, COVID-19 would not impact the Tier IV compliance of a facility.

REMOTE WORKING

Q: How can we minimize the issue of network saturation because we are all working from home?

A: We recommend that all remote workers have established security policies set up by their IT departments, and that IT departments explore potential bottlenecks and recommend mitigation efforts.

Many employees working from a home office will use their ISP (internet service provider) to access the cloud and office LAN (local area network) over VPN, where the read/write profile is totally different compared with, say, a Netflix streaming movie (which is practically all downloading and has a “read” profile). Adding consumption from others in the home (such as family members also working or doing remote schooling), the “read” function is increased and often becomes the great villain in bandwidth consumption. This can cause bandwidth limits being reached, leading to packet loss or time outs, delivering a slow internet experience.

With regard to VPNs, as part of their cybersecurity policies, many organizations and governments use strict access policies to control LAN users when working from the office. Remote workers use VPNs to build a tunnel in their ISP, linked with their office’s secure access, however, this model is rigid and wasn’t developed to accommodate the number of employees currently working remotely. This can lead to additional bottlenecks, giving the same slow internet experience as from home. Occasionally, connected to the LAN, a remote worker’s internet access will pass through their company´s firewall to locate a cloud-based service. This can cause a serious degradation of service because the remote worker’s cloud services traffic is compounded by activity such as their video-conferencing traffic and by their family´s consumption.

Q: Any recommendations for the security of data center information? The question relates to the importance of remote monitoring of mission-critical systems and whether this would be done through cloud-hosted applications.

A: If the data center has implemented remote monitoring and BMS, these systems should be used during a pandemic or other similar emergency events to monitor, measure and manage both IT equipment and supporting infrastructure such as power and cooling systems. There should be emphasis on all VPN connections to ensure they are tested and enable reliable access for remote data center monitoring.

SUPPLY CHAIN

Q: In the case of suppliers working at reduced capacity, do you recommend purchasing spare parts stock for operating equipment, taking into account the budget constraints/budget recommendations of prioritization?

A: The potential for long-term disruption to the supply chain for critical spares and consumables should be considered. If service level agreements include spare-parts supplies, communication should be established to ensure key equipment parts are available and/or to establish additional time for arrival in case of failure or emergency.

LONG-TERM EFFECTS

Q: Once this pandemic is overcome, it should accelerate the transfer of data center owners’ IT platforms to large data centers or to the cloud. What is your vision with regard to this issue?

A: There are several dynamics at play here, and for this reason, it is premature to give any definitive guidance until the situation clarifies. But some observations:

Many enterprises are likely to conclude that they want to reduce risk and complexity in the future, and they will not welcome the extra costs and processes associated with reducing the impact of future pandemics. For many, the obvious solution will be to go to the cloud or colocation companies. But the former, while strategic for many, will be most disruptive, perhaps more expensive, and may make the risks less visible.
Our research already shows that the biggest single impact of the lockdown has been to delay data center and IT projects. This is likely to slow down any major cloud/colocation moves, as a backlog builds and new priorities come into play. Overall, there is likely to be bias toward strengthening the status quo.
As we move out of the pandemic, many enterprises will have cost-reduction programs in place resulting from loss of business. Cloud has many advantages, but few large businesses have found it to be cheaper, especially where data centers are already depreciated. And almost all find that even where the costs are not higher, there are temporary transition costs.

Of course, every application, every service and every company is different. Although it is speculative, we think it likely that using colocation will prove a less-disruptive and more cost-effective path than full-on cloud transformation. While the long-term trend toward cloud will continue, there may be more pressure on cloud operators to take active steps to attract enterprise workloads.

Q: What do you see as the future focus, having now had a pandemic as a precedent?

A: In situations like this, data centers face particular challenges due to the unavailability of key personnel to be in their roles due to illness or quarantine. We recommend that organizations develop a specific pandemic preparedness plan similar to civic emergencies that focuses on performance, efficiency and reliability that include contingency plans that can be adapted to the challenges of the current pandemic or the potential of recurrent endemic events. Each organization’s response will vary based on individual site environments and local government/mandatory restrictions. Plans should consider situations in which staff may be unable to access or leave the site on short notice. Please refer to our report COVID-19: Minimizing critical facility risk, which addresses this topic in detail.

Q: What time projection does Uptime Institute have for the COVID-19 crisis?

A: Unfortunately, there is insufficient information at this point in time to answer questions regarding the duration of the COVID-19 crisis.

2020

COVID-19: Q&A (Part 2): Site Sanitation and Security

April 16, 2020/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime Institute

Here is Part 2 of our Q&A regarding digital infrastructure considerations during the COVID19 crisis. Keep in mind that we are all handling this crisis in varied ways, and learning from each other along the way. In that process, we really ARE all finding our own “New Normal” and ultimately we will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.

As part of our infrastructure community leadership, we are regularly adding new materials, guidance and recommendations to help digital infrastructure owners and operators in these times.

We’ve conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. Below is Part 2 of the Q&A responses brought up during these webinars. These questions deal with Site Sanitation and Security. (Part 1 dealt with Staffing, and future published info will focus on Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations.

SITE SANITIZATION

Q: If an average of 50 people per day enter the data center, how often is filter change recommended? What parameters do I use to make that change?

A: Under normal operating scenarios, filter changes are typically triggered by an increased pressure differential measured across the filter. As the filter clogs with debris and particulate, pressure drop will increase. As far as operations during COVID-19, there does not appear to be any reasons for a change to the outlook on filter changes, including what should trigger a filter change, based on information currently available on how the virus spreads (although information seems to be changing over time.).

Q: What types of access controls or filters are recommended to implement? Are there air conditioning filters on the market that limit the circulation of viruses in the data center?

A: Currently, it is not believed that filters will play a large part in mitigating spread of the virus in data centers. Some research suggests that high-end filters (for example, high efficiency particulate air filters) are capable of filtering particles the size of COVID-19. However, based on current information on how the virus spreads, it is not generally believed that it spreads in a true airborne manner to the point where it gets in ventilation systems; it spreads via proximity to infected people sneezing, coughing, etc. Data centers are unique in that they have multiple air changes per minute, which is different from other facility types. It is important to note that while filtering could theoretically reduce spread, the National Air Filtration Association does not believe this will happen from a practical perspective.

Q: My concern is that, in a closed-loop environment, if COVID-19 enters the data center, the virus will live and could infect more people.

A: That is an accurate statement. That is also why it is important to take any reasonable steps possible to keep it out of the data center via strong site-access control requirements and checks. It is also why many data center operators are implementing regular disinfecting so that if it is in the data center, spread is mitigated. Uptime Institute is aware that many operators have found specialized data center cleaning companies that are capable of disinfecting sites in accordance with guidelines for this pandemic from the WHO (World Health Organization) and/or the CDC (US Centers for Disease Control and Prevention).

Q: How long can the airborne virus particle last in a data center because it is cold air?

A: According to the WHO: “It is not certain how long the virus that causes COVID-19 survives on surfaces, but it seems to behave like other coronaviruses. Studies suggest that coronaviruses (including preliminary information on the COVID-19 virus) may persist on surfaces for a few hours or up to several days. This may vary under different conditions (e.g., type of surface, temperature or humidity of the environment).” We believe that this is an area that is still being studied. There does not appear to be any consensus, other than there are a number of factors, including temperature and humidity, that can impact this.

Q: Taking into consideration that the virus lasts for an incubation period, would it be possible to bring contamination into the data center?

A: Based on the information presented by the WHO and CDC, it is likely that the virus can be introduced and that there can be contamination in the data center. The most likely vector for transmission is infected individuals who do not know they are infected. Please refer to the WHO, the CDC and/or your local authority for more information.

Q: Is there any type of clothing, masks or gloves that is recommended for access to the data center by customers or suppliers so as not to expose our staff? Is it more feasible for staff to carry this type of PPE or to demand it from the customers or suppliers?

A: The CDC is recommending the N95 mask. Other masks do not seal tightly around the nose and mouth to provide proper protection. To be fully effective it must be fitted properly. Specialists receive training annually on how to properly fit these respirators around the nose, cheeks and chin, ensuring that wearers don’t breathe around the edges of the respirator. When you do that, it turns out that the work of breathing, since you’re going through a very thick material, is harder. You have to work to breathe in and out. All personnel accessing the data center should be wearing PPE in accordance with the current policies related to the COVID-19 pandemic and future similar events.

It is important to note that we are seeing companies implement their pandemic plans, which to a large extent includes limiting site access to customers and employees. The pandemic plans vary between companies, but we are hearing of restrictions being implemented to include the use of masks, gloves, etc., primarily to follow CDC guidelines. Our suggestion is to follow CDC guidelines, as well as follow your approved pandemic plan. Please note that sanitizing methods may be a better thing to focus on than the use of masks. Also, while masks would provide an additional layer of protection, until production of masks ramps up, Uptime Institute is now somewhat cautious about recommending data center owners and operators stock up on masks. The “good” masks are in short supply and should be allocated to healthcare professionals until there is sufficient supply to go around.

Q: If we decide to sanitize [our data center], the fire system detectors can go off. What do you recommend?

A: It is common during various housekeeping operations to put the fire system into bypass. This is especially important if there is a VESDA (very early smoke detection apparatus) system present, which can be triggered by disturbances of even very small particulate (they are specifically designed to be highly sensitive). Our recommendation is to put the fire system in bypass while maintaining compliance with local jurisdictional requirements. This may require fire watch or similar measures be taken while the system is in bypass.

SITE SECURITY

Q: Insecurity is likely to increase in different regions, do you recommend increasing security?

A: Certainly, in facilities operating in severely affected areas the level of security risk could be affected. In these areas, management must adopt enhanced security policies, including prescreening all scheduled visitors before arrival on-site; prohibiting all unscheduled visitors; and if possible and applicable, creating a separate, secure entrance for all parties involved in essential on-site construction projects and establishing a policy that they (or any other visitors) are not allowed to interact with duty operations personnel.

COVID-19: Q&A (Part 1): Staff Management

April 15, 2020/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime Institute

We are all in this together. We are all finding our own “New Normal”. We are all learning from each other and will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.

To this end, Uptime Institute has been creating many types of guidance and recommendations to help digital infrastructure owners and operators ride through the COVID19 crisis. We want to help the entire community deal with their present and prepare for their future, where the New Normal will be a way of life.

We’ve created COVID19 operational guideline reports based on our 25 years of data center risk management experience, published associated real-time updates and bulletins, and conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. As a community, it’s clear we are all looking for each other’s support, guidance and experience.

As part of these webinars, we have published a series of digital infrastructure Q&A documents, focused on various categories of COVID19 specific topics: Staff Management, Site Sanitation, Site Security, Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations. Presented below is PART 1 of these Q&A topics, covering Staff Management during COVID19.

STAFF MANAGEMENT

Q: What is your view on continuous, long shifts for four or seven days in the data center?

A: It is not best practice for operations to change to continuous shifts from four days to seven days, as fatigue and stress will increase the human risk factor that can cause abnormal incidents. Instead, we recommend assessing extending the shift time from 8 hours to 12 hours and limiting this to a maximum of two or three consecutive days. Any extended continuous shifts should include long regular breaks each shift to avoid fatigue. There needs to be a careful balance between the increased risk of human fatigue and the mitigated risk of virus spread. Managers should also consider that the total hours worked per person does not increase, that overtime will not be over 10%, and that the shifts are arranged so that staff can rest adequately between shifts.

Q: What procedures should we follow to identify an infected staff member?

A: As described in our report COVID-19: Minimizing critical facility risk, we recommend contact tracing systems. Register the health information and location of your organization’s personnel, suppliers’ personnel and other related personnel every day to monitor possible exposure to the virus and/or any symptoms (including those of the common cold). We recommend prescreening all scheduled visitors before they arrive on-site, including sending a questionnaire via email 48 hours prior to their visit. Require completion of the questionnaire before the appointment is confirmed. Verify that all answers remain unchanged upon arrival and institute temperature checks using noncontact thermometers before entry to the facility.

For a confirmed COVID-19 case at the site, we recommend that cleaning personnel use bio-hazard suits, gloves, shoe coverings, etc. and that all personal protective equipment (PPE) is bagged and removed from the site once cleaning is complete. With or without a confirmed case at the side, ensure the availability of PPE, including masks, gloves and hazardous materials or hazmat suits. Depending on the appropriate medical or management advice, workers should use masks during shift turnover. Training pairs (e.g., senior engineer and trainee) must wear masks at all times.

For further information, please refer to our report COVID-19: Minimizing critical facility risk.

Q: How feasible is it to move families to the data center?

A: Although housing staff on-site should be considered only as a last resort, regions could go into lockdown mid-shift, so you may need to prepare for that eventuality. There are disaster recovery plans that include providing accommodation for several family members for up to 2 weeks to avoid traveling to and from the data center. While the data center is perceived as a controlled-access space, it is not a safe space. Therefore, any organization considering this option should also consider offering a specialized training program for family members that includes awareness of the hazards and the associated risks, emergency evacuation procedures, etc.

Q: Is it always recommended to keep personnel 24×7 for Tier III and Tier IV data centers?

A: Yes, it is a required criteria of the Uptime Institute Tier Certification of Operational Sustainability to have a minimum of one 24-hour, 7-day-a-week qualified staff presence (full-time employee) for Tier III data centers per shift and a minimum of two 24-hr/7-day-a-week staff presence (full-time employee) per shift for Tier IV data centers.

Q: We must not forget the following considerations for staff that may need to stay at the data center for 24 hours or more: the need to prepare food; a supply of canned food for more than 40 days, as well as alkaline water and the ability to purify it by reverse osmosis in case of water contamination; and cardiopulmonary resuscitation equipment for emergencies.

A: Correct, all these initiatives are proactive and preventative. Uptime Institute’s COVID-19: Minimizing critical facility risk report provides additional information related to what measures data center management should consider for the health and safety of staff and the protection of the site.

Q: Do you recommend interviewing all internal staff to determine their personal situation, and whether this should be done by a psychologist, particularly if staff are in the data center for a long time?

A: Organizations should maintain open and continuous communication with staff, customers and relevant third parties on a daily basis or even twice daily. Briefings may be appropriate as the conditions change. We also recommend sharing news updates and links to public resources to keep staff informed of the current status of the pandemic and the best practices for maintaining a safe and healthy work environment. As appropriate for each case, emotional support should be provided to reduce stress. Special attention should be given to any changes to continuous, long shifts that could increase the risk of human error, which may cause abnormal incidents.

COVID-19: What worries data center management the most?

Pandemic is causing some outages and slowdowns

Data center PUEs flat since 2013

COVID-19: Q&A (Part 3): Deferred Maintenance, Remote Work, Supply Chain, Long-Term Outlook

DEFERRED MAINTENANCE

UPTIME INSTITUTE

REMOTE WORKING

SUPPLY CHAIN

LONG-TERM EFFECTS

COVID-19: Q&A (Part 2): Site Sanitation and Security

SITE SANITIZATION

SITE SECURITY

COVID-19: Q&A (Part 1): Staff Management

STAFF MANAGEMENT

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices