The average power usage effectiveness (PUE) ratio for a data center in 2020 is 1.58, only marginally better than 7 years ago, according to the latest annual Uptime Institute survey (findings to be published shortly).
PUE, an international standard first developed by The Green Grid and others in 2007, is the most widely accepted way of measuring the energy efficiency of a data center. It measures the ratio of the energy used by the IT equipment to the energy used by the entire data center.
All operators strive to get their PUE ratio down to as near 1.0 as possible. Using the latest technology and practices, most new builds fall between 1.2 and 1.4. But there are still thousands of older data centers that cannot be economically or safely upgraded to become that efficient, especially if high availability is required. In 2019, the PUE value increased slightly (see previous article), with a number of possible explanations.
The new data (shown in the figure below) conforms to a consistent pattern: Big improvements in energy efficiency were made from 2007 to 2013, mostly using inexpensive or easy methods such as simple air containment, after which improvements became more difficult or expensive. The Uptime Institute figures are based on surveys of global data centers ranging in size from 1 megawatt (MW) to over 60 MW, of varying ages.
As ever, the data does not tell a complete story. This data is based on the average PUE per site, regardless of size or age. Newer data centers, usually built by hyperscale or colocation companies, tend to be much more efficient, and larger. A growing amount of work is therefore done in larger, more efficient data centers (Uptime Institute data in 2019 shows data centers above 20 MW to have lower PUEs). Data released by Google shows almost exactly the same curve shape — but at much lower values.
Operators who cannot improve their site PUE can still do a lot to reduce energy and/or decarbonize operations. First, they can improve the utilization of their IT and refresh their servers to ensure IT energy optimization. Second, they can re-use the heat generated by the data center; and third, they can buy renewable energy or invest in renewable energy generation.
More information on this topic is available to members of the Uptime Institute Network here.
The final installment of our Q&A regarding digital infrastructure considerations during the COVID19 crisis includes Deferred Maintenance, Remote Word, Supply Chain, Tier Standard and Long-Term Outlook
Below is Part 3 of the Q&A responses brought up during our recent series of webinars about managing operational risk during the COVID19 crisis. (Part 1 dealt with Staffing and Part 2 focused on Site Sanitation)
Q: One of our clients does not want to do maintenance to avoid entering the data center. What do you recommend for this? Is it necessary to defer preventative maintenance on data center components?
A: Maintenance activities should be prioritized. At the very least, try to perform the most critical activities. If unable to do this, try to rotate hours as much as possible between redundant components and also contact the manufacturers of components/equipment to better identify the impacts of not performing maintenance on specific equipment. Deferred maintenance brings higher risk; in some equipment this risk is more serious than in others, so maintenance activities should be prioritized in order of criticality. For more information, please see our report COVID-19: Minimizing critical facility risk.
Q: Will the learning from this pandemic be reflected in adjustments in the certification levels of each of the Tiers of the Uptime Institute? Will the Uptime Institute’s standard for operations be updated due to COVID-19, to incorporate lessons learned from this situation? How relevant will DCIM (data center infrastructure management) systems become from this pandemic? Will there be an emphasis on DCIM at the Uptime Institute certification levels?
A: Yes, Uptime Institute is currently evaluating potential adjustments in the criteria of the Uptime Institute Tier Certification of Operational Sustainability to take in consideration the pandemic and potential endemic issues that can affect the normal operation and sustainability of the data center. If the data center has implemented a DCIM system and BMS (building management system), during a pandemic or other similar emergency events these systems should be used to continually monitor, measure and manage both IT and supporting infrastructure equipment such as power and cooling systems. There should be an emphasis on all virtual private network (VPN) connections, which should be tested to ensure reliable access for remote data center monitoring.
Q: Will future operational [sustainability] or management and operations awards contemplate additional procedures associated with pandemic risks?
A: Yes, Uptime Institute is currently evaluating and planning modifications of the Uptime Institute Tier Certification of Operational Sustainability, which will result in a change in the evaluation of data centers’ ability to mitigate various risks, including pandemics.
Q: What would be the “Tier IV measures” in a data center regarding COVID-19?
A: Uptime Institute Tier IV is a reference largely to data center topology design and installation. COVID-19 is mostly impacting data center operations. Therefore, COVID-19 would not impact the Tier IV compliance of a facility.
Q: How can we minimize the issue of network saturation because we are all working from home?
A: We recommend that all remote workers have established security policies set up by their IT departments, and that IT departments explore potential bottlenecks and recommend mitigation efforts.
Many employees working from a home office will use their ISP (internet service provider) to access the cloud and office LAN (local area network) over VPN, where the read/write profile is totally different compared with, say, a Netflix streaming movie (which is practically all downloading and has a “read” profile). Adding consumption from others in the home (such as family members also working or doing remote schooling), the “read” function is increased and often becomes the great villain in bandwidth consumption. This can cause bandwidth limits being reached, leading to packet loss or time outs, delivering a slow internet experience.
With regard to VPNs, as part of their cybersecurity policies, many organizations and governments use strict access policies to control LAN users when working from the office. Remote workers use VPNs to build a tunnel in their ISP, linked with their office’s secure access, however, this model is rigid and wasn’t developed to accommodate the number of employees currently working remotely. This can lead to additional bottlenecks, giving the same slow internet experience as from home. Occasionally, connected to the LAN, a remote worker’s internet access will pass through their company´s firewall to locate a cloud-based service. This can cause a serious degradation of service because the remote worker’s cloud services traffic is compounded by activity such as their video-conferencing traffic and by their family´s consumption.
Q: Any recommendations for the security of data center information? The question relates to the importance of remote monitoring of mission-critical systems and whether this would be done through cloud-hosted applications.
A: If the data center has implemented remote monitoring and BMS, these systems should be used during a pandemic or other similar emergency events to monitor, measure and manage both IT equipment and supporting infrastructure such as power and cooling systems. There should be emphasis on all VPN connections to ensure they are tested and enable reliable access for remote data center monitoring.
Q: In the case of suppliers working at reduced capacity, do you recommend purchasing spare parts stock for operating equipment, taking into account the budget constraints/budget recommendations of prioritization?
A: The potential for long-term disruption to the supply chain for critical spares and consumables should be considered. If service level agreements include spare-parts supplies, communication should be established to ensure key equipment parts are available and/or to establish additional time for arrival in case of failure or emergency.
Q: Once this pandemic is overcome, it should accelerate the transfer of data center owners’ IT platforms to large data centers or to the cloud. What is your vision with regard to this issue?
A: There are several dynamics at play here, and for this reason, it is premature to give any definitive guidance until the situation clarifies. But some observations:
- Many enterprises are likely to conclude that they want to reduce risk and complexity in the future, and they will not welcome the extra costs and processes associated with reducing the impact of future pandemics. For many, the obvious solution will be to go to the cloud or colocation companies. But the former, while strategic for many, will be most disruptive, perhaps more expensive, and may make the risks less visible.
- Our research already shows that the biggest single impact of the lockdown has been to delay data center and IT projects. This is likely to slow down any major cloud/colocation moves, as a backlog builds and new priorities come into play. Overall, there is likely to be bias toward strengthening the status quo.
- As we move out of the pandemic, many enterprises will have cost-reduction programs in place resulting from loss of business. Cloud has many advantages, but few large businesses have found it to be cheaper, especially where data centers are already depreciated. And almost all find that even where the costs are not higher, there are temporary transition costs.
Of course, every application, every service and every company is different. Although it is speculative, we think it likely that using colocation will prove a less-disruptive and more cost-effective path than full-on cloud transformation. While the long-term trend toward cloud will continue, there may be more pressure on cloud operators to take active steps to attract enterprise workloads.
Q: What do you see as the future focus, having now had a pandemic as a precedent?
A: In situations like this, data centers face particular challenges due to the unavailability of key personnel to be in their roles due to illness or quarantine. We recommend that organizations develop a specific pandemic preparedness plan similar to civic emergencies that focuses on performance, efficiency and reliability that include contingency plans that can be adapted to the challenges of the current pandemic or the potential of recurrent endemic events. Each organization’s response will vary based on individual site environments and local government/mandatory restrictions. Plans should consider situations in which staff may be unable to access or leave the site on short notice. Please refer to our report COVID-19: Minimizing critical facility risk, which addresses this topic in detail.
Q: What time projection does Uptime Institute have for the COVID-19 crisis?
A: Unfortunately, there is insufficient information at this point in time to answer questions regarding the duration of the COVID-19 crisis.
Here is Part 2 of our Q&A regarding digital infrastructure considerations during the COVID19 crisis. Keep in mind that we are all handling this crisis in varied ways, and learning from each other along the way. In that process, we really ARE all finding our own “New Normal” and ultimately we will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.
As part of our infrastructure community leadership, we are regularly adding new materials, guidance and recommendations to help digital infrastructure owners and operators in these times.
We’ve conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. Below is Part 2 of the Q&A responses brought up during these webinars. These questions deal with Site Sanitation and Security. (Part 1 dealt with Staffing, and future published info will focus on Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations.
Q: If an average of 50 people per day enter the data center, how often is filter change recommended? What parameters do I use to make that change?
A: Under normal operating scenarios, filter changes are typically triggered by an increased pressure differential measured across the filter. As the filter clogs with debris and particulate, pressure drop will increase. As far as operations during COVID-19, there does not appear to be any reasons for a change to the outlook on filter changes, including what should trigger a filter change, based on information currently available on how the virus spreads (although information seems to be changing over time.).
Q: What types of access controls or filters are recommended to implement? Are there air conditioning filters on the market that limit the circulation of viruses in the data center?
A: Currently, it is not believed that filters will play a large part in mitigating spread of the virus in data centers. Some research suggests that high-end filters (for example, high efficiency particulate air filters) are capable of filtering particles the size of COVID-19. However, based on current information on how the virus spreads, it is not generally believed that it spreads in a true airborne manner to the point where it gets in ventilation systems; it spreads via proximity to infected people sneezing, coughing, etc. Data centers are unique in that they have multiple air changes per minute, which is different from other facility types. It is important to note that while filtering could theoretically reduce spread, the National Air Filtration Association does not believe this will happen from a practical perspective.
Q: My concern is that, in a closed-loop environment, if COVID-19 enters the data center, the virus will live and could infect more people.
A: That is an accurate statement. That is also why it is important to take any reasonable steps possible to keep it out of the data center via strong site-access control requirements and checks. It is also why many data center operators are implementing regular disinfecting so that if it is in the data center, spread is mitigated. Uptime Institute is aware that many operators have found specialized data center cleaning companies that are capable of disinfecting sites in accordance with guidelines for this pandemic from the WHO (World Health Organization) and/or the CDC (US Centers for Disease Control and Prevention).
Q: How long can the airborne virus particle last in a data center because it is cold air?
A: According to the WHO: “It is not certain how long the virus that causes COVID-19 survives on surfaces, but it seems to behave like other coronaviruses. Studies suggest that coronaviruses (including preliminary information on the COVID-19 virus) may persist on surfaces for a few hours or up to several days. This may vary under different conditions (e.g., type of surface, temperature or humidity of the environment).” We believe that this is an area that is still being studied. There does not appear to be any consensus, other than there are a number of factors, including temperature and humidity, that can impact this.
Q: Taking into consideration that the virus lasts for an incubation period, would it be possible to bring contamination into the data center?
A: Based on the information presented by the WHO and CDC, it is likely that the virus can be introduced and that there can be contamination in the data center. The most likely vector for transmission is infected individuals who do not know they are infected. Please refer to the WHO, the CDC and/or your local authority for more information.
Q: Is there any type of clothing, masks or gloves that is recommended for access to the data center by customers or suppliers so as not to expose our staff? Is it more feasible for staff to carry this type of PPE or to demand it from the customers or suppliers?
A: The CDC is recommending the N95 mask. Other masks do not seal tightly around the nose and mouth to provide proper protection. To be fully effective it must be fitted properly. Specialists receive training annually on how to properly fit these respirators around the nose, cheeks and chin, ensuring that wearers don’t breathe around the edges of the respirator. When you do that, it turns out that the work of breathing, since you’re going through a very thick material, is harder. You have to work to breathe in and out. All personnel accessing the data center should be wearing PPE in accordance with the current policies related to the COVID-19 pandemic and future similar events.
It is important to note that we are seeing companies implement their pandemic plans, which to a large extent includes limiting site access to customers and employees. The pandemic plans vary between companies, but we are hearing of restrictions being implemented to include the use of masks, gloves, etc., primarily to follow CDC guidelines. Our suggestion is to follow CDC guidelines, as well as follow your approved pandemic plan. Please note that sanitizing methods may be a better thing to focus on than the use of masks. Also, while masks would provide an additional layer of protection, until production of masks ramps up, Uptime Institute is now somewhat cautious about recommending data center owners and operators stock up on masks. The “good” masks are in short supply and should be allocated to healthcare professionals until there is sufficient supply to go around.
Q: If we decide to sanitize [our data center], the fire system detectors can go off. What do you recommend?
A: It is common during various housekeeping operations to put the fire system into bypass. This is especially important if there is a VESDA (very early smoke detection apparatus) system present, which can be triggered by disturbances of even very small particulate (they are specifically designed to be highly sensitive). Our recommendation is to put the fire system in bypass while maintaining compliance with local jurisdictional requirements. This may require fire watch or similar measures be taken while the system is in bypass.
Q: Insecurity is likely to increase in different regions, do you recommend increasing security?
A: Certainly, in facilities operating in severely affected areas the level of security risk could be affected. In these areas, management must adopt enhanced security policies, including prescreening all scheduled visitors before arrival on-site; prohibiting all unscheduled visitors; and if possible and applicable, creating a separate, secure entrance for all parties involved in essential on-site construction projects and establishing a policy that they (or any other visitors) are not allowed to interact with duty operations personnel.
We are all in this together. We are all finding our own “New Normal”. We are all learning from each other and will come out of this crisis stronger due to the renewed focus on operational effectiveness, risk avoidance and contingency planning.
To this end, Uptime Institute has been creating many types of guidance and recommendations to help digital infrastructure owners and operators ride through the COVID19 crisis. We want to help the entire community deal with their present and prepare for their future, where the New Normal will be a way of life.
We’ve created COVID19 operational guideline reports based on our 25 years of data center risk management experience, published associated real-time updates and bulletins, and conducted a number of webinars attended by thousands of listeners where we shared first-hand COVID19 experience. As a community, it’s clear we are all looking for each other’s support, guidance and experience.
As part of these webinars, we have published a series of digital infrastructure Q&A documents, focused on various categories of COVID19 specific topics: Staff Management, Site Sanitation, Site Security, Deferred Maintenance, Tier Standard, Remote Working, Supply Chain and Long-Term expectations. Presented below is PART 1 of these Q&A topics, covering Staff Management during COVID19.
Q: What is your view on continuous, long shifts for four or seven days in the data center?
A: It is not best practice for operations to change to continuous shifts from four days to seven days, as fatigue and stress will increase the human risk factor that can cause abnormal incidents. Instead, we recommend assessing extending the shift time from 8 hours to 12 hours and limiting this to a maximum of two or three consecutive days. Any extended continuous shifts should include long regular breaks each shift to avoid fatigue. There needs to be a careful balance between the increased risk of human fatigue and the mitigated risk of virus spread. Managers should also consider that the total hours worked per person does not increase, that overtime will not be over 10%, and that the shifts are arranged so that staff can rest adequately between shifts.
Q: What procedures should we follow to identify an infected staff member?
A: As described in our report COVID-19: Minimizing critical facility risk, we recommend contact tracing systems. Register the health information and location of your organization’s personnel, suppliers’ personnel and other related personnel every day to monitor possible exposure to the virus and/or any symptoms (including those of the common cold). We recommend prescreening all scheduled visitors before they arrive on-site, including sending a questionnaire via email 48 hours prior to their visit. Require completion of the questionnaire before the appointment is confirmed. Verify that all answers remain unchanged upon arrival and institute temperature checks using noncontact thermometers before entry to the facility.
For a confirmed COVID-19 case at the site, we recommend that cleaning personnel use bio-hazard suits, gloves, shoe coverings, etc. and that all personal protective equipment (PPE) is bagged and removed from the site once cleaning is complete. With or without a confirmed case at the side, ensure the availability of PPE, including masks, gloves and hazardous materials or hazmat suits. Depending on the appropriate medical or management advice, workers should use masks during shift turnover. Training pairs (e.g., senior engineer and trainee) must wear masks at all times.
For further information, please refer to our report COVID-19: Minimizing critical facility risk.
Q: How feasible is it to move families to the data center?
A: Although housing staff on-site should be considered only as a last resort, regions could go into lockdown mid-shift, so you may need to prepare for that eventuality. There are disaster recovery plans that include providing accommodation for several family members for up to 2 weeks to avoid traveling to and from the data center. While the data center is perceived as a controlled-access space, it is not a safe space. Therefore, any organization considering this option should also consider offering a specialized training program for family members that includes awareness of the hazards and the associated risks, emergency evacuation procedures, etc.
Q: Is it always recommended to keep personnel 24×7 for Tier III and Tier IV data centers?
A: Yes, it is a required criteria of the Uptime Institute Tier Certification of Operational Sustainability to have a minimum of one 24-hour, 7-day-a-week qualified staff presence (full-time employee) for Tier III data centers per shift and a minimum of two 24-hr/7-day-a-week staff presence (full-time employee) per shift for Tier IV data centers.
Q: We must not forget the following considerations for staff that may need to stay at the data center for 24 hours or more: the need to prepare food; a supply of canned food for more than 40 days, as well as alkaline water and the ability to purify it by reverse osmosis in case of water contamination; and cardiopulmonary resuscitation equipment for emergencies.
A: Correct, all these initiatives are proactive and preventative. Uptime Institute’s COVID-19: Minimizing critical facility risk report provides additional information related to what measures data center management should consider for the health and safety of staff and the protection of the site.
Q: Do you recommend interviewing all internal staff to determine their personal situation, and whether this should be done by a psychologist, particularly if staff are in the data center for a long time?
A: Organizations should maintain open and continuous communication with staff, customers and relevant third parties on a daily basis or even twice daily. Briefings may be appropriate as the conditions change. We also recommend sharing news updates and links to public resources to keep staff informed of the current status of the pandemic and the best practices for maintaining a safe and healthy work environment. As appropriate for each case, emotional support should be provided to reduce stress. Special attention should be given to any changes to continuous, long shifts that could increase the risk of human error, which may cause abnormal incidents.
One of the findings of Uptime Institute’s recently published report Annual outage analysis 2020 is that the most serious categories of outages — those that cause a significant disruption in services — are becoming more severe and more costly. This isn’t entirely surprising: individuals and businesses alike are becoming ever more dependent on IT, and it is becoming harder to replicate or replace an IT service with a manual system.
But one of the findings raises both red flags and questions: Setting aside those that were partial and incidental (which had minimal impact), publicly reported outages over the past three years appear to be getting longer. And this, in turn, is likely to be one of the reasons why the costs and severity of outages have been rising.
The table above shows the numbers of publicly reported outages collected by Uptime Institute in the years 2017-2019, except for those that did not have a financial or customer impact or those for which there was no known cause. The figures show outages are on the rise. This is due to a number of factors, including greater deployment of IT services and better reporting. But they also show a tilt in the data towards longer outages — especially those that lasted more than 48 hours. (This is true even though one of the biggest causes of lengthy outages — ransomware — was excluded from our sample.)
The outage times reported are until full IT service recovery, not business recovery — it may take longer, for example, to move aircraft back to where they are supposed to be, or to deal with backlogs in insurance claims. This trend is not dramatic, but it is real and it is concerning, because a 48-hour interruption can be lethal for many organizations.
Why is it happening? Complexity and interdependency of IT systems and greater dependency on software and data are very likely to be big reasons. For example, Uptime’s Institute’s research shows that fewer big outages are caused by power failure in the data center and more by IT systems configuration now than in the past. While resolving facility engineering issues may not be easy, it is usually a relatively predictable matter: failures are often binary, and very often recovery processes have been drilled into the operators and spare parts are kept at hand. Software, data integrity and damaged/interrupted cross-organizational business processes, however, can be much more difficult issues to resolve or sometimes even to diagnose — and these types of failure are becoming much more common (and yes, sometimes they are triggered by a power failure). Furthermore, because failures can be partial, files may become out of sync or even be corrupted.
There are lessons to be drawn. The biggest is that the resiliency regimes that facilities staff have lived by for three decades or more need to be extended and integrated into IT and DevOps and fully supported and invested in by management. Another is that while disaster recovery may be slowly disappearing as a type of commercial backup service, the principles of vigilance, recovery, and fail over – especially when under stress – are more important than ever.
The full report Annual outage analysis 2020 is available to members of the Uptime Institute Network which can be requested here.
Over the past few weeks, Uptime Institute held multiple customer roundtables to discuss the impact of the COVID-19 virus on data center operations and potential operational responses to its spread. We gathered our communities insights and best practices, we combined with our own 25 years worth of infrastructure operational management knowledge and we are now making this information available freely to the data center industry. HERE.
A little background to get you started right away….
Dozens of organizations were represented at these roundtables, which was open to a global audience of Uptime Institute Network membership. What we learned is that while most organizations have a plan for foreseen emergency situations, few have one specific to a global pandemic. As a result, many have been hurrily modifying existing plans based on gut feel and good intentions: creating tiered response levels, identifying events that would trigger the next level of response, researching concerns specific to a pandemic (e.g., what does “deep cleaning” mean in a white space, and what are the implications for different data center environments — raised floors, multi-tenant data centers, mixed-use facilities, etc.?).
But this is clearly unchartered territory for ALL of us. For many organizations, the Human Resources and/or Environmental Health & Safety department(s) take the lead in generating an organization-wide response plan, and specific business units, such as data center operations, incorporate that guidance into a plan tailored to their business mission and setting. Because many organizations have data centers in multiple regions, responses may vary by location or facility characteristics. A sample but very broad Emergency Response Plan by the US government’s FDA (with portions pertaining to the delivery of IT services contained within) can be seen here.
But immediately actionable tangible advice goes a long way in times like these. Several participants mentioned that their facilities now screen all potential visitors with a questionnaire. They do not admit anyone who reports symptoms (personally or in family members) or who has traveled recently to areas with high levels of COVID-19 cases. Some repondants advised that nn additional measure of their security involves prescreening all scheduled visitors: Send the visitor the questionnaire via email 4-8 hours prior to their visit and require completion before the appointment is confirmed. Only permit entry if the questionnaire indicates a low probability of infection (confirm all answers remain unchanged upon arrival) and prohibit unscheduled visitors altogether.
Some facilities – for example, multi-tenant data centers or mixed-use facilities – have a higher volume of visitors, and thus greater potential for COVID-19 spread. To avoid inconvenience and potential client dissatisfaction, be proactive: Inform all affected parties of the COVID-19 preparedness plan in place and its impact on their access to the facility in advance.
Sanitization is a particular challenge, with several participants reporting disinfectant/hand sanitizer shortages. Many had questions specific to deep cleaning the white space environment, given its high rate/volume of air exchange, highly specialized electronic equipment and possible raised floor configuration. Spray techniques are more effective than simply wiping surfaces with disinfectant solutions, as the antiseptic mist coats surfaces for a longer period. Many organizations are hiring specialist cleaning firms and/or following CDC recommendations for disinfection.
As COVID-19 spreads, more organizations are moving their energy from academicly tweaking their written response plans to implementing them. In many companies, that decision is made by a business unit, based on site environment, number of COVID-19 cases in the area and government-mandated restrictions. Mission-critical facilities have a particular remit, though, so need to create and implement plans specific to their business mission.
Good preparation simplifies decision-making. Roundtable participants suggest the following:
- Categorizing essential versus nonessential tasks and calendaring them in advance (makes it easier to identify maintenance items you can postpone, and for how long).
- Cross-training personnel and maintaining up-to-date skill inventories/certifications (helps ensure core capabilities are always available).
- Having contingency plans in place (means you’re prepared to manage supply chain disruption and staff shortages).
- Stress-testing technologies and procedures in advance (gives you confidence that you can accommodate a move to remote work: shifting procedures that are usually performed manually to an automated process, monitoring remotely, interacting virtually with other team members, etc.).
It’s no longer a question of if a plan like this will be needed, we know it is! Most facility operators need to quickly craft and then implement their response plan, and learn from this incident for the future.
Uptime Institute has created a number of resources and will be providing specific guidance regarding the COVID-19 situation here.