Recent extreme weather-related events in the US (the big freeze in Texas, fires on the west coast) have once again highlighted the need for data center operators to reassess their risks in the face of climate change. The topic is discussed in depth in the Uptime Institute report (available to Uptime Institute members) entitled, The gathering storm: Climate change and data center resiliency.
Data centers are designed and built to withstand bad weather. But extreme weather is becoming more common, and it can trigger all kinds of unforeseen problems — especially for utilities and support services.
In a recent Uptime Intelligence survey, almost half (45%) of respondents said they had experienced an extreme weather event that threatened their continuous operation — a surprisingly large number. While most said operations continued without problems, nearly one in 10 respondents (8.8%) did suffer an outage or significant service disruption as a result. This makes extreme weather one of the top causes of outages or disruption.
More events, higher costs
The industry — and that means investors, designers, insurers, operators and other contractors — is now braced for more challenging conditions and higher costs in the years ahead. Three in five respondents (59%) think there will be more IT service outages as a direct result of the impact of climate change. Nearly nine in 10 (86%) think that climate change and weather-related events will drive up the cost of data center infrastructure and operations over the next 10 years.
While most operators are very aware of the risks and costs of climate change, however, many do not appear to consider their own sites to be facing any immediate challenges. Over a third (36%) report their management has yet to formally assess the vulnerability of data centers to climate change. Almost a third (31%) believe they already have adequate protection in place — but it is not clear if this belief is backed by recent analysis.
Perhaps most dramatically, only one in 20 managers sees a dramatic increase in risks due to climate change and is taking steps to improve resiliency as a result. Such steps can range from simple changes to processes and maintenance to expensive investments in flood barriers, changes to cooling systems or even re-siting and closure.
The need for assessments
Any investments in resiliency need, of course, to be based on sound risk analysis. Uptime Institute strongly recommends that operators conduct regular reviews of climate change-related risks to their data centers. The risk profile for a data center may be far less rosy in 2021 than it was when it was built even a few years ago. Four in five data center operators (81%) agree that data center resiliency assessments will need to be regularly updated due to the impact of climate change.
As recent events show, such reviews may need to consider water and power grid resilience, potential impacts to roads and staff access, and even the economics of operating for long periods without free cooling.
The figure below shows the top areas typically reviewed by organizations conducting climate change/weather-related data center resiliency assessments.
Data center managers do appear to have a good understanding of what to assess. But the findings also highlight the Achilles heel of data center resiliency: the difficulty of mitigating against (or even accurately analyzing) the risks of failures by outside suppliers. Extreme weather events can hit power, water, fuel supplies, maintenance services and staff availability all at once. Good planning, however, can dramatically reduce the impact of such challenges.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/03/gw.jpg7481941Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com2021-03-15 06:07:002021-05-04 10:32:23Extreme weather affects nearly half of data centers
In every region of the world, data center capacity is being dramatically expanded. Across the board, the scale of capacity growth is stretching the critical infrastructure sector’s talent supply. The availability (or lack) of specialist staff will be an increasing concern for all types of data centers, from mega-growth hyperscales to smaller private enterprise facilities.
Uptime Institute forecasts global data center staff requirements will grow globally from about 2.0 million full-time employee equivalents in 2019 to nearly 2.3 million in 2025 (see Figure 1). This estimate, the sector’s first, covers more than 230 specialist job roles for different types and sizes of data centers, with varying criticality requirements, from design through operation.
Figure 1. Global data center staff projections
Our research shows that the proportion of data center owners or operators globally that are having difficulty finding qualified candidates for open jobs rose to 50% in 2020. While there is hope that new technologies to manage and operate facilities will reduce staff burdens over time, their effect is expected to be limited, at least until 2025.
There is also a concern that many employees in some mature data center markets, such as the US and Western Europe, are due to retire around the same time, causing an additional surge in demand, particularly for senior roles.
However, the growth in demand does not need to represent a crisis. Individual employers can take steps to address the issue, and the sector can act together to raise the profile of opportunities and to improve recruitment and training. Globally, the biggest employers are investing in more training and education, by not only developing internal programs but also working with universities/colleges and technical schools. Of course, additional training requires additional resources, but more operators of all sizes and types are beginning to view this type of investment as a necessity.
Education and background requirements for many job roles may also need to be revisited. In reality, most jobs do not require a high level of formal education to carry out the role, even in positions where the employer may have initially required it. Relevant experience, an internship/traineeship, or on-the-job training can often more than compensate for the lack of a formal qualification in most job roles.
The growing, long-term requirement for more trained people has also caught the attention of private equity and other investors. More are backing data center facilities management suppliers, which offer services that can help overcome skills shortages. Raising awareness of the opportunities and offering training can be part of the investment. While the data center sector faces staff challenges, with focused investment, industry initiatives and more data center-specific education programs, it can rise to the challenge.
Several resources related to this topic are available to members of the Uptime Institute community, including “The people challenge: Global data center staffing forecast 2021-2025” and “Critical facility management: Guidance on using third parties.” Click here to find out more about joining our community.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/03/job7.jpg5031331Rhonda Ascierto, Vice President, Research, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRhonda Ascierto, Vice President, Research, Uptime Institute2021-03-08 06:09:002021-05-04 10:28:57Data center staff shortages don’t need to be a crisis
In Uptime Institute’s recent report on preparing for the extreme effects of climate change, there were over a dozen references to the dangers of extremely hot weather, which can overwhelm cooling systems and trigger regional fires that disrupt power, connectivity and staff access.
But the effects of extreme cold were discussed only in passing. The main thermal challenge for a data center, after all, is keeping temperatures down, and most data centers subject to extreme cold (e.g., Facebook’s data center in Lulea, Sweden, just outside the Arctic Circle) have usually been designed with adequate protective measures in place.
But climate change (if that is the cause) is known to trigger wild swings in the weather, and that may include, for some, a period of unexpected extreme cold. This occurred in Texas this month (February 2021), which has experienced record cold temperatures — breaking the previous record from 1909 by some margin. Temperatures in Austin (TX) on Monday, February 15, fell to 4 degrees Fahrenheit (-16 degrees Celsius), with a wind chill effect taking it down to -16 degrees Fahrenheit (-27 degrees Celsius).
The impact on digital infrastructure was dramatic. First, the grid shut down for more than seven hours, affecting two million homes and forcing data centers to use generators. This, reportedly, was due to multiple failures in the power grid, from the shutdown of gas wells and power plants (due mainly to frozen components and loss of power for pumping gas) to low power generation from renewable sources (due to low wind/solar availability and frozen wind turbines). The failures have triggered further discussions about the way the Texas grid is managed and the amount of capacity it can call on at critical times. In addition, AT&T and T-Mobile reported some issues with connectivity services.
Data center managers struggled with multiple issues. Those successful in moving to generator power faced fuel delivery issues due to road conditions, while anyone buying power on the spot markets saw a surge in power prices (although most data center operators buy at a fixed price). The city of Austin’s own data center was one of those that suffered a lengthy outage.
All this raises the question: What can data center staff do to reduce the likelihood of an outage or service degradation due to low temperatures (and possible snowy/icy conditions)? Below we provide advice from Uptime’s Chief Technical Officer, Chris Brown.
For backup power systems (usually diesel generators):
• Check start battery condition. • Check diesel additive to ensure it is protective below the anticipated temperatures. • Ensure block heaters and jacket water heaters are operational. • Check filters, as they are more likely clog at low temperatures.
For cooling systems:
• Ensure freeze protection is in place or de-icing procedures are followed on cooling towers. Consider reversing fans to remove built-up ice. • Ensure all externally mounted equipment is rated for the anticipated temperatures. (Direct expansion compressors and air-cooled chillers will not operate in extreme temperatures.) • Ensure freeze protection on all external piping is operational. • Evaluate the use of free cooling where available.
And of course, it’s always good to remember to ensure critical staff are housed near the data center in case transportation becomes an issue. Also, consider reducing some IT loads and turn off a generator or two — it is better to do this than run generators for long periods at a low load.
Given the weather-related February 2021 failure of the Texas grid at a critical time, it may also be advisable for all data center operators to review the resiliency and capacity of their local energy utilities, especially with regard to planning for extreme weather events, including heat, cold, rain and wind. Increasing use of renewable energy may require that greater reserve capacity is available.
For more information on climate change and weather risks along with the litany of new challenges facing today’s infrastructure owner and operators, consider becoming a member of Uptime Institute. Members enjoy an entire portfolio of experiential knowledge and hands-on understanding from more than 100 of the world’s most respected companies. Members can access our report The gathering storm: Climate change and data center resiliency.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/02/cold2.jpg10521938Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com2021-03-01 05:34:002021-03-01 08:52:22Extreme cold — a neglected threat to availability?
Power failures have always been one of the top causes of serious IT service outages. The loss of power to a data center can be devastating, and its consequences have fully justified the huge expense and effort that go into preventing such events.
But in recent years, other causes are catching up, with networking issues now emerging as one of the more common — if not the most common — causes of downtime. In our most recent survey of nearly 300 data center and IT service operators, network issues were cited as the most common reason for any IT service outage — more common even than power problems (see Figure 1).
Figure 1. Among survey respondents, networking issues were the most common cause of IT service outages in the past three years.
The reasons are clear enough: modern applications and data are increasingly spread across and among data centers, with the network ever more critical. To add to the mix, software-defined networks have added great flexibility and programmability, but they have also introduced failure-prone complexity.
Delving a little deeper confirms the complexity diagnosis. Configuration errors, firmware errors, and corrupted routing tables all play a big role, while the more traditional worries of weather and cable breaks are a relatively minor concern. Congestion and capacity issues also cause failures, but these are often themselves the result of programming/configuration issues.
Networks are complex not only technically, but also operationally. While enterprise data centers may be served by only one or two providers, multi-carrier colocation hubs can be served by many telecommunications providers. Some of these links may, further down the line, share cables or facilities — adding possible overlapping points of failure or capacity pinch points. Ownership, visibility and accountability can also be complicated. These factors combined help account for the fact that 39% of survey respondents said they had experienced an outage caused by a third-party networking issue — something over which they had little control (see Figure 2).
Figure 2. Configuration/change management failures caused almost half of all network-related outages reported by survey respondents.
Among those who avoided any downtime from networking-related outages, what was the most important reason why?
A few of those organizations that avoided any network-related downtime put this down to luck (!). (We know of one operator who suffered two separate, unrelated critical cable breaks at the same time.) But the majority of those who avoided downtime credited a factor that is more controllable: investment in systems and training (see Figure 3).
Figure 3.. Over three-quarters of survey respondents that avoided network-related downtime attributed it to their investments in resiliency and training.
The Bottom Line: As with the prevention of power issues, money spent on expertise, redundancy, monitoring, diagnostics and recovery — along with staff training and processes — will be paid back with more hours of uptime.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/02/dd2.jpg8992034Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com2021-02-22 05:44:002021-05-04 10:27:15Network problems causing ever more outages
The pandemic has led to a renewed interest by data center managers in remote monitoring, management and automation. Uptime Institute has fielded dozens of inquiries about these approaches in recent months, but one in particular stands out: What will operational automation and machine learning mean for on-site staff requirements?
With greater automation, the expectation is a move toward light staffing models, with just one or a handful of technicians on-site. These technicians will need to be able to respond to a range of potential situations: electrical or mechanical issues; software administration problems; break/fix needs on the computer-room floor; the ability to configure equipment, including servers, switches and routers; and so on. Do people with these competencies exist? How long does it take to train them?
Our experts agree: the requirement of on-site staff is shifting — from electrical and mechanical specialists to more generalist technicians whose primary role is to monitor and control data center activities to prevent incidents (especially outages).
Even before the pandemic, most data center technicians on-site did not carry out major preventative maintenance activities (although some do conduct low-level preventative maintenance); they support and escort vendors who do this work. The pandemic has accelerated this trend. On-site technicians today are typically trained as operational coordinators, switching and isolating equipment when necessary, ensuring adequate monitoring, reacting to unexpected and emergency conditions with the goal of getting a situation under control and returning the data center to a stable state.
Staffing costs have always been a major item in data center operations budgets. With the advent of better remote monitoring and resiliency of the data center in recent years, the perceived need for larger numbers of on-site data center staffing has diminished, particularly during off hours when activity is low. This trend is unlikely to reverse even after pandemic times.
One of our members, for example, runs an extremely large data center site that can be described as being built with a Tier III (concurrently maintainable) intent. It is mostly leased by hyperscale internet/cloud companies. On-site technicians are trained as generalists for operations and emergency coverage, and they work 12-hour shifts. A separate 8-hour day shift is staffed more heavily with engineers to handle customer projects and to assist with other operator activities as needed. All preventative maintenance is conducted by third-party vendors, who are escorted on-site by the staff technicians. Management anticipates moving to an automated, condition maintenance-based approach in the future, with the aim of lowering the number of on-site technical staff over time. The expectation is that on-site 24/7 staff will always be required to meet client service level agreements, but by lowering their numbers there will be meaningful operational savings.
However, this will not be a swift change (for this or any other data center). Implementing automated, software-driven systems is an iterative — and human-driven — process that takes time, ongoing investment and, critically, organizational and process change.
Technologies and services for remote data center monitoring and management are available and continue to develop. As they are (slowly and carefully) implemented, managers will feel more comfortable not having personnel on-site 24/7. In time, management focus will likely shift from ensuring round-the-clock staffing to developing more of an on-call approach. Already, more data centers are employing technicians and engineers who support multiple sites rather than having a fully staffed, dedicated team for each individual data center. These technicians have general knowledge of electrical and mechanical systems, and they coordinate the preventive and corrective maintenance activities, which are mostly performed by vendors.
Today, however, because of the pandemic, there is generally a greater reliance in on-site staffing, whereby technicians in the data center are providing managers with security/comfort and insurance in case there is an incident or outage. This is likely a short-term reaction.
In the medium term — say, in the next three to five years or so — we expect there will be increased use of plug-and-play data center and IT components and systems, so generalist site staffers can readily remove and replace modules as needed, without extensive training.
In the long term, more managers will seek to enhance and ensure data center and application resiliency and automation. This will involve the technical development of more self-healing systems/networks and redundancies (driven by software) that allow for reduced levels of on-site staff and reduced expertise of those personnel. If business functions can continue in the face of a failure without human intervention, then mean-time-to-repair becomes far less critical — a technician or vendor can be dispatched in due course with a replacement component to restore full functionality of the site. This type of self-healing approach has been discussed in earnest for at least the past decade but has not yet been realized — in no small part because of the operational change and new operational processes needed. A self-healing, autonomous operational approach would be an overhaul of today’s decades-long, industry-wide practices. Change is not always easy, and rarely is it inexpensive.
What is likely to (finally) propel the development of and move to self-healing technologies is the expected demand for large numbers of lights-out edge data centers. These small facilities will increasingly be designed to be plug-and-play and to be serviced by people with little specialized skills/training. On-site staff will be trained primarily to reliably follow directions from remote technical experts. These remote experts will be responsible for analyzing monitored data and providing specific instructions for the staffer at the site. It is possible, if not likely, that most people dispatched on-site to edge facilities will be vendors swapping out components. And increasingly, the specialist mechanical and electrical staff will not only be remote, but also trained experts in real-time monitoring and management software and software-driven systems.
Recent events have heightened concerns around physical security for many data center operators, and with good reason: the pandemic means many data centers may still be short-staffed, less time may have been available for review of and training on routine procedures, and vendor substitutes may be more common than under non-pandemic conditions. Add the usual “unusuals” that affect operations (e.g., severe storms causing staff absences and increasing the likelihood of utility failures), and normal precautions may fall by the wayside.
For most data centers, much of physical security starts at site selection and design. The typical layered (“box inside a box”) security strategy adopted by most facilities handles many concerns. If a data center has vulnerabilities (e.g., dark fiber wells beyond the perimeter), they’re generally known and provisions have been made to monitor them. Routine security standards are in place, emergency procedures are established, and all employees are trained.
But what keeps data center operators up at night is the unexpected. The recent bombing in Nashville, Tennessee (US), which disrupted internet and wireless services, and new threats to Amazon Web Services facilities as a result of their decision to suspend hosting the social media platform Parler are a stark reminder that extreme events can occur.
A December 2019 report from Uptime Institute summed it up best, by stating that IT security is one of the big issues of the information age. Billions of dollars are spent protecting the integrity and availability of data against the actions of malign agents. But while cybersecurity is a high-profile issue, all information lives in a physical data center somewhere, and much of it needs the highest order of protection. Data center owners/operators employ a wide range of tactics to maintain a perimeter against intruders and to regulate the activities of clients and visitors inside the data center. The full report assesses operator security spending, concerns, management and best practices.
Key Findings from December 2019 report: • Spending on physical security is commonly around 5% of the operations budget but in extremecases can be as high as 30%. • Data centers employ a range of common technologies and techniques to control access to the facility, but there is no “one size fits all” solution to physical security: each organization must tailor their approach to fit their circumstances. • Neither cloud-based data replication nor the threat of cybersecurity to both IT systems and facilities equipment have significantly diminished the need for physical security. • Most data center owners and operators consider unauthorized activity in the data center to be the greatest physical threat to IT. • Access to the data center property is governed by policies that reflect the business requirements of the organization and establish the techniques and technologies used to ensure the physical security of the facility. These policies should be reviewed regularly and benchmarked against those of similar organizations. • Data centers commonly employ third-party security services to enforce physical security policies. • Attempts at unwarranted entry do occur. In a recent study, about one in five data centers experienced some form of attempted access in a 5-year period. • Drones, infrared cameras, thermal scanners and video analytics are promising new technologies. • Biometric recognition is still viewed skeptically by many operators.
Now is the time to review your security plans and emergency operations procedures and to brief staff. Ensure they know the organization’s strategies and expectations. If your facility is in an area where many data centers are clustered together, consider collaborating with them to develop a regional plan.
https://journal.uptimeinstitute.com/wp-content/uploads/2021/01/sec1.jpg8041857Rhonda Ascierto, Vice President, Research, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRhonda Ascierto, Vice President, Research, Uptime Institute2021-01-25 04:34:002021-01-12 14:56:35Ensuring physical security in uncertain times
Extreme weather affects nearly half of data centers
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comRecent extreme weather-related events in the US (the big freeze in Texas, fires on the west coast) have once again highlighted the need for data center operators to reassess their risks in the face of climate change. The topic is discussed in depth in the Uptime Institute report (available to Uptime Institute members) entitled, The gathering storm: Climate change and data center resiliency.
Data centers are designed and built to withstand bad weather. But extreme weather is becoming more common, and it can trigger all kinds of unforeseen problems — especially for utilities and support services.
In a recent Uptime Intelligence survey, almost half (45%) of respondents said they had experienced an extreme weather event that threatened their continuous operation — a surprisingly large number. While most said operations continued without problems, nearly one in 10 respondents (8.8%) did suffer an outage or significant service disruption as a result. This makes extreme weather one of the top causes of outages or disruption.
More events, higher costs
The industry — and that means investors, designers, insurers, operators and other contractors — is now braced for more challenging conditions and higher costs in the years ahead. Three in five respondents (59%) think there will be more IT service outages as a direct result of the impact of climate change. Nearly nine in 10 (86%) think that climate change and weather-related events will drive up the cost of data center infrastructure and operations over the next 10 years.
While most operators are very aware of the risks and costs of climate change, however, many do not appear to consider their own sites to be facing any immediate challenges. Over a third (36%) report their management has yet to formally assess the vulnerability of data centers to climate change. Almost a third (31%) believe they already have adequate protection in place — but it is not clear if this belief is backed by recent analysis.
Perhaps most dramatically, only one in 20 managers sees a dramatic increase in risks due to climate change and is taking steps to improve resiliency as a result. Such steps can range from simple changes to processes and maintenance to expensive investments in flood barriers, changes to cooling systems or even re-siting and closure.
The need for assessments
Any investments in resiliency need, of course, to be based on sound risk analysis. Uptime Institute strongly recommends that operators conduct regular reviews of climate change-related risks to their data centers. The risk profile for a data center may be far less rosy in 2021 than it was when it was built even a few years ago. Four in five data center operators (81%) agree that data center resiliency assessments will need to be regularly updated due to the impact of climate change.
As recent events show, such reviews may need to consider water and power grid resilience, potential impacts to roads and staff access, and even the economics of operating for long periods without free cooling.
The figure below shows the top areas typically reviewed by organizations conducting climate change/weather-related data center resiliency assessments.
Data center managers do appear to have a good understanding of what to assess. But the findings also highlight the Achilles heel of data center resiliency: the difficulty of mitigating against (or even accurately analyzing) the risks of failures by outside suppliers. Extreme weather events can hit power, water, fuel supplies, maintenance services and staff availability all at once. Good planning, however, can dramatically reduce the impact of such challenges.
Data center staff shortages don’t need to be a crisis
/in Executive/by Rhonda Ascierto, Vice President, Research, Uptime InstituteIn every region of the world, data center capacity is being dramatically expanded. Across the board, the scale of capacity growth is stretching the critical infrastructure sector’s talent supply. The availability (or lack) of specialist staff will be an increasing concern for all types of data centers, from mega-growth hyperscales to smaller private enterprise facilities.
Uptime Institute forecasts global data center staff requirements will grow globally from about 2.0 million full-time employee equivalents in 2019 to nearly 2.3 million in 2025 (see Figure 1). This estimate, the sector’s first, covers more than 230 specialist job roles for different types and sizes of data centers, with varying criticality requirements, from design through operation.
Our research shows that the proportion of data center owners or operators globally that are having difficulty finding qualified candidates for open jobs rose to 50% in 2020. While there is hope that new technologies to manage and operate facilities will reduce staff burdens over time, their effect is expected to be limited, at least until 2025.
There is also a concern that many employees in some mature data center markets, such as the US and Western Europe, are due to retire around the same time, causing an additional surge in demand, particularly for senior roles.
However, the growth in demand does not need to represent a crisis. Individual employers can take steps to address the issue, and the sector can act together to raise the profile of opportunities and to improve recruitment and training. Globally, the biggest employers are investing in more training and education, by not only developing internal programs but also working with universities/colleges and technical schools. Of course, additional training requires additional resources, but more operators of all sizes and types are beginning to view this type of investment as a necessity.
Education and background requirements for many job roles may also need to be revisited. In reality, most jobs do not require a high level of formal education to carry out the role, even in positions where the employer may have initially required it. Relevant experience, an internship/traineeship, or on-the-job training can often more than compensate for the lack of a formal qualification in most job roles.
The growing, long-term requirement for more trained people has also caught the attention of private equity and other investors. More are backing data center facilities management suppliers, which offer services that can help overcome skills shortages. Raising awareness of the opportunities and offering training can be part of the investment. While the data center sector faces staff challenges, with focused investment, industry initiatives and more data center-specific education programs, it can rise to the challenge.
Several resources related to this topic are available to members of the Uptime Institute community, including “The people challenge: Global data center staffing forecast 2021-2025” and “Critical facility management: Guidance on using third parties.” Click here to find out more about joining our community.
Extreme cold — a neglected threat to availability?
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comIn Uptime Institute’s recent report on preparing for the extreme effects of climate change, there were over a dozen references to the dangers of extremely hot weather, which can overwhelm cooling systems and trigger regional fires that disrupt power, connectivity and staff access.
But the effects of extreme cold were discussed only in passing. The main thermal challenge for a data center, after all, is keeping temperatures down, and most data centers subject to extreme cold (e.g., Facebook’s data center in Lulea, Sweden, just outside the Arctic Circle) have usually been designed with adequate protective measures in place.
But climate change (if that is the cause) is known to trigger wild swings in the weather, and that may include, for some, a period of unexpected extreme cold. This occurred in Texas this month (February 2021), which has experienced record cold temperatures — breaking the previous record from 1909 by some margin. Temperatures in Austin (TX) on Monday, February 15, fell to 4 degrees Fahrenheit (-16 degrees Celsius), with a wind chill effect taking it down to -16 degrees Fahrenheit (-27 degrees Celsius).
The impact on digital infrastructure was dramatic. First, the grid shut down for more than seven hours, affecting two million homes and forcing data centers to use generators. This, reportedly, was due to multiple failures in the power grid, from the shutdown of gas wells and power plants (due mainly to frozen components and loss of power for pumping gas) to low power generation from renewable sources (due to low wind/solar availability and frozen wind turbines). The failures have triggered further discussions about the way the Texas grid is managed and the amount of capacity it can call on at critical times. In addition, AT&T and T-Mobile reported some issues with connectivity services.
Data center managers struggled with multiple issues. Those successful in moving to generator power faced fuel delivery issues due to road conditions, while anyone buying power on the spot markets saw a surge in power prices (although most data center operators buy at a fixed price). The city of Austin’s own data center was one of those that suffered a lengthy outage.
All this raises the question: What can data center staff do to reduce the likelihood of an outage or service degradation due to low temperatures (and possible snowy/icy conditions)? Below we provide advice from Uptime’s Chief Technical Officer, Chris Brown.
For backup power systems (usually diesel generators):
• Check start battery condition.
• Check diesel additive to ensure it is protective below the anticipated temperatures.
• Ensure block heaters and jacket water heaters are operational.
• Check filters, as they are more likely clog at low temperatures.
For cooling systems:
• Ensure freeze protection is in place or de-icing procedures are followed on cooling towers. Consider reversing fans to remove built-up ice.
• Ensure all externally mounted equipment is rated for the anticipated temperatures. (Direct expansion compressors and air-cooled chillers will not operate in extreme temperatures.)
• Ensure freeze protection on all external piping is operational.
• Evaluate the use of free cooling where available.
And of course, it’s always good to remember to ensure critical staff are housed near the data center in case transportation becomes an issue. Also, consider reducing some IT loads and turn off a generator or two — it is better to do this than run generators for long periods at a low load.
Given the weather-related February 2021 failure of the Texas grid at a critical time, it may also be advisable for all data center operators to review the resiliency and capacity of their local energy utilities, especially with regard to planning for extreme weather events, including heat, cold, rain and wind. Increasing use of renewable energy may require that greater reserve capacity is available.
For more information on climate change and weather risks along with the litany of new challenges facing today’s infrastructure owner and operators, consider becoming a member of Uptime Institute. Members enjoy an entire portfolio of experiential knowledge and hands-on understanding from more than 100 of the world’s most respected companies. Members can access our report The gathering storm: Climate change and data center resiliency.
Network problems causing ever more outages
/in Design, Executive/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comPower failures have always been one of the top causes of serious IT service outages. The loss of power to a data center can be devastating, and its consequences have fully justified the huge expense and effort that go into preventing such events.
But in recent years, other causes are catching up, with networking issues now emerging as one of the more common — if not the most common — causes of downtime. In our most recent survey of nearly 300 data center and IT service operators, network issues were cited as the most common reason for any IT service outage — more common even than power problems (see Figure 1).
Figure 1. Among survey respondents, networking issues were the most common cause of IT service outages in the past three years.
The reasons are clear enough: modern applications and data are increasingly spread across and among data centers, with the network ever more critical. To add to the mix, software-defined networks have added great flexibility and programmability, but they have also introduced failure-prone complexity.
Delving a little deeper confirms the complexity diagnosis. Configuration errors, firmware errors, and corrupted routing tables all play a big role, while the more traditional worries of weather and cable breaks are a relatively minor concern. Congestion and capacity issues also cause failures, but these are often themselves the result of programming/configuration issues.
Networks are complex not only technically, but also operationally. While enterprise data centers may be served by only one or two providers, multi-carrier colocation hubs can be served by many telecommunications providers. Some of these links may, further down the line, share cables or facilities — adding possible overlapping points of failure or capacity pinch points. Ownership, visibility and accountability can also be complicated. These factors combined help account for the fact that 39% of survey respondents said they had experienced an outage caused by a third-party networking issue — something over which they had little control (see Figure 2).
Among those who avoided any downtime from networking-related outages, what was the most important reason why?
A few of those organizations that avoided any network-related downtime put this down to luck (!). (We know of one operator who suffered two separate, unrelated critical cable breaks at the same time.) But the majority of those who avoided downtime credited a factor that is more controllable: investment in systems and training (see Figure 3).
The Bottom Line: As with the prevention of power issues, money spent on expertise, redundancy, monitoring, diagnostics and recovery — along with staff training and processes — will be paid back with more hours of uptime.
Data center staff on-site: engineering specialists or generalists?
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteThe pandemic has led to a renewed interest by data center managers in remote monitoring, management and automation. Uptime Institute has fielded dozens of inquiries about these approaches in recent months, but one in particular stands out: What will operational automation and machine learning mean for on-site staff requirements?
With greater automation, the expectation is a move toward light staffing models, with just one or a handful of technicians on-site. These technicians will need to be able to respond to a range of potential situations: electrical or mechanical issues; software administration problems; break/fix needs on the computer-room floor; the ability to configure equipment, including servers, switches and routers; and so on. Do people with these competencies exist? How long does it take to train them?
Our experts agree: the requirement of on-site staff is shifting — from electrical and mechanical specialists to more generalist technicians whose primary role is to monitor and control data center activities to prevent incidents (especially outages).
Even before the pandemic, most data center technicians on-site did not carry out major preventative maintenance activities (although some do conduct low-level preventative maintenance); they support and escort vendors who do this work. The pandemic has accelerated this trend. On-site technicians today are typically trained as operational coordinators, switching and isolating equipment when necessary, ensuring adequate monitoring, reacting to unexpected and emergency conditions with the goal of getting a situation under control and returning the data center to a stable state.
Staffing costs have always been a major item in data center operations budgets. With the advent of better remote monitoring and resiliency of the data center in recent years, the perceived need for larger numbers of on-site data center staffing has diminished, particularly during off hours when activity is low. This trend is unlikely to reverse even after pandemic times.
One of our members, for example, runs an extremely large data center site that can be described as being built with a Tier III (concurrently maintainable) intent. It is mostly leased by hyperscale internet/cloud companies. On-site technicians are trained as generalists for operations and emergency coverage, and they work 12-hour shifts. A separate 8-hour day shift is staffed more heavily with engineers to handle customer projects and to assist with other operator activities as needed. All preventative maintenance is conducted by third-party vendors, who are escorted on-site by the staff technicians. Management anticipates moving to an automated, condition maintenance-based approach in the future, with the aim of lowering the number of on-site technical staff over time. The expectation is that on-site 24/7 staff will always be required to meet client service level agreements, but by lowering their numbers there will be meaningful operational savings.
However, this will not be a swift change (for this or any other data center). Implementing automated, software-driven systems is an iterative — and human-driven — process that takes time, ongoing investment and, critically, organizational and process change.
Technologies and services for remote data center monitoring and management are available and continue to develop. As they are (slowly and carefully) implemented, managers will feel more comfortable not having personnel on-site 24/7. In time, management focus will likely shift from ensuring round-the-clock staffing to developing more of an on-call approach. Already, more data centers are employing technicians and engineers who support multiple sites rather than having a fully staffed, dedicated team for each individual data center. These technicians have general knowledge of electrical and mechanical systems, and they coordinate the preventive and corrective maintenance activities, which are mostly performed by vendors.
Today, however, because of the pandemic, there is generally a greater reliance in on-site staffing, whereby technicians in the data center are providing managers with security/comfort and insurance in case there is an incident or outage. This is likely a short-term reaction.
In the medium term — say, in the next three to five years or so — we expect there will be increased use of plug-and-play data center and IT components and systems, so generalist site staffers can readily remove and replace modules as needed, without extensive training.
In the long term, more managers will seek to enhance and ensure data center and application resiliency and automation. This will involve the technical development of more self-healing systems/networks and redundancies (driven by software) that allow for reduced levels of on-site staff and reduced expertise of those personnel. If business functions can continue in the face of a failure without human intervention, then mean-time-to-repair becomes far less critical — a technician or vendor can be dispatched in due course with a replacement component to restore full functionality of the site. This type of self-healing approach has been discussed in earnest for at least the past decade but has not yet been realized — in no small part because of the operational change and new operational processes needed. A self-healing, autonomous operational approach would be an overhaul of today’s decades-long, industry-wide practices. Change is not always easy, and rarely is it inexpensive.
What is likely to (finally) propel the development of and move to self-healing technologies is the expected demand for large numbers of lights-out edge data centers. These small facilities will increasingly be designed to be plug-and-play and to be serviced by people with little specialized skills/training. On-site staff will be trained primarily to reliably follow directions from remote technical experts. These remote experts will be responsible for analyzing monitored data and providing specific instructions for the staffer at the site. It is possible, if not likely, that most people dispatched on-site to edge facilities will be vendors swapping out components. And increasingly, the specialist mechanical and electrical staff will not only be remote, but also trained experts in real-time monitoring and management software and software-driven systems.
Ensuring physical security in uncertain times
/in Executive/by Rhonda Ascierto, Vice President, Research, Uptime InstituteRecent events have heightened concerns around physical security for many data center operators, and with good reason: the pandemic means many data centers may still be short-staffed, less time may have been available for review of and training on routine procedures, and vendor substitutes may be more common than under non-pandemic conditions. Add the usual “unusuals” that affect operations (e.g., severe storms causing staff absences and increasing the likelihood of utility failures), and normal precautions may fall by the wayside.
For most data centers, much of physical security starts at site selection and design. The typical layered (“box inside a box”) security strategy adopted by most facilities handles many concerns. If a data center has vulnerabilities (e.g., dark fiber wells beyond the perimeter), they’re generally known and provisions have been made to monitor them. Routine security standards are in place, emergency procedures are established, and all employees are trained.
But what keeps data center operators up at night is the unexpected. The recent bombing in Nashville, Tennessee (US), which disrupted internet and wireless services, and new threats to Amazon Web Services facilities as a result of their decision to suspend hosting the social media platform Parler are a stark reminder that extreme events can occur.
A December 2019 report from Uptime Institute summed it up best, by stating that IT security is one of the big issues of the information age. Billions of dollars are spent protecting the integrity and availability of data against the actions of malign agents. But while cybersecurity is a high-profile issue, all information lives in a physical data center somewhere, and much of it needs the highest order of protection. Data center owners/operators employ a wide range of tactics to maintain a perimeter against intruders and to regulate the activities of clients and visitors inside the data center. The full report assesses operator security spending, concerns, management and best practices.
Key Findings from December 2019 report:
• Spending on physical security is commonly around 5% of the operations budget but in extremecases can be as high as 30%.
• Data centers employ a range of common technologies and techniques to control access to the facility, but there is no “one size fits all” solution to physical security: each organization must tailor their approach to fit their circumstances.
• Neither cloud-based data replication nor the threat of cybersecurity to both IT systems and facilities equipment have significantly diminished the need for physical security.
• Most data center owners and operators consider unauthorized activity in the data center to be the greatest physical threat to IT.
• Access to the data center property is governed by policies that reflect the business requirements of the organization and establish the techniques and technologies used to ensure the physical security of the facility. These policies should be reviewed regularly and benchmarked against those of similar organizations.
• Data centers commonly employ third-party security services to enforce physical security policies.
• Attempts at unwarranted entry do occur. In a recent study, about one in five data centers experienced some form of attempted access in a 5-year period.
• Drones, infrared cameras, thermal scanners and video analytics are promising new technologies.
• Biometric recognition is still viewed skeptically by many operators.
Now is the time to review your security plans and emergency operations procedures and to brief staff. Ensure they know the organization’s strategies and expectations. If your facility is in an area where many data centers are clustered together, consider collaborating with them to develop a regional plan.