Pandemic is causing some outages and slowdowns

To date, media coverage of the impact of COVID-19 and the lockdowns has been largely laudatory. There have been few high profile or serious outages (perhaps fewer than normal) and for the most part, internet traffic flow analysis shows that a sudden jump in demand, along with a shift toward the residential edge and busy daytime hours, has had little impact on performance. The military-grade resiliency of the ARPANET (Advanced Research Projects Agency Network) and the Internet Protocol, the foundation of the modern internet, is given much credit.

But beneath the calm surface water, data center operators have been paddling furiously to maintain services, especially with some alarming staff shortages at some sites.

In our most recent survey, 84% of respondents had not experienced a service slowdown or outage that could be attributed to COVID-19. However, 4% (eight operators) said they had an COVID-19-related outage and 10% (20 operators) had experienced a service slowdown that was COVID-19 related (see figure below).

Establishing the causes of these slowdowns or outages will probably not be easy. Research does show that staff shortage and tiredness can lead to more incidents and outages, and sustained staff shortages (due to illness, separation of shifts and self-isolation) are widespread across the sector. Some recent data center outages that Uptime Institute has tracked were clearly the result of operator or management error — but this is a usual occurrence.

Slowdowns, meanwhile, are most likely to be the result of sudden changes in demand and overload, or an external third-party network problem. Two examples are the UK online grocer that mistook high demand for a denial of service attack; and Sharp Electronics, which offered consumer PPE (personal protection equipment) using the same systems as its online appliance management systems in Japan. Both crashed. Zoom, the suddenly popular conferencing service, has also experienced some maintenance-related issues. In the US, a CenturyLink cable cut, and another network issue at Tata Communications in Europe, caused outage numbers to surge above averages.

As the impact of the virus continues, data center operators could come under more strain. Most operators have deferred some scheduled maintenance, which in spite of monitoring and close management, will likely lead to an increased risk of more failures. In addition, many if not most sites are now operating with a reduced level of on-site staff, with many engineers on call, rather than on site. The industry’s traditional conservatism has so far it given a good protective buffer — but this will come under pressure over time, unless restrictions and practices are eased or evolve to ensure risk is again reduced.

More information on this topic is available to members of the Uptime Institute Network here.

Share this