Data center staff on-site: engineering specialists or generalists?
The pandemic has led to a renewed interest by data center managers in remote monitoring, management and automation. Uptime Institute has fielded dozens of inquiries about these approaches in recent months, but one in particular stands out: What will operational automation and machine learning mean for on-site staff requirements?
With greater automation, the expectation is a move toward light staffing models, with just one or a handful of technicians on-site. These technicians will need to be able to respond to a range of potential situations: electrical or mechanical issues; software administration problems; break/fix needs on the computer-room floor; the ability to configure equipment, including servers, switches and routers; and so on. Do people with these competencies exist? How long does it take to train them?
Our experts agree: the requirement of on-site staff is shifting — from electrical and mechanical specialists to more generalist technicians whose primary role is to monitor and control data center activities to prevent incidents (especially outages).
Even before the pandemic, most data center technicians on-site did not carry out major preventative maintenance activities (although some do conduct low-level preventative maintenance); they support and escort vendors who do this work. The pandemic has accelerated this trend. On-site technicians today are typically trained as operational coordinators, switching and isolating equipment when necessary, ensuring adequate monitoring, reacting to unexpected and emergency conditions with the goal of getting a situation under control and returning the data center to a stable state.
Staffing costs have always been a major item in data center operations budgets. With the advent of better remote monitoring and resiliency of the data center in recent years, the perceived need for larger numbers of on-site data center staffing has diminished, particularly during off hours when activity is low. This trend is unlikely to reverse even after pandemic times.
One of our members, for example, runs an extremely large data center site that can be described as being built with a Tier III (concurrently maintainable) intent. It is mostly leased by hyperscale internet/cloud companies. On-site technicians are trained as generalists for operations and emergency coverage, and they work 12-hour shifts. A separate 8-hour day shift is staffed more heavily with engineers to handle customer projects and to assist with other operator activities as needed. All preventative maintenance is conducted by third-party vendors, who are escorted on-site by the staff technicians. Management anticipates moving to an automated, condition maintenance-based approach in the future, with the aim of lowering the number of on-site technical staff over time. The expectation is that on-site 24/7 staff will always be required to meet client service level agreements, but by lowering their numbers there will be meaningful operational savings.
However, this will not be a swift change (for this or any other data center). Implementing automated, software-driven systems is an iterative — and human-driven — process that takes time, ongoing investment and, critically, organizational and process change.
Technologies and services for remote data center monitoring and management are available and continue to develop. As they are (slowly and carefully) implemented, managers will feel more comfortable not having personnel on-site 24/7. In time, management focus will likely shift from ensuring round-the-clock staffing to developing more of an on-call approach. Already, more data centers are employing technicians and engineers who support multiple sites rather than having a fully staffed, dedicated team for each individual data center. These technicians have general knowledge of electrical and mechanical systems, and they coordinate the preventive and corrective maintenance activities, which are mostly performed by vendors.
Today, however, because of the pandemic, there is generally a greater reliance in on-site staffing, whereby technicians in the data center are providing managers with security/comfort and insurance in case there is an incident or outage. This is likely a short-term reaction.
In the medium term — say, in the next three to five years or so — we expect there will be increased use of plug-and-play data center and IT components and systems, so generalist site staffers can readily remove and replace modules as needed, without extensive training.
In the long term, more managers will seek to enhance and ensure data center and application resiliency and automation. This will involve the technical development of more self-healing systems/networks and redundancies (driven by software) that allow for reduced levels of on-site staff and reduced expertise of those personnel. If business functions can continue in the face of a failure without human intervention, then mean-time-to-repair becomes far less critical — a technician or vendor can be dispatched in due course with a replacement component to restore full functionality of the site. This type of self-healing approach has been discussed in earnest for at least the past decade but has not yet been realized — in no small part because of the operational change and new operational processes needed. A self-healing, autonomous operational approach would be an overhaul of today’s decades-long, industry-wide practices. Change is not always easy, and rarely is it inexpensive.
What is likely to (finally) propel the development of and move to self-healing technologies is the expected demand for large numbers of lights-out edge data centers. These small facilities will increasingly be designed to be plug-and-play and to be serviced by people with little specialized skills/training. On-site staff will be trained primarily to reliably follow directions from remote technical experts. These remote experts will be responsible for analyzing monitored data and providing specific instructions for the staffer at the site. It is possible, if not likely, that most people dispatched on-site to edge facilities will be vendors swapping out components. And increasingly, the specialist mechanical and electrical staff will not only be remote, but also trained experts in real-time monitoring and management software and software-driven systems.