Regulations drive investments in cybersecurity and efficiency

Regulations drive investments in cybersecurity and efficiency

Legislative requirements for data center resiliency, operational transparency and energy performance are tightening worldwide — putting data centers under greater regulatory scrutiny. In response, organizations are either starting or stepping up their efforts to achieve compliance in these areas, and findings from the Uptime Institute Global Data Center Survey 2023 reveal that most are prioritizing cybersecurity (see Figure 1).

Figure 1. Regulations drive security, hardware and efficiency investments

Diagram: Regulations drive security, hardware and efficiency investments

Since 2020, several countries have introduced laws with strict cybersecurity demands for data center operators to combat the rise in cyber threats (see Table 1) — especially if they host or manage critical national infrastructure (CNI) workloads. As CNI entities become more reliant on digital services, they are increasingly exposed to cyber risks that could result in severe consequences. For example, a compromised facility managing applications for a utility risks widespread power and communications outages, threatening the physical safety of citizens.

Table 1. Regulations that mandate enhanced cybersecurity measures

Table: Regulations that mandate enhanced cybersecurity measures

Cyberattacks are becoming increasingly sophisticated as the digital infrastructure becomes more interconnected. For example, operational technology systems for power and cooling optimization are routinely connected to the internet (either directly or indirectly), which creates a broader “attack surface,” giving more access points for cyberattacks. Operators are also increasingly deploying Internet of Things devices and applications. These are used for asset tracking, predictive maintenance and capacity planning, but they require network connectivity and can lack robust cybersecurity features.

Measures aimed at improving energy efficiency rank as the second and third most popular responses to new regulations (see Figure 1). To evaluate their progress, data center operators may add new energy management systems and network connections to the power infrastructure, potentially complicating existing cybersecurity programs.

Alongside the risks to CNI, cyberattacks could lead to significant financial losses for organizations through data breaches, reputational damage, customer lawsuits, ransom payments and regulatory fines. Governments are particularly concerned about systemic risks: the knock on or “domino effect” when parts of the digital infrastructure supply chain go offline, causing others to fail or putting new traffic loads of entirely separate systems.

Privacy is also a major issue beginning to affect infrastructure operators — although this is mostly an issue at the application / data storage level. For example, the US Health Insurance Portability and Accountability Act (HIPAA) mandates that data center operators meet specific security standards if their facilities process private healthcare information — and noncompliance can cost $50,000 per violation. Such financial risks often fuel the business case for cybersecurity investments.

What do these investments look like? Many organizations start by conducting cybersecurity risk assessments, which often show that traditional and partial solutions such as firewalls and basic security is not enough. They may also hire new or additional cybersecurity staff and systems to patch vulnerable systems and applications, deploy network segmentation, set up protection against distributed denial-of-service attacks and deploy multifactor authentication for users. Once established, these measures need to be checked against specific regulatory requirements, which may call for specialized software or compliance audits.

The cost of compliance can be significant and recurring because of frequent regulatory and technological changes. Furthermore, the cybersecurity field is currently facing a labor shortage. According to the International Information System Security Certification Consortium (ISC2), there are more than 700,000 unfilled cybersecurity positions in the US alone, which is likely driving the costs higher.

While these investments can be significant for some organizations, there are many potential benefits that extend beyond regulatory compliance. Combined with other investments prompted by regulations, including energy performance improvements, these may pay dividends in preventing potential outages and play a role in elevating the overall resiliency and efficiency of all the systems involved.


The Uptime Intelligence View

Regulatory concerns over resiliency and energy use have led to a wave of new and updated requirements for data centers. Organizations are starting efforts to achieve compliance — and most are prioritizing cybersecurity. While investments in cybersecurity can carry significant costs, threats by malicious actors and financial penalties from noncompliance with regulatory requirements have bolstered the business case for these efforts.

Are utility companies needed for pull-the-plug testing?

Are utility companies needed for pull-the-plug testing?

The testing of backup power systems is crucial for ensuring that data center operations remain available through power interruptions. By cutting all power to the facility and replicating a real-world electrical grid failure, pull-the-plug testing provides the most comprehensive assessment of these systems. However, there are some differing opinions on the best way to perform the test and whether the electrical utility company needs to be involved.

Results from the Uptime Institute Data Center Resiliency Survey 2023 found that more than 70% of organizations perform pull-the-plug tests (Figure 1), and of this group, roughly 95% do so at least annually. At the same time, less than half of operators involve their utility company in the process — raising questions over the best practices and the value of some approaches to performing the test.

Operators are not required to notify the utility company of these tests in most cases. This is because it is unlikely that a sudden drop in demand, even from larger data centers, would impact an average-sized grid.

Successful pull-the-plug tests assess a range of operations, including power-loss detection, switchgear, backup generation and the controls needed to connect to on-site power production systems. Depending on the facility design, it may not be possible to fully test all these functions without coordinating with the electrical utility company.

Therefore, organizations that interrupt their power supply independently, without the involvement of the utility, are at risk of performing an incomplete test. And this may give a false sense of security about the facility’s ability to ride through a power outage.

Figure 1. Most data center operators perform pull-the-plug tests

Diagram: Most data center operators perform pull-the-plug tests

Below are three of the most common approaches for performing a pull-the-plug test and the key considerations for operators when determining which type of test is best suited to their facility.

Coordinating with the electrical utility provider

For this test, the grid provider cuts all incoming power to the data center, prompting the backup power controls to start.

Although this approach guarantees an interruption to the power supply, it can create challenges with costs and scheduling. Because this is a full test of all backup functions, there are some risks. This means it is crucial to have staff with the necessary skills on-site during the test to monitor each step of the procedure and ensure it runs smoothly. This can create scheduling challenges since the test may be constrained by staff availability, including those from suppliers. And because utility providers typically charge fees for their technicians, the costs can increase if unforeseen events, such as severe weather, occur that result in a call for a rescheduling of the test.

Typically, operators have to use this approach when they lack an isolation device — but these carry their own set of challenges.

Using an isolation device to interrupt power

A pull-the-plug test may also be carried out using an isolation device. These are circuit breakers or switches that are deployed upstream of the power transformers. Opening the isolation device cuts the power from the grid to the facility without requiring coordination with the electrical utility company. This approach can cut costs and remove some of the scheduling challenges listed in the previous section, but may not be feasible for some facility designs.

For example, if the opened hardware is monitored by a programmable logic controller (PLC), the generators may start automatically without using (and therefore testing) the controls linked to the power transformer. In this case, the testing of the power-loss detection, the processes for switching devices to on-site power use, and the controls used to coordinate these steps can be bypassed, leading to an incomplete test.

The use of an isolation device can also create new risks. Human error or hardware malfunctions of the device can result in unintended power interruptions or failures to interrupt the power when necessary. Other factors can add to these risks, such as installing the device outside a building and exposing it to extreme weather.

Data center operators that have deployed isolation devices in the initial facility’s design are the most likely to use them to conduct pull-the-plug tests. Those operators that do not have the devices already installed may not want to have them retrofitted due to new concerns, such as spatial challenges — some standards, such as the National Electrical Code in the US, require additional open space around such deployments. Any new installations would also require testing, which would carry all the risks and costs associated with pull-the-plug tests.

Pulling the power transformer fuses

Pulling the power transformer fuses tests all the PLC and backup system hardware required for responding to utility power failures and does not require coordination with the grid provider. However, the power loss to the facility is only simulated and not experienced. The PLC reacts as if an interruption to the power has happened, but a true loss of grid power only occurs once the generator power is active and backup systems are online.

In this case, the uninterruptible power supply (UPS) batteries only discharge for a fraction of the time that they would normally in an actual power outage and are therefore not fully tested. Depending on the PLC design, other ancillary processes may also be skipped and not tested.

However, this approach has many advantages that offset these limitations. It is widely used, particularly by operators of facilities that are the most sensitive to risk. Because the grid power is not interrupted, it can be restored quickly if the equipment malfunctions or human error occurs during the test. And because the UPS batteries are discharged only for a short time, there is less stress and impact on their overall life expectancy.

Facilities that have difficulties with interrupting the power, such as coordinating with the utility or have designs that place staff at risk while opening breakers and switches, also benefit from this approach.

While data center operators have options for pulling the plug, many are unwilling or unable to perform the test. For example, colocation providers and operators of facilities that process critical infrastructure workloads may be restricted over how and when they can pull the plug due to customer contracts.

The Uptime Intelligence View

Uptime Intelligence data has consistently shown that power is the most common cause behind the most significant data center outages, with the failure to switch from the electrical grid to on-site a recurrent problem. At the same time, electrical grids are set to become less reliable. As a result, all operators can benefit from reviewing their pull-the-plug testing procedures with their clients, regardless of whether they involve the energy provider or not, to help ensure resilient backup power systems.


For more details on data center resiliency and outage prevention, Uptime Institute’s Annual Outages Analysis 2023 is available here.

AI will have a limited role in data centers — for now

AI will have a limited role in data centers — for now

The topic of artificial intelligence (AI) has captured the public’s imaginations, and now barely a week goes by without reports of another breakthrough. Among the many, sometimes dramatic predictions made by experts and non-experts alike is the potential elimination of some, or even many, jobs.

These expectations are partly — but only partly — mirrored in the data center industry: a quarter of respondents to the 2023 Uptime Institute annual data center survey believe that AI will reduce their data center operations staffing levels within the next five years. A much larger group, however, are more cautious, with nearly half believing that jobs will only be displaced over a longer period of time.

These views are inevitably speculative, but a measure of skepticism in the sector is understandable. Despite the hype surrounding large language models, such as ChatGPT, and other generative AI applications, the use cases for these AI tools in data center operations currently appear limited. There are, however, other forms of AI that are already in use in the data center — and have proved valuable — but have not affected any jobs.

AI-based technologies have been the subject of several hype cycles in the past, with their immediate impact always smaller than predicted. This supports the view that the transition to a new generation of software tools is unlikely to be quick, or as far-reaching in the near term, as some AI enthusiasts think.

There are two factors that will likely slow the impact of AI on data center jobs:

  • The risk profile of most AI-based technologies is currently unacceptable to data center operators.
  • Those AI-based technologies that have made their way into the data center appear to augment, rather than replace employees.

AI in context

AI is an inconveniently broad umbrella term used to describe computer software that is capable of exhibiting what humans perceive as intelligent behavior. The term includes disciplines such as machine learning (ML), which is concerned with developing complex mathematical models that can learn from data to improve model performance over time.

ML is important in the automation of certain tasks. Once trained on data center operational data, such models can react to events much faster and with more granularity than human employees. This attribute is the foundation for most of the current-generation AI-based data center applications, such as dynamic cooling optimization and equipment health monitoring.

But the term AI also includes plenty of other concepts that span a wide range of applications. The most hotly pursued approaches fashion deep neural networks into complex logic using a training process. Such systems address computational problems that cannot be explicitly expressed in manual programming. High-profile examples are natural language processing, computer vision, search and recommendation systems, and, more recently, generative content systems, such as ChatGPT (text) and Stable Diffusion (text-to-image).

While the current wave of interest, investment and application is unprecedented, there are reasons to look at the latest resurgence of AI with a degree of skepticism. AI is one of the few technologies to have gone through several hype cycles since its origins as an academic discipline in 1956. These periods are often referred to as “AI summers” at the height of excitement and investment in the technology and “AI winters” during the lows.

With faster and more affordable computers, new sources of data for model training, and sensors that enable machines to better understand the physical world, new and innovative AI applications emerge. When researchers reach the technological limits of the day, the funding and interest dries out.

AI applications that prove to be useful are integrated into mainstream software and often stop being considered AI, as part of a phenomenon called “the AI effect.” In the past, this has happened to computers playing chess, optical character recognition, machine translation, email spam filters, satellite navigation systems, and personal digital assistants, such as Siri and Alexa. Applications of AI with bad product-market fit are abandoned.

Data center operators, like other managers across industry, tend to react to the hype cycles with waves of inflated or dampened expectations. In 2019, 29% of respondents to Uptime’s annual survey said they believed that AI would reduce the need for data center staff within the next five years (Figure 1). Nearly five years later, we don’t see any evidence of this reduction taking place.

Figure 1: More operators expect AI to reduce staffing requirements in the near term 

Diagram: More operators expect AI to reduce staffing requirements in the near term

AI in the data center

Some AI-based applications have made it into the data center. AI is currently used for dynamic power and cooling optimization, in anomaly detection, predictive maintenance, and other types of predictive analytics.

AI is rarely integrated into data center management tools as a control mechanism. Instead, it is used to advise facility operators. Ceding control of the facility to algorithms or models might make the infrastructure more efficient, but it would also expose the data center to new types of risk — and arguably new single points of failure in the AI mechanism itself. Any mistakes in model design or operation could result in prolonged outages, which could cost millions of dollars. This is not a gamble that operators are currently willing to take.

Increased media coverage of AI has also created more awareness of the faults and limitations that exist within the current generation of AI-based tools, which drives further caution. One such fault that has gained prominence in 2023 is the concept of “artificial hallucinations,” which describes the tendency of generative AI models to occasionally produce confident but inaccurate responses on factual matters. Other issues include the lack of decision-making transparency and accountability (often described as the “black box” problem), and concerns over the security of the data that is provided to train the models.

Nothing is new

It is worth noting that AI has had plenty of time to make inroads into the data center: US company Vigilent — an AI-based tool developer focused on digital infrastructure — has been applying ML in its cooling equipment optimization system since 2008. Some of the vendors to integrate this technology in their data center management tools include Schneider Electric, Siemens and Hitachi Vantara.

Vigilent is not alone in offering this kind of service. Recent entries in the cooling optimization product category include Phaidra in the US (established in 2019) and Coolgradient in Europe (founded in 2021). The former was founded by some members of the DeepMind team, which built an ML model for a Google data center that reportedly cut down the power consumption of cooling equipment by 40%.

What these tools, which represent some of the most successful implementations of AI in data center operations, have in common is their intention to augment humans rather than replace them — they drive cooling systems with a level of detail that the human brain alone would find difficult, if not impossible, to achieve.

The impact on jobs

According to Data Center Career Pathfinder — the tool developed in collaboration between Uptime Institute and some of the world’s largest data center operators — there are at least 25 distinct career options in data center operations and another 25 in operations engineering. These roles include engineers, mechanics, electricians, HVAC technicians, supplier quality managers, environmental health and safety coordinators, and cleaning specialists.

Most operations jobs require a physical presence at the site and interaction with physical equipment within the data center. No matter how intelligent, a software system cannot install a server or fix an ailing generator set.

There are, however, a few data center positions that may be at immediate risk from AI tools. The need for preventative maintenance planners might be reduced since the current generation of AI-based tools can predict failure rates and suggest optimal, condition-based maintenance schedules through advanced statistical methods. There may also be less need for physical security staff: CCTV systems equipped with detection, recognition and tracking features are able to alert the front desk if someone is in the facility without the right credentials. In the future, such systems will get smarter and cover a growing number of threat types through complex pattern recognition in and around the data center.

At the same time, the data center industry is suffering from severe staffing shortages. Half of the respondents to Uptime’s annual survey said they are experiencing difficulties in finding qualified candidates for open jobs. Even if AI-based tools become reliable enough to take over some of the duties of human employees, the likely impact would be to reduce the need for additional hires, offsetting the deficit in staff recruitment, rather than replace those already employed in data centers.

On balance, AI is not going to devour many, if any, jobs in data centers. Equally, it is premature to look to AI as a short-term fix to the industry’s staffing issues. Instead, the data center industry needs to take steps to draw in more staff through advertising the benefits of working in the sector to qualified job seekers, and particularly through attracting the younger cohorts by offering a clear path for training and career progression.

Perhaps this time around AI will really change the fabric of society and the nature of work. In the meantime, developing and deploying smarter AI systems will require a great deal more infrastructure capacity, which will generate new data center jobs before the technology displaces any.


The Uptime Intelligence View

Large language models and generative AI applications are making the headlines but are unlikely to find many uses in data center management and operation. Instead, the current hype cycle might make operators more amenable to better established and understood types of AI — those that have been deployed in data centers over the past 15 years but haven’t reached mainstream adoption. There is little doubt that, eventually, some jobs will be automated out of existence through AI-based software. However, data centers will continue to provide secure employment and operational staff, in particular, will continue to be in high demand.

The strong case for power management

The strong case for power management

ANALYST OPINION

In a recent report on server energy efficiency, Uptime Intelligence’s Dr. Tomas Rahkonen analyzed data from 429 servers and identified five key insights (see Server energy efficiency: five key insights). All were valuable observations for better managing (and reducing) IT power consumption, but one area of his analysis stood out: the efficiency benefits of IT power management.

IT power management holds a strange position in modern IT. The technology is mature, well understood, clearly explained by vendors, and is known to reduce IT energy consumption effectively at certain points of load. Many guidelines (including the 2022 Best Practice Guidelines for the EU Code of Conduct on Data Centre Energy Efficiency and Uptime Institute’s sustainability series Digital infrastructure sustainability — a manager’s guide) strongly advocate for its use. And yet very few operators use it.

The reason for this is also widely understood: operational IT managers make the decision to use IT power management and their main task is to ensure IT processing performance is optimized at all times so that it never becomes a problem. Power management, however, is a technology that involves a trade-off between processing power and energy consumption, and this compromise will almost always affect performance. When it comes to power consumption versus compute power, IT managers will almost always favor compute in order to protect application performance.

This is where Dr. Rahkonen’s study becomes important. His analysis (see below for the key findings) details these trade-offs and shows how performance might be affected. Such analysis, Uptime Intelligence believes, should be part of the discussions between facilities and IT managers at the point of procurement and as part of regular efficiency or sustainability reviews.

Power management — the context

The goal of power management is simple: reduce server energy consumption by applying voltage or frequency controls in various ways. The trick is finding the right approach so that IT performance is only minimally affected. That requires some thought about processor types, likely utilization and application needs.

Analysis shows that power management lengthens processing times and latency, but most IT managers have little or no idea by how much. And because most IT managers are striving to consolidate their IT loads and work their machines harder, power management seems like an unnecessary risk.

In his report, Dr. Rahkonen analyzed the data on server performance and energy use from The Green Grid’s publicly available Server Efficiency Rating Tool (SERT) database and drew out two key findings.

First, that server power management can improve server efficiency — which is based on the SERT server-side Java (SSJ) worklet and defined in terms of the SERT energy efficiency metric (SSJ transactions per second per watt) — by up to 19% at the most effective point in the 25% to 60% utilization range. This is a useful finding, not least because it shows the biggest efficiency improvements occur in the utilization range that most operators are striving for.

Despite this finding’s importance, many IT managers won’t initially pay too much attention to the efficiency metric. They care more about absolute performance. This is where Dr. Rahkonen’s second key finding (see Figure 1) is important: even at the worst points of utilization, power management only reduced work (in terms of SSJ transactions per second) by up to 6.4%. Power use reductions were more likely to be in the range of 15% to 21%.

Figure 1. Power management reduces server power and work capacity

Figure 1. Power management reduces server power and work capacity

What should IT managers make of this information? The main payoff is clear: power management’s impact on performance is demonstrably low, and in most cases, customers will probably not notice that it is turned on. Even at higher points of utilization, the impact on performance is minimal, which suggests that there are likely to be opportunities to both consolidate servers and utilize power management.

It is, of course, not that simple. Conservative IT managers may argue that they still cannot take the risk, especially if certain applications might be affected at key times. This soon becomes a more complex discussion that spans IT architectures, capacity use, types of performance measurement and economics. And latency, not just processor performance, is certainly a worry — more so for some applications and businesses than others.

Such concerns are valid and should be taken into consideration. However, seen through the lens of sustainability and efficiency, there is a clear case for IT operators to evaluate the impact of power management and deploy it where it is technically practical — which will likely be in many situations.

The economic case is possibly even stronger, especially given the recent rises in energy prices. Even at the most efficient facilities, aggregated savings will be considerable, easily rewarding the time and effort spent deploying power management (future Uptime Intelligence reports will have more analysis on the cost impacts of better IT power management).

IT power management has long been overlooked as a means of improving data center efficiency. Uptime Intelligence’s data shows that in most cases, concerns about IT performance are far outweighed by the reduction in energy use. Managers from both IT and facilities will benefit from analyzing the data, applying it to their use cases and, unless there are significant technical and performance issues, using power management as a default.

The Energy Efficiency Directive: requirements come into focus

The Energy Efficiency Directive: requirements come into focus

The European Commission (EC) continues to grapple with the challenges of implementing the Energy Efficiency Directive (EED) reporting and metrics mandates. The publication of the Task B report Labelling and minimum performance standards schemes for data centres and the Task C report EU repository for the reporting obligation of data centres on June 7, 2023 represent the next step on the implementation journey.

Data center operators need to monitor the evolution of the EC’s plans, recognizing that the final version will not be published until the end of 2023. The good news is that about 95% of the requirements have already been set out and the EC’s focus is now on collecting data to inform the future development and selection of efficiency metrics, minimum performance standards and a rating system.

The Task B and C reports clarify most of the data reporting requirements and set out the preferred policy options for assessing data center energy performance. The reports also indicate the direction and scope of the effort to establish a work per energy metric, supporting metrics — such as power usage effectiveness (PUE) and renewable energy factor (REF) — and the appropriate minimum performance thresholds.

Data reporting update

Operators will need to periodically update their data collection processes to keep pace with adjustments to the EED reporting requirements. The Task C report introduces a refined and expanded list of data reporting elements (see Tables 1 and 2).

The EC intends for IT operators to report maximum work, as measured by the server efficiency rating tool (SERT), and the storage capacity of server and storage equipment, respectively, as well as the estimated CPU utilization of the server equipment with an assessment of its confidence level. The EC will use these data elements to assess the readiness of operators to report a work per energy metric and how data center types and equipment redundancy levels should differentiate thresholds.  

Table 1. The EED’s indicator values

Table 1. The EED’s indicator values

Table 2 describes the new data indicators that have been added to the reporting requirements — water use and renewable energy consumption were listed previously. The Task A report and earlier versions of the EED recast had identified the other elements but had not designated them for reporting.

Table 2. The indicator data to be reported to EU member states

Table 2. The indicator data to be reported to EU member states

The EC will use these data elements to develop an understanding of data center operating characteristics:

  • How backup generators are used to support data center and electrical grid operations.  
  • The percentage of annual water use that comes from potable water sources. In its comments on the Task B and C reports, Uptime Institute recommended that the EC also collect data on facilities’ cooling systems so that water use and water usage effectiveness (WUE) can be correlated to cooling system type.
  • Identify the number of data centers that are capturing heat for reuse and the quality of heat that the systems generate.
  • Quantify the renewable energy consumed to run the data center and the quantity of guarantees of origin (GOs) used to offset electricity purchases from the grid. In its comments to the EC, Uptime Institute recommended the use of megawatt-hours (MWh) of renewable or carbon-free energy consumed in the operation, not the combination of MWh of consumption and offsets, to calculate the REF metric.

Section 4 of the Task C report details the full scope of the data elements that need to be reported under the EED. Five data elements specified in Annex VIa of the final EED draft are missing: temperature set points; installed power; annual incoming and outgoing data traffic; amount of data stored and processed; and power utilization. The final data reporting process needs to include these, as instructed by the directive. Table 3 lists the remaining data elements that are not covered in Tables 1, 2 and 4.

Table 3. Other reporting items mandated in the Task A report and the EED

Table 3. Other reporting items mandated in the Task A report and the EED

Data center operators need to set up and exercise their data collection process in good time to ensure quality data for their first EED report on May 15, 2024. They should be able to easily collect and report most of the required data, with one exception.

The source of the servers’ maximum work capacity, utilization data, and the methodologies to estimate these values are currently under development. It is likely that these will not be available until the final Task A to D reports are published at the end of 2023. Operators are advised to track the development of these methodologies and be prepared to incorporate them into their reporting processes upon publication.

Colocation operators need to move quickly to establish their processes for collecting the IT equipment data from their tenants. The ideal solution for this challenge would be an industry-standard template that lets IT operators autoload their data to their colocation provider with embedded quality checks. The template would then aggregate the data for each data center location and autoload it to the EU-wide database, as proposed in the Task C document. It is possibly too late to prepare this solution for the May 2024 report, however, the data center industry should seriously consider creating, testing and deploying such a template for the 2025 report.

The reporting of metrics

The EC intends to assess data center locations for at least four of the eight EN 50600-4 metrics (see Table 4). The metrics will be calculated from the submitted indicator data. Task C designates the public reporting of PUE, WUE, REF and energy reuse factor (ERF) by data center location. Two ICT metrics, IT equipment energy efficiency for servers (ITEEsv) and IT equipment utilization for servers (ITEUsv), will be calculated from indicator data but not publicly reported.

The cooling efficiency ratio (CER) and the carbon usage effectiveness (CUE) are not designated for indicator data collection or calculation. Uptime Institute recommended that the EC collect the energy used and produced from the cooling system as well as the cooling system type to enable the EC to understand the relationship between CER, WUE and the cooling system type.

Table 4. The use of EN 50600-4 data center metrics for EED reporting

Table 4. The use of EN 50600-4 data center metrics for EED reporting

Public reporting of location-specific information

Table 5 lists the location-specific information that the EC recommends being made available to the public.

Table 5. The public reporting of indicator data and metrics

Table 5. The public reporting of indicator data and metrics

These data elements will reveal a significant level of detail about individual data center operations and will focus scrutiny on operators that are perceived as inefficient or using excessive quantities of energy or water. Operators are advised to look at their data from 2020 to 2022 for these elements. They will need to consider how the public will perceive the data, determine the appropriateness of creating an improvement plan and develop a communication strategy to engage with stakeholders the company’s management of its data center operations.

EU member states and EU-level reporting will provide aggregated data to detail the overall scope of data center operations in the individual jurisdictions. 

Data center efficiency rating systems and thresholds

The Task B report looks at the variety of sources on which the EC has built its data center efficiency rating systems and thresholds. In particular, the report evaluates:

  • A total of 25 national and regional laws, such as North Holland regional regulation.
  • Voluntary initiatives, such as the European Code of Conduct for Energy Efficiency in Data Centres.
  • Voluntary certification schemes, such as Germany’s Blauer Engel Data Centers ecolabel.
  • Building-focused data center certification schemes, such as the Leadership in Energy and Environmental Design (LEED) and Building Research Establishment Environmental Assessment Method (BREEAM).
  • Self-regulation, such as the Climate Neutral Data Center Pact.
  • A maturity model for energy management and environmental sustainability (CLC/TS 50600-5-1).

The report distills this information into four of the most promising options.

  1. Minimum performance thresholds for PUE and REF. These could be based on current commitments, such as those by the Climate Neutral Data Center Pact, and government targets for minimum renewable energy consumption percentages.
  2. Information requirements in the form of a rating-based labeling system. This would likely be differentiated by data center type and redundancy level, and built on metric performance thresholds and best practices:
    • Mandatory system. A multi-level rating system based on efficiency metric thresholds with a minimum number of energy management best practices.
    • Voluntary system. Pass / fail performance criteria will likely be set for a mix of efficiency metrics and best practices for energy management.
  3. Information requirements in the form of metrics. Indicator data and metric performance results would be published annually, enabling stakeholders to assess year-to-year performance improvements.

Integrating the Task A through C reports, the EC appears poised to publish the performance indicator data and metrics calculations detailed in option 3 in the above list for the May 15, 2024 public report. The EC’s intent seems to be to collect sufficient data to assess the current values of the key metrics (see Table 4) at operating data centers and the performance variations resulting from different data center types and redundancy levels. This information will help the EC to select the best set of metrics with performance threshold values to compel operators to improve their environmental performance and the quantity of work delivered for each unit of energy and water consumption.

Unfortunately, the EC will have only the 2024 data reports available for analysis ahead of the second assessment’s deadline of May 15, 2025, when it will recommend further measures. The quality of the 2024 data is likely to be suspect due to the short time that operators will have had to collect, aggregate, and report this data. It would benefit the European Parliament and the EC to delay the second assessment report until March 2026.

This extension would enable an analysis of two years’ worth of data reports, give operators time to establish and stabilize their reporting processes, and the EC time to observe trends in the data to improve their recommendations.

Conclusion

The details of the EED data reporting requirements are slowly coming into focus with the publication of the draft Task B and Task C reports. There is the potential for some changes to both the final, approved EED and the final Task A to D reports that will be delivered to the European Parliament by the end of 2023, but they are likely to be minimal. The EED was approved by the European Parliament on July 11, 2023 and is scheduled for Council approval on July 27, 2023, with formal publication two to three weeks later.

The broad outline and most of the specifics of the required data reports are now clearly defined. Operators need to move quickly to ensure that they are collecting and validating the necessary data for the May 15, 2024 report.

A significant exception is the lack of clarity regarding the measurements and methodologies that IT operators can use to estimate their server work capacity and utilization. The data center industry and the EC need to develop, approve and make available the data sets and methodologies that can facilitate the calculation of these indicators.


The Uptime Intelligence View

The EC is quickly converging on the data and metrics that will be collected and reported for the May 15, 2024 report. The publicly reported, location-specific data will give the public an intimate view of data center resource consumption and its basic efficiency levels. More importantly, it will enable the EC to gather the data it needs to evaluate a work per energy metric and develop minimum performance thresholds that have the potential to alter an IT manager’s business goals and objectives toward a greater focus on environmental performance.

Lifting and shifting apps to the cloud: a source of risk creep?

Lifting and shifting apps to the cloud: a source of risk creep?

Public cloud infrastructures have come a long way over the past 16 years to slowly earn the trust of enterprises in running their most important applications and storing sensitive data. In the Uptime Institute Global Data Center Survey 2022, more than a third of enterprises that operate their own IT infrastructure said they also placed some of their mission-critical workloads in a public cloud.

This gradual change in enterprises’ posture, however, can only be partially attributed to improved or more visible cloud resiliency. An equal, or arguably even bigger, component in this shift in attitude is enterprises’ willingness to make compromises when using the cloud, which includes sometimes accepting less resilient cloud data center facilities. However, a more glaring downgrade lies in the loss of the ability to configure IT hardware specifically for sensitive business applications.

In more traditional, monolithic applications, both the data center and IT hardware play a central role in their reliability and availability. Most critical applications that predate the cloud era depend heavily on hardware features because they run on a single or a small number of servers. By design, more application performance meant bigger, more powerful servers (scaling up as opposed to scaling out), and more reliability and availability meant picking servers engineered for mission-critical use.

In contrast, cloud-native applications should be designed to scale across tens or hundreds of servers, with the assumption that the hardware cannot be relied upon. Cloud providers are upfront that customers are expected to build in resiliency and reliability using software and services.

But such architectures are complex, may require specialist skills and come with high software management overheads. Legacy mission-critical applications, such as databases, are not always set up to look after their reliability on their own without depending on hardware and operating system / hypervisor mechanisms. To move such applications to the cloud and maintain their reliability, organizations may need to substantially refactor the code.

This Uptime Update discusses why organizations that are migrating critical workloads from their own IT infrastructure to the cloud will need to change their attitudes towards reliability to avoid creating risks.

Much more than availability

The language that surrounds infrastructure resiliency is often ambiguous and masks several interrelated but distinct aspects of engineering. Historically, the industry has largely discussed availability considerations around the public cloud, which most stakeholders understand as not experiencing outages to their cloud services.

In common public cloud parlance, availability is almost always used interchangeably with reliability. When offering advice on their reliability features or on how to architect cloud applications for reliability, cloud providers tend to discuss almost exclusively what falls under the high-availability engineering discipline (e.g., data replication, clustering and recovery schemes). In the software domain, physical and IT infrastructure reliability may be conflated with site reliability engineering, which is a software development and deployment framework.

These crossover in two significant ways. First, availability objectives, such as the likelihood that the system is ready to operate at any given time, are only a part of reliability engineering — or rather, one of its outcomes. Reliability engineering is primarily concerned with the system’s ability to perform its function free of errors. It also aims to suppress the likelihood that failures will affect the system’s health. Crucially, this includes the detection and containment of abnormal operations, such as a device making mistakes. In short, reliability is the likelihood of producing correct outputs.

For facilities, this typically translates to the ability to deliver conditioned power and air — even during times of maintenance and failures. For IT systems, reliability is about the capacity to perform compute and storage jobs without errors in calculations or data.

Second, the reliability of any system builds on the robustness of its constituent parts, which include the smallest components. In the cloud, however, the atomic unit of reliability that is visible to customers is a consumable cloud resource, such as a virtual machine or container, and more complex cloud services, such as data storage, network and an array of application interfaces.

Today, enterprises have not only limited information on the cloud data centers’ physical infrastructure resiliency (either topology or maintenance and operations practices), but also low visibility of, or any choice in, the reliability features of IT hardware and infrastructure software that underpin cloud services.

Engineering for reliability: a lost art?

This abstraction of hardware resources is a major departure from the classical infrastructure practices for IT systems that run mission-critical business and industrial applications. Server reliability greatly depends on the architectural features that detect and recover from errors occurring in processors and memory chips, often with the added help of the operating system.

Typical examples of errors include soft bit flips (transient bit errors typically caused by an anomaly) and hard bit flips (permanent faults) in memory cell arrays. Bit errors can be found both in the processor and in external memory banks, as well as operational errors and design bugs in processor logic that could produce incorrect outputs or result in a software crash.

For much of its history, the IT industry has gone to great and costly lengths to design mission-critical servers (and storage systems) that can be trusted to manage data and perform operations as intended. The engineering discipline addressing server robustness is generally known as reliability, availability and serviceability (RAS, which was originally coined by IBM five decades ago), with the serviceability aspect referring to maintenance and upgrades, including software, without causing any downtime.

Traditional examples of these servers include mainframes, UNIX-based and other proprietary software and hardware systems. However, in the past couple of decades x86-based mission-critical systems, which are distinct from volume servers in their RAS features, have also taken hold in the market.

What sets mission-critical hardware design apart is its extensive reliability mechanisms in its error detection, correction and recovery capabilities that go beyond those found in mainstream hardware. While perfect resistance of errors is not possible, such features greatly reduce the chances of errors and software crashes.

Mission-critical systems tend to be able to isolate a range of faulty hardware components without resulting in any disruption. These components include memory chips (the most common source of data integrity and system stability issues), processor units or entire processors, or even an entire physical partition of a mission-critical server. Often, critical memory contents are mirrored within the system across different banks of memory to safeguard against hardware failures.

Server reliability doesn’t end with design, however. Vendors of mission-critical servers and storage systems test the final version of any new server platform for many months to ensure they perform correctly, known as the validation process, before volume manufacturing begins.

Entire sectors, such as financial services, e-commerce, manufacturing, transport and more, have come to depend on and trust such hardware for the correctness of their critical applications and data.

Someone else’s server, my reliability

Maintaining a mission-critical level of infrastructure reliability in the cloud (or even just establishing underlying infrastructure reliability in general), as opposed to “simple” availability, is not straightforward. Major cloud providers don’t address the topic of reliability in much depth to begin with.

What techniques, if any, cloud operators could deploy to safeguard customer applications against data corruption and application failures beyond the use of basic error correction code in memory, which is only able to handle random, single-bit errors, is difficult to know. Currently, there are no hyperscale cloud instances that offer enhanced RAS features comparable to mission-critical IT systems.

While IBM and Microsoft both offer migration paths directly for some mission-critical architectures, such as IBM Power and older s390x mainframes, their focus is on the modernization of legacy applications rather than maintaining reliability and availability levels that are comparable to on-premises systems. The reliability on offer is even less clear when it comes to more abstracted cloud services, such as software as a service and database as a service offerings or serverless computing.

Arguably, the future of reliability lies with software mechanisms. In particular, the software stack needs to adapt by getting rid of its dependency on hardware RAS features, whether this is achieved through verifying computations, memory coherency or the ability to remove and add hardware resources.

This puts the onus of RAS engineering almost solely on the cloud user. For new critical applications, purely software-based RAS by design is a must. However, the time and costs of refactoring or rearchitecting an existing mission-critical software stack to verify results and handle hardware-originating errors are not trivial — and are likely to be prohibitive in many cases, if possible.

Without the assistance of advanced RAS features in mission-critical IT systems, performance, particularly response times, will also likely take a hit if the same depth of reliability is required. At best, this means the need for more server resources to handle the same workload because the software mechanisms for extensive system reliability features will carry a substantial computational and data overhead.

These considerations should temper the pace at which mission-critical monolithic applications migrate to the cloud. Yet, these arguments are almost academic. The benefits of high reliability are difficult to quantify and compare (even more so than availability), in part because it is counterfactual — it is hard to measure what is being prevented.

Over time, cloud operators might invest more in generic infrastructure reliability and even offer products with enhanced RAS for legacy applications. But software-based RAS is the way forward in a world where hardware has become generic and abstracted.

Enterprise decision-makers should at least be mindful of the reliability (and availability) trade-offs involved with the migration of existing mission-critical applications to the cloud, and budget for investing in the necessary architectural and software changes if they expect the same level of service that an enterprise IT infrastructure can provide.