Outages: understanding the human factor

Outages: understanding the human factor

Analyzing human error — with a view to preventing it — has always been challenging for data center operators. The cause of a failure can lie in how well a process was taught, how tired, well trained or resourced the staff are, or whether the equipment itself was unnecessarily difficult to operate.

There are also definitional questions: If a machine fails due to a software error at the factory, is that human error? In addition, human error can play a role in outages that are attributed to other causes. For these reasons, Uptime Institute tends to research human error in a different way to other causes — we view it as one causal factor in outages, and rarely a single or root cause. Using this methodology, Uptime estimates (based on 25 years of data) that human error plays some role in about two-thirds of all outages.

In Uptime’s recent surveys on resiliency, we have tried to understand the makeup of some of these human error-related failures. As the figure below shows, human error-related outages are most commonly caused either by staff failing to follow procedures (even where they have been agreed and codified) or because the procedures themselves are faulty.

Diagram: Most common causes of major human error-related outages
Figure 1 Most common causes of major human error-related outages

This underscores a key tenet of Uptime: good training and processes both play a major role in outage reduction and can be implemented without great expense or capital expenditure.

Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.

The weakest link dictates cloud outage compensation

The weakest link dictates cloud outage compensation

Cloud providers offer services that are assembled by users into applications. An outage of any single cloud service can render an application unavailable. Importantly, cloud providers guarantee the availability of individual services, not of entire applications. Even if a whole application becomes unresponsive due to a provider outage, compensation is only due for the individual services that failed.

A small outage on a tiny part of an application may wreak havoc with an application (and even an entire business), but the provider will only award compensation for the weakest link that caused the outage.

A service level agreement (SLA) sets out the likely or promised availability of a specific cloud service, plus compensation due to the user if the provider fails to meet this availability. There is a different SLA for each cloud service.

Consider the cloud architecture in Figure 1, which shows traffic flows between virtual machines and a virtual load balancer. The load balancer distributes traffic across nine virtual machines. The virtual machines use the infrastructure as a service (IaaS) model, meaning the user is responsible for architecting virtual machines to be resilient. The load balancer, however, is a platform as a service (PaaS), which should mean the provider has architected it for resiliency. (More information on cloud models can be found in Uptime Institute’s recent report Cloud scalability and resiliency from first principles.)

Diagram: Simple load-balanced cloud architecture
Figure 1 Simple load-balanced cloud architecture

If the load balancer becomes unresponsive, the entire application is unusable as traffic can’t be routed to the virtual machines. The load balancer in this architecture is the weakest link.

The provider is responsible for the resiliency of the load balancer. Is the load balancer a single point of failure? It depends on the perception of the user. It may not be regarded as a single point of failure if the user is fully confident the provider has architected resiliency correctly through, for example, redundant servers or automatic failover. If the user is less convinced, they may consider the load balancer as a single point of failure because it is controlled by a single entity over which the user has limited visibility and control.

The virtual machines are still “available” because they are still in operation and can be accessed by the user for administration or maintenance — they just can’t be accessed by the end user via the application. The load balancer is the problem, not the virtual machines. In this scenario, no compensation is payable for the virtual machines even though the application has gone down.

To understand the financial impact, we’ll assume each virtual machine and load balancer costs $3 per month. As an example of a cloud SLA, Microsoft Azure offers compensation of 10% of the monthly cost of the load balancer if it fails to be available for 99.99% to 99.9% of a month. Similar terms also apply to virtual machines.

If the load balancer is down for 43 minutes, then Microsoft Azure is obliged to pay 10% of the monthly fee of the load balancer, so $0.30 in this case. It is not obliged to pay for the virtual machines as these continued in operation, despite the application becoming unresponsive. The total monthly cost of the application is $30, and the compensation is $0.30, which means the payment for the outage is 1% of the monthly fee paid to Microsoft Azure — a paltry sum compared with the likely impact of the outage. Table 1 provides an overview of the refunds due in this scenario.

Table: SLA refunds due in example scenario
Table 1 SLA refunds due in example scenario

This example demonstrates two points. First, cloud providers provide services that users use to build applications. Their responsibility ends with these services. The analogy of cloud services as toy bricks may seem trite, but it conveys effectively some fundamental aspects of the cloud model. A toy company may guarantee the quality of its bricks, but it would not guarantee the quality of a model completed by an enthusiastic builder, no matter how well built.

In the architecture in Figure 1, for example, the user could have designed a more scalable application. They could have reduced the cost of virtual machines by implementing autoscaling, which would automatically terminate the unutilized virtual machines when not in use (such as when the load balancer went down). The user is ultimately responsible for building a resilient and scalable application, just as the quality of the toy model is the builder’s responsibility.

Second, this example also demonstrates that an SLA is not an insurance policy. It does not mitigate the business impact of downtime. In practice, compensation for cloud outages is likely to be less than originally assumed by users due to nuances in contract terms, which reduce the provider’s liability. Cloud customers must identify single points of failure in their applications and assess the risk and impact of an outage of these services. An outage of these services can render an application unavailable, but SLA compensation is unlikely to even cover the cost of the cloud services, let alone business losses. Ultimately, users are responsible for architecting greater resiliency in their applications to reduce the impact of outages.

The shock waves from Ukraine

The shock waves from Ukraine

How is the Ukraine conflict affecting digital infrastructure and the data sector? In response to multiple queries, Uptime Institute Intelligence identified six main areas where operators and customers of digital infrastructure are experiencing effects from the conflict or will do so soon, not just in Europe but globally. (In Russia and Ukraine there are over 150 data centers, but these are not included in the analysis).

Energy prices

In 2021, electricity prices already rose dramatically partly because of the post-COVID boom in economic activity, and partly because of tight energy markets caused by a combination of weak renewable energy production and a Russian squeeze on gas supplies. Electricity prices can account for around a quarter to two-thirds of the operational costs of a data center. Prices are now near record highs, peaking at 400 euros / megawatt-hours in some European markets shortly after the invasion in February. Because electricity prices globally are linked to liquid natural gas, forward prices across the world will remain high for at least the rest of 2022, especially in Europe.

To contain the effects of price volatility, most large data center operators attempt to reduce their exposure by buying in advance, usually in half-year periods. But some operators were caught out by the Russian invasion of Ukraine, having switched to the spot markets in anticipation of falling prices.

Exposure to high energy prices among data center operators varies according to business type and how well businesses have managed their risks. Colocation companies are the least exposed, because they buy in advance and it is usual to have the contractual right to pass on power prices to their customers. Some providers, however, offer forward pricing and price locks, and may be more exposed if they offer commodity hosting and cloud-like services too.

Enterprises cannot pass on energy costs, except by charging more for their own products and services. Many enterprise operators do not lock in their own electricity costs in advance, while their colocation partners usually do. Some enterprises are benefiting from the diligent purchasing of their colocation partners.

The biggest cloud data center operators can only pass on power price increases by raising cloud prices. Recently, they have been buying more energy on the spot market for their own facilities and haven’t locked in lower prices — exposing them to increases at scale. However, those who have invested in renewable power purchase agreements — which is common among large cloud providers — are benefiting because they have locked in their green power at lower prices for several years ahead.

Economic slowdown?

Prior to late February 2022, world economies were recovering from COVID, but the Ukraine crisis has created a darker and more complex picture. The United Nation’s Conference of Trade and Development (UNCTAD) cut its global growth forecast from 3.6% to 2.6% in March. In a recent report, it highlighted inflation, fuel and food prices, disruption of trade flows and supply chains, and rising debt levels as areas of concern.

Not all economies and industries will be affected equally. But while the COVID pandemic clearly boosted demand for data centers and digital services, the effects of another, different widespread slowdown may dampen demand for digital services and data centers, partly through a tightening of capital availability, and partly also due to runaway energy costs which make computing more expensive.

Cybersecurity

 Cyberattacks on business have increased steadily in number and severity since the early 1980s. Organizations in countries that are considered critical of Russia — as well as those in Russia itself — have been warned of a sharp increase in attacks.

To date, the most serious attacks have been directly linked to the conflict. For example, hundreds of thousands of people across eastern Europe suffered a loss of communications, due to cyber assault on a satellite, in the early days of war. There have also been deepfakes inserted in media services on both sides; and thousands of Russian businesses have suffered distributed denial-of-service attacks.

Security analysts are warning that a perceived fall in cyberattacks on the West from Russian hackers is likely to be temporary. Governments, data centers and IT services companies, energy companies, financial services and media are all regarded as critical infrastructure and therefore likely targets.

Sustainability paradox

In Europe, high energy prices in 2021 and concerns over climate change caused both political and economic difficulties. This was not sufficient to trigger major structural changes in policy or investment, but this may now change.

For the data center sector, two trends are unfolding with diametrically opposite effects on carbon emissions. First, to ensure energy availability, utilities in Europe — with the support of governments — are planning to use more high-carbon coal to make up for a reduced supply of Russian gas. This will at times increase the carbon content of the grid and could increase the number of renewable energy certificates or offsets that data center operators must buy to achieve zero-carbon emissions status.

Second, there is also a counter trend: high prices reduce demand. To date, there has not been a dramatic increase in coal use — although coal prices have risen steeply. In the longer term, energy security concerns will drive up utility and government investment in renewable energy and in nuclear. The UK government is considering a plan to invest up to $140 billion in nuclear energy.

In the data center sector, energy-saving projects may now be more financially viable. These will range from upgrading inefficient cooling systems, recycling heat, raising temperatures, reducing unnecessary redundancy or addressing low IT utilization.

But operators be warned: a switch from fossil fuels to renewable energy, with grid or localized storage, means new dependencies and new global supply chains.

Separately, those European and US legislators who have campaigned for Bitcoin’s energy intensive proof-of-work protocol to be outlawed will now find new political support. Calls are growing for energy devoted to Bitcoin to be used more productively (see Are proof-of-work blockchains a corporate sustainability issue?)

Supply chain and component shortages

The digital infrastructure sector (like many other sectors) has been dogged by supply chain problems, shortages and price spikes since the early days of the COVID pandemic. Data center projects are regularly delayed because of equipment supply problems.

The Ukraine conflict is exacerbating this. Semiconductor and power electronics fabrication, which require various rare elements produced in Ukraine and Russia (e.g., neon, vanadium, palladium, nickel, titanium), may suffer shortages and delays if Russia drags out the war.

The delivery of machines (and that means nearly all) that use microelectronic components could also suffer delays. At the height of the pandemic, cooling equipment, uninterruptible power supplies and switchgear equipment were the most delayed items — and this is likely to be the case again.

Following the pandemic, most of the larger operators took steps to alleviate the effects of operational component shortages, such as building inventory and arranging for second sources, which may cushion the effect of these shortages.

Border zones

Even before the pandemic and the war in Ukraine, there were concerns with geopolitical instability and the vulnerabilities of globalization. These concerns have intensified. Several large European and US corporations, for example, outsource some IT functions to Ukraine and Russia, particularly software development. Thousands of Ukrainian developers were forced to flee from the war, while Russian IT professionals have suffered from the loss of internet applications and cloud services — and the ability to be paid.

Organizations will react in different ways, such as shorter, localized supply chains and insourcing within national borders. Global cloud providers will likely need to introduce more regions to address country-specific requirements — as was seen in the lead up to Brexit — and this is expected to require (yet) more data center capacity.


Uptime Institute will continue to monitor the Ukraine conflict and its impact on the global digital infrastructure sector.

Cloud generations drive down prices

Cloud generations drive down prices

Cloud providers need to deliver the newest capability to stay relevant. Few enterprises will accept working with outdated technology just because it’s consumable as a cloud service. However, existing cloud instances don’t migrate automatically. Similarly to on-premises server infrastructure, users need to refresh their cloud services regularly.

Typically, cloud operators prefer product continuity between generations, often creating nearly identical instances. A virtual instance has a “family”, which dictates the physical server’s profile, such as more computing power or faster memory. A “size” dictates the amount of memory, virtual processors, disks and other attributes assigned to the virtual instance. The launch of a new generation usually consists of a range of virtual instances with similar definitions of family and size as the previous generation. The major difference is the underlying server hardware’s technology.

A new generation doesn’t replace an older version. The older generation is still available to purchase. The user can migrate their workloads to the newer generation if they wish, but it is their responsibility to do so. By supporting older generations, the cloud provider is seen to be allowing the user to upgrade at their own pace. The provider doesn’t want to appear to be forcing the user into migrating applications that might not be compatible with the newer server platforms.

More generations create more complexity for users through greater choice and different virtual instance generations to manage. More recently, cloud operators have started to offer different processor architectures in the same generation. Users can now pick between Intel, Advanced Micro Devices (AMD) or, in Amazon Web Service’s (AWS’s) case, servers using ARM-based processors. The variety of cloud processor architectures is likely to expand over the coming years.

Cloud operators provide price incentives so that users gravitate towards newer generations (and between server architectures). Figure 1 shows lines of best fit for the average cost per virtual central processing unit (vCPU, essentially a physical processor thread as most processor cores run two threads simultaneously) of a range of AWS virtual instances over time. Data is obtained from AWS’s Price List API. For clarity, we only show pricing for AWS’ US-East-1 region, but the observations are similar across all regions. The analysis only considers x86 processors from AMD and Intel.

The trend for most virtual instances is downward, with the average cost of the m family general-purpose virtual instances dropping 50% from its first generation to the present time. Each family has different configurations of memory, network and other attributes that aren’t accounted for in the price of an individual vCPU, which explains the price differences between families.

Average cost per AWS vCPU generation over time diagram
Figure 1 Average cost per AWS vCPU generation over time

One hidden factor is that compute power per vCPU also increases over generations — often incrementally. This is because more advanced manufacturing technologies tend to help with both clock speeds (frequency) and the “smartness” of processor cores in executing codes faster. Users can expect greater processing speed with newer generations compared with older versions while paying less. The cost efficiency gap is more substantial than simple pricing suggests.

AWS (and other cloud operators) are reaping the economic benefits of Moore’s law in a steep downward trajectory for cost of performance and passing some of this saving onto customers. Giving customers lower prices works in AWS’s favor by incentivizing customers to move to newer server platforms that are often more energy efficient and can carry more customer workloads — generating greater revenue and gross margin. However, how much of the cost savings AWS is passing on to its customers versus adding to its gross margin remains hidden from view. In terms of demand, cloud customers prioritize cost over performance for most of their applications and, partly because of this price pressure, cloud virtual instances are coming down in price.

The trend of lower costs and higher clock speed fails for one type of instance: graphics processing units (GPUs). GPU instances of families g and p have higher prices per vCPU over time, while g instances also have a lower CPU clock speed. This is not comparable with the non-GPU instances because GPUs are typically not broken down into standard units of capacity, such as a vCPU. Instead, customers tend to have (and want) access to the full resources of a GPU instance for their accelerated applications. Here, the rapid growth in total performance and the high value of the customer applications (for example, training of deep neural networks or massively parallel large computational problems) that use them allowed cloud operators (and their chip suppliers, chiefly NVIDIA) to raise prices. In other words, customers are willing to pay more for newer GPU instances if they deliver value in being able to solve complex problems quicker.

On average, virtual instances (at AWS at least) are coming down in price with every new generation, while clock speed is increasing. However, users need to migrate their workloads from older generations to newer ones to take advantage of lower costs and better performance. Cloud users must keep track of new virtual instances and plan how and when to migrate. The migration of workloads from older to newer generations is a business risk that requires a balanced approach. There may be unexpected issues of interoperability or downtime while the migration takes place — maintaining an ability to revert to the original configuration is key. Just as users plan server refreshes, they need to make virtual instance refreshes part of their ongoing maintenance.

Cloud providers will continue to automate, negotiate and innovate to drive costs lower across their entire operations, of which processors constitute a small but vital part. They will continue to offer new generations, families and sizes so buyers have access to the latest technology at a competitive price. The likelihood is that new generations will continue the trend of being cheaper than the last — by just enough to attract increasing numbers of applications to the cloud, while maintaining (or even improving) the operator’s future gross margins.

Industry consensus on sustainability looks fragile

Industry consensus on sustainability looks fragile

Pressed by a sense of urgency among scientists and the wider public, and by governments and investors who must fulfil promises made at COP (Conference of the Parties) summits, major businesses are facing ever more stringent sustainability reporting requirements. Big energy users, such as data centers, are in the firing line.

Many of the reporting requirements, and proposed methods of reducing carbon emissions, are proving to be complicated and may appear contradictory and counterproductive. Many managers will be bewildered and frustrated.

To date, most of the commitments on climate change made by the digital infrastructure sector have been voluntary. This has allowed a certain laxity in the definitions, targets and terminology used — and in the level of scrutiny applied. But these are all set to be tested: reporting requirements will increasingly become mandatory, either by law or because of commercial pressures. Failure to publish data or meet targets will carry penalties or have other negative consequences.

The European Union (EU) is the flag bearer in what is likely to be a wave of legislation spreading around the world. Its much-strengthened Energy Efficiency Directive, part of its “fit for 55” initiative (a legislative package to help meet the target of a 55% reduction in carbon emissions by 2030), is but one example. This legislation will require much more granular and open reporting, with even smaller-sized data centers (around 300–400 kilowatt total load) likely to face public audits for energy efficiency.

For operators in each part of the critical digital infrastructure sector, there may be some difficult decisions and trade-offs to make. Cloud companies, enterprises and colocation companies all want to halt climate change, but each has its own perspective and interests to protect.

Cloud suppliers and some of the bigger colocation providers, for example, are lobbying against some of the EU’s proposed reporting rules. Most of these organizations are already highly energy efficient and, by using matching and offsets, claim a very high degree of renewable use. Almost all also publish power usage effectiveness (PUE) data and some produce high-level carbon calculators for clients. Significant, step-change improvements would be complex and costly. Additionally, they argue, a bigger part of the sector’s energy waste takes place in smaller data centers, which may not have to fully report their energy use or carbon emissions — and may not be audited.

Colocation companies have a particular conundrum. Their energy consumption is high profile and huge — and clients now expect their colocation companies to use electricity from low-carbon or renewable sources. But this requires the purchase of ever more expensive RECs (renewable energy certificates), also known as Guarantees of Origin, and / or expensive, risky PPAs (power purchase agreements).

Purchasing carbon offsets or sourcing renewable power alone, however, is not likely to be enough in the years ahead. Regulators and investors will want to see annual improvements in energy efficiency or in reductions in energy use and carbon emissions.

For a colocation provider, achieving significant energy efficiency gains every year may not be possible. More than 70% of their energy use is tied to (and controlled by) their IT customers — many of whom are also pushing for more resiliency, which usually uses more energy. This can also apply to bare metal cloud customers.

In most data centers, the IT systems consume the most power and are operated wastefully. To encourage more energy efficiency at colocation sites, it makes sense for enterprises to take direct, Scope 2 responsibility for the carbon associated with the purchased electricity powering their systems. At present, most enterprises in a colocation site categorize the carbon associated with their IT as embedded Scope 3, which has weaker oversight and is not usually covered by expensive carbon offsets.

While many (including Uptime Institute) advocate that IT owners and operators take Scope 2 responsibility, it is clearly problematic. The owners and operators of the IT would have to be accountable for the carbon emissions resulting from the energy purchases made by their colocation or cloud companies — something many will not yet be ready to do. And, if they are responsible for the carbon emissions, they may have to also take on more responsibility for the expensive RECs and PPAs. This may be onerous – although the change might, at least, encourage IT owners to take on the considerable task of improving IT efficiency.

IT energy waste is a challenge for most in the digital critical infrastructure sector. After a decade of trying, the industry has yet to settle on metrics for measuring IT efficiency, although there are good measurements available for utilization and server efficiency (see Figure 1). In 2022, this challenge will rise up the agenda as stakeholders once again seek to define and apply the elusive metric of “useful work per watt” of IT. There won’t be any early resolution, though: these metrics are specific to each application, limiting their usefulness to regulators or overseers — and executives may fear the results will be alarmingly revealing.

Power consumption and PUE top sustainability metrics
Fig 1. Power consumption and PUE top sustainability metrics

The full report Five data center predictions for 2022 is available to Uptime Institute members here.

Cloud Complexity

Why cloud is a kludge of complexity

The cloud model was designed to be simple and nimble. Simple and nimble doesn’t necessarily mean fit for purpose. Over the past decade, new layers of capability have been added to cloud to address its shortcomings. While this has created more options and greater functionality, it has also meant greater complexity in its management.

Today, it is possible to create a virtual server on a public cloud and deploy an IT application within minutes. This simplicity is a significant value driver of cloud uptake. But building applications that are resilient, performant and compliant requires far greater consideration by cloud users.

Public cloud providers make few guarantees regarding the performance and resiliency of their services. They state that users should design their applications across availability zones, which are networked groups of data centers within a region, so they are resilient to an outage in a single zone. The onus is on the cloud user to build an IT application that works across multiple availability zones. This can be a complex task, especially for existing applications that were not designed for multi-availability zone cloud architecture. In other words, availability zones were introduced to make cloud more resilient, but they are just an enabler — the user must architect their use.

One of the original core tenets of cloud was the centralization of computing for convenience and outsourced management. The reality is many cloud buyers aren’t comfortable giving full control of all their workloads to a third party. Many are also bound by regulations requiring them to keep data in certain jurisdictions or under their own management. Private clouds were created to provide governance and control where public cloud failed, albeit with less scalability and flexibility than a public cloud.

Hybrid cloud makes public cloud more scalable by making it more distributed, which also means it is more flexible in terms of compliance and control. But this means cloud buyers must wrestle with designing and managing IT applications to work across different venues, where each venue has different capabilities and characteristics.

Public cloud providers now offer appliances or software that provide the same services found on public cloud but located in an on-premises environment. These appliances and software are designed to work “out of the box” with the public cloud, thereby allowing hybrid cloud to be implemented quicker than through a bespoke design. The hardware manufacturers, seeing the cloud providers enter their traditional territory of the on-premises data center, have responded with pay-as-you-go cloud servers that are billed according to usage.

Cloud management platforms provide a common interface to manage hybrid cloud, another consideration for cloud buyers. To manage applications effectively across venues, new application architectures are required. Software containers (an abstraction of code from operating systems) provide the basis of microservices, where applications are broken down into small, independent pieces of code that can scale independently — across venues if needed.

Applications that can scale effectively on the cloud are referred to as “cloud native.” Containers, microservices and cloud-native architectures were all introduced to make cloud scale effectively, but they all introduce new complexity. The Cloud Native Computing Foundation (CNCF) tracks over 1,200 projects, products and companies associated with cloud-native practices. The CNCF aims to reduce technical complexity in cloud-native practices, but these practices are all nascent and there is no clear standard approach to implementing cloud-native concepts.

To the uninitiated, cloud might appear a simple and nimble means to access capacity and cloud-enabling technologies (such as cloud software tools, libraries of application programming interfaces for integrations, etc.). This can still be the case for simple use cases, such as non-mission critical websites. However, users face complex and often onerous requirements for many of their workloads to run in a public cloud according to their business needs (such as resiliency and cost). The original cloud promised much, but the additional capabilities that have made cloud arguably more scalable and resilient have come at the cost of simplicity.

Today, there is no standard architecture for a particular application, no “best” approach or “right” combination of tools, venues, providers or services. Cloud users face a wall of options to consider. Amazon Web Services, the largest cloud provider, has over 200 products alone, with over five million variations. Most cloud deployments today are kludged — improvised or put together from an ill-assorted collection of parts. Different venues, different management interfaces and different frameworks, working together as best as they can. Functional, but not integrated.

The big threat of complexity is that more things can go wrong. When they do, the cause can be challenging to trace. The cloud sector has exploded with new capabilities to address mission-critical requirements — but choosing and assembling these capabilities to satisfactorily support a mission-critical application is a work in progress.