Blog Multi Author - Uptime Institute Blog

Is concern over cloud and third-party outages increasing?

June 8, 2022/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]

As a result of some high-profile outages and the growing interest in running critical services in a public cloud, the reliability — and transparency — of public cloud services has come under scrutiny.

Cloud services are designed to operate with low failure rates. Large (at-scale) cloud and IT service providers, such as Amazon Web Services, Microsoft Azure and Google Cloud, incorporate layers of software and middleware, balance capacity across systems, networks and data centers, and reroute workloads and traffic away from failures and troubled sites.

On the whole, these architectures provide high levels of service availability at scale. Despite this, no architecture is fail-safe and many failures have been recorded that can be attributed to the difficulties of managing such complex, at-scale software, data and networks.

Our recent resiliency survey of data center and IT managers shows that enterprise managers are reasonably concerned about the resiliency of public cloud services (see Figure 1). Only one in seven (14%) respondents say public cloud services are resilient enough to run all their workloads. The same proportion say the cloud is not resilient enough to run any of their workloads; and 32% say the cloud is only resilient enough to run some of their workloads. The increase in the number of “don’t know” responses since our 2021 survey also shows that confidence in the resiliency of the cloud has become shrouded in some uncertainty and skepticism.

Diagram: Most say cloud is only resilient enough for some workloads — **Figure 1 Most say cloud is only resilient enough for some workloads**

Concerns over cloud resiliency may be partly due to several recent outages being attributed to third-party service providers. In our resiliency survey, 39% of organizations suffered an outage that was caused by a problem with a third-party supplier.

As more workloads are outsourced to external providers, the more these operators account for high profile, public outages. Over a five-year period, third-party, commercial operators of IT and / or data centers (cloud, hosting, colocation, digital services, telecommunications, etc.) combined accounted for almost 63% of all public outages since 2016, when Uptime Institute started tracking them. This percentage has crept up year-by-year: in 2021 the combined proportion of outages caused by these commercial operators was 70%.

When third-party IT and data center service providers do have an outage, customers are affected immediately. These customers may seek compensation — and a full explanation. Many regulators and enterprises now want increased visibility, accountability and improved service level agreements — especially for cloud providers.

Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.

Outages: understanding the human factor

June 8, 2022/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]

Analyzing human error — with a view to preventing it — has always been challenging for data center operators. The cause of a failure can lie in how well a process was taught, how tired, well trained or resourced the staff are, or whether the equipment itself was unnecessarily difficult to operate.

There are also definitional questions: If a machine fails due to a software error at the factory, is that human error? In addition, human error can play a role in outages that are attributed to other causes. For these reasons, Uptime Institute tends to research human error in a different way to other causes — we view it as one causal factor in outages, and rarely a single or root cause. Using this methodology, Uptime estimates (based on 25 years of data) that human error plays some role in about two-thirds of all outages.

In Uptime’s recent surveys on resiliency, we have tried to understand the makeup of some of these human error-related failures. As the figure below shows, human error-related outages are most commonly caused either by staff failing to follow procedures (even where they have been agreed and codified) or because the procedures themselves are faulty.

Diagram: Most common causes of major human error-related outages — **Figure 1 Most common causes of major human error-related outages**

This underscores a key tenet of Uptime: good training and processes both play a major role in outage reduction and can be implemented without great expense or capital expenditure.

Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.

The weakest link dictates cloud outage compensation

June 1, 2022/in Executive, Operations/by Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]

Cloud providers offer services that are assembled by users into applications. An outage of any single cloud service can render an application unavailable. Importantly, cloud providers guarantee the availability of individual services, not of entire applications. Even if a whole application becomes unresponsive due to a provider outage, compensation is only due for the individual services that failed.

A small outage on a tiny part of an application may wreak havoc with an application (and even an entire business), but the provider will only award compensation for the weakest link that caused the outage.

A service level agreement (SLA) sets out the likely or promised availability of a specific cloud service, plus compensation due to the user if the provider fails to meet this availability. There is a different SLA for each cloud service.

Consider the cloud architecture in Figure 1, which shows traffic flows between virtual machines and a virtual load balancer. The load balancer distributes traffic across nine virtual machines. The virtual machines use the infrastructure as a service (IaaS) model, meaning the user is responsible for architecting virtual machines to be resilient. The load balancer, however, is a platform as a service (PaaS), which should mean the provider has architected it for resiliency. (More information on cloud models can be found in Uptime Institute’s recent report Cloud scalability and resiliency from first principles.)

Diagram: Simple load-balanced cloud architecture — **Figure 1 Simple load-balanced cloud architecture**

If the load balancer becomes unresponsive, the entire application is unusable as traffic can’t be routed to the virtual machines. The load balancer in this architecture is the weakest link.

The provider is responsible for the resiliency of the load balancer. Is the load balancer a single point of failure? It depends on the perception of the user. It may not be regarded as a single point of failure if the user is fully confident the provider has architected resiliency correctly through, for example, redundant servers or automatic failover. If the user is less convinced, they may consider the load balancer as a single point of failure because it is controlled by a single entity over which the user has limited visibility and control.

The virtual machines are still “available” because they are still in operation and can be accessed by the user for administration or maintenance — they just can’t be accessed by the end user via the application. The load balancer is the problem, not the virtual machines. In this scenario, no compensation is payable for the virtual machines even though the application has gone down.

To understand the financial impact, we’ll assume each virtual machine and load balancer costs $3 per month. As an example of a cloud SLA, Microsoft Azure offers compensation of 10% of the monthly cost of the load balancer if it fails to be available for 99.99% to 99.9% of a month. Similar terms also apply to virtual machines.

If the load balancer is down for 43 minutes, then Microsoft Azure is obliged to pay 10% of the monthly fee of the load balancer, so $0.30 in this case. It is not obliged to pay for the virtual machines as these continued in operation, despite the application becoming unresponsive. The total monthly cost of the application is $30, and the compensation is $0.30, which means the payment for the outage is 1% of the monthly fee paid to Microsoft Azure — a paltry sum compared with the likely impact of the outage. Table 1 provides an overview of the refunds due in this scenario.

Table: SLA refunds due in example scenario — **Table 1 SLA refunds due in example scenario**

This example demonstrates two points. First, cloud providers provide services that users use to build applications. Their responsibility ends with these services. The analogy of cloud services as toy bricks may seem trite, but it conveys effectively some fundamental aspects of the cloud model. A toy company may guarantee the quality of its bricks, but it would not guarantee the quality of a model completed by an enthusiastic builder, no matter how well built.

In the architecture in Figure 1, for example, the user could have designed a more scalable application. They could have reduced the cost of virtual machines by implementing autoscaling, which would automatically terminate the unutilized virtual machines when not in use (such as when the load balancer went down). The user is ultimately responsible for building a resilient and scalable application, just as the quality of the toy model is the builder’s responsibility.

Second, this example also demonstrates that an SLA is not an insurance policy. It does not mitigate the business impact of downtime. In practice, compensation for cloud outages is likely to be less than originally assumed by users due to nuances in contract terms, which reduce the provider’s liability. Cloud customers must identify single points of failure in their applications and assess the risk and impact of an outage of these services. An outage of these services can render an application unavailable, but SLA compensation is unlikely to even cover the cost of the cloud services, let alone business losses. Ultimately, users are responsible for architecting greater resiliency in their applications to reduce the impact of outages.

The shock waves from Ukraine

May 18, 2022/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]

How is the Ukraine conflict affecting digital infrastructure and the data sector? In response to multiple queries, Uptime Institute Intelligence identified six main areas where operators and customers of digital infrastructure are experiencing effects from the conflict or will do so soon, not just in Europe but globally. (In Russia and Ukraine there are over 150 data centers, but these are not included in the analysis).

Energy prices

In 2021, electricity prices already rose dramatically partly because of the post-COVID boom in economic activity, and partly because of tight energy markets caused by a combination of weak renewable energy production and a Russian squeeze on gas supplies. Electricity prices can account for around a quarter to two-thirds of the operational costs of a data center. Prices are now near record highs, peaking at 400 euros / megawatt-hours in some European markets shortly after the invasion in February. Because electricity prices globally are linked to liquid natural gas, forward prices across the world will remain high for at least the rest of 2022, especially in Europe.

To contain the effects of price volatility, most large data center operators attempt to reduce their exposure by buying in advance, usually in half-year periods. But some operators were caught out by the Russian invasion of Ukraine, having switched to the spot markets in anticipation of falling prices.

Exposure to high energy prices among data center operators varies according to business type and how well businesses have managed their risks. Colocation companies are the least exposed, because they buy in advance and it is usual to have the contractual right to pass on power prices to their customers. Some providers, however, offer forward pricing and price locks, and may be more exposed if they offer commodity hosting and cloud-like services too.

Enterprises cannot pass on energy costs, except by charging more for their own products and services. Many enterprise operators do not lock in their own electricity costs in advance, while their colocation partners usually do. Some enterprises are benefiting from the diligent purchasing of their colocation partners.

The biggest cloud data center operators can only pass on power price increases by raising cloud prices. Recently, they have been buying more energy on the spot market for their own facilities and haven’t locked in lower prices — exposing them to increases at scale. However, those who have invested in renewable power purchase agreements — which is common among large cloud providers — are benefiting because they have locked in their green power at lower prices for several years ahead.

Economic slowdown?

Prior to late February 2022, world economies were recovering from COVID, but the Ukraine crisis has created a darker and more complex picture. The United Nation’s Conference of Trade and Development (UNCTAD) cut its global growth forecast from 3.6% to 2.6% in March. In a recent report, it highlighted inflation, fuel and food prices, disruption of trade flows and supply chains, and rising debt levels as areas of concern.

Not all economies and industries will be affected equally. But while the COVID pandemic clearly boosted demand for data centers and digital services, the effects of another, different widespread slowdown may dampen demand for digital services and data centers, partly through a tightening of capital availability, and partly also due to runaway energy costs which make computing more expensive.

Cybersecurity

Cyberattacks on business have increased steadily in number and severity since the early 1980s. Organizations in countries that are considered critical of Russia — as well as those in Russia itself — have been warned of a sharp increase in attacks.

To date, the most serious attacks have been directly linked to the conflict. For example, hundreds of thousands of people across eastern Europe suffered a loss of communications, due to cyber assault on a satellite, in the early days of war. There have also been deepfakes inserted in media services on both sides; and thousands of Russian businesses have suffered distributed denial-of-service attacks.

Security analysts are warning that a perceived fall in cyberattacks on the West from Russian hackers is likely to be temporary. Governments, data centers and IT services companies, energy companies, financial services and media are all regarded as critical infrastructure and therefore likely targets.

Sustainability paradox

In Europe, high energy prices in 2021 and concerns over climate change caused both political and economic difficulties. This was not sufficient to trigger major structural changes in policy or investment, but this may now change.

For the data center sector, two trends are unfolding with diametrically opposite effects on carbon emissions. First, to ensure energy availability, utilities in Europe — with the support of governments — are planning to use more high-carbon coal to make up for a reduced supply of Russian gas. This will at times increase the carbon content of the grid and could increase the number of renewable energy certificates or offsets that data center operators must buy to achieve zero-carbon emissions status.

Second, there is also a counter trend: high prices reduce demand. To date, there has not been a dramatic increase in coal use — although coal prices have risen steeply. In the longer term, energy security concerns will drive up utility and government investment in renewable energy and in nuclear. The UK government is considering a plan to invest up to $140 billion in nuclear energy.

In the data center sector, energy-saving projects may now be more financially viable. These will range from upgrading inefficient cooling systems, recycling heat, raising temperatures, reducing unnecessary redundancy or addressing low IT utilization.

But operators be warned: a switch from fossil fuels to renewable energy, with grid or localized storage, means new dependencies and new global supply chains.

Separately, those European and US legislators who have campaigned for Bitcoin’s energy intensive proof-of-work protocol to be outlawed will now find new political support. Calls are growing for energy devoted to Bitcoin to be used more productively (see Are proof-of-work blockchains a corporate sustainability issue?)

Supply chain and component shortages

The digital infrastructure sector (like many other sectors) has been dogged by supply chain problems, shortages and price spikes since the early days of the COVID pandemic. Data center projects are regularly delayed because of equipment supply problems.

The Ukraine conflict is exacerbating this. Semiconductor and power electronics fabrication, which require various rare elements produced in Ukraine and Russia (e.g., neon, vanadium, palladium, nickel, titanium), may suffer shortages and delays if Russia drags out the war.

The delivery of machines (and that means nearly all) that use microelectronic components could also suffer delays. At the height of the pandemic, cooling equipment, uninterruptible power supplies and switchgear equipment were the most delayed items — and this is likely to be the case again.

Following the pandemic, most of the larger operators took steps to alleviate the effects of operational component shortages, such as building inventory and arranging for second sources, which may cushion the effect of these shortages.

Border zones

Even before the pandemic and the war in Ukraine, there were concerns with geopolitical instability and the vulnerabilities of globalization. These concerns have intensified. Several large European and US corporations, for example, outsource some IT functions to Ukraine and Russia, particularly software development. Thousands of Ukrainian developers were forced to flee from the war, while Russian IT professionals have suffered from the loss of internet applications and cloud services — and the ability to be paid.

Organizations will react in different ways, such as shorter, localized supply chains and insourcing within national borders. Global cloud providers will likely need to introduce more regions to address country-specific requirements — as was seen in the lead up to Brexit — and this is expected to require (yet) more data center capacity.

Uptime Institute will continue to monitor the Ukraine conflict and its impact on the global digital infrastructure sector.

Cloud generations drive down prices

May 3, 2022/in Executive, Operations/by Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]

Cloud providers need to deliver the newest capability to stay relevant. Few enterprises will accept working with outdated technology just because it’s consumable as a cloud service. However, existing cloud instances don’t migrate automatically. Similarly to on-premises server infrastructure, users need to refresh their cloud services regularly.

Typically, cloud operators prefer product continuity between generations, often creating nearly identical instances. A virtual instance has a “family”, which dictates the physical server’s profile, such as more computing power or faster memory. A “size” dictates the amount of memory, virtual processors, disks and other attributes assigned to the virtual instance. The launch of a new generation usually consists of a range of virtual instances with similar definitions of family and size as the previous generation. The major difference is the underlying server hardware’s technology.

A new generation doesn’t replace an older version. The older generation is still available to purchase. The user can migrate their workloads to the newer generation if they wish, but it is their responsibility to do so. By supporting older generations, the cloud provider is seen to be allowing the user to upgrade at their own pace. The provider doesn’t want to appear to be forcing the user into migrating applications that might not be compatible with the newer server platforms.

More generations create more complexity for users through greater choice and different virtual instance generations to manage. More recently, cloud operators have started to offer different processor architectures in the same generation. Users can now pick between Intel, Advanced Micro Devices (AMD) or, in Amazon Web Service’s (AWS’s) case, servers using ARM-based processors. The variety of cloud processor architectures is likely to expand over the coming years.

Cloud operators provide price incentives so that users gravitate towards newer generations (and between server architectures). Figure 1 shows lines of best fit for the average cost per virtual central processing unit (vCPU, essentially a physical processor thread as most processor cores run two threads simultaneously) of a range of AWS virtual instances over time. Data is obtained from AWS’s Price List API. For clarity, we only show pricing for AWS’ US-East-1 region, but the observations are similar across all regions. The analysis only considers x86 processors from AMD and Intel.

The trend for most virtual instances is downward, with the average cost of the m family general-purpose virtual instances dropping 50% from its first generation to the present time. Each family has different configurations of memory, network and other attributes that aren’t accounted for in the price of an individual vCPU, which explains the price differences between families.

Average cost per AWS vCPU generation over time diagram — **Figure 1 Average cost per AWS vCPU generation over time**

One hidden factor is that compute power per vCPU also increases over generations — often incrementally. This is because more advanced manufacturing technologies tend to help with both clock speeds (frequency) and the “smartness” of processor cores in executing codes faster. Users can expect greater processing speed with newer generations compared with older versions while paying less. The cost efficiency gap is more substantial than simple pricing suggests.

AWS (and other cloud operators) are reaping the economic benefits of Moore’s law in a steep downward trajectory for cost of performance and passing some of this saving onto customers. Giving customers lower prices works in AWS’s favor by incentivizing customers to move to newer server platforms that are often more energy efficient and can carry more customer workloads — generating greater revenue and gross margin. However, how much of the cost savings AWS is passing on to its customers versus adding to its gross margin remains hidden from view. In terms of demand, cloud customers prioritize cost over performance for most of their applications and, partly because of this price pressure, cloud virtual instances are coming down in price.

The trend of lower costs and higher clock speed fails for one type of instance: graphics processing units (GPUs). GPU instances of families g and p have higher prices per vCPU over time, while g instances also have a lower CPU clock speed. This is not comparable with the non-GPU instances because GPUs are typically not broken down into standard units of capacity, such as a vCPU. Instead, customers tend to have (and want) access to the full resources of a GPU instance for their accelerated applications. Here, the rapid growth in total performance and the high value of the customer applications (for example, training of deep neural networks or massively parallel large computational problems) that use them allowed cloud operators (and their chip suppliers, chiefly NVIDIA) to raise prices. In other words, customers are willing to pay more for newer GPU instances if they deliver value in being able to solve complex problems quicker.

On average, virtual instances (at AWS at least) are coming down in price with every new generation, while clock speed is increasing. However, users need to migrate their workloads from older generations to newer ones to take advantage of lower costs and better performance. Cloud users must keep track of new virtual instances and plan how and when to migrate. The migration of workloads from older to newer generations is a business risk that requires a balanced approach. There may be unexpected issues of interoperability or downtime while the migration takes place — maintaining an ability to revert to the original configuration is key. Just as users plan server refreshes, they need to make virtual instance refreshes part of their ongoing maintenance.

Cloud providers will continue to automate, negotiate and innovate to drive costs lower across their entire operations, of which processors constitute a small but vital part. They will continue to offer new generations, families and sizes so buyers have access to the latest technology at a competitive price. The likelihood is that new generations will continue the trend of being cheaper than the last — by just enough to attract increasing numbers of applications to the cloud, while maintaining (or even improving) the operator’s future gross margins.

Industry consensus on sustainability looks fragile

April 19, 2022/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]

Pressed by a sense of urgency among scientists and the wider public, and by governments and investors who must fulfil promises made at COP (Conference of the Parties) summits, major businesses are facing ever more stringent sustainability reporting requirements. Big energy users, such as data centers, are in the firing line.

Many of the reporting requirements, and proposed methods of reducing carbon emissions, are proving to be complicated and may appear contradictory and counterproductive. Many managers will be bewildered and frustrated.

To date, most of the commitments on climate change made by the digital infrastructure sector have been voluntary. This has allowed a certain laxity in the definitions, targets and terminology used — and in the level of scrutiny applied. But these are all set to be tested: reporting requirements will increasingly become mandatory, either by law or because of commercial pressures. Failure to publish data or meet targets will carry penalties or have other negative consequences.

The European Union (EU) is the flag bearer in what is likely to be a wave of legislation spreading around the world. Its much-strengthened Energy Efficiency Directive, part of its “fit for 55” initiative (a legislative package to help meet the target of a 55% reduction in carbon emissions by 2030), is but one example. This legislation will require much more granular and open reporting, with even smaller-sized data centers (around 300–400 kilowatt total load) likely to face public audits for energy efficiency.

For operators in each part of the critical digital infrastructure sector, there may be some difficult decisions and trade-offs to make. Cloud companies, enterprises and colocation companies all want to halt climate change, but each has its own perspective and interests to protect.

Cloud suppliers and some of the bigger colocation providers, for example, are lobbying against some of the EU’s proposed reporting rules. Most of these organizations are already highly energy efficient and, by using matching and offsets, claim a very high degree of renewable use. Almost all also publish power usage effectiveness (PUE) data and some produce high-level carbon calculators for clients. Significant, step-change improvements would be complex and costly. Additionally, they argue, a bigger part of the sector’s energy waste takes place in smaller data centers, which may not have to fully report their energy use or carbon emissions — and may not be audited.

Colocation companies have a particular conundrum. Their energy consumption is high profile and huge — and clients now expect their colocation companies to use electricity from low-carbon or renewable sources. But this requires the purchase of ever more expensive RECs (renewable energy certificates), also known as Guarantees of Origin, and / or expensive, risky PPAs (power purchase agreements).

Purchasing carbon offsets or sourcing renewable power alone, however, is not likely to be enough in the years ahead. Regulators and investors will want to see annual improvements in energy efficiency or in reductions in energy use and carbon emissions.

For a colocation provider, achieving significant energy efficiency gains every year may not be possible. More than 70% of their energy use is tied to (and controlled by) their IT customers — many of whom are also pushing for more resiliency, which usually uses more energy. This can also apply to bare metal cloud customers.

In most data centers, the IT systems consume the most power and are operated wastefully. To encourage more energy efficiency at colocation sites, it makes sense for enterprises to take direct, Scope 2 responsibility for the carbon associated with the purchased electricity powering their systems. At present, most enterprises in a colocation site categorize the carbon associated with their IT as embedded Scope 3, which has weaker oversight and is not usually covered by expensive carbon offsets.

While many (including Uptime Institute) advocate that IT owners and operators take Scope 2 responsibility, it is clearly problematic. The owners and operators of the IT would have to be accountable for the carbon emissions resulting from the energy purchases made by their colocation or cloud companies — something many will not yet be ready to do. And, if they are responsible for the carbon emissions, they may have to also take on more responsibility for the expensive RECs and PPAs. This may be onerous – although the change might, at least, encourage IT owners to take on the considerable task of improving IT efficiency.

IT energy waste is a challenge for most in the digital critical infrastructure sector. After a decade of trying, the industry has yet to settle on metrics for measuring IT efficiency, although there are good measurements available for utilization and server efficiency (see Figure 1). In 2022, this challenge will rise up the agenda as stakeholders once again seek to define and apply the elusive metric of “useful work per watt” of IT. There won’t be any early resolution, though: these metrics are specific to each application, limiting their usefulness to regulators or overseers — and executives may fear the results will be alarmingly revealing.

Fig 1. Power consumption and PUE top sustainability metrics

The full report Five data center predictions for 2022 is available to Uptime Institute members here.

Outages: understanding the human factor

The shock waves from Ukraine

Energy prices

Economic slowdown?

Cybersecurity

Sustainability paradox

Supply chain and component shortages

Border zones

Industry consensus on sustainability looks fragile

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Is PUE actually going UP?