A service level agreement (SLA) is a contract between a cloud provider and a user. The SLA describes the provider’s minimum level of service, specified by performance metrics, and the compensation due to the user should the provider fail to deliver this service.
In practice, however, this compensation is punitive. It seems designed to punish the provider for its failure, not to compensate the user for their loss.
A key metric used in cloud SLAs is availability. Cloud availability is generally expressed as the percentage of a defined period (usually a month) during which a resource has external connectivity. If a user can’t connect over the internet to the resource due to a cloud provider issue, the resource is deemed unavailable or ‘down’.
For example, Amazon Web Services (AWS) offers a 99.5% SLA on a virtual machine. There are 730 hours on average in a month, so the virtual machine should be available (or ‘up’) for around 726.5 hours of those 730 hours. Compensation is due if the virtual machine fails to meet this minimum availability during a month.
Alternatively, we could say AWS is not liable to pay compensation unless the virtual machine experiences total downtime of more than 0.5% of a month, or 3.5 hours on average (given that the length of months differ). Google Cloud, Microsoft Azure and other cloud providers offer similar terms. Table 1 shows common availability SLA performance metrics.
Table 1
If the virtual machine is distributed across two availability zones (logical data centers, see Cloud scalability and resiliency from first principles), the typical SLA for availability increases to 99.99% (equivalent to an average of around 4 minutes of downtime per month).
The business impact of downtime to users depends on the situation. If short-lived outages occur overnight, the downtime may have little effect on revenue or productivity. However, a long outage in the middle of a major retail event, for example, is likely to hit income and reputation hard. In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident.
SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.
If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total.
When a failure occurs, the user is responsible for measuring downtime and requesting compensation – this is not provided automatically. Users usually need to raise a report request with service logs to show proof of the outage. If the cloud provider approves the request, compensation is offered in service credits, not cash. Together, these approaches mean that users must detect an outage and apply for a service credit, which can only be redeemed through the offending cloud provider, and which is unlikely to cover the cost of the outage. Is the benefit of the compensation worth the effort?
Cloud providers are upfront about these compensation terms — they can be found easily on their websites. They are not being unreasonable by limiting their liabilities, but cloud users need to be realistic about the value of SLAs.
An SLA incentivizes the cloud provider to meet performance requirements with a threat of punitive losses. An SLA is also a marketing tool designed to convince the buyer that the cloud provider can be trusted to host business-critical applications.
Customers should not consider SLAs (or 99.9x% availability figures) as reliable predictors of future availability — or even a record of past levels of availability. The frequency of cloud outages (and their duration) suggests that actual performance of many service providers falls short of published SLAs. (For a more detailed discussion on outages, see our Annual outages analysis 2021: The causes and impacts of data center outages.)
In short, users shouldn’t put too much faith in SLAs. They are likely to indicate a cloud provider’s minimum standard of service: few cloud providers would guarantee compensation on metrics it didn’t feel it could realistically achieve, even if the compensation is minimal. But an SLA is not an insurance policy; it does not mitigate the business impact of downtime.
https://journal.uptimeinstitute.com/wp-content/uploads/2022/06/SLAs_Compensation-featured.jpg6281200Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]2022-06-15 07:00:002022-06-02 16:01:24Cloud SLAs punish, not compensate
As a result of some high-profile outages and the growing interest in running critical services in a public cloud, the reliability — and transparency — of public cloud services has come under scrutiny.
Cloud services are designed to operate with low failure rates. Large (at-scale) cloud and IT service providers, such as Amazon Web Services, Microsoft Azure and Google Cloud, incorporate layers of software and middleware, balance capacity across systems, networks and data centers, and reroute workloads and traffic away from failures and troubled sites.
On the whole, these architectures provide high levels of service availability at scale. Despite this, no architecture is fail-safe and many failures have been recorded that can be attributed to the difficulties of managing such complex, at-scale software, data and networks.
Our recent resiliency survey of data center and IT managers shows that enterprise managers are reasonably concerned about the resiliency of public cloud services (see Figure 1). Only one in seven (14%) respondents say public cloud services are resilient enough to run all their workloads. The same proportion say the cloud is not resilient enough to run any of their workloads; and 32% say the cloud is only resilient enough to run some of their workloads. The increase in the number of “don’t know” responses since our 2021 survey also shows that confidence in the resiliency of the cloud has become shrouded in some uncertainty and skepticism.
Concerns over cloud resiliency may be partly due to several recent outages being attributed to third-party service providers. In our resiliency survey, 39% of organizations suffered an outage that was caused by a problem with a third-party supplier.
As more workloads are outsourced to external providers, the more these operators account for high profile, public outages. Over a five-year period, third-party, commercial operators of IT and / or data centers (cloud, hosting, colocation, digital services, telecommunications, etc.) combined accounted for almost 63% of all public outages since 2016, when Uptime Institute started tracking them. This percentage has crept up year-by-year: in 2021 the combined proportion of outages caused by these commercial operators was 70%.
When third-party IT and data center service providers do have an outage, customers are affected immediately. These customers may seek compensation — and a full explanation. Many regulators and enterprises now want increased visibility, accountability and improved service level agreements — especially for cloud providers.
Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022Annual Outage Analysis report is available exclusively to Uptime Institute members.
https://journal.uptimeinstitute.com/wp-content/uploads/2022/06/Concern-over-cloud-and-third-party-outages-featured.jpg6281200Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2022-06-08 08:00:002022-06-03 11:53:08Is concern over cloud and third-party outages increasing?
Analyzing human error — with a view to preventing it — has always been challenging for data center operators. The cause of a failure can lie in how well a process was taught, how tired, well trained or resourced the staff are, or whether the equipment itself was unnecessarily difficult to operate.
There are also definitional questions: If a machine fails due to a software error at the factory, is that human error? In addition, human error can play a role in outages that are attributed to other causes. For these reasons, Uptime Institute tends to research human error in a different way to other causes — we view it as one causal factor in outages, and rarely a single or root cause. Using this methodology, Uptime estimates (based on 25 years of data) that human error plays some role in about two-thirds of all outages.
In Uptime’s recent surveys on resiliency, we have tried to understand the makeup of some of these human error-related failures. As the figure below shows, human error-related outages are most commonly caused either by staff failing to follow procedures (even where they have been agreed and codified) or because the procedures themselves are faulty.
This underscores a key tenet of Uptime: good training and processes both play a major role in outage reduction and can be implemented without great expense or capital expenditure.
Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022Annual Outage Analysis report is available exclusively to Uptime Institute members.
https://journal.uptimeinstitute.com/wp-content/uploads/2022/06/Human-Error-featured.jpg6281200Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2022-06-08 08:00:002022-06-03 11:30:20Outages: understanding the human factor
Cloud providers offer services that are assembled by users into applications. An outage of any single cloud service can render an application unavailable. Importantly, cloud providers guarantee the availability of individual services, not of entire applications. Even if a whole application becomes unresponsive due to a provider outage, compensation is only due for the individual services that failed.
A small outage on a tiny part of an application may wreak havoc with an application (and even an entire business), but the provider will only award compensation for the weakest link that caused the outage.
A service level agreement (SLA) sets out the likely or promised availability of a specific cloud service, plus compensation due to the user if the provider fails to meet this availability. There is a different SLA for each cloud service.
Consider the cloud architecture in Figure 1, which shows traffic flows between virtual machines and a virtual load balancer. The load balancer distributes traffic across nine virtual machines. The virtual machines use the infrastructure as a service (IaaS) model, meaning the user is responsible for architecting virtual machines to be resilient. The load balancer, however, is a platform as a service (PaaS), which should mean the provider has architected it for resiliency. (More information on cloud models can be found in Uptime Institute’s recent report Cloud scalability and resiliency from first principles.)
If the load balancer becomes unresponsive, the entire application is unusable as traffic can’t be routed to the virtual machines. The load balancer in this architecture is the weakest link.
The provider is responsible for the resiliency of the load balancer. Is the load balancer a single point of failure? It depends on the perception of the user. It may not be regarded as a single point of failure if the user is fully confident the provider has architected resiliency correctly through, for example, redundant servers or automatic failover. If the user is less convinced, they may consider the load balancer as a single point of failure because it is controlled by a single entity over which the user has limited visibility and control.
The virtual machines are still “available” because they are still in operation and can be accessed by the user for administration or maintenance — they just can’t be accessed by the end user via the application. The load balancer is the problem, not the virtual machines. In this scenario, no compensation is payable for the virtual machines even though the application has gone down.
To understand the financial impact, we’ll assume each virtual machine and load balancer costs $3 per month. As an example of a cloud SLA, Microsoft Azure offers compensation of 10% of the monthly cost of the load balancer if it fails to be available for 99.99% to 99.9% of a month. Similar terms also apply to virtual machines.
If the load balancer is down for 43 minutes, then Microsoft Azure is obliged to pay 10% of the monthly fee of the load balancer, so $0.30 in this case. It is not obliged to pay for the virtual machines as these continued in operation, despite the application becoming unresponsive. The total monthly cost of the application is $30, and the compensation is $0.30, which means the payment for the outage is 1% of the monthly fee paid to Microsoft Azure — a paltry sum compared with the likely impact of the outage. Table 1 provides an overview of the refunds due in this scenario.
This example demonstrates two points. First, cloud providers provide services that users use to build applications. Their responsibility ends with these services. The analogy of cloud services as toy bricks may seem trite, but it conveys effectively some fundamental aspects of the cloud model. A toy company may guarantee the quality of its bricks, but it would not guarantee the quality of a model completed by an enthusiastic builder, no matter how well built.
In the architecture in Figure 1, for example, the user could have designed a more scalable application. They could have reduced the cost of virtual machines by implementing autoscaling, which would automatically terminate the unutilized virtual machines when not in use (such as when the load balancer went down). The user is ultimately responsible for building a resilient and scalable application, just as the quality of the toy model is the builder’s responsibility.
Second, this example also demonstrates that an SLA is not an insurance policy. It does not mitigate the business impact of downtime. In practice, compensation for cloud outages is likely to be less than originally assumed by users due to nuances in contract terms, which reduce the provider’s liability. Cloud customers must identify single points of failure in their applications and assess the risk and impact of an outage of these services. An outage of these services can render an application unavailable, but SLA compensation is unlikely to even cover the cost of the cloud services, let alone business losses. Ultimately, users are responsible for architecting greater resiliency in their applications to reduce the impact of outages.
https://journal.uptimeinstitute.com/wp-content/uploads/2022/05/Weakest-Link-featured.jpg6281200Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]2022-06-01 07:00:002022-05-17 16:57:39The weakest link dictates cloud outage compensation
How is the Ukraine conflict affecting digital infrastructure and the data sector? In response to multiple queries, Uptime Institute Intelligence identified six main areas where operators and customers of digital infrastructure are experiencing effects from the conflict or will do so soon, not just in Europe but globally. (In Russia and Ukraine there are over 150 data centers, but these are not included in the analysis).
Energy prices
In 2021, electricity prices already rose dramatically partly because of the post-COVID boom in economic activity, and partly because of tight energy markets caused by a combination of weak renewable energy production and a Russian squeeze on gas supplies. Electricity prices can account for around a quarter to two-thirds of the operational costs of a data center. Prices are now near record highs, peaking at 400 euros / megawatt-hours in some European markets shortly after the invasion in February. Because electricity prices globally are linked to liquid natural gas, forward prices across the world will remain high for at least the rest of 2022, especially in Europe.
To contain the effects of price volatility, most large data center operators attempt to reduce their exposure by buying in advance, usually in half-year periods. But some operators were caught out by the Russian invasion of Ukraine, having switched to the spot markets in anticipation of falling prices.
Exposure to high energy prices among data center operators varies according to business type and how well businesses have managed their risks. Colocation companies are the least exposed, because they buy in advance and it is usual to have the contractual right to pass on power prices to their customers. Some providers, however, offer forward pricing and price locks, and may be more exposed if they offer commodity hosting and cloud-like services too.
Enterprises cannot pass on energy costs, except by charging more for their own products and services. Many enterprise operators do not lock in their own electricity costs in advance, while their colocation partners usually do. Some enterprises are benefiting from the diligent purchasing of their colocation partners.
The biggest cloud data center operators can only pass on power price increases by raising cloud prices. Recently, they have been buying more energy on the spot market for their own facilities and haven’t locked in lower prices — exposing them to increases at scale. However, those who have invested in renewable power purchase agreements — which is common among large cloud providers — are benefiting because they have locked in their green power at lower prices for several years ahead.
Economic slowdown?
Prior to late February 2022, world economies were recovering from COVID, but the Ukraine crisis has created a darker and more complex picture. The United Nation’s Conference of Trade and Development (UNCTAD) cut its global growth forecast from 3.6% to 2.6% in March. In a recent report, it highlighted inflation, fuel and food prices, disruption of trade flows and supply chains, and rising debt levels as areas of concern.
Not all economies and industries will be affected equally. But while the COVID pandemic clearly boosted demand for data centers and digital services, the effects of another, different widespread slowdown may dampen demand for digital services and data centers, partly through a tightening of capital availability, and partly also due to runaway energy costs which make computing more expensive.
Cybersecurity
Cyberattacks on business have increased steadily in number and severity since the early 1980s. Organizations in countries that are considered critical of Russia — as well as those in Russia itself — have been warned of a sharp increase in attacks.
To date, the most serious attacks have been directly linked to the conflict. For example, hundreds of thousands of people across eastern Europe suffered a loss of communications, due to cyber assault on a satellite, in the early days of war. There have also been deepfakes inserted in media services on both sides; and thousands of Russian businesses have suffered distributed denial-of-service attacks.
Security analysts are warning that a perceived fall in cyberattacks on the West from Russian hackers is likely to be temporary. Governments, data centers and IT services companies, energy companies, financial services and media are all regarded as critical infrastructure and therefore likely targets.
Sustainability paradox
In Europe, high energy prices in 2021 and concerns over climate change caused both political and economic difficulties. This was not sufficient to trigger major structural changes in policy or investment, but this may now change.
For the data center sector, two trends are unfolding with diametrically opposite effects on carbon emissions. First, to ensure energy availability, utilities in Europe — with the support of governments — are planning to use more high-carbon coal to make up for a reduced supply of Russian gas. This will at times increase the carbon content of the grid and could increase the number of renewable energy certificates or offsets that data center operators must buy to achieve zero-carbon emissions status.
Second, there is also a counter trend: high prices reduce demand. To date, there has not been a dramatic increase in coal use — although coal prices have risen steeply. In the longer term, energy security concerns will drive up utility and government investment in renewable energy and in nuclear. The UK government is considering a plan to invest up to $140 billion in nuclear energy.
In the data center sector, energy-saving projects may now be more financially viable. These will range from upgrading inefficient cooling systems, recycling heat, raising temperatures, reducing unnecessary redundancy or addressing low IT utilization.
But operators be warned: a switch from fossil fuels to renewable energy, with grid or localized storage, means new dependencies and new global supply chains.
Separately, those European and US legislators who have campaigned for Bitcoin’s energy intensive proof-of-work protocol to be outlawed will now find new political support. Calls are growing for energy devoted to Bitcoin to be used more productively (see Are proof-of-work blockchains a corporate sustainability issue?)
Supply chain and component shortages
The digital infrastructure sector (like many other sectors) has been dogged by supply chain problems, shortages and price spikes since the early days of the COVID pandemic. Data center projects are regularly delayed because of equipment supply problems.
The Ukraine conflict is exacerbating this. Semiconductor and power electronics fabrication, which require various rare elements produced in Ukraine and Russia (e.g., neon, vanadium, palladium, nickel, titanium), may suffer shortages and delays if Russia drags out the war.
The delivery of machines (and that means nearly all) that use microelectronic components could also suffer delays. At the height of the pandemic, cooling equipment, uninterruptible power supplies and switchgear equipment were the most delayed items — and this is likely to be the case again.
Following the pandemic, most of the larger operators took steps to alleviate the effects of operational component shortages, such as building inventory and arranging for second sources, which may cushion the effect of these shortages.
Border zones
Even before the pandemic and the war in Ukraine, there were concerns with geopolitical instability and the vulnerabilities of globalization. These concerns have intensified. Several large European and US corporations, for example, outsource some IT functions to Ukraine and Russia, particularly software development. Thousands of Ukrainian developers were forced to flee from the war, while Russian IT professionals have suffered from the loss of internet applications and cloud services — and the ability to be paid.
Organizations will react in different ways, such as shorter, localized supply chains and insourcing within national borders. Global cloud providers will likely need to introduce more regions to address country-specific requirements — as was seen in the lead up to Brexit — and this is expected to require (yet) more data center capacity.
Uptime Institute will continue to monitor the Ukraine conflict and its impact on the global digital infrastructure sector.
https://journal.uptimeinstitute.com/wp-content/uploads/2022/05/Ukraine-featured.jpg6281200Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2022-05-18 07:00:002022-05-17 16:52:34The shock waves from Ukraine
Cloud providers need to deliver the newest capability to stay relevant. Few enterprises will accept working with outdated technology just because it’s consumable as a cloud service. However, existing cloud instances don’t migrate automatically. Similarly to on-premises server infrastructure, users need to refresh their cloud services regularly.
Typically, cloud operators prefer product continuity between generations, often creating nearly identical instances. A virtual instance has a “family”, which dictates the physical server’s profile, such as more computing power or faster memory. A “size” dictates the amount of memory, virtual processors, disks and other attributes assigned to the virtual instance. The launch of a new generation usually consists of a range of virtual instances with similar definitions of family and size as the previous generation. The major difference is the underlying server hardware’s technology.
A new generation doesn’t replace an older version. The older generation is still available to purchase. The user can migrate their workloads to the newer generation if they wish, but it is their responsibility to do so. By supporting older generations, the cloud provider is seen to be allowing the user to upgrade at their own pace. The provider doesn’t want to appear to be forcing the user into migrating applications that might not be compatible with the newer server platforms.
More generations create more complexity for users through greater choice and different virtual instance generations to manage. More recently, cloud operators have started to offer different processor architectures in the same generation. Users can now pick between Intel, Advanced Micro Devices (AMD) or, in Amazon Web Service’s (AWS’s) case, servers using ARM-based processors. The variety of cloud processor architectures is likely to expand over the coming years.
Cloud operators provide price incentives so that users gravitate towards newer generations (and between server architectures). Figure 1 shows lines of best fit for the average cost per virtual central processing unit (vCPU, essentially a physical processor thread as most processor cores run two threads simultaneously) of a range of AWS virtual instances over time. Data is obtained from AWS’s Price List API. For clarity, we only show pricing for AWS’ US-East-1 region, but the observations are similar across all regions. The analysis only considers x86 processors from AMD and Intel.
The trend for most virtual instances is downward, with the average cost of the m family general-purpose virtual instances dropping 50% from its first generation to the present time. Each family has different configurations of memory, network and other attributes that aren’t accounted for in the price of an individual vCPU, which explains the price differences between families.
One hidden factor is that compute power per vCPU also increases over generations — often incrementally. This is because more advanced manufacturing technologies tend to help with both clock speeds (frequency) and the “smartness” of processor cores in executing codes faster. Users can expect greater processing speed with newer generations compared with older versions while paying less. The cost efficiency gap is more substantial than simple pricing suggests.
AWS (and other cloud operators) are reaping the economic benefits of Moore’s law in a steep downward trajectory for cost of performance and passing some of this saving onto customers. Giving customers lower prices works in AWS’s favor by incentivizing customers to move to newer server platforms that are often more energy efficient and can carry more customer workloads — generating greater revenue and gross margin. However, how much of the cost savings AWS is passing on to its customers versus adding to its gross margin remains hidden from view. In terms of demand, cloud customers prioritize cost over performance for most of their applications and, partly because of this price pressure, cloud virtual instances are coming down in price.
The trend of lower costs and higher clock speed fails for one type of instance: graphics processing units (GPUs). GPU instances of families g and p have higher prices per vCPU over time, while g instances also have a lower CPU clock speed. This is not comparable with the non-GPU instances because GPUs are typically not broken down into standard units of capacity, such as a vCPU. Instead, customers tend to have (and want) access to the full resources of a GPU instance for their accelerated applications. Here, the rapid growth in total performance and the high value of the customer applications (for example, training of deep neural networks or massively parallel large computational problems) that use them allowed cloud operators (and their chip suppliers, chiefly NVIDIA) to raise prices. In other words, customers are willing to pay more for newer GPU instances if they deliver value in being able to solve complex problems quicker.
On average, virtual instances (at AWS at least) are coming down in price with every new generation, while clock speed is increasing. However, users need to migrate their workloads from older generations to newer ones to take advantage of lower costs and better performance. Cloud users must keep track of new virtual instances and plan how and when to migrate. The migration of workloads from older to newer generations is a business risk that requires a balanced approach. There may be unexpected issues of interoperability or downtime while the migration takes place — maintaining an ability to revert to the original configuration is key. Just as users plan server refreshes, they need to make virtual instance refreshes part of their ongoing maintenance.
Cloud providers will continue to automate, negotiate and innovate to drive costs lower across their entire operations, of which processors constitute a small but vital part. They will continue to offer new generations, families and sizes so buyers have access to the latest technology at a competitive price. The likelihood is that new generations will continue the trend of being cheaper than the last — by just enough to attract increasing numbers of applications to the cloud, while maintaining (or even improving) the operator’s future gross margins.
https://journal.uptimeinstitute.com/wp-content/uploads/2022/04/Cloud-generations-drive-down-prices-featured.jpg6281200Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngDr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]2022-05-03 06:00:002022-04-30 14:08:45Cloud generations drive down prices
Cloud SLAs punish, not compensate
/in Executive, Operations/by Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]A service level agreement (SLA) is a contract between a cloud provider and a user. The SLA describes the provider’s minimum level of service, specified by performance metrics, and the compensation due to the user should the provider fail to deliver this service.
In practice, however, this compensation is punitive. It seems designed to punish the provider for its failure, not to compensate the user for their loss.
A key metric used in cloud SLAs is availability. Cloud availability is generally expressed as the percentage of a defined period (usually a month) during which a resource has external connectivity. If a user can’t connect over the internet to the resource due to a cloud provider issue, the resource is deemed unavailable or ‘down’.
For example, Amazon Web Services (AWS) offers a 99.5% SLA on a virtual machine. There are 730 hours on average in a month, so the virtual machine should be available (or ‘up’) for around 726.5 hours of those 730 hours. Compensation is due if the virtual machine fails to meet this minimum availability during a month.
Alternatively, we could say AWS is not liable to pay compensation unless the virtual machine experiences total downtime of more than 0.5% of a month, or 3.5 hours on average (given that the length of months differ). Google Cloud, Microsoft Azure and other cloud providers offer similar terms. Table 1 shows common availability SLA performance metrics.
Table 1
If the virtual machine is distributed across two availability zones (logical data centers, see Cloud scalability and resiliency from first principles), the typical SLA for availability increases to 99.99% (equivalent to an average of around 4 minutes of downtime per month).
The business impact of downtime to users depends on the situation. If short-lived outages occur overnight, the downtime may have little effect on revenue or productivity. However, a long outage in the middle of a major retail event, for example, is likely to hit income and reputation hard. In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident.
SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.
If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total.
When a failure occurs, the user is responsible for measuring downtime and requesting compensation – this is not provided automatically. Users usually need to raise a report request with service logs to show proof of the outage. If the cloud provider approves the request, compensation is offered in service credits, not cash. Together, these approaches mean that users must detect an outage and apply for a service credit, which can only be redeemed through the offending cloud provider, and which is unlikely to cover the cost of the outage. Is the benefit of the compensation worth the effort?
Cloud providers are upfront about these compensation terms — they can be found easily on their websites. They are not being unreasonable by limiting their liabilities, but cloud users need to be realistic about the value of SLAs.
An SLA incentivizes the cloud provider to meet performance requirements with a threat of punitive losses. An SLA is also a marketing tool designed to convince the buyer that the cloud provider can be trusted to host business-critical applications.
Customers should not consider SLAs (or 99.9x% availability figures) as reliable predictors of future availability — or even a record of past levels of availability. The frequency of cloud outages (and their duration) suggests that actual performance of many service providers falls short of published SLAs. (For a more detailed discussion on outages, see our Annual outages analysis 2021: The causes and impacts of data center outages.)
In short, users shouldn’t put too much faith in SLAs. They are likely to indicate a cloud provider’s minimum standard of service: few cloud providers would guarantee compensation on metrics it didn’t feel it could realistically achieve, even if the compensation is minimal. But an SLA is not an insurance policy; it does not mitigate the business impact of downtime.
Is concern over cloud and third-party outages increasing?
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]As a result of some high-profile outages and the growing interest in running critical services in a public cloud, the reliability — and transparency — of public cloud services has come under scrutiny.
Cloud services are designed to operate with low failure rates. Large (at-scale) cloud and IT service providers, such as Amazon Web Services, Microsoft Azure and Google Cloud, incorporate layers of software and middleware, balance capacity across systems, networks and data centers, and reroute workloads and traffic away from failures and troubled sites.
On the whole, these architectures provide high levels of service availability at scale. Despite this, no architecture is fail-safe and many failures have been recorded that can be attributed to the difficulties of managing such complex, at-scale software, data and networks.
Our recent resiliency survey of data center and IT managers shows that enterprise managers are reasonably concerned about the resiliency of public cloud services (see Figure 1). Only one in seven (14%) respondents say public cloud services are resilient enough to run all their workloads. The same proportion say the cloud is not resilient enough to run any of their workloads; and 32% say the cloud is only resilient enough to run some of their workloads. The increase in the number of “don’t know” responses since our 2021 survey also shows that confidence in the resiliency of the cloud has become shrouded in some uncertainty and skepticism.
Concerns over cloud resiliency may be partly due to several recent outages being attributed to third-party service providers. In our resiliency survey, 39% of organizations suffered an outage that was caused by a problem with a third-party supplier.
As more workloads are outsourced to external providers, the more these operators account for high profile, public outages. Over a five-year period, third-party, commercial operators of IT and / or data centers (cloud, hosting, colocation, digital services, telecommunications, etc.) combined accounted for almost 63% of all public outages since 2016, when Uptime Institute started tracking them. This percentage has crept up year-by-year: in 2021 the combined proportion of outages caused by these commercial operators was 70%.
When third-party IT and data center service providers do have an outage, customers are affected immediately. These customers may seek compensation — and a full explanation. Many regulators and enterprises now want increased visibility, accountability and improved service level agreements — especially for cloud providers.
Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.
Outages: understanding the human factor
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]Analyzing human error — with a view to preventing it — has always been challenging for data center operators. The cause of a failure can lie in how well a process was taught, how tired, well trained or resourced the staff are, or whether the equipment itself was unnecessarily difficult to operate.
There are also definitional questions: If a machine fails due to a software error at the factory, is that human error? In addition, human error can play a role in outages that are attributed to other causes. For these reasons, Uptime Institute tends to research human error in a different way to other causes — we view it as one causal factor in outages, and rarely a single or root cause. Using this methodology, Uptime estimates (based on 25 years of data) that human error plays some role in about two-thirds of all outages.
In Uptime’s recent surveys on resiliency, we have tried to understand the makeup of some of these human error-related failures. As the figure below shows, human error-related outages are most commonly caused either by staff failing to follow procedures (even where they have been agreed and codified) or because the procedures themselves are faulty.
This underscores a key tenet of Uptime: good training and processes both play a major role in outage reduction and can be implemented without great expense or capital expenditure.
Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.
The weakest link dictates cloud outage compensation
/in Executive, Operations/by Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]Cloud providers offer services that are assembled by users into applications. An outage of any single cloud service can render an application unavailable. Importantly, cloud providers guarantee the availability of individual services, not of entire applications. Even if a whole application becomes unresponsive due to a provider outage, compensation is only due for the individual services that failed.
A small outage on a tiny part of an application may wreak havoc with an application (and even an entire business), but the provider will only award compensation for the weakest link that caused the outage.
A service level agreement (SLA) sets out the likely or promised availability of a specific cloud service, plus compensation due to the user if the provider fails to meet this availability. There is a different SLA for each cloud service.
Consider the cloud architecture in Figure 1, which shows traffic flows between virtual machines and a virtual load balancer. The load balancer distributes traffic across nine virtual machines. The virtual machines use the infrastructure as a service (IaaS) model, meaning the user is responsible for architecting virtual machines to be resilient. The load balancer, however, is a platform as a service (PaaS), which should mean the provider has architected it for resiliency. (More information on cloud models can be found in Uptime Institute’s recent report Cloud scalability and resiliency from first principles.)
If the load balancer becomes unresponsive, the entire application is unusable as traffic can’t be routed to the virtual machines. The load balancer in this architecture is the weakest link.
The provider is responsible for the resiliency of the load balancer. Is the load balancer a single point of failure? It depends on the perception of the user. It may not be regarded as a single point of failure if the user is fully confident the provider has architected resiliency correctly through, for example, redundant servers or automatic failover. If the user is less convinced, they may consider the load balancer as a single point of failure because it is controlled by a single entity over which the user has limited visibility and control.
The virtual machines are still “available” because they are still in operation and can be accessed by the user for administration or maintenance — they just can’t be accessed by the end user via the application. The load balancer is the problem, not the virtual machines. In this scenario, no compensation is payable for the virtual machines even though the application has gone down.
To understand the financial impact, we’ll assume each virtual machine and load balancer costs $3 per month. As an example of a cloud SLA, Microsoft Azure offers compensation of 10% of the monthly cost of the load balancer if it fails to be available for 99.99% to 99.9% of a month. Similar terms also apply to virtual machines.
If the load balancer is down for 43 minutes, then Microsoft Azure is obliged to pay 10% of the monthly fee of the load balancer, so $0.30 in this case. It is not obliged to pay for the virtual machines as these continued in operation, despite the application becoming unresponsive. The total monthly cost of the application is $30, and the compensation is $0.30, which means the payment for the outage is 1% of the monthly fee paid to Microsoft Azure — a paltry sum compared with the likely impact of the outage. Table 1 provides an overview of the refunds due in this scenario.
This example demonstrates two points. First, cloud providers provide services that users use to build applications. Their responsibility ends with these services. The analogy of cloud services as toy bricks may seem trite, but it conveys effectively some fundamental aspects of the cloud model. A toy company may guarantee the quality of its bricks, but it would not guarantee the quality of a model completed by an enthusiastic builder, no matter how well built.
In the architecture in Figure 1, for example, the user could have designed a more scalable application. They could have reduced the cost of virtual machines by implementing autoscaling, which would automatically terminate the unutilized virtual machines when not in use (such as when the load balancer went down). The user is ultimately responsible for building a resilient and scalable application, just as the quality of the toy model is the builder’s responsibility.
Second, this example also demonstrates that an SLA is not an insurance policy. It does not mitigate the business impact of downtime. In practice, compensation for cloud outages is likely to be less than originally assumed by users due to nuances in contract terms, which reduce the provider’s liability. Cloud customers must identify single points of failure in their applications and assess the risk and impact of an outage of these services. An outage of these services can render an application unavailable, but SLA compensation is unlikely to even cover the cost of the cloud services, let alone business losses. Ultimately, users are responsible for architecting greater resiliency in their applications to reduce the impact of outages.
The shock waves from Ukraine
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]How is the Ukraine conflict affecting digital infrastructure and the data sector? In response to multiple queries, Uptime Institute Intelligence identified six main areas where operators and customers of digital infrastructure are experiencing effects from the conflict or will do so soon, not just in Europe but globally. (In Russia and Ukraine there are over 150 data centers, but these are not included in the analysis).
Energy prices
In 2021, electricity prices already rose dramatically partly because of the post-COVID boom in economic activity, and partly because of tight energy markets caused by a combination of weak renewable energy production and a Russian squeeze on gas supplies. Electricity prices can account for around a quarter to two-thirds of the operational costs of a data center. Prices are now near record highs, peaking at 400 euros / megawatt-hours in some European markets shortly after the invasion in February. Because electricity prices globally are linked to liquid natural gas, forward prices across the world will remain high for at least the rest of 2022, especially in Europe.
To contain the effects of price volatility, most large data center operators attempt to reduce their exposure by buying in advance, usually in half-year periods. But some operators were caught out by the Russian invasion of Ukraine, having switched to the spot markets in anticipation of falling prices.
Exposure to high energy prices among data center operators varies according to business type and how well businesses have managed their risks. Colocation companies are the least exposed, because they buy in advance and it is usual to have the contractual right to pass on power prices to their customers. Some providers, however, offer forward pricing and price locks, and may be more exposed if they offer commodity hosting and cloud-like services too.
Enterprises cannot pass on energy costs, except by charging more for their own products and services. Many enterprise operators do not lock in their own electricity costs in advance, while their colocation partners usually do. Some enterprises are benefiting from the diligent purchasing of their colocation partners.
The biggest cloud data center operators can only pass on power price increases by raising cloud prices. Recently, they have been buying more energy on the spot market for their own facilities and haven’t locked in lower prices — exposing them to increases at scale. However, those who have invested in renewable power purchase agreements — which is common among large cloud providers — are benefiting because they have locked in their green power at lower prices for several years ahead.
Economic slowdown?
Prior to late February 2022, world economies were recovering from COVID, but the Ukraine crisis has created a darker and more complex picture. The United Nation’s Conference of Trade and Development (UNCTAD) cut its global growth forecast from 3.6% to 2.6% in March. In a recent report, it highlighted inflation, fuel and food prices, disruption of trade flows and supply chains, and rising debt levels as areas of concern.
Not all economies and industries will be affected equally. But while the COVID pandemic clearly boosted demand for data centers and digital services, the effects of another, different widespread slowdown may dampen demand for digital services and data centers, partly through a tightening of capital availability, and partly also due to runaway energy costs which make computing more expensive.
Cybersecurity
Cyberattacks on business have increased steadily in number and severity since the early 1980s. Organizations in countries that are considered critical of Russia — as well as those in Russia itself — have been warned of a sharp increase in attacks.
To date, the most serious attacks have been directly linked to the conflict. For example, hundreds of thousands of people across eastern Europe suffered a loss of communications, due to cyber assault on a satellite, in the early days of war. There have also been deepfakes inserted in media services on both sides; and thousands of Russian businesses have suffered distributed denial-of-service attacks.
Security analysts are warning that a perceived fall in cyberattacks on the West from Russian hackers is likely to be temporary. Governments, data centers and IT services companies, energy companies, financial services and media are all regarded as critical infrastructure and therefore likely targets.
Sustainability paradox
In Europe, high energy prices in 2021 and concerns over climate change caused both political and economic difficulties. This was not sufficient to trigger major structural changes in policy or investment, but this may now change.
For the data center sector, two trends are unfolding with diametrically opposite effects on carbon emissions. First, to ensure energy availability, utilities in Europe — with the support of governments — are planning to use more high-carbon coal to make up for a reduced supply of Russian gas. This will at times increase the carbon content of the grid and could increase the number of renewable energy certificates or offsets that data center operators must buy to achieve zero-carbon emissions status.
Second, there is also a counter trend: high prices reduce demand. To date, there has not been a dramatic increase in coal use — although coal prices have risen steeply. In the longer term, energy security concerns will drive up utility and government investment in renewable energy and in nuclear. The UK government is considering a plan to invest up to $140 billion in nuclear energy.
In the data center sector, energy-saving projects may now be more financially viable. These will range from upgrading inefficient cooling systems, recycling heat, raising temperatures, reducing unnecessary redundancy or addressing low IT utilization.
But operators be warned: a switch from fossil fuels to renewable energy, with grid or localized storage, means new dependencies and new global supply chains.
Separately, those European and US legislators who have campaigned for Bitcoin’s energy intensive proof-of-work protocol to be outlawed will now find new political support. Calls are growing for energy devoted to Bitcoin to be used more productively (see Are proof-of-work blockchains a corporate sustainability issue?)
Supply chain and component shortages
The digital infrastructure sector (like many other sectors) has been dogged by supply chain problems, shortages and price spikes since the early days of the COVID pandemic. Data center projects are regularly delayed because of equipment supply problems.
The Ukraine conflict is exacerbating this. Semiconductor and power electronics fabrication, which require various rare elements produced in Ukraine and Russia (e.g., neon, vanadium, palladium, nickel, titanium), may suffer shortages and delays if Russia drags out the war.
The delivery of machines (and that means nearly all) that use microelectronic components could also suffer delays. At the height of the pandemic, cooling equipment, uninterruptible power supplies and switchgear equipment were the most delayed items — and this is likely to be the case again.
Following the pandemic, most of the larger operators took steps to alleviate the effects of operational component shortages, such as building inventory and arranging for second sources, which may cushion the effect of these shortages.
Border zones
Even before the pandemic and the war in Ukraine, there were concerns with geopolitical instability and the vulnerabilities of globalization. These concerns have intensified. Several large European and US corporations, for example, outsource some IT functions to Ukraine and Russia, particularly software development. Thousands of Ukrainian developers were forced to flee from the war, while Russian IT professionals have suffered from the loss of internet applications and cloud services — and the ability to be paid.
Organizations will react in different ways, such as shorter, localized supply chains and insourcing within national borders. Global cloud providers will likely need to introduce more regions to address country-specific requirements — as was seen in the lead up to Brexit — and this is expected to require (yet) more data center capacity.
Uptime Institute will continue to monitor the Ukraine conflict and its impact on the global digital infrastructure sector.
Cloud generations drive down prices
/in Executive, Operations/by Dr. Owen Rogers, Research Director for Cloud Computing, Uptime Institute, [email protected]Cloud providers need to deliver the newest capability to stay relevant. Few enterprises will accept working with outdated technology just because it’s consumable as a cloud service. However, existing cloud instances don’t migrate automatically. Similarly to on-premises server infrastructure, users need to refresh their cloud services regularly.
Typically, cloud operators prefer product continuity between generations, often creating nearly identical instances. A virtual instance has a “family”, which dictates the physical server’s profile, such as more computing power or faster memory. A “size” dictates the amount of memory, virtual processors, disks and other attributes assigned to the virtual instance. The launch of a new generation usually consists of a range of virtual instances with similar definitions of family and size as the previous generation. The major difference is the underlying server hardware’s technology.
A new generation doesn’t replace an older version. The older generation is still available to purchase. The user can migrate their workloads to the newer generation if they wish, but it is their responsibility to do so. By supporting older generations, the cloud provider is seen to be allowing the user to upgrade at their own pace. The provider doesn’t want to appear to be forcing the user into migrating applications that might not be compatible with the newer server platforms.
More generations create more complexity for users through greater choice and different virtual instance generations to manage. More recently, cloud operators have started to offer different processor architectures in the same generation. Users can now pick between Intel, Advanced Micro Devices (AMD) or, in Amazon Web Service’s (AWS’s) case, servers using ARM-based processors. The variety of cloud processor architectures is likely to expand over the coming years.
Cloud operators provide price incentives so that users gravitate towards newer generations (and between server architectures). Figure 1 shows lines of best fit for the average cost per virtual central processing unit (vCPU, essentially a physical processor thread as most processor cores run two threads simultaneously) of a range of AWS virtual instances over time. Data is obtained from AWS’s Price List API. For clarity, we only show pricing for AWS’ US-East-1 region, but the observations are similar across all regions. The analysis only considers x86 processors from AMD and Intel.
The trend for most virtual instances is downward, with the average cost of the m family general-purpose virtual instances dropping 50% from its first generation to the present time. Each family has different configurations of memory, network and other attributes that aren’t accounted for in the price of an individual vCPU, which explains the price differences between families.
One hidden factor is that compute power per vCPU also increases over generations — often incrementally. This is because more advanced manufacturing technologies tend to help with both clock speeds (frequency) and the “smartness” of processor cores in executing codes faster. Users can expect greater processing speed with newer generations compared with older versions while paying less. The cost efficiency gap is more substantial than simple pricing suggests.
AWS (and other cloud operators) are reaping the economic benefits of Moore’s law in a steep downward trajectory for cost of performance and passing some of this saving onto customers. Giving customers lower prices works in AWS’s favor by incentivizing customers to move to newer server platforms that are often more energy efficient and can carry more customer workloads — generating greater revenue and gross margin. However, how much of the cost savings AWS is passing on to its customers versus adding to its gross margin remains hidden from view. In terms of demand, cloud customers prioritize cost over performance for most of their applications and, partly because of this price pressure, cloud virtual instances are coming down in price.
The trend of lower costs and higher clock speed fails for one type of instance: graphics processing units (GPUs). GPU instances of families g and p have higher prices per vCPU over time, while g instances also have a lower CPU clock speed. This is not comparable with the non-GPU instances because GPUs are typically not broken down into standard units of capacity, such as a vCPU. Instead, customers tend to have (and want) access to the full resources of a GPU instance for their accelerated applications. Here, the rapid growth in total performance and the high value of the customer applications (for example, training of deep neural networks or massively parallel large computational problems) that use them allowed cloud operators (and their chip suppliers, chiefly NVIDIA) to raise prices. In other words, customers are willing to pay more for newer GPU instances if they deliver value in being able to solve complex problems quicker.
On average, virtual instances (at AWS at least) are coming down in price with every new generation, while clock speed is increasing. However, users need to migrate their workloads from older generations to newer ones to take advantage of lower costs and better performance. Cloud users must keep track of new virtual instances and plan how and when to migrate. The migration of workloads from older to newer generations is a business risk that requires a balanced approach. There may be unexpected issues of interoperability or downtime while the migration takes place — maintaining an ability to revert to the original configuration is key. Just as users plan server refreshes, they need to make virtual instance refreshes part of their ongoing maintenance.
Cloud providers will continue to automate, negotiate and innovate to drive costs lower across their entire operations, of which processors constitute a small but vital part. They will continue to offer new generations, families and sizes so buyers have access to the latest technology at a competitive price. The likelihood is that new generations will continue the trend of being cheaper than the last — by just enough to attract increasing numbers of applications to the cloud, while maintaining (or even improving) the operator’s future gross margins.