A service level agreement (SLA) is a contract between a cloud provider and a user. The SLA describes the provider’s minimum level of service, specified by performance metrics, and the compensation due to the user should the provider fail to deliver this service.
In practice, however, this compensation is punitive. It seems designed to punish the provider for its failure, not to compensate the user for their loss.
A key metric used in cloud SLAs is availability. Cloud availability is generally expressed as the percentage of a defined period (usually a month) during which a resource has external connectivity. If a user can’t connect over the internet to the resource due to a cloud provider issue, the resource is deemed unavailable or ‘down’.
For example, Amazon Web Services (AWS) offers a 99.5% SLA on a virtual machine. There are 730 hours on average in a month, so the virtual machine should be available (or ‘up’) for around 726.5 hours of those 730 hours. Compensation is due if the virtual machine fails to meet this minimum availability during a month.
Alternatively, we could say AWS is not liable to pay compensation unless the virtual machine experiences total downtime of more than 0.5% of a month, or 3.5 hours on average (given that the length of months differ). Google Cloud, Microsoft Azure and other cloud providers offer similar terms. Table 1 shows common availability SLA performance metrics.
If the virtual machine is distributed across two availability zones (logical data centers, see Cloud scalability and resiliency from first principles), the typical SLA for availability increases to 99.99% (equivalent to an average of around 4 minutes of downtime per month).
The business impact of downtime to users depends on the situation. If short-lived outages occur overnight, the downtime may have little effect on revenue or productivity. However, a long outage in the middle of a major retail event, for example, is likely to hit income and reputation hard. In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident.
SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.
If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total.
When a failure occurs, the user is responsible for measuring downtime and requesting compensation – this is not provided automatically. Users usually need to raise a report request with service logs to show proof of the outage. If the cloud provider approves the request, compensation is offered in service credits, not cash. Together, these approaches mean that users must detect an outage and apply for a service credit, which can only be redeemed through the offending cloud provider, and which is unlikely to cover the cost of the outage. Is the benefit of the compensation worth the effort?
Cloud providers are upfront about these compensation terms — they can be found easily on their websites. They are not being unreasonable by limiting their liabilities, but cloud users need to be realistic about the value of SLAs.
An SLA incentivizes the cloud provider to meet performance requirements with a threat of punitive losses. An SLA is also a marketing tool designed to convince the buyer that the cloud provider can be trusted to host business-critical applications.
Customers should not consider SLAs (or 99.9x% availability figures) as reliable predictors of future availability — or even a record of past levels of availability. The frequency of cloud outages (and their duration) suggests that actual performance of many service providers falls short of published SLAs. (For a more detailed discussion on outages, see our Annual outages analysis 2021: The causes and impacts of data center outages.)
In short, users shouldn’t put too much faith in SLAs. They are likely to indicate a cloud provider’s minimum standard of service: few cloud providers would guarantee compensation on metrics it didn’t feel it could realistically achieve, even if the compensation is minimal. But an SLA is not an insurance policy; it does not mitigate the business impact of downtime.