Making sense of the outage numbers

Making sense of the outage numbers

In recent years, Uptime Institute has published regular reports examining both the rate and causes of data center and IT service outages. The reports, which have been widely read and reported in the media, paint a picture of an industry that is struggling with resiliency and reliability — and one where operators regularly suffer downtime, disruption, and reputational and financial damage.

Is this a fair picture? Rather like the authors of a scientific paper whose findings from a small experiment are hailed as a major breakthrough, Uptime Institute Intelligence has often felt a certain unease when the results of these complex findings, pulled from an ever changing and complex environment, are distilled into sound bites and headlines.

In May this year, Uptime Intelligence published its Annual outage analysis 2022. The key findings were worded cautiously: outage rates are not falling; many outages have serious / severe consequences; the cost and impact of outages is increasing. This year, the media largely reported the findings accurately, picking different angles on the data — but this has not always been the case.

What does Uptime Institute think about the overall outages rate? Is reliability good or bad? Are outages rising or falling? If there were straightforward answers, there would be no need for this discussion. The reality, however, is that outages are both worsening and improving.

In our recent outage analysis report, four in five organizations surveyed say they’d had an outage in the past three years (Figure 1). This is in line with the previous years’ results and is consistent with various other Uptime studies. 

diagram: Most organizations experienced an outage in the past three years
Figure 1 Most organizations experienced an outage in the past three years

A smaller proportion, about one in five, have had a “serious” or “severe” outage (Uptime classes outages on a severity scale of one to five; these are levels four and five), which means the outcome has serious or severe financial and reputational consequences. This is consistent with our previous studies and our data also shows the cost of outages is increasing. 

By combining this information, we can see that the rate of outages, and their severity and impact, is not improving — in some ways it’s worsening. But hold on, say many providers of data center services and IT, we know our equipment is much better than it was, and we know that IT and data center technicians have better tools and skills — so why aren’t things improving? The fact is, they are.

Our data and findings are based on multiple sources: some reliable, others less so. The primary tools we use are large, multinational surveys of IT and data center operators. Respondents report on outages of the IT services delivered from their data center site(s) or IT operation. Therefore, the outage rate is “per site” or “per organization”.

This is important because the number of organizations with IT and data centers has increased significantly. Even more notable is the amount of IT (data, compute, IT services) per site / organization, which is rising dramatically every year.

What do we conclude? First, the rate of outages per site / company / IT operation is steady on average and is neither rising nor falling. Second, the total number of outages is rising steadily, but not substantially, even though the number of organizations either using or offering IT is increasing. Lastly, the number of outages as a percentage of all IT delivered is falling steadily, if not dramatically.

This analysis is not easy for the media to summarize in one headline. But let’s make one more observation as a comparison. In 1970, there were 298 air crashes, which resulted in 2,226 deaths; in 2021, there were 84 air crashes, which resulted in 359 deaths. This is an enormous improvement, particularly allowing for the huge increase in passenger miles flown. If the airline safety record was similar to the IT industry’s outage rate, there would still be many hundreds of crashes per year and thousands of deaths.

This is perhaps not a like-for-like comparison — flight safety is, after all, always a matter of life and death. It does, however, demonstrate the power of collective commitment, transparency of reporting, regulation and investment. As IT becomes more critical, the question for the sector, for regulators and for every IT service and data center operator is (as it always has been): what level of outage risk is acceptable?

Watch our 2022 Outage Report Webinar for more from Uptime Institute Intelligence on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members. Learn more about Uptime Institute Membership and request a free guest trial here.

Cloud price increases damage trust

Cloud price increases damage trust

In general, the prices of cloud services either remain level or decrease. There are occasional price increases, but these are typically restricted to specific features; blanket price increases across product families are rare.

Price cuts are often the result of improved operating efficiencies. Through automation, slicker processes, Moore’s law improvements in hardware and economies of scale, cloud providers can squeeze their costs and pass savings on to their customers (see Cloud generations drive down prices).

Cost base isn’t the only factor impacting price. Cloud providers need to demonstrate ongoing value. Value is the relationship between function and cost. Although central processing units are becoming increasingly powerful, most users don’t see the functionality of a slightly faster clock speed as valuable — they would rather pay less. As a result, the price of virtual machines tends to decrease.

The price of other cloud services remains flat or decreases more slowly than virtual machines. With these services, the cloud provider pockets the benefits of improved operational efficiencies or invests it in new functionality to increase the perceived value of the service. Occasionally cloud providers will cut the price of these services to garner attention and drive sales volume. Unlike virtual machines, users seek improved capability (at least for the time being) rather than lower prices.

Price increases are rare because of two reasons. First, improvements in operational efficiencies mean the provider doesn’t have to increase prices to maintain or grow margin. Second, the cloud provider doesn’t want to be perceived as taking advantage of customers that are already locked into their services.

Cloud buyers place significant trust in their cloud providers. Only a decade ago, cloud computing was viewed as being unsuitable for enterprise deployments by many buyers. Trusting a third party to host business-impacting applications takes a leap of faith: for example, there is no service level agreement that adequately covers the business cost of an outage. Cloud adoption has grown significantly over the past decade, and this reflects the increased trust in both the cloud model and individual cloud providers.

One of the major concerns of using cloud computing is vendor lock-in: users can’t easily migrate applications hosted on one cloud provider to another. (See Cloud scalability and resiliency from first principles.) If the price of the cloud services increases, the user has no choice but to accept the price increase or else plan a costly migration.

Despite this financial anxiety, price increases have not materialized. Most cloud providers have realized that increasing prices would damage the customer trust they’ve spent so long cultivating. Cloud providers want to maintain good relationships with their customers, so that they are the de facto provider of choice for new projects and developments.

However, cloud providers face new and ongoing challenges. The COVID-19 pandemic and the current Russia-Ukraine conflict have disrupted supply chains. Cloud providers may also face internal cost pressures and spend more on energy, hardware components and people. But raising prices could be perceived as price-gouging, especially as their customers are operating under similar economic pressures.

In light of these challenges, it’s surprising that Google Cloud has announced that some services will increase in price from October this year. These include a 50% increase to multiregion nearline storage and the doubling of some operations fees. Load balancers will also be subject to an outbound bandwidth charge. Google Cloud has focused on convincing users that it is a relationship-led, enterprise-focused company (not just a consumer business). To make such sweeping price increases would appear to damage its credibility in this regard.

How will these changes affect Google Cloud’s existing customers? It all depends on the customer’s application architecture. Google Cloud maintains it is raising prices to fall in line with other cloud providers. It is worth noting, however, that a price increase doesn’t necessarily mean Google Cloud will be more expensive than other cloud providers.

In Q3 2021, Google Cloud revenue increased by 45% to $4.99 billion, up from $3.44 billion in Q3 2020. Despite this growth, the division reported an operating loss of $644 million. Google Cloud’s revenue trails Amazon Web Services and Microsoft Azure by a wide margin, so Google Cloud may be implementing these price increases with a view to building a more profitable and sustainable business.

Will current and prospective customers consider the price increases reasonable or will they feel their trust in Google Cloud has been misplaced? Vendor lock-in is a business risk that needs managing — what’s not clear today is how big a risk it is.

Direct liquid cooling (DLC): pressure is rising but constraints remain

Direct liquid cooling: pressure is rising but constraints remain

Direct liquid cooling (DLC) is a collection of techniques that removes heat by circulating a coolant to IT electronics. Even though the process is far more efficient than using air, the move to liquid cooling has been largely confined to select applications in high-performance computing to cool extreme high-density IT systems. There are a few examples of operators using DLC at scale, such as OVHcloud in France, but generally DLC continues to be an exception to the air cooling norm. In a survey of enterprise data center operators in the first quarter of 2022, Uptime Institute found that approximately one in six currently uses DLC (see Figure 1).

diagram: Many would consider adopting DLC
Figure 1 Many would consider adopting DLC

Uptime Institute Intelligence believes this balance will soon start shifting toward DLC. Renewed market activity has built up around DLC in anticipation of demand, offering a broader set of DLC systems than ever before. Applications that require high-density infrastructure, such as high-performance technical computing, big data analytics and the rapidly emerging field of deep neural networks, are becoming larger and more common. In addition, the pressure on operators to further improve efficiency and sustainability is building, as practical efficiency gains from air cooling have run their course. DLC offers the potential for a step change in these areas.

Yet, it is the evolution of data center IT silicon that will likely start pushing operators toward DLC en masse. High-density racks can be handled tactically and sustainability credentials can be fudged but escalating processor and server power, together with the tighter temperature limits that come with them, will gradually render air cooling impractical in the second half of this decade.

Air cooling will be unable to handle future server developments without major compromises — such as larger heat sinks, higher fan power or (worse still) the need to lower air temperature. Uptime believes that organizations that cannot handle these next-generation systems because of their cooling requirements will compromise their IT performance — and likely their efficiency— compared with organizations that embrace them (see Moore’s law resumes — but not for all).

While the future need for scaled-up DLC deployments seems clear, there are technical and business complexities. At the root of the challenges involved is the lack of standards within the DLC market. Unlike air cooling, where air acts both as the standard cooling medium and as the interface between facilities and IT, there is no comparable standard coolant or line of demarcation for DLC. Efforts to create some mechanical interface standards, notably within the Open Compute Project and more recently by Intel, will take years to materialize in product form.

However, the DLC systems that are currently available have evolved significantly in recent years and have become easier to adopt, install and operate. A growing number of DLC production installations in live environments are providing vendors with more data to inform product development in terms of reliability, ease of use and material compatibility. Crucially, a patchwork of partnerships between IT vendors, DLC technology providers and system integrators is growing to make liquid-cooled servers more readily available.

Uptime has identified six commercially available categories of DLC systems (three cold plates and three immersion) but expects additional categories to develop in the near future:

  • Water cold plates.
  • Single-phase dielectric cold plates.
  • Two-phase dielectric cold plates.
  • Chassis immersion.
  • Single-phase immersion.
  • Two-phase immersion.

There are more than a dozen specialist vendors that actively develop DLC products, typically focusing on one of the above system categories, and numerous other ones have the capability to do so. Each category has a distinct profile of technical and business trade-offs. We exclude other systems that bring liquid to the rack, such as rear-door heat exchangers, due to their reliance on air as the medium to transfer heat from IT electronics to the facility’s cooling infrastructure. This is not an arbitrary distinction: there are major differences in the technical and business characteristics of these systems.

Uptime’s research indicates that most enterprises are open to making the change to DLC, and indeed project considerable adoption of DLC in the coming years. While the exact profile of the uptake of DLC cannot be realistically modeled, Uptime considers there to be strong evidence for a general shift to DLC. The pace of adoption, however, will likely be constrained by a fractured market of suppliers and vendors, organizational inertia — and a lack of formal standards and components.

Check out this DLC webinar for more in-depth insights on the state of direct liquid cooling in the digital infrastructure sector. The full Intelligence report “The coming era of direct liquid cooling: it’s when, not if” is available exclusively to Uptime Institute members. Learn more about Uptime Institute Membership and request a free guest trial here.

The ultimate liquid cooling: heat rejection into water

The ultimate liquid cooling: heat rejection into water

Uptime Institute’s data on power usage effectiveness (PUE) is a testament to the progress the data center industry has made in energy efficiency over the past 10 years. However, global average PUEs have been largely stalling at close to 1.6 since 2018, with only marginal gains. This makes sense: for the average figure to show substantial improvement, most facilities would require financially unviable overhauls to their cooling systems to achieve notably better efficiencies, while modern builds already operate near the physical limits of air cooling.

A growing number of operators are looking to direct liquid cooling (DLC) for the next leap in infrastructure efficiency. But a switch to liquid cooling at scale involves operational and supply chain complexities that challenge even the most resourceful technical organizations. Uptime is aware of only one major operator that runs DLC as standard: French hosting and cloud provider OVHcloud, which is an outlier with a vertically integrated infrastructure using custom in-house water cold plate and server designs.

When it comes to the use of liquid cooling, an often-overlooked part of the cooling infrastructure is heat rejection. Rejecting heat into the atmosphere is a major source of inefficiencies, manifesting not only in energy use, but also in capital costs and in large reserves of power for worst case (design day) cooling needs.

A small number of data centers have been using water features as heat sinks successfully for some years. Instead of eliminating heat through water towers, air-cooled chillers or other means that rely on ambient air, some facilities use the closed chilled water loop, which rejects heat through a heat exchanger that’s cooled by an open loop of water. These cooling designs using water heat sinks extend the benefits of water’s thermal properties from heat transport inside the data center to the rejection of heat outside the facilities.

Schematic of a data center once-through system
Figure 1. Schematic of a data center once-through system

The idea of using water for heat rejection, of course, is not new. Known as once-through cooling, these systems are used extensively in thermoelectric power generation and manufacturing industries for their efficiency and reliability in handling large heat loads. Because IT infrastructures are relatively smaller and tend to cluster around population centers, which in turn tend to be situated near water, Uptime considers the approach to have wide geographical applicability in future data center construction projects.

Uptime’s research has identified more than a dozen data center sites, some operated by global brands, that use a water feature as a heat sink. All once-through cooling designs use some custom equipment — there are currently no off-the-shelf designs that are commercially available for data centers. While the facilities we studied vary in size, location and some engineering choices, there are some commonalities between the projects.

Operators we’ve interviewed for the research (all of them colocation providers) considered their once-through cooling projects to be both a technical and a business success, achieving strong operational performance and attracting customers. The energy price crisis that started in 2021, combined with a corporate rush to claim strong sustainability credentials, reportedly boosted the level of return on these investments past even optimistic scenarios.

Rejecting heat into bodies of water allows for stable PUEs year-round, meaning that colocation providers can serve a larger IT load from the same site power envelope. Another benefit is the ability to lower computer room temperatures, for example, 64°F to 68°F (18°C to 20°C) for “free”: this does not come with a PUE penalty. Low-temperature air supply helps operators minimize IT component failures and accommodate future high-performance servers with sufficient cooling. If the water feature is naturally flowing or replenished, operators also eliminate the need for chillers or other large cooling systems from their infrastructure, which would otherwise be required as backup.

Still, that these favorable outcomes outweigh the required investments were far from certain during design and construction, as all undertaking involved nontrivial engineering efforts and associated costs. Committed sponsorship from senior management was critical for these projects to be given the green light and to overcome any unexpected difficulties. Encouraged by the positive experience of the facilities we studied, Uptime expects once-through cooling to gather more interest in the future. A more mature market for these designs will factor into siting decisions, as well as jurisdictional permitting requirements as a proxy for efficiency and sustainability. Once-through systems will also help to maximize the energy efficiency benefits of future DLC rollouts through “free” low-temperature operations, creating an end-to-end liquid cooling infrastructure.

By: Jacqueline Davis, Research Analyst, Uptime Institute and Daniel Bizo, Research Director, Uptime Institute

Cloud SLAs punish, not compensate

Cloud SLAs punish, not compensate

A service level agreement (SLA) is a contract between a cloud provider and a user. The SLA describes the provider’s minimum level of service, specified by performance metrics, and the compensation due to the user should the provider fail to deliver this service.

In practice, however, this compensation is punitive. It seems designed to punish the provider for its failure, not to compensate the user for their loss.

A key metric used in cloud SLAs is availability. Cloud availability is generally expressed as the percentage of a defined period (usually a month) during which a resource has external connectivity. If a user can’t connect over the internet to the resource due to a cloud provider issue, the resource is deemed unavailable or ‘down’.

For example, Amazon Web Services (AWS) offers a 99.5% SLA on a virtual machine. There are 730 hours on average in a month, so the virtual machine should be available (or ‘up’) for around 726.5 hours of those 730 hours. Compensation is due if the virtual machine fails to meet this minimum availability during a month.

Alternatively, we could say AWS is not liable to pay compensation unless the virtual machine experiences total downtime of more than 0.5% of a month, or 3.5 hours on average (given that the length of months differ). Google Cloud, Microsoft Azure and other cloud providers offer similar terms. Table 1 shows common availability SLA performance metrics.

Table 1

Table: Common availability SLA performance metrics

If the virtual machine is distributed across two availability zones (logical data centers, see Cloud scalability and resiliency from first principles), the typical SLA for availability increases to 99.99% (equivalent to an average of around 4 minutes of downtime per month).

The business impact of downtime to users depends on the situation. If short-lived outages occur overnight, the downtime may have little effect on revenue or productivity. However, a long outage in the middle of a major retail event, for example, is likely to hit income and reputation hard. In the 2021 Uptime Institute data center survey, the average cost of respondents’ most significant recent downtime incident was $973,000. This average does not include the 2% of respondents who estimate they lost more than $40M for their most recent worst downtime incident.

SLA compensation doesn’t even scratch the surface of these losses. If a single virtual machine goes down for less than 7 hours, 18 minutes (99% monthly availability), AWS will pay 10% of the monthly cost of that virtual machine. Considering the price of a small instance (a ‘t4g.nano’) in the large US-East-1 region (in Northern Virginia, US) is around $3 per month, total compensation for this outage would be 30 cents.

If a virtual machine goes down for less than 36 hours (95% availability in a month), the compensation is just 30% — just under a dollar. The user only receives a full refund for the month if the resource is down for more than one day, 12 hours and 31 minutes in total.

When a failure occurs, the user is responsible for measuring downtime and requesting compensation – this is not provided automatically. Users usually need to raise a report request with service logs to show proof of the outage. If the cloud provider approves the request, compensation is offered in service credits, not cash. Together, these approaches mean that users must detect an outage and apply for a service credit, which can only be redeemed through the offending cloud provider, and which is unlikely to cover the cost of the outage. Is the benefit of the compensation worth the effort?

Cloud providers are upfront about these compensation terms — they can be found easily on their websites. They are not being unreasonable by limiting their liabilities, but cloud users need to be realistic about the value of SLAs.

An SLA incentivizes the cloud provider to meet performance requirements with a threat of punitive losses. An SLA is also a marketing tool designed to convince the buyer that the cloud provider can be trusted to host business-critical applications.

Customers should not consider SLAs (or 99.9x% availability figures) as reliable predictors of future availability — or even a record of past levels of availability. The frequency of cloud outages (and their duration) suggests that actual performance of many service providers falls short of published SLAs. (For a more detailed discussion on outages, see our Annual outages analysis 2021: The causes and impacts of data center outages.)

In short, users shouldn’t put too much faith in SLAs. They are likely to indicate a cloud provider’s minimum standard of service: few cloud providers would guarantee compensation on metrics it didn’t feel it could realistically achieve, even if the compensation is minimal. But an SLA is not an insurance policy; it does not mitigate the business impact of downtime.

Is concern over cloud and third-party outages increasing?

Is concern over cloud and third-party outages increasing?

As a result of some high-profile outages and the growing interest in running critical services in a public cloud, the reliability — and transparency — of public cloud services has come under scrutiny.

Cloud services are designed to operate with low failure rates. Large (at-scale) cloud and IT service providers, such as Amazon Web Services, Microsoft Azure and Google Cloud, incorporate layers of software and middleware, balance capacity across systems, networks and data centers, and reroute workloads and traffic away from failures and troubled sites.

On the whole, these architectures provide high levels of service availability at scale. Despite this, no architecture is fail-safe and many failures have been recorded that can be attributed to the difficulties of managing such complex, at-scale software, data and networks.

Our recent resiliency survey of data center and IT managers shows that enterprise managers are reasonably concerned about the resiliency of public cloud services (see Figure 1). Only one in seven (14%) respondents say public cloud services are resilient enough to run all their workloads. The same proportion say the cloud is not resilient enough to run any of their workloads; and 32% say the cloud is only resilient enough to run some of their workloads. The increase in the number of “don’t know” responses since our 2021 survey also shows that confidence in the resiliency of the cloud has become shrouded in some uncertainty and skepticism.

Diagram: Most say cloud is only resilient enough for some workloads
Figure 1 Most say cloud is only resilient enough for some workloads

Concerns over cloud resiliency may be partly due to several recent outages being attributed to third-party service providers. In our resiliency survey, 39% of organizations suffered an outage that was caused by a problem with a third-party supplier.

As more workloads are outsourced to external providers, the more these operators account for high profile, public outages. Over a five-year period, third-party, commercial operators of IT and / or data centers (cloud, hosting, colocation, digital services, telecommunications, etc.) combined accounted for almost 63% of all public outages since 2016, when Uptime Institute started tracking them. This percentage has crept up year-by-year: in 2021 the combined proportion of outages caused by these commercial operators was 70%.

When third-party IT and data center service providers do have an outage, customers are affected immediately. These customers may seek compensation — and a full explanation. Many regulators and enterprises now want increased visibility, accountability and improved service level agreements — especially for cloud providers.

Watch our 2022 Outage Report Webinar for more research on the causes, frequency and costs of digital infrastructure outages. The complete 2022 Annual Outage Analysis report is available exclusively to Uptime Institute members.