Blog Single Author Small - Uptime Institute Blog

Rapid interconnectivity growth will add complexity and risk

January 25, 2023/in Executive, Operations/by Lenny Simon, Senior Research Associate, Uptime Institute

Recent geopolitical concerns, predictions of a looming recession, and continued supply chain difficulties are unlikely to dampen growth in digital bandwidth on private networks according to Equinix’s 2022 Global Interconnection Index (GXI). Global interconnection bandwidth (the volume of data exchanged between companies directly, bypassing the public internet) is a barometer for digital infrastructure and sheds light on the difference in dynamics between verticals. High growth in private interconnection is a boon for Equinix as the world’s largest colocation provider by market share but makes resiliency more challenging for its customers: all these interconnects are also potential points of failure.

The Equinix GXI projects strong growth across the industry in 2023, with global interconnection bandwidth projected to increase by 41% compared to 2022. Overall, global interconnection bandwidth is projected to grow by a compound annual growth rate (CAGR) of 40% into 2025, when it is expected to reach nearly 28,000 terabits per second (tbps). These numbers include direct connections between enterprises and their digital business partners (such as telecommunications, cloud, edge, and software as a service (SaaS) providers).

The Equinix study projects faster growth in private interconnection for enterprises than for networks operated by telecommunications companies or cloud providers. This growth in private interconnection is driven by high demand for digital services and products — many of which also require a presence with multiple cloud providers as well as integration with major SaaS companies.

The energy and utility sector is likely to see the greatest growth in private network interconnection through 2025, with a CAGR of 53%, as energy data becomes increasingly important for managing intermittent renewable energy and decarbonizing the grid. Digital services supporting sustainability efforts such as carbon accounting are likely to require additional private interconnection with SaaS providers to accurately track operational sustainability metrics.

The banking and insurance and manufacturing sectors are expected to see CAGRs of 49% and 45%, respectively, over the same period. These industries are particularly sensitive to errors and outages, however, and appropriate planning will be necessary.

There is a reason Equinix has been drawing attention to the benefits of interconnection for the past six years: as at Q2 2022 the company operates 435,800 cross-connects throughout its own data centers. Its closest competitor, Digital Realty, reported just 185,000 cross-connects at its facilities in the same quarter. Equinix defines a cross-connect as a point-to-point cable link between two customers in the same retail colocation data center. For colocation companies, cross-connects not only represent core recurring revenue streams but also make their network-rich facilities more valuable as integration hubs between organizations.

As private interconnection increases, so too does the interdependency of digital infrastructure. Strong growth in interconnection may be responsible for the increasing proportion of networking and third-party-related outages in recent years. Uptime’s 2022 resiliency survey sheds light on the two most common causes of connectivity-related outages: misconfiguration and change management failure (reported by 43% of survey respondents); and third-party network-provider failure (43%). Asked specifically if their organization had suffered an outage caused by a problem with a third-party supplier, 39% of respondents confirmed this to be the case (see Figure 1).

diagram: Most common causes of major third-party outages — **Figure 1. The most common causes of major third-party outages**

When third-party IT and data center service providers do have an outage, customers are immediately affected — and may seek compensation. Enterprise end-users will need additional transparency and stronger service-level agreements from providers to better manage additional points of failure, as well as the outsourcing of their architecture resiliency. Importantly, managing the added complexity of an enterprise IT architecture spanning on-premises, colocation and cloud facilities demands more organizational resources in terms of skilled staff, time and budget.

Failing that, businesses might encounter unexpected availability and reliability issues rather than any anticipated improvement. According to Uptime’s 2021 annual survey of IT and data center managers, one in eight (of those who had a view) reported that using a mix of IT venues had resulted in their organization experiencing a deterioration in service resiliency, rather than the reverse.

By: Lenny Simon, Senior Research Associate and Max Smolaks, Analyst

Reports of cloud decline have been greatly exaggerated

January 18, 2023/in Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com

Cloud providers have experienced unprecedented growth over the past few years. CIOs the world over, often prompted by CFOs and CEOs, have been favoring the cloud over on-premises IT for new and major projects — with the result that the largest cloud provider, Amazon Web Services (AWS), has seen revenue increase by 30% to 40% every year since 2014 (when it recorded an 80% jump in turnover). Microsoft Azure and Google have reported similar numbers in recent times.

But there are signs of a slowdown:

While AWS reported a quarter-on-quarter revenue increase of 27.5% for Q3 2022, this is down from 33% in Q2 — the slowest growth in its history.
Microsoft’s CFO has also commented that Azure could see revenue growth decline in their next quarter, following disappointing 35% growth in the three months to September 2022.

Why this slowdown in cloud growth?

The global macroeconomic environment — specifically, high energy costs together with inflation — is making organizations more cautious about spending money. Cloud development projects are no different from many others and are likely to be postponed or deprioritized due to rising costs, skill shortages and global uncertainty.

Some moves to the cloud may have been indefinitely deferred. Public cloud is not always cheaper than on-premises implementations, and many organizations may have concluded that migration is just not worthwhile in light of other financial pressures.

For those organizations that have already built cloud-based applications it is neither feasible nor wise to turn off applications or resources to save money: these organizations are, instead, spending more time examining and optimizing their costs.

Cutting cloud costs, not consumption

Cloud providers’ top-line revenue figures suggest customers are successfully reducing their cloud costs. How are they doing this?

Optimizing cloud expenditure involves two key activities: first, eliminating waste (such as orphaned resources and poorly sized virtual machines); and second, more cost-effective procurement, through alternative pricing models such as consistent-usage commitments or spot instances — both of which, crucially, reduce expenditure without impacting application performance.

Hyperscaler cloud providers, which are more interested in building longer-term relationships than in deriving higher gross margins in the short term, offer tools to help users reduce expenditure. These tools have improved significantly over the past few years.

Many organizations have now crossed a threshold in terms of cloud use, where the savings to be made mean it is to their benefit to invest in optimization (using these tools). One factor driving optimization here is higher cloud expenditure — in part an ongoing consequence of the pandemic, which saw businesses retooling IT to survive, rather than focusing on cutting IT costs.

It should, perhaps, have been anticipated that customers would, at some point, start using these tools to their own advantage — current pressures on other costs having made cutting IT expenditure more critical than before.

Will cloud prices rise?

Cloud providers’ overriding objective of winning and keeping customers over the long term explains why hyperscalers are likely to try and avoid increasing their prices for the foreseeable future. Providers want to maintain good relationships with their customers so that they are the de facto provider of choice for new projects and developments: price hikes would damage the customer trust they’ve spent so long cultivating.

AWS’s Q3 2022 gross margin was 26%, some 3% down on Q2. This drop in margin could be attributed to rising energy costs, which AWS states almost doubled over the same period (hedging and long-term purchase agreements notwithstanding). Microsoft has reported it will face additional energy costs of $800 million this financial year. While AWS and Microsoft could have increased prices to offset rising energy costs and maintain their profit margins they have, so far, chosen not to do so rather than risk damaging customers’ trust.

How will this play out, going forward? Financial pressures may make organizations more careful about cloud spending. Projects may be subject to more stringent justification and approval, and some migrations are likely to be delayed (or even cancelled) for now. As revenue increases in absolute terms, achieving high-percentage revenue gains becomes increasingly difficult. Nonetheless, while the days of 40% revenue jumps may be over, this recent downturn is unlikely to be the start of a rapid downward spiral. AWS’s Q3 2022 revenue growth may have shrunk in percentage terms: but it was still in excess of $4 billion.

Applications architected for the cloud should be automatically scalable, and capable of meeting customers’ requirements without their having to spend more than necessary. Cloud applications allow organizations to adapt their business models and / or drive innovation — which may be one of the reasons many have been able to survive (and, in some cases, thrive) during challenging times. In a sense, the decline in growth that the cloud companies have suffered recently demonstrates that the cloud model is working exactly as intended.

The hyperscaler cloud providers are likely to continue to expand globally and create new products and services. Enterprise customers, in turn, are likely to continue to find cloud services competitive in comparison with colocation-based or on-premises alternatives. Much of the cloud’s value comes from a perception of it offering “unlimited” resources. If providers don’t increase capacity, they risk failing to meet customers’ expectations when required — damaging credibility, and relationships. AWS, Google and Microsoft continue to compete for market share, worldwide. Reducing investment now could risk future profitability.

AWS currently has 13,000 vacancies advertised on its website — a sign that the cloud sector is certainly not in retreat. This fact, rather, suggests future growth will be strong.

Major data center fire highlights criticality of IT services

January 11, 2023/in Design, Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com

Uptime Institute’s outages database suggests data center fires are infrequent, and rarely have a significant impact on operations. Uptime has identified 14 publicly reported, high-profile data center outages caused by fire or fire suppression systems since 2020. The frequency of fires is not increasing relative to the IT load or number of data centers but, uncontained, they are potentially disastrous to facilities, and subsequent outages can be ruinous for the business.

SK Group, South Korea’s second largest conglomerate, is the latest high-profile organization to suffer a major data center fire, following a breakout at a multistory colocation facility operated by its SK Inc. C&C subsidiary in Pangyo (just south of Seoul) on October 15. According to police reports, the fire started in a battery room before spreading quickly to the rest of the building. It took firefighters around eight hours to bring the blaze under control.

While there were no reported injuries, this incident could prove to be the largest data center outage caused by fire to date. It is a textbook example of how seemingly minor incidents can escalate to wreak havoc through cascading interdependencies in IT services.

The incident took tens of thousands of servers offline, including not only SK Group’s own systems but also the IT infrastructure running South Korea’s most popular messaging and single sign-on platform, KakaoTalk. The outage disrupted its integrated mobile payment system, transport app, gaming platform and music service — all of which are used by millions. The outage also affected domestic cloud giant Naver (the “Google of South Korea”) which reported disruption to its online search, shopping, media and blogging services.

While SK Group has yet to disclose the root cause of the fire, Kakao, the company behind KakaoTalk, has pointed to the Lithium-ion (Li-ion) batteries deployed at the facility — manufactured by SK on, another SK Group subsidiary. In response, SK Group has released what it claims are records from its battery management system (BMS) showing no deviation from normal operations prior to the incident. Some local media reports contradict this, however, claiming multiple warnings were, in fact, generated by the BMS. Only a thorough investigation will settle these claims. In the meantime, both sides are reported to be “lawyering up.”

The fallout from the outage is not limited to service disruptions or lost revenue, and has prompted a statement from the country’s president, Yoon Suk-yeol, who has promised a thorough investigation into the causes of, and the extent of the damages arising from, the fire. The incident has, so far, led to a police raid on SK Inc. C&C headquarters; the resignation of Kakao co-CEO Whon Namkoong; and the establishment of a national task force for disaster prevention involving military officials and the national intelligence agency. Multiple class-action lawsuits against Kakao are in progress, mainly based on claims that the company has prioritized short-term profits over investment in more resilient IT infrastructure.

The South Korean government has announced a raft of measures aimed at preventing large-scale digital service failures. All large data centers will now be subject to disaster management procedures defined by the government, including regular inspections and safety drills. Longer-term, the country’s Ministry of Science and ICT will be pushing for the development of battery technologies posing a lower fire risk — a matter of national interest for South Korea, home to some of the world’s largest Li-ion cell manufacturers including Samsung SDI and LG Chem, in addition to SK on.

The fire in South Korea will inevitably draw comparisons with the data center fire that brought down the OVHcloud Strasbourg facility in 2021. Impacting some 65,000 customers, many of whom lost their data in the blaze (see Learning from the OVHcloud data center fire), this fire, as in Pangyo, was thought to have involved uninterruptible power supply (UPS) systems. According to the French Bureau of Investigation and Analysis on Industrial Risks (BEA-RI), the lack of an automatic fire extinguisher system, delayed electrical cutoff and building design all contributed to the spread of the blaze.

A further issue arising from this outage, and one that remains to be determined, is the financial cost to SK Group, Kakao and Naver. The fire at the OVHcloud Strasbourg facility was estimated to cost the operator more than €105 million — with less than half of this being covered by insurance. The cost of the fire in Pangyo is likely to run into tens (if not hundreds) of millions of dollars. This should serve as a timely reminder of the importance of fire suppression, particularly in battery rooms.

Li-ion batteries in mission-critical applications — risk creep?

Li-ion batteries present a greater fire risk than valve-regulated lead-acid batteries, regardless of their specific chemistries and construction – a position endorsed by the US’ National Fire Protection Association and others. Since the breakdown of cells in Li-ion batteries produces combustible gases (including oxygen) which can result in a major thermal-runaway event (in which the fire spreads uncontrollably between cells, across battery packs and, potentially, even cabinets if these are inappropriately distanced), the fires they cause are notoriously difficult to suppress.

Many operators have, hitherto, found the risk-reward profile of Li-ion batteries (in terms of their lower footprint and longer lifespan) to be acceptable. Uptime surveys show major UPS vendors reporting strong uptake of Li-ion batteries in data center and industrial applications: some vendors report shipping more than half their major three-phase UPS systems with Li-ion battery strings. According to the Uptime Institute Global Data Center Survey 2021, nearly half of operators have adopted this technology for their centralized UPS plants, up from about a quarter three years ago. The Uptime Institute Global Data Center Survey 2022 found Li-ion adoption levels to be increasing still further (see Figure 1).

diagram of Data centers are embracing Li-ion batteries — **Figure 1. Data centers are embracing Li-ion batteries**

The incident at the SK Inc. C&C facility highlights the importance of selecting appropriate fire suppression systems, and the importance of fire containment as part of resiliency. Most local regulation governing fire prevention and mitigation concentrates (rightly) on securing people’s safety, rather than on protecting assets. Data center operators, however, have other critically important issues to consider — including equipment protection, operational continuity, disaster recovery and mean time to recovery.

While gaseous (or clean agent) suppression is effective at slowing down the spread of a fire in the early stages of Li-ion cell failure (when coupled with early detection), it is arguably less suitable for handling a major thermal-runaway event. The cooling effects of water and foam mean these are likely to perform better; double-interlock pre-action sprinklers also limit the spread. Placing battery cabinets farther apart can help prevent or limit the spread of a major fire. Dividing battery rooms into fire-resistant compartments (a measure mandated by Uptime Institute’s Tier IV resiliency requirements) can further decrease the risk of facility-wide outages.

Such extensive fire prevention measures could, however, compromise the benefits of Li-ion batteries in terms of their higher volumetric energy density, lower cooling needs and overall advantage in lifespan costs (particularly where space is at a premium).

Advances in Li-ion chemistries and cell assembly will address operational safety concerns — lithium iron phosphate, with its higher ignition point and no release of oxygen during decomposition – being a case in point. Longer term, inherently safer, innovative chemistries — such as sodium-ion and nickel-zinc — will probably offer a more lasting solution to the safety (and sustainability) conundrum around Li-ion. Until then, the growing prevalence of vast amounts of Li-ion batteries in data centers means the propensity of violent fires can only grow — with potentially dire financial consequences.

By: Max Smolaks, Analyst, Uptime Institute Intelligence and Daniel Bizo, Research Director, Uptime Institute Intelligence

Tweak to AWS Outposts reflects demand for greater cloud autonomy

December 7, 2022/in Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com

Amazon Web Services (AWS) has made a minor change to its private-cloud appliance, AWS Outposts, that could significantly impact resiliency. The cloud provider has enabled local access to cloud administration, removing the appliance’s reliance on the public cloud. In the event of a network failure between the public cloud and the user’s data center, the private-cloud container platform can still be configured and maintained.

Many public-cloud providers have extended their offerings to now make their services accessible through the user’s own choice of data center. Services are typically billed in the same way as they are via the public cloud, and accessed through the same portal and software interfaces, but are now delivered from hardware and software hosted in the user’s own facility. Such services are in demand from customers seeking to meet compliance or data protection requirements, or to improve the end-user experience through lower latency.

In one business model, the cloud provider ships a server-storage private-cloud appliance to an organization’s data center. The organization manages the data center. The public-cloud provider is responsible for the hardware and middleware that delivers the cloud functionality.

The term “private cloud” describes a cloud platform where the user has access to elements of the platform not usually accessible in the public cloud (such as the data center facility, hardware and middleware). These appliances are a particular type of private cloud, not designed to be operated independently of the public cloud. They are best thought of as extensions of the public cloud to the on-premises data center (or colocation facility) since administration and software maintenance is performed via the public cloud.

As the public and private cloud use the same platform and application programming interfaces (APIs), applications can be built across the organization’s and the cloud provider’s data centers, and the platform can be managed as one. For more information on private-cloud appliances (see the Uptime Institute Intelligence report Cloud scalability and resiliency from first principles).

The resilience of this architecture has not, hitherto, been assured because the application still relies on the cloud provider’s ability to manage some services, such as the management interface. The public-cloud provider controls the interface for interacting with the user’s on-premises cloud (the “control plane”); if that interface goes down, so too does the ability to administrate the on-premises cloud.

Ironically, it is precisely during an outage that an administrator is most likely to want to make such changes to configuration — to reserve capacity for mission-critical workloads or to reprioritize applications to handle the loss of public-cloud capacity, for example. If an AWS Outpost appliance were being used in a factory to support manufacturing machinery, for instance, the inability to configure local capabilities during a network failure could significantly affect production.

It is for this reason that AWS’s announcement that its Elastic Kubernetes Service product (Amazon EKS) can be managed locally on AWS Outposts is important. Kubernetes is a platform used to manage containers. This new capability allows users to configure API endpoints on the AWS Outposts appliance, meaning the container configuration can be changed via the local network without connecting to the public cloud.

In practical terms, this addition makes AWS Outposts more resilient to outages because it can function in the event of a connectivity failure between the cloud provider and the data center. AWS Outposts is now far more feasible as a disaster-recovery or failover location, and more appropriate for edge locations, where connectivity might be less assured.

The most important aspect of this development, however, is that it indicates AWS — the largest cloud provider — is perhaps acknowledging that users don’t just want an extension of the public cloud to their own facilities. Although many organizations are pursuing a hybrid-cloud approach, where public and private cloud platforms can work together, they don’t want to sacrifice the autonomy of each of those environments.

Organizations want venues to work independently of each other if required, avoiding single points of failure. To address this desire, other AWS Outposts services may be made locally configurable over time as users demand autonomy and greater control over their cloud applications.

Why are governments investigating cloud competitiveness?

November 30, 2022/in Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com

In any market, fewer sellers or providers typically results in less choice for buyers. Where the number of sellers is very low this could, theoretically, lead to exploitation, through higher prices or lower-quality goods and services — with buyers having no choice but to accept such terms.

Three hyperscale cloud providers — Amazon Web Services, Google Cloud and Microsoft Azure — have become dominant throughout most of the world. This has triggered investigations by some governments to check that limited competition is not impacting customers.

The UK government’s Office of Communications’ (Ofcom’s) Cloud services market study is intended to investigate the role played by these “cloud provider hyperscalers” in the country’s £15 billion public cloud services market. Ofcom’s objective, specifically, is to understand the strength of competition in the market and to investigate whether the dominance of these hyperscalers is limiting growth and innovation.

Although there is a debate about the cost and strategic implications of moving core workloads to the cloud, competition among cloud provider hyperscalers, so far, seems to be good for users: recent inflation-driven increases notwithstanding, prices have generally decreased (across all providers) over the past few years. Apart from the hyperscalers, users can procure cloud services from local providers (and established brands), colocation providers and private cloud vendors. The cloud provider hyperscalers continue to develop innovative products, sold for pennies per hour through the pay-as-you-go pricing model and accessible to anyone with a credit card.

However, Ofcom is concerned. It cites research from Synergy Research Group showing that the combined market share of the hyperscalers is growing at the expense of smaller providers (at a rate of 3% per year) with the hyperscalers’ UK market share now standing at over 80%. As discussed in Uptime Institute Intelligence’s Cloud scalability and resiliency from first principles report, vendor lock-in can make it harder for users to change cloud providers to find a better deal.

The Herfindahl-Hirschman Index (HHI) is commonly used to assess market competitiveness on the basis of market share. A market with an HHI of over 2,500 suggests a limited number of companies have significant power to control market prices — a “high concentration.” The UK cloud services market is estimated to have an HHI of over 2,900. Given the global HHI of 1,600 for this sector, the UK’s high value validates the need for the Ofcom investigation.

Such a high market concentration isn’t necessarily a problem, however, if competing companies keep prices low while offering innovative products and services to a large population. A high concentration is only problematic if the cloud providers are in a stalemate (or worse, in collusion) — not cutting prices, not releasing new products, and not fighting to win each other’s customers. UK law prevents cloud providers from colluding to fix prices or restrict competition. But with so few competitors, such anti-competitive behavior might emerge accidentally (although there are few — if any — signs of such a stalemate so far).

The most intriguing part of Ofcom’s study will be its recommendations on how to make the market more competitive. Unless Ofcom can find evidence of anti-competitive behavior, there may be very little it can do to help smaller players compete, apart from limiting the hyperscalers’ ambitions, through regulation or divestiture. Outward signs are that cloud providers have come to dominate the market by providing users with the services they expect, at a price they’re willing to pay, rather than through any nefarious means.

Hyperscale cloud providers require colossal capital, substantial and cutting-edge expertise, and global-scale efficiency investments — all of which means they can cut prices, over time, while expanding into new markets and releasing new products. The hyperscalers themselves have not created the significant barrier to entry faced by smaller players in attempting to compete here: that barrier exists because of the sheer scale of operations fundamental to cloud computing’s raison d’etre.

In most countries, competition authorities — or governments generally — have limited ability to help smaller providers overcome this barrier, whether through investment or support. In the case of the UK, Ofcom’s only option is to restrict the dominance of the hyperscalers.

One option open to competition authorities would be regulating cloud prices by setting price caps, or by forcing providers to pass on cost savings. But price regulation only makes sense if prices are going up, and if users have no other alternatives. Many users of cloud services have seen prices come down: and they are, in any case, at liberty to use noncloud infrastructure if providers are not delivering good value.

Ofcom (and other regulators) could, alternatively, enforce the divestment of hyperscalers’ assets. But breaking up a cloud provider on the basis of the products and services offered would penalize those users looking for integrated services from a single source. It would also be an extremely bold and highly controversial step that the UK government would be unlikely to undertake without wider political consensus. In the US, there is a bipartisan support for an investigation into tech giant market power, which could provide that impetus.

Regulators could also legislate to force suppliers to offer greater support in migrating services between cloud providers: but this could stifle innovation, with providers unable to develop differentiated features that might not work elsewhere. Theoretically, a government could even nationalize a major cloud provider (although this is highly unlikely).

Given the high concentration of this market, Ofcom’s interest in conducting an investigation is understandable: while there is limited evidence to date, there could, be anti-competitive factors at play that are not immediately obvious to customers. Ofcom’s study may well not uncover many competitive concerns at the moment but it might, equally, focus attention on the nation’s over-reliance on a limited number of cloud providers in the years ahead.

In this Note, we have focused purely on AWS’, Google’s and Microsoft’s cloud infrastructure businesses (Amazon Web Services, Google Cloud and Microsoft Azure). But these tech giants also provide many other products and services in many markets, each of which has different levels of competitiveness.

Microsoft, for example, has recently been pressured into making changes to its software licensing terms following complaints from EU regulators and European cloud providers (including Aruba, NextCloud and OVHcloud). These regulators and cloud providers argue that Microsoft has an unfair advantage in delivering cloud services (via its Azure cloud), given it owns the underlying operating system. Microsoft, they claim, could potentially price its cloud competitors out of the market by increasing its software licensing fees.

As their market power continues to increase, these tech giants will continue to face anti-competitive regulation and lawsuits in some, or many, of these markets. In the UK, how far Ofcom will investigate the hyperscalers’ impact in particular subsectors, such as retail, mobile, operating systems and internet search is yet to be seen.

Users unprepared for inevitable cloud outages

November 23, 2022/in Executive, Operations/by Dr. Owen Rogers, Senior Research Director for Cloud Computing, Uptime Institute, orogers@uptimeinstitute.com

Organizations are becoming more confident in using the cloud for mission-critical workloads — partly due to a perception of improved visibility into operational resiliency. But many users aren’t taking basic steps to ensure their mission-critical applications can endure relatively frequent availability zone outages.

Data from the 2022 Uptime Institute annual survey reflects this growing confidence in public cloud. The proportion of respondents not placing mission-critical workloads into a public cloud has now dropped from 74% (2019) to 63% (2022), while those saying they have adequate visibility into the resiliency of public-cloud services has risen from 14% to 21%.

However, other survey data suggests cloud users’ confidence may be misplaced. Cloud providers recommend that users distribute their workloads across multiple availability zones. An availability zone is a logical data center, often understood to have redundant and separate power and networking. Cloud providers make explicitly clear that zones will suffer outages occasionally — their position being that users must architect their applications to handle the loss of an availability zone.

Zone outages are relatively common: 35% of respondents said the loss of an availability zone would result in significant performance issues. Only 16% of those surveyed said that the loss of an availability zone would have no impact on their cloud applications (see Figure 1).

diagram: Many cloud applications vulnerable to availability zone outages — **Figure 1. Many cloud applications vulnerable to availability zone outages**

This presents a clear contradiction. Users appear to be more confident that the public cloud can handle mission-critical workloads, yet over a third of users are architecting applications vulnerable to relatively common availability zone outages. This contradiction is due to a lack of clarity on the respective roles and responsibilities of provider and user.

Who is at fault if an application goes down as a result of a single availability zone outage? This data point would appear to reflect the lack of clarity on roles and responsibilities: half of respondents to Uptime’s annual survey believe this to be primarily the cloud provider’s fault, while the other half believe responsibility lies with the user in having failed to architect the application to avoid such downtime.

The provider is, of course, responsible for the operational resiliency of its data centers. But cloud providers neither state nor guarantee that availability zones will be highly available. On which basis, why do users assume that a single availability zone will provide the resiliency their application requires?

This misunderstanding might, at least in part, be due to the simplistic view that the cloud is just someone else’s computer, in someone else’s data center: but this is not the case. A cloud service is a complex combination of data center, hardware, software and people. Services will fail from time to time due to unexpected behavior arising from the complexity of interacting systems, and people.

Accordingly, organizations that want to achieve high availability in the cloud must architect their applications to endure frequent outages of single availability zones. Lifting and shifting an application from an on-premises server to a cloud virtual machine might reduce resiliency if the application is not rearchitected to work across cloud zones.

As cloud adoption increases, the impact of outages is likely to grow as a significantly higher number of organizations rely on cloud computing for their applications. While many will architect their applications to weather occasional outages, many are not yet fully prepared for inevitable cloud service failures and the subsequent impact on their applications.

Major data center fire highlights criticality of IT services

Li-ion batteries in mission-critical applications — risk creep?

Explaining the Uptime Institute’s Tier Classification System (April 2021 Update)

The Making of a Good Method of Procedure

A Look at Data Center Cooling Technologies

Data Center Cooling: CRAC/CRAH redundancy, capacity, and selection metrics

Implementing Data Center Cooling Best Practices