Public cloud infrastructures have come a long way over the past 16 years to slowly earn the trust of enterprises in running their most important applications and storing sensitive data. In the Uptime Institute Global Data Center Survey 2022, more than a third of enterprises that operate their own IT infrastructure said they also placed some of their mission-critical workloads in a public cloud.
This gradual change in enterprises’ posture, however, can only be partially attributed to improved or more visible cloud resiliency. An equal, or arguably even bigger, component in this shift in attitude is enterprises’ willingness to make compromises when using the cloud, which includes sometimes accepting less resilient cloud data center facilities. However, a more glaring downgrade lies in the loss of the ability to configure IT hardware specifically for sensitive business applications.
In more traditional, monolithic applications, both the data center and IT hardware play a central role in their reliability and availability. Most critical applications that predate the cloud era depend heavily on hardware features because they run on a single or a small number of servers. By design, more application performance meant bigger, more powerful servers (scaling up as opposed to scaling out), and more reliability and availability meant picking servers engineered for mission-critical use.
In contrast, cloud-native applications should be designed to scale across tens or hundreds of servers, with the assumption that the hardware cannot be relied upon. Cloud providers are upfront that customers are expected to build in resiliency and reliability using software and services.
But such architectures are complex, may require specialist skills and come with high software management overheads. Legacy mission-critical applications, such as databases, are not always set up to look after their reliability on their own without depending on hardware and operating system / hypervisor mechanisms. To move such applications to the cloud and maintain their reliability, organizations may need to substantially refactor the code.
This Uptime Update discusses why organizations that are migrating critical workloads from their own IT infrastructure to the cloud will need to change their attitudes towards reliability to avoid creating risks.
Much more than availability
The language that surrounds infrastructure resiliency is often ambiguous and masks several interrelated but distinct aspects of engineering. Historically, the industry has largely discussed availability considerations around the public cloud, which most stakeholders understand as not experiencing outages to their cloud services.
In common public cloud parlance, availability is almost always used interchangeably with reliability. When offering advice on their reliability features or on how to architect cloud applications for reliability, cloud providers tend to discuss almost exclusively what falls under the high-availability engineering discipline (e.g., data replication, clustering and recovery schemes). In the software domain, physical and IT infrastructure reliability may be conflated with site reliability engineering, which is a software development and deployment framework.
These crossover in two significant ways. First, availability objectives, such as the likelihood that the system is ready to operate at any given time, are only a part of reliability engineering — or rather, one of its outcomes. Reliability engineering is primarily concerned with the system’s ability to perform its function free of errors. It also aims to suppress the likelihood that failures will affect the system’s health. Crucially, this includes the detection and containment of abnormal operations, such as a device making mistakes. In short, reliability is the likelihood of producing correct outputs.
For facilities, this typically translates to the ability to deliver conditioned power and air — even during times of maintenance and failures. For IT systems, reliability is about the capacity to perform compute and storage jobs without errors in calculations or data.
Second, the reliability of any system builds on the robustness of its constituent parts, which include the smallest components. In the cloud, however, the atomic unit of reliability that is visible to customers is a consumable cloud resource, such as a virtual machine or container, and more complex cloud services, such as data storage, network and an array of application interfaces.
Today, enterprises have not only limited information on the cloud data centers’ physical infrastructure resiliency (either topology or maintenance and operations practices), but also low visibility of, or any choice in, the reliability features of IT hardware and infrastructure software that underpin cloud services.
Engineering for reliability: a lost art?
This abstraction of hardware resources is a major departure from the classical infrastructure practices for IT systems that run mission-critical business and industrial applications. Server reliability greatly depends on the architectural features that detect and recover from errors occurring in processors and memory chips, often with the added help of the operating system.
Typical examples of errors include soft bit flips (transient bit errors typically caused by an anomaly) and hard bit flips (permanent faults) in memory cell arrays. Bit errors can be found both in the processor and in external memory banks, as well as operational errors and design bugs in processor logic that could produce incorrect outputs or result in a software crash.
For much of its history, the IT industry has gone to great and costly lengths to design mission-critical servers (and storage systems) that can be trusted to manage data and perform operations as intended. The engineering discipline addressing server robustness is generally known as reliability, availability and serviceability (RAS, which was originally coined by IBM five decades ago), with the serviceability aspect referring to maintenance and upgrades, including software, without causing any downtime.
Traditional examples of these servers include mainframes, UNIX-based and other proprietary software and hardware systems. However, in the past couple of decades x86-based mission-critical systems, which are distinct from volume servers in their RAS features, have also taken hold in the market.
What sets mission-critical hardware design apart is its extensive reliability mechanisms in its error detection, correction and recovery capabilities that go beyond those found in mainstream hardware. While perfect resistance of errors is not possible, such features greatly reduce the chances of errors and software crashes.
Mission-critical systems tend to be able to isolate a range of faulty hardware components without resulting in any disruption. These components include memory chips (the most common source of data integrity and system stability issues), processor units or entire processors, or even an entire physical partition of a mission-critical server. Often, critical memory contents are mirrored within the system across different banks of memory to safeguard against hardware failures.
Server reliability doesn’t end with design, however. Vendors of mission-critical servers and storage systems test the final version of any new server platform for many months to ensure they perform correctly, known as the validation process, before volume manufacturing begins.
Entire sectors, such as financial services, e-commerce, manufacturing, transport and more, have come to depend on and trust such hardware for the correctness of their critical applications and data.
Someone else’s server, my reliability
Maintaining a mission-critical level of infrastructure reliability in the cloud (or even just establishing underlying infrastructure reliability in general), as opposed to “simple” availability, is not straightforward. Major cloud providers don’t address the topic of reliability in much depth to begin with.
What techniques, if any, cloud operators could deploy to safeguard customer applications against data corruption and application failures beyond the use of basic error correction code in memory, which is only able to handle random, single-bit errors, is difficult to know. Currently, there are no hyperscale cloud instances that offer enhanced RAS features comparable to mission-critical IT systems.
While IBM and Microsoft both offer migration paths directly for some mission-critical architectures, such as IBM Power and older s390x mainframes, their focus is on the modernization of legacy applications rather than maintaining reliability and availability levels that are comparable to on-premises systems. The reliability on offer is even less clear when it comes to more abstracted cloud services, such as software as a service and database as a service offerings or serverless computing.
Arguably, the future of reliability lies with software mechanisms. In particular, the software stack needs to adapt by getting rid of its dependency on hardware RAS features, whether this is achieved through verifying computations, memory coherency or the ability to remove and add hardware resources.
This puts the onus of RAS engineering almost solely on the cloud user. For new critical applications, purely software-based RAS by design is a must. However, the time and costs of refactoring or rearchitecting an existing mission-critical software stack to verify results and handle hardware-originating errors are not trivial — and are likely to be prohibitive in many cases, if possible.
Without the assistance of advanced RAS features in mission-critical IT systems, performance, particularly response times, will also likely take a hit if the same depth of reliability is required. At best, this means the need for more server resources to handle the same workload because the software mechanisms for extensive system reliability features will carry a substantial computational and data overhead.
These considerations should temper the pace at which mission-critical monolithic applications migrate to the cloud. Yet, these arguments are almost academic. The benefits of high reliability are difficult to quantify and compare (even more so than availability), in part because it is counterfactual — it is hard to measure what is being prevented.
Over time, cloud operators might invest more in generic infrastructure reliability and even offer products with enhanced RAS for legacy applications. But software-based RAS is the way forward in a world where hardware has become generic and abstracted.
Enterprise decision-makers should at least be mindful of the reliability (and availability) trade-offs involved with the migration of existing mission-critical applications to the cloud, and budget for investing in the necessary architectural and software changes if they expect the same level of service that an enterprise IT infrastructure can provide.