• Link to X
  • Link to LinkedIn
  • Link to Mail
  • ABOUT UI
    • Business Partners
    • Careers
    • Contact Us
    • News & Press
    • Our Team
    • Press Releases
    • Branding Guidelines
  • CONTACT
Uptime Institute Blog
  • Journal
    • Journal Home
    • Executive
    • Operations
    • Design
  • AI Services
    • AI Infrastructure Advisory
    • AI Custom Support
  • Tier Certification
    • Overview
    • Design
    • Construction
    • Operations
    • Tier Gap Analysis
    • Prefabricated/Modular
    • Tier Certifications List
  • Professional Services
    • Overview
    • Infrastructure Services
    • Management and Operations Services
    • Energy and Sustainability Services
    • Consulting Services
  • Education
    • Course Details
    • Course Calendar
    • Competency & Confidence Assessments
    • Private Education
    • Graduate Roster
  • Events
    • Industry Events
    • Leadership Events
    • Network Events
  • Network
    • Overview
    • Network Calendar
    • Network Roster
    • Request Corporate Access
    • Request Guest Access
    • Uptime Network Portal
  • Intelligence
  • Clients
    • Client Stories
  • Resources
    • Data Center Industry Surveys
    • Ebooks
    • Journal Blog
    • Product Datasheets
    • Research & Reports
    • Tier Specification Documents
    • Tools
    • Webinars
  • Click to open the search input field Click to open the search input field Search
  • Menu Menu
Blog - Latest News
Lifting and shifting apps to the cloud: a source of risk creep?

Lifting and shifting apps to the cloud: a source of risk creep?

July 26, 2023/in Executive, Operations/by Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com

Public cloud infrastructures have come a long way over the past 16 years to slowly earn the trust of enterprises in running their most important applications and storing sensitive data. In the Uptime Institute Global Data Center Survey 2022, more than a third of enterprises that operate their own IT infrastructure said they also placed some of their mission-critical workloads in a public cloud.

This gradual change in enterprises’ posture, however, can only be partially attributed to improved or more visible cloud resiliency. An equal, or arguably even bigger, component in this shift in attitude is enterprises’ willingness to make compromises when using the cloud, which includes sometimes accepting less resilient cloud data center facilities. However, a more glaring downgrade lies in the loss of the ability to configure IT hardware specifically for sensitive business applications.

In more traditional, monolithic applications, both the data center and IT hardware play a central role in their reliability and availability. Most critical applications that predate the cloud era depend heavily on hardware features because they run on a single or a small number of servers. By design, more application performance meant bigger, more powerful servers (scaling up as opposed to scaling out), and more reliability and availability meant picking servers engineered for mission-critical use.

In contrast, cloud-native applications should be designed to scale across tens or hundreds of servers, with the assumption that the hardware cannot be relied upon. Cloud providers are upfront that customers are expected to build in resiliency and reliability using software and services.

But such architectures are complex, may require specialist skills and come with high software management overheads. Legacy mission-critical applications, such as databases, are not always set up to look after their reliability on their own without depending on hardware and operating system / hypervisor mechanisms. To move such applications to the cloud and maintain their reliability, organizations may need to substantially refactor the code.

This Uptime Update discusses why organizations that are migrating critical workloads from their own IT infrastructure to the cloud will need to change their attitudes towards reliability to avoid creating risks.

Much more than availability

The language that surrounds infrastructure resiliency is often ambiguous and masks several interrelated but distinct aspects of engineering. Historically, the industry has largely discussed availability considerations around the public cloud, which most stakeholders understand as not experiencing outages to their cloud services.

In common public cloud parlance, availability is almost always used interchangeably with reliability. When offering advice on their reliability features or on how to architect cloud applications for reliability, cloud providers tend to discuss almost exclusively what falls under the high-availability engineering discipline (e.g., data replication, clustering and recovery schemes). In the software domain, physical and IT infrastructure reliability may be conflated with site reliability engineering, which is a software development and deployment framework.

These crossover in two significant ways. First, availability objectives, such as the likelihood that the system is ready to operate at any given time, are only a part of reliability engineering — or rather, one of its outcomes. Reliability engineering is primarily concerned with the system’s ability to perform its function free of errors. It also aims to suppress the likelihood that failures will affect the system’s health. Crucially, this includes the detection and containment of abnormal operations, such as a device making mistakes. In short, reliability is the likelihood of producing correct outputs.

For facilities, this typically translates to the ability to deliver conditioned power and air — even during times of maintenance and failures. For IT systems, reliability is about the capacity to perform compute and storage jobs without errors in calculations or data.

Second, the reliability of any system builds on the robustness of its constituent parts, which include the smallest components. In the cloud, however, the atomic unit of reliability that is visible to customers is a consumable cloud resource, such as a virtual machine or container, and more complex cloud services, such as data storage, network and an array of application interfaces.

Today, enterprises have not only limited information on the cloud data centers’ physical infrastructure resiliency (either topology or maintenance and operations practices), but also low visibility of, or any choice in, the reliability features of IT hardware and infrastructure software that underpin cloud services.

Engineering for reliability: a lost art?

This abstraction of hardware resources is a major departure from the classical infrastructure practices for IT systems that run mission-critical business and industrial applications. Server reliability greatly depends on the architectural features that detect and recover from errors occurring in processors and memory chips, often with the added help of the operating system.

Typical examples of errors include soft bit flips (transient bit errors typically caused by an anomaly) and hard bit flips (permanent faults) in memory cell arrays. Bit errors can be found both in the processor and in external memory banks, as well as operational errors and design bugs in processor logic that could produce incorrect outputs or result in a software crash.

For much of its history, the IT industry has gone to great and costly lengths to design mission-critical servers (and storage systems) that can be trusted to manage data and perform operations as intended. The engineering discipline addressing server robustness is generally known as reliability, availability and serviceability (RAS, which was originally coined by IBM five decades ago), with the serviceability aspect referring to maintenance and upgrades, including software, without causing any downtime.

Traditional examples of these servers include mainframes, UNIX-based and other proprietary software and hardware systems. However, in the past couple of decades x86-based mission-critical systems, which are distinct from volume servers in their RAS features, have also taken hold in the market.

What sets mission-critical hardware design apart is its extensive reliability mechanisms in its error detection, correction and recovery capabilities that go beyond those found in mainstream hardware. While perfect resistance of errors is not possible, such features greatly reduce the chances of errors and software crashes.

Mission-critical systems tend to be able to isolate a range of faulty hardware components without resulting in any disruption. These components include memory chips (the most common source of data integrity and system stability issues), processor units or entire processors, or even an entire physical partition of a mission-critical server. Often, critical memory contents are mirrored within the system across different banks of memory to safeguard against hardware failures.

Server reliability doesn’t end with design, however. Vendors of mission-critical servers and storage systems test the final version of any new server platform for many months to ensure they perform correctly, known as the validation process, before volume manufacturing begins.

Entire sectors, such as financial services, e-commerce, manufacturing, transport and more, have come to depend on and trust such hardware for the correctness of their critical applications and data.

Someone else’s server, my reliability

Maintaining a mission-critical level of infrastructure reliability in the cloud (or even just establishing underlying infrastructure reliability in general), as opposed to “simple” availability, is not straightforward. Major cloud providers don’t address the topic of reliability in much depth to begin with.

What techniques, if any, cloud operators could deploy to safeguard customer applications against data corruption and application failures beyond the use of basic error correction code in memory, which is only able to handle random, single-bit errors, is difficult to know. Currently, there are no hyperscale cloud instances that offer enhanced RAS features comparable to mission-critical IT systems.

While IBM and Microsoft both offer migration paths directly for some mission-critical architectures, such as IBM Power and older s390x mainframes, their focus is on the modernization of legacy applications rather than maintaining reliability and availability levels that are comparable to on-premises systems. The reliability on offer is even less clear when it comes to more abstracted cloud services, such as software as a service and database as a service offerings or serverless computing.

Arguably, the future of reliability lies with software mechanisms. In particular, the software stack needs to adapt by getting rid of its dependency on hardware RAS features, whether this is achieved through verifying computations, memory coherency or the ability to remove and add hardware resources.

This puts the onus of RAS engineering almost solely on the cloud user. For new critical applications, purely software-based RAS by design is a must. However, the time and costs of refactoring or rearchitecting an existing mission-critical software stack to verify results and handle hardware-originating errors are not trivial — and are likely to be prohibitive in many cases, if possible.

Without the assistance of advanced RAS features in mission-critical IT systems, performance, particularly response times, will also likely take a hit if the same depth of reliability is required. At best, this means the need for more server resources to handle the same workload because the software mechanisms for extensive system reliability features will carry a substantial computational and data overhead.

These considerations should temper the pace at which mission-critical monolithic applications migrate to the cloud. Yet, these arguments are almost academic. The benefits of high reliability are difficult to quantify and compare (even more so than availability), in part because it is counterfactual — it is hard to measure what is being prevented.

Over time, cloud operators might invest more in generic infrastructure reliability and even offer products with enhanced RAS for legacy applications. But software-based RAS is the way forward in a world where hardware has become generic and abstracted.

Enterprise decision-makers should at least be mindful of the reliability (and availability) trade-offs involved with the migration of existing mission-critical applications to the cloud, and budget for investing in the necessary architectural and software changes if they expect the same level of service that an enterprise IT infrastructure can provide.

Share this:

  • Share on Facebook (Opens in new window) Facebook
  • Share on X (Opens in new window) X
  • Share on LinkedIn (Opens in new window) LinkedIn
  • Share on Reddit (Opens in new window) Reddit
  • Share on WhatsApp (Opens in new window) WhatsApp
  • Email a link to a friend (Opens in new window) Email
Tags: Cloud, Cloud Costs, Cloud Infrastructure, Cloud Migration, Data Center, Data Center Risk, digital Infrastructure
https://journal.uptimeinstitute.com/wp-content/uploads/2023/07/Lifting-and-shifting-apps-to-the-cloud-a-source-of-risk-creep-featured.jpg 539 1030 Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.png Daniel Bizo, Research Director, Uptime Institute Intelligence, dbizo@uptimeinstitute.com2023-07-26 15:00:002023-07-24 15:06:01Lifting and shifting apps to the cloud: a source of risk creep?
You might also like
The two sides of a sustainability strategy The two sides of a sustainability strategy
Data Centers & Mission Critical Fabric Mission Critical Computing Fabric
Outages: understanding the human factor Outages: understanding the human factor
Is navigating cloud-native complexity worth the hassle? Is navigating cloud-native complexity worth the hassle?
ENTEL Achieves Uptime Institute Tier Certification of Operational Sustainability
Long shifts in data centers — time to reconsider? Long shifts in data centers — time to reconsider?
ATD Perspectives From Mainframes to Modular Designs
Climate Change changes all the norms PG&E Turns Power Off (a.k.a. Climate Change and the Data Center)

Content Categories

  • Journal Home
  • Executive
  • Operations
  • Design

Subscribe to Journal via Email

Enter your email address to subscribe to Uptime Institute Journal and receive notifications of new articles by email.

  • Recent

Tags

Accredited Tier Designer (9) AI (21) artificial intelligence (16) ATD (10) Carbon Emissions (7) Climate Change (13) Cloud (22) Cloud Computing (17) Cloud Costs (15) Cloud Infrastructure (29) Cloud Migration (8) Colocation (6) cooling (9) Data Center (252) Data Center Availability (40) Data Center Cooling (13) Data Center Design (45) Data Center Disaster Recovery (7) Data Center Energy Efficiency (34) Data Center Facilities Management (43) Data Center Operations (66) data center power (8) Data Center Staffing (18) DCIM (9) digital Infrastructure (117) energy (8) Energy Efficiency (38) Environmental Sustainability (18) IT (7) IT Efficiency (16) IT Outages (10) M&O (6) outages (11) Public Cloud (7) PUE (10) Regulations (24) Resiliency (9) security (7) Sustainability (34) Sustainability Reporting (7) Tier Certification (26) Tier Certification Constructed Facility (16) Uptime Institute FORCSS (6) Uptime Institute Network (13) Uptime Institute Symposium (6)
© 2014-2025 Uptime Institute, LLC All rights reserved.
  • Link to X
  • Link to LinkedIn
  • Link to Mail
Link to: Use tools to control cloud costs before it’s too late Link to: Use tools to control cloud costs before it’s too late Use tools to control cloud costs before it’s too lateUse tools to control cloud costs before it’s too late Link to: The Energy Efficiency Directive: requirements come into focus Link to: The Energy Efficiency Directive: requirements come into focus The Energy Efficiency Directive: requirements come into focusThe Energy Efficiency Directive: requirements come into focus
Scroll to top Scroll to top Scroll to top