Separation of Production vs. Non-Production IT

Non-production IT can hinder mission-critical operations

Separating production and non-production assets should be an operational requirement for most organizations. By definition, production assets support high-priority IT loads — servers that are critical to a business or business unit. In most organizations, IT will have sufficient discretion to place these assets where they have redundant power supplies, sufficient cooling and high levels of security. Other assets can be placed elsewhere, preserving the most important infrastructure for the most important loads. However, business requirements sometimes require IT organizations to operate both production (mission critical) and non-production environments in the same facility.

In these instances, facility managers must be careful to prevent the spread of non-production IT, such as email, human resources, telephone and building controls, into expensive mission-critical spaces. While non-production IT generally does not increase risk to mission-critical IT during normal operations, mixing production and production environments will reduce mission-critical capacity and can make it more difficult to shed load.

Keeping production and non-production IT separate:

  • Reduces the chance of human error in operations.
  • Preserves power, cooling and space capacity.
  • Simplifies the process of shedding load, if necessary.

Our report Planning for mission-critical IT in mixed-use facilities (available to Uptime Institute Network community members) discusses how operating a data center in a mixed-use facility can be advantageous to the organization and even to the IT department, but can also introduce significant risk. Establishing and enforcing budget and access policies is critical in these circumstances; the entire organization must understand and follow policies limiting access to the white space.

Organizations do not need to maintain separate budgets and facilities staff for production and non-production operations — they’re accustomed to managing both; both are clearly IT functions. But the similarities between production and non-production IT does not mean that these assets should share circuits — or even racks. The presence of non-production IT gear in a mission-critical white space increases operational risk, and the less critical gear reduces the availability of mission-critical resources, such as space, cooling or power. The infrastructure required to meet the demands of mission-critical IT is expensive to build and operate and should not be used for less critical loads.

Separating the two classes of assets makes it easier for IT to manage assets and space, as well as reduce demands on generator and uninterruptible power supply capacity, especially in the event of an incident. Similarly, keeping the assets separate makes it easier for operations to shed load, if necessary.

Limiting the use of mission-critical infrastructure to production workloads can help organizations defer expansion plans. In addition, it makes it easier to limit access to mission-critical spaces to qualified personnel, while still allowing owners of non-production gear to retain access to their equipment.

However, not all companies can completely separate production and non-production loads. Other solutions include designing certain areas within the data center strictly for noncritical loads and treating those spaces differently from the rest of the facility. This arrangement takes a lot of rigor to manage and maintain, especially when the two types of spaces are in close proximity. For example, non-production IT can utilize single-corded devices, but these should be fed by dedicated power distribution units (PDUs), with dual-corded loads also served by dedicated PDUs. But mixing those servers and PDUs in a shared space creates opportunities for human error when adding or moving servers.

For this reason and more, the greater the separation between production and non-production assets, the easier it becomes for IT to differentiate between them, allocate efforts and resources appropriately, and reduce operational risk.

The full report Planning for mission-critical IT in mixed-use facilities is available to members of the Uptime Institute Network. Guest Registration is easy and can be started here.

99 Red Flags: The Availability Nines

99 Red Flags

One of the most widely cited metrics in the IT industry is for availability, expressed in the form of a number of nines: three nines for 99.9% availability (minutes of downtime per year), extending to six nines — 99.9999% — or even, very rarely, seven nines. What this should mean in practice is show in the table below:

The metric is very widely used (more so in the context of IT equipment than facilities) and is commonly, if not routinely, cited in cloud services and colocation contracts. Its use is rarely questioned. Contrast this with the power usage effectiveness (PUE) metric, which has been the subject of a decade of industry-wide debate and is treated with such suspicion that many dismiss it out of hand.

So, let us start (or re-open) the debate: To what degree should customers of IT services, and particularly cloud services, pay attention to the availability metric — the availability promise — that almost all providers embed in their service level agreement (SLA) – the 99.99% or 99.999% number? (The SLA creates a baseline below which compensation will be paid or credits applied.)

In speaking to multiple colleagues at Uptime Institute, there seems to be a consensus: Treat this number, and any SLAs that use this number, with extreme caution. And the reasons why are not so dissimilar from the reasons why PUE is so maligned: the metric is very useful for certain specific purposes but it is used casually, in many different ways, and without scientific rigor. It is routinely misapplied as well (sometimes with a clear intention to distort or mislead).

Here are some of the things that these experts are concerned about. First, the “nines” number, unless clearly qualified, is neither a forward-looking metric (it doesn’t predict availability), nor a backward-looking number (it doesn’t say how a service has performed); usually a time period is not stated. Rather, it is an engineering calculation based on the likely reliability of each individual component in the system, based on earlier tests or manufacturer promises. (The method is rooted in longstanding methodologies for measuring the reliability of dedicated and well-understood hardware equipment, such as aircraft components or machine parts.)

This is where the trouble starts. Complex systems based on multiple parts working together or in which use cases and conditions can change frequently are not so easily modeled in this way. This is especially true of software, where changes are frequent. To look backward with accuracy requires genuine, measured and stable data of all the parts working together; to look forward requires an understanding of what changes will be made, by whom, what the conditions will be when the changes are made, and with what impact. Add to this the fact that many failures are due to unpredictable configuration and/or operation and management failures, and the value of the “nines” metric becomes further diluted.

But it gets worse: the role of downtime due to maintenance is often not covered or is not clearly separated out. More importantly, the definition of downtime is either not made clear or it varies according to the service and the provider. There are often — and we stress the word “often” — situations in modern, hybrid services where a service has slowed to a nearly non-functional crawl, but the individual applications are all still considered to be “up.” Intermittent errors are even worse — the service can theoretically stop working for a few seconds a day at crucial times yet be considered well within published availability numbers.

The providers do, of course, measure historical service availability — they need to show performance against existing SLAs. But the published or promised 99.9x availability figures in the SLAs of providers are only loosely based on underlying measurements or engineering calculations. In these contracts, the figure is set to maximize profit for the operator: it needs to be high enough to attract (or not scare away) customers, but low enough to ensure minimum compensation is paid. Since the contracts are in any case weighted to ensure that the amounts paid out are only ever tiny, the incentive is to cite a high number.

To be fair, it is not always done this way. Some operators cite clear, performance-based availability over a period of time. But most don’t. Most availability promises in an SLA are market-driven and may change according to market conditions.

How does all this relate to the Uptime Institute Tier rating systems for the data center? Uptime Institute’s Chief Technology Officer Chris Brown explains that there is not a direct relationship between any number of “nines” and a Tier level. Early on, Uptime did publish a paper with some “expected” availability numbers for each Tier level to use as a discussion point, but this is no longer considered relevant. One reason is that it is possible to create a mathematical model to show a data center has a good level of availability (99.999%, for example), while still having multiple single points of failure in its design. Unless this is understood, a big failure is lying in wait. A secondary point is that measuring predicted outages using one aggregated figure might hide the impact of multiple small failures.

Brown (along with members of the Uptime Institute Resiliency Assessment team) believes it can be useful to use a recognized failure methodology, even creating a 99.9x figure. “I like availability number analysis to help determine which of two design choices to use. But I would not stake my career on them,” Brown says. “There is a difference between theory and the real world.” The Tier model assumes that in the real world, failures will occur, and maintenance will be needed.

Where does this leave cloud customers? In our research, we do find that the 99.9x figures give a good first-pass guide to the expected availability of a service. Spanner, Google’s highly resilient distributed database, for example, has an SLA based on 99.999% availability — assuming it is configured correctly. This compares to most database reliability figures of 99.95% or 99.99%. And some SLAs have a higher availability promise if the right level of replication and independent network pathways are deployed.

It is very clear that the industry needs to have a debate about establishing — or re-establishing — a reliable way of reporting true, engineering-based availability numbers. This may come, but not very soon. In the meantime, customers should be cautious — skeptical, even. They should expect failures and model their own likely availability using applicable tools and services to monitor cloud services.

Big Tech Regulation

Big Tech Regulations: an UPSIDE for Cloud customers?

Regulation of Internet giants has focused so far mostly on data privacy, where concerns are relatively well understood by lawmakers and the general public. At the same time, the threat of antitrust action is growing. Congressional hearings in the US with Amazon, Apple, Facebook and Google have begun, and governments in Europe, India and elsewhere are undertaking or considering similar probes. Regulatory discourse has centered on social media, search, digital advertising and e-commerce.

Yet there is another area for which increased pressure is likely: cloud computing services, which play an increasingly critical role in many industries. Two big players, Amazon and Google, are already being scrutinized by lawmakers. Greater regulatory attention seems inevitable.

Deregulating big cloud

Given its dominance, some governments may initially focus on Amazon Web Services (AWS). With almost $26 billion in revenue last year — a 47% annual increase — AWS’ market share is larger than at least two of its biggest competitors combined: Microsoft Azure, its closest rival, and Google Cloud Platform, which is trailing but growing fast. At least one Wall Street analyst has publicly called for Amazon to spin off AWS as a separate business to help avoid regulatory pressure in other areas. As a stand-alone company, AWS could become more focused and business-friendly.

Breaking up big cloud would, however, be both politically controversial and technically difficult. If cloud is a vast experiment on a global scale, then breaking it up would be too.

If applications and services were separated from suppliers’ cloud platforms, would the underlying infrastructure they have built, such as data centers and networks, also be separated? What would happen to customers’ agreements, and how would third-party service providers whose businesses sit on top of cloud platforms (and infrastructure) function?

For the data center sector, there is also the question of what happens to the thousands of long-term agreements that colocation and wholesale providers have with the cloud giants (and which many rely on). These agreements would likely stay in place, but with whom?

Big providers’ valuation would also be problematic; part of their value is their tightly integrated ‘one-stop-shop’ services. And what about their non-cloud service businesses that run in their cloud data centers and over the wide area networks they have built?

Consumer harm?

Any antitrust action, at least in some countries such as the US, would assume a burden of consumer harm. On the face of it, consumer harm could appear to be lacking to regulators that — as with other innovations — have just a rudimentary understanding of the market. After all, cloud is characterized by low customer prices, which continue to fall, and competitors such as Oracle are promising services that would cut AWS customers’ bills “in half.”

However, given the scale and reach of cloud computing, some regulators will look more closely and beyond the data privacy and security laws that have already been enacted. The ability of lawmakers to create additional oversight will be complicated by the vast number of multi-faceted services available. The number of cloud services offered by AWS, Google and Microsoft has almost tripled in the past three years to nearly 300. Understanding the capabilities and requirements of AWS’ services (and those of other cloud providers) is so complex it is now a specialized career, including for third-party consultants.

Providers’ pricing structures and billing documentation are also highly complex. There is a standard metric for virtual machines (fixed per hour) and for storage (gigabytes per month) but multiple and varied metrics are billed for server-less, database, load balancer and other services. Providers have online calculators to help with this, but they can leave out critical (and potentially expensive) components such as data transfer, which obfuscates pricing for the buyer. Regulators may require simplified pricing and billing and greater visibility into how services and products are tied together, potentially leading to the decoupling of some service bundles.

This raises another area for potential oversight: broader and deeper inter-operability among services. Multi-cloud and hybrid IT environments are common, but organizations are often forced to choose a primary cloud platform to manage their development and IT environments effectively. Greater inter-operability among platforms, development tools and services would give customers greater practical choice and lower their risk of vendor lock-in. This could limit the ability of some providers to retain or gain market share.

It could also crimp their pricing strategies for non-cloud products. For example, beginning in October 2019, Microsoft will end its Bring Your Own License (BYOL) model that has enabled users of Microsoft software in their own data centers to migrate those workloads to a dedicated (single-tenant) host in a public cloud using the same on-premises license, without additional cost. Soon, however, Microsoft licenses will be transferable only to dedicated hosts in Microsoft’s Azure cloud. Customers that wish to use a dedicated host in AWS, Google or Alibaba, for example, will have to pay a fee in addition to the standard license cost. Regulators may seek to curb these types of approaches that effectively “tax” users of competitor’s services.

Visibility, please

The most likely scenario, at least in the short-term, may not be antitrust but enforced openness — specifically, greater transparency of cloud infrastructure, with regulators insisting on more oversight of the resiliency of critical systems running in the cloud, especially in the financial services. As a 2018 US Treasury report noted, bank regulations have not “sufficiently modernized” to accommodate cloud.

Regulators are beginning to look into this issue. In the UK, the Bank of England is investigating large banks’ reliance on cloud as part of a broader risk-reduction initiative for financial digital services. The European Banking Authority specifically states that an outsourcer/cloud operator must allow site inspections of data centers. And in the US, the Federal Reserve has conducted a formal examination of at least one AWS data center, in Virginia, with a focus on its infrastructure resiliency and backup systems. More site visits are expected.

Cloud providers may have to share more granular information about the operational redundancy and fail-over mechanisms of their infrastructure, enabling customers to better assess their risk. Some providers may invest in highly resilient infrastructure for certain services and either accept lower operating profit margins or charge a premium for those services.

The creation of any new laws would take years to play out. Meanwhile, market forces could compel providers to make their services simpler to buy and more inter-operable and the associated risks, more transparent. Some are already moving in this direction but the threat of regulatory impact, real or perceived, could speed up the development of these types of improvements.

IT Outages in the Financial Community

Outage Reporting in Financial Services

In the movie “Mary Poppins,” Mr. Banks sings that a British bank must be run with precision, and that “Tradition, discipline and rules must be the tools.” Otherwise, he warns, “Disorder! Chaos!” will ensue.

One rule, introduced by the UK Financial Conduct Authority (FCA) in 2018, suggests that disorder and chaos might be quite common in the IT departments of many UK banks. The rule introduced the mandatory reporting of online service disruptions caused by IT problems for retail banks. The first year’s figures, published this summer, show that banks, on average, suffered an IT outage or security issue nearly every month.

The numbers (shown in the table below) are far higher than any published elsewhere or previously. One bank, Barclays, averaged an incident nearly every ten days. But all the big banks suffered frequent problems. Given this, it is no surprise that the UK Treasury Select Committee has called for action, and the Bank of England is planning on introducing new operational resilience rules for all financial institutions operating in the UK.

Source: UK Financial Conduct Authority

There may be certain mitigating factors relating to the retail banking figures: many of the banks are large and have multiple services, and some of the banks have common ownership or shared systems, so there may some “double counting” of underlying problems. And the data has grouped security incidents with IT availability problems, which has boosted the overall numbers.

Even so, there is clearly an issue in retail banking/financial services that is similar to the one affecting airlines, which we discussed in our “Creeping Criticality Note” (available to Uptime Institute Network members). The growing demand for 24-hour availability — at ever increasing scale — and the need to support ever more services is running ahead of some companies’ ability to deliver it. This seems to be affecting many financial services organizations. The FCA, which has responsibility for all UK financial services (such as insurance, pensions, etc.), recently said the number of “operational resilience” breaks reported increased to 916 for the year 2018-19 from 229 the year before – a 300% increase.

The FCA’s Director of Supervision Meghan Butler made two important observations when the numbers were published: First, a substantial number of new incidents were caused by IT and software issues (including, we assume, data center power/networking) and not so much by cyber-attacks. She noted the need for better management and processes. And second, the increase, while real, is also to do with the number of incidents being reported.

The last point is critical: Dozens of financial services giants and hundreds of innovative fintech startups call the UK home. Is this a uniquely UK problem? Is IT in this sector is particularly bad? There may be — indeed, there are known to be — problems with modernizing huge, legacy infrastructure while in flight. But many US banks, for example, have suffered similar issues.

Undoubtedly, a big proportion of the rise is due to the fact that the FCA now insists that incidents are reported. Outage reporting across almost all industries is at best ad hoc — notwithstanding confidential reporting systems run by the Uptime Institute Network (the Abnormal Incident Report, or AIR, database) and the Data Center Incident Reporting Network (DCIRN), an independent body set up in 2017 by veteran data center designer Ed Ansett. Even allowing for media reports, many also tracked by Uptime, the vast majority of outages, even at major firms, are never reported to anyone, so the lessons cannot be learned. While outages at consumer-facing internet services can be spotted by web-based outage reporting services using software tools, often there no real details other than data that points to a loss or slowdown in service.

It has often been suggested that for lessons to be shared, mandatory reporting of outages — in detail — is needed, at least by large or important organizations or services. The FCA’s data supports the view that many outages are hidden. Now they have surfaced, the next question is, How can the lessons of the failures can be more openly and widely shared?


More information on this topic is available to members of the Uptime Institute Network which can be found here.

Data Center Outages - Part Deux

How to Avoid Outages – Part Deux

A previous Uptime Intelligence Note suggested that avoiding data center outages might be as simple as trying harder. The Note suggested that management failures are the main reason that enterprises continue to experience downtime incidents, even in fault tolerant facilities. “The Intelligence Trap,” a new book by David Robson, sheds light on why management mistakes continue to plague data center operations. The book suggests that management must be aware of its own limitations if it is going to properly staff operations, develop training programs and evaluate risk. Robson also notes that modern education practices, especially for children in the US and Europe, have created notions about intelligence that eventually undermine management efforts and also inform opposing and intractable political arguments about wider societal issues such as climate change and the anti-vaccination movement.

Citing dozens of anecdotes and psychological studies, Robson carefully examines how the intelligence trap ensnares almost everyone, including notables such as Arthur Conan Doyle, Albert Einstein and other famous, high IQ individuals — even Nobel laureates. Brilliant as they are, many fall prey to a variety of hoaxes. Some believe a wide variety of conspiracy theories and junk science, in many cases persuading others to join them.

The foundations of the trap, Robson notes, begin with the early and widespread acceptance of the IQ test, its narrow definition of intelligence and expectations that “smart” people will perform better at all tasks than others. Psychologist Keith Stanovich says intelligence tests “work quite well for what they do,” but are a poor metric of overall success. He calls the link between intelligence and rationality weak.

Studies show that geniuses experience deference in fields far beyond their areas of expertise. But according to studies, they are at least as apt to use their intelligence to craft arguments that justify their existing positions than to open themselves to new information that may cause them to change their opinions.

In some cases, ingrained patterns of thinking may further undermine these geniuses, with Robson noting that experts often develop mental shortcuts, based on experience, that enable them to work quickly on routine cases. In the medical field, these patterns contribute to a higher rate of misdiagnosis: in the US, one in six initial diagnoses proves to be incorrect, leading to one in 10 (40,000-80,000) patient deaths per year.

The intelligence trap can catch organizations, too. The US FBI held firmly to a mistaken fingerprint reading that implicated an innocent man in the 2004 terrorist bombing in Madrid, even after non-forensic evidence emerged that showed the suspect had played no role in the attack. FBI experts were not even persuaded by Spanish authorities who found discrepancies between the print found at the bombing and the suspect’s fingerprint.

Escaping the intelligence trap is difficult, but possible. However, ingrained patterns of thinking lead successful individuals to surround themselves with too many experts, forming non-diverse teams that do not function well together. High levels of intelligence and experience and lack of diversity can lead to overconfidence that increases their appetite for risk. Competition among accomplished team members can also lead to poor team function.

Data center owners and operators should note that even highly functional teams sometimes process close calls as successes, which leads them to discount the likelihood of a serious incident. NASA suffered two well-known space shuttle disasters after its engineers began to downplay known risks due to a string of successful missions. Prior to the loss of the Challenger, many shuttles had survived the faulty seals that caused the 1986 disaster, and every shuttle had survived damage to foam insulation until the Columbia disaster in 2003. Prior experience had reduced the perceived risk of these dangerous conditions, a condition called “outcome bias.”

The result, says Catherine Tinsley, a professor of management at Georgetown specializing in corporate catastrophe, is a subtle but distinct increase in risk appetite. She says, “I’m not here to tell you what your risk tolerance should be. I’m here to say that when you experience a near miss, your risk tolerance will increase and you won’t be aware of it.”

The real question here is not whether mistakes and disasters are inevitable. Rather, it’s how to become conscious of the subconscious. Staff should be trained to recognize the limitations that cause bad decisions. And perhaps more succinctly, data center operators and owners should recognize that procedures must not only be foolproof — they must be “expert-proof” as well.

More information on this topic is available to members of the Uptime Institute Network. Membership information can be found here.

Business Outages

Why do some industries and organizations suffer more serious, high profile outages than others?

In a recent Uptime Institute Intelligence note, we considered a June 2019 report issued by the US General Accounting Office (GAO) on the IT resiliency of US airlines. The GAO wanted to better understand if the all-too-frequent IT outages and resultant chaos passengers face have any common causes and if so, how they could be addressed. (Since we published that Note, the UK-owned carrier British Airways suffered its second big outage in two years, once again stranding tens of thousands of passengers and facing heavy costs.)

The GAO report didn’t really uncover much new: there was, in some cases, a need for better testing, a little more redundancy needed here and there, certainly some improved processes. But despite suspicions of under-investment, there was nothing systemic. The causes of the outages were varied and, although often avoidable when looked at in hindsight, not necessarily predictable.

There is, however, still an undeniable pattern. Uptime Institute’s own analysis of three years of public, media-reported outages shows that two industries, airlines and retail financial services, do appear to suffer from significantly more, highly disruptive (category 4 and 5), high profile outages than other industries.

To be clear: these businesses do not necessarily have more outages, but rather they suffer a higher number of highly disruptive outages, and as a result, get more negative publicity when there is a problem. Cloud providers are not far behind.

Why is this? The reasons may vary, but these businesses very often offer services on which large numbers of people depend, for which almost any interruption causes immediate losses and negative publicity, and in which it may not be easy to get back to the status quo.

Another trait that seems to set these businesses apart is that their almost complete dependency on IT is relatively recent (or they may be a new IT service or industry), so they may not yet have invested to the same levels as, for example, an investment bank, stock exchange or a power utility. In these last examples, the mission-critical nature of the business has long been clear, they are probably regulated, and so have investments and processes fully in place.

Organizations have long conducted business impact analyses, and there are various methodologies and tools available to help carry these out. Uptime Institute has been researching this area, particularly to see how organizations might specifically address business/impacts of failures in digital infrastructure. One simple approach is to create a “vulnerability” rating for each application/service, with scores attributed across a number of factors. Some of our thinking — and this is not comprehensive — is outlined below:

  • Profile. Certain industries are consumer facing, large scale or have a very public brand. A high score in this area means even small failures — Facebook’s outages are a good example — will have a big public impact.
  • Failure sensitivity. Sensitive industries are those for which an outage has immediate and high impact. If an investment bank can’t trade, planes can’t take off or clients can’t access their money, the sensitivity is high.
  • Recover-ability. Organizations that can take a lengthy time to restore normal service will suffer more seriously from IT failures. The costs of an outage may be multiplied many times over if the recovery time is lengthy. For example, airlines may find it takes days to get all planes and crews in the right location to restore normal operations.
  • Regulatory/compliance. Failures in the certain industries either must be reported or will attract attention from regulators. Emergency services (e.g., 911, 999, 112), power companies and hospitals are good examples … and this list is growing.
  • Platform dependents. Organizations whose customers include service providers — such as software as a service; infrastructure as a service; and co-location, hosting and cloud-based service providers — will not only breach service level agreements but also lose paying clients. (There are many examples of this.)

One of the challenges of carrying out assessments is that the impact of any particular service’s or application’s failing is changing, in two ways. First, in most cases, it is increasing, along with the dependency of all businesses and consumers on IT. And second, it is becoming more complicated and harder to determine accurately, largely because of the inter-dependency of many different systems and applications, intertwined to support different processes and services. There may even a logarithmic hockey stick curve, with the impact of failures growing rapidly as more systems, people and businesses are involved.

Looked at like this, it is clear that certain organizations have become more vulnerable to high impact outages than they were a year or two previously, because while the immediate impact on sales/revenue may not have the changed, the scale, profile or recover-ability may have. It may be that airlines, which only two years ago could board passengers manually, can no longer do so without IT; similarly, retail banking customers used to carry sufficient cash or checks to get themselves a meal and get home. Not anymore. These organizations now have a very tricky problem: How do they upgrade their infrastructure, and their processes, to a level of mission critically for which they were not designed?

All this raises a further tricky question that Uptime Institute is researching: Which industries, businesses or services have become (or will become) critical to the national infrastructure — even if a few years ago they certainly were not (or they are not currently)? And how should these services be regulated (if they should …)? We are seeking partners to help with this research. Organizations are not the only ones struggling with these questions — governments are as well.

More information on this topic is available to members of the Uptime Institute Network here.