Regulation of Internet giants has focused so far mostly on data privacy, where concerns are relatively well understood by lawmakers and the general public. At the same time, the threat of antitrust action is growing. Congressional hearings in the US with Amazon, Apple, Facebook and Google have begun, and governments in Europe, India and elsewhere are undertaking or considering similar probes. Regulatory discourse has centered on social media, search, digital advertising and e-commerce.
Yet there is another area for which increased pressure is likely: cloud computing services, which play an increasingly critical role in many industries. Two big players, Amazon and Google, are already being scrutinized by lawmakers. Greater regulatory attention seems inevitable.
Deregulating big cloud
Given its dominance, some governments may initially focus on Amazon Web Services (AWS). With almost $26 billion in revenue last year — a 47% annual increase — AWS’ market share is larger than at least two of its biggest competitors combined: Microsoft Azure, its closest rival, and Google Cloud Platform, which is trailing but growing fast. At least one Wall Street analyst has publicly called for Amazon to spin off AWS as a separate business to help avoid regulatory pressure in other areas. As a stand-alone company, AWS could become more focused and business-friendly.
Breaking up big cloud would, however, be both politically controversial and technically difficult. If cloud is a vast experiment on a global scale, then breaking it up would be too.
If applications and services were separated from suppliers’ cloud platforms, would the underlying infrastructure they have built, such as data centers and networks, also be separated? What would happen to customers’ agreements, and how would third-party service providers whose businesses sit on top of cloud platforms (and infrastructure) function?
For the data center sector, there is also the question of what happens to the thousands of long-term agreements that colocation and wholesale providers have with the cloud giants (and which many rely on). These agreements would likely stay in place, but with whom?
Big providers’ valuation would also be problematic; part of their value is their tightly integrated ‘one-stop-shop’ services. And what about their non-cloud service businesses that run in their cloud data centers and over the wide area networks they have built?
Consumer harm?
Any antitrust action, at least in some countries such as the US, would assume a burden of consumer harm. On the face of it, consumer harm could appear to be lacking to regulators that — as with other innovations — have just a rudimentary understanding of the market. After all, cloud is characterized by low customer prices, which continue to fall, and competitors such as Oracle are promising services that would cut AWS customers’ bills “in half.”
However, given the scale and reach of cloud computing, some regulators will look more closely and beyond the data privacy and security laws that have already been enacted. The ability of lawmakers to create additional oversight will be complicated by the vast number of multi-faceted services available. The number of cloud services offered by AWS, Google and Microsoft has almost tripled in the past three years to nearly 300. Understanding the capabilities and requirements of AWS’ services (and those of other cloud providers) is so complex it is now a specialized career, including for third-party consultants.
Providers’ pricing structures and billing documentation are also highly complex. There is a standard metric for virtual machines (fixed per hour) and for storage (gigabytes per month) but multiple and varied metrics are billed for server-less, database, load balancer and other services. Providers have online calculators to help with this, but they can leave out critical (and potentially expensive) components such as data transfer, which obfuscates pricing for the buyer. Regulators may require simplified pricing and billing and greater visibility into how services and products are tied together, potentially leading to the decoupling of some service bundles.
This raises another area for potential oversight: broader and deeper inter-operability among services. Multi-cloud and hybrid IT environments are common, but organizations are often forced to choose a primary cloud platform to manage their development and IT environments effectively. Greater inter-operability among platforms, development tools and services would give customers greater practical choice and lower their risk of vendor lock-in. This could limit the ability of some providers to retain or gain market share.
It could also crimp their pricing strategies for non-cloud products. For example, beginning in October 2019, Microsoft will end its Bring Your Own License (BYOL) model that has enabled users of Microsoft software in their own data centers to migrate those workloads to a dedicated (single-tenant) host in a public cloud using the same on-premises license, without additional cost. Soon, however, Microsoft licenses will be transferable only to dedicated hosts in Microsoft’s Azure cloud. Customers that wish to use a dedicated host in AWS, Google or Alibaba, for example, will have to pay a fee in addition to the standard license cost. Regulators may seek to curb these types of approaches that effectively “tax” users of competitor’s services.
Visibility, please
The most likely scenario, at least in the short-term, may not be antitrust but enforced openness — specifically, greater transparency of cloud infrastructure, with regulators insisting on more oversight of the resiliency of critical systems running in the cloud, especially in the financial services. As a 2018 US Treasury report noted, bank regulations have not “sufficiently modernized” to accommodate cloud.
Regulators are beginning to look into this issue. In the UK, the Bank of England is investigating large banks’ reliance on cloud as part of a broader risk-reduction initiative for financial digital services. The European Banking Authority specifically states that an outsourcer/cloud operator must allow site inspections of data centers. And in the US, the Federal Reserve has conducted a formal examination of at least one AWS data center, in Virginia, with a focus on its infrastructure resiliency and backup systems. More site visits are expected.
Cloud providers may have to share more granular information about the operational redundancy and fail-over mechanisms of their infrastructure, enabling customers to better assess their risk. Some providers may invest in highly resilient infrastructure for certain services and either accept lower operating profit margins or charge a premium for those services.
The creation of any new laws would take years to play out. Meanwhile, market forces could compel providers to make their services simpler to buy and more inter-operable and the associated risks, more transparent. Some are already moving in this direction but the threat of regulatory impact, real or perceived, could speed up the development of these types of improvements.
In the movie “Mary Poppins,” Mr. Banks sings that a British bank must be run with precision, and that “Tradition, discipline and rules must be the tools.” Otherwise, he warns, “Disorder! Chaos!” will ensue.
One rule, introduced by the UK Financial Conduct Authority (FCA) in 2018, suggests that disorder and chaos might be quite common in the IT departments of many UK banks. The rule introduced the mandatory reporting of online service disruptions caused by IT problems for retail banks. The first year’s figures, published this summer, show that banks, on average, suffered an IT outage or security issue nearly every month.
The numbers (shown in the table below) are far higher than any published elsewhere or previously. One bank, Barclays, averaged an incident nearly every ten days. But all the big banks suffered frequent problems. Given this, it is no surprise that the UK Treasury Select Committee has called for action, and the Bank of England is planning on introducing new operational resilience rules for all financial institutions operating in the UK.
There may be certain mitigating factors relating to the retail banking figures: many of the banks are large and have multiple services, and some of the banks have common ownership or shared systems, so there may some “double counting” of underlying problems. And the data has grouped security incidents with IT availability problems, which has boosted the overall numbers.
Even so, there is clearly an issue in retail banking/financial services that is similar to the one affecting airlines, which we discussed in our “Creeping Criticality Note” (available to Uptime Institute Network members). The growing demand for 24-hour availability — at ever increasing scale — and the need to support ever more services is running ahead of some companies’ ability to deliver it. This seems to be affecting many financial services organizations. The FCA, which has responsibility for all UK financial services (such as insurance, pensions, etc.), recently said the number of “operational resilience” breaks reported increased to 916 for the year 2018-19 from 229 the year before – a 300% increase.
The FCA’s Director of Supervision Meghan Butler made two important observations when the numbers were published: First, a substantial number of new incidents were caused by IT and software issues (including, we assume, data center power/networking) and not so much by cyber-attacks. She noted the need for better management and processes. And second, the increase, while real, is also to do with the number of incidents being reported.
The last point is critical: Dozens of financial services giants and hundreds of innovative fintech startups call the UK home. Is this a uniquely UK problem? Is IT in this sector is particularly bad? There may be — indeed, there are known to be — problems with modernizing huge, legacy infrastructure while in flight. But many US banks, for example, have suffered similar issues.
Undoubtedly, a big proportion of the rise is due to the fact that the FCA now insists that incidents are reported. Outage reporting across almost all industries is at best ad hoc — notwithstanding confidential reporting systems run by the Uptime Institute Network (the Abnormal Incident Report, or AIR, database) and the Data Center Incident Reporting Network (DCIRN), an independent body set up in 2017 by veteran data center designer Ed Ansett. Even allowing for media reports, many also tracked by Uptime, the vast majority of outages, even at major firms, are never reported to anyone, so the lessons cannot be learned. While outages at consumer-facing internet services can be spotted by web-based outage reporting services using software tools, often there no real details other than data that points to a loss or slowdown in service.
It has often been suggested that for lessons to be shared, mandatory reporting of outages — in detail — is needed, at least by large or important organizations or services. The FCA’s data supports the view that many outages are hidden. Now they have surfaced, the next question is, How can the lessons of the failures can be more openly and widely shared?
———————————————————————————————————————————–
More information on this topic is available to members of the Uptime Institute Network which can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/10/GettyImages-1023248390cropped.jpg18535000Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2019-10-14 06:00:392019-09-30 11:02:05Outage Reporting in Financial Services
A previous Uptime Intelligence Note suggested that avoiding data center outages might be as simple as trying harder. The Note suggested that management failures are the main reason that enterprises continue to experience downtime incidents, even in fault tolerant facilities. “The Intelligence Trap,” a new book by David Robson, sheds light on why management mistakes continue to plague data center operations. The book suggests that management must be aware of its own limitations if it is going to properly staff operations, develop training programs and evaluate risk. Robson also notes that modern education practices, especially for children in the US and Europe, have created notions about intelligence that eventually undermine management efforts and also inform opposing and intractable political arguments about wider societal issues such as climate change and the anti-vaccination movement.
Citing dozens of anecdotes and psychological studies, Robson carefully examines how the intelligence trap ensnares almost everyone, including notables such as Arthur Conan Doyle, Albert Einstein and other famous, high IQ individuals — even Nobel laureates. Brilliant as they are, many fall prey to a variety of hoaxes. Some believe a wide variety of conspiracy theories and junk science, in many cases persuading others to join them.
The foundations of the trap, Robson notes, begin with the early and widespread acceptance of the IQ test, its narrow definition of intelligence and expectations that “smart” people will perform better at all tasks than others. Psychologist Keith Stanovich says intelligence tests “work quite well for what they do,” but are a poor metric of overall success. He calls the link between intelligence and rationality weak.
Studies show that geniuses experience deference in fields far beyond their areas of expertise. But according to studies, they are at least as apt to use their intelligence to craft arguments that justify their existing positions than to open themselves to new information that may cause them to change their opinions.
In some cases, ingrained patterns of thinking may further undermine these geniuses, with Robson noting that experts often develop mental shortcuts, based on experience, that enable them to work quickly on routine cases. In the medical field, these patterns contribute to a higher rate of misdiagnosis: in the US, one in six initial diagnoses proves to be incorrect, leading to one in 10 (40,000-80,000) patient deaths per year.
The intelligence trap can catch organizations, too. The US FBI held firmly to a mistaken fingerprint reading that implicated an innocent man in the 2004 terrorist bombing in Madrid, even after non-forensic evidence emerged that showed the suspect had played no role in the attack. FBI experts were not even persuaded by Spanish authorities who found discrepancies between the print found at the bombing and the suspect’s fingerprint.
Escaping the intelligence trap is difficult, but possible. However, ingrained patterns of thinking lead successful individuals to surround themselves with too many experts, forming non-diverse teams that do not function well together. High levels of intelligence and experience and lack of diversity can lead to overconfidence that increases their appetite for risk. Competition among accomplished team members can also lead to poor team function.
Data center owners and operators should note that even highly functional teams sometimes process close calls as successes, which leads them to discount the likelihood of a serious incident. NASA suffered two well-known space shuttle disasters after its engineers began to downplay known risks due to a string of successful missions. Prior to the loss of the Challenger, many shuttles had survived the faulty seals that caused the 1986 disaster, and every shuttle had survived damage to foam insulation until the Columbia disaster in 2003. Prior experience had reduced the perceived risk of these dangerous conditions, a condition called “outcome bias.”
The result, says Catherine Tinsley, a professor of management at Georgetown specializing in corporate catastrophe, is a subtle but distinct increase in risk appetite. She says, “I’m not here to tell you what your risk tolerance should be. I’m here to say that when you experience a near miss, your risk tolerance will increase and you won’t be aware of it.”
The real question here is not whether mistakes and disasters are inevitable. Rather, it’s how to become conscious of the subconscious. Staff should be trained to recognize the limitations that cause bad decisions. And perhaps more succinctly, data center operators and owners should recognize that procedures must not only be foolproof — they must be “expert-proof” as well.
More information on this topic is available to members of the Uptime Institute Network. Membership information can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/10/Blog-Part-Deux.jpg9802643Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2019-10-07 06:15:142019-09-23 10:16:41How to Avoid Outages – Part Deux
In a recent Uptime Institute Intelligence note, we considered a June 2019 report issued by the US General Accounting Office (GAO) on the IT resiliency of US airlines. The GAO wanted to better understand if the all-too-frequent IT outages and resultant chaos passengers face have any common causes and if so, how they could be addressed. (Since we published that Note, the UK-owned carrier British Airways suffered its second big outage in two years, once again stranding tens of thousands of passengers and facing heavy costs.)
The GAO report didn’t really uncover much new: there was, in some cases, a need for better testing, a little more redundancy needed here and there, certainly some improved processes. But despite suspicions of under-investment, there was nothing systemic. The causes of the outages were varied and, although often avoidable when looked at in hindsight, not necessarily predictable.
There is, however, still an undeniable pattern. Uptime Institute’s own analysis of three years of public, media-reported outages shows that two industries, airlines and retail financial services, do appear to suffer from significantly more, highly disruptive (category 4 and 5), high profile outages than other industries.
To be clear: these businesses do not necessarily have more outages, but rather they suffer a higher number of highly disruptive outages, and as a result, get more negative publicity when there is a problem. Cloud providers are not far behind.
Why is this? The reasons may vary, but these businesses very often offer services on which large numbers of people depend, for which almost any interruption causes immediate losses and negative publicity, and in which it may not be easy to get back to the status quo.
Another trait that seems to set these businesses apart is that their almost complete dependency on IT is relatively recent (or they may be a new IT service or industry), so they may not yet have invested to the same levels as, for example, an investment bank, stock exchange or a power utility. In these last examples, the mission-critical nature of the business has long been clear, they are probably regulated, and so have investments and processes fully in place.
Organizations have long conducted business impact analyses, and there are various methodologies and tools available to help carry these out. Uptime Institute has been researching this area, particularly to see how organizations might specifically address business/impacts of failures in digital infrastructure. One simple approach is to create a “vulnerability” rating for each application/service, with scores attributed across a number of factors. Some of our thinking — and this is not comprehensive — is outlined below:
Profile. Certain industries are consumer facing, large scale or have a very public brand. A high score in this area means even small failures — Facebook’s outages are a good example — will have a big public impact.
Failure sensitivity. Sensitive industries are those for which an outage has immediate and high impact. If an investment bank can’t trade, planes can’t take off or clients can’t access their money, the sensitivity is high.
Recover-ability. Organizations that can take a lengthy time to restore normal service will suffer more seriously from IT failures. The costs of an outage may be multiplied many times over if the recovery time is lengthy. For example, airlines may find it takes days to get all planes and crews in the right location to restore normal operations.
Regulatory/compliance. Failures in the certain industries either must be reported or will attract attention from regulators. Emergency services (e.g., 911, 999, 112), power companies and hospitals are good examples … and this list is growing.
Platform dependents. Organizations whose customers include service providers — such as software as a service; infrastructure as a service; and co-location, hosting and cloud-based service providers — will not only breach service level agreements but also lose paying clients. (There are many examples of this.)
One of the challenges of carrying out assessments is that the impact of any particular service’s or application’s failing is changing, in two ways. First, in most cases, it is increasing, along with the dependency of all businesses and consumers on IT. And second, it is becoming more complicated and harder to determine accurately, largely because of the inter-dependency of many different systems and applications, intertwined to support different processes and services. There may even a logarithmic hockey stick curve, with the impact of failures growing rapidly as more systems, people and businesses are involved.
Looked at like this, it is clear that certain organizations have become more vulnerable to high impact outages than they were a year or two previously, because while the immediate impact on sales/revenue may not have the changed, the scale, profile or recover-ability may have. It may be that airlines, which only two years ago could board passengers manually, can no longer do so without IT; similarly, retail banking customers used to carry sufficient cash or checks to get themselves a meal and get home. Not anymore. These organizations now have a very tricky problem: How do they upgrade their infrastructure, and their processes, to a level of mission critically for which they were not designed?
All this raises a further tricky question that Uptime Institute is researching: Which industries, businesses or services have become (or will become) critical to the national infrastructure — even if a few years ago they certainly were not (or they are not currently)? And how should these services be regulated (if they should …)? We are seeking partners to help with this research. Organizations are not the only ones struggling with these questions — governments are as well.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/GettyImages-522477922-aspect-smaller.jpg17504725Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2019-09-30 06:00:012019-09-18 15:36:57Why do some industries and organizations suffer more serious, high profile outages than others?
Uptime Institute has spent years analyzing the roots causes for data center and service outages, surveying thousands of IT professionals throughout the year on this topic. According to the data, the vast majority of data center failures are caused by human error. Some industry experts report numbers as high as 75%, but Uptime Institute generally reports about 70% based on the wealth of data we gather continuously. That assumption immediately raises an important question: Just how preventable are most outages?
Certainly, the number of outages remains persistently high, and the associated costs of these outages are also high. Uptime Institute data from the past two years demonstrates that almost one-third of data center owners and operators experienced a downtime incident or severe degradation of service in the past year, and half in the previous three years. Many of these incidents had severe financial consequences, with 10% of the 2019 respondents reporting that their most recent incident cost more than $1 million.
These findings, and others related to the causes of outages, are perhaps not unexpected. But more surprisingly, in Uptime Institute’s April 2019 survey, 60% of respondents believed that their most recent significant downtime incident could have been prevented with better management/processes or configuration. For outages that cost greater than $1 million, this figure jumped to 74%, and then leveled out around 50% as the outage costs increased to more than $40 million. These numbers remain persistently high, given the existing knowledge available on the causes and sources of downtime incidents and the costs of many downtime incidents.
Data center owners and operators know that on-premises power failures continue to cause the most outages (33%), with network and connectivity issues close behind (31%). Additional failures attributed to colocation providers could also have been prevented by the provider.
These findings should be alarming to everyone in the digital infrastructure business. After years of building data centers, and adding complex layers of features and functionality, not to mention dynamic workload migration and orchestration, the industry’s report card on actual service delivery performance is less than stellar. And while these sorts of failures should be very rare in concurrently maintainable and fault tolerant facilities when appropriate and complete procedures are in place, what we are finding is the operational part of the story falls flat. Simply put, if humans worked harder to MANAGE the well-designed and constructed facilities better, we would have fewer outages..
Uptime Institute consultants have underscored the critically important role procedures play in data center operations. They remind listeners that having and maintaining appropriate and complete procedures is essential to achieving performance and service availability goals. These same procedures can also help data centers meet efficiency goals, even in conditions that exceed planned design days. Among other benefits, well conceived procedures and the extreme discipline to follow these procedures helps operators cope with strong storms, properly perform maintenance and upgrades, manage costs and, perhaps most relevant, restore operations quickly after an outage.
So why, then, does the industry continue to experience downtime incidents, given that the causes have been so well pinpointed, the costs are so well-known and the solution to reducing their frequency (better processes and procedures) is so obvious? We just don’t try hard enough.
When asking our constituents about the causes for their outages, there are perhaps as many explanations as there are respondents. Here are just a few questions to consider when looking internal at your own risks and processes:
Does the complexity of your infrastructure, especially the distributed nature of it, increase the risk that simple errors will cascade into a service outage?
Is your organization expanding critical IT capacity faster than it can attract and apply the resources to manage that infrastructure?
Has your organization started to see any staffing and skills shortage, which may be starting to impair mission-critical operations?
Do your concerns about cyber vulnerability and data security outweigh concerns about physical infrastructure?
Does your organization shortchange training and education programs when budgeting?
Does your organizations under-invest in IT operations, management, and other business management functions?
Does your organization truly understand change management, especially when many of your workloads may already be shared across multiple servers, in multiple facilities or in entirely different types of IT environments including co-location and the cloud?
Does you organization consider the needs at the application level when designing new facilities or cloud adoptions?
Perhaps there is simply a limit to what can be achieved in an industry that still relies heavily on people to perform many of the most basic and critical tasks and thus is subject to human error, which can never be completely eliminated. However, a quick survey of the issues suggests that management failure — not human error — is the main reason that outages persist. By under-investing in training, failing to enforce policies, allowing procedures to grow outdated, and underestimating the importance of qualified staff, management sets the stage for a cascade of circumstances that leads to downtime. If we try harder, we can make progress. If we leverage the investments in physical infrastructure by applying the right level of operational expertise and business management, outages will decline.
We just need to try harder.
More information on this and similar topics is available to members of the Uptime Institute which can be initiated here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/GettyImages-966689448-aspect2.7.jpg8162218Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2019-09-23 06:15:002019-09-11 13:44:41How to avoid outages: Try harder!
In the recently published 2019 Uptime Institute supplier survey, participants told us they are witnessing higher than normal data center spending patterns. This is in line with general market trends, driven by the demand for data and digital services. It is also a welcome sign for those suppliers who witnessed a downturn two to three years ago, as public cloud began to take a bite.
The increase in spending is not only by hyperscalers known to be designing for 100x scalability and building for 10x growth. Smaller facilities (under 20 MW) are also seeing continued investment, including in higher levels of redundancy at primary sites (a trend that may have surprised some).
However, this growth continues to raise concerns. In this year’s survey, the top challenge operators face, as identified by suppliers, is forecasting future data center capacity requirements. This is followed by the need to maintain competitive and cost-efficient operations compared with cloud/colocation. Managing different data center environments dropped to fourth place, after coming second in last year’s supplier survey. This finding agrees with the results of our 2019 operator survey (of around 1,000 data centers operators around the world). In that survey, our analysis attributed the change to the advancement in tools and market maturity.
The figure below shows the top challenges operators faced in 2018 and 2019, as reported by their suppliers:
Forecasting data center capacity is a long-standing issue. Rapid changes in technology and the difficulty of anticipating future workload growth at a time when there are so many choices complicate matters. Over-provisioning capacity, the most commonly adopted strategy, leads to inefficiencies in operations (and unnecessary upfront investment). Against this, under-provisioning capacity is an operational risk and could also mean facilities reach their limit before their planned investment life-cycle.
Depending on the sector and type of workload, many organizations have adopted modular data center designs, which can be an effective way to alleviate the expense of over-provisioning. Where appropriate, some operators also move highly unpredictable or the most easily/economically transported workloads to public cloud environments. These strategies, plus various other factors driving the uptake of mixed IT infrastructures, mean more organizations are accumulating expertise in managing hybrid environments. This may explain why the challenge of managing different data center environments dropped to fourth place in our survey this year. Additionally, cloud computing suppliers are offering more effective tools to help customers better manage their costs when running cloud services.
The adoption of cloud-first policies by many operators means managers are having to demonstrate cost-effectiveness more than ever. This means that understanding the true cost of maintaining in-house facilities versus the cost of cloud/colocation venues is becoming more important, as the survey results above show.
The 2019 Uptime Institute operator survey also reflects this. Forty percent of participants indicated that they are not confident in their organization’s ability to compare costs between in-house versus cloud/colocation environments. Indeed, this is not a straightforward exercise. On the one hand, the structure of some enterprises (e.g., how budgets are split) makes calculating the cost of running owned sites tricky. On the other hand, calculating the true cost of moving to the cloud is also not straightforward. There may be costs inherent in the transition related to application re-engineering, potential repatriation or network upgrades for example (and there is now a vast choice of cloud offerings that require careful costing and management). Among other issues, such as vendor lock-in, this complexity is now driving many to change their policies to be more about cloud appropriateness, rather than cloud-first.
Want to know more details? The full report Uptime Institute data center supply-side survey 2019 is available to members of the Uptime Institute Network which can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/GettyImages-973035578-crop-blog.jpg22135996Rabih Bashroushhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRabih Bashroush2019-09-16 07:00:302019-08-30 14:38:43Troubling for operators: Capacity forecasting and maintaining cost competitiveness
Big Tech Regulations: an UPSIDE for Cloud customers?
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteRegulation of Internet giants has focused so far mostly on data privacy, where concerns are relatively well understood by lawmakers and the general public. At the same time, the threat of antitrust action is growing. Congressional hearings in the US with Amazon, Apple, Facebook and Google have begun, and governments in Europe, India and elsewhere are undertaking or considering similar probes. Regulatory discourse has centered on social media, search, digital advertising and e-commerce.
Yet there is another area for which increased pressure is likely: cloud computing services, which play an increasingly critical role in many industries. Two big players, Amazon and Google, are already being scrutinized by lawmakers. Greater regulatory attention seems inevitable.
Deregulating big cloud
Given its dominance, some governments may initially focus on Amazon Web Services (AWS). With almost $26 billion in revenue last year — a 47% annual increase — AWS’ market share is larger than at least two of its biggest competitors combined: Microsoft Azure, its closest rival, and Google Cloud Platform, which is trailing but growing fast. At least one Wall Street analyst has publicly called for Amazon to spin off AWS as a separate business to help avoid regulatory pressure in other areas. As a stand-alone company, AWS could become more focused and business-friendly.
Breaking up big cloud would, however, be both politically controversial and technically difficult. If cloud is a vast experiment on a global scale, then breaking it up would be too.
If applications and services were separated from suppliers’ cloud platforms, would the underlying infrastructure they have built, such as data centers and networks, also be separated? What would happen to customers’ agreements, and how would third-party service providers whose businesses sit on top of cloud platforms (and infrastructure) function?
For the data center sector, there is also the question of what happens to the thousands of long-term agreements that colocation and wholesale providers have with the cloud giants (and which many rely on). These agreements would likely stay in place, but with whom?
Big providers’ valuation would also be problematic; part of their value is their tightly integrated ‘one-stop-shop’ services. And what about their non-cloud service businesses that run in their cloud data centers and over the wide area networks they have built?
Consumer harm?
Any antitrust action, at least in some countries such as the US, would assume a burden of consumer harm. On the face of it, consumer harm could appear to be lacking to regulators that — as with other innovations — have just a rudimentary understanding of the market. After all, cloud is characterized by low customer prices, which continue to fall, and competitors such as Oracle are promising services that would cut AWS customers’ bills “in half.”
However, given the scale and reach of cloud computing, some regulators will look more closely and beyond the data privacy and security laws that have already been enacted. The ability of lawmakers to create additional oversight will be complicated by the vast number of multi-faceted services available. The number of cloud services offered by AWS, Google and Microsoft has almost tripled in the past three years to nearly 300. Understanding the capabilities and requirements of AWS’ services (and those of other cloud providers) is so complex it is now a specialized career, including for third-party consultants.
Providers’ pricing structures and billing documentation are also highly complex. There is a standard metric for virtual machines (fixed per hour) and for storage (gigabytes per month) but multiple and varied metrics are billed for server-less, database, load balancer and other services. Providers have online calculators to help with this, but they can leave out critical (and potentially expensive) components such as data transfer, which obfuscates pricing for the buyer. Regulators may require simplified pricing and billing and greater visibility into how services and products are tied together, potentially leading to the decoupling of some service bundles.
This raises another area for potential oversight: broader and deeper inter-operability among services. Multi-cloud and hybrid IT environments are common, but organizations are often forced to choose a primary cloud platform to manage their development and IT environments effectively. Greater inter-operability among platforms, development tools and services would give customers greater practical choice and lower their risk of vendor lock-in. This could limit the ability of some providers to retain or gain market share.
It could also crimp their pricing strategies for non-cloud products. For example, beginning in October 2019, Microsoft will end its Bring Your Own License (BYOL) model that has enabled users of Microsoft software in their own data centers to migrate those workloads to a dedicated (single-tenant) host in a public cloud using the same on-premises license, without additional cost. Soon, however, Microsoft licenses will be transferable only to dedicated hosts in Microsoft’s Azure cloud. Customers that wish to use a dedicated host in AWS, Google or Alibaba, for example, will have to pay a fee in addition to the standard license cost. Regulators may seek to curb these types of approaches that effectively “tax” users of competitor’s services.
Visibility, please
The most likely scenario, at least in the short-term, may not be antitrust but enforced openness — specifically, greater transparency of cloud infrastructure, with regulators insisting on more oversight of the resiliency of critical systems running in the cloud, especially in the financial services. As a 2018 US Treasury report noted, bank regulations have not “sufficiently modernized” to accommodate cloud.
Regulators are beginning to look into this issue. In the UK, the Bank of England is investigating large banks’ reliance on cloud as part of a broader risk-reduction initiative for financial digital services. The European Banking Authority specifically states that an outsourcer/cloud operator must allow site inspections of data centers. And in the US, the Federal Reserve has conducted a formal examination of at least one AWS data center, in Virginia, with a focus on its infrastructure resiliency and backup systems. More site visits are expected.
Cloud providers may have to share more granular information about the operational redundancy and fail-over mechanisms of their infrastructure, enabling customers to better assess their risk. Some providers may invest in highly resilient infrastructure for certain services and either accept lower operating profit margins or charge a premium for those services.
The creation of any new laws would take years to play out. Meanwhile, market forces could compel providers to make their services simpler to buy and more inter-operable and the associated risks, more transparent. Some are already moving in this direction but the threat of regulatory impact, real or perceived, could speed up the development of these types of improvements.
Outage Reporting in Financial Services
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]In the movie “Mary Poppins,” Mr. Banks sings that a British bank must be run with precision, and that “Tradition, discipline and rules must be the tools.” Otherwise, he warns, “Disorder! Chaos!” will ensue.
One rule, introduced by the UK Financial Conduct Authority (FCA) in 2018, suggests that disorder and chaos might be quite common in the IT departments of many UK banks. The rule introduced the mandatory reporting of online service disruptions caused by IT problems for retail banks. The first year’s figures, published this summer, show that banks, on average, suffered an IT outage or security issue nearly every month.
The numbers (shown in the table below) are far higher than any published elsewhere or previously. One bank, Barclays, averaged an incident nearly every ten days. But all the big banks suffered frequent problems. Given this, it is no surprise that the UK Treasury Select Committee has called for action, and the Bank of England is planning on introducing new operational resilience rules for all financial institutions operating in the UK.
There may be certain mitigating factors relating to the retail banking figures: many of the banks are large and have multiple services, and some of the banks have common ownership or shared systems, so there may some “double counting” of underlying problems. And the data has grouped security incidents with IT availability problems, which has boosted the overall numbers.
Even so, there is clearly an issue in retail banking/financial services that is similar to the one affecting airlines, which we discussed in our “Creeping Criticality Note” (available to Uptime Institute Network members). The growing demand for 24-hour availability — at ever increasing scale — and the need to support ever more services is running ahead of some companies’ ability to deliver it. This seems to be affecting many financial services organizations. The FCA, which has responsibility for all UK financial services (such as insurance, pensions, etc.), recently said the number of “operational resilience” breaks reported increased to 916 for the year 2018-19 from 229 the year before – a 300% increase.
The FCA’s Director of Supervision Meghan Butler made two important observations when the numbers were published: First, a substantial number of new incidents were caused by IT and software issues (including, we assume, data center power/networking) and not so much by cyber-attacks. She noted the need for better management and processes. And second, the increase, while real, is also to do with the number of incidents being reported.
The last point is critical: Dozens of financial services giants and hundreds of innovative fintech startups call the UK home. Is this a uniquely UK problem? Is IT in this sector is particularly bad? There may be — indeed, there are known to be — problems with modernizing huge, legacy infrastructure while in flight. But many US banks, for example, have suffered similar issues.
Undoubtedly, a big proportion of the rise is due to the fact that the FCA now insists that incidents are reported. Outage reporting across almost all industries is at best ad hoc — notwithstanding confidential reporting systems run by the Uptime Institute Network (the Abnormal Incident Report, or AIR, database) and the Data Center Incident Reporting Network (DCIRN), an independent body set up in 2017 by veteran data center designer Ed Ansett. Even allowing for media reports, many also tracked by Uptime, the vast majority of outages, even at major firms, are never reported to anyone, so the lessons cannot be learned. While outages at consumer-facing internet services can be spotted by web-based outage reporting services using software tools, often there no real details other than data that points to a loss or slowdown in service.
It has often been suggested that for lessons to be shared, mandatory reporting of outages — in detail — is needed, at least by large or important organizations or services. The FCA’s data supports the view that many outages are hidden. Now they have surfaced, the next question is, How can the lessons of the failures can be more openly and widely shared?
———————————————————————————————————————————–
More information on this topic is available to members of the Uptime Institute Network which can be found here.
How to Avoid Outages – Part Deux
/in Executive, Operations/by Kevin HeslinA previous Uptime Intelligence Note suggested that avoiding data center outages might be as simple as trying harder. The Note suggested that management failures are the main reason that enterprises continue to experience downtime incidents, even in fault tolerant facilities. “The Intelligence Trap,” a new book by David Robson, sheds light on why management mistakes continue to plague data center operations. The book suggests that management must be aware of its own limitations if it is going to properly staff operations, develop training programs and evaluate risk. Robson also notes that modern education practices, especially for children in the US and Europe, have created notions about intelligence that eventually undermine management efforts and also inform opposing and intractable political arguments about wider societal issues such as climate change and the anti-vaccination movement.
Citing dozens of anecdotes and psychological studies, Robson carefully examines how the intelligence trap ensnares almost everyone, including notables such as Arthur Conan Doyle, Albert Einstein and other famous, high IQ individuals — even Nobel laureates. Brilliant as they are, many fall prey to a variety of hoaxes. Some believe a wide variety of conspiracy theories and junk science, in many cases persuading others to join them.
The foundations of the trap, Robson notes, begin with the early and widespread acceptance of the IQ test, its narrow definition of intelligence and expectations that “smart” people will perform better at all tasks than others. Psychologist Keith Stanovich says intelligence tests “work quite well for what they do,” but are a poor metric of overall success. He calls the link between intelligence and rationality weak.
Studies show that geniuses experience deference in fields far beyond their areas of expertise. But according to studies, they are at least as apt to use their intelligence to craft arguments that justify their existing positions than to open themselves to new information that may cause them to change their opinions.
In some cases, ingrained patterns of thinking may further undermine these geniuses, with Robson noting that experts often develop mental shortcuts, based on experience, that enable them to work quickly on routine cases. In the medical field, these patterns contribute to a higher rate of misdiagnosis: in the US, one in six initial diagnoses proves to be incorrect, leading to one in 10 (40,000-80,000) patient deaths per year.
The intelligence trap can catch organizations, too. The US FBI held firmly to a mistaken fingerprint reading that implicated an innocent man in the 2004 terrorist bombing in Madrid, even after non-forensic evidence emerged that showed the suspect had played no role in the attack. FBI experts were not even persuaded by Spanish authorities who found discrepancies between the print found at the bombing and the suspect’s fingerprint.
Escaping the intelligence trap is difficult, but possible. However, ingrained patterns of thinking lead successful individuals to surround themselves with too many experts, forming non-diverse teams that do not function well together. High levels of intelligence and experience and lack of diversity can lead to overconfidence that increases their appetite for risk. Competition among accomplished team members can also lead to poor team function.
Data center owners and operators should note that even highly functional teams sometimes process close calls as successes, which leads them to discount the likelihood of a serious incident. NASA suffered two well-known space shuttle disasters after its engineers began to downplay known risks due to a string of successful missions. Prior to the loss of the Challenger, many shuttles had survived the faulty seals that caused the 1986 disaster, and every shuttle had survived damage to foam insulation until the Columbia disaster in 2003. Prior experience had reduced the perceived risk of these dangerous conditions, a condition called “outcome bias.”
The result, says Catherine Tinsley, a professor of management at Georgetown specializing in corporate catastrophe, is a subtle but distinct increase in risk appetite. She says, “I’m not here to tell you what your risk tolerance should be. I’m here to say that when you experience a near miss, your risk tolerance will increase and you won’t be aware of it.”
The real question here is not whether mistakes and disasters are inevitable. Rather, it’s how to become conscious of the subconscious. Staff should be trained to recognize the limitations that cause bad decisions. And perhaps more succinctly, data center operators and owners should recognize that procedures must not only be foolproof — they must be “expert-proof” as well.
More information on this topic is available to members of the Uptime Institute Network. Membership information can be found here.
Why do some industries and organizations suffer more serious, high profile outages than others?
/in Executive/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]In a recent Uptime Institute Intelligence note, we considered a June 2019 report issued by the US General Accounting Office (GAO) on the IT resiliency of US airlines. The GAO wanted to better understand if the all-too-frequent IT outages and resultant chaos passengers face have any common causes and if so, how they could be addressed. (Since we published that Note, the UK-owned carrier British Airways suffered its second big outage in two years, once again stranding tens of thousands of passengers and facing heavy costs.)
The GAO report didn’t really uncover much new: there was, in some cases, a need for better testing, a little more redundancy needed here and there, certainly some improved processes. But despite suspicions of under-investment, there was nothing systemic. The causes of the outages were varied and, although often avoidable when looked at in hindsight, not necessarily predictable.
There is, however, still an undeniable pattern. Uptime Institute’s own analysis of three years of public, media-reported outages shows that two industries, airlines and retail financial services, do appear to suffer from significantly more, highly disruptive (category 4 and 5), high profile outages than other industries.
To be clear: these businesses do not necessarily have more outages, but rather they suffer a higher number of highly disruptive outages, and as a result, get more negative publicity when there is a problem. Cloud providers are not far behind.
Why is this? The reasons may vary, but these businesses very often offer services on which large numbers of people depend, for which almost any interruption causes immediate losses and negative publicity, and in which it may not be easy to get back to the status quo.
Another trait that seems to set these businesses apart is that their almost complete dependency on IT is relatively recent (or they may be a new IT service or industry), so they may not yet have invested to the same levels as, for example, an investment bank, stock exchange or a power utility. In these last examples, the mission-critical nature of the business has long been clear, they are probably regulated, and so have investments and processes fully in place.
Organizations have long conducted business impact analyses, and there are various methodologies and tools available to help carry these out. Uptime Institute has been researching this area, particularly to see how organizations might specifically address business/impacts of failures in digital infrastructure. One simple approach is to create a “vulnerability” rating for each application/service, with scores attributed across a number of factors. Some of our thinking — and this is not comprehensive — is outlined below:
One of the challenges of carrying out assessments is that the impact of any particular service’s or application’s failing is changing, in two ways. First, in most cases, it is increasing, along with the dependency of all businesses and consumers on IT. And second, it is becoming more complicated and harder to determine accurately, largely because of the inter-dependency of many different systems and applications, intertwined to support different processes and services. There may even a logarithmic hockey stick curve, with the impact of failures growing rapidly as more systems, people and businesses are involved.
Looked at like this, it is clear that certain organizations have become more vulnerable to high impact outages than they were a year or two previously, because while the immediate impact on sales/revenue may not have the changed, the scale, profile or recover-ability may have. It may be that airlines, which only two years ago could board passengers manually, can no longer do so without IT; similarly, retail banking customers used to carry sufficient cash or checks to get themselves a meal and get home. Not anymore. These organizations now have a very tricky problem: How do they upgrade their infrastructure, and their processes, to a level of mission critically for which they were not designed?
All this raises a further tricky question that Uptime Institute is researching: Which industries, businesses or services have become (or will become) critical to the national infrastructure — even if a few years ago they certainly were not (or they are not currently)? And how should these services be regulated (if they should …)? We are seeking partners to help with this research. Organizations are not the only ones struggling with these questions — governments are as well.
More information on this topic is available to members of the Uptime Institute Network here.
How to avoid outages: Try harder!
/in Executive, Operations/by Kevin HeslinUptime Institute has spent years analyzing the roots causes for data center and service outages, surveying thousands of IT professionals throughout the year on this topic. According to the data, the vast majority of data center failures are caused by human error. Some industry experts report numbers as high as 75%, but Uptime Institute generally reports about 70% based on the wealth of data we gather continuously. That assumption immediately raises an important question: Just how preventable are most outages?
Certainly, the number of outages remains persistently high, and the associated costs of these outages are also high. Uptime Institute data from the past two years demonstrates that almost one-third of data center owners and operators experienced a downtime incident or severe degradation of service in the past year, and half in the previous three years. Many of these incidents had severe financial consequences, with 10% of the 2019 respondents reporting that their most recent incident cost more than $1 million.
These findings, and others related to the causes of outages, are perhaps not unexpected. But more surprisingly, in Uptime Institute’s April 2019 survey, 60% of respondents believed that their most recent significant downtime incident could have been prevented with better management/processes or configuration. For outages that cost greater than $1 million, this figure jumped to 74%, and then leveled out around 50% as the outage costs increased to more than $40 million. These numbers remain persistently high, given the existing knowledge available on the causes and sources of downtime incidents and the costs of many downtime incidents.
Data center owners and operators know that on-premises power failures continue to cause the most outages (33%), with network and connectivity issues close behind (31%). Additional failures attributed to colocation providers could also have been prevented by the provider.
These findings should be alarming to everyone in the digital infrastructure business. After years of building data centers, and adding complex layers of features and functionality, not to mention dynamic workload migration and orchestration, the industry’s report card on actual service delivery performance is less than stellar. And while these sorts of failures should be very rare in concurrently maintainable and fault tolerant facilities when appropriate and complete procedures are in place, what we are finding is the operational part of the story falls flat. Simply put, if humans worked harder to MANAGE the well-designed and constructed facilities better, we would have fewer outages..
Uptime Institute consultants have underscored the critically important role procedures play in data center operations. They remind listeners that having and maintaining appropriate and complete procedures is essential to achieving performance and service availability goals. These same procedures can also help data centers meet efficiency goals, even in conditions that exceed planned design days. Among other benefits, well conceived procedures and the extreme discipline to follow these procedures helps operators cope with strong storms, properly perform maintenance and upgrades, manage costs and, perhaps most relevant, restore operations quickly after an outage.
So why, then, does the industry continue to experience downtime incidents, given that the causes have been so well pinpointed, the costs are so well-known and the solution to reducing their frequency (better processes and procedures) is so obvious? We just don’t try hard enough.
When asking our constituents about the causes for their outages, there are perhaps as many explanations as there are respondents. Here are just a few questions to consider when looking internal at your own risks and processes:
Perhaps there is simply a limit to what can be achieved in an industry that still relies heavily on people to perform many of the most basic and critical tasks and thus is subject to human error, which can never be completely eliminated. However, a quick survey of the issues suggests that management failure — not human error — is the main reason that outages persist. By under-investing in training, failing to enforce policies, allowing procedures to grow outdated, and underestimating the importance of qualified staff, management sets the stage for a cascade of circumstances that leads to downtime. If we try harder, we can make progress. If we leverage the investments in physical infrastructure by applying the right level of operational expertise and business management, outages will decline.
We just need to try harder.
More information on this and similar topics is available to members of the Uptime Institute which can be initiated here.
Troubling for operators: Capacity forecasting and maintaining cost competitiveness
/in Executive, Operations/by Rabih BashroushIn the recently published 2019 Uptime Institute supplier survey, participants told us they are witnessing higher than normal data center spending patterns. This is in line with general market trends, driven by the demand for data and digital services. It is also a welcome sign for those suppliers who witnessed a downturn two to three years ago, as public cloud began to take a bite.
The increase in spending is not only by hyperscalers known to be designing for 100x scalability and building for 10x growth. Smaller facilities (under 20 MW) are also seeing continued investment, including in higher levels of redundancy at primary sites (a trend that may have surprised some).
However, this growth continues to raise concerns. In this year’s survey, the top challenge operators face, as identified by suppliers, is forecasting future data center capacity requirements. This is followed by the need to maintain competitive and cost-efficient operations compared with cloud/colocation. Managing different data center environments dropped to fourth place, after coming second in last year’s supplier survey. This finding agrees with the results of our 2019 operator survey (of around 1,000 data centers operators around the world). In that survey, our analysis attributed the change to the advancement in tools and market maturity.
The figure below shows the top challenges operators faced in 2018 and 2019, as reported by their suppliers:
Forecasting data center capacity is a long-standing issue. Rapid changes in technology and the difficulty of anticipating future workload growth at a time when there are so many choices complicate matters. Over-provisioning capacity, the most commonly adopted strategy, leads to inefficiencies in operations (and unnecessary upfront investment). Against this, under-provisioning capacity is an operational risk and could also mean facilities reach their limit before their planned investment life-cycle.
Depending on the sector and type of workload, many organizations have adopted modular data center designs, which can be an effective way to alleviate the expense of over-provisioning. Where appropriate, some operators also move highly unpredictable or the most easily/economically transported workloads to public cloud environments. These strategies, plus various other factors driving the uptake of mixed IT infrastructures, mean more organizations are accumulating expertise in managing hybrid environments. This may explain why the challenge of managing different data center environments dropped to fourth place in our survey this year. Additionally, cloud computing suppliers are offering more effective tools to help customers better manage their costs when running cloud services.
The adoption of cloud-first policies by many operators means managers are having to demonstrate cost-effectiveness more than ever. This means that understanding the true cost of maintaining in-house facilities versus the cost of cloud/colocation venues is becoming more important, as the survey results above show.
The 2019 Uptime Institute operator survey also reflects this. Forty percent of participants indicated that they are not confident in their organization’s ability to compare costs between in-house versus cloud/colocation environments. Indeed, this is not a straightforward exercise. On the one hand, the structure of some enterprises (e.g., how budgets are split) makes calculating the cost of running owned sites tricky. On the other hand, calculating the true cost of moving to the cloud is also not straightforward. There may be costs inherent in the transition related to application re-engineering, potential repatriation or network upgrades for example (and there is now a vast choice of cloud offerings that require careful costing and management). Among other issues, such as vendor lock-in, this complexity is now driving many to change their policies to be more about cloud appropriateness, rather than cloud-first.
Want to know more details? The full report Uptime Institute data center supply-side survey 2019 is available to members of the Uptime Institute Network which can be found here.