A previous Uptime Intelligence Note suggested that avoiding data center outages might be as simple as trying harder. The Note suggested that management failures are the main reason that enterprises continue to experience downtime incidents, even in fault tolerant facilities. “The Intelligence Trap,” a new book by David Robson, sheds light on why management mistakes continue to plague data center operations. The book suggests that management must be aware of its own limitations if it is going to properly staff operations, develop training programs and evaluate risk. Robson also notes that modern education practices, especially for children in the US and Europe, have created notions about intelligence that eventually undermine management efforts and also inform opposing and intractable political arguments about wider societal issues such as climate change and the anti-vaccination movement.
Citing dozens of anecdotes and psychological studies, Robson carefully examines how the intelligence trap ensnares almost everyone, including notables such as Arthur Conan Doyle, Albert Einstein and other famous, high IQ individuals — even Nobel laureates. Brilliant as they are, many fall prey to a variety of hoaxes. Some believe a wide variety of conspiracy theories and junk science, in many cases persuading others to join them.
The foundations of the trap, Robson notes, begin with the early and widespread acceptance of the IQ test, its narrow definition of intelligence and expectations that “smart” people will perform better at all tasks than others. Psychologist Keith Stanovich says intelligence tests “work quite well for what they do,” but are a poor metric of overall success. He calls the link between intelligence and rationality weak.
Studies show that geniuses experience deference in fields far beyond their areas of expertise. But according to studies, they are at least as apt to use their intelligence to craft arguments that justify their existing positions than to open themselves to new information that may cause them to change their opinions.
In some cases, ingrained patterns of thinking may further undermine these geniuses, with Robson noting that experts often develop mental shortcuts, based on experience, that enable them to work quickly on routine cases. In the medical field, these patterns contribute to a higher rate of misdiagnosis: in the US, one in six initial diagnoses proves to be incorrect, leading to one in 10 (40,000-80,000) patient deaths per year.
The intelligence trap can catch organizations, too. The US FBI held firmly to a mistaken fingerprint reading that implicated an innocent man in the 2004 terrorist bombing in Madrid, even after non-forensic evidence emerged that showed the suspect had played no role in the attack. FBI experts were not even persuaded by Spanish authorities who found discrepancies between the print found at the bombing and the suspect’s fingerprint.
Escaping the intelligence trap is difficult, but possible. However, ingrained patterns of thinking lead successful individuals to surround themselves with too many experts, forming non-diverse teams that do not function well together. High levels of intelligence and experience and lack of diversity can lead to overconfidence that increases their appetite for risk. Competition among accomplished team members can also lead to poor team function.
Data center owners and operators should note that even highly functional teams sometimes process close calls as successes, which leads them to discount the likelihood of a serious incident. NASA suffered two well-known space shuttle disasters after its engineers began to downplay known risks due to a string of successful missions. Prior to the loss of the Challenger, many shuttles had survived the faulty seals that caused the 1986 disaster, and every shuttle had survived damage to foam insulation until the Columbia disaster in 2003. Prior experience had reduced the perceived risk of these dangerous conditions, a condition called “outcome bias.”
The result, says Catherine Tinsley, a professor of management at Georgetown specializing in corporate catastrophe, is a subtle but distinct increase in risk appetite. She says, “I’m not here to tell you what your risk tolerance should be. I’m here to say that when you experience a near miss, your risk tolerance will increase and you won’t be aware of it.”
The real question here is not whether mistakes and disasters are inevitable. Rather, it’s how to become conscious of the subconscious. Staff should be trained to recognize the limitations that cause bad decisions. And perhaps more succinctly, data center operators and owners should recognize that procedures must not only be foolproof — they must be “expert-proof” as well.
More information on this topic is available to members of the Uptime Institute Network. Membership information can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/10/Blog-Part-Deux.jpg9802643Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2019-10-07 06:15:142019-09-23 10:16:41How to Avoid Outages – Part Deux
In a recent Uptime Institute Intelligence note, we considered a June 2019 report issued by the US General Accounting Office (GAO) on the IT resiliency of US airlines. The GAO wanted to better understand if the all-too-frequent IT outages and resultant chaos passengers face have any common causes and if so, how they could be addressed. (Since we published that Note, the UK-owned carrier British Airways suffered its second big outage in two years, once again stranding tens of thousands of passengers and facing heavy costs.)
The GAO report didn’t really uncover much new: there was, in some cases, a need for better testing, a little more redundancy needed here and there, certainly some improved processes. But despite suspicions of under-investment, there was nothing systemic. The causes of the outages were varied and, although often avoidable when looked at in hindsight, not necessarily predictable.
There is, however, still an undeniable pattern. Uptime Institute’s own analysis of three years of public, media-reported outages shows that two industries, airlines and retail financial services, do appear to suffer from significantly more, highly disruptive (category 4 and 5), high profile outages than other industries.
To be clear: these businesses do not necessarily have more outages, but rather they suffer a higher number of highly disruptive outages, and as a result, get more negative publicity when there is a problem. Cloud providers are not far behind.
Why is this? The reasons may vary, but these businesses very often offer services on which large numbers of people depend, for which almost any interruption causes immediate losses and negative publicity, and in which it may not be easy to get back to the status quo.
Another trait that seems to set these businesses apart is that their almost complete dependency on IT is relatively recent (or they may be a new IT service or industry), so they may not yet have invested to the same levels as, for example, an investment bank, stock exchange or a power utility. In these last examples, the mission-critical nature of the business has long been clear, they are probably regulated, and so have investments and processes fully in place.
Organizations have long conducted business impact analyses, and there are various methodologies and tools available to help carry these out. Uptime Institute has been researching this area, particularly to see how organizations might specifically address business/impacts of failures in digital infrastructure. One simple approach is to create a “vulnerability” rating for each application/service, with scores attributed across a number of factors. Some of our thinking — and this is not comprehensive — is outlined below:
Profile. Certain industries are consumer facing, large scale or have a very public brand. A high score in this area means even small failures — Facebook’s outages are a good example — will have a big public impact.
Failure sensitivity. Sensitive industries are those for which an outage has immediate and high impact. If an investment bank can’t trade, planes can’t take off or clients can’t access their money, the sensitivity is high.
Recover-ability. Organizations that can take a lengthy time to restore normal service will suffer more seriously from IT failures. The costs of an outage may be multiplied many times over if the recovery time is lengthy. For example, airlines may find it takes days to get all planes and crews in the right location to restore normal operations.
Regulatory/compliance. Failures in the certain industries either must be reported or will attract attention from regulators. Emergency services (e.g., 911, 999, 112), power companies and hospitals are good examples … and this list is growing.
Platform dependents. Organizations whose customers include service providers — such as software as a service; infrastructure as a service; and co-location, hosting and cloud-based service providers — will not only breach service level agreements but also lose paying clients. (There are many examples of this.)
One of the challenges of carrying out assessments is that the impact of any particular service’s or application’s failing is changing, in two ways. First, in most cases, it is increasing, along with the dependency of all businesses and consumers on IT. And second, it is becoming more complicated and harder to determine accurately, largely because of the inter-dependency of many different systems and applications, intertwined to support different processes and services. There may even a logarithmic hockey stick curve, with the impact of failures growing rapidly as more systems, people and businesses are involved.
Looked at like this, it is clear that certain organizations have become more vulnerable to high impact outages than they were a year or two previously, because while the immediate impact on sales/revenue may not have the changed, the scale, profile or recover-ability may have. It may be that airlines, which only two years ago could board passengers manually, can no longer do so without IT; similarly, retail banking customers used to carry sufficient cash or checks to get themselves a meal and get home. Not anymore. These organizations now have a very tricky problem: How do they upgrade their infrastructure, and their processes, to a level of mission critically for which they were not designed?
All this raises a further tricky question that Uptime Institute is researching: Which industries, businesses or services have become (or will become) critical to the national infrastructure — even if a few years ago they certainly were not (or they are not currently)? And how should these services be regulated (if they should …)? We are seeking partners to help with this research. Organizations are not the only ones struggling with these questions — governments are as well.
More information on this topic is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/GettyImages-522477922-aspect-smaller.jpg17504725Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com2019-09-30 06:00:012019-09-18 15:36:57Why do some industries and organizations suffer more serious, high profile outages than others?
Uptime Institute has spent years analyzing the roots causes for data center and service outages, surveying thousands of IT professionals throughout the year on this topic. According to the data, the vast majority of data center failures are caused by human error. Some industry experts report numbers as high as 75%, but Uptime Institute generally reports about 70% based on the wealth of data we gather continuously. That assumption immediately raises an important question: Just how preventable are most outages?
Certainly, the number of outages remains persistently high, and the associated costs of these outages are also high. Uptime Institute data from the past two years demonstrates that almost one-third of data center owners and operators experienced a downtime incident or severe degradation of service in the past year, and half in the previous three years. Many of these incidents had severe financial consequences, with 10% of the 2019 respondents reporting that their most recent incident cost more than $1 million.
These findings, and others related to the causes of outages, are perhaps not unexpected. But more surprisingly, in Uptime Institute’s April 2019 survey, 60% of respondents believed that their most recent significant downtime incident could have been prevented with better management/processes or configuration. For outages that cost greater than $1 million, this figure jumped to 74%, and then leveled out around 50% as the outage costs increased to more than $40 million. These numbers remain persistently high, given the existing knowledge available on the causes and sources of downtime incidents and the costs of many downtime incidents.
Data center owners and operators know that on-premises power failures continue to cause the most outages (33%), with network and connectivity issues close behind (31%). Additional failures attributed to colocation providers could also have been prevented by the provider.
These findings should be alarming to everyone in the digital infrastructure business. After years of building data centers, and adding complex layers of features and functionality, not to mention dynamic workload migration and orchestration, the industry’s report card on actual service delivery performance is less than stellar. And while these sorts of failures should be very rare in concurrently maintainable and fault tolerant facilities when appropriate and complete procedures are in place, what we are finding is the operational part of the story falls flat. Simply put, if humans worked harder to MANAGE the well-designed and constructed facilities better, we would have fewer outages..
Uptime Institute consultants have underscored the critically important role procedures play in data center operations. They remind listeners that having and maintaining appropriate and complete procedures is essential to achieving performance and service availability goals. These same procedures can also help data centers meet efficiency goals, even in conditions that exceed planned design days. Among other benefits, well conceived procedures and the extreme discipline to follow these procedures helps operators cope with strong storms, properly perform maintenance and upgrades, manage costs and, perhaps most relevant, restore operations quickly after an outage.
So why, then, does the industry continue to experience downtime incidents, given that the causes have been so well pinpointed, the costs are so well-known and the solution to reducing their frequency (better processes and procedures) is so obvious? We just don’t try hard enough.
When asking our constituents about the causes for their outages, there are perhaps as many explanations as there are respondents. Here are just a few questions to consider when looking internal at your own risks and processes:
Does the complexity of your infrastructure, especially the distributed nature of it, increase the risk that simple errors will cascade into a service outage?
Is your organization expanding critical IT capacity faster than it can attract and apply the resources to manage that infrastructure?
Has your organization started to see any staffing and skills shortage, which may be starting to impair mission-critical operations?
Do your concerns about cyber vulnerability and data security outweigh concerns about physical infrastructure?
Does your organization shortchange training and education programs when budgeting?
Does your organizations under-invest in IT operations, management, and other business management functions?
Does your organization truly understand change management, especially when many of your workloads may already be shared across multiple servers, in multiple facilities or in entirely different types of IT environments including co-location and the cloud?
Does you organization consider the needs at the application level when designing new facilities or cloud adoptions?
Perhaps there is simply a limit to what can be achieved in an industry that still relies heavily on people to perform many of the most basic and critical tasks and thus is subject to human error, which can never be completely eliminated. However, a quick survey of the issues suggests that management failure — not human error — is the main reason that outages persist. By under-investing in training, failing to enforce policies, allowing procedures to grow outdated, and underestimating the importance of qualified staff, management sets the stage for a cascade of circumstances that leads to downtime. If we try harder, we can make progress. If we leverage the investments in physical infrastructure by applying the right level of operational expertise and business management, outages will decline.
We just need to try harder.
More information on this and similar topics is available to members of the Uptime Institute which can be initiated here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/GettyImages-966689448-aspect2.7.jpg8162218Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2019-09-23 06:15:002019-09-11 13:44:41How to avoid outages: Try harder!
In the recently published 2019 Uptime Institute supplier survey, participants told us they are witnessing higher than normal data center spending patterns. This is in line with general market trends, driven by the demand for data and digital services. It is also a welcome sign for those suppliers who witnessed a downturn two to three years ago, as public cloud began to take a bite.
The increase in spending is not only by hyperscalers known to be designing for 100x scalability and building for 10x growth. Smaller facilities (under 20 MW) are also seeing continued investment, including in higher levels of redundancy at primary sites (a trend that may have surprised some).
However, this growth continues to raise concerns. In this year’s survey, the top challenge operators face, as identified by suppliers, is forecasting future data center capacity requirements. This is followed by the need to maintain competitive and cost-efficient operations compared with cloud/colocation. Managing different data center environments dropped to fourth place, after coming second in last year’s supplier survey. This finding agrees with the results of our 2019 operator survey (of around 1,000 data centers operators around the world). In that survey, our analysis attributed the change to the advancement in tools and market maturity.
The figure below shows the top challenges operators faced in 2018 and 2019, as reported by their suppliers:
Forecasting data center capacity is a long-standing issue. Rapid changes in technology and the difficulty of anticipating future workload growth at a time when there are so many choices complicate matters. Over-provisioning capacity, the most commonly adopted strategy, leads to inefficiencies in operations (and unnecessary upfront investment). Against this, under-provisioning capacity is an operational risk and could also mean facilities reach their limit before their planned investment life-cycle.
Depending on the sector and type of workload, many organizations have adopted modular data center designs, which can be an effective way to alleviate the expense of over-provisioning. Where appropriate, some operators also move highly unpredictable or the most easily/economically transported workloads to public cloud environments. These strategies, plus various other factors driving the uptake of mixed IT infrastructures, mean more organizations are accumulating expertise in managing hybrid environments. This may explain why the challenge of managing different data center environments dropped to fourth place in our survey this year. Additionally, cloud computing suppliers are offering more effective tools to help customers better manage their costs when running cloud services.
The adoption of cloud-first policies by many operators means managers are having to demonstrate cost-effectiveness more than ever. This means that understanding the true cost of maintaining in-house facilities versus the cost of cloud/colocation venues is becoming more important, as the survey results above show.
The 2019 Uptime Institute operator survey also reflects this. Forty percent of participants indicated that they are not confident in their organization’s ability to compare costs between in-house versus cloud/colocation environments. Indeed, this is not a straightforward exercise. On the one hand, the structure of some enterprises (e.g., how budgets are split) makes calculating the cost of running owned sites tricky. On the other hand, calculating the true cost of moving to the cloud is also not straightforward. There may be costs inherent in the transition related to application re-engineering, potential repatriation or network upgrades for example (and there is now a vast choice of cloud offerings that require careful costing and management). Among other issues, such as vendor lock-in, this complexity is now driving many to change their policies to be more about cloud appropriateness, rather than cloud-first.
Want to know more details? The full report Uptime Institute data center supply-side survey 2019 is available to members of the Uptime Institute Network which can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/GettyImages-973035578-crop-blog.jpg22135996Rabih Bashroushhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRabih Bashroush2019-09-16 07:00:302019-08-30 14:38:43Troubling for operators: Capacity forecasting and maintaining cost competitiveness
With the recent expansion of the American Society of Heating, Refrigerating and Air-Conditioning Engineers’ (ASHRAE’s) acceptable data center operating temperature and humidity ranges — taken as an industry-standard best practice by many operators — the case for free air cooling has become much stronger. Free air cooling is an economical method of using low external air temperature to cool server rooms.
In the 2019 Uptime Institute Supply-side Survey (available to member of the Uptime Institute Network) we asked over 500 data center vendors, consultants and engineers about their customers’ adoption of free air economizer cooling (the use of outside air or a combination of water and air to supplement mechanical cooling) using the following approaches:
Indirect air: Outside air passes through a heat exchanger that separates the air inside the data center from the cooler outside air. This approach prevents particulates from entering the white space and helps control humidity levels.
Direct air: Outside air passes through an evaporative cooler and is then directed via filters to the data center cold aisle. When the temperature outside is too cold, the system mixes the outside air with exhaust air to achieve the correct inlet temperature for the facility.
Findings from the survey show that free air cooling economization projects continue to gain traction, with indirect free air cooling being slightly more popular than direct air. In our survey, 84% said that at least some of their customers are deploying indirect air cooling (74% for direct air). Only 16% of participants said that none of their customers are deploying indirect free air cooling (26% for direct air), as shown in the figure below.
The data suggests that there is more momentum behind direct free air cooling in North America than in other parts of the world. Among North American respondents, 70% indicated that some of their customers are deploying direct air cooling (compared with 63% indirect air). As shown in the figure below, this was not the case in Europe or Asia-Pacific, where suppliers reported that more customers were deploying indirect air. This perhaps could be linked to the fact that internet giants represent a bigger data center market share in North America than in other parts of the world — internet giants are known to favor direct free air cooling when deploying at scale.
The continued pressure to increase cost-efficiency, as well as the rising awareness and interest in environmental impact, is likely to continue driving uptake of free air cooling. Compared with traditional compressor-based cooling systems, free air cooling requires less upfront capital investment and involves lower operational expenses, while having a lower environmental impact (e.g., no refrigerants, low embedded carbon and a higher proportion of recyclable components).
Yet, some issues hampering free air cooling uptake will likely continue in the short term. These include the upfront retrofit investment required for existing facilities; humidity and air quality constraints (which are less of a problem for indirect air cooling); lack of reliable weather models in some areas (and the potential impact of climate change); and restrictive service level agreements, particularly in the colocation sector.
Moreover, a lack of understanding of the ASHRAE standards and clarity around IT equipment needs is driving some operators to design to the highest common denominator, particularly when hosting legacy or mixed IT systems. The opportunity to take advantage of free air cooling is missed as a result, due to the perceived need to adopt lower operating temperatures.
Going forward, at least in Europe, this problem might be partially addressed by the introduction of the new European EcoDesign legislation for servers and online storage devices, which will take effect from March 2020. The new legislation will require IT manufacturers to declare the operating condition classes and thermal performance of their equipment. This, in turn, will help enterprise data centers better optimize their operations by segregating IT equipment based on ambient operating requirements.
The full report Uptime Institute data center supply-side survey 2019 is available to members of the Uptime Institute Network. You can become a member or request guest access by looking here or contacting any member of the Uptime Institute team.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/BLOG-Energy-Arrows.jpg8012202Rabih Bashroushhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRabih Bashroush2019-09-09 06:30:542019-08-19 15:17:00Data Center Free Air Cooling Trends
Uptime Institute has long argued that, although it may take many years, the long-term trend is toward a high level of automation in the data center, covering many functions that most managers currently would not trust to machines or outside programmers.
Our data center management maturity model shows this long-term evolution.
In our model, we have mapped different levels of operating efficiency to different stages of deployment of data center infrastructure management (DCIM) software. For any manager who is looking to buy DCIM or has already implemented the software and seeks expanded features or functions, we encourage them to consider their short- and long-term automation goals.
Today, most DCIM deployments fall into Level 2 or Level 3 of the model. A growing number of organizations are targeting Level 3 by integrating DCIM data with IT, cloud service and other non-facility data, as discussed in the report Data center management software and services: Effective selection and deployment (co-authored with Andy Lawrence).
The advent of AI-driven, cloud-based services will, we believe, drive greater efficiencies and, when deployed in combination with on-premises DCIM software, enable more data centers to reach Level 4 (and, over time, Level 5).
Although procurement decisions today may be only minimally affected by current automation needs, a later move toward greater automation should be considered, especially in terms of vendor choice/lock-in and integration.
Integration capabilities, as well as the use and integration of AI (including AI-driven cloud services), are important factors in both the overall strategic decision to deploy DCIM and the choice of a particular supplier/platform.
The full report Data center management software and services: Effective selection and deployment is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/09/BLOG-MM.jpg8012202Rhonda Ascierto, Vice President, Research, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRhonda Ascierto, Vice President, Research, Uptime Institute2019-09-02 08:00:322019-08-19 15:29:41The Evolving Data Center Management Maturity Model, A Quick Update
How to Avoid Outages – Part Deux
/in Executive, Operations/by Kevin HeslinA previous Uptime Intelligence Note suggested that avoiding data center outages might be as simple as trying harder. The Note suggested that management failures are the main reason that enterprises continue to experience downtime incidents, even in fault tolerant facilities. “The Intelligence Trap,” a new book by David Robson, sheds light on why management mistakes continue to plague data center operations. The book suggests that management must be aware of its own limitations if it is going to properly staff operations, develop training programs and evaluate risk. Robson also notes that modern education practices, especially for children in the US and Europe, have created notions about intelligence that eventually undermine management efforts and also inform opposing and intractable political arguments about wider societal issues such as climate change and the anti-vaccination movement.
Citing dozens of anecdotes and psychological studies, Robson carefully examines how the intelligence trap ensnares almost everyone, including notables such as Arthur Conan Doyle, Albert Einstein and other famous, high IQ individuals — even Nobel laureates. Brilliant as they are, many fall prey to a variety of hoaxes. Some believe a wide variety of conspiracy theories and junk science, in many cases persuading others to join them.
The foundations of the trap, Robson notes, begin with the early and widespread acceptance of the IQ test, its narrow definition of intelligence and expectations that “smart” people will perform better at all tasks than others. Psychologist Keith Stanovich says intelligence tests “work quite well for what they do,” but are a poor metric of overall success. He calls the link between intelligence and rationality weak.
Studies show that geniuses experience deference in fields far beyond their areas of expertise. But according to studies, they are at least as apt to use their intelligence to craft arguments that justify their existing positions than to open themselves to new information that may cause them to change their opinions.
In some cases, ingrained patterns of thinking may further undermine these geniuses, with Robson noting that experts often develop mental shortcuts, based on experience, that enable them to work quickly on routine cases. In the medical field, these patterns contribute to a higher rate of misdiagnosis: in the US, one in six initial diagnoses proves to be incorrect, leading to one in 10 (40,000-80,000) patient deaths per year.
The intelligence trap can catch organizations, too. The US FBI held firmly to a mistaken fingerprint reading that implicated an innocent man in the 2004 terrorist bombing in Madrid, even after non-forensic evidence emerged that showed the suspect had played no role in the attack. FBI experts were not even persuaded by Spanish authorities who found discrepancies between the print found at the bombing and the suspect’s fingerprint.
Escaping the intelligence trap is difficult, but possible. However, ingrained patterns of thinking lead successful individuals to surround themselves with too many experts, forming non-diverse teams that do not function well together. High levels of intelligence and experience and lack of diversity can lead to overconfidence that increases their appetite for risk. Competition among accomplished team members can also lead to poor team function.
Data center owners and operators should note that even highly functional teams sometimes process close calls as successes, which leads them to discount the likelihood of a serious incident. NASA suffered two well-known space shuttle disasters after its engineers began to downplay known risks due to a string of successful missions. Prior to the loss of the Challenger, many shuttles had survived the faulty seals that caused the 1986 disaster, and every shuttle had survived damage to foam insulation until the Columbia disaster in 2003. Prior experience had reduced the perceived risk of these dangerous conditions, a condition called “outcome bias.”
The result, says Catherine Tinsley, a professor of management at Georgetown specializing in corporate catastrophe, is a subtle but distinct increase in risk appetite. She says, “I’m not here to tell you what your risk tolerance should be. I’m here to say that when you experience a near miss, your risk tolerance will increase and you won’t be aware of it.”
The real question here is not whether mistakes and disasters are inevitable. Rather, it’s how to become conscious of the subconscious. Staff should be trained to recognize the limitations that cause bad decisions. And perhaps more succinctly, data center operators and owners should recognize that procedures must not only be foolproof — they must be “expert-proof” as well.
More information on this topic is available to members of the Uptime Institute Network. Membership information can be found here.
Why do some industries and organizations suffer more serious, high profile outages than others?
/in Executive/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comIn a recent Uptime Institute Intelligence note, we considered a June 2019 report issued by the US General Accounting Office (GAO) on the IT resiliency of US airlines. The GAO wanted to better understand if the all-too-frequent IT outages and resultant chaos passengers face have any common causes and if so, how they could be addressed. (Since we published that Note, the UK-owned carrier British Airways suffered its second big outage in two years, once again stranding tens of thousands of passengers and facing heavy costs.)
The GAO report didn’t really uncover much new: there was, in some cases, a need for better testing, a little more redundancy needed here and there, certainly some improved processes. But despite suspicions of under-investment, there was nothing systemic. The causes of the outages were varied and, although often avoidable when looked at in hindsight, not necessarily predictable.
There is, however, still an undeniable pattern. Uptime Institute’s own analysis of three years of public, media-reported outages shows that two industries, airlines and retail financial services, do appear to suffer from significantly more, highly disruptive (category 4 and 5), high profile outages than other industries.
To be clear: these businesses do not necessarily have more outages, but rather they suffer a higher number of highly disruptive outages, and as a result, get more negative publicity when there is a problem. Cloud providers are not far behind.
Why is this? The reasons may vary, but these businesses very often offer services on which large numbers of people depend, for which almost any interruption causes immediate losses and negative publicity, and in which it may not be easy to get back to the status quo.
Another trait that seems to set these businesses apart is that their almost complete dependency on IT is relatively recent (or they may be a new IT service or industry), so they may not yet have invested to the same levels as, for example, an investment bank, stock exchange or a power utility. In these last examples, the mission-critical nature of the business has long been clear, they are probably regulated, and so have investments and processes fully in place.
Organizations have long conducted business impact analyses, and there are various methodologies and tools available to help carry these out. Uptime Institute has been researching this area, particularly to see how organizations might specifically address business/impacts of failures in digital infrastructure. One simple approach is to create a “vulnerability” rating for each application/service, with scores attributed across a number of factors. Some of our thinking — and this is not comprehensive — is outlined below:
One of the challenges of carrying out assessments is that the impact of any particular service’s or application’s failing is changing, in two ways. First, in most cases, it is increasing, along with the dependency of all businesses and consumers on IT. And second, it is becoming more complicated and harder to determine accurately, largely because of the inter-dependency of many different systems and applications, intertwined to support different processes and services. There may even a logarithmic hockey stick curve, with the impact of failures growing rapidly as more systems, people and businesses are involved.
Looked at like this, it is clear that certain organizations have become more vulnerable to high impact outages than they were a year or two previously, because while the immediate impact on sales/revenue may not have the changed, the scale, profile or recover-ability may have. It may be that airlines, which only two years ago could board passengers manually, can no longer do so without IT; similarly, retail banking customers used to carry sufficient cash or checks to get themselves a meal and get home. Not anymore. These organizations now have a very tricky problem: How do they upgrade their infrastructure, and their processes, to a level of mission critically for which they were not designed?
All this raises a further tricky question that Uptime Institute is researching: Which industries, businesses or services have become (or will become) critical to the national infrastructure — even if a few years ago they certainly were not (or they are not currently)? And how should these services be regulated (if they should …)? We are seeking partners to help with this research. Organizations are not the only ones struggling with these questions — governments are as well.
More information on this topic is available to members of the Uptime Institute Network here.
How to avoid outages: Try harder!
/in Executive, Operations/by Kevin HeslinUptime Institute has spent years analyzing the roots causes for data center and service outages, surveying thousands of IT professionals throughout the year on this topic. According to the data, the vast majority of data center failures are caused by human error. Some industry experts report numbers as high as 75%, but Uptime Institute generally reports about 70% based on the wealth of data we gather continuously. That assumption immediately raises an important question: Just how preventable are most outages?
Certainly, the number of outages remains persistently high, and the associated costs of these outages are also high. Uptime Institute data from the past two years demonstrates that almost one-third of data center owners and operators experienced a downtime incident or severe degradation of service in the past year, and half in the previous three years. Many of these incidents had severe financial consequences, with 10% of the 2019 respondents reporting that their most recent incident cost more than $1 million.
These findings, and others related to the causes of outages, are perhaps not unexpected. But more surprisingly, in Uptime Institute’s April 2019 survey, 60% of respondents believed that their most recent significant downtime incident could have been prevented with better management/processes or configuration. For outages that cost greater than $1 million, this figure jumped to 74%, and then leveled out around 50% as the outage costs increased to more than $40 million. These numbers remain persistently high, given the existing knowledge available on the causes and sources of downtime incidents and the costs of many downtime incidents.
Data center owners and operators know that on-premises power failures continue to cause the most outages (33%), with network and connectivity issues close behind (31%). Additional failures attributed to colocation providers could also have been prevented by the provider.
These findings should be alarming to everyone in the digital infrastructure business. After years of building data centers, and adding complex layers of features and functionality, not to mention dynamic workload migration and orchestration, the industry’s report card on actual service delivery performance is less than stellar. And while these sorts of failures should be very rare in concurrently maintainable and fault tolerant facilities when appropriate and complete procedures are in place, what we are finding is the operational part of the story falls flat. Simply put, if humans worked harder to MANAGE the well-designed and constructed facilities better, we would have fewer outages..
Uptime Institute consultants have underscored the critically important role procedures play in data center operations. They remind listeners that having and maintaining appropriate and complete procedures is essential to achieving performance and service availability goals. These same procedures can also help data centers meet efficiency goals, even in conditions that exceed planned design days. Among other benefits, well conceived procedures and the extreme discipline to follow these procedures helps operators cope with strong storms, properly perform maintenance and upgrades, manage costs and, perhaps most relevant, restore operations quickly after an outage.
So why, then, does the industry continue to experience downtime incidents, given that the causes have been so well pinpointed, the costs are so well-known and the solution to reducing their frequency (better processes and procedures) is so obvious? We just don’t try hard enough.
When asking our constituents about the causes for their outages, there are perhaps as many explanations as there are respondents. Here are just a few questions to consider when looking internal at your own risks and processes:
Perhaps there is simply a limit to what can be achieved in an industry that still relies heavily on people to perform many of the most basic and critical tasks and thus is subject to human error, which can never be completely eliminated. However, a quick survey of the issues suggests that management failure — not human error — is the main reason that outages persist. By under-investing in training, failing to enforce policies, allowing procedures to grow outdated, and underestimating the importance of qualified staff, management sets the stage for a cascade of circumstances that leads to downtime. If we try harder, we can make progress. If we leverage the investments in physical infrastructure by applying the right level of operational expertise and business management, outages will decline.
We just need to try harder.
More information on this and similar topics is available to members of the Uptime Institute which can be initiated here.
Troubling for operators: Capacity forecasting and maintaining cost competitiveness
/in Executive, Operations/by Rabih BashroushIn the recently published 2019 Uptime Institute supplier survey, participants told us they are witnessing higher than normal data center spending patterns. This is in line with general market trends, driven by the demand for data and digital services. It is also a welcome sign for those suppliers who witnessed a downturn two to three years ago, as public cloud began to take a bite.
The increase in spending is not only by hyperscalers known to be designing for 100x scalability and building for 10x growth. Smaller facilities (under 20 MW) are also seeing continued investment, including in higher levels of redundancy at primary sites (a trend that may have surprised some).
However, this growth continues to raise concerns. In this year’s survey, the top challenge operators face, as identified by suppliers, is forecasting future data center capacity requirements. This is followed by the need to maintain competitive and cost-efficient operations compared with cloud/colocation. Managing different data center environments dropped to fourth place, after coming second in last year’s supplier survey. This finding agrees with the results of our 2019 operator survey (of around 1,000 data centers operators around the world). In that survey, our analysis attributed the change to the advancement in tools and market maturity.
The figure below shows the top challenges operators faced in 2018 and 2019, as reported by their suppliers:
Forecasting data center capacity is a long-standing issue. Rapid changes in technology and the difficulty of anticipating future workload growth at a time when there are so many choices complicate matters. Over-provisioning capacity, the most commonly adopted strategy, leads to inefficiencies in operations (and unnecessary upfront investment). Against this, under-provisioning capacity is an operational risk and could also mean facilities reach their limit before their planned investment life-cycle.
Depending on the sector and type of workload, many organizations have adopted modular data center designs, which can be an effective way to alleviate the expense of over-provisioning. Where appropriate, some operators also move highly unpredictable or the most easily/economically transported workloads to public cloud environments. These strategies, plus various other factors driving the uptake of mixed IT infrastructures, mean more organizations are accumulating expertise in managing hybrid environments. This may explain why the challenge of managing different data center environments dropped to fourth place in our survey this year. Additionally, cloud computing suppliers are offering more effective tools to help customers better manage their costs when running cloud services.
The adoption of cloud-first policies by many operators means managers are having to demonstrate cost-effectiveness more than ever. This means that understanding the true cost of maintaining in-house facilities versus the cost of cloud/colocation venues is becoming more important, as the survey results above show.
The 2019 Uptime Institute operator survey also reflects this. Forty percent of participants indicated that they are not confident in their organization’s ability to compare costs between in-house versus cloud/colocation environments. Indeed, this is not a straightforward exercise. On the one hand, the structure of some enterprises (e.g., how budgets are split) makes calculating the cost of running owned sites tricky. On the other hand, calculating the true cost of moving to the cloud is also not straightforward. There may be costs inherent in the transition related to application re-engineering, potential repatriation or network upgrades for example (and there is now a vast choice of cloud offerings that require careful costing and management). Among other issues, such as vendor lock-in, this complexity is now driving many to change their policies to be more about cloud appropriateness, rather than cloud-first.
Want to know more details? The full report Uptime Institute data center supply-side survey 2019 is available to members of the Uptime Institute Network which can be found here.
Data Center Free Air Cooling Trends
/in Design, Executive/by Rabih BashroushWith the recent expansion of the American Society of Heating, Refrigerating and Air-Conditioning Engineers’ (ASHRAE’s) acceptable data center operating temperature and humidity ranges — taken as an industry-standard best practice by many operators — the case for free air cooling has become much stronger. Free air cooling is an economical method of using low external air temperature to cool server rooms.
In the 2019 Uptime Institute Supply-side Survey (available to member of the Uptime Institute Network) we asked over 500 data center vendors, consultants and engineers about their customers’ adoption of free air economizer cooling (the use of outside air or a combination of water and air to supplement mechanical cooling) using the following approaches:
Findings from the survey show that free air cooling economization projects continue to gain traction, with indirect free air cooling being slightly more popular than direct air. In our survey, 84% said that at least some of their customers are deploying indirect air cooling (74% for direct air). Only 16% of participants said that none of their customers are deploying indirect free air cooling (26% for direct air), as shown in the figure below.
The data suggests that there is more momentum behind direct free air cooling in North America than in other parts of the world. Among North American respondents, 70% indicated that some of their customers are deploying direct air cooling (compared with 63% indirect air). As shown in the figure below, this was not the case in Europe or Asia-Pacific, where suppliers reported that more customers were deploying indirect air. This perhaps could be linked to the fact that internet giants represent a bigger data center market share in North America than in other parts of the world — internet giants are known to favor direct free air cooling when deploying at scale.
The continued pressure to increase cost-efficiency, as well as the rising awareness and interest in environmental impact, is likely to continue driving uptake of free air cooling. Compared with traditional compressor-based cooling systems, free air cooling requires less upfront capital investment and involves lower operational expenses, while having a lower environmental impact (e.g., no refrigerants, low embedded carbon and a higher proportion of recyclable components).
Yet, some issues hampering free air cooling uptake will likely continue in the short term. These include the upfront retrofit investment required for existing facilities; humidity and air quality constraints (which are less of a problem for indirect air cooling); lack of reliable weather models in some areas (and the potential impact of climate change); and restrictive service level agreements, particularly in the colocation sector.
Moreover, a lack of understanding of the ASHRAE standards and clarity around IT equipment needs is driving some operators to design to the highest common denominator, particularly when hosting legacy or mixed IT systems. The opportunity to take advantage of free air cooling is missed as a result, due to the perceived need to adopt lower operating temperatures.
Going forward, at least in Europe, this problem might be partially addressed by the introduction of the new European EcoDesign legislation for servers and online storage devices, which will take effect from March 2020. The new legislation will require IT manufacturers to declare the operating condition classes and thermal performance of their equipment. This, in turn, will help enterprise data centers better optimize their operations by segregating IT equipment based on ambient operating requirements.
The full report Uptime Institute data center supply-side survey 2019 is available to members of the Uptime Institute Network. You can become a member or request guest access by looking here or contacting any member of the Uptime Institute team.
The Evolving Data Center Management Maturity Model, A Quick Update
/in Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteUptime Institute has long argued that, although it may take many years, the long-term trend is toward a high level of automation in the data center, covering many functions that most managers currently would not trust to machines or outside programmers.
Recent advances in artificial intelligence (AI) have made this seem more likely. (For more on data center AI, see our report Very smart data centers: How artificial intelligence will power operational decisions.)
Our data center management maturity model shows this long-term evolution.
In our model, we have mapped different levels of operating efficiency to different stages of deployment of data center infrastructure management (DCIM) software. For any manager who is looking to buy DCIM or has already implemented the software and seeks expanded features or functions, we encourage them to consider their short- and long-term automation goals.
Today, most DCIM deployments fall into Level 2 or Level 3 of the model. A growing number of organizations are targeting Level 3 by integrating DCIM data with IT, cloud service and other non-facility data, as discussed in the report Data center management software and services: Effective selection and deployment (co-authored with Andy Lawrence).
The advent of AI-driven, cloud-based services will, we believe, drive greater efficiencies and, when deployed in combination with on-premises DCIM software, enable more data centers to reach Level 4 (and, over time, Level 5).
Although procurement decisions today may be only minimally affected by current automation needs, a later move toward greater automation should be considered, especially in terms of vendor choice/lock-in and integration.
Integration capabilities, as well as the use and integration of AI (including AI-driven cloud services), are important factors in both the overall strategic decision to deploy DCIM and the choice of a particular supplier/platform.
The full report Data center management software and services: Effective selection and deployment is available to members of the Uptime Institute Network here.