Avoiding IT service outages is a big concern for any operator or service provider, especially one providing a business-critical service. But when an outage does occur, the business impact can vary from “barely noticeable” to “huge and expensive.” Anticipating and modeling the impact of a service interruption should be a part of incident planning and is key to determining the level of investment that should be made to reduce incidents and their impact.
In recent years, Uptime Institute has been collecting data about service outages, including the costs, the consequences and, most notably, the most common causes. One of our findings is that organizations often don’t collect full financial data about the impact of outages, or if they do, it might take months for these to become apparent. Many of the costs are hidden, even if the outcry from managers and even non-paying customers is most certainly not. But cost is not a proxy for impact: even a relatively short and inexpensive outage at a big, consumer-facing service provider can attract negative, national headlines.
Another clear trend, now that so many applications are distributed and interlinked, is that “outages” can often be partial, affecting users in different ways. This has, in some cases, enabled some major operators to claim very impressive availability figures in spite of poor customer experience. Their argument: Just because a service is slow or can’t perform some functions doesn’t mean it is “down.”
To give managers a shorthand way to talk about the impact of a service outage, Uptime Institute developed the Outage Severity Rating (below). The rating is not scientific and might be compared to the internationally used Beaufort Scale, which describes how various wind speeds are experienced on land and sea.
By applying this scale to widely reported outages from 2016-2018, Uptime Institute tracked 11 “Severe” Category 5 outages and 46 “Serious” Category 4 outages. Of these 11 severe outages, no fewer than five occurred at airlines. In each case, multi-million-dollar losses occurred, as flights were cancelled and travelers stranded. Compensation was paid, and negative headlines ensued.
Analysis suggests both obvious and less obvious reasons why airlines were hit so hard: the obvious one is that airlines are not only highly dependent on IT for almost all elements of the operations, but also that the impact of disruption is immediate and expensive. Less obviously, many airlines have been disrupted by low cost competition and forced to “do more with less” in the field of IT. This leads to errors and over-thrifty outsourcing, and it makes incidents more likely.
If we consider Categories 4 and 5 together, the banking and financial services sector is the most over-weighted. For this sector, outage causes varied widely, and in some cases, cost cutting was a factor. More commonly, the real challenge was simply managing complexity and recovering from failures fast enough to reduce the impact.
——————————————————————————–
Members of Uptime Institute Network experience HALF of the incidents that cause these type of service disruptions. Members share a wealth of experiences with their peers from some of the largest companies in the world. Membership instills a primary consciousness about operational efficiency and best practices which can be put into action everyday. For membership information click here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/04/Outage-red-image.jpg5201371Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2019-05-08 09:00:322019-05-10 15:21:27Comparing the severity of IT service outages: Uptime Institute’s Outage Severity Rating
Artificial intelligence (AI) is being used in data centers to drive up efficiencies and drive down risks and costs. But it also creates new types of risks. This is one of the findings from a recent Uptime Intelligence research report #25, “Very smart data centers: How artificial intelligence will power operational decisions”, published in April 2019 (and available to Uptime Institute Network members).
Some of these risks are not clear-cut. Take, for example, new AI-driven cloud services, such as data center management as a service (DMaaS), that pool anonymized data from hundreds or thousands of other customers’ data centers. They apply AI to this vast store of information and then deliver individualized insight to customers via a wide area network, usually the internet. But that raises a big question: Who owns the data, the supplier or the customer? The answer is usually both: customers can keep their own data but the supplier typically also retains a copy (even if the paid service stops, the data becomes an anonymous part of their data lake.)
Whether lack of certainty or clarity over data ownership actually constitutes a risk to data centers is vigorously debated. Some say that if hackers accessed data, it would be of little use as the data is anonymized and, for example, does not include specific location details. Others say hackers could apply techniques, including their own AI analysis, to piece together sensitive information to build up a fairly complete picture.
This is just one example of the risks that should at least be considered when deploying AI. In our new report, we describe four areas of risk with AI offerings:
Commercial risk: AI models and data are (often) stored in the public cloud and outside of immediate control (if using a supplier model) or may be on-site but not understood.
Commercial machine learning products and services raise the risk of lock-in because processes and systems may be built on top of models using data that cannot be replicated.
Pricing may be increasing as adoption grows — at present, prices are low to attract new data (to build up the effectiveness of AI models) or to attract equipment services or sales.
A high reliance on AI could change skills requirements or “deskill” staff positions, which could potentially be an issue.
Legal and service level agreement risk: Again, AI models and data are stored outside of immediate control (if using a supplier model) or may be on-site but not understood.
This may be unacceptable for some, such as service providers or organizations operating within strict regulatory environments.
In theory, it could also shift liability back to an Data Center AI service supplier — a particular concern for any automated actions provided by the service.
Technical risk: While we usually understand what types of data are being used for human actions and recommendations, it is not always possible to understand why and exactly how a machine reached a decision.
It may not be possible to easily change or override decisions.
As machines guide more decisions, core skills may become outsourced, leaving organizations vulnerable.
Interoperability risk and other “unknown unknowns”: The risk from the development of “2001” HAL scenarios (i.e., singularity) are over-played but there is an unknown, long-term risk.
One example is that Data Center AI is likely to be embedded in most cases (i.e., inside an individual equipment and management system). This could lead to situations where two or three or five systems all have some ability to take action according to their own models, leading to a potential runaway situation — or conflict with each other. For example, a building management system may turn up the cooling, while an IT system moves workload to another location, which turns up cooling elsewhere.
——————————————————————————–
For more on Artificial Intelligence in the data center read our latest report, Very smart data centers: How artificial intelligence will power operational decisions, available to members of the Uptime Institute Network. Members share a wealth of experiences with their peers from some of the largest companies in the world. Membership instills a primary consciousness about operational efficiency and best practices which can be put into action everyday. For membership information click here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/04/DATACENTER-AI.jpg10862385Rhonda Ascierto, Vice President, Research, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRhonda Ascierto, Vice President, Research, Uptime Institute2019-04-22 10:28:142019-04-22 10:46:13Data Center AI (Artificial Intelligence) Creates New Risks
The 1970s-era environmental phrase “Think globally, act locally,” is an apt way for data center operators to consider the best approach to understand and address the effects of climate change on their facilities and IT operations TODAY.
Globally, we all hear that climate change threatens to bring warmer temperatures, stronger storms, rising sea levels, and more rain. It’s not too much of a stretch to agree that at the local level, this can mean increased flooding and droughts associated with higher ambient temperatures and winds, stronger lightning, and more humidity. So it’s not hard to understand that these changes are already posing significant implications for public infrastructures and real estate ventures, as well as their digital constituents including the delivery of IT services.
Uptime Institute advises data center operators and owners to evaluate their facilities and procedures considering the growing threat, and where possible, act to mitigate the risk from these environmental changes which will potentially impact their physical infrastructure along with their operational best practices currently in place.
Survey Results Show the Data Center Industry is Taking Action
Owners & Operators Should Refresh Plans Based on Latest Weather and Storm Models
To get started, data center operators should strive to create a baseline of service resiliency with climate change in mind. Many facilities have been designed to meet the level of flooding, storm, drought, or fire events as they were understood years ago. These facilities also may have adequate capacity to meet some amount of increased cooling demands from higher ambient temperatures, but the specific additional capacity may not be well understood relative to the needs brought about through climate change.
Resiliency assessments could reveal greater-than-expected vulnerabilities. For example, the operator of a 20-year-old data center originally built to withstand a 100-year storm as defined in the year 2000 may no longer withstand the greater flooding predicted to occur as part of a 100-year storm defined in 2020. For example, in Houston, TX, 19.5 inches of rain defined a 100-year storm as recently as 2000. Today, that same amount of rainfall is a 17.5-year event. By 2100, it is expected to be a 5.5-year event. As a corollary, 100-year events in the near future will involve far greater rainfall and flooding than we expect today. This evolving threat means that data centers slowly become less resilient and then suddenly are vulnerable, if planning and mitigation are not part of an ongoing planned process.
Hosting Providers Being Proactive, but Still Vulnerable to Underlying Risks
There are no shortcuts to these investigations. Moving to a hosting facility, does not necessarily lower an organization’s risk. Hosting facilities can also be just as vulnerable to a wide range of climate change impacts, and these providers must be subject to due diligence before hosting contracts are signed, just as in any other site evaluation process. And the questions which are part of due diligence should be more forward looking with climate change level stress provisions identified.
According to Uptime Institute’s 2018 survey, many hosting providers are taking climate change very seriously. The survey found that hosting providers are far more likely than any other vertical to have considered or taken a variety of climate change precautions than any other sector. As the chart below indicates, hosting providers (81%) were far more likely to say they were preparing for climate change than any other sector (industry average, 45%). They were also far more willing to re-evaluate technology selection.
Colocation (%)
Telecommunications (%)
Financial (%)
Software/Cloud Services (%)
Industry Average (%)
Preparing for Climate Change
81
60
61
57
45
Willingness to Re-Evaluate Technology Selection
54
33
33
32
33
Data center operators have many other operational issues to consider as well. Hardening the physical facility infrastructure is a good first step, but data center operators must also re-examine their MOPs, SOPs, and EOPs, as well as their supporting vendors’ SLAs.
One good example of proper planning is when a major brokerage in New York City, during Superstorm Sandy, remained 100% operational because it had developed and followed suitable procedures and best practices for addressing conditions proactively. Uptime Institute found that many firms like this that switched to generator power in advance of an expected utility outage remained operational through the storm and continued to successfully operate if supply procedures were in place to guarantee on-going fuel availability. In addition, these firms provided provisions for local operational staff whose homes and families were threatened by the storm.
Preparation Requires Careful Evaluation of Partners as well as Local Infrastructure
But data centers do not work as islands. A data center encircled by water, even if operational, cannot support the business if it loses local network connectivity. Likewise, flooded roadways and tunnels can interrupt fuel oil deliveries and block staff access. The extent of these threats cannot be fully evaluated without including an extensive list of local conditions ranging from town and village highway closure reports to statewide emergency services response plans as well as those plans from all telecommunications providers and other utilities. These services each have their own disaster recovery plans.
The incremental nature of climate change means that resiliency assessments cannot be one-time events. Data center resiliency in the face of climate change must be re-assessed often enough to keep pace with changing weather patterns, updated local and state disaster planning, and telecommunications and local infrastructure.
Costs Expected to Rise as Resources Become More Scare and Risks Increase
Finally, data center operators must keep abreast of a laundry list of costs expected to rise as a result of climate change. For example, insurance companies have already begun to increase premiums because of unexpected losses due to unprecedented and anomalous events. In the same way, resources such as water are expected to be increasingly expensive because of shortages and local restrictions. Utilities and suppliers could also be affected by these rising costs, which are likely to cause the prices of diesel fuel and electricity to increase.
Now is the time to evaluate climate change threats on a local basis, facility by facility. Data center operators have this obligation to fully understand how climate change affects their facilities and their customers. They may find that a mix of solutions will be adequate to address their needs today, while other more radical solutions are needed for tomorrow. Climate change, or whatever you wish to call it, is here… now.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/01/Flooding-Hurricane-ClimateChange-851x360.jpg390851Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2019-01-16 15:13:322019-01-16 15:13:32“Think Globally, Act Locally”: Re-Evaluating Data Center Resiliency in the Face of Climate Change
This is a guest post written by Brett Ridley, Head of Central Operations and Facility Management for NEXTDC.NEXTDC is Australia’s leading independent data centre operator with a nationwide network of Uptime Institute certified Tier III and Tier IV facilities. NEXTDC provides enterprise-class colocation services to local and international organisations.
If you are interested in participating in our guest post program, please contact [email protected].
Data centres provide much of the connectivity and raw computing power that drives the connected economy. Government departments, financial institutions and large corporates usually run at least some of their own IT infrastructure in-house but it’s becoming more common for them to outsource their mission critical infrastructure to a certified high-availability colocation data centre.
It wasn’t always like this. In the past, if a large organisation wanted to ensure maximum uptime, they would hire specialist engineers to design a server room in their corporate headquarters and invest capital to strengthen floors, secure doors and ensure sufficient supplies of connectivity and electricity.
So what changed?
For a start, the reliance on technology and connectivity has never been greater. More and more applications are mission critical and organisations are less tolerant of outages even for their secondary systems. In addition, advances in processor technology have resulted in much faster, smaller and denser servers. As servers got smaller and demand for computing and storage increased, organisations would pack more and more computing power into their server rooms. Server rooms also started growing, taking over whole floors or even entire buildings.
As the density of computing power inside the data centre increased, the power and cooling requirements become more specialised. If you have a room with a hundred regular servers and it affects the room temperature, maybe a portable A/C would keep staff happy because the servers wouldn’t need any additional cooling.
However, if that same room had a hundred of the latest multi-core blade server cabinets, not only would its power requirements have increased exponentially, in order to deal with the sheer amount of heat generated by the servers, the room would need to be fitted with specialist cooling and ventilation systems in order to avoid a complete hardware meltdown.
At this point, relatively few organisations find it desirable or cost effective to run their own data centre facilities.
Ensuring the infrastructure of a dense computing data centre is designed and maintained to a level where it is completely reliable is an ongoing, time consuming, tedious and extremely expensive process. It requires specialist, dedicated staff backed up by a committed management with deep pockets.
In the data centre world this is known as “operational sustainability”, and it’s the primary goal of all large data centre managers.
A list of requirements describing the best practices to ensure the operational sustainability of data centres have been developed by Uptime Institute, which was established with a mission to offer the most authoritative, evidence-based, and unbiased guidance to help global companies improve the performance, efficiency, and reliability of their business critical infrastructure.
More than 1,500 data centres have been certified by Uptime Institute, which meticulously examines every component of the data centre including its design, construction, management and procedures. Once an assessment is positively completed, the data centre is then certified with the appropriate Tier rating.
Uptime Institute Tier Ratings
Tier I
Basic site infrastructure
Tier II
Redundant capacity components site infrastructure
Tier III
Concurrently maintainable site infrastructure
Tier IV
Fault tolerant site infrastructure
Bronze, Silver & Gold Ratings Tier Certification of Operational Sustainability awards receive a Bronze, Silver or Gold rating. These ratings signify the extent to which a data center is optimizing its infrastructure performance and exceeding the baseline Tier standards.
We’re extremely passionate about the data centres we build and operate and we are totally obsessed with ensuring they are staffed and maintained in an environment that minimises human error.
NEXTDC has become the first data centre operator in the Southern Hemisphere to achieve Tier IV Gold Certification of Operational Sustainability by Uptime Institute -NEXTDC’s B2 data centre in Brisbane received the Tier IV Gold Certification, highlighting the company’s excellence in managing long-term operational risks and behaviours, and showcasing its commitment to customers to be robustly reliable, highly efficient and ensuring 100% uptime.
The Gold Operational Sustainability standard recognises the human factors in running a data centre to meet fault tolerant standards. It includes climate-change preparedness and the growing need for edge computing, outage risk mitigation, energy efficiency, increasing rack density, and staffing trends. Achieving Gold certification requires a score of greater than 90% in all areas, Silver is 80%-89% and Bronze is 70%-79%.
The physical design and construction of a data centre can be solid but that’s only two-thirds of the story. Human error is the biggest challenge we face when it comes to outages, with around 80% of issues being sighted as accidental.
If staff are not properly trained and the correct processes are not in place, it doesn’t matter if the building and hardware are perfect, it’s only a matter of time before an outage will strike. This is why NEXTDC invests hundreds of thousands of dollars every year to educate its staff, partners and vendors in an effort to maximise operational sustainability.
Other data centres may claim they have procedures and trained staff but unless they’re regularly assessed by an independent third party and benchmarked against the best data centres on the planet, their claims are worthless.
To qualify for even the lowest Bronze certificate, data centres need to establish training programs for all of their staff, however, Uptime Institute also examines any risks posed by other users of our facility – the clients and partners.
We have quarterly training sessions for our operations team – our staff are tested and trained like no one in the industry and Uptime Institute requires evidence of that. They want to know about all the testing and training carried out, they want to know if we have hired any new staff and they will check to ensure new hires have completed the necessary training and a competency-based assessment.
We need to know that our national partners, for example someone like Nilsen Networks, know what they’re doing. We hold regular training days for them on our MOPs (Method of Operations) and are required to show evidence of this training to the Uptime Institute. They also want to examine our maintenance records to show that the procedures are being followed to the letter.
We have a procedure for everything — it’s all written down and laid out. We’ve colour-code our folders, we’ve got the command centre set up and we make our staff and partners practice over and over and over again to ensure that, during an emergency, when stress levels are high, they are far less likely to make costly mistakes.
This dedication to details in the whole process, from design, construction, staffing and maintenance of our facilities is what sets NEXTDC apart from alternative data centre operators. We pay attention to all the details to ensure that your business remains connected and is available 100% of the time.
It’s the extra sleep-at-night factor that you don’t get with anybody else. The training and skillset of NEXTDC staff matches the design and engineering excellence of our buildings.
https://journal.uptimeinstitute.com/wp-content/uploads/2018/09/NextDC-B2-night_851x360.jpg360851Brett Ridleyhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngBrett Ridley2018-09-27 13:58:492019-04-22 10:54:43NEXTDC: Obsessed with the details so our customers’ business is always available
Software enabled application resiliency is now playing a significant and increasing role in bolstering applications availability and reliability across the enterprise, reducing risk to the business. No longer are clients solely reliant upon the stability provided by the electrical and mechanical systems in their data center. By utilizing new software techniques, enterprises are now able to deploy applications that span multiple instances from enterprise to co-location to cloud, that bolster the availability and reliability of their critical applications.
Hybrid Deployments Create Patchwork of Dependencies on Software Agility, Network Infrastructure and Cloud Partners
While we’ve not yet seen any enterprises that rely solely upon software techniques to provide reliability, we have noted that many leaders are now complementing their data center reliability efforts with some form of software agility capabilities. And as more clients move some of their workloads to the cloud, they are beginning to utilize built-in features such as high-availability and reliability zones to augment their applications’ reliability beyond the capabilities of the individual sites themselves.
Additionally, these same clients are starting to replace or augment their requirement for risk management through disaster recovery with the high resiliency provided by availability zones from cloud providers. And bleeding-edge adopters are beginning to utilize advanced new work distribution technologies such as Google’s Cloud Spanner, which allows clients to scale globally with low latency while still preserving transactional consistency, by utilizing data that is automatically shared amongst several data centers to ensure integrity.
Keep in mind that as clients move applications and data to edge locations they can now purchase newly developed cloud services that have recently come on the market which utilize micro data centers connected in a grid formation, to create a virtualized data center that can span an entire city geography, with impressive overall performance and availability! This type of implementation requires highly specialized orchestration software, as well as a dedicated low latency fiber network that is very carefully designed, implemented and automated, to provide the high level of service required.
Cloud Providers Become the New Owners of Infrastructure Resiliency
Given all these advances in software related agility, it must be noted that all these cloud and edge providers themselves still continue to maintain highly reliable underlying electrical and mechanical foundations for their data center facilities. And since the connecting network now plays a much bigger and more critical role in overall application resiliency, it too requires the same level of focus on designed-in redundancy, reliability and fault tolerance, as that traditionally given to data center infrastructure design. Risk management comes in so many forms to be sure.
So overall, there’s a big new world of options and approaches when it comes to applications resiliency design, with most enterprises still using a belt and suspenders approach of software and hardware to reduce risk and ensure resiliency and reliability. But with new cloud services providing increasingly more self-service capabilities, it’s becoming critically important for customers to clearly evaluate their modern digital business requirements which can then be used to map out a strategy that provides the highest level of availability and resiliency at a cost which is aligned with the business itself.
And with so much at stake, the risk management aspects of hybrid infrastructures should not be ignored just because they are hard to quantify. Your very business is at risk if you don’t. Remember, the measure of a great leader is one that is not afraid to ask for help. Ask for help if you need it!
https://journal.uptimeinstitute.com/wp-content/uploads/2018/08/uptime_institute_fiber_optic_network_resiliency.jpg360851Todd Traverhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngTodd Traver2018-08-14 15:56:122018-08-14 15:56:57Application Resiliency vs. Infrastructure Resiliency
We’ve entered an era where our IT infrastructures are now becoming a compilation of capacity that is spread out and running upon a wide range of platforms; some we completely control, some we control partially and some we don’t control at all. No longer should our IT services discussions start with ‘And in the data center we have…’, but instead they need to center around mission critical business applications and/or transactions that are provided by ‘the fabric’.
Fabric Computing
Who would have thought that all of us long-time ‘data center professionals’ would now be on the hook to deliver IT services using a platform or set of platforms that we had little or no control over? Who would have thought we’d be building our infrastructures like fabric, weaving various pieces together like a finely crafted quilt? But yet here we are, and between the data centers we own, the co-locations we fill and the clouds we rent, we are putting a lot of faith in a lot of people’s prowess to create these computing quilts or fabrics.
We all know that the executive committee will ask us regularly, “We have now transformed to be digital everything. How prepared we are to deliver these essential business critical services?”, and we in turn know that we must respond with a rehearsed confirmation of readiness. The reality is we are really crossing our fingers and hoping that the colo’s we’ve chosen and our instances in the Cloud we’ve spun up won’t show up on the 6 o’clock news each night. We simply have less and less control as we outsource more and more.
A big challenge to be sure. What we need to do is to focus on the total capacity needed and identify the risk tolerance for each application, and then look at our hybrid infrastructure as a compilation of sub-assemblies which each have their own characteristics for risk and cost. While it’s not simple math to figure out our risk and cost, it *IS* math that needs to be done, application by application. Remember I can now throw nearly any application into my in-house data centers, or spin them up in a co-location site, and even burst up to the cloud on demand. The user of that application would NOT likely know the difference in platform, yet the cost and risk to process that transaction would vary widely.
But we have SLAs to manage all of this 3rd party risk, right? Nope. SLAs are part of the dirty little secret of the industry which essentially says what happens when a third-party fails to keep things running. Most SLA agreements spend most of the prose explaining what the penalties will be WHEN the service fails. SLAs do not prevent failure, they just articulate what happens when failures occur.
Data Center Tools
So this now becomes a pure business discussion about supporting a Mission Critical ‘Fabric’. This fabric is the hybrid infrastructures we are all already creating. What needs to be added to the mix are the business attributes of cost and risk and for each, a cost calculation and risk justification for why we have made certain platform choices. Remember, we can run nearly ANY application in any one of the platform choices described above, so there must be a clear reason WHY we have done what we have done, and we need to be able to articulate and defend those reasons. And we need to think about service delivery, when it spans multiple platforms and can actually traverse from one to another over the course of any given hour, day or week. Its all a set of calculations!
Put your screwdrivers away and fire up your risk management tools, your financial modelling tools, or even your trusty copy of Excel! This is the time to work through the business metrics, rather than the technical details.
Welcome to the era of Mission Critical Computing Fabric!
Comparing the severity of IT service outages: Uptime Institute’s Outage Severity Rating
/in Design, Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]Avoiding IT service outages is a big concern for any operator or service provider, especially one providing a business-critical service. But when an outage does occur, the business impact can vary from “barely noticeable” to “huge and expensive.” Anticipating and modeling the impact of a service interruption should be a part of incident planning and is key to determining the level of investment that should be made to reduce incidents and their impact.
In recent years, Uptime Institute has been collecting data about service outages, including the costs, the consequences and, most notably, the most common causes. One of our findings is that organizations often don’t collect full financial data about the impact of outages, or if they do, it might take months for these to become apparent. Many of the costs are hidden, even if the outcry from managers and even non-paying customers is most certainly not. But cost is not a proxy for impact: even a relatively short and inexpensive outage at a big, consumer-facing service provider can attract negative, national headlines.
Another clear trend, now that so many applications are distributed and interlinked, is that “outages” can often be partial, affecting users in different ways. This has, in some cases, enabled some major operators to claim very impressive availability figures in spite of poor customer experience. Their argument: Just because a service is slow or can’t perform some functions doesn’t mean it is “down.”
To give managers a shorthand way to talk about the impact of a service outage, Uptime Institute developed the Outage Severity Rating (below). The rating is not scientific and might be compared to the internationally used Beaufort Scale, which describes how various wind speeds are experienced on land and sea.
By applying this scale to widely reported outages from 2016-2018, Uptime Institute tracked 11 “Severe” Category 5 outages and 46 “Serious” Category 4 outages. Of these 11 severe outages, no fewer than five occurred at airlines. In each case, multi-million-dollar losses occurred, as flights were cancelled and travelers stranded. Compensation was paid, and negative headlines ensued.
Analysis suggests both obvious and less obvious reasons why airlines were hit so hard: the obvious one is that airlines are not only highly dependent on IT for almost all elements of the operations, but also that the impact of disruption is immediate and expensive. Less obviously, many airlines have been disrupted by low cost competition and forced to “do more with less” in the field of IT. This leads to errors and over-thrifty outsourcing, and it makes incidents more likely.
If we consider Categories 4 and 5 together, the banking and financial services sector is the most over-weighted. For this sector, outage causes varied widely, and in some cases, cost cutting was a factor. More commonly, the real challenge was simply managing complexity and recovering from failures fast enough to reduce the impact.
——————————————————————————–
Members of Uptime Institute Network experience HALF of the incidents that cause these type of service disruptions. Members share a wealth of experiences with their peers from some of the largest companies in the world. Membership instills a primary consciousness about operational efficiency and best practices which can be put into action everyday. For membership information click here.
Data Center AI (Artificial Intelligence) Creates New Risks
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteArtificial intelligence (AI) is being used in data centers to drive up efficiencies and drive down risks and costs. But it also creates new types of risks. This is one of the findings from a recent Uptime Intelligence research report #25, “Very smart data centers: How artificial intelligence will power operational decisions”, published in April 2019 (and available to Uptime Institute Network members).
Some of these risks are not clear-cut. Take, for example, new AI-driven cloud services, such as data center management as a service (DMaaS), that pool anonymized data from hundreds or thousands of other customers’ data centers. They apply AI to this vast store of information and then deliver individualized insight to customers via a wide area network, usually the internet. But that raises a big question: Who owns the data, the supplier or the customer? The answer is usually both: customers can keep their own data but the supplier typically also retains a copy (even if the paid service stops, the data becomes an anonymous part of their data lake.)
Whether lack of certainty or clarity over data ownership actually constitutes a risk to data centers is vigorously debated. Some say that if hackers accessed data, it would be of little use as the data is anonymized and, for example, does not include specific location details. Others say hackers could apply techniques, including their own AI analysis, to piece together sensitive information to build up a fairly complete picture.
This is just one example of the risks that should at least be considered when deploying AI. In our new report, we describe four areas of risk with AI offerings:
Commercial risk: AI models and data are (often) stored in the public cloud and outside of immediate control (if using a supplier model) or may be on-site but not understood.
Legal and service level agreement risk: Again, AI models and data are stored outside of immediate control (if using a supplier model) or may be on-site but not understood.
Technical risk: While we usually understand what types of data are being used for human actions and recommendations, it is not always possible to understand why and exactly how a machine reached a decision.
Interoperability risk and other “unknown unknowns”: The risk from the development of “2001” HAL scenarios (i.e., singularity) are over-played but there is an unknown, long-term risk.
——————————————————————————–
For more on Artificial Intelligence in the data center read our latest report, Very smart data centers: How artificial intelligence will power operational decisions, available to members of the Uptime Institute Network. Members share a wealth of experiences with their peers from some of the largest companies in the world. Membership instills a primary consciousness about operational efficiency and best practices which can be put into action everyday. For membership information click here.
“Think Globally, Act Locally”: Re-Evaluating Data Center Resiliency in the Face of Climate Change
/in Design, Executive, Operations/by Kevin HeslinThe 1970s-era environmental phrase “Think globally, act locally,” is an apt way for data center operators to consider the best approach to understand and address the effects of climate change on their facilities and IT operations TODAY.
Globally, we all hear that climate change threatens to bring warmer temperatures, stronger storms, rising sea levels, and more rain. It’s not too much of a stretch to agree that at the local level, this can mean increased flooding and droughts associated with higher ambient temperatures and winds, stronger lightning, and more humidity. So it’s not hard to understand that these changes are already posing significant implications for public infrastructures and real estate ventures, as well as their digital constituents including the delivery of IT services.
Uptime Institute advises data center operators and owners to evaluate their facilities and procedures considering the growing threat, and where possible, act to mitigate the risk from these environmental changes which will potentially impact their physical infrastructure along with their operational best practices currently in place.
Survey Results Show the Data Center Industry is Taking Action
In our 2018 survey of hundreds of data center owners and operators, we asked about climate change and what was being done to address it in their infrastructure and planning processes. Here’s what we saw:
Owners & Operators Should Refresh Plans Based on Latest Weather and Storm Models
To get started, data center operators should strive to create a baseline of service resiliency with climate change in mind. Many facilities have been designed to meet the level of flooding, storm, drought, or fire events as they were understood years ago. These facilities also may have adequate capacity to meet some amount of increased cooling demands from higher ambient temperatures, but the specific additional capacity may not be well understood relative to the needs brought about through climate change.
Resiliency assessments could reveal greater-than-expected vulnerabilities. For example, the operator of a 20-year-old data center originally built to withstand a 100-year storm as defined in the year 2000 may no longer withstand the greater flooding predicted to occur as part of a 100-year storm defined in 2020. For example, in Houston, TX, 19.5 inches of rain defined a 100-year storm as recently as 2000. Today, that same amount of rainfall is a 17.5-year event. By 2100, it is expected to be a 5.5-year event. As a corollary, 100-year events in the near future will involve far greater rainfall and flooding than we expect today. This evolving threat means that data centers slowly become less resilient and then suddenly are vulnerable, if planning and mitigation are not part of an ongoing planned process.
Hosting Providers Being Proactive, but Still Vulnerable to Underlying Risks
There are no shortcuts to these investigations. Moving to a hosting facility, does not necessarily lower an organization’s risk. Hosting facilities can also be just as vulnerable to a wide range of climate change impacts, and these providers must be subject to due diligence before hosting contracts are signed, just as in any other site evaluation process. And the questions which are part of due diligence should be more forward looking with climate change level stress provisions identified.
According to Uptime Institute’s 2018 survey, many hosting providers are taking climate change very seriously. The survey found that hosting providers are far more likely than any other vertical to have considered or taken a variety of climate change precautions than any other sector. As the chart below indicates, hosting providers (81%) were far more likely to say they were preparing for climate change than any other sector (industry average, 45%). They were also far more willing to re-evaluate technology selection.
Data center operators have many other operational issues to consider as well. Hardening the physical facility infrastructure is a good first step, but data center operators must also re-examine their MOPs, SOPs, and EOPs, as well as their supporting vendors’ SLAs.
One good example of proper planning is when a major brokerage in New York City, during Superstorm Sandy, remained 100% operational because it had developed and followed suitable procedures and best practices for addressing conditions proactively. Uptime Institute found that many firms like this that switched to generator power in advance of an expected utility outage remained operational through the storm and continued to successfully operate if supply procedures were in place to guarantee on-going fuel availability. In addition, these firms provided provisions for local operational staff whose homes and families were threatened by the storm.
Preparation Requires Careful Evaluation of Partners as well as Local Infrastructure
But data centers do not work as islands. A data center encircled by water, even if operational, cannot support the business if it loses local network connectivity. Likewise, flooded roadways and tunnels can interrupt fuel oil deliveries and block staff access. The extent of these threats cannot be fully evaluated without including an extensive list of local conditions ranging from town and village highway closure reports to statewide emergency services response plans as well as those plans from all telecommunications providers and other utilities. These services each have their own disaster recovery plans.
The incremental nature of climate change means that resiliency assessments cannot be one-time events. Data center resiliency in the face of climate change must be re-assessed often enough to keep pace with changing weather patterns, updated local and state disaster planning, and telecommunications and local infrastructure.
Costs Expected to Rise as Resources Become More Scare and Risks Increase
Finally, data center operators must keep abreast of a laundry list of costs expected to rise as a result of climate change. For example, insurance companies have already begun to increase premiums because of unexpected losses due to unprecedented and anomalous events. In the same way, resources such as water are expected to be increasingly expensive because of shortages and local restrictions. Utilities and suppliers could also be affected by these rising costs, which are likely to cause the prices of diesel fuel and electricity to increase.
Now is the time to evaluate climate change threats on a local basis, facility by facility. Data center operators have this obligation to fully understand how climate change affects their facilities and their customers. They may find that a mix of solutions will be adequate to address their needs today, while other more radical solutions are needed for tomorrow. Climate change, or whatever you wish to call it, is here… now.
Related Resources from Uptime Institute:
NEXTDC: Obsessed with the details so our customers’ business is always available
/in Design, Executive, Operations/by Brett RidleyIf you are interested in participating in our guest post program, please contact [email protected].
Data centres provide much of the connectivity and raw computing power that drives the connected economy. Government departments, financial institutions and large corporates usually run at least some of their own IT infrastructure in-house but it’s becoming more common for them to outsource their mission critical infrastructure to a certified high-availability colocation data centre.
It wasn’t always like this. In the past, if a large organisation wanted to ensure maximum uptime, they would hire specialist engineers to design a server room in their corporate headquarters and invest capital to strengthen floors, secure doors and ensure sufficient supplies of connectivity and electricity.
So what changed?
For a start, the reliance on technology and connectivity has never been greater. More and more applications are mission critical and organisations are less tolerant of outages even for their secondary systems. In addition, advances in processor technology have resulted in much faster, smaller and denser servers. As servers got smaller and demand for computing and storage increased, organisations would pack more and more computing power into their server rooms. Server rooms also started growing, taking over whole floors or even entire buildings.
As the density of computing power inside the data centre increased, the power and cooling requirements become more specialised. If you have a room with a hundred regular servers and it affects the room temperature, maybe a portable A/C would keep staff happy because the servers wouldn’t need any additional cooling.
However, if that same room had a hundred of the latest multi-core blade server cabinets, not only would its power requirements have increased exponentially, in order to deal with the sheer amount of heat generated by the servers, the room would need to be fitted with specialist cooling and ventilation systems in order to avoid a complete hardware meltdown.
At this point, relatively few organisations find it desirable or cost effective to run their own data centre facilities.
Ensuring the infrastructure of a dense computing data centre is designed and maintained to a level where it is completely reliable is an ongoing, time consuming, tedious and extremely expensive process. It requires specialist, dedicated staff backed up by a committed management with deep pockets.
In the data centre world this is known as “operational sustainability”, and it’s the primary goal of all large data centre managers.
A list of requirements describing the best practices to ensure the operational sustainability of data centres have been developed by Uptime Institute, which was established with a mission to offer the most authoritative, evidence-based, and unbiased guidance to help global companies improve the performance, efficiency, and reliability of their business critical infrastructure.
More than 1,500 data centres have been certified by Uptime Institute, which meticulously examines every component of the data centre including its design, construction, management and procedures. Once an assessment is positively completed, the data centre is then certified with the appropriate Tier rating.
Uptime Institute Tier Ratings
Tier I
Basic site infrastructure
Tier II
Redundant capacity components site infrastructure
Tier III
Concurrently maintainable site infrastructure
Tier IV
Fault tolerant site infrastructure
Bronze, Silver & Gold Ratings
Tier Certification of Operational Sustainability awards receive a Bronze, Silver or Gold rating. These ratings signify the extent to which a data center is optimizing its infrastructure performance and exceeding the baseline Tier standards.
We’re extremely passionate about the data centres we build and operate and we are totally obsessed with ensuring they are staffed and maintained in an environment that minimises human error.
NEXTDC has become the first data centre operator in the Southern Hemisphere to achieve Tier IV Gold Certification of Operational Sustainability by Uptime Institute -NEXTDC’s B2 data centre in Brisbane received the Tier IV Gold Certification, highlighting the company’s excellence in managing long-term operational risks and behaviours, and showcasing its commitment to customers to be robustly reliable, highly efficient and ensuring 100% uptime.
The Gold Operational Sustainability standard recognises the human factors in running a data centre to meet fault tolerant standards. It includes climate-change preparedness and the growing need for edge computing, outage risk mitigation, energy efficiency, increasing rack density, and staffing trends. Achieving Gold certification requires a score of greater than 90% in all areas, Silver is 80%-89% and Bronze is 70%-79%.
The physical design and construction of a data centre can be solid but that’s only two-thirds of the story. Human error is the biggest challenge we face when it comes to outages, with around 80% of issues being sighted as accidental.
If staff are not properly trained and the correct processes are not in place, it doesn’t matter if the building and hardware are perfect, it’s only a matter of time before an outage will strike. This is why NEXTDC invests hundreds of thousands of dollars every year to educate its staff, partners and vendors in an effort to maximise operational sustainability.
Other data centres may claim they have procedures and trained staff but unless they’re regularly assessed by an independent third party and benchmarked against the best data centres on the planet, their claims are worthless.
To qualify for even the lowest Bronze certificate, data centres need to establish training programs for all of their staff, however, Uptime Institute also examines any risks posed by other users of our facility – the clients and partners.
We have quarterly training sessions for our operations team – our staff are tested and trained like no one in the industry and Uptime Institute requires evidence of that. They want to know about all the testing and training carried out, they want to know if we have hired any new staff and they will check to ensure new hires have completed the necessary training and a competency-based assessment.
We need to know that our national partners, for example someone like Nilsen Networks, know what they’re doing. We hold regular training days for them on our MOPs (Method of Operations) and are required to show evidence of this training to the Uptime Institute. They also want to examine our maintenance records to show that the procedures are being followed to the letter.
We have a procedure for everything — it’s all written down and laid out. We’ve colour-code our folders, we’ve got the command centre set up and we make our staff and partners practice over and over and over again to ensure that, during an emergency, when stress levels are high, they are far less likely to make costly mistakes.
This dedication to details in the whole process, from design, construction, staffing and maintenance of our facilities is what sets NEXTDC apart from alternative data centre operators. We pay attention to all the details to ensure that your business remains connected and is available 100% of the time.
It’s the extra sleep-at-night factor that you don’t get with anybody else. The training and skillset of NEXTDC staff matches the design and engineering excellence of our buildings.
Application Resiliency vs. Infrastructure Resiliency
/in Design, Executive/by Todd TraverSoftware enabled application resiliency is now playing a significant and increasing role in bolstering applications availability and reliability across the enterprise, reducing risk to the business. No longer are clients solely reliant upon the stability provided by the electrical and mechanical systems in their data center. By utilizing new software techniques, enterprises are now able to deploy applications that span multiple instances from enterprise to co-location to cloud, that bolster the availability and reliability of their critical applications.
Hybrid Deployments Create Patchwork of Dependencies on Software Agility, Network Infrastructure and Cloud Partners
While we’ve not yet seen any enterprises that rely solely upon software techniques to provide reliability, we have noted that many leaders are now complementing their data center reliability efforts with some form of software agility capabilities. And as more clients move some of their workloads to the cloud, they are beginning to utilize built-in features such as high-availability and reliability zones to augment their applications’ reliability beyond the capabilities of the individual sites themselves.
Additionally, these same clients are starting to replace or augment their requirement for risk management through disaster recovery with the high resiliency provided by availability zones from cloud providers. And bleeding-edge adopters are beginning to utilize advanced new work distribution technologies such as Google’s Cloud Spanner, which allows clients to scale globally with low latency while still preserving transactional consistency, by utilizing data that is automatically shared amongst several data centers to ensure integrity.
Keep in mind that as clients move applications and data to edge locations they can now purchase newly developed cloud services that have recently come on the market which utilize micro data centers connected in a grid formation, to create a virtualized data center that can span an entire city geography, with impressive overall performance and availability! This type of implementation requires highly specialized orchestration software, as well as a dedicated low latency fiber network that is very carefully designed, implemented and automated, to provide the high level of service required.
Cloud Providers Become the New Owners of Infrastructure Resiliency
Given all these advances in software related agility, it must be noted that all these cloud and edge providers themselves still continue to maintain highly reliable underlying electrical and mechanical foundations for their data center facilities. And since the connecting network now plays a much bigger and more critical role in overall application resiliency, it too requires the same level of focus on designed-in redundancy, reliability and fault tolerance, as that traditionally given to data center infrastructure design. Risk management comes in so many forms to be sure.
So overall, there’s a big new world of options and approaches when it comes to applications resiliency design, with most enterprises still using a belt and suspenders approach of software and hardware to reduce risk and ensure resiliency and reliability. But with new cloud services providing increasingly more self-service capabilities, it’s becoming critically important for customers to clearly evaluate their modern digital business requirements which can then be used to map out a strategy that provides the highest level of availability and resiliency at a cost which is aligned with the business itself.
And with so much at stake, the risk management aspects of hybrid infrastructures should not be ignored just because they are hard to quantify. Your very business is at risk if you don’t. Remember, the measure of a great leader is one that is not afraid to ask for help. Ask for help if you need it!
Mission Critical Computing Fabric
/in Executive, Operations/by Mark HarrisWe’ve entered an era where our IT infrastructures are now becoming a compilation of capacity that is spread out and running upon a wide range of platforms; some we completely control, some we control partially and some we don’t control at all. No longer should our IT services discussions start with ‘And in the data center we have…’, but instead they need to center around mission critical business applications and/or transactions that are provided by ‘the fabric’.
Fabric Computing
Who would have thought that all of us long-time ‘data center professionals’ would now be on the hook to deliver IT services using a platform or set of platforms that we had little or no control over? Who would have thought we’d be building our infrastructures like fabric, weaving various pieces together like a finely crafted quilt? But yet here we are, and between the data centers we own, the co-locations we fill and the clouds we rent, we are putting a lot of faith in a lot of people’s prowess to create these computing quilts or fabrics.
We all know that the executive committee will ask us regularly, “We have now transformed to be digital everything. How prepared we are to deliver these essential business critical services?”, and we in turn know that we must respond with a rehearsed confirmation of readiness. The reality is we are really crossing our fingers and hoping that the colo’s we’ve chosen and our instances in the Cloud we’ve spun up won’t show up on the 6 o’clock news each night. We simply have less and less control as we outsource more and more.
A big challenge to be sure. What we need to do is to focus on the total capacity needed and identify the risk tolerance for each application, and then look at our hybrid infrastructure as a compilation of sub-assemblies which each have their own characteristics for risk and cost. While it’s not simple math to figure out our risk and cost, it *IS* math that needs to be done, application by application. Remember I can now throw nearly any application into my in-house data centers, or spin them up in a co-location site, and even burst up to the cloud on demand. The user of that application would NOT likely know the difference in platform, yet the cost and risk to process that transaction would vary widely.
But we have SLAs to manage all of this 3rd party risk, right? Nope. SLAs are part of the dirty little secret of the industry which essentially says what happens when a third-party fails to keep things running. Most SLA agreements spend most of the prose explaining what the penalties will be WHEN the service fails. SLAs do not prevent failure, they just articulate what happens when failures occur.
Data Center Tools
So this now becomes a pure business discussion about supporting a Mission Critical ‘Fabric’. This fabric is the hybrid infrastructures we are all already creating. What needs to be added to the mix are the business attributes of cost and risk and for each, a cost calculation and risk justification for why we have made certain platform choices. Remember, we can run nearly ANY application in any one of the platform choices described above, so there must be a clear reason WHY we have done what we have done, and we need to be able to articulate and defend those reasons. And we need to think about service delivery, when it spans multiple platforms and can actually traverse from one to another over the course of any given hour, day or week. Its all a set of calculations!
Put your screwdrivers away and fire up your risk management tools, your financial modelling tools, or even your trusty copy of Excel! This is the time to work through the business metrics, rather than the technical details.
Welcome to the era of Mission Critical Computing Fabric!