The digital infrastructure industry continues to grow and change at a striking price. Across the world, a thriving community of investors, designers, owners and operators are grappling with many of the same issues: resiliency and risk, the impact of cloud, the move to the edge, rapid innovation, and unpredictable (although mostly upward) demand. What should stakeholders in this industry expect in 2020? Which innovations will make a difference — and which have been exaggerated? Is the challenge of distributed resiliency being solved, or is it getting worse? What are regulators likely to do?
Ten data center industry trends in 2020
The Top-10 trends the Uptime Institute Intelligence team has identified show an industry that is confidently expanding toward the edge, that is attractive to many new investors, and that is embracing new technologies and architectures — but one that is also running against some headwinds. Resiliency concerns, premature expectations about the impact of 5G, climate change, environmental impact and increasingly watchful regulators are among the hazards that must be successfully navigated.
So without further ado, here are the Top-10 Trends for 2020…
#1: Outages drive authorities and businesses to act
Big IT outages are occurring with growing regularity, many with severe consequences. Executives, industry authorities and governments alike are responding with more rules, calls for more transparency and a more formal approach to end-to-end, holistic resiliency.
#2: The internet tilts toward the edge
In the coming years, significant data will be generated by many more things and much more will be processed away from the core, especially in regional data centers. Many different types of data centers and networking approaches will be needed.
#3: Data center energy use goes up and up
Energy use by data centers and IT will continue to rise, putting pressure on energy infrastructure and raising questions about carbon emissions. The drivers for more energy use are simply too great to be offset by efficiency gains.
#4: Capital inflow boosts the data center market
Data centers are no longer a niche or exotic investment among mainstream institutional buyers, which are swarming to the sector. New types of capital investors, with deep pockets and long return timelines, could boost the sector overall.
#5: More data, more automated data centers
Many managers are wary of handing key decisions and operations to machines or outside programmers. But recent advances, including the broad adoption of data center infrastructure management systems and the introduction of artificial intelligence-driven cloud services, have made this much more likely. The case for more automation will become increasingly compelling.
#6: Data centers without generators: More pilots, more deployments
Most big data centers cannot contemplate operating without generators, but there is a strong drive to do so. Technological alternatives are improving, and the number of good use cases is proliferating. The next 24 months are likely to see more pilots and deployments.
#7: Pay-as-you-go model spreads to critical components
As enterprises continue to move from a focus on capital expenditures to operating expenditures, more critical infrastructure services and components — from backup energy and software to data center capacity — will be consumed on a pay-as-you-go, “as a service” basis.
#8: Micro data centers: An explosion in demand, in slow motion
The surge in demand for micro data centers will be real, and it will be strong — but it will take time to arrive in force. Many of the economic and technical drivers are not yet mature; and 5G, one of the key underlying catalysts, is in its infancy. Demand will grow faster from 2022.
#9: Staffing shortages are systemic and worsening
The data center sector’s staffing problem is systemic and long term, and employers will continue to struggle with talent shortages and growing recruitment costs. To solve the crisis, more investment will be needed from industry and educators.
#10: Climate change spurs data center regulations
Climate change awareness is growing, and attitudes are hardening. Although major industry players are acting, legislators, lobbyists and the public are pressing for more. More regulations are on the way, addressing energy efficiency, renewable energy and waste reduction.
So what should YOU do as part of your 2020 planning process? Think about your own needs in the terms of business requirements and embrace the undeniable fact that the Digital Infrastructure world around you *IS* changing. In other words, start with crafting a TOP-DOWN understanding of what the business needs from IT, and then chart yourself a path to embrace the trends that are taking hold across the planet. As a general rule, if you are building and/or operating your computing function the Same way you did 10 years ago, then you are probably sub-optimized, inefficient and incurring significant higher costs and risks compared to those that are proactively embracing new ideas and approaches. As always, challenge yourself to compare your existing structures to what a ‘clean-slate’ approachg might yield, and then strive to more forward.
Want the WHOLE report with all the DETAIL? You can get ithere.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/11/Trends-2020.jpg8222220Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com2019-11-18 05:21:122019-12-11 15:32:43Top-10 Digital Infrastructure Trends for 2020
We are often asked for our thoughts about the use of lithium-ion (Li-ion) batteries in data center uninterruptible power supply (UPS) systems. This is a relatively new, evolving technology, and there are still a lot of unknowns. In our business, the unknown makes us nervous and uneasy — as it should, because we are tasked with providing uninterrupted power and cooling to IT assets for the benefit of our clients. They trust us to help them understand risk and innovation and help them balance the two. That makes us less likely to embrace newer technologies until we know what the implications would be in a production mission critical style environment. As a general rule, our experience shows that the least business risk is associated with proven technologies that are tried and tested and have a demonstrable track record and performance profile.
It’s true, all the failure modes are not fully understood where Li-ion is concerned; they’ll only be known when we see a larger installation base and actual operational performance. Tales of thermal runaway in Li-ion installations give justifiable concern, but any technology will fail if stressed beyond its limits. It’s worth considering the real-world conditions under which UPS systems are used.
The charge/discharge cycle on most UPS systems is not very demanding. UPS systems are not often required to transition to batteries, and even when they do, the time is usually short — worst case, 15 minutes — before power is restored either by the utility or the engine generator system. Under normal circumstances the batteries are on a float charge and when called upon to provide power, the amount of power they source is a fraction of the total design capacity. Therefore, as a general rule, UPS batteries are not stressed: it’s typically one discharge, then a recharge. In my experience, batteries handle that just fine — it’s the repeated discharge then recharge that causes issues.
Li-ion batteries monitor the cell condition in the battery itself, which helps users avoid problems. If you look at battery design, thermal runaway is usually caused by a charging system that malfunctions and does not reduce the charging current appropriately. Either that, or the battery itself is physically damaged.
Although thermal runaways are possible with any battery, Li-ion has a shorter track record in data centers than vented lead-acid (VLA) or valve-regulated lead-acid (VRLA) batteries. For that reason, I would not be excited about putting Li-Ion batteries in the data hall but would instead keep them in purpose-built battery rooms until we fully understand the failure modes. (See Uptime Intelligence’s Note 8, which discusses the National Fire Protection Association’s proposed standard on space requirements and siting of energy storage technology.)
With that said, because UPS batteries are usually not stressed and as long as the batteries and recharging system are functioning properly, we don’t anticipate seeing the Li-ion failures that have been seen in other industries. While I don’t think there is enough data to know for certain how long the batteries will last in relation to VRLA batteries, I think there is enough history for data center owners and operators to start to consider the technology, as long as the advantages of Li-ion are weighed against the installation, maintenance and operations costs (or savings) to see if it makes sense in a given data center.
So what are the advantages of Li-ion (as compared to VLA or VRLA) batteries? First, the power density of Li-ion technology exceeds that of VLA or VRLA, so Li-ion batteries deliver more power from a smaller footprint. Second, Li-ion technology allows for more charge/discharge cycles without degrading the battery. All batteries degrade with repeated cycling, but VLA and VRLA batteries need to be replaced when they reach about 80% of original capacity because after that point, the remaining capacity falls off very quickly. In comparison, Li-ion batteries lose capacity gradually and predictably. Finally, suppliers claim that, despite their higher initial cost, Li-ion batteries have a lower total cost of ownership than VLA or VRLA.
Data center owners/operators who are considering replacing the existing battery string with Li-ion should first verify if the installed UPS system will operate property with Li-ion batteries — the charging circuit and controls for Li-ion differ from VLA or VRLA. If the UPS system is compatible with Li-ion technology, the next step is to look at other factors (performance, siting, costs, etc.). Perform a cost vs. benefit analysis; if there’s a good case to use Li-ion, consider a small test installation to see if the technology performs as expected. This process should not only confirm whether the business case is supported but also help address the (very human and completely appropriate) skepticism of new technology.
In my opinion the current information is promising. These Lithium-Ion batteries are used in many industries more demanding than data centers, sufficient to indicate that Li-ion technology is not a passing fad. And manufacturers are working with different compositions of batteries to improve their performance and stability, so the technology is improving over time. But all factors must be weighed fully, as the cost of Li-ion batteries is significant, and all of the claims cannot be completely substantiated with long-term data. The applicability of any technology must be evaluated on a case-by-case basis — what makes sense (cost and risk) for one data center may not for another.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/11/Batteries-2-7.jpg19235248Chris Brown, Chief Technical Officer, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngChris Brown, Chief Technical Officer, Uptime Institute2019-11-11 06:00:122019-11-07 17:14:31Lithium Ion Batteries for the data center. Are they ready for production yet?
Separating production and non-production assets should be an operational requirement for most organizations. By definition, production assets support high-priority IT loads — servers that are critical to a business or business unit. In most organizations, IT will have sufficient discretion to place these assets where they have redundant power supplies, sufficient cooling and high levels of security. Other assets can be placed elsewhere, preserving the most important infrastructure for the most important loads. However, business requirements sometimes require IT organizations to operate both production (mission critical) and non-production environments in the same facility.
In these instances, facility managers must be careful to prevent the spread of non-production IT, such as email, human resources, telephone and building controls, into expensive mission-critical spaces. While non-production IT generally does not increase risk to mission-critical IT during normal operations, mixing production and production environments will reduce mission-critical capacity and can make it more difficult to shed load.
Keeping production and non-production IT separate:
Reduces the chance of human error in operations.
Preserves power, cooling and space capacity.
Simplifies the process of shedding load, if necessary.
Our report Planning for mission-critical IT in mixed-use facilities (available to Uptime Institute Network community members) discusses how operating a data center in a mixed-use facility can be advantageous to the organization and even to the IT department, but can also introduce significant risk. Establishing and enforcing budget and access policies is critical in these circumstances; the entire organization must understand and follow policies limiting access to the white space.
Organizations do not need to maintain separate budgets and facilities staff for production and non-production operations — they’re accustomed to managing both; both are clearly IT functions. But the similarities between production and non-production IT does not mean that these assets should share circuits — or even racks. The presence of non-production IT gear in a mission-critical white space increases operational risk, and the less critical gear reduces the availability of mission-critical resources, such as space, cooling or power. The infrastructure required to meet the demands of mission-critical IT is expensive to build and operate and should not be used for less critical loads.
Separating the two classes of assets makes it easier for IT to manage assets and space, as well as reduce demands on generator and uninterruptible power supply capacity, especially in the event of an incident. Similarly, keeping the assets separate makes it easier for operations to shed load, if necessary.
Limiting the use of mission-critical infrastructure to production workloads can help organizations defer expansion plans. In addition, it makes it easier to limit access to mission-critical spaces to qualified personnel, while still allowing owners of non-production gear to retain access to their equipment.
However, not all companies can completely separate production and non-production loads. Other solutions include designing certain areas within the data center strictly for noncritical loads and treating those spaces differently from the rest of the facility. This arrangement takes a lot of rigor to manage and maintain, especially when the two types of spaces are in close proximity. For example, non-production IT can utilize single-corded devices, but these should be fed by dedicated power distribution units (PDUs), with dual-corded loads also served by dedicated PDUs. But mixing those servers and PDUs in a shared space creates opportunities for human error when adding or moving servers.
For this reason and more, the greater the separation between production and non-production assets, the easier it becomes for IT to differentiate between them, allocate efforts and resources appropriately, and reduce operational risk.
The full report Planning for mission-critical IT in mixed-use facilities is available to members of the Uptime Institute Network. Guest Registration is easy and can be started here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/11/Separationj.jpg12403387Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2019-11-04 06:00:082019-10-21 10:23:36Non-production IT can hinder mission-critical operations
One of the most widely cited metrics in the IT industry is for availability, expressed in the form of a number of nines: three nines for 99.9% availability (minutes of downtime per year), extending to six nines — 99.9999% — or even, very rarely, seven nines. What this should mean in practice is show in the table below:
The metric is very widely used (more so in the context of IT equipment than facilities) and is commonly, if not routinely, cited in cloud services and colocation contracts. Its use is rarely questioned. Contrast this with the power usage effectiveness (PUE) metric, which has been the subject of a decade of industry-wide debate and is treated with such suspicion that many dismiss it out of hand.
So, let us start (or re-open) the debate: To what degree should customers of IT services, and particularly cloud services, pay attention to the availability metric — the availability promise — that almost all providers embed in their service level agreement (SLA) – the 99.99% or 99.999% number? (The SLA creates a baseline below which compensation will be paid or credits applied.)
In speaking to multiple colleagues at Uptime Institute, there seems to be a consensus: Treat this number, and any SLAs that use this number, with extreme caution. And the reasons why are not so dissimilar from the reasons why PUE is so maligned: the metric is very useful for certain specific purposes but it is used casually, in many different ways, and without scientific rigor. It is routinely misapplied as well (sometimes with a clear intention to distort or mislead).
Here are some of the things that these experts are concerned about. First, the “nines” number, unless clearly qualified, is neither a forward-looking metric (it doesn’t predict availability), nor a backward-looking number (it doesn’t say how a service has performed); usually a time period is not stated. Rather, it is an engineering calculation based on the likely reliability of each individual component in the system, based on earlier tests or manufacturer promises. (The method is rooted in longstanding methodologies for measuring the reliability of dedicated and well-understood hardware equipment, such as aircraft components or machine parts.)
This is where the trouble starts. Complex systems based on multiple parts working together or in which use cases and conditions can change frequently are not so easily modeled in this way. This is especially true of software, where changes are frequent. To look backward with accuracy requires genuine, measured and stable data of all the parts working together; to look forward requires an understanding of what changes will be made, by whom, what the conditions will be when the changes are made, and with what impact. Add to this the fact that many failures are due to unpredictable configuration and/or operation and management failures, and the value of the “nines” metric becomes further diluted.
But it gets worse: the role of downtime due to maintenance is often not covered or is not clearly separated out. More importantly, the definition of downtime is either not made clear or it varies according to the service and the provider. There are often — and we stress the word “often” — situations in modern, hybrid services where a service has slowed to a nearly non-functional crawl, but the individual applications are all still considered to be “up.” Intermittent errors are even worse — the service can theoretically stop working for a few seconds a day at crucial times yet be considered well within published availability numbers.
The providers do, of course, measure historical service availability — they need to show performance against existing SLAs. But the published or promised 99.9x availability figures in the SLAs of providers are only loosely based on underlying measurements or engineering calculations. In these contracts, the figure is set to maximize profit for the operator: it needs to be high enough to attract (or not scare away) customers, but low enough to ensure minimum compensation is paid. Since the contracts are in any case weighted to ensure that the amounts paid out are only ever tiny, the incentive is to cite a high number.
To be fair, it is not always done this way. Some operators cite clear, performance-based availability over a period of time. But most don’t. Most availability promises in an SLA are market-driven and may change according to market conditions.
How does all this relate to the Uptime Institute Tier rating systems for the data center? Uptime Institute’s Chief Technology Officer Chris Brown explains that there is not a direct relationship between any number of “nines” and a Tier level. Early on, Uptime did publish a paper with some “expected” availability numbers for each Tier level to use as a discussion point, but this is no longer considered relevant. One reason is that it is possible to create a mathematical model to show a data center has a good level of availability (99.999%, for example), while still having multiple single points of failure in its design. Unless this is understood, a big failure is lying in wait. A secondary point is that measuring predicted outages using one aggregated figure might hide the impact of multiple small failures.
Brown (along with members of the Uptime Institute Resiliency Assessment team) believes it can be useful to use a recognized failure methodology, even creating a 99.9x figure. “I like availability number analysis to help determine which of two design choices to use. But I would not stake my career on them,” Brown says. “There is a difference between theory and the real world.” The Tier model assumes that in the real world, failures will occur, and maintenance will be needed.
Where does this leave cloud customers? In our research, we do find that the 99.9x figures give a good first-pass guide to the expected availability of a service. Spanner, Google’s highly resilient distributed database, for example, has an SLA based on 99.999% availability — assuming it is configured correctly. This compares to most database reliability figures of 99.95% or 99.99%. And some SLAs have a higher availability promise if the right level of replication and independent network pathways are deployed.
It is very clear that the industry needs to have a debate about establishing — or re-establishing — a reliable way of reporting true, engineering-based availability numbers. This may come, but not very soon. In the meantime, customers should be cautious — skeptical, even. They should expect failures and model their own likely availability using applicable tools and services to monitor cloud services.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/10/Red-Clock-aspect27.jpg13423523Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com2019-10-28 06:05:072019-10-14 15:36:5199 Red Flags
Regulation of Internet giants has focused so far mostly on data privacy, where concerns are relatively well understood by lawmakers and the general public. At the same time, the threat of antitrust action is growing. Congressional hearings in the US with Amazon, Apple, Facebook and Google have begun, and governments in Europe, India and elsewhere are undertaking or considering similar probes. Regulatory discourse has centered on social media, search, digital advertising and e-commerce.
Yet there is another area for which increased pressure is likely: cloud computing services, which play an increasingly critical role in many industries. Two big players, Amazon and Google, are already being scrutinized by lawmakers. Greater regulatory attention seems inevitable.
Deregulating big cloud
Given its dominance, some governments may initially focus on Amazon Web Services (AWS). With almost $26 billion in revenue last year — a 47% annual increase — AWS’ market share is larger than at least two of its biggest competitors combined: Microsoft Azure, its closest rival, and Google Cloud Platform, which is trailing but growing fast. At least one Wall Street analyst has publicly called for Amazon to spin off AWS as a separate business to help avoid regulatory pressure in other areas. As a stand-alone company, AWS could become more focused and business-friendly.
Breaking up big cloud would, however, be both politically controversial and technically difficult. If cloud is a vast experiment on a global scale, then breaking it up would be too.
If applications and services were separated from suppliers’ cloud platforms, would the underlying infrastructure they have built, such as data centers and networks, also be separated? What would happen to customers’ agreements, and how would third-party service providers whose businesses sit on top of cloud platforms (and infrastructure) function?
For the data center sector, there is also the question of what happens to the thousands of long-term agreements that colocation and wholesale providers have with the cloud giants (and which many rely on). These agreements would likely stay in place, but with whom?
Big providers’ valuation would also be problematic; part of their value is their tightly integrated ‘one-stop-shop’ services. And what about their non-cloud service businesses that run in their cloud data centers and over the wide area networks they have built?
Consumer harm?
Any antitrust action, at least in some countries such as the US, would assume a burden of consumer harm. On the face of it, consumer harm could appear to be lacking to regulators that — as with other innovations — have just a rudimentary understanding of the market. After all, cloud is characterized by low customer prices, which continue to fall, and competitors such as Oracle are promising services that would cut AWS customers’ bills “in half.”
However, given the scale and reach of cloud computing, some regulators will look more closely and beyond the data privacy and security laws that have already been enacted. The ability of lawmakers to create additional oversight will be complicated by the vast number of multi-faceted services available. The number of cloud services offered by AWS, Google and Microsoft has almost tripled in the past three years to nearly 300. Understanding the capabilities and requirements of AWS’ services (and those of other cloud providers) is so complex it is now a specialized career, including for third-party consultants.
Providers’ pricing structures and billing documentation are also highly complex. There is a standard metric for virtual machines (fixed per hour) and for storage (gigabytes per month) but multiple and varied metrics are billed for server-less, database, load balancer and other services. Providers have online calculators to help with this, but they can leave out critical (and potentially expensive) components such as data transfer, which obfuscates pricing for the buyer. Regulators may require simplified pricing and billing and greater visibility into how services and products are tied together, potentially leading to the decoupling of some service bundles.
This raises another area for potential oversight: broader and deeper inter-operability among services. Multi-cloud and hybrid IT environments are common, but organizations are often forced to choose a primary cloud platform to manage their development and IT environments effectively. Greater inter-operability among platforms, development tools and services would give customers greater practical choice and lower their risk of vendor lock-in. This could limit the ability of some providers to retain or gain market share.
It could also crimp their pricing strategies for non-cloud products. For example, beginning in October 2019, Microsoft will end its Bring Your Own License (BYOL) model that has enabled users of Microsoft software in their own data centers to migrate those workloads to a dedicated (single-tenant) host in a public cloud using the same on-premises license, without additional cost. Soon, however, Microsoft licenses will be transferable only to dedicated hosts in Microsoft’s Azure cloud. Customers that wish to use a dedicated host in AWS, Google or Alibaba, for example, will have to pay a fee in addition to the standard license cost. Regulators may seek to curb these types of approaches that effectively “tax” users of competitor’s services.
Visibility, please
The most likely scenario, at least in the short-term, may not be antitrust but enforced openness — specifically, greater transparency of cloud infrastructure, with regulators insisting on more oversight of the resiliency of critical systems running in the cloud, especially in the financial services. As a 2018 US Treasury report noted, bank regulations have not “sufficiently modernized” to accommodate cloud.
Regulators are beginning to look into this issue. In the UK, the Bank of England is investigating large banks’ reliance on cloud as part of a broader risk-reduction initiative for financial digital services. The European Banking Authority specifically states that an outsourcer/cloud operator must allow site inspections of data centers. And in the US, the Federal Reserve has conducted a formal examination of at least one AWS data center, in Virginia, with a focus on its infrastructure resiliency and backup systems. More site visits are expected.
Cloud providers may have to share more granular information about the operational redundancy and fail-over mechanisms of their infrastructure, enabling customers to better assess their risk. Some providers may invest in highly resilient infrastructure for certain services and either accept lower operating profit margins or charge a premium for those services.
The creation of any new laws would take years to play out. Meanwhile, market forces could compel providers to make their services simpler to buy and more inter-operable and the associated risks, more transparent. Some are already moving in this direction but the threat of regulatory impact, real or perceived, could speed up the development of these types of improvements.
In the movie “Mary Poppins,” Mr. Banks sings that a British bank must be run with precision, and that “Tradition, discipline and rules must be the tools.” Otherwise, he warns, “Disorder! Chaos!” will ensue.
One rule, introduced by the UK Financial Conduct Authority (FCA) in 2018, suggests that disorder and chaos might be quite common in the IT departments of many UK banks. The rule introduced the mandatory reporting of online service disruptions caused by IT problems for retail banks. The first year’s figures, published this summer, show that banks, on average, suffered an IT outage or security issue nearly every month.
The numbers (shown in the table below) are far higher than any published elsewhere or previously. One bank, Barclays, averaged an incident nearly every ten days. But all the big banks suffered frequent problems. Given this, it is no surprise that the UK Treasury Select Committee has called for action, and the Bank of England is planning on introducing new operational resilience rules for all financial institutions operating in the UK.
Source: UK Financial Conduct Authority
There may be certain mitigating factors relating to the retail banking figures: many of the banks are large and have multiple services, and some of the banks have common ownership or shared systems, so there may some “double counting” of underlying problems. And the data has grouped security incidents with IT availability problems, which has boosted the overall numbers.
Even so, there is clearly an issue in retail banking/financial services that is similar to the one affecting airlines, which we discussed in our “Creeping Criticality Note” (available to Uptime Institute Network members). The growing demand for 24-hour availability — at ever increasing scale — and the need to support ever more services is running ahead of some companies’ ability to deliver it. This seems to be affecting many financial services organizations. The FCA, which has responsibility for all UK financial services (such as insurance, pensions, etc.), recently said the number of “operational resilience” breaks reported increased to 916 for the year 2018-19 from 229 the year before – a 300% increase.
The FCA’s Director of Supervision Meghan Butler made two important observations when the numbers were published: First, a substantial number of new incidents were caused by IT and software issues (including, we assume, data center power/networking) and not so much by cyber-attacks. She noted the need for better management and processes. And second, the increase, while real, is also to do with the number of incidents being reported.
The last point is critical: Dozens of financial services giants and hundreds of innovative fintech startups call the UK home. Is this a uniquely UK problem? Is IT in this sector is particularly bad? There may be — indeed, there are known to be — problems with modernizing huge, legacy infrastructure while in flight. But many US banks, for example, have suffered similar issues.
Undoubtedly, a big proportion of the rise is due to the fact that the FCA now insists that incidents are reported. Outage reporting across almost all industries is at best ad hoc — notwithstanding confidential reporting systems run by the Uptime Institute Network (the Abnormal Incident Report, or AIR, database) and the Data Center Incident Reporting Network (DCIRN), an independent body set up in 2017 by veteran data center designer Ed Ansett. Even allowing for media reports, many also tracked by Uptime, the vast majority of outages, even at major firms, are never reported to anyone, so the lessons cannot be learned. While outages at consumer-facing internet services can be spotted by web-based outage reporting services using software tools, often there no real details other than data that points to a loss or slowdown in service.
It has often been suggested that for lessons to be shared, mandatory reporting of outages — in detail — is needed, at least by large or important organizations or services. The FCA’s data supports the view that many outages are hidden. Now they have surfaced, the next question is, How can the lessons of the failures can be more openly and widely shared?
———————————————————————————————————————————–
More information on this topic is available to members of the Uptime Institute Network which can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2019/10/GettyImages-1023248390cropped.jpg18535000Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.com2019-10-14 06:00:392019-09-30 11:02:05Outage Reporting in Financial Services
Top-10 Digital Infrastructure Trends for 2020
/in Executive, News/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comThe digital infrastructure industry continues to grow and change at a striking price. Across the world, a thriving community of investors, designers, owners and operators are grappling with many of the same issues: resiliency and risk, the impact of cloud, the move to the edge, rapid innovation, and unpredictable (although mostly upward) demand. What should stakeholders in this industry expect in 2020? Which innovations will make a difference — and which have been exaggerated? Is the challenge of distributed resiliency being solved, or is it getting worse? What are regulators likely to do?
Ten data center industry trends in 2020
The Top-10 trends the Uptime Institute Intelligence team has identified show an industry that is confidently expanding toward the edge, that is attractive to many new investors, and that is embracing new technologies and architectures — but one that is also running against some headwinds. Resiliency concerns, premature expectations about the impact of 5G, climate change, environmental impact and increasingly watchful regulators are among the hazards that must be successfully navigated.
So without further ado, here are the Top-10 Trends for 2020…
#1: Outages drive authorities and businesses to act
Big IT outages are occurring with growing regularity, many with severe consequences. Executives, industry authorities and governments alike are responding with more rules, calls for more transparency and a more formal approach to end-to-end, holistic resiliency.
#2: The internet tilts toward the edge
In the coming years, significant data will be generated by many more things and much more will be processed away from the core, especially in regional data centers. Many different types of data centers and networking approaches will be needed.
#3: Data center energy use goes up and up
Energy use by data centers and IT will continue to rise, putting pressure on energy infrastructure and raising questions about carbon emissions. The drivers for more energy use are simply too great to be offset by efficiency gains.
#4: Capital inflow boosts the data center market
Data centers are no longer a niche or exotic investment among mainstream institutional buyers, which are swarming to the sector. New types of capital investors, with deep pockets and long return timelines, could boost the sector overall.
#5: More data, more automated data centers
Many managers are wary of handing key decisions and operations to machines or outside programmers. But recent advances, including the broad adoption of data center infrastructure management systems and the introduction of artificial intelligence-driven cloud services, have made this much more likely. The case for more automation will become increasingly compelling.
#6: Data centers without generators: More pilots, more deployments
Most big data centers cannot contemplate operating without generators, but there is a strong drive to do so. Technological alternatives are improving, and the number of good use cases is proliferating. The next 24 months are likely to see more pilots and deployments.
#7: Pay-as-you-go model spreads to critical components
As enterprises continue to move from a focus on capital expenditures to operating expenditures, more critical infrastructure services and components — from backup energy and software to data center capacity — will be consumed on a pay-as-you-go, “as a service” basis.
#8: Micro data centers: An explosion in demand, in slow motion
The surge in demand for micro data centers will be real, and it will be strong — but it will take time to arrive in force. Many of the economic and technical drivers are not yet mature; and 5G, one of the key underlying catalysts, is in its infancy. Demand will grow faster from 2022.
#9: Staffing shortages are systemic and worsening
The data center sector’s staffing problem is systemic and long term, and employers will continue to struggle with talent shortages and growing recruitment costs. To solve the crisis, more investment will be needed from industry and educators.
#10: Climate change spurs data center regulations
Climate change awareness is growing, and attitudes are hardening. Although major industry players are acting, legislators, lobbyists and the public are pressing for more. More regulations are on the way, addressing energy efficiency, renewable energy and waste reduction.
So what should YOU do as part of your 2020 planning process? Think about your own needs in the terms of business requirements and embrace the undeniable fact that the Digital Infrastructure world around you *IS* changing. In other words, start with crafting a TOP-DOWN understanding of what the business needs from IT, and then chart yourself a path to embrace the trends that are taking hold across the planet. As a general rule, if you are building and/or operating your computing function the Same way you did 10 years ago, then you are probably sub-optimized, inefficient and incurring significant higher costs and risks compared to those that are proactively embracing new ideas and approaches. As always, challenge yourself to compare your existing structures to what a ‘clean-slate’ approachg might yield, and then strive to more forward.
Want the WHOLE report with all the DETAIL? You can get it here.
Lithium Ion Batteries for the data center. Are they ready for production yet?
/in Design/by Chris Brown, Chief Technical Officer, Uptime InstituteWe are often asked for our thoughts about the use of lithium-ion (Li-ion) batteries in data center uninterruptible power supply (UPS) systems. This is a relatively new, evolving technology, and there are still a lot of unknowns. In our business, the unknown makes us nervous and uneasy — as it should, because we are tasked with providing uninterrupted power and cooling to IT assets for the benefit of our clients. They trust us to help them understand risk and innovation and help them balance the two. That makes us less likely to embrace newer technologies until we know what the implications would be in a production mission critical style environment. As a general rule, our experience shows that the least business risk is associated with proven technologies that are tried and tested and have a demonstrable track record and performance profile.
It’s true, all the failure modes are not fully understood where Li-ion is concerned; they’ll only be known when we see a larger installation base and actual operational performance. Tales of thermal runaway in Li-ion installations give justifiable concern, but any technology will fail if stressed beyond its limits. It’s worth considering the real-world conditions under which UPS systems are used.
The charge/discharge cycle on most UPS systems is not very demanding. UPS systems are not often required to transition to batteries, and even when they do, the time is usually short — worst case, 15 minutes — before power is restored either by the utility or the engine generator system. Under normal circumstances the batteries are on a float charge and when called upon to provide power, the amount of power they source is a fraction of the total design capacity. Therefore, as a general rule, UPS batteries are not stressed: it’s typically one discharge, then a recharge. In my experience, batteries handle that just fine — it’s the repeated discharge then recharge that causes issues.
Li-ion batteries monitor the cell condition in the battery itself, which helps users avoid problems. If you look at battery design, thermal runaway is usually caused by a charging system that malfunctions and does not reduce the charging current appropriately. Either that, or the battery itself is physically damaged.
Although thermal runaways are possible with any battery, Li-ion has a shorter track record in data centers than vented lead-acid (VLA) or valve-regulated lead-acid (VRLA) batteries. For that reason, I would not be excited about putting Li-Ion batteries in the data hall but would instead keep them in purpose-built battery rooms until we fully understand the failure modes. (See Uptime Intelligence’s Note 8, which discusses the National Fire Protection Association’s proposed standard on space requirements and siting of energy storage technology.)
With that said, because UPS batteries are usually not stressed and as long as the batteries and recharging system are functioning properly, we don’t anticipate seeing the Li-ion failures that have been seen in other industries. While I don’t think there is enough data to know for certain how long the batteries will last in relation to VRLA batteries, I think there is enough history for data center owners and operators to start to consider the technology, as long as the advantages of Li-ion are weighed against the installation, maintenance and operations costs (or savings) to see if it makes sense in a given data center.
So what are the advantages of Li-ion (as compared to VLA or VRLA) batteries? First, the power density of Li-ion technology exceeds that of VLA or VRLA, so Li-ion batteries deliver more power from a smaller footprint. Second, Li-ion technology allows for more charge/discharge cycles without degrading the battery. All batteries degrade with repeated cycling, but VLA and VRLA batteries need to be replaced when they reach about 80% of original capacity because after that point, the remaining capacity falls off very quickly. In comparison, Li-ion batteries lose capacity gradually and predictably. Finally, suppliers claim that, despite their higher initial cost, Li-ion batteries have a lower total cost of ownership than VLA or VRLA.
Data center owners/operators who are considering replacing the existing battery string with Li-ion should first verify if the installed UPS system will operate property with Li-ion batteries — the charging circuit and controls for Li-ion differ from VLA or VRLA. If the UPS system is compatible with Li-ion technology, the next step is to look at other factors (performance, siting, costs, etc.). Perform a cost vs. benefit analysis; if there’s a good case to use Li-ion, consider a small test installation to see if the technology performs as expected. This process should not only confirm whether the business case is supported but also help address the (very human and completely appropriate) skepticism of new technology.
In my opinion the current information is promising. These Lithium-Ion batteries are used in many industries more demanding than data centers, sufficient to indicate that Li-ion technology is not a passing fad. And manufacturers are working with different compositions of batteries to improve their performance and stability, so the technology is improving over time. But all factors must be weighed fully, as the cost of Li-ion batteries is significant, and all of the claims cannot be completely substantiated with long-term data. The applicability of any technology must be evaluated on a case-by-case basis — what makes sense (cost and risk) for one data center may not for another.
Non-production IT can hinder mission-critical operations
/in Executive, Operations/by Kevin HeslinSeparating production and non-production assets should be an operational requirement for most organizations. By definition, production assets support high-priority IT loads — servers that are critical to a business or business unit. In most organizations, IT will have sufficient discretion to place these assets where they have redundant power supplies, sufficient cooling and high levels of security. Other assets can be placed elsewhere, preserving the most important infrastructure for the most important loads. However, business requirements sometimes require IT organizations to operate both production (mission critical) and non-production environments in the same facility.
In these instances, facility managers must be careful to prevent the spread of non-production IT, such as email, human resources, telephone and building controls, into expensive mission-critical spaces. While non-production IT generally does not increase risk to mission-critical IT during normal operations, mixing production and production environments will reduce mission-critical capacity and can make it more difficult to shed load.
Keeping production and non-production IT separate:
Our report Planning for mission-critical IT in mixed-use facilities (available to Uptime Institute Network community members) discusses how operating a data center in a mixed-use facility can be advantageous to the organization and even to the IT department, but can also introduce significant risk. Establishing and enforcing budget and access policies is critical in these circumstances; the entire organization must understand and follow policies limiting access to the white space.
Organizations do not need to maintain separate budgets and facilities staff for production and non-production operations — they’re accustomed to managing both; both are clearly IT functions. But the similarities between production and non-production IT does not mean that these assets should share circuits — or even racks. The presence of non-production IT gear in a mission-critical white space increases operational risk, and the less critical gear reduces the availability of mission-critical resources, such as space, cooling or power. The infrastructure required to meet the demands of mission-critical IT is expensive to build and operate and should not be used for less critical loads.
Separating the two classes of assets makes it easier for IT to manage assets and space, as well as reduce demands on generator and uninterruptible power supply capacity, especially in the event of an incident. Similarly, keeping the assets separate makes it easier for operations to shed load, if necessary.
Limiting the use of mission-critical infrastructure to production workloads can help organizations defer expansion plans. In addition, it makes it easier to limit access to mission-critical spaces to qualified personnel, while still allowing owners of non-production gear to retain access to their equipment.
However, not all companies can completely separate production and non-production loads. Other solutions include designing certain areas within the data center strictly for noncritical loads and treating those spaces differently from the rest of the facility. This arrangement takes a lot of rigor to manage and maintain, especially when the two types of spaces are in close proximity. For example, non-production IT can utilize single-corded devices, but these should be fed by dedicated power distribution units (PDUs), with dual-corded loads also served by dedicated PDUs. But mixing those servers and PDUs in a shared space creates opportunities for human error when adding or moving servers.
For this reason and more, the greater the separation between production and non-production assets, the easier it becomes for IT to differentiate between them, allocate efforts and resources appropriately, and reduce operational risk.
The full report Planning for mission-critical IT in mixed-use facilities is available to members of the Uptime Institute Network. Guest Registration is easy and can be started here.
99 Red Flags
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comOne of the most widely cited metrics in the IT industry is for availability, expressed in the form of a number of nines: three nines for 99.9% availability (minutes of downtime per year), extending to six nines — 99.9999% — or even, very rarely, seven nines. What this should mean in practice is show in the table below:
The metric is very widely used (more so in the context of IT equipment than facilities) and is commonly, if not routinely, cited in cloud services and colocation contracts. Its use is rarely questioned. Contrast this with the power usage effectiveness (PUE) metric, which has been the subject of a decade of industry-wide debate and is treated with such suspicion that many dismiss it out of hand.
So, let us start (or re-open) the debate: To what degree should customers of IT services, and particularly cloud services, pay attention to the availability metric — the availability promise — that almost all providers embed in their service level agreement (SLA) – the 99.99% or 99.999% number? (The SLA creates a baseline below which compensation will be paid or credits applied.)
In speaking to multiple colleagues at Uptime Institute, there seems to be a consensus: Treat this number, and any SLAs that use this number, with extreme caution. And the reasons why are not so dissimilar from the reasons why PUE is so maligned: the metric is very useful for certain specific purposes but it is used casually, in many different ways, and without scientific rigor. It is routinely misapplied as well (sometimes with a clear intention to distort or mislead).
Here are some of the things that these experts are concerned about. First, the “nines” number, unless clearly qualified, is neither a forward-looking metric (it doesn’t predict availability), nor a backward-looking number (it doesn’t say how a service has performed); usually a time period is not stated. Rather, it is an engineering calculation based on the likely reliability of each individual component in the system, based on earlier tests or manufacturer promises. (The method is rooted in longstanding methodologies for measuring the reliability of dedicated and well-understood hardware equipment, such as aircraft components or machine parts.)
This is where the trouble starts. Complex systems based on multiple parts working together or in which use cases and conditions can change frequently are not so easily modeled in this way. This is especially true of software, where changes are frequent. To look backward with accuracy requires genuine, measured and stable data of all the parts working together; to look forward requires an understanding of what changes will be made, by whom, what the conditions will be when the changes are made, and with what impact. Add to this the fact that many failures are due to unpredictable configuration and/or operation and management failures, and the value of the “nines” metric becomes further diluted.
But it gets worse: the role of downtime due to maintenance is often not covered or is not clearly separated out. More importantly, the definition of downtime is either not made clear or it varies according to the service and the provider. There are often — and we stress the word “often” — situations in modern, hybrid services where a service has slowed to a nearly non-functional crawl, but the individual applications are all still considered to be “up.” Intermittent errors are even worse — the service can theoretically stop working for a few seconds a day at crucial times yet be considered well within published availability numbers.
The providers do, of course, measure historical service availability — they need to show performance against existing SLAs. But the published or promised 99.9x availability figures in the SLAs of providers are only loosely based on underlying measurements or engineering calculations. In these contracts, the figure is set to maximize profit for the operator: it needs to be high enough to attract (or not scare away) customers, but low enough to ensure minimum compensation is paid. Since the contracts are in any case weighted to ensure that the amounts paid out are only ever tiny, the incentive is to cite a high number.
To be fair, it is not always done this way. Some operators cite clear, performance-based availability over a period of time. But most don’t. Most availability promises in an SLA are market-driven and may change according to market conditions.
How does all this relate to the Uptime Institute Tier rating systems for the data center? Uptime Institute’s Chief Technology Officer Chris Brown explains that there is not a direct relationship between any number of “nines” and a Tier level. Early on, Uptime did publish a paper with some “expected” availability numbers for each Tier level to use as a discussion point, but this is no longer considered relevant. One reason is that it is possible to create a mathematical model to show a data center has a good level of availability (99.999%, for example), while still having multiple single points of failure in its design. Unless this is understood, a big failure is lying in wait. A secondary point is that measuring predicted outages using one aggregated figure might hide the impact of multiple small failures.
Brown (along with members of the Uptime Institute Resiliency Assessment team) believes it can be useful to use a recognized failure methodology, even creating a 99.9x figure. “I like availability number analysis to help determine which of two design choices to use. But I would not stake my career on them,” Brown says. “There is a difference between theory and the real world.” The Tier model assumes that in the real world, failures will occur, and maintenance will be needed.
Where does this leave cloud customers? In our research, we do find that the 99.9x figures give a good first-pass guide to the expected availability of a service. Spanner, Google’s highly resilient distributed database, for example, has an SLA based on 99.999% availability — assuming it is configured correctly. This compares to most database reliability figures of 99.95% or 99.99%. And some SLAs have a higher availability promise if the right level of replication and independent network pathways are deployed.
It is very clear that the industry needs to have a debate about establishing — or re-establishing — a reliable way of reporting true, engineering-based availability numbers. This may come, but not very soon. In the meantime, customers should be cautious — skeptical, even. They should expect failures and model their own likely availability using applicable tools and services to monitor cloud services.
Big Tech Regulations: an UPSIDE for Cloud customers?
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteRegulation of Internet giants has focused so far mostly on data privacy, where concerns are relatively well understood by lawmakers and the general public. At the same time, the threat of antitrust action is growing. Congressional hearings in the US with Amazon, Apple, Facebook and Google have begun, and governments in Europe, India and elsewhere are undertaking or considering similar probes. Regulatory discourse has centered on social media, search, digital advertising and e-commerce.
Yet there is another area for which increased pressure is likely: cloud computing services, which play an increasingly critical role in many industries. Two big players, Amazon and Google, are already being scrutinized by lawmakers. Greater regulatory attention seems inevitable.
Deregulating big cloud
Given its dominance, some governments may initially focus on Amazon Web Services (AWS). With almost $26 billion in revenue last year — a 47% annual increase — AWS’ market share is larger than at least two of its biggest competitors combined: Microsoft Azure, its closest rival, and Google Cloud Platform, which is trailing but growing fast. At least one Wall Street analyst has publicly called for Amazon to spin off AWS as a separate business to help avoid regulatory pressure in other areas. As a stand-alone company, AWS could become more focused and business-friendly.
Breaking up big cloud would, however, be both politically controversial and technically difficult. If cloud is a vast experiment on a global scale, then breaking it up would be too.
If applications and services were separated from suppliers’ cloud platforms, would the underlying infrastructure they have built, such as data centers and networks, also be separated? What would happen to customers’ agreements, and how would third-party service providers whose businesses sit on top of cloud platforms (and infrastructure) function?
For the data center sector, there is also the question of what happens to the thousands of long-term agreements that colocation and wholesale providers have with the cloud giants (and which many rely on). These agreements would likely stay in place, but with whom?
Big providers’ valuation would also be problematic; part of their value is their tightly integrated ‘one-stop-shop’ services. And what about their non-cloud service businesses that run in their cloud data centers and over the wide area networks they have built?
Consumer harm?
Any antitrust action, at least in some countries such as the US, would assume a burden of consumer harm. On the face of it, consumer harm could appear to be lacking to regulators that — as with other innovations — have just a rudimentary understanding of the market. After all, cloud is characterized by low customer prices, which continue to fall, and competitors such as Oracle are promising services that would cut AWS customers’ bills “in half.”
However, given the scale and reach of cloud computing, some regulators will look more closely and beyond the data privacy and security laws that have already been enacted. The ability of lawmakers to create additional oversight will be complicated by the vast number of multi-faceted services available. The number of cloud services offered by AWS, Google and Microsoft has almost tripled in the past three years to nearly 300. Understanding the capabilities and requirements of AWS’ services (and those of other cloud providers) is so complex it is now a specialized career, including for third-party consultants.
Providers’ pricing structures and billing documentation are also highly complex. There is a standard metric for virtual machines (fixed per hour) and for storage (gigabytes per month) but multiple and varied metrics are billed for server-less, database, load balancer and other services. Providers have online calculators to help with this, but they can leave out critical (and potentially expensive) components such as data transfer, which obfuscates pricing for the buyer. Regulators may require simplified pricing and billing and greater visibility into how services and products are tied together, potentially leading to the decoupling of some service bundles.
This raises another area for potential oversight: broader and deeper inter-operability among services. Multi-cloud and hybrid IT environments are common, but organizations are often forced to choose a primary cloud platform to manage their development and IT environments effectively. Greater inter-operability among platforms, development tools and services would give customers greater practical choice and lower their risk of vendor lock-in. This could limit the ability of some providers to retain or gain market share.
It could also crimp their pricing strategies for non-cloud products. For example, beginning in October 2019, Microsoft will end its Bring Your Own License (BYOL) model that has enabled users of Microsoft software in their own data centers to migrate those workloads to a dedicated (single-tenant) host in a public cloud using the same on-premises license, without additional cost. Soon, however, Microsoft licenses will be transferable only to dedicated hosts in Microsoft’s Azure cloud. Customers that wish to use a dedicated host in AWS, Google or Alibaba, for example, will have to pay a fee in addition to the standard license cost. Regulators may seek to curb these types of approaches that effectively “tax” users of competitor’s services.
Visibility, please
The most likely scenario, at least in the short-term, may not be antitrust but enforced openness — specifically, greater transparency of cloud infrastructure, with regulators insisting on more oversight of the resiliency of critical systems running in the cloud, especially in the financial services. As a 2018 US Treasury report noted, bank regulations have not “sufficiently modernized” to accommodate cloud.
Regulators are beginning to look into this issue. In the UK, the Bank of England is investigating large banks’ reliance on cloud as part of a broader risk-reduction initiative for financial digital services. The European Banking Authority specifically states that an outsourcer/cloud operator must allow site inspections of data centers. And in the US, the Federal Reserve has conducted a formal examination of at least one AWS data center, in Virginia, with a focus on its infrastructure resiliency and backup systems. More site visits are expected.
Cloud providers may have to share more granular information about the operational redundancy and fail-over mechanisms of their infrastructure, enabling customers to better assess their risk. Some providers may invest in highly resilient infrastructure for certain services and either accept lower operating profit margins or charge a premium for those services.
The creation of any new laws would take years to play out. Meanwhile, market forces could compel providers to make their services simpler to buy and more inter-operable and the associated risks, more transparent. Some are already moving in this direction but the threat of regulatory impact, real or perceived, could speed up the development of these types of improvements.
Outage Reporting in Financial Services
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, alawrence@uptimeinstitute.comIn the movie “Mary Poppins,” Mr. Banks sings that a British bank must be run with precision, and that “Tradition, discipline and rules must be the tools.” Otherwise, he warns, “Disorder! Chaos!” will ensue.
One rule, introduced by the UK Financial Conduct Authority (FCA) in 2018, suggests that disorder and chaos might be quite common in the IT departments of many UK banks. The rule introduced the mandatory reporting of online service disruptions caused by IT problems for retail banks. The first year’s figures, published this summer, show that banks, on average, suffered an IT outage or security issue nearly every month.
The numbers (shown in the table below) are far higher than any published elsewhere or previously. One bank, Barclays, averaged an incident nearly every ten days. But all the big banks suffered frequent problems. Given this, it is no surprise that the UK Treasury Select Committee has called for action, and the Bank of England is planning on introducing new operational resilience rules for all financial institutions operating in the UK.
There may be certain mitigating factors relating to the retail banking figures: many of the banks are large and have multiple services, and some of the banks have common ownership or shared systems, so there may some “double counting” of underlying problems. And the data has grouped security incidents with IT availability problems, which has boosted the overall numbers.
Even so, there is clearly an issue in retail banking/financial services that is similar to the one affecting airlines, which we discussed in our “Creeping Criticality Note” (available to Uptime Institute Network members). The growing demand for 24-hour availability — at ever increasing scale — and the need to support ever more services is running ahead of some companies’ ability to deliver it. This seems to be affecting many financial services organizations. The FCA, which has responsibility for all UK financial services (such as insurance, pensions, etc.), recently said the number of “operational resilience” breaks reported increased to 916 for the year 2018-19 from 229 the year before – a 300% increase.
The FCA’s Director of Supervision Meghan Butler made two important observations when the numbers were published: First, a substantial number of new incidents were caused by IT and software issues (including, we assume, data center power/networking) and not so much by cyber-attacks. She noted the need for better management and processes. And second, the increase, while real, is also to do with the number of incidents being reported.
The last point is critical: Dozens of financial services giants and hundreds of innovative fintech startups call the UK home. Is this a uniquely UK problem? Is IT in this sector is particularly bad? There may be — indeed, there are known to be — problems with modernizing huge, legacy infrastructure while in flight. But many US banks, for example, have suffered similar issues.
Undoubtedly, a big proportion of the rise is due to the fact that the FCA now insists that incidents are reported. Outage reporting across almost all industries is at best ad hoc — notwithstanding confidential reporting systems run by the Uptime Institute Network (the Abnormal Incident Report, or AIR, database) and the Data Center Incident Reporting Network (DCIRN), an independent body set up in 2017 by veteran data center designer Ed Ansett. Even allowing for media reports, many also tracked by Uptime, the vast majority of outages, even at major firms, are never reported to anyone, so the lessons cannot be learned. While outages at consumer-facing internet services can be spotted by web-based outage reporting services using software tools, often there no real details other than data that points to a loss or slowdown in service.
It has often been suggested that for lessons to be shared, mandatory reporting of outages — in detail — is needed, at least by large or important organizations or services. The FCA’s data supports the view that many outages are hidden. Now they have surfaced, the next question is, How can the lessons of the failures can be more openly and widely shared?
———————————————————————————————————————————–
More information on this topic is available to members of the Uptime Institute Network which can be found here.