As enterprises continue to move from a focus on capital expenditures to operating expenditures, more data center components will also be consumed on a pay-as-you-go, “as a service” basis.
“-aaS” goes mainstream
The trend toward everything “as a service” (XaaS) is now mainstream in IT, ranging from cloud (infrastructure-aaS) and software-aaS (SaaS) to newer offerings, such as bare metal-aaS, container-aaS, and artificial intelligence-aaS (AI-aaS). At the IT level, service providers are winning over more clients to the service-based approach by reducing capital expenditures (capex) in favor of operational expenditures (opex), by offering better products, and by investing heavily to improve security and compliance. More organizations are now willing to trust them.
But this change is not confined to the IT: a similar trend is underway in data centers.
Why buy and not build?
While the cost to build new data centers is generally falling, driven partly by the availability of more prefabricated components, enterprise operators have been increasingly competing against lower-cost options to host their IT — notably colocation, cloud and SaaS.
Cost is rarely the biggest motivation for moving to cloud, but it is a factor. Large cloud providers continue to build and operate data centers at scale and enjoy the proportional cost savings as well the fruits of intense value engineering. They also spread costs among customers and tend to have much higher utilization rates compared with other data centers. And, of course, they invest in innovative, leading-edge IT tools that can be rolled out almost instantly. This all adds up to ever-improving IT and infrastructure services from cloud providers that are cheaper (and often better) than using or developing equivalent services based in a smaller-scale enterprise data center.
Many organizations have now come to view data center ownership as a big capital risk — one that only some want to take. Even when it’s cheaper to deliver IT from their own “on-premises” data center, the risks of data center early obsolescence, under-utilization, technical noncompliance or unexpected technological or local problems are all factors. And, of course, most businesses want to avoid a big capital outlay: Our research shows that, in 2017, the total cost of ownership of an “average” concurrently maintainable 3 megawatt (MW) enterprise data center amortized over 15 years was about $90 million, and that roughly half of the cost is invested in three installments over the first six years, assuming a typical phased build and bricks-and-mortar construction.
This represents a significant amount of risk. To be economically viable, the enterprise must typically operate a facility at a high level of utilization — yet forecasting future data center capacity remains enterprises’ top challenge, according to our research.
Demand for enterprise data centers remains sizable, in spite of the alternatives. Many enterprises with smaller data centers are closing them and consolidating into premium, often larger, centralized data centers and outsourcing as much else as possible.
The appeal of the cloud will continue to convince executives and drive strategy. Increasingly, public cloud is an alternative way to deliver workloads faster and cheaper without having to build additional on-premise capacity. Scalability, portability, reduced risk, better tools, high levels of resiliency, infrastructure avoidance and fewer staff requirements are other key drivers for cloud adoption. Innovation and access to leading-edge IT will likely be bigger factors in the future, as will more cloud-first remits from upper management.
Colocation, including sale leasebacks
Although rarely thought of in this way, colocation is the most widely used “data center-aaS” offering today. Sale with leaseback of the data center by enterprise to colos is also becoming more common, a trend that will continue to build (see UII Note 38: Capital inflow boosts the data center market).
Colo interconnection services will attract even more businesses. More will likely seek to lease space in the same facility as their cloud or other third-party service provider, enabling lower latency and fewer costs and more security for third-party services, such as storage-aaS and disaster recovery-aaS.
While more enterprise IT is moving to colos and managed services (whether or not it is cloud), enterprise data centers will not disappear. More than 600 IT and data center managers told Uptime Institute that, in 2021, about half of all workloads will still be in enterprise data centers, and only 18% of workloads in public cloud/SaaS.
Other “as a service” trends in data centers
Data center monitoring and analysis is another relatively new example of a pay-as-you-go service. Introduced in late 2016, data center management as a service is a big data-driven cloud service that provides customized analysis and is paid for on a recurring basis. The move to a pay-as-you-go service has helped unlock the data center infrastructure management market, which was struggling for growth because of costs and complexity.
Energy backup and generation is another area to watch. Suppliers have introduced various pay-as-you-go models for their equipment. These include leased fuel cells owned by the supplier (notably Bloom Energy), which charges customers only for the energy produced. By eliminating the client’s risk and capital outlay, it can make the supplier’s sale easier (although they have to wait to be paid). Some suppliers have ventured in UPS-aaS, but with limited success to date.
More alternatives to ownership are likely for data center electrical assets, such as batteries. Given the high and fast rate of innovation in the technology, leasing large-scale battery installations delivers the capacity and innovation benefits without the risks.
It’s also likely that more large data centers will use energy service companies (ESCOs) to produce, manage and deliver energy from renewable microgrids. Demand for green energy, for energy security (that is, energy produced off-grid) and energy-price stability is growing; ESCOs can deliver all this for dedicated customers that sign long-term energy-purchase agreements but don’t have the capital required to build or the expertise necessary to run a green microgrid.
Demand for enterprise data centers will continue but alongside the use of more cloud and more colo. More will be consumed “as a service,” ranging from data center monitoring to renewable energy from nearby dedicated microgrids.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network. Membership information can be found here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/03/GettyImages-1187121207-blog.jpg10002700Rhonda Ascierto, Vice President, Research, Uptime Institutehttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRhonda Ascierto, Vice President, Research, Uptime Institute2020-03-02 06:01:122020-02-28 15:23:13Pay-as-you-go model spreads to critical components
Despite years of discussion, warnings and strict regulations in some countries, data center hot work remains a contentious issue in the data center industry. Hot work is the practice of working on energized electrical circuits (voltage limits differ regionally) — and it is usually done, in spite of the risks, to reduce the possibility of a downtime incident during maintenance.
Uptime Institute advises against hot work in almost all instances. The safety concerns are just too great, and data suggests work on energized circuits may — at best — only reduce the number of manageable incidents, while increasing the risk of arc flash and other events that damage expensive equipment and may lead to an outage or injury. In addition, concurrently maintainable or fault tolerant designs as described in Uptime Institute’s Tier Standard make hot work unnecessary.
The pressure against hot work continues to mount. In the US, electrical contractors have begun to decline some work that involves working on energized circuits, even if an energized work permit has been created and signed by appropriate management, as required by National Fire Protection Association (NFPA) 70E (Standard for Electrical Safety in the Workplace). In addition, US Department of Labor’s Occupational Safety and Hazards Agency (OSHA) has repeatedly rejected business continuity as an exception to hot work restrictions, making it harder for management to justify hot work and to find executives willing to sign the energized work permit.
OSHA statistics make clear that work on energized systems is a dangerous practice, especially for construction trades workers; installation, maintenance, and repair occupations; and grounds maintenance workers. For this reason, NFPA 70E sharply limits the situations in which organizations are allowed to work on energized equipment. Personnel safety is not the only issue; personal protective equipment (PPE) protects only workers, not equipment, so an arc flash can destroy many thousands of dollars of IT gear.
Ignoring local and national standards can be costly, too. OSHA reported 2,923 lockout/tagout and 1,528 PPE violations in 2017, among the many safety concerns it addressed that year. New minimum penalties for a single violation exceed $13,000, with top total fines for numerous, willful and repeated violations running into the millions of dollars. Wrongful death and injury suits add to the cost, and violations can lead to higher insurance premiums, too.
Participants in a recent Uptime Institute discussion roundtable agreed that the remaining firms performing work on live loads should begin preparing to end the practice. They said that senior management is often the biggest impediment to ending hot work, at least at some organizations, despite the well-known and documented risks. Executive resistance can be tied to concerns about power supplies or failure to maintain independent A/B feeds. In some cases, service level agreements contain restrictions against powering down equipment.
Despite executive resistance at some companies, the trend is clearly against hot work. By 2015, more than two-thirds of facilities operators had already eliminated the practice, according to Uptime Institute data. A tighter regulatory environment, heightened safety concerns, increased financial risk and improved equipment should combine to all but eliminate hot work in the near future. But there are still holdouts, and the practice is far more acceptable in some countries — China is an example — than in others, such as the US, where NFPA 70E severely limits the practice in all industries.
Also, hot work does not eliminate IT failure risk. Uptime Institute has been tracking data center abnormal incidents for more than 20 years and when studying the data, at least 71 failures occurred during hot work. While these failures are generally attributed to poor procedures or maintenance, a recent, more careful analysis concluded that better procedures or maintenance (or both) would have made it possible to perform the work safely — and without any failures — on de-energized systems.
The Uptime Institute abnormal incident database includes only four injury reports; all occurred during work on energized systems. In addition, the database includes 16 reports of arc flash. One occurred during normal preventive maintenance and one during an infrared scan. Neither caused injury, but the potential risk to personnel is apparent, as is the potential for equipment damage (and legal exposure).
Undoubtedly, eliminating hot work is a difficult process. One large retailer that has just begun the process expects the transition to take several years. And not all organizations succeed: Uptime Institute is aware of at least one organization in which incidents involving failed power supplies caused senior management to cancel their plan to disallow work on energized equipment.
According to several Uptime Institute Network community members, building a culture of safety is the most time-consuming part of the transition from hot work, as data centers are goal-oriented organizations, well-practiced at developing and following programs to identify and eliminate risk.
It is not necessary or even prudent to eliminate all hot work at once. The IT team can help slowly retire the practice by eliminating the most dangerous hot work first, building experience on less critical loads, or reducing the number of circuits affected at any one time. To prevent common failures when de-energizing servers, the Operations team can increase scrutiny on power supplies and ensure that dual-corded servers are properly fed.
In early data centers, the practice of hot work was understandable — necessary, even. However, Uptime Institute has long advocated against hot work. Modern equipment and higher resiliency architectures based on dual-corded servers make it possible to switch power feeds in the case of an electrical equipment failure. These advances not only improve data center availability, they also make it possible to isolate equipment for maintenance purposes.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/02/Energized-hot-work-cropped-blog.jpg11983242Kevin Heslinhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngKevin Heslin2020-02-24 05:30:402020-02-14 15:34:41Phasing Out Data Center Hot Work
Uptime Institute Intelligence plans to release its 2019/2020 outages report shortly. This report will examine the types, causes and impacts of public outages, as well as further analyze the results of a recent Uptime survey on outages and impacts. The data will once again show that serious IT service interruptions are common and costly, with the impacts often causing serious disruption.
We have excluded one type of outage from the report: those caused by cyberattacks. Data integrity and cybersecurity is, of course, a very major issue that requires vigilant attention and investment, but it is not currently an area on which Uptime Institute researches and advises. Most security issues are data breaches; although they have serious consequences, they do not usually lead to a service interruption.
However, two forms of malicious attack can and often do lead to outages or at least a severe service degradation. The first is a Distributed Denial of Service (DDoS) attack, where a coordinated attempt is made to overwhelm a site with traffic. Uptime has tracked a number of these each year for many years, and security specialists say they are increasingly common. Even so, most organizations that are DDoS targets have developed effective countermeasures that minimize the threat. These measures include such techniques as packet filtering, load balancing and blocking suspect internet protocol addresses. As a result, DDoS attacks are showing up less frequently in our lists of outages.
The second type, ransomware, is emerging as a major problem and cause of outages. Ransomware attackers deny authorized users access to their own data; the hackers use malware to encrypt the user’s files and refuse to unlock them unless a ransom is paid. Often, operators have no choice but to take down all involved IT services in an attempt to recover access, restore from the last clean backup copy, and purge the systems of viruses. Outages can last days or weeks.
In the past two years, ransomware attacks have increased dramatically. The FBI investigated over 1,400 ransomware attacks in 2018. Government offices are a particular target. Kaspersky Research Labs, operated by security software supplier Kaspersky, identified 147 attacks on municipalities in 2019 (up 60%), in which the criminals demanded ransoms of $5.3 million. The IT Governance blog, based in the UK, recorded 19 major ransomware attacks globally in December 2019 alone.
Most US cities have now signed a charter never to pay a ransom to the criminals — but more importantly, most are now also upgrading their infrastructure and practices to prevent attacks. Some that have been targeted, however, have paid the ransom.
Perhaps the two most serious attacks in 2019 were the City of Baltimore, which refused to pay the ransom and budgeted $18 million to fix its problem; and the City of Atlanta, which also refused to pay the ransom and paid over $7 million to fully restore operations. The WannaCry virus attack in 2018 reportedly cost the UK National Health Service over $120 million (£92 million). And on New Year’s Eve 2019, Travelex’s currency trading went offline for two weeks due to a ransomware attack, costing it millions.
Preventing a ransomware attack has become — or should become — a very high priority for those concerned with resiliency. Addressing the risk may involve some stringent, expensive and inconvenient processes, such as multifactor security, since attackers will likely try to copy all passwords as well as encrypt files. In terms of the Uptime Institute Outage Severity Rating, many attacks quickly escalate to the most serious Category 4 or 5 levels — severe enough to costs millions and threaten the survival of the organization. Indeed, one North American health provider has struggled to recover after receiving a $14 million ransom demand.
All of this points to the obvious imperative: The availability and integrity of digital infrastructure, data and services is critical — in the fullest sense of the word — to almost all organizations today, and assessments of vulnerability need to span security, software, systems, power, networks and facilities. Weaknesses are likely to be exploited; sufficient investment and diligence in this area has become essential and must never waver. In hindsight we discover that almost all outages could have prevented with better management, processes and technology.
Members of the Uptime Institute Network can read more on this topic here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/02/GettyImages-1142860861-blog.jpg4461221Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-02-10 05:46:282020-01-30 14:56:41The spectre of ransomware
A wave of new technologies, from 5G to the internet of things (IoT) to artificial intelligence (AI), means much more computing and much more data will be needed near the point of use. That means many more small data centers will be required. But there will be no sudden mass deployment, no single standout use case, no single design dominating. Demand is likely to grow faster from 2022.
Small package, big impact
Suppliers in the data center industry are excited. Big vendors such as Schneider, Vertiv and Huawei have been rapidly adding to their product lines and redrawing their financial forecasts; startups — companies such as Vapor IO, EdgeMicro, EdgeInfra and MetroEDGE — are pioneering new designs; and established telco specialists, such as Ericsson, along with telco operators, are working on new technologies and partnerships. Builders and operators of colocation data centers, such as EdgeConneX, Equinix and Compass, are assessing where the opportunity lies.
The opportunity is to supply, build or operate local edge data centers — small micro data centers that are designed to operate near the point of use, supporting applications that are not suited to run in big, remote data centers, even in mid-sized regional colocation data centers. Unlike most larger data centers, micro data centers will mostly be built, configured and tested in a factory and delivered on a truck. Typical sizes will be 50 kW to 400 kW, and there are expected to be a lot of them.
But with the anticipation comes consternation — it is possible to commit too early. Some analysts had predicted that the explosion in edge demand would be in full swing by now, fueled by the growing maturity of the IoT and the 2020 launch schedules for 5G services. Suppliers, however, mostly report only a trickle — not a flood — of orders.
Privately, some suppliers admit they have been caught off guard. There is a deep discussion about the extent of data center capacity needed at the local edge; about just how many applications and services really need local edge processing; and about the type and size of IT equipment needed — maybe a small box on the wall will be enough?
While the technical answers to most of these questions are largely understood, questions remain about the economics, the ownership, and the scale and pace of deployment of new technologies and services. These are critical matters affecting deployment.
Edge demand and 5G
In the past decade, data and processing has shifted to a cloudy core, with hundreds of hyperscale data centers built or planned. This will continue. But a rebalancing is underway (see Uptime Institute Intelligence report: The internet tilts toward the edge), with more processing being done not just at the regional edge, in nearby colocation (and other regional) data centers, but locally, in a micro data center that is tens or hundreds of meters away.
This new small facility may be needed to support services that have a lot of data, such as MRI scanners, augmented reality and real-time streaming; it may be needed to provide very low latency, instantly responsive services for both humans and machines — factory machines are one example, driverless cars another; and it may be needed to quickly crunch AI calculations for immediate, real-time responses. There is also a more mundane application: to provide on-site services, such as in a hospital, factory or retail establishment, should the network fail.
With all these use cases, why is there any doubt about the micro data center opportunity?
First, in terms of demand drivers, no new technology has created so much interest and excitement as 5G. The next generation telecom wireless network standard promises speeds of up to 10 gigabits per second (Gbps) communications, latency of below five millisecond (ms), support for one million devices per square kilometer, and five-nines availability. It will ultimately support a vast array of new always-on, low latency and immersive applications that will require unimaginable amounts of data and compute power — too much to realistically or economically send back to the internet’s hyperscale core. Much of this will require low-latency communications and rapid processing of a few milliseconds or less — which, the speed of light dictates, must be within a few kilometers.
Few doubt that 5G will create (or satisfy) huge demand and play a pivotal role in IoT. But the rollout of 5G, already underway, is not going to be quick, sudden or dramatic. In fact, full rollout may take 15 years. This is because the infrastructure required to support 5G is too expensive, too complex, and involves too many parties to do all at once. Estimates vary, with at least one analyst firm predicting that telecom companies will need to spend $1 trillion upgrading their networks.
A second issue that is creating uncertainty about demand is that many edge applications — whether supported by 5G or some other networking technology (such as WiFi 6) — may not require a local micro data center. For example, high-bandwidth applications may be best served from a content distribution network at the regional edge, in a colo, or by the colo itself, while many sensors and IoT devices produce very little data and so can be served by small gateway devices. Among 5G’s unique properties is the ability to support data-heavy, low-latency services at scale — but this is exactly the kind of service that will mostly be deployed in 2021 or later.
Suppliers and telcos alike, then, are unsure about the number, type and size of data centers at the local edge. Steve Carlini, a Schneider Electric executive, told Uptime Institute that he expects most demand for micro data centers supporting 5G will be in the cities, where mobile edge-computing clusters would likely each need one micro data center. But the number of clusters in each city, far fewer than the number of new masts, would depend on demand, applications and other factors.
A third big issue that will slow demand for micro data centers is economic and organizational. These issues include licensing, location and ownership of sites; support and maintenance; security and resiliency concerns; and management sentiment. Most enterprises expect to own their own edge micro data centers, according to Uptime Intelligence research, but many others will likely prefer to outsource this altogether, in spite of potentially higher operational costs and a loss of control.
Suppliers are bullish, even if they know demand will grow slowly at first. Among the first-line targets are those simply looking to upgrade server rooms, where the work cannot be turned over to a colo or the cloud; factories with local automation needs; retailers and others that need more resiliency in distributed locations; and telcos, whose small central offices need the security, availability and cost base of small data centers.
This wide range of applications has also led to an explosion of innovation. Expect micro data centers to vary in density, size, shape, cooling types (include liquid), power sources (including lithium ion batteries and fuel cells) and levels of resiliency.
The surge in demand for micro data centers will be real, but it will take time. Many of the economic and technical drivers are not yet mature; 5G, one of the key underlying catalysts, is in its infancy. In the near term, much of the impetus behind the use of micro data centers will lie in their ability to ensure local availability in the event of network or other remote outages.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/01/GettyImages-157394357-blog.jpg10242720Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-02-03 05:34:532020-01-30 14:46:06Micro data centers: An explosion in demand, in slow motion
Hardware refresh is the process of replacing older, less efficient servers with newer, more efficient ones with more compute capacity. However, there is a complication to the refresh cycle that is relatively recent: the slowing down of Moore’s law. There is still a very strong case for savings in energy when replacing servers that are up to nine years old. However, the case for refreshing more recent servers — say, up to three years old — may be far less clear, due to the stagnation witnessed in Moore’s law over the past few years.
Moore’s law refers to the observation made by Gordon Moore (co-founder of Intel) that the transistor count on microchips would double every two years. This implied that transistors would become smaller and faster, while drawing less energy. Over time, the doubling in performance per watt was observed to happen around every year and a half.
It is this doubling in performance per watt that underpins the major opportunity for increasing compute capacity while increasing efficiency through hardware refresh. But in the past five years, it has been harder for Intel (and immediate rival AMD) to maintain the pace of improvement. This raises the question: Are we still seeing these gains from recent and forthcoming generation of central processing units (CPUs)? If not, the hardware refresh case will be undermined … and suppliers are unlikely to be making that point too loudly.
To answer this question, Uptime Institute Intelligence analyzed performance data from the Standard Performance Evaluation Corporation (SPEC; https://www.spec.org/). The SPECpower dataset used contains energy performance results from hundreds of servers, based on the SPECpower server energy performance benchmark. To be able to track trends and eliminate potential outlier bias in reported servers (e.g., high-end servers versus volume servers), only dual-socket servers were considered in our analysis, for trend consistency. The dataset was then broken down into 18-month intervals (based on the published date of release of servers in SPECpower) and the performance averaged for each period. The results (server performance per watt) are shown in Figure 1, along with the trend line (polynomial, order 3).
The figure above shows how performance increases have started to plateau, particularly over the past two periods. The data suggests upgrading a 2015 server in 2019 might provide only a 20% boost in processing power for the same number of watts. In contrast, upgrading a 2008/2009 server in 2012 might have given a boost of 200% to 300%.
To further understand the reason behind this, we charted the way CPU technology (lithography) has evolved over time, along with performance and idle power consumption (see Figure 2).
Figure 2 reveals some interesting insights. During the beginning of the decade, the move from one CPU lithography to another, e.g., 65 nanometers (nm) to 45 nm, 45 nm to 32 nm, etc., presented major performance per watt gains (orange line), as well as substantial reduction in idle power consumption (blue line), thanks to the reduction in transistor size and voltage.
However, it is also interesting to see that the introduction of a larger number of cores to maintain performance gains produced a negative impact on idle power consumption. This can be seen briefly during the 45 nm lithography and very clearly in recent years with 14 nm.
Over the past few years, while lithography stagnated at 14 nm, the increase in performance per watt (when working with a full load) has been accompanied by a steady increase in idle power consumption (perhaps due to the increase in core count to achieve performance gains). This is one reason why the case for hardware refresh for more recent kit has become weaker: Servers in real-life deployments tend to spend a substantial part of their time in idle mode — 75% of the time, on average. As such, the increase in idle power may offset energy gains from performance.
This is an important point that will likely have escaped many buyers and operators: If a server spends a disproportionate amount of time in active idle mode — as is the case for most — the focus should be on active idle efficiency (e.g., choosing servers with lower core count) rather than just on higher server performance efficiency, while satisfying overall compute capacity requirements.
It is, of course, a constantly moving picture. The more recent introduction of the 7 nm lithography by AMD (Intel’s main competitor) should give Moore’s law a new lease of life for the next couple of years. However, it has become clear that we are starting to reach the limits of the existing approach to CPU design. Innovation and efficiency improvements will need to be based on new architectures, entirely new technologies and more energy-aware software design practices.
The full report Beyond PUE: Tackling IT’s wasted terawatts is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/01/GettyImages-1017644632-50.jpg9952591Rabih Bashroushhttps://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngRabih Bashroush2020-01-27 05:45:542020-01-27 09:59:00Optimizing server refresh cycles with an aging Moore’s law
Big IT outages are occurring with growing regularity, many with severe consequences. Executives, industry authorities and governments alike are responding with more rules, calls for more transparency and a more formal approach to end-to-end, holistic resiliency.
Creeping criticality
IT outages and data center downtime can cause huge disruption. That is hardly news: veterans with long memories can remember severe IT problems caused by power outages, for example, back in the early 1990s.
Three decades on, the situation is vastly different. Almost every component and process in the entire IT supply chain has been engineered, re-engineered and architected for the better, with availability a prime design criterion. Failure avoidance and management, business continuity and data center resiliency has become a discipline, informed by proven approaches and supported by real-time data and a vast array of tools and systems.
But there is a paradox: The very success of IT, and of remotely delivered services, has created a critical dependency on IT in almost every business and for almost every business process. This dependency has radically increased in recent years. Many more outages — and there are more of them — have a more immediate, wider and bigger impact than in the past.
A particular issue that has affected many high-profile organizations, especially in industries such as air transport, finance and retail, is “asymmetric criticality” or “creeping criticality.” This refers to a situation in which the infrastructure and processes have not been upgraded or updated to reflect the growing criticality of the applications or business processes they support. Some of the infrastructure has a 15-year life cycle, a timeframe out of sync with the far faster pace of innovation and change in the IT market.
While the level of dependency on IT is growing, another big set of changes is still only partway through: the move to cloud and distributed IT architectures (which may or may not involve the public cloud). Cloud and distributed applications enable the move, in part or whole, to a more distributed approach to resiliency. This approach involves replicating data across availability zones (regional clusters of three or more data centers) and using a variety of software tools and approaches, distributed databases, decentralized traffic and workload management, data replication and disaster recovery as a service.
These approaches can be highly effective but bring two challenges. First are complexity and cost — these architectures are difficult to set up, manage and configure, even for a customer with no direct responsibility for the infrastructure (Uptime Institute data suggests that difficulties with IT and software contribute to ever more outages). And second, for most customers, is a loss of control, visibility and accountability. This loss of visibility is now troubling regulators, especially in the financial services sector, which now plan to exercise more oversight in the United States (US), Europe, the United Kingdom (UK) and elsewhere.
Will outages get worse?
Are outages becoming more common or more damaging? The answer depends on the exact phrasing of the question: neither the number nor the severity of outages is increasing as a proportion of the level of IT services being deployed — in fact, reliability and availability is probably increasing, albeit perhaps not significantly.
But the absolute number of outages is clearly increasing. In both our 2018 and 2019 global annual surveys, half (almost exactly 50%) said their organization had a serious data center or IT outage in the past three years – and it is known that the number of data centers has risen significantly during this time. Our data also shows the impact of these outages is serious or severe in almost 20% of cases, with many industry sectors, including public cloud and colocation, suffering problems.
What next?
The industry is now at an inflection point; whatever the overall rate of outages, the impact of outages at all levels has become more public, has more consequential effects, and is therefore more costly. This trend will continue for several years, as networks, IT and cloud services take time to mature and evolve to meet the heavy availability demands put upon them. More high-profile outages can be expected, and more sectors and governments will start examining the nature of critical infrastructure.
This has already started in earnest: In the UK, the Bank of England is investigating large banks’ reliance on cloud as part of a broader risk-reduction initiative for financial digital services. The European Banking Authority specifically states that an outsourcer/cloud operator must allow site inspections of data centers. And in the US, the Federal Reserve has conducted a formal examination of at least one Amazon Web Services (AWS) data center, in Virginia, with a focus on its infrastructure resiliency and backup systems. More site visits are expected.
Authorities in the Netherlands, Sweden and the US have also been examining the resiliency of 911 services after a series of failures. And in the US, the General Accounting Office published an analysis to determine what could be done about the impact and frequency of IT outages at airlines. Meanwhile, data centers themselves will continue to be the most resilient and mature component (and with Uptime Institute certification, can be shown to be designed and operated for resiliency). There are very few signs that any sector of the market (enterprise, colocation or cloud) plans on downgrading physical infrastructure redundancy.
As a result of the high impact of outages, a much greater focus on resiliency can be expected, with best practices and management, investment, technical architectures, transparency and reporting, and legal responsibility all under discussion.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network here.
https://journal.uptimeinstitute.com/wp-content/uploads/2020/01/GettyImages-1192700007-blog.jpg9472655Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]https://journal.uptimeinstitute.com/wp-content/uploads/2022/12/uptime-institute-logo-r_240x88_v2023-with-space.pngAndy Lawrence, Executive Director of Research, Uptime Institute, [email protected]2020-01-13 05:57:122019-12-27 09:11:43Outages drive authorities and businesses to act
Pay-as-you-go model spreads to critical components
/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime InstituteAs enterprises continue to move from a focus on capital expenditures to operating expenditures, more data center components will also be consumed on a pay-as-you-go, “as a service” basis.
“-aaS” goes mainstream
The trend toward everything “as a service” (XaaS) is now mainstream in IT, ranging from cloud (infrastructure-aaS) and software-aaS (SaaS) to newer offerings, such as bare metal-aaS, container-aaS, and artificial intelligence-aaS (AI-aaS). At the IT level, service providers are winning over more clients to the service-based approach by reducing capital expenditures (capex) in favor of operational expenditures (opex), by offering better products, and by investing heavily to improve security and compliance. More organizations are now willing to trust them.
But this change is not confined to the IT: a similar trend is underway in data centers.
Why buy and not build?
While the cost to build new data centers is generally falling, driven partly by the availability of more prefabricated components, enterprise operators have been increasingly competing against lower-cost options to host their IT — notably colocation, cloud and SaaS.
Cost is rarely the biggest motivation for moving to cloud, but it is a factor. Large cloud providers continue to build and operate data centers at scale and enjoy the proportional cost savings as well the fruits of intense value engineering. They also spread costs among customers and tend to have much higher utilization rates compared with other data centers. And, of course, they invest in innovative, leading-edge IT tools that can be rolled out almost instantly. This all adds up to ever-improving IT and infrastructure services from cloud providers that are cheaper (and often better) than using or developing equivalent services based in a smaller-scale enterprise data center.
Many organizations have now come to view data center ownership as a big capital risk — one that only some want to take. Even when it’s cheaper to deliver IT from their own “on-premises” data center, the risks of data center early obsolescence, under-utilization, technical noncompliance or unexpected technological or local problems are all factors. And, of course, most businesses want to avoid a big capital outlay: Our research shows that, in 2017, the total cost of ownership of an “average” concurrently maintainable 3 megawatt (MW) enterprise data center amortized over 15 years was about $90 million, and that roughly half of the cost is invested in three installments over the first six years, assuming a typical phased build and bricks-and-mortar construction.
This represents a significant amount of risk. To be economically viable, the enterprise must typically operate a facility at a high level of utilization — yet forecasting future data center capacity remains enterprises’ top challenge, according to our research.
Demand for enterprise data centers remains sizable, in spite of the alternatives. Many enterprises with smaller data centers are closing them and consolidating into premium, often larger, centralized data centers and outsourcing as much else as possible.
The appeal of the cloud will continue to convince executives and drive strategy. Increasingly, public cloud is an alternative way to deliver workloads faster and cheaper without having to build additional on-premise capacity. Scalability, portability, reduced risk, better tools, high levels of resiliency, infrastructure avoidance and fewer staff requirements are other key drivers for cloud adoption. Innovation and access to leading-edge IT will likely be bigger factors in the future, as will more cloud-first remits from upper management.
Colocation, including sale leasebacks
Although rarely thought of in this way, colocation is the most widely used “data center-aaS” offering today. Sale with leaseback of the data center by enterprise to colos is also becoming more common, a trend that will continue to build (see UII Note 38: Capital inflow boosts the data center market).
Colo interconnection services will attract even more businesses. More will likely seek to lease space in the same facility as their cloud or other third-party service provider, enabling lower latency and fewer costs and more security for third-party services, such as storage-aaS and disaster recovery-aaS.
While more enterprise IT is moving to colos and managed services (whether or not it is cloud), enterprise data centers will not disappear. More than 600 IT and data center managers told Uptime Institute that, in 2021, about half of all workloads will still be in enterprise data centers, and only 18% of workloads in public cloud/SaaS.
Other “as a service” trends in data centers
Data center monitoring and analysis is another relatively new example of a pay-as-you-go service. Introduced in late 2016, data center management as a service is a big data-driven cloud service that provides customized analysis and is paid for on a recurring basis. The move to a pay-as-you-go service has helped unlock the data center infrastructure management market, which was struggling for growth because of costs and complexity.
Energy backup and generation is another area to watch. Suppliers have introduced various pay-as-you-go models for their equipment. These include leased fuel cells owned by the supplier (notably Bloom Energy), which charges customers only for the energy produced. By eliminating the client’s risk and capital outlay, it can make the supplier’s sale easier (although they have to wait to be paid). Some suppliers have ventured in UPS-aaS, but with limited success to date.
More alternatives to ownership are likely for data center electrical assets, such as batteries. Given the high and fast rate of innovation in the technology, leasing large-scale battery installations delivers the capacity and innovation benefits without the risks.
It’s also likely that more large data centers will use energy service companies (ESCOs) to produce, manage and deliver energy from renewable microgrids. Demand for green energy, for energy security (that is, energy produced off-grid) and energy-price stability is growing; ESCOs can deliver all this for dedicated customers that sign long-term energy-purchase agreements but don’t have the capital required to build or the expertise necessary to run a green microgrid.
Demand for enterprise data centers will continue but alongside the use of more cloud and more colo. More will be consumed “as a service,” ranging from data center monitoring to renewable energy from nearby dedicated microgrids.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network. Membership information can be found here.
Phasing Out Data Center Hot Work
/in Executive, Operations/by Kevin HeslinDespite years of discussion, warnings and strict regulations in some countries, data center hot work remains a contentious issue in the data center industry. Hot work is the practice of working on energized electrical circuits (voltage limits differ regionally) — and it is usually done, in spite of the risks, to reduce the possibility of a downtime incident during maintenance.
Uptime Institute advises against hot work in almost all instances. The safety concerns are just too great, and data suggests work on energized circuits may — at best — only reduce the number of manageable incidents, while increasing the risk of arc flash and other events that damage expensive equipment and may lead to an outage or injury. In addition, concurrently maintainable or fault tolerant designs as described in Uptime Institute’s Tier Standard make hot work unnecessary.
The pressure against hot work continues to mount. In the US, electrical contractors have begun to decline some work that involves working on energized circuits, even if an energized work permit has been created and signed by appropriate management, as required by National Fire Protection Association (NFPA) 70E (Standard for Electrical Safety in the Workplace). In addition, US Department of Labor’s Occupational Safety and Hazards Agency (OSHA) has repeatedly rejected business continuity as an exception to hot work restrictions, making it harder for management to justify hot work and to find executives willing to sign the energized work permit.
OSHA statistics make clear that work on energized systems is a dangerous practice, especially for construction trades workers; installation, maintenance, and repair occupations; and grounds maintenance workers. For this reason, NFPA 70E sharply limits the situations in which organizations are allowed to work on energized equipment. Personnel safety is not the only issue; personal protective equipment (PPE) protects only workers, not equipment, so an arc flash can destroy many thousands of dollars of IT gear.
Ignoring local and national standards can be costly, too. OSHA reported 2,923 lockout/tagout and 1,528 PPE violations in 2017, among the many safety concerns it addressed that year. New minimum penalties for a single violation exceed $13,000, with top total fines for numerous, willful and repeated violations running into the millions of dollars. Wrongful death and injury suits add to the cost, and violations can lead to higher insurance premiums, too.
Participants in a recent Uptime Institute discussion roundtable agreed that the remaining firms performing work on live loads should begin preparing to end the practice. They said that senior management is often the biggest impediment to ending hot work, at least at some organizations, despite the well-known and documented risks. Executive resistance can be tied to concerns about power supplies or failure to maintain independent A/B feeds. In some cases, service level agreements contain restrictions against powering down equipment.
Despite executive resistance at some companies, the trend is clearly against hot work. By 2015, more than two-thirds of facilities operators had already eliminated the practice, according to Uptime Institute data. A tighter regulatory environment, heightened safety concerns, increased financial risk and improved equipment should combine to all but eliminate hot work in the near future. But there are still holdouts, and the practice is far more acceptable in some countries — China is an example — than in others, such as the US, where NFPA 70E severely limits the practice in all industries.
Also, hot work does not eliminate IT failure risk. Uptime Institute has been tracking data center abnormal incidents for more than 20 years and when studying the data, at least 71 failures occurred during hot work. While these failures are generally attributed to poor procedures or maintenance, a recent, more careful analysis concluded that better procedures or maintenance (or both) would have made it possible to perform the work safely — and without any failures — on de-energized systems.
The Uptime Institute abnormal incident database includes only four injury reports; all occurred during work on energized systems. In addition, the database includes 16 reports of arc flash. One occurred during normal preventive maintenance and one during an infrared scan. Neither caused injury, but the potential risk to personnel is apparent, as is the potential for equipment damage (and legal exposure).
Undoubtedly, eliminating hot work is a difficult process. One large retailer that has just begun the process expects the transition to take several years. And not all organizations succeed: Uptime Institute is aware of at least one organization in which incidents involving failed power supplies caused senior management to cancel their plan to disallow work on energized equipment.
According to several Uptime Institute Network community members, building a culture of safety is the most time-consuming part of the transition from hot work, as data centers are goal-oriented organizations, well-practiced at developing and following programs to identify and eliminate risk.
It is not necessary or even prudent to eliminate all hot work at once. The IT team can help slowly retire the practice by eliminating the most dangerous hot work first, building experience on less critical loads, or reducing the number of circuits affected at any one time. To prevent common failures when de-energizing servers, the Operations team can increase scrutiny on power supplies and ensure that dual-corded servers are properly fed.
In early data centers, the practice of hot work was understandable — necessary, even. However, Uptime Institute has long advocated against hot work. Modern equipment and higher resiliency architectures based on dual-corded servers make it possible to switch power feeds in the case of an electrical equipment failure. These advances not only improve data center availability, they also make it possible to isolate equipment for maintenance purposes.
The spectre of ransomware
/in Executive, Operations/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]Uptime Institute Intelligence plans to release its 2019/2020 outages report shortly. This report will examine the types, causes and impacts of public outages, as well as further analyze the results of a recent Uptime survey on outages and impacts. The data will once again show that serious IT service interruptions are common and costly, with the impacts often causing serious disruption.
We have excluded one type of outage from the report: those caused by cyberattacks. Data integrity and cybersecurity is, of course, a very major issue that requires vigilant attention and investment, but it is not currently an area on which Uptime Institute researches and advises. Most security issues are data breaches; although they have serious consequences, they do not usually lead to a service interruption.
However, two forms of malicious attack can and often do lead to outages or at least a severe service degradation. The first is a Distributed Denial of Service (DDoS) attack, where a coordinated attempt is made to overwhelm a site with traffic. Uptime has tracked a number of these each year for many years, and security specialists say they are increasingly common. Even so, most organizations that are DDoS targets have developed effective countermeasures that minimize the threat. These measures include such techniques as packet filtering, load balancing and blocking suspect internet protocol addresses. As a result, DDoS attacks are showing up less frequently in our lists of outages.
The second type, ransomware, is emerging as a major problem and cause of outages. Ransomware attackers deny authorized users access to their own data; the hackers use malware to encrypt the user’s files and refuse to unlock them unless a ransom is paid. Often, operators have no choice but to take down all involved IT services in an attempt to recover access, restore from the last clean backup copy, and purge the systems of viruses. Outages can last days or weeks.
In the past two years, ransomware attacks have increased dramatically. The FBI investigated over 1,400 ransomware attacks in 2018. Government offices are a particular target. Kaspersky Research Labs, operated by security software supplier Kaspersky, identified 147 attacks on municipalities in 2019 (up 60%), in which the criminals demanded ransoms of $5.3 million. The IT Governance blog, based in the UK, recorded 19 major ransomware attacks globally in December 2019 alone.
Most US cities have now signed a charter never to pay a ransom to the criminals — but more importantly, most are now also upgrading their infrastructure and practices to prevent attacks. Some that have been targeted, however, have paid the ransom.
Perhaps the two most serious attacks in 2019 were the City of Baltimore, which refused to pay the ransom and budgeted $18 million to fix its problem; and the City of Atlanta, which also refused to pay the ransom and paid over $7 million to fully restore operations. The WannaCry virus attack in 2018 reportedly cost the UK National Health Service over $120 million (£92 million). And on New Year’s Eve 2019, Travelex’s currency trading went offline for two weeks due to a ransomware attack, costing it millions.
Preventing a ransomware attack has become — or should become — a very high priority for those concerned with resiliency. Addressing the risk may involve some stringent, expensive and inconvenient processes, such as multifactor security, since attackers will likely try to copy all passwords as well as encrypt files. In terms of the Uptime Institute Outage Severity Rating, many attacks quickly escalate to the most serious Category 4 or 5 levels — severe enough to costs millions and threaten the survival of the organization. Indeed, one North American health provider has struggled to recover after receiving a $14 million ransom demand.
All of this points to the obvious imperative: The availability and integrity of digital infrastructure, data and services is critical — in the fullest sense of the word — to almost all organizations today, and assessments of vulnerability need to span security, software, systems, power, networks and facilities. Weaknesses are likely to be exploited; sufficient investment and diligence in this area has become essential and must never waver. In hindsight we discover that almost all outages could have prevented with better management, processes and technology.
Members of the Uptime Institute Network can read more on this topic here.
Micro data centers: An explosion in demand, in slow motion
/in Design, Executive/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]A wave of new technologies, from 5G to the internet of things (IoT) to artificial intelligence (AI), means much more computing and much more data will be needed near the point of use. That means many more small data centers will be required. But there will be no sudden mass deployment, no single standout use case, no single design dominating. Demand is likely to grow faster from 2022.
Small package, big impact
Suppliers in the data center industry are excited. Big vendors such as Schneider, Vertiv and Huawei have been rapidly adding to their product lines and redrawing their financial forecasts; startups — companies such as Vapor IO, EdgeMicro, EdgeInfra and MetroEDGE — are pioneering new designs; and established telco specialists, such as Ericsson, along with telco operators, are working on new technologies and partnerships. Builders and operators of colocation data centers, such as EdgeConneX, Equinix and Compass, are assessing where the opportunity lies.
The opportunity is to supply, build or operate local edge data centers — small micro data centers that are designed to operate near the point of use, supporting applications that are not suited to run in big, remote data centers, even in mid-sized regional colocation data centers. Unlike most larger data centers, micro data centers will mostly be built, configured and tested in a factory and delivered on a truck. Typical sizes will be 50 kW to 400 kW, and there are expected to be a lot of them.
But with the anticipation comes consternation — it is possible to commit too early. Some analysts had predicted that the explosion in edge demand would be in full swing by now, fueled by the growing maturity of the IoT and the 2020 launch schedules for 5G services. Suppliers, however, mostly report only a trickle — not a flood — of orders.
Privately, some suppliers admit they have been caught off guard. There is a deep discussion about the extent of data center capacity needed at the local edge; about just how many applications and services really need local edge processing; and about the type and size of IT equipment needed — maybe a small box on the wall will be enough?
While the technical answers to most of these questions are largely understood, questions remain about the economics, the ownership, and the scale and pace of deployment of new technologies and services. These are critical matters affecting deployment.
Edge demand and 5G
In the past decade, data and processing has shifted to a cloudy core, with hundreds of hyperscale data centers built or planned. This will continue. But a rebalancing is underway (see Uptime Institute Intelligence report: The internet tilts toward the edge), with more processing being done not just at the regional edge, in nearby colocation (and other regional) data centers, but locally, in a micro data center that is tens or hundreds of meters away.
This new small facility may be needed to support services that have a lot of data, such as MRI scanners, augmented reality and real-time streaming; it may be needed to provide very low latency, instantly responsive services for both humans and machines — factory machines are one example, driverless cars another; and it may be needed to quickly crunch AI calculations for immediate, real-time responses. There is also a more mundane application: to provide on-site services, such as in a hospital, factory or retail establishment, should the network fail.
With all these use cases, why is there any doubt about the micro data center opportunity?
First, in terms of demand drivers, no new technology has created so much interest and excitement as 5G. The next generation telecom wireless network standard promises speeds of up to 10 gigabits per second (Gbps) communications, latency of below five millisecond (ms), support for one million devices per square kilometer, and five-nines availability. It will ultimately support a vast array of new always-on, low latency and immersive applications that will require unimaginable amounts of data and compute power — too much to realistically or economically send back to the internet’s hyperscale core. Much of this will require low-latency communications and rapid processing of a few milliseconds or less — which, the speed of light dictates, must be within a few kilometers.
Few doubt that 5G will create (or satisfy) huge demand and play a pivotal role in IoT. But the rollout of 5G, already underway, is not going to be quick, sudden or dramatic. In fact, full rollout may take 15 years. This is because the infrastructure required to support 5G is too expensive, too complex, and involves too many parties to do all at once. Estimates vary, with at least one analyst firm predicting that telecom companies will need to spend $1 trillion upgrading their networks.
A second issue that is creating uncertainty about demand is that many edge applications — whether supported by 5G or some other networking technology (such as WiFi 6) — may not require a local micro data center. For example, high-bandwidth applications may be best served from a content distribution network at the regional edge, in a colo, or by the colo itself, while many sensors and IoT devices produce very little data and so can be served by small gateway devices. Among 5G’s unique properties is the ability to support data-heavy, low-latency services at scale — but this is exactly the kind of service that will mostly be deployed in 2021 or later.
Suppliers and telcos alike, then, are unsure about the number, type and size of data centers at the local edge. Steve Carlini, a Schneider Electric executive, told Uptime Institute that he expects most demand for micro data centers supporting 5G will be in the cities, where mobile edge-computing clusters would likely each need one micro data center. But the number of clusters in each city, far fewer than the number of new masts, would depend on demand, applications and other factors.
A third big issue that will slow demand for micro data centers is economic and organizational. These issues include licensing, location and ownership of sites; support and maintenance; security and resiliency concerns; and management sentiment. Most enterprises expect to own their own edge micro data centers, according to Uptime Intelligence research, but many others will likely prefer to outsource this altogether, in spite of potentially higher operational costs and a loss of control.
Suppliers are bullish, even if they know demand will grow slowly at first. Among the first-line targets are those simply looking to upgrade server rooms, where the work cannot be turned over to a colo or the cloud; factories with local automation needs; retailers and others that need more resiliency in distributed locations; and telcos, whose small central offices need the security, availability and cost base of small data centers.
This wide range of applications has also led to an explosion of innovation. Expect micro data centers to vary in density, size, shape, cooling types (include liquid), power sources (including lithium ion batteries and fuel cells) and levels of resiliency.
The surge in demand for micro data centers will be real, but it will take time. Many of the economic and technical drivers are not yet mature; 5G, one of the key underlying catalysts, is in its infancy. In the near term, much of the impetus behind the use of micro data centers will lie in their ability to ensure local availability in the event of network or other remote outages.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network here.
Optimizing server refresh cycles with an aging Moore’s law
/in Executive, Operations/by Rabih BashroushHardware refresh is the process of replacing older, less efficient servers with newer, more efficient ones with more compute capacity. However, there is a complication to the refresh cycle that is relatively recent: the slowing down of Moore’s law. There is still a very strong case for savings in energy when replacing servers that are up to nine years old. However, the case for refreshing more recent servers — say, up to three years old — may be far less clear, due to the stagnation witnessed in Moore’s law over the past few years.
Moore’s law refers to the observation made by Gordon Moore (co-founder of Intel) that the transistor count on microchips would double every two years. This implied that transistors would become smaller and faster, while drawing less energy. Over time, the doubling in performance per watt was observed to happen around every year and a half.
It is this doubling in performance per watt that underpins the major opportunity for increasing compute capacity while increasing efficiency through hardware refresh. But in the past five years, it has been harder for Intel (and immediate rival AMD) to maintain the pace of improvement. This raises the question: Are we still seeing these gains from recent and forthcoming generation of central processing units (CPUs)? If not, the hardware refresh case will be undermined … and suppliers are unlikely to be making that point too loudly.
To answer this question, Uptime Institute Intelligence analyzed performance data from the Standard Performance Evaluation Corporation (SPEC; https://www.spec.org/). The SPECpower dataset used contains energy performance results from hundreds of servers, based on the SPECpower server energy performance benchmark. To be able to track trends and eliminate potential outlier bias in reported servers (e.g., high-end servers versus volume servers), only dual-socket servers were considered in our analysis, for trend consistency. The dataset was then broken down into 18-month intervals (based on the published date of release of servers in SPECpower) and the performance averaged for each period. The results (server performance per watt) are shown in Figure 1, along with the trend line (polynomial, order 3).
The figure above shows how performance increases have started to plateau, particularly over the past two periods. The data suggests upgrading a 2015 server in 2019 might provide only a 20% boost in processing power for the same number of watts. In contrast, upgrading a 2008/2009 server in 2012 might have given a boost of 200% to 300%.
To further understand the reason behind this, we charted the way CPU technology (lithography) has evolved over time, along with performance and idle power consumption (see Figure 2).
Figure 2 reveals some interesting insights. During the beginning of the decade, the move from one CPU lithography to another, e.g., 65 nanometers (nm) to 45 nm, 45 nm to 32 nm, etc., presented major performance per watt gains (orange line), as well as substantial reduction in idle power consumption (blue line), thanks to the reduction in transistor size and voltage.
However, it is also interesting to see that the introduction of a larger number of cores to maintain performance gains produced a negative impact on idle power consumption. This can be seen briefly during the 45 nm lithography and very clearly in recent years with 14 nm.
Over the past few years, while lithography stagnated at 14 nm, the increase in performance per watt (when working with a full load) has been accompanied by a steady increase in idle power consumption (perhaps due to the increase in core count to achieve performance gains). This is one reason why the case for hardware refresh for more recent kit has become weaker: Servers in real-life deployments tend to spend a substantial part of their time in idle mode — 75% of the time, on average. As such, the increase in idle power may offset energy gains from performance.
This is an important point that will likely have escaped many buyers and operators: If a server spends a disproportionate amount of time in active idle mode — as is the case for most — the focus should be on active idle efficiency (e.g., choosing servers with lower core count) rather than just on higher server performance efficiency, while satisfying overall compute capacity requirements.
It is, of course, a constantly moving picture. The more recent introduction of the 7 nm lithography by AMD (Intel’s main competitor) should give Moore’s law a new lease of life for the next couple of years. However, it has become clear that we are starting to reach the limits of the existing approach to CPU design. Innovation and efficiency improvements will need to be based on new architectures, entirely new technologies and more energy-aware software design practices.
The full report Beyond PUE: Tackling IT’s wasted terawatts is available to members of the Uptime Institute Network here.
Outages drive authorities and businesses to act
/in Executive/by Andy Lawrence, Executive Director of Research, Uptime Institute, [email protected]Big IT outages are occurring with growing regularity, many with severe consequences. Executives, industry authorities and governments alike are responding with more rules, calls for more transparency and a more formal approach to end-to-end, holistic resiliency.
Creeping criticality
IT outages and data center downtime can cause huge disruption. That is hardly news: veterans with long memories can remember severe IT problems caused by power outages, for example, back in the early 1990s.
Three decades on, the situation is vastly different. Almost every component and process in the entire IT supply chain has been engineered, re-engineered and architected for the better, with availability a prime design criterion. Failure avoidance and management, business continuity and data center resiliency has become a discipline, informed by proven approaches and supported by real-time data and a vast array of tools and systems.
But there is a paradox: The very success of IT, and of remotely delivered services, has created a critical dependency on IT in almost every business and for almost every business process. This dependency has radically increased in recent years. Many more outages — and there are more of them — have a more immediate, wider and bigger impact than in the past.
A particular issue that has affected many high-profile organizations, especially in industries such as air transport, finance and retail, is “asymmetric criticality” or “creeping criticality.” This refers to a situation in which the infrastructure and processes have not been upgraded or updated to reflect the growing criticality of the applications or business processes they support. Some of the infrastructure has a 15-year life cycle, a timeframe out of sync with the far faster pace of innovation and change in the IT market.
While the level of dependency on IT is growing, another big set of changes is still only partway through: the move to cloud and distributed IT architectures (which may or may not involve the public cloud). Cloud and distributed applications enable the move, in part or whole, to a more distributed approach to resiliency. This approach involves replicating data across availability zones (regional clusters of three or more data centers) and using a variety of software tools and approaches, distributed databases, decentralized traffic and workload management, data replication and disaster recovery as a service.
These approaches can be highly effective but bring two challenges. First are complexity and cost — these architectures are difficult to set up, manage and configure, even for a customer with no direct responsibility for the infrastructure (Uptime Institute data suggests that difficulties with IT and software contribute to ever more outages). And second, for most customers, is a loss of control, visibility and accountability. This loss of visibility is now troubling regulators, especially in the financial services sector, which now plan to exercise more oversight in the United States (US), Europe, the United Kingdom (UK) and elsewhere.
Will outages get worse?
Are outages becoming more common or more damaging? The answer depends on the exact phrasing of the question: neither the number nor the severity of outages is increasing as a proportion of the level of IT services being deployed — in fact, reliability and availability is probably increasing, albeit perhaps not significantly.
But the absolute number of outages is clearly increasing. In both our 2018 and 2019 global annual surveys, half (almost exactly 50%) said their organization had a serious data center or IT outage in the past three years – and it is known that the number of data centers has risen significantly during this time. Our data also shows the impact of these outages is serious or severe in almost 20% of cases, with many industry sectors, including public cloud and colocation, suffering problems.
What next?
The industry is now at an inflection point; whatever the overall rate of outages, the impact of outages at all levels has become more public, has more consequential effects, and is therefore more costly. This trend will continue for several years, as networks, IT and cloud services take time to mature and evolve to meet the heavy availability demands put upon them. More high-profile outages can be expected, and more sectors and governments will start examining the nature of critical infrastructure.
This has already started in earnest: In the UK, the Bank of England is investigating large banks’ reliance on cloud as part of a broader risk-reduction initiative for financial digital services. The European Banking Authority specifically states that an outsourcer/cloud operator must allow site inspections of data centers. And in the US, the Federal Reserve has conducted a formal examination of at least one Amazon Web Services (AWS) data center, in Virginia, with a focus on its infrastructure resiliency and backup systems. More site visits are expected.
Authorities in the Netherlands, Sweden and the US have also been examining the resiliency of 911 services after a series of failures. And in the US, the General Accounting Office published an analysis to determine what could be done about the impact and frequency of IT outages at airlines. Meanwhile, data centers themselves will continue to be the most resilient and mature component (and with Uptime Institute certification, can be shown to be designed and operated for resiliency). There are very few signs that any sector of the market (enterprise, colocation or cloud) plans on downgrading physical infrastructure redundancy.
As a result of the high impact of outages, a much greater focus on resiliency can be expected, with best practices and management, investment, technical architectures, transparency and reporting, and legal responsibility all under discussion.
The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network here.