Surveillance Capitalism and DCIM

In her book “Surveillance Capitalism,” the Harvard scholar Shoshana Zuboff describes how some software and service providers have been collecting vast amounts of data, with the goal of tracking, anticipating, shaping and even controlling the behavior of individuals. She sees it as a threat to individual freedom, to business and to democracy.

Zuboff outlines the actions, strategies and excesses of Facebook, Google and Microsoft in some detail. Much of this is well-known, and many legislators have been grappling with how they might limit the activities of some of these powerful companies. But the intense level of this surveillance extends far beyond these giants to many other suppliers, serving many markets. The emergence of the internet of things (IoT) accelerates the process dramatically.

Zuboff describes, for example, how a mattress supplier uses sensors and apps to collect data on sleepers’ habits, even after they have opted out; how a doll listens to and analyzes snippets of children’s conversations; and, nearer to home for many businesses, how Google’s home automation system Nest is able to harvest and exploit data about users’ behavior (and anticipated behavior) from their use of power, heating and cooling. Laws such as Europe’s General Data Protection Regulation offer theoretical protection but are mostly worked around by fine print: Privacy policies, Zuboff says, should be renamed surveillance policies.

All this is made possible, of course, because of ubiquitous connected devices; automation; large-scale, low-cost compute and storage; and data centers. And there is an irony — because many data center operators are themselves concerned at how software, service and equipment suppliers are collecting (or hope to collect) vast amounts of data about the operation of their equipment and their facilities. At one recent Uptime Institute customer roundtable with heavy financial services representation, some attendees strongly expressed the view that suppliers should (and would) not be allowed to collect and keep data regarding their data center’s performance. Others, meanwhile, see the suppliers’ request to collect data, and leverage the insights from that data, as benign and valuable, leading to better availability. If the supplier benefits in the process, they say, it’s a fair trade.

Of all the data center technology suppliers, Schneider Electric has moved furthest and fastest on this front. Its EcoStruxure for IT service is a cloud-based data center infrastructure management (DCIM) product (known as data center management as a service, or DMaaS) that pulls data from its customers’ many thousands of devices, sensors and monitoring systems and pools it into data lakes for analysis. (By using the service, customers effectively agree to share their anonymized data.) With the benefit of artificial intelligence (AI) and other big-data techniques, it is able to use this anonymized data to build performance models, reveal hitherto unseen patterns, make better products and identify optimal operational practices. Some of the insights are shared back with customers.

Schneider acknowledges that some potential customers have proven to be resistant and suspicious, primarily for security reasons (some prefer an air gap, with no direct internet connections for equipment). But they also say that take-up of their DCIM/DMaaS products have risen sharply since they began offering the low-cost, cloud-based monitoring services. Privacy concerns are not so great that they deter operators from taking advantage of a service they like.

Competitors are also wary. Some worry about competitive advantage, that a big DMaaS company will have the ability to see into a data center as surely as if its staff were standing in the aisles — indeed, somewhat better. And it is true: a supplier with good data and models could determine, with a fairly high degree of certainty, what will likely happen in a data center tomorrow and probably next year — when it might reach full capacity, when it might need more cooling, when equipment might fail, even when more staff are needed. That kind of insight is hugely valuable to the customer — and to any forewarned supplier.

To be fair, these competitive concerns aren’t exactly new: services companies have always had early access to equipment needs, for example, and remote monitoring and software as a service are now common in all industries. But the ability to pool data, divine new patterns, to predict and even shape decisions, almost without competition … this is a newer trend and arguably could stifle competition and create vendor lock-ins in a way not see before. With the benefit of AI, a supplier may know when cooling capacity will need to be increased even before the customer has thought about it.

Uptime has discussed the privacy (surveillance?) issues with executives at several large suppliers. Unsurprisingly, those who are competitively at most risk are most concerned. For others, their biggest concern is simply that they don’t have enough data to do this effectively themselves.

Schneider, which has a big market share but is not, arguably, in a position of dominance, says that it addressed both privacy and security fears when it designed and launched EcoStruxure. It says that the way data is collected and used is fully under customer control. The (encrypted) machine data that is collected by the EcoStruxure DMaaS is seen only by a select number of trusted developers, all of whom are under nondisclosure agreements. Data is tagged to a particular customer via a unique identifier to ensure proper matching, but it is fully segregated from other customers’ data and anonymized when used to inform analytics. These insights from the anonymized models may be shared with all customers, but neither Schneider nor anyone else can identify particular sites.

Using Schneider’s services, customers can see their own data, and see it in context — they see how their data center or equipment compares with the aggregated pool of data, providing valuable insights. But still, outside of the small number of trusted developers, no one but the customer sees it — unless, that is, the customer appoints a reseller, advisor or Schneider to look at the data and give advice. At that point, the supplier does have an advantage, but the duty is on the customer to decide how to take that advice.

None of this seems to raise any major privacy flags, but it is not clear that Zuboff would be entirely satisfied. For example, it might be argued that the agreement between data center operators and their suppliers is similar the common practice of the “surveillance capitalists.” These giant consumer-facing companies offer a superior service/product in exchange for a higher level of data access, which they can use as they like; anyone who denies access to the supplier is simply denied use of the product. Very few people ever deny Google or Apple access to their location, for example, because doing so will prevent many applications from working.

While DCIM is unusually wide in its scope and monitoring capability, this is not just about software tools. Increasingly, a lot of data center hardware, such as such as uninterruptible power supplies, power distribution units or blade servers, requires access to the host for updates and effective monitoring. Denying this permission reduces the functionality — probably to the point where it becomes impractical.

And this raises a secondary issue that is not well covered by most privacy laws: Who owns what data? It is clear that the value of a suppliers’ AI services grows significantly with more customer data. The relationship is symbiotic, but some in the data center industry are questioning the balance. Who, they ask, benefits the most? Who should be paying whom?

The issue of permissions and anonymity can also get muddy. In theory, an accurate picture of a customer (a data center operator) could be built (probably by an AI-based system) using data from utilities, cooling equipment, network traffic and on-site monitoring systems — without the customer ever actually owning or controlling any of that data.

Speaking on regulation and technology at a recent Uptime meeting held in the United States, Alex Howard, a Washington-based technology regulation specialist, advised that the customers could not be expected to track all this, and that more regulation is required. In the meantime, he advised business to take vigilant stance.

Uptime’s advice, reflecting client’s concerns, is that DMaaS provides a powerful and trusted service — but operators should always consider worse cases and how a dependency might be reversed. Big data provides many tempting opportunities, while anonymized data in theory can be breached and sometimes de-anonymized. Businesses, like governments, can change over time, and armed with powerful tools and data, they may cut corners or breach walls if the situation — or if a new owner or government — demands it. This is a now a reality in all business.

For specific recommendations on using DMaaS and related data center cloud services, see our report Very smart data centers: How artificial intelligence will power operational decisions.

Data centers without diesel generators: The groundwork is being laid…

In 2012, Microsoft announced that it planned to eliminate engine generators at its big data center campus in Quincey, Oregon. Six years later the same group, with much the same aspirations, filed for permission to install 72 diesel generators, which have an expected life of at least a decade. This example illustrates clearly just how essential engine generators are to the operation of medium and large data centers. Few — very few — can even contemplate operating production environments without diesel generators.

Almost every operator and owner would like to eliminate generators and replace them with a more modern, cleaner technology. Diesel generators are dirty — they emit both carbon dioxide and particulates, which means regulation and operating restrictions; they are expensive to buy; they are idle most of the time; and they have an operational overhead in terms of testing, regulatory conformity and fuel management (i.e., quality, supply and storage logistics).

But to date, no other technology so effectively combines low operating costs, energy density, reliability, local control and, as long as fuel can be delivered, open-ended continuous power.

Is this about to change? Not wholly, immediately or dramatically — but yes, significantly. The motivation to eliminate generators is becoming ever stronger, especially at the largest operators (most have eliminated reportable carbon emissions from their grid supply, leaving generators to account for most of the rest). And the combination of newer technologies, such as fuel cells, lithium-ion (Li-ion) batteries and a lot of management software, is beginning to look much more effective. Even where generators are not eliminated entirely, we expect more projects from 2020 onward will involve less generator cover.

Four Areas of Activity

There are four areas of activity in terms of technology, research and deployment that could mean that in the future, in some situations, generators will play a reduced role — or no role at all.

Fuel cells and on-site continuous renewables

The opportunity for replacing generators with fuel cells has been intensively explored (and to a lesser extent, tried) for a decade. At least three suppliers — Bloom Energy (US), Doosan (South Korea) and SOLIDPower (Germany) — have some data center installations. Of these, Bloom’s success with Equinix is best known. Fuel cells are arguably the only technology, after generators, that can provide reliable, on-site, continuous power at scale.

The use of fuel cells for data centers is controversial and hotly debated. Some, including the city of Santa Clara in California, maintain that fuel cells, like generators, are not clean and green, because most use fossil fuel-based gas (or hydrogen, which usually requires fossil fuel-based energy to isolate). Others say that using grid-supplied or local storage of gas introduces risks to availability and safety.

These arguments are possibly easily overcome, given the reliability of gas and the fact that very few safety issues ever occur. But fuel cells have two other disadvantages: first, they cost more than generators on a kilowatt-hour (kWh) per dollar($) basis and have mostly proven economic only when supported by grants; and second, they require a continuous, steady load (depending on the fuel cell architecture). This causes design and cost complications.

The debate will continue but even so, fuel cells are being deployed: a planned data center campus in Connecticut (owner/operator currently confidential) will have 20 MW of Doosan fuel cells, Equinix is committing to more installations, and Uptime Institute is hearing of new plans elsewhere. The overriding reason is not cost or availability, but rather the ability to achieve a dramatic reduction in carbon dioxide and other emissions and to build architectures in which the equipment is not sitting idle.

The idea of on-site renewables as a primary source of at-scale energy has gained little traction. But Uptime Institute is seeing one trend gathering pace: the colocation of data centers with local energy sources such as hydropower (or, in theory, biogas). At least two projects are being considered in Europe. Such data centers would draw from two separate but local sources, providing a theoretical level of concurrent maintainability should one fail. Local energy storage using batteries, pumped storage and other technologies would provide additional security.

Edge data centers

Medium and large data centers have large power requirements and, in most cases, need a high level of availability. But this is not always the case with smaller data centers, perhaps below 500 kilowatt (kW), of which there are expected to be many, many more in the decade ahead. Such data centers may more easily duplicate their loads and data to similar data centers nearby, may participate in distributed recovery systems, and may, in any case, cause fewer problems if they suffer an outage.

But above all, these data centers can deploy batteries (or small fuel calls) to achieve a sufficient ride-through time while the network redeploys traffic and workloads. For example, a small shipping container-sized 500 kWh Li-ion battery could provide all uninterruptible power supply (UPS) functions, feed power back to the grid and provide several hours of power to a small data center (say, 250 kW) in the event of a grid outage. As the technology improves and prices drop, such deployments will become commonplace. Furthermore, when used alongside a small generator, these systems could provide power for extended periods.

Cloud-based resiliency

When Microsoft, Equinix and others speak of reducing their reliance on generators, they are mostly referring to the extensive use of alternative power sources. But the holy grail for the hyperscale operators, and even smaller clusters of data centers, is to use availability zones, traffic switching, replication, load management and management software to rapidly re-configure if a data center loses power.

Such architectures are proving effective to a point, but they are expensive, complex and far from fail-safe. Even with full replication, the loss of an entire data center cannot but cause performance problems. For this reason, all the major operators continue to build data centers with concurrent maintainability and on-site power at the data center level.

But as software improves and Moore’s law continues to advance, will this change? Based on the state of the art in 2019 and the plans for new builds, the answer is categorically “not yet.” But in 2019, at least one major operator conducted tests to determine its resiliency using these technologies. The likely goal would not be to eliminate generators altogether, but to reduce the portion of the workload that would need generator cover.

Li-ion and smart energy

For the data center designer, one of the most significance advances of the past several years is the maturing — technically and economically — of the Li-ion battery. From 2010 to 2018, the cost of Li-ion batteries (in $ per kWh) fell 85%, according to Bloomberg-NEF (New Energy Finance). Most analysts expect prices to continue to fall steadily for the next five years, with large-scale manufacturing being the major reason. While this is no Moore’s law, it is creating an opportunity to introduce a new form of energy storage in new ways — including the replacement of some generators.

It is early days, but major operators, manufacturers and startups alike are all looking at how they can use Li-ion storage, combined with multiple forms of energy generation, to reduce their reliance on generators. Perhaps this should not be seen as the direct replacement of generators with Li-ion storage, since this is not likely to be economic for some time, but rather the use of Li-ion storage not just as a standard UPS, but more creatively and more actively. For example, combined with load shifting and closing down applications according to their criticality, UPS ride-throughs can be dramatically extended and generators will be turned on much later (or not all). Some may even be eliminated. Trials and pilots in this area are likely to be initiated or publicized in 2020 or soon after.

(Alternative technologies that could compete with lithium-ion batteries in the data center include sodium-ion batteries based on Prussian blue electrodes.)

The full report Ten data center industry trends in 2020 is available to members of the Uptime Institute Network here.

Top-10 Digital Infrastructure Trends for 2020

The digital infrastructure industry continues to grow and change at a striking price. Across the world, a thriving community of investors, designers, owners and operators are grappling with many of the same issues: resiliency and risk, the impact of cloud, the move to the edge, rapid innovation, and unpredictable (although mostly upward) demand.  What should stakeholders in this industry expect in 2020? Which innovations will make a difference — and which have been exaggerated? Is the challenge of distributed resiliency being solved, or is it getting worse? What are regulators likely to do?

Ten data center industry trends in 2020

The Top-10 trends the Uptime Institute Intelligence team has identified show an industry that is confidently expanding toward the edge, that is attractive to many new investors, and that is embracing new technologies and architectures — but one that is also running against some headwinds. Resiliency concerns, premature expectations about the impact of 5G, climate change, environmental impact and increasingly watchful regulators are among the hazards that must be successfully navigated.

So without further ado, here are the Top-10 Trends for 2020…

#1: Outages drive authorities and businesses to act

Big IT outages are occurring with growing regularity, many with severe consequences. Executives, industry authorities and governments alike are responding with more rules, calls for more transparency and a more formal approach to end-to-end, holistic resiliency.

#2: The internet tilts toward the edge

In the coming years, significant data will be generated by many more things and much more will be processed away from the core, especially in regional data centers. Many different types of data centers and networking approaches will be needed.

#3: Data center energy use goes up and up

Energy use by data centers and IT will continue to rise, putting pressure on energy infrastructure and raising questions about carbon emissions. The drivers for more energy use are simply too great to be offset by efficiency gains.

#4: Capital inflow boosts the data center market

Data centers are no longer a niche or exotic investment among mainstream institutional buyers, which are swarming to the sector. New types of capital investors, with deep pockets and long return timelines, could boost the sector overall.

#5: More data, more automated data centers

Many managers are wary of handing key decisions and operations to machines or outside programmers. But recent advances, including the broad adoption of data center infrastructure management systems and the introduction of artificial intelligence-driven cloud services, have made this much more likely. The case for more automation will become increasingly compelling.

#6: Data centers without generators: More pilots, more deployments

Most big data centers cannot contemplate operating without generators, but there is a strong drive to do so. Technological alternatives are improving, and the number of good use cases is proliferating. The next 24 months are likely to see more pilots and deployments.

#7: Pay-as-you-go model spreads to critical components

As enterprises continue to move from a focus on capital expenditures to operating expenditures, more critical infrastructure services and components — from backup energy and software to data center capacity — will be consumed on a pay-as-you-go, “as a service” basis.

#8: Micro data centers: An explosion in demand, in slow motion

The surge in demand for micro data centers will be real, and it will be strong — but it will take time to arrive in force. Many of the economic and technical drivers are not yet mature; and 5G, one of the key underlying catalysts, is in its infancy. Demand will grow faster from 2022.

#9: Staffing shortages are systemic and worsening

The data center sector’s staffing problem is systemic and long term, and employers will continue to struggle with talent shortages and growing recruitment costs. To solve the crisis, more investment will be needed from industry and educators.

#10: Climate change spurs data center regulations

Climate change awareness is growing, and attitudes are hardening.  Although major industry players are acting, legislators, lobbyists and the public are pressing for more. More regulations are on the way, addressing energy efficiency, renewable energy and waste reduction.

So what should YOU do as part of your 2020 planning process? Think about your own needs in the terms of business requirements and embrace the undeniable fact that the Digital Infrastructure world around you *IS* changing. In other words, start with crafting a TOP-DOWN understanding of what the business needs from IT, and then chart yourself a path to embrace the trends that are taking hold across the planet. As a general rule, if you are building and/or operating your computing function the Same way you did 10 years ago, then you are probably sub-optimized, inefficient and incurring significant higher costs and risks compared to those that are proactively embracing new ideas and approaches. As always, challenge yourself to compare your existing structures to what a ‘clean-slate’ approachg might yield, and then strive to more forward.

Want the WHOLE report with all the DETAIL? You can get it here.

Lithium Ion Batteries for the data center. Are they ready for production yet?

We are often asked for our thoughts about the use of lithium-ion (Li-ion) batteries in data center uninterruptible power supply (UPS) systems. This is a relatively new, evolving technology, and there are still a lot of unknowns. In our business, the unknown makes us nervous and uneasy — as it should, because we are tasked with providing uninterrupted power and cooling to IT assets for the benefit of our clients. They trust us to help them understand risk and innovation and help them balance the two. That makes us less likely to embrace newer technologies until we know what the implications would be in a production mission critical style environment. As a general rule, our experience shows that the least business risk is associated with proven technologies that are tried and tested and have a demonstrable track record and performance profile.

It’s true, all the failure modes are not fully understood where Li-ion is concerned; they’ll only be known when we see a larger installation base and actual operational performance. Tales of thermal runaway in Li-ion installations give justifiable concern, but any technology will fail if stressed beyond its limits. It’s worth considering the real-world conditions under which UPS systems are used.

The charge/discharge cycle on most UPS systems is not very demanding. UPS systems are not often required to transition to batteries, and even when they do, the time is usually short — worst case, 15 minutes — before power is restored either by the utility or the engine generator system. Under normal circumstances the batteries are on a float charge and when called upon to provide power, the amount of power they source is a fraction of the total design capacity. Therefore, as a general rule, UPS batteries are not stressed: it’s typically one discharge, then a recharge. In my experience, batteries handle that just fine — it’s the repeated discharge then recharge that causes issues.

Li-ion batteries monitor the cell condition in the battery itself, which helps users avoid problems. If you look at battery design, thermal runaway is usually caused by a charging system that malfunctions and does not reduce the charging current appropriately. Either that, or the battery itself is physically damaged.

Although thermal runaways are possible with any battery, Li-ion has a shorter track record in data centers than vented lead-acid (VLA) or valve-regulated lead-acid (VRLA) batteries. For that reason, I would not be excited about putting Li-Ion batteries in the data hall but would instead keep them in purpose-built battery rooms until we fully understand the failure modes. (See Uptime Intelligence’s Note 8, which discusses the National Fire Protection Association’s proposed standard on space requirements and siting of energy storage technology.)

With that said, because UPS batteries are usually not stressed and as long as the batteries and recharging system are functioning properly, we don’t anticipate seeing the Li-ion failures that have been seen in other industries. While I don’t think there is enough data to know for certain how long the batteries will last in relation to VRLA batteries, I think there is enough history for data center owners and operators to start to consider the technology, as long as the advantages of Li-ion are weighed against the installation, maintenance and operations costs (or savings) to see if it makes sense in a given data center.

So what are the advantages of Li-ion (as compared to VLA or VRLA) batteries? First, the power density of Li-ion technology exceeds that of VLA or VRLA, so Li-ion batteries deliver more power from a smaller footprint. Second, Li-ion technology allows for more charge/discharge cycles without degrading the battery. All batteries degrade with repeated cycling, but VLA and VRLA batteries need to be replaced when they reach about 80% of original capacity because after that point, the remaining capacity falls off very quickly. In comparison, Li-ion batteries lose capacity gradually and predictably. Finally, suppliers claim that, despite their higher initial cost, Li-ion batteries have a lower total cost of ownership than VLA or VRLA.

Data center owners/operators who are considering replacing the existing battery string with Li-ion should first verify if the installed UPS system will operate property with Li-ion batteries — the charging circuit and controls for Li-ion differ from VLA or VRLA. If the UPS system is compatible with Li-ion technology, the next step is to look at other factors (performance, siting, costs, etc.). Perform a cost vs. benefit analysis; if there’s a good case to use Li-ion, consider a small test installation to see if the technology performs as expected. This process should not only confirm whether the business case is supported but also help address the (very human and completely appropriate) skepticism of new technology.

In my opinion the current information is promising. These Lithium-Ion batteries are used in many industries more demanding than data centers, sufficient to indicate that Li-ion technology is not a passing fad. And manufacturers are working with different compositions of batteries to improve their performance and stability, so the technology is improving over time. But all factors must be weighed fully, as the cost of Li-ion batteries is significant, and all of the claims cannot be completely substantiated with long-term data. The applicability of any technology must be evaluated on a case-by-case basis — what makes sense (cost and risk) for one data center may not for another.

Separation of Production vs. Non-Production IT

Non-production IT can hinder mission-critical operations

Separating production and non-production assets should be an operational requirement for most organizations. By definition, production assets support high-priority IT loads — servers that are critical to a business or business unit. In most organizations, IT will have sufficient discretion to place these assets where they have redundant power supplies, sufficient cooling and high levels of security. Other assets can be placed elsewhere, preserving the most important infrastructure for the most important loads. However, business requirements sometimes require IT organizations to operate both production (mission critical) and non-production environments in the same facility.

In these instances, facility managers must be careful to prevent the spread of non-production IT, such as email, human resources, telephone and building controls, into expensive mission-critical spaces. While non-production IT generally does not increase risk to mission-critical IT during normal operations, mixing production and production environments will reduce mission-critical capacity and can make it more difficult to shed load.

Keeping production and non-production IT separate:

  • Reduces the chance of human error in operations.
  • Preserves power, cooling and space capacity.
  • Simplifies the process of shedding load, if necessary.

Our report Planning for mission-critical IT in mixed-use facilities (available to Uptime Institute Network community members) discusses how operating a data center in a mixed-use facility can be advantageous to the organization and even to the IT department, but can also introduce significant risk. Establishing and enforcing budget and access policies is critical in these circumstances; the entire organization must understand and follow policies limiting access to the white space.

Organizations do not need to maintain separate budgets and facilities staff for production and non-production operations — they’re accustomed to managing both; both are clearly IT functions. But the similarities between production and non-production IT does not mean that these assets should share circuits — or even racks. The presence of non-production IT gear in a mission-critical white space increases operational risk, and the less critical gear reduces the availability of mission-critical resources, such as space, cooling or power. The infrastructure required to meet the demands of mission-critical IT is expensive to build and operate and should not be used for less critical loads.

Separating the two classes of assets makes it easier for IT to manage assets and space, as well as reduce demands on generator and uninterruptible power supply capacity, especially in the event of an incident. Similarly, keeping the assets separate makes it easier for operations to shed load, if necessary.

Limiting the use of mission-critical infrastructure to production workloads can help organizations defer expansion plans. In addition, it makes it easier to limit access to mission-critical spaces to qualified personnel, while still allowing owners of non-production gear to retain access to their equipment.

However, not all companies can completely separate production and non-production loads. Other solutions include designing certain areas within the data center strictly for noncritical loads and treating those spaces differently from the rest of the facility. This arrangement takes a lot of rigor to manage and maintain, especially when the two types of spaces are in close proximity. For example, non-production IT can utilize single-corded devices, but these should be fed by dedicated power distribution units (PDUs), with dual-corded loads also served by dedicated PDUs. But mixing those servers and PDUs in a shared space creates opportunities for human error when adding or moving servers.

For this reason and more, the greater the separation between production and non-production assets, the easier it becomes for IT to differentiate between them, allocate efforts and resources appropriately, and reduce operational risk.

The full report Planning for mission-critical IT in mixed-use facilities is available to members of the Uptime Institute Network. Guest Registration is easy and can be started here.

99 Red Flags: The Availability Nines

99 Red Flags

One of the most widely cited metrics in the IT industry is for availability, expressed in the form of a number of nines: three nines for 99.9% availability (minutes of downtime per year), extending to six nines — 99.9999% — or even, very rarely, seven nines. What this should mean in practice is show in the table below:

The metric is very widely used (more so in the context of IT equipment than facilities) and is commonly, if not routinely, cited in cloud services and colocation contracts. Its use is rarely questioned. Contrast this with the power usage effectiveness (PUE) metric, which has been the subject of a decade of industry-wide debate and is treated with such suspicion that many dismiss it out of hand.

So, let us start (or re-open) the debate: To what degree should customers of IT services, and particularly cloud services, pay attention to the availability metric — the availability promise — that almost all providers embed in their service level agreement (SLA) – the 99.99% or 99.999% number? (The SLA creates a baseline below which compensation will be paid or credits applied.)

In speaking to multiple colleagues at Uptime Institute, there seems to be a consensus: Treat this number, and any SLAs that use this number, with extreme caution. And the reasons why are not so dissimilar from the reasons why PUE is so maligned: the metric is very useful for certain specific purposes but it is used casually, in many different ways, and without scientific rigor. It is routinely misapplied as well (sometimes with a clear intention to distort or mislead).

Here are some of the things that these experts are concerned about. First, the “nines” number, unless clearly qualified, is neither a forward-looking metric (it doesn’t predict availability), nor a backward-looking number (it doesn’t say how a service has performed); usually a time period is not stated. Rather, it is an engineering calculation based on the likely reliability of each individual component in the system, based on earlier tests or manufacturer promises. (The method is rooted in longstanding methodologies for measuring the reliability of dedicated and well-understood hardware equipment, such as aircraft components or machine parts.)

This is where the trouble starts. Complex systems based on multiple parts working together or in which use cases and conditions can change frequently are not so easily modeled in this way. This is especially true of software, where changes are frequent. To look backward with accuracy requires genuine, measured and stable data of all the parts working together; to look forward requires an understanding of what changes will be made, by whom, what the conditions will be when the changes are made, and with what impact. Add to this the fact that many failures are due to unpredictable configuration and/or operation and management failures, and the value of the “nines” metric becomes further diluted.

But it gets worse: the role of downtime due to maintenance is often not covered or is not clearly separated out. More importantly, the definition of downtime is either not made clear or it varies according to the service and the provider. There are often — and we stress the word “often” — situations in modern, hybrid services where a service has slowed to a nearly non-functional crawl, but the individual applications are all still considered to be “up.” Intermittent errors are even worse — the service can theoretically stop working for a few seconds a day at crucial times yet be considered well within published availability numbers.

The providers do, of course, measure historical service availability — they need to show performance against existing SLAs. But the published or promised 99.9x availability figures in the SLAs of providers are only loosely based on underlying measurements or engineering calculations. In these contracts, the figure is set to maximize profit for the operator: it needs to be high enough to attract (or not scare away) customers, but low enough to ensure minimum compensation is paid. Since the contracts are in any case weighted to ensure that the amounts paid out are only ever tiny, the incentive is to cite a high number.

To be fair, it is not always done this way. Some operators cite clear, performance-based availability over a period of time. But most don’t. Most availability promises in an SLA are market-driven and may change according to market conditions.

How does all this relate to the Uptime Institute Tier rating systems for the data center? Uptime Institute’s Chief Technology Officer Chris Brown explains that there is not a direct relationship between any number of “nines” and a Tier level. Early on, Uptime did publish a paper with some “expected” availability numbers for each Tier level to use as a discussion point, but this is no longer considered relevant. One reason is that it is possible to create a mathematical model to show a data center has a good level of availability (99.999%, for example), while still having multiple single points of failure in its design. Unless this is understood, a big failure is lying in wait. A secondary point is that measuring predicted outages using one aggregated figure might hide the impact of multiple small failures.

Brown (along with members of the Uptime Institute Resiliency Assessment team) believes it can be useful to use a recognized failure methodology, even creating a 99.9x figure. “I like availability number analysis to help determine which of two design choices to use. But I would not stake my career on them,” Brown says. “There is a difference between theory and the real world.” The Tier model assumes that in the real world, failures will occur, and maintenance will be needed.

Where does this leave cloud customers? In our research, we do find that the 99.9x figures give a good first-pass guide to the expected availability of a service. Spanner, Google’s highly resilient distributed database, for example, has an SLA based on 99.999% availability — assuming it is configured correctly. This compares to most database reliability figures of 99.95% or 99.99%. And some SLAs have a higher availability promise if the right level of replication and independent network pathways are deployed.

It is very clear that the industry needs to have a debate about establishing — or re-establishing — a reliable way of reporting true, engineering-based availability numbers. This may come, but not very soon. In the meantime, customers should be cautious — skeptical, even. They should expect failures and model their own likely availability using applicable tools and services to monitor cloud services.