Mixed resiliency at the edge

Mixed resiliency at the edge

Many analysts have forecast an explosion in demand for edge data centers. After a long, slow start, demand is beginning to build, with small, prefabricated and mostly remotely operated data centers ready to be deployed to support a varying array of applications.

There are still many uncertainties surrounding the edge market, ranging from business models to ownership, and from numbers to types of deployment. One open question is how much resiliency will be needed, and how it will be achieved.

While on-site infrastructure redundancy (site-level resiliency) remains the most common approach to achieving edge data center resiliency, Uptime Institute’s research shows increased interest in software- and network-based distributed resiliency. Nine of 10 edge data center owners and operators believe it will be very or somewhat commonly used in two to three years.

Distributed resiliency, which involves synchronous or asynchronous replication of data across multiple sites, has, until recently, mainly been used by large cloud and internet service providers. It is commonly deployed in cloud availability zones and combined with site-level resiliency at three or more connected physical data centers.

While site-level redundancy is primarily a defense against equipment faults at a site, distributed resiliency can harden against major local events or sabotage (taking out a full site). It can also reduce infrastructure costs (by reducing site-level redundancy needs) and provide increased business agility by flexible placement and shifting of IT workloads. Edge data centers making use of distributed resiliency are connected and operated in a coordinated manner, as illustrated in Figure 1. The redundant element in this case is at least one full edge data center (not a component or system). When a disruption occurs, when capacity limitations are reached, or when planned maintenance is required, some (or all) of the IT workloads in an edge data center will be shifted to one or more other edge data centers.

diagram: Different resiliency approaches are used for edge data centers
Figure 1. Different resiliency approaches are used for edge data centers

Site-level resiliency relies on redundant capacity components (also including major equipment) for critical power, cooling, and network connectivity — the approach widely adopted by almost all data centers of any size. Edge data centers using only site-level resiliency tend to run their own IT workloads independently from other edge data centers.

Edge data centers making use of distributed resiliency are connected and operated in a coordinated manner, commonly using either a hierarchical topology or a mesh topology to deliver multisite resiliency.

None of these approaches or topologies are mutually exclusive, although distributed resiliency creates opportunities to reduce component redundancy at individual edge sites without risking service continuity.

Uptime Institute’s research suggests that organizations deploying edge data centers can benefit from the combined use of site-level resiliency and distributed resiliency.

Organizations deploying distributed resiliency should expect some challenges before the system works flawlessly, due to the increased software and network complexity. Because edge data centers are typically unstaffed, resilient remote monitoring and good network management/IT monitoring are essential for early detection of disruption and capacity limitations, regardless of the resiliency approach used.

Does the spread of direct liquid cooling make PUE less relevant?

Does the spread of direct liquid cooling make PUE less relevant?

The power usage effectiveness (PUE) metric is predominant thanks to its universal applicability and its simplicity: energy used by the entire data center, divided by energy used by the IT equipment. However, its simplicity could limit its future relevance, as techniques such as direct liquid cooling (DLC) profoundly change the profile of data center energy consumption.

PUE has long been used beyond its original intention, including as a single defining efficiency metric and as a comparative benchmark between different data centers, ever since it was developed by The Green Grid in 2007. Annualized PUE has become the global de facto standard for data center energy efficiency, in part because it can hide many sins: PUE doesn’t account for important trade-offs in, for example, resiliency, water consumption and, perhaps most crucially, the efficiency of IT.

However, looming technical changes to facility infrastructure could, if extensively implemented, render PUE unsuitable for historical or contemporary benchmarking. One such change is the possibility of DLC entering mainstream adoption. While DLC technology has been an established yet niche technology for decades, some in the data center sector think it’s on the verge of being more widely used.

Among the drivers for DLC is the ongoing escalation of server processor power, which could mean new servers will increasingly be offered in both traditional and DLC configurations.

diagram: Few think air cooling will remain dominant beyond 10 years
Figure 1. Few think air cooling will remain dominant beyond 10 years

According to a recent Uptime survey, only one in four respondents think air cooling will remain dominant beyond the next decade in data centers larger than 1 megawatt (MW; see Figure 1).

Regardless of the form, full or partial immersion or direct-to-chip (cold plates), DLC reshapes the composition of energy consumption of the facility and IT infrastructure beyond simply lowering the calculated PUE to near the absolute limit. Most DLC implementations achieve a partial PUE of 1.02 to 1.03, outperforming the most efficient air-cooling systems by low single-digit percentages. But PUE does not capture most of DLC’s energy gains because it also lowers the power consumption of IT, raising questions about how to account for infrastructure efficiency.

In other words, DLC changes enough variables outside the scope of PUE that its application as an energy efficiency metric becomes unsuitable.

There are two major reasons why DLC PUEs are qualitatively different from PUEs of air-cooled infrastructure. One is that DLC systems do not require most of the IT system fans that move air through the chassis (cold-plate systems still need some fans in power supplies, and for low-power electronics). Because server fans are powered by the server power supply, their consumption counts as IT power. Suppliers have modeled fan power consumption extensively, and it is a non-trivial amount. Estimates typically range between 5% and 10% of total IT power depending on fan efficiency, size and speeds (supply air temperature can also be a factor).

The other, less-explored component of IT energy is semiconductor power losses due to temperature. Modern high-performance processors are liable to relatively high leakage currents that flow even when the chip is not cycling (sleeping circuits with no clock signal). This is known as static power, as opposed to the dynamic (active) power consumed when a switch gate changes state to perform work. As the scale of integration grows with more advanced chip manufacturing technologies, so does the challenge of leakage. Against the efforts of chipmakers to contain it without giving up too much performance or transistor density, static power remains significant in the total power equation for large compute chips tuned for performance, such as server processors.

Static power, unlike dynamic power, correlates strongly with temperature. Because DLC systems can maintain chip operating temperatures far below that of air-cooled ones (say, at 48 degrees Celsius /118.4 degrees Fahrenheit, as opposed to 72 degrees Celsius/161.6 degrees Fahrenheit for air-cooled systems), they can dramatically reduce static power. In a 2010 study on a supercomputer in Japan, Fujitsu estimated that water cooling lowered processor power by a little over 10% when cooled from 85 degrees Celsius/185 degrees Fahrenheit to 30 degrees Celsius/86 degrees Fahrenheit. Static power has likely become a bigger problem since this study was conducted, suggesting that cooler chip operation has the potential to curb total IT power by several percentage points.

Without guidance from chipmakers on the static power profile of their processors, the only way to quantify this energy benefit is via experimentation. Worse still, the impact on total power will vary across servers using different chips, for multiple reasons (e.g., processor utilization, workload intensity, and semiconductor technology and manufacturing variations between different chipmakers or chip generations). All this complicates the case for including static power in a new efficiency metric — or in the business case for DLC. In other words, it is a known factor, but to what extent is unknown.

There are other developments in infrastructure design that can undermine the relevance of PUE. For example, distributed, rack-integrated uninterruptible power supplies with small battery packs can become part of the IT infrastructure, rather than the purview of facilities management. If the promise of widespread adoption of DLC materializes, PUE, in its current form, may be heading toward the end of its usefulness. The potential absence of a useful PUE metric would represent a discontinuity of historical trending. Moreover, it would hollow out competitive benchmarking: all DLC data centers will be very efficient, with immaterial energy differences. If liquid-cooled servers gain more foothold (as many — but not all — in the data center sector expect it will), operators will likely need a new metric for energy efficiency, if not as a replacement for PUE, then as a supplement. Tracking of IT utilization, and an overall more granular approach to monitoring the power consumption of workloads, could quantify efficiency gains much better than any future versions of PUE.

ASHRAE Guidelines

New ASHRAE guidelines challenge efficiency drive

Earlier in 2021, ASHRAE’s Technical Committee 9.9 published an update — the fifth edition — of its Thermal Guidelines for Data Processing Environments. The update recommends important changes to data center thermal operating envelopes: the presence of pollutants is now a factor, and it introduces a new class of IT equipment for high-density computing. The new advice can, in some cases, lead operators to not only alter operational practices but also shift set points, a change that may impact both energy efficiency and contractual service level agreements (SLA) with data center services providers.

Since the original release in 2004, ASHRAE’s Thermal Guidelines have been instrumental in setting cooling standards for data centers globally. The 9.9 committee collects input from a wide cross-section of the IT and data center industry to promote an evidence-based approach to climatic controls, one which helps operators better understand both risks and optimization opportunities. Historically, most changes to the guidelines pointed data center operators toward further relaxations of climatic set points (e.g., temperature, relative humidity, dew point), which also stimulated equipment makers to develop more efficient air economizer systems.

In the fifth edition, ASHRAE adds some major caveats to its thermal guidance. While the recommendations for relative humidity (RH) extend the range up to 70% (the previous cutoff was 60%), this is conditional on the data hall having low concentrations of pollutant gases. If the presence of corrosive gases is above the set thresholds, ASHRAE now recommends operators keep RH under 50% — below its previous recommended limit. To monitor, operators should place metal strips, known as “reactivity coupons,” in the data hall and measure corroded layer formation; the limit for silver is 200 ångström per month and for copper, 300 ångström per month.

ASHRAE bases its enhanced guidance on an experimental study on the effects of gaseous pollutants and humidity on electronics, performed between 2015 and 2018 with researchers from Syracuse University (US). The experiments found that the presence of chlorine and hydrogen sulfide accelerates copper corrosion under higher humidity conditions. Without chlorine, hydrogen sulfide or similarly strong catalysts, there was no significant corrosion up to 70% RH, even when other, less aggressive gaseous pollutants (such as ozone, nitrogen dioxide and sulfur dioxide) were present.

Because corrosion from chlorine and hydrogen sulfide at 50% RH is still above acceptable levels, ASHRAE suggests operators consider chemical filtration to decontaminate.

While the data ASHRAE uses is relatively new, its conclusions echo previous standards. Those acquainted with the environmental requirements of data storage systems may find the guidance familiar — data storage vendors have been following specifications set out in ANSI/ISA-71.04 since 1985 (last updated in 2013). Long after the era of tapes, storage drives (hard disks and solid state alike) remain the foremost victims of corrosion, as their low-temperature operational requirements mean increased moisture absorption and adsorption.

However, many data center operators do not routinely measure gaseous contaminant levels, and so do not monitor for corrosion. If strong catalysts are present but undetected, this might lead to higher than expected failure rates even if temperature and RH are within target ranges. Worse still, lowering supply air temperature in an attempt to counter failures might make them more likely. ASHRAE recommends operators consider a 50% RH limit if they don’t perform reactivity coupon measurements. Somewhat confusingly, it also makes an allowance for following specifications set out in its previous update (the fourth edition), which recommends a 60% RH limit.

Restricted envelope for high-density IT systems

Another major change in the latest update is the addition of a new class of IT equipment, separate from the pre-existing classes of A1 through A4. The new class, H1, includes systems that tightly integrate a number of high-powered components (server processors, accelerators, memory chips and networking controllers). ASHRAE says these high-density systems need more narrow air temperature bands — it recommends 18°C/64.4°F to 22°C/71.6°F (as opposed to 18°C/64.4°F to 27°C/80.6°F) — to meet its cooling requirements. The allowable envelope has become tighter as well, with upper limits of 25°C/77°F for class H1, instead of 32°C/89.6°F (see Figure 1).

diagram: Thermal Guidelines for Data Processing Environments
Figure 1. 2021 recommended and allowable envelopes for ASHRAE class H1. The recommended envelope is for low levels of pollutants, verified by coupon measurements.
Source: Thermal Guidelines for Data Processing Environments, 5th Edition, ASHRAE

This is because, according to ASHRAE, there is simply not enough room in some dense systems for the higher performance heat sinks and fans that could keep components below temperature limits across the generic (classes A1 through A4) recommended envelope. ASHRAE does not stipulate what makes a system class H1, leaving it to the IT vendor to specify its products as such.

There are some potentially far-reaching implications of these new envelopes. Operators have over the past decade built and equipped a large number of facilities based on ASHRAE’s previous guidance. Many of these relatively new data centers take advantage of the recommended temperature bands by using less mechanical refrigeration and more economization. In several locations — Dublin, London and Seattle, for example — it is even possible for operators to completely eliminate mechanical cooling yet stay within ASHRAE guidelines by marrying the use of evaporative and adiabatic air handlers with advanced air-flow design and operational discipline. The result is a major leap in energy efficiency and the ability to support more IT load from a substation.

Such optimized facilities will not typically lend themselves well to the new envelopes. That most of these data centers can support 15- to 20-kilowatt IT racks doesn’t help either, since H1 is a new equipment class requiring a lower maximum for temperature — regardless of the rack’s density. To maintain the energy efficiency of highly optimized new data center designs, dense IT may need to have its own dedicated area with independent cooling. Indeed, ASHRAE says that operators should separate H1 and other more restricted equipment into areas with their own controls and cooling equipment.

Uptime will be watching with interest how colocation providers, in particular, will handle this challenge, as their typical SLAs depend heavily on the ASHRAE thermal guidelines. What may be considered an oddity today may soon become common, given that semiconductor power keeps escalating with every generation. Facility operators may deploy direct liquid cooling for high-density IT as a way out of this bind.

Too big to fail? Facebook’s global outage

Too big to fail? Facebook’s global outage

The bigger the outage, the greater the need for explanations and, most importantly, for taking steps to avoid a repeat.

By any standards, the outage that affected Facebook on Monday, October 4th, was big. For more than six hours, Facebook and its other businesses, including WhatsApp, Instagram and Oculus VR, disappeared from the internet — not just in a few regions or countries, but globally. So many users and machines kept retrying these websites, it caused a slowdown of the internet and issues with cellular networks.

While Facebook is large enough to ride through the immediate financial impact, it should not be dismissed. Market watchers estimate that the outage cost Facebook roughly $60 million in revenues over its more than six-hour period. The company’s shares fell 4.9% on the day, which translated into more than $47 billion in lost market cap.

Facebook may recover those losses, but the bigger ramifications may be reputational and legal. Uptime Institute research shows that the level of outages from hyperscale operators is similar to that experienced by colocation companies and enterprises — despite their huge investments in distributed availability zones and global load and traffic management. In 2020, Uptime Institute recorded 21 cloud/internet giant outages, with associated financial and reputational damage. With antitrust, data privacy and, most recently, children’s mental health concerns swirling about Facebook, the company is unlikely to welcome further reputational and legal scrutiny.

What was the cause of Facebook’s outage? The company said there was an errant command issued during planned network maintenance. While an automated auditing tool would ordinarily catch an errant command, there was a bug in the tool that didn’t properly stop it. The command led to configuration changes on Facebook’s backbone routers that coordinate network traffic among its data centers. This had a cascading effect that halted Facebook’s services.

Setting aside theories of deliberate sabotage, there is evidence that Facebook’s internet routes (Border Gateway Protocol, or BGP) were withdrawn by mistake as part of these configuration changes.

BGP is a mechanism for large internet routers to constantly exchange information about the possible routes for them to deliver network packets. BGP effectively provides very long lists of potential routing paths that are constantly updated. When Facebook stopped broadcasting its presence — something observed by sites that monitor and manage internet traffic — other networks could not find it.

One factor that exacerbated the outage is that Facebook has an atypical internet infrastructure design, specifically related to BGP and another three-letter acronym: DNS, the domain name system. While BGP functions as the internet’s routing map, the DNS serves as its address book. (The DNS translates human-friendly names for online resources into machine-friendly internet protocol addresses.)

Facebook has its own DNS registrar, which manages and broadcasts its domain names. Because of Facebook’s architecture — designed to improve flexibility and control — when its BPG configuration error happened, the Facebook registrar went offline. (As an aside, this caused some domain tools to erroneously show that the Facebook.com domain was available for sale.) As a result, internet service providers and other networks simply could not find Facebook’s network.

How did this then cause a slowdown of the internet? Billions of systems, including mobile devices running a Facebook-owned application in the background, were constantly requesting new “coordinates” for these sites. These requests are ordinarily cached in servers located at the edge, but when the BGP routes disappeared, so did those caches. Requests were routed upstream to large internet servers in core data centers.

The situation was compounded by a negative feedback loop, caused in part by application logic and in part by user behavior. Web applications will not accept a BGP routing error as an answer to a request and so they retry, often aggressively. Users and their mobile devices running these applications in the background also won’t accept an error and will repeatedly reload the website or reboot the application. The result was an up to 40% increase in DNS request traffic, which slowed down other networks (and, therefore, increased latency and timeout requests for other web applications). The increased traffic also reportedly led to issues with some cellular networks, including users being unable to make voice-over-IP phone calls.

Facebook’s outage was initially caused by routine network maintenance gone wrong, but the error was missed by an auditing tool and propagated via an automated system, which were likely both built by Facebook. The command error reportedly blocked remote administrators from reverting the configuration change. What’s more, the people with access to Facebook’s physical routers (in Facebook’s data centers) did not have access to the network/logical system. This suggests two things: the network maintenance auditing tool and process were inadequately tested, and there was a lack of specialized staff with network-system access physically inside Facebook’s data centers.

When the only people who can remedy a potential network maintenance problem rely on the network that is being worked on, it seems obvious that a contingency plan needs to be in place.

Facebook, which like other cloud/internet giants has rigorous processes for applying lessons learned, should be better protected next time. But Uptime Institute’s research shows there are no guarantees — cloud/internet giants are particularly vulnerable to network and software configuration errors, a function of their complexity and the interdependency of many data centers, zones, systems and separately managed networks. Ten of the 21 outages in 2020 that affected cloud/internet giants were caused by software/network errors. That these errors can cause traffic pileups that can then snarl completely unrelated applications globally will further concern all those who depend on publicly shared digital infrastructure – including the internet.

More detailed information about the causes and costs of outages is available in our report Annual outage analysis 2021: The causes and impacts of data center outages.

Vertiv’s DCIM ambitions wither on Trellis’ demise

Vertiv’s DCIM ambitions wither on Trellis’ demise

Operators often say that data center infrastructure management (DCIM) software is a necessary evil. Modern facilities need centralized, real-time management and analytics, but DCIM is notoriously difficult to deploy. The software is also not easy to rip and replace — yet some operators unexpectedly will have to. Vertiv, a pioneer and leader in equipment monitoring and management, recently told customers it has discontinued its flagship DCIM platform Trellis. Support for existing contracts will end in 2023.

In addition to dismaying these customers, Vertiv’s decision to drop Trellis raises questions about the company’s overall strategy — and value proposition — for data center management software.

Vertiv’s software business has fallen far in the past decade. Its Aperture asset management software was the most widely used DCIM product as recently as 2015 (when the company was Emerson Network Power). Trellis launched with fanfare in 2012 and became one of the most expansive DCIM platforms, promising to integrate data center facility and IT equipment monitoring.

For years, Trellis was the centerpiece of Vertiv’s strategy toward more intelligent, more automated data centers. It was once positioned as a rival to today’s leading DCIM products, notably from Schneider Electric and Nlyte Software. At one point, IBM and HP were Trellis resellers.

Ultimately, however, Trellis was both overengineered and underwhelming. It was built using the Oracle Fusion application development system so that new functionality and software could be built on top of Trellis. This was a benefit for customers active in the Oracle environment, but not for most everyone else; the architecture was too sprawling and heavy. Vertiv says this was a primary reason for it discontinuing the product — it was simply too big and complex.   

Other factors were at play. Over several years, operators reported various problems and concerns with Trellis. For example, while Trellis’ asset management functionality was eventually strong, for years it lagged Aperture, the product it was designed to replace. (Vertiv discontinued Aperture in 2017, with support of existing contracts to end by 2023.) This left customers to choose between maintaining a discontinued product or migrating to a newer one with less functionality. Many ended up migrating to a competitor’s platform.

Vertiv’s overall share of the DCIM market has steadily shrunk. Trellis, Aperture, Data Center Planner, SiteScan – all are well-known Vertiv software products that are now discontinued. (No date yet for when support for existing SiteScan contracts will end.) Moreover, Trellis was a proof point for Vertiv’s software ambitions.

So where is Vertiv heading? The company is focusing on a few remaining software products that are solid but narrowly focused. It is investing in Environet, which it acquired in 2018 when it bought Geist Global, a relatively small vendor of rack power distribution and related sensors. At the time, Environet had a small, loyal customer base. Vertiv has since extended its power and thermal monitoring, as well as its facilities space management features. Yet the software lacks the more sophisticated predictive maintenance and, notably, the IT device management capabilities of Trellis and other DCIM suites.

Separately, Vertiv continues to develop its IT out-of-band management software, as well as its iCOM-S thermal management tools. Vertiv’s recent deal to acquire the electrical supplier E+I Engineering (for $1.2 billion) means it could potentially add more power monitoring features. 

Taken as a portfolio, Vertiv appears underinvested in software. This is not just when compared with some of its equipment rivals, but also in the context of the overall evolution of data center management.  

The future of data center management will increasingly be predictive, automated and remote — thanks to sophisticated DCIM software driven by artificial intelligence (AI) and other big-data approaches. New developments such as AI-driven DCIM as a service are helping move smaller facilities, including at the edge, toward this vision. At the same time, DCIM deployed on-premises in larger, mission-critical facilities is becoming increasingly AI-driven, predictive, and able to assimilate a variety of data to extend real-time alarming and closed-loop automation.  Schneider and Eaton, for example,  both offer cloud-delivered predictive DCIM. Several DCIM vendors are investing in more AI to boost the accuracy and remit of analysis. And earlier this year, Schneider said it was expanding development of its long-standing and comprehensive on-premises DCIM platform, EcoStruxure for Data Centers. (Schneider has also quietly standardized its tools and processes to make it easier to migrate other vendors’ DCIM data into its software, among other things.)

While Vertiv has a long-held strategy to embed more software in its facilities equipment for monitoring and management, it has not executed or even articulated broader ambitions. As one of the largest equipment and services suppliers to data centers, Vertiv, like its peers, does not need to be a primary supplier of software to succeed. However, DCIM is increasingly one of the most critical management levers available to data center operators. Major infrastructure product and service suppliers often view DCIM as a strategic product line – in part because it should increase the “stickiness” of their core business. More importantly, the software’s analytics and control features can show how a supplier’s broader offerings easily fit into an intelligence infrastructure.

Startups brew new chemistries for fresh battery types

Startups brew new chemistries for fresh battery types

Through their public commitments to net-zero carbon emission targets, cloud providers have re-energized talks of a major redesign of critical power systems within the data center sector. The elimination of diesel fuel use is chief among the goals, but the desire for a leaner facility power infrastructure that closely tracks IT capacity needs is also attracting interest. New types of batteries, among other tools such as clean fuels, will likely prove instrumental in this endeavor.

Operators have already started displacing standard valve-regulated lead-acid (VRLA) banks with lithium-ion (Li-ion) batteries to refashion data center backup power. According to the upcoming 2021 Uptime Institute Global Data Center Survey report, nearly half of operators have adopted the technology for their centralized uninterruptible power supply (UPS) plant, up from about a quarter three years ago. Compared with VRLA, Li-ion batteries are smaller, easier to monitor and maintain, and require less stringent climatic controls. (See our post: Lithium-ion batteries for the data center: Are they ready for production yet? for more technical details.)

In addition, Li-ion batteries have a much longer service life, which offsets most of the higher initial cost of Li-ion packs. The change in chemistry also makes possible use cases that were not previously feasible with VRLA — for example, demand-response schemes in which operators can use their Li-ion UPS systems for peak load shaving when there is a price incentive from the electric utility.

But these advantages are not without significant trade-offs. According to recent research by the US-based National Fire Protection Association, Li-ion batteries (regardless of their specific chemistries and construction) represent a higher fire risk than VRLA. Worse still, Li-ion fires are notoriously difficult to suppress, as the breakdown of cells produces combustible gases — a situation that creates the conditions for life-threatening explosions. Even though some chemistries, such as lithium iron phosphate, have a comparatively lower risk of fire (due to a higher ignition point and no release of oxygen), the possibility of thermal runaway persists.

Then there is the matter of sustainability. VRLA batteries are effectively fully recycled to produce new batteries, at no cost to the owner (owing to lead toxicity-related regulation). That is not the case for large Li-ion battery packs, where the industrial base for recycling has yet to develop — most Li-ion-powered products haven’t reached their end of life. Repackaging and repurposing for less demanding applications seem a more likely solution than breaking down to materials but is still challenging. Furthermore, supply chains for Li-ion production, along with their social and environmental impacts, stay obscure to buyers.

These issues have already given some data center operators and their contractors pause. But they have also created an opening in the market to other branches of battery chemistry. One promising family for high-power applications relies on sodium (Na) ions instead of lithium as its main component to carry the charge, while developments in iron-air (FeO2) batteries raise hopes of building economical and safe energy storage for very long backup times.

Na-ion cells pose no thermal runaway risk and offer much higher power-to-capacity ratios. Even better, they can repeatedly discharge their full energy at very high rates, in one to two minutes, then fully recharge in 10 to 20 minutes without overheating. Durability is also better than that of Li-ion cells.

However, a significant disadvantage is the size of Na-ion batteries compared to state-of-the-art Li-ion (although they are still more compact than VRLA). These features make Na-ion batteries interesting for UPS applications with shorter runtime targets (e.g., up to 10 minutes). More importantly, perhaps, Na-ion seems safe for installation in data halls, where operators can use it in distributed UPS topologies that closely match power service level to IT capacity requirements. Social and environmental impacts from sodium battery manufacturing also appear less concerning, as they tend to use no rare earth elements or minerals tainted by the conditions of extraction (although practical recyclability is an outstanding question).

Multiple startups have already attracted funding to commercialize sodium batteries for various use cases, both stationary and mobile (including industrial vehicles). Examples are Natron Energy (US), Faradion (UK), and CATL (China). The fabrication process of Na-ion battery shares many commonalities with Li-ion, making its industrialization relatively straightforward — producers can tap into a large pool of toolmakers and manufacturing know-how. All the same, it will take another few years (and hundreds of millions of dollars) for Na-ion battery production to ramp up.

Meanwhile, US startup Form Energy has recently raised $200 million (bringing total life-to-date funding past $325 million) to commercialize an entirely different type of battery that aims to solve not the power but the energy capacity problem. The company says it has figured out how to make effective iron-air (FeO2) batteries that can store energy for long periods and discharge it slowly over many hours (the design target is 100 hours) when needed. Because of the inexpensive ingredients, Form claims that FeO2 cells cost a small fraction (15%) of Li-ion per kilowatt-hour.

Form Energy is now targeting 2025 for the first battery deployments. There seems to be a catch, however: space. According to the firm, a Form Energy system that can hold a 3-megawatt load for four days occupies an acre of land. Even if this can be reduced by lowering runtime to, say, 24 hours, it will still require several times the footprint of a diesel generator set and associated fuel storage. Nonetheless, if product development goes to plan, Form Energy may provide the data center sector with a new piece of its power technology puzzle.

In upcoming reports, Uptime Institute Intelligence will explore developments in energy storage, power infrastructure, and operational methods that may make the stretch goal of a diesel-free data center achievable.