Fastly outage underscores slow creep of digital services risk

A recent outage at content delivery network Fastly took down thousands of websites in different countries, including big names, such as Amazon, Twitter and Spotify, for nearly an hour. It is the latest large outage highlighting the downside of a key trend in digital infrastructure: The growth in dependency on digital service providers can undermine infrastructure resilience and business continuity.

The conundrum here is one of conflicting double standards. While many companies expend substantial money and time on hardening and testing their own data center and IT infrastructure against disruptions, they also increasingly rely on cloud service providers whose architectural details and operational practices are largely hidden from them.

Some confusion stems from high expectations and a high level of trust in the resiliency of the greatly distributed topologies of cloud and network technologies — often amounting to a leap of faith. Large providers routinely quote five nines or more availability or imply that system-wide failure is nearly impossible — but this is clearly not the case. While many do deliver on these promises most of the time, outages still happen. When they do, the impact is often broad and wide.

Fastly is far from being a lone culprit. In our report Annual outage analysis 2021: The causes and impacts of data center outages, Uptime Institute counts over 80 major outages at cloud providers in the last five years (not including other digital services and telecommunication providers). The Fastly outage serves as a reminder that cloud- and network-based services are not invulnerable to service interruptions — in fact, they sometimes prove to be surprisingly fragile, susceptible to tiny errors.

Fastly’s own initial postmortem of the event notes that the global outage was a result of a legitimate and valid configuration change by a single customer. That change led to an unexpected effect on the wider system, ultimately taking down dozens of other customers’ services. Fastly said it did not know how the bug made it into production. In some ways, this Fastly outage is similar to an Amazon Web Services (AWS) incident a few years ago, when a single storage configuration error made by an administrator ground some AWS services to a near halt on the US East Coast, causing major disruption.

Major outages at large digital service providers point to a need for greater transparency of their operational and architectural resiliency, so customers can better assess their own risk exposure. Research from Uptime Institute Intelligence shows that only a small proportion of businesses — fewer than one in five in our study — thought they had sufficient understanding of the resilience of their cloud providers to trust them to run their mission-critical workloads (see Figure 1). Nearly a quarter said they would likely use public cloud for such workloads if they gained greater visibility into the cloud providers’ resiliency design and procedures.

Diagram of adequate visibility into resiliency of public cloud operations
Figure 1. More would use public cloud if providers gave better visibility into options and operations

Furthermore, Uptime Institute’s surveys have shown that most businesses manage a rich mix of digital infrastructures and services, spanning their own private (on-premises) data centers, leased facilities, hosted capacity, and various public cloud-powered services. This comes at the cost of added complexity, which generates risks, both known and unknown, if not managed with full visibility. In a recent survey, of those who have a view on how their services’ resilience has changed as a result of adopting hybrid environments, a strong majority reported a welcome improvement. Still, one in seven reported a decrease in overall resilience.

The bottom line is that without the possibility of verifying availability and reliability claims from cloud and other IT services providers, and without proactively mitigating the challenges that hybrid architectures pose, enterprises might be signing up to added risks to their business continuity and data security. Fastly’s customers have received a stark reminder of this growing reality at their own expense.

Green tariff renewable energy purchases

Until recently, power purchase agreements (PPAs) and unbundled renewable energy certificates (RECs) were the primary means for data center operators or managers to procure renewable electricity and RECs for their operations. Many companies are not comfortable with the eight-to-20-year length and financial risks of a PPA. Unbundled RECs are an ongoing expense and do not necessarily help to increase the amount of renewable power generated.

As the supply of renewably generated electricity has grown across markets, utilities and energy retailers are offering sophisticated retail renewable energy supply contracts, referred to as green tariff contracts, which physically deliver firm (24/7 reliable) renewable power to data center facilities. The retailer creates a firm power product by combining renewable power, brown power (fossil fuel, natural gas and nuclear) and RECs from a defined set of generation projects within the grid region serving the facility.

The base of the contract is committed renewable generation capacity, which delivers the megawatt-hours (MWh) of electricity and RECs, from one or more renewable energy projects. Brown power is purchased from the spot market to fill supply gaps to ensure 24/7 supply at the meter (see schematic below). All the power purchased under the contract is physically supplied to the data center operator’s facility and matched with RECS.



In contrast to PPAs, these contracts have typical terms of four to eight years; charge a retail price inclusive of transmission, distribution costs and fees; and physically deliver 24/7 reliable power to the consuming facility. Perhaps most importantly, the financial and supply management risks inherent in a PPA are carried by the retailer. This relieves the data center operator or manager of the responsibility of managing a long-term energy supply contract. In exchange for carrying these risks, the retailer receives a 1% to 5% premium over a grid power supply contract. Paying a fixed retail price for the life of the contract brings budget certainty to the data center operator. The contract term of four-to eight years better fits the business planning horizon of most data center operators. It eliminates the uncertainty associated with estimating the location, size and energy demand of a data center operations portfolio over the full term of a 20-to-25-year PPA.

The use of green tariff contracts often does not enable the purchaser to claim additionality for the renewable electricity purchases, which may be important for some. Many green tariff contracts procure energy from older hydro facilities (in the US and France), from generation facilities coming off subsidy (in the US and Germany) or from operating generators that are selling into the spot market and are looking for a firm, reliable price over a defined period for their output (in the US, EU and China). These latter situations do not satisfy the additionality criteria. But in situations where the contract procures renewably generated electricity through a new PPA or a repowered renewable energy facility, additionality may be claimed.

But most importantly, green tariff contracts do allow data center operators to claim that they are powering their facility with a quantified amount of MWhs of renewably generated power and that they are matching their brown power purchases with RECs from renewable generation in their grid region. A green tariff purchase enables the data center operator, the energy retailer and the grid region authority to develop experience in deploying renewable generation assets to directly power a company’s operations.

This approach is arguably more valuable and responsible than buying renewably generated power under a virtual power purchase agreement (VPPA), where there is no connection between energy generation and consumption. In the case of a VPPA, the dispatch of the generation into the market can increase price volatility and has the potential to destabilize grid reliability over time. This has been demonstrated by recent market events in Germany and in the CAISO (California Independent System Operator), ERCOT (Electric Reliability Council of Texas) and SPP (Southwest Power Pool) grid regions in the US.

Green tariffs encourage the addition of more renewable power in the grid region. Requests for green tariff contracts send a signal to the energy market that data center operators desire more renewable energy procurement options for their consumed power — which they then respond to. This is demonstrated by the increase in these offerings in many electricity markets over the past five years.

Uptime Institute believes that a green tariff contract offers many data center operators a business-friendly solution for the procurement of renewably generated electricity. But like PPAs, green tariff electricity contracts are complex vehicles. Data center operators need to develop a basic understanding of the market to fully appreciate the many nuances of the issues that must be settled to finalize a contract. It may be advisable to work with a third party with trading and market experience to identify and minimize any price and supply risks in the contract.

Pandemic has driven up data center costs

As the pandemic began to make an impact in early 2020, it became clear that data center operators were going to have to invest more if they were to provide the services on which their customers were increasingly reliant. Short-term needs included protective equipment, deep cleaning and, it seemed likely, more spending to support extended shifts and more support staff.

It has been less clear if the pandemic also triggered a more substantial wave of investment in automation and monitoring in the data center. In an Uptime Institute survey in July 2020, 90% of operators said they would increase their use of remote monitoring as a result of the pandemic, and 73% said they would increase their use of automation.

Intentions, however, do not always translate into action and investment. Suppliers of remote monitoring tools, software and automation have not, on the whole, reported a dramatic surge in adoption (at least, not yet).

The 2020 research also revealed that many expected to spend more on infrastructure and resiliency as a direct result of the pandemic. This was not wholly expected — with more attention paid in the media to increased cloud spending.

In Uptime Institute’s latest of several surveys on the impact of COVID-19, there is, however, some evidence that spending on infrastructure, monitoring and staff have increased as a result of the pandemic. Four in 10 (40%) of operators said spending has risen as a result of the pandemic, and only one in 20 (6%) say it has fallen. Most said spending rose by less than 20%, although a few outliers saw much bigger increases.

The chart below shows the top four areas that have contributed to increased spending during the pandemic.



The data suggests that while the pandemic may subside during 2021 and 2022, the spending increase is likely to be sustained. Spending on protective equipment and extra staff may fall back, but capital technology investments, whether in increased automation/monitoring or in site resiliency, may take years to peak, and would then require ongoing operational support. As a result, data centers should be more resilient in the years ahead, and a little less susceptible to problems with a critical component — humans. But operational costs are unlikely to fall.

Renewable energy and data centers: Buyer, be aware

Across the world, data center owners and managers are striving to buy more renewable energy and reduce their dependence on fossil fuels. The global internet giants and the largest colocation companies have led the way with huge green energy purchases.

But the impact of renewable energy on the grid operator’s economics, and on the critical issue of grid reliability, is not well understood by many observers and energy buyers. In some cases, the purchase and use of renewable energy can expose the buyer to financial risks and can threaten to destabilize the economics of the power supply chain. For this reason, Uptime Institute is advising operators to conduct a thorough analysis of the availability and economics of renewable energy as part of their energy buying strategy.

The problem stems from the volatile balance of supply and demand. The intermittent nature of wind and solar generation moves the electricity markets in unexpected ways, driving down wholesale prices as renewable generation capacity increases. These price movements can stress the business model of reliable generation, which is needed to ensure continuous operation of the grid.

An analysis of market conditions in grid regions with high penetration of renewably generated power offers important guideposts for data center operators evaluating renewable energy purchases and the reliability of their electricity supplies.

Generally, average wholesale power prices decrease as generation capacity — particularly intermittent renewable energy generation capacity — increases. Consider the example of ERCOT, the Electric Reliability Council of Texas. ERCOT is the Texas grid operator recently in the news as it struggled with extreme weather. In that region, wind generation capacity more than doubled from 2013 to 2020. Against this increase in capacity, the average spot market price of electricity dropped from approximately $25/MWh (megawatt-hour) in 2016 to $15/MWh in 2020. (Solar generation exhibits the same negative correlation between installed capacity and wholesale prices.)

This reduction in wholesale price in turn reduces the revenue stream of power purchase agreements (PPAs) — an instrument used by many data center operators. In a PPA, the purchaser takes on the financial risk of power generation by guaranteeing to pay the generator a fixed price for the electricity of the generated MWh over the contract term. If the spot market revenue for the generated power is less than the PPA revenue (based on the agreed contract price), the purchaser will then pay the difference and take a loss.

This does happen: Currently, PPAs signed in the Texas ERCOT region in 2016 are generating losses for their purchasers. Figure 1 illustrates how higher average available wind generation capacity (bars) reduced the daily average spot market prices at two settlement hubs (lines) during January 2021.

Figure 1: Changes in ERCOT electricity spot market prices with change in wind output

The loss of revenue resulting from falling wholesale prices also challenges the business models of nonrenewable fossil fuel and nuclear generators, making it difficult for them to meet operation, maintenance and financing costs while achieving profitability. Unless the grid has sufficient renewable energy storage and transmission, this will threaten the reliable supply of power.

The impact of renewable energy on the reliability and economics of the grid

The impact of renewable energy on the reliability and economics of the grid does not stop there. Three further issues are curtailment, congestion and reliability.

Curtailment. As intermittent renewable capacities increase, higher levels of output can exceed grid demand. To maintain grid stability, power must be sold to other grid regions or some portion of the generation supply must be taken offline. In the California grid district CAISO (the California Independent System Operator), solar generation curtailments have grown from 400 GWh (gigawatt-hours) in 2017 to over 1500 GWh in 2020. This results in lost revenue for the curtailed facilities. Oftentimes, the need to ensure that reliable generators are available to the grid requires the curtailment of the intermittent renewable generators.

Congestion. Congestion occurs as transmission capacity approaches or exceeds full utilization. To stay within the capacity constraints, generators need to be rebalanced to assure supply and demand will be matched in all local regions of the grid. This condition often results in increased curtailment of renewable generation. Transmission costs may also be increased on transmission lines that are near capacity to send an economic signal to balance capacity and demand.

Reliability. As intermittent renewable capacity increases it reduces the percentage of time fossil fuel and nuclear generation sources are dispatched to the grid, reducing their revenues. While the delivery of renewably generated energy has increased, it still varies significantly from hour to hour. In ERCOT in 2019, there were 63 hours where wind satisfied more than 50% of demand and 602 hours where it supplied less than 5%. Figure 2 provides an example of wind generation variability over two days in January 2019 in the ERCOT area. On January 8, the available wind generation capacity averaged roughly 8 gigawatts (GWs), satisfying approximately 25% of the demand. On January 14, the output was significantly reduced at 2 GWs, satisfying approximately 10% of demand.

Figure 2: Day-to-day and hour-to-hour variation in wind generation

 

To deal with this volatility, the system operator needs to have a capacity plan, using economic signals or other strategies to ensure there is sufficient capacity to meet demand under all possible mixes of renewable and conventional generation.

As the proportion of renewably generated electricity in their supply grid grows, data center operators will need to carefully evaluate their renewable energy procurement decisions, the overall generation capacity available under different scenarios, and the robustness of the transmission system capacity. There is no guarantee that a grid that has proved reliable in the past will prove equally reliable in the years ahead.

While purchases of renewable energy are important, data center operators must also advocate for increased transmission capacity; improved, automated grid management; and robust, actionable capacity and reliability planning to ensure a reliable electricity supply with increased renewable content at their facility meter.

The insider threat: Social engineering is raising security risks

Uptime Institute Members say one of their most vexing security concerns is the insider threat — authorized staff, vendors or visitors acting with malicious intent.

In extreme examples, trusted individuals could power down servers and other equipment, damage network equipment, cut fiber paths, or steal data from servers or wipe the associated storage. Unfortunately, data centers cannot simply screen for trusted individuals with bad intent.

Most data center operators conduct background checks. Most have policies for different levels of access. Some may insist that all visitors have security escorts, and many have policies that prevent tailgating (physically following an authorized person through a door to gain access). Many have policies to limit the use of portable memory devices in computer rooms to only authorized work; some destroy them once the work is complete, and some insist that only specific computers assigned to specific worktables can be used.

Yet vulnerabilities exist. The use of single-source authentication of identification (ID), for example, can lead to the sharing of access cards and other unintended consequences. While some ID cards and badges have measures, such as encryption, to prevent them being copied, they can be cloned using specialist devices. In some data centers, multifactor authentication is used to significantly harden ingress and egress access.

The COVID-19 pandemic increased the risk for many data centers, at least temporarily. Some of the usual on-site staff were replaced by others, and routines were changed. When this happens, security and vetting procedures can be more successfully evaded.

However, even before the pandemic, the risk of the insider threat has been growing — and it has changed. Trusted individuals are now more likely to unwittingly act in ways that lead to malicious outcomes (or fail to respond and prevent such outcomes). This is because human psychology tactics are increasingly being used to trick authorized people into providing sensitive information. Social engineering, using deception to obtain unauthorized data or access, is now prolific and becoming increasingly sophisticated.

Tactics can include a mix of digital and physical reconnaissance. The simplest approaches are often the most effective, such as manipulating people using phone or email, and using information available to the public (for example, on the internet).

Social engineering is a concern for all businesses but particularly those with mission-critical infrastructure. A growing number of data center operators use automated security systems to detect anomalies in communications, such as email phishing campaigns on staff and visitors.

However, even routine communication can be exploited by hackers. For example, the host names derived from the headers of an email may contain information about the internet protocol (IP) address of the computer that sent the email, such as its geographic location. Further information about, say, a data center employee can be obtained using online information (social media, typically), which can then be used for social manipulation — such as posing as a trusted source (spoofing caller IDs or creating unauthorized security certificates for a web domain, for example), tricking an employee into providing sensitive information. By surveilling employees, either physically or online, hackers can also obtain useful information at places they visit, such as credit card information used at a restaurant (by exploiting a vulnerability in the restaurant’s digital system, for example). Hackers often gain trust by combining information gleaned from chasing digital trails with social engineering tactics.

Reviews of policies and procedures, including separation of duties, are recommended. There are also numerous cybersecurity software and training tools to minimize the scope for social engineering and unauthorized access. Some data center operations use automated open-source intelligence (OSInt) software to scan social media and the internet for mentions of keywords, such as their organization’s name, associated with terror-related language. Some use automated cybersecurity tools to conduct open-source reconnaissance of exposed critical equipment and digital assets.

The insider threat is impossible to fully control — but it can be mitigated against by adding layers of security.


The full report Data center security: Reassessing physical, human and digital risks is available to members of Uptime Institute. Consider a guest membership here.

Data Center Security

Data center insecurity: Online exposure threatens critical systems

In early March 2021, a hacker group publicly exposed the username and password of an administrative account of a security camera vendor. The credentials enabled them to access 150,000 commercial security systems and, potentially, set up subsequent attacks on other critical equipment. A few weeks earlier, leaked credentials for the collaboration software TeamViewer gave hackers a way into a system controlling a city water plant in Florida (US). They remotely adjusted the sodium hydroxide levels to a dangerous level (the attack was detected, and harm avoided).

These are just some of the most recent examples of exploits where critical infrastructure was disrupted by remote access to IT systems, including some high-profile attacks at power plants.

The threat of cybersecurity breaches also applies to physical data centers, and it is growing. Cloud computing, and increased automation and remote monitoring have broadened the attack surface. (See our recent report Data center security: Reassessing physical, human and digital risks.)

So, how widespread is the problem of insecure facility assets? Our research of vulnerable systems on the open internet suggests it is not uncommon.

For close to a decade, the website Shodan has been used by hackers, benevolent and malevolent, to search for targets. Instead of fetching results that are webpages, Shodan crawls the internet for devices and industrial control systems (ICSs) that are connected to the internet but exposed.

Shodan and similar search engine websites (BinaryEdge, Censys and others) provide a compendium of port-scan data (locating open ports, which are a path to attack) on the internet. Expert users identify interesting characteristics about certain systems and set out to gain as much access as they can. Automation tools make the process more efficient, speeding up and also expanding what is possible for an exploit (e.g., by defeating login safeguards).

In a recent demonstration of Shodan for the Uptime Institute, the cybersecurity firm Phobos Group showed more than 98,000 ICSs exposed globally, including data center equipment and devices. Phobos quickly discovered access to the login screens of control systems for most major data center equipment providers. In Figure 1 (as in all figures), screenshots of aggregate search results are shown with specific details hidden to ensure privacy.

The login process itself can be highly problematic. Sometimes installers or users do not change the default credentials supplied by the manufacturers, which can often be found online. During our demonstration, for example, Phobos used a default login to gain access to the control system for cooling units supplied by a widely used data center equipment vendor. If this exercise were carried out by a genuine intruder, they would be able to change setpoint temperatures and alarms.

Users’ customized login credentials can sometimes be obtained from a data breach of one service and then used by a hacker to try to log into another service, a type of cyberattack known as credential stuffing. The availability of lists of credentials has proliferated, and automated credential-stuffing tools have become more sophisticated, using bots to thwart traditional login protections. (Data breaches can happen without leaving any trace in corporate systems and can go undetected.)

As cybersecurity exploits of critical infrastructure in recent years have shown, control system interfaces may be the primary targets — but access to them is often through another system. Using the Shodan tool, the security company Phobos searched for exposed remote desktops, which can then provide access to multiple systems. This method can be particularly troubling if a control system is accessible through a remote desktop and if the user employs the same or similar passwords across systems.

There are many remote desktops exposed online. As Figure 2 shows, in a recent Shodan search, over 86,700 remote desktops were exposed in the US city of Ashburn, Virginia, alone (a city known as the world’s data center capital). This list includes a set of addresses for a global data center capacity provider (not shown).

Password reuse is one of the biggest security vulnerabilities humans introduce, but it can be minimized with training and tools, and by multifactor authentication where practicable. Installers and users should also be prevented from removing password protection controls (another vulnerability that Phobos demonstrated). There are also cybersecurity tools to continuously scan for assets exposed online and to provide attack simulations. Services used at some facilities include threat intelligence and penetration tests on IP addresses and infrastructure. Low-tech approaches such as locked workstations and clean-desk policies also help protect sensitive information.

Cybersecurity of data center control systems and other internet protocol (IP)-enabled assets is multilayered and requires a combination of ongoing strategies. The threat is real and the likelihood of physical breaches, unauthorized access to information, and the destruction of or tampering with data and services is higher than ever before.


The full report Data center security: Reassessing physical, human and digital risks is available to members of the Uptime Institute community here.