Disaster Recovery Lessons Learned from Superstorm Sandy

An Uptime Institute survey reveals that best practices and preparation pay dividends

In late October 2012, Superstorm Sandy tore through the Caribbean and up the east coast of the U.S., killing over a hundred, leaving millions without power and causing billions of dollars in damage. In the aftermath of the storm, Uptime Institute surveyed data center operators to gather information on how Sandy affected data center and IT operations.

The acute damage caused by Superstorm Sandy in the Northeast was thoroughly documented. But it is important to note that regional events have larger implications. The survey found that Superstorm Sandy caused damage and problems far from its path. For example, one site in the Midwest experienced construction delays as equipment and material deliveries were diverted to sites affected by the hurricane. Another business incurred similar delays at a leased facility that was being remodeled for its use. The timelines for both of these projects had to be adjusted.

The survey focuses primarily on the northeast corridor of the U.S. as the Greater New York City area took the brunt of the storm and suffered the most devastating losses. This area has a longstanding and rigorous data center operations culture. A total of 131 end users completed at least part of the survey, with 115 responding to all questions that applied. Many of the survey respondents were also Uptime Institute Network members, who could be expected to be highly familiar with industry best practices. These factors may have led survey respondents to be better prepared than a more general survey population.

The survey examined preparations, how facilities fared during the storm and after actions taken by the data center owners and operators to ensure availability and safeguard of critical equipment. Of all respondents spread through North America, approximately one-third said they were affected by the storm in some way. The survey results show that natural disasters can bring unexpected risks, but also reveal planning and preemptive measures that can be applied in anticipation of a catastrophic event. The path traveled by Superstorm Sandy can be considered susceptible to hurricanes, yet any data center is vulnerable to man-made or natural disasters, no matter its location. And natural disasters are trending upward in frequency and magnitude.

In the aftermath of Sandy, the Office of the President of the United States and Department of Energy jointly asserted, “In 2012, the United States suffered eleven billion-dollar weather disasters–the second-most for any year on record, behind only 2011,” according to Economic Benefits Of Increasing Electric Grid Resilience to Weather Outages, a report prepared by the President’s Council of Economic Advisers and the U.S. Department of Energy’s Office of Electricity Delivery and Energy Reliability with assistance from the White House Office of Science and Technology.

In this context, the role of the data center owner, operator and designer is even more important: to identify the risks and mitigate them. Mitigation can often be achieved in many ways.

Questions

The survey focused on four main topics:

The impact of loss of a data center
Steps taken to prepare for the storm
The organization’s disaster recovery plan
Lessons learned

Select questions asked for a yes or no answer; many required respondents to answer in their own words. These narrative responses make up the bulk of the survey and provided key insight into the study results.

Impact on Organizational Computing

The majority of responses to the first question, “Describe the nature of how your organization’s computing needs were impacted,” centered on loss of utility power or limited engine generator runtime, usually because of lack of fuel or fuel supply problems.

Almost all the respondents in the path of the storm went on engine generators during the course of the storm, with a couple following industry best practice by going on engine generators before losing utility power. About three-quarters of the respondents who turned to engine generators successfully rode out the storm and the days after. The remainder turned to disaster recovery sites or underwent an orderly shutdown. For all who turned to engine-generator power, maintaining sufficient on-site fuel storage was a key to remaining operational.

Utility power outages were widespread due to high winds and flooding. Notably, two sites that had two separate commercial power feeds were not immune. One site lost one utility feed completely, and large voltage swings rendered the other unstable. Thus, the additional infrastructure investment was unusable in these circumstances.

The Tier Standard: Topology does not specify a number of utility feeds or their configuration. In fact, there is intentionally no utility feed requirement at all. The Uptime Institute recommends, instead, that data center owners invest in a robust on-site power generation system (typically engine generators), as this is the only source of reliable power for the data center.

Survey respondents reported engine-generator runtimes ranging from one hour to eight days, with flooding affecting fuel storage, fuel pumps or fuel piping distribution, limiting the runtime of about a quarter of the engine generator systems. In one case, building management warned a data center tenant that it would need to shut down the building’s engine generators if flooding worsened. As a result, the tenant proactively shut down operations. Several operators remained on engine-generator power after utility power was restored due to the “unreliable grid.” Additionally, for some respondents, timely delivery of fuel to fill tanks was not available as fuel shortages caused fuel vendors to prioritize hospitals and other life-safety facilities. In short, fuel delivery SLAs were unenforceable.

Uptime Institute Tier Certified data centers have a minimum requirement of 12 of on-site fuel storage to support N engine generators. However, this storm showed that even a backup plan for fuel delivery is no guarantee that fuel will be available to replenish this stock. In some cases fuel supplier power outages, fuel shortages or closed roadways prevented deliveries. In other cases, however, companies that thought they were on a priority list learned hospitals, fire stations and police had even higher priority.

Other affects could be traced to design issues, rather than operations. One site experienced an unanticipated outage when the engine generator overheated. Wind pressure and direction prevented hot exhaust air from exhausting properly.

Facilities with engine generators, fuel storage or fuel pumps in underground basements experienced problems. One respondent reported that a facility had a sufficient volume of fuel; however, the fuel was underwater. The respondent remarked, “Couldn’t get those [engine generators] to run very well on sea water.” Although building owners may not have expected Sandy to cause flooding, localized flooding from broken mains or other sources often accompanies major events and should be mitigated accordingly.

One respondent noted that site communications were interrupted, thereby rendering the facility’s engine generators unusable. And several buildings were declared uninhabitable because the lobby and elevators were underwater.

Figure 2. Flooding in the Hoboken PATH station. Photos in this article depict the widespread flooding resulting from Hurricane Sandy and were contributed by Sabey and other respondents to the survey.

A respondent also noted that water also infiltrated spaces that would normally be watertight in a typical rainstorm because the severe winds (100+ MPH) pushed water uphill and along unusual paths. One response stated that high winds pushed water past roof flashing and vents: “Our roofing system is only waterproof when water follows gravity.” Although traditionally outside of design parameters and worse-case considerations, Superstorm Sandy provides a potent justification for appropriate amendments to building characteristics. For a greenfield site or potential upgrades at an existing site, business decisions should be made considering the design loads and wind speed potential to be designed for versus the use of the building. This may also impact decisions about where to locate a data center.

Storm Preparation

The second question, “Describe any preparations you made prior to the storm,” brought a variety of responses. Almost all the respondents reported that they topped off fuel or arranged additional fuel, and one-third made sleeping and food provisions for operators or vendors expected to be on extended shift. About one-quarter of respondents reported checking that business continuity and maintenance actions were up-to-date. Others ensured that roof and other drainage structures were clear and working. And, a handful obtained sandbags.

All respondents to the question stated they had reviewed operational procedures with their teams to ensure a thorough understanding of standard and emergency procedures. Several reported that they brought in key vendors to work with their crews on site during the event, which they said proved helpful.

Downtown New York City immediately after the storm.

Some firms also had remote Operations Emergency Response Teams to relieve the local staff, with one reporting that blocked roadways and flight cancellations prevented arrival for lengthy periods of time.

A few respondents said that personnel and vendors were asked to vacate sites for their own safety. Three respondents stated that they tested their Business Continuity Plans and provided emergency bridge lines so that key stakeholders could call in and get updates.

Active construction sites provided a difficult challenge as construction materials needed to be secured so they would not become airborne and cause damage or loss.

Effectiveness of Preparations

Multiple respondents said that conducting an in-depth review of emergency procedures in preparation for the storm resulted in the staff being better aware of how to respond to events. Preparations enabled all the operators to continue IT operations during the storm. For example, unexpected water leaks materialized, but precautions such as sandbags and tarps successfully safeguarded the IT gear.

Status communication to IT departments continued during the storm. This allowed Facilities teams to provide ongoing updates to IT teams with reassurances that power, cooling and other critical infrastructure remained operational or that planned load shedding was occurring as needed.

Several owners/operators attributed their ability to ensure the storm to preparations, calling them “extremely effective” and “very effective.” Preparation pays in an emergency, as it enables personnel to anticipate and prevent problems rather than responding in a reactive way.

Disaster Recovery or Business Continuity Plan

Of the survey respondents, only one did not have a disaster recovery or business continuity plan in place.

However, when asked, “Did you use your Disaster Recovery or Business Continuity Plan?” Almost half the respondents said no, with almost all indicating that planning allowed business functions to remain in operation.

The storm did cause construction delays at one site as suppliers delayed shipments to meet the needs of operational sites impacted by the storm. While the site was not operational and a business continuity plan was not deployed, the shipping issues may have delayed outfitting, a move-in date and consolidation or expansion of other existing data centers.

Nonetheless, more than half the respondents to this question employed their business continuity plans, with some shifting mission-critical applications to other states and experiencing seamless fail over. Common responses regarding successful application of the business continuity plans were “it works,” “we used it and we were successful,” and “worked well.”

Two respondents stated that commercial power was restored before they needed to implement their business continuity plans. At one site, a last minute fuel delivery and the unexpected restoration of utility power averted the planned implementation of the business continuity plan.

Some operations implemented business continuity plans to allow employees to remain at home and off the roadways. These employees were not able to come on site because state and local governments restricted roads to emergency personnel only. However, as a result of implementing the business continuity plan, these employees were able to help keep these sites up and running.

Lessons Learned
Three survey questions captured lessons learned, including what worked and what required improvement. The most frequent answer reflected successes in preparation and pre-planning. On the flip side, fuel supply was identified as an area for improvement. It should be noted the Uptime Institute position is to rely only on on-site fuel storage rather than off-site utility or fuel suppliers.

To the question “What is the one thing you would do the same?” the two most frequent responses were

Planning with accurate, rehearsed procedures
Regular load transfer testing to switch the electrical load from utility power to engine generators
Staffing, full fuel tanks and communication with Facilities and IT personnel (both staff and management) were also mentioned.

Overwhelmingly, the most significant lesson was that all preparations were valuable. Additional positive lessons included involving staff when planning for disasters and having a sound electrical infrastructure in place. The staff knows its equipment and site. Therefore, if it is involved before a disaster occurs, it can bring ideas to the table that should be included in disaster recovery planning. Such lessons included fuel-related issues, site location, proper protection from severe weather and a range of other issues.

Remote Sites

A number of respondents were able to shift IT loads to remote sites, including sharing IT workload or uptime of cloud service providers. Load shedding of IT services proved effective for some businesses. Having a remote facility able to pick up the IT services can be a costly solution, but the expense may be worth it if the remote facility helps mitigate potential losses due to severe weather or security issues. Companies that could transfer workloads to alternate facilities in non-affected locations experienced this advantage. These cost issues are, of course, business decisions.

One owner listed maintaining an up-to-date IT load shed plan as its most important lesson. Although important, an IT load shed plan can be difficult to prioritize in multiple tenant facilities or when users think their use is the main priority. They don’t always see the bigger picture of a potential loss to the power source.

Staff

At least two respondents explicitly thanked their staffs. Having staff on site or nearby in hotels was critical to maintaining operations, according to almost half of the respondents to that question. It should be understood that riding through a storm requires that the site have appropriate infrastructure and staffing to start with, followed by good preventative maintenance, up-to-date standard operating procedures (SOPs) and disaster recovery plans.

Some businesses had staff from other locations on standby. Once the roadways opened or commercial flights resumed, depending on the location of the standby personnel, remote staff could mobilize on site. One respondent indicated the storm “stretch(ed) our resources” after the third day but nonetheless did not have to bring staff from another site.

One respondent stated that staff was the most important element and knew that local staff would be distracted if their homes and families were without power. Consequently, that company provided staff members with generators to provide essential power to their homes. This consideration allowed staff to either work remotely or be more available for on-site needs. Many such responses show that companies were thinking beyond the obvious or imminent to ensure ongoing operations during the event.

Other Facility Issues

Respondents reported no problems with space cooling equipment (chillers, computer room air handling units, etc.).

Internal Communication

One business reported that it provided a phone number so that occupants could call for status. A Facilities group at another site indicated that IT staff was concerned prior to the storm, but became more comfortable upon seeing the competency of the Facilities staff and its knowledge of the power distribution system, including the levels of backup power, during pre-communication meetings.

What Needed Improvement

“What is the one thing you would do differently?” The majority of respondents stated that they were looking at developing a more robust infrastructure, redundant engine-generator topology, moving infrastructure higher and regularly testing the switchover to engine generators prior to another emergency event. One respondent said that it was reviewing moving a data center farther from downtown Manhattan.

About half the respondents said that they would not do anything differently; their planning was so effective that their sites did not see any impacts. Even though these sites fared well, all sites should continue regular testing and update business continuity plans to ensure contact and other information is current. Improving infrastructure, continuing with proper preventative maintenance and updating contingency plans must continue to achieve the positive results characterized in this survey.

Hurricane-Sandy-World-Trade-Center-construction-site

Flooding at the World Trade Center construction site.

Not surprisingly, given the number and significance of fuel supply problems, some respondents indicated that they plan to procure an additional supplier, others plan bring a fuel truck to stay on site in advance of an expected event, and some plan to increase on-site storage. Increasing storage includes modifying and improving the existing fuel oil infrastructure. The Institute advocates that the only reliable fuel supply is what can be controlled at an owner’s site. The owner does not have ultimate control over what a supplier can provide. As seen during Superstorm Sandy, a site is not guaranteed to receive fuel delivery.

Steps taken related to fuel delivery included communicating with fuel suppliers and being on a priority list for delivery. However, in the aftermath of this storm, fire, paramedics, hospitals, medical support, etc., became higher priority. In addition, fuel suppliers had their own issues remaining operational. Some refineries shut down, forcing one facility owner to procure fuel from another state. Some fuel suppliers had to obtain their own engine-generator systems, and one supplier successfully created a gravity system for dispensing fuel. There were also problems with phone systems, making it difficult for suppliers to run their businesses and communicate with customers. One supplier deployed a small engine generator to run the phone system.

Due to lack of control over off-site vendors, the Uptime Institute’s Network Owners Advisory Committee recommendation and the Tier Standard: Topology requirement is a minimum of 12 hours of Concurrently Maintainable fuel to supply N engine generators. Sites susceptible to severe storms (hurricanes, snow, tornados, etc.) are strongly encouraged to store even more fuel. Many sites anticipated the move to engine-generator power and fuel procurement but did not anticipate problems with fuel transport.

Downtown New York City immediately after the storm.

In addition, respondents reported lessons about the location of fuel tanks. Fuel storage tanks or fuel pumps should not be located below grade. As indicated, many areas below grade flooded due to either storm water or pipe damage. At least four engine-generator systems were unable to operate due to this issue. Respondents experiencing this problem plan to move pumps or other engine-generator support equipment to higher locations.

One Facilities staff reported finding sediment in fuel supplies; though, it is unknown if this was due to existing sediment in the bottom of the owner’s tanks or if the contaminants were introduced from the bottom of a fuel supplier’s tank. This facility plans to add fuel filtration systems as a result.

Infrastructure

Many comments centered on the need to increase site infrastructure resiliency with plant additions or by moving equipment to higher ground. Such construction projects would likely be easier to implement outside of dense cities, in which space is at a premium and where construction can be costly. Regardless, these improvements should be analyzed for technical robustness, feasibility and cost implications. One site stated they would be “investigating design of engine-generator plant for sustained runs.”

A few respondents saw a need to move critical facilities away from an area susceptible to a hurricane or flood. While some respondents plan to increase the resiliency of their site infrastructure, they are also evaluating extending the use of existing facilities in other locations to pick up the computing needs durin the emergency response period.

No one expected that water infiltration would have such an impact. Wind speeds were so extreme that building envelopes were not watertight, with water entering buildings through roofing and entryways. And, unusual wind directions caused unexpected problems; one respondent cautioned

“Water can get in the building when the wind blows it uphill. Be ready for it.”

Water especially flooded below grade facilities of all types.

The communication backbone, consisting of network fiber and cable, only had one report of failure within the survey. Regardless, voice system terminal devices did go down due to lack of power or limited battery duration. Sites should consider placing phone systems on engine generator and UPS power.

Load Transfer

Even if sufficient power supply infrastructure exists, equipment must be maintained and periodically tested. Most reports emphasized that regular testing proved valuable. One respondent waited until utility power failed to switch to engine-generator power to preserve fuel.

However, there was risk in waiting if issues arose during load transfer. Load transfer includes engine generators starting, ramping up, and all related controls working properly. One site that did not perform regular testing experienced problems transferring the load to engine generator. That respondent planned to begin regular testing. The failed switchover illustrates the need to perform routine testing. Both complete commissioning after construction and ongoing preventative maintenance and testing are critical.

One comment specifically stated the owner would implement “more thorough testing of the generators and the electrical system overall”; the Uptime Institute considers thorough testing like this a best practice. In addition, fuel supply, fuel distribution piping and other infrastructure should be designed to minimize the possibility of total damage with redundant systems installed in locations far from each other. Therefore, damage in an area might interrupt a single source, rather than both primary and alternate sources.

Planning

Most of the responses identified pre-planning and “tabletop scenario walkthroughs” as keys to providing uptime. Some respondents reported that they needed to update their disaster recovery plans, as IT contacts or roles and responsibilities had changed since the document was published or last revised.

Respondents noted area of improvement for planning and preparation:

Shortfalls with their business continuity plans
Out-of-date staff and contacts lists
Establish closer working relationships with cloud service providers to make sure their sites have redundancy
Re-analyzing or accelerating data center consolidations or relocations. One respondent had recently relocated their critical functions away from a facility located near the harbor, which paid off. The facility did flood; it housed only non-critical functions, and there was not a major impact.

A few respondents stated they would work with their utility company to get better communication plans in place. And, one commenter learned to start preparations earlier with storm warnings.

Some respondents said they would plan to add additional staff to the schedule prior to the event, including deploying the Emergency Response Team from another site, after realizing the impact from transportation (ground and flight) issues.

Although most plans anticipated the need to house staff in local hotels, they did not anticipate that some hotels would close. In addition, local staff was limited by road closures and gas stations that closed due to loss of power or depleted supply.

Summary

Full preparedness is not a simple proposition. However, being prepared with a robust infrastructure system, sufficient on-site fuel, available staff and knowledge of past events have proven beneficial in terms of ongoing operations.

When sites lost their IT computing services, it was due largely to either critical infrastructure components being located below grade or by depending on external resources for re-supply of engine-generator fuel—both preventable outcomes.

Solutions include the following:

Thoroughly reviewing site location—even for a 3rd-party service
Locating critical components, including fuel storage and fuel pumps, at higher elevations
Performing testing and maintenance of the infrastructure systems, in particular switching power from utility to engine generator
Ensuring sufficient duration of engine-generator fuel stored on site
Employing proper staffing
Relocating from a storm prone area
Maintaining up-to-date Disaster Recovery, Business Continuity and IT load shedding plans
Briefing stakeholders on these plans regularly to ensure confidence and common understanding
Providing a remote site to take over critical computing
Implementing and testing robust power, cooling, network and voice infrastructures

Though unforeseen issues can arise, the goal is to reduce potential impacts as much as possible. The intent of survey analysis is to share information to assist with that goal.

This video features the author, Debbie Seidman presenting the summary at the Uptime Institute Symposium in 2013

About the Authors

Debbie Seidman, PE , has 25+ years delivering highvalue (resilient, robust), energy-efficient and cost-effective projects for mission-critical facilities. As a senior consultant for the Uptime Institute, Ms. Seidman performs audits and Tier Certifications as well as provides customer engagements, advising clients throughout the design and construction phases of data center facility projects–with a focus on power and cooling infrastructure. Most recently, she was with Xcel Energy where she worked on demand-side energy-efficiency projects including data centers. Previous to that, she was with HP on the team that developed the HP POD. Ms. Seidman also served more than 20 years on HP’s Real Estate team as a project manager, facilities engineer, and operations engineer where she managed design and construction teams for projects at multiple sites totaling 2.5 million square feet, including data centers, clean rooms, research facilities, manufacturing environments, and corporate centers. Ms. Seidman holds a Bachelor of Science degree in Architectural Engineering (Building Systems Engineering) from the University of Colorado and an MBA from Colorado State University.

Vince Renaud was formerly CTO and Senior Tier Certification Authority for Uptime Institute Professional Services. Mr. Renaud has over 28 years of experience ranging from planning, engineering, design, and construction to start-up and operation. He received his Bachelor of Science degree in Civil Engineering from the Air Force Academy and a Master of Science in Engineering Management from Air Force Institute of Technology. Mr. Renaud is co-author of both Tier Standards and an instructor for the Accredited Tier Designer curriculum.