Dual-Corded-Power

Dual-Corded Power and Fault Tolerance: Past, Present, and Future

Details of dual-corded power change, but the theme remains the same.

Uptime Institute has worked with owners and operators of data centers since the early 1990s. At the time, data center owners used single-corded IT devices for even their most critical IT assets. Figure 1 shows a selection of the many potential sources of outage in the single path.

Early on, Site Uptime Network (now the Uptime Institute Network) founder Ken Brill recognized that outages due to faults or maintenance in the critical distribution system were a major problem in high availability computing. The Uptime Institute considers the critical distribution to include the power supply to IT devices from the UPS output to any PDU (power distribution unit), panel, or remote power panel (RPP), and the power distribution down to the rack via whip or bus duct.

Ahead of their time, Ken Brill and the Network created the Fault Tolerant Power Compliance Specification in 2000 to address the sources of outages, and updated it in 2002. Then, in 2004 Uptime Institute produced the paper Fault Tolerant Power Certification is Essential When Buying Products for High-Availability to directly address the issue. When this paper was written, four years after the Fault Tolerant Power Compliance Specification was first issued, critical distribution failures continued to cause the majority of data center outages.

Fault-Tolerant Power Compliance Specification Version 2.0” lists the required functionality of Fault Tolerant dual-corded IT devices as defined by the Uptime Institute.

In the mid-1990s, the Uptime Institute led the data center industry in establishing Tiers as a way to define the performance characteristics of data centers. Each Tier builds upon the previous Tier, adding maintenance opportunity and Fault Tolerance. This progress culminated in the 2009 publication of the Tier Standard: Topology, which established Tiers as progressive maintenance opportunities and fault tolerance. The Tier Standard also included the requirement for dual-corded devices in Tier III and Tier IV objective data centers. Tier III data centers have dual power paths to provide Concurrent Maintainability of each and every component and path. Tier IV data centers require the same dual power paths for Concurrent Maintainability and add the ability to autonomously respond to failures.

Single-corded-IT-equipment

Figure 1. Single-corded IT equipment

Present
The Fault Tolerant Power Compliance Specification, Version 2.0 is clearly relevant 12 years later. Originally called Fault Tolerant IT devices, today the commonly used vernacular is dual corded, and these devices have become the basis of high availability. The two terms Fault Tolerant IT devices and dual-corded IT device are used interchangeably.

Tier III and Tier IV data centers designs continue to be based upon the use of dual-corded architecture and require an active-active, dual-path distribution. The dual-corded concept is cemented into high-availability architecture in enterprise data centers, hyper-scale internet providers, and third-party data center spaces. Even the innovative Open Compute Project, sponsored by Facebook, which uses cutting-edge electrical architecture, utilizes dual-corded, Fault Tolerant IT devices.

Confoundingly, though, more than half of the more than 5,000 reported incidents in the Uptime Institute Network’s Abnormal Incident Reports (AIRs) database relate to the critical distribution system.

Dual-corded assets have increased maintenance opportunities for data center facilities management. Operations teams no longer need to wait for inconveniently timed maintenance windows to perform maintenance; instead they can maintain their facilities without IT impact during safe and regular hours. If there is an anomaly, the facilities and IT staff are on hand to address them.

Fault-Tolerant-and-dual-corded

Figure 2. Fault-Tolerant-and-dual-corded

Uptime Institute Network members today recognize the benefits of dual-corded devices. COO Jason Weckworth of RagingWire recently said, “Dual-corded IT devices allow RagingWire the maintenance and operations flexibility that are consistent with our Concurrently Maintainable objective and provide that extra level of availability assurance below the UPS system where any problem may have consequential impacts to availability.”

Uptime Institute Network adoption of dual-corded devices has clearly improved, as indicated by the number of outages attributed to critical distribution. Properly applied, dual-corded devices do not experience any effect on loss of a single source. Analysis of the AIRs database from 2007 to 2012 showed a reduction of more than 90% of critical distribution failures impacting the IT load.

Some data center owners or IT teams try to achieve dual power paths to IT equipment using large static transfer switches (STS) or STS power distribution units (PDU) (see Figure 3). However, problems inherent in the maintenance, replacement, or a fault of an STS for the device and onward threaten the critical load. One data center suffered a fault on an STS-PDU that affected one third of its IT equipment and loss of those systems rendered the entire data center unavailable. As noted in Figure 3, the single large STS solution does not meet Tier III or Tier IV criteria.

Static-transfer-switches

Figure 3. Static transfer switches

Uptime Institute recognizes that some heritage devices or legacy systems may end up in data centers, due to systems migrations challenges, mergers and acquisitions, consolidations, or historical clients. Data center infrastructure professionals need to question the justifications that lead to these conditions: If the system is so important, why is it not migrated to a high-availability, dual-corded IT asset?

The Tier Standard: Topology does include an accommodation for single-corded equipment as shown in Figure 4, depicting a local, rack-mounted transfer switch. The rack-mounted or point-of-use transfer switch allows for distribution of risk as low as possible in the critical distribution.

Still, many in IT have not yet gotten the message and bring in more than the occasional one-off device. Single-corded devices are found in a larger percentage of installations than should be expected. Rob McClary, SVP and GM of FORTRUST, said, “FORTRUST utilizes infrastructure with dual-power paths, yet we estimate that greater than 50% of our clients continue to deploy at least one or more single-corded devices that do not utilize our power infrastructure and could impact their own availability. FORTRUST strongly supports continued education to our end user community to utilize all dual-corded IT assets for a true high-availability solution.” The loss of even one of the data centers asset in their deployment can render the platforms or applications of the deployment unavailable. The disconnect between data center infrastructure and IT hardware continues to exist.

Point-of-use-transfer-switch

Figure 4. Point-of-use transfer switch

Uptime Institute teams still find the following configurations that continue to plague data center operators:

  • Single-corded network devices
  • Mainframes that degrade or are lost on loss of a single source of power
  • IT devices with an odd number of cords

The Future: A Call to Action
Complex systems such as data center infrastructure and the IT equipment and systems within them require comprehensive team approaches to management, which means breaking down the barriers between the organizations by integrating Facilities and IT staff, allowing the integrated organization to manage the data center and educating end users who don’t understand power infrastructure. If we can’t integrate, then educate.

If a merger of IT and facilities just won’t work in an enterprise data center, a regular meeting will at least enable teams to share knowledge and review change management and facilities maintenance actions. In addition, codifying change management and maintenance window procedures in terms IT can understand using an ITIL-based system will enable IT counterparts to start to understand the criticality of power distribution as they see the how and why of data center facility operations firsthand.

Colocation and third-party data centers understand that many client IT organizations have limited in-house staff, expertise, and familiarity with high-availability data centers. The need to educate these clients is clear. Several ways to educate include:

  • Compile incident reports involving single-corded device and share them with new tenants and deployments teams
  • Create a one-page fact sheet on dual-corded infrastructure with a schematic and benefits summary that those users can understand
  • Create a policy that requires rack-mounted or point-of-use transfer switches for all single-corded device.
  • Require all devices that support a high-availability application or IT deployment to be dual corded

These actions will pay dividends with increased ease of maintenance and reduced client coordination.

Facilities teams also need to look within themselves. Improved monitoring and data center infrastructure management (DCIM) solutions provide windows into the infrastructure but do not replace good management. Anecdotal evidence has shown 1-10% of servers in a data center may be improperly corded, i.e., both cords are plugged into the A distribution.

Management can address these challenges by

  • Clearly and consistently labeling A and B power
  • Training all staff working in critical areas about data center policies, including the dual-corded policy
  • Performing quality control to verify A/B cording, phase balancing, and installation documentation
  • Capturing the configuration of the data center
  • Regularly tracking single-corded installations to pressure owners of those systems to modernize

Summary
Millions of dollars are regularly invested in the dual-power path infrastructure in data centers for high availability because of business needs. This is clearly represented in the increasing cost of downtime from lost business to ruined reputations or goodwill. It is essential that Facilities and IT, including the procurement and installation teams, work together to safeguard the investment, making sure dual-power path technology is utilized for business critical applications. In addition, owners and operators of data centers must continue to educate customers who lack the knowledge or familiarity with data center practices and manage the data center to ensure high-availability principals such as dual-corded architecture are fully utilized.


Fault-Tolerant Power Compliance Specification Version 2.0

Fault-tolerant power equipment refers to computer or communication hardware that is capable of receiving AC input from two different AC power sources. The objective is to maintain full equipment functionality when operating from A and B power sources or from A alone or from B alone. Equipment with an odd number of external power inputs (line cords) generally will not meet this requirement. It is desirable for equipment to have the least number of external power inputs while still meeting the requirement for receiving AC input from two different AC power sources. Products requiring more than two external power inputs risk being rejected by some sites. For equipment to qualify as truly fault-tolerant power compliant, it must meet all of the following criteria as initially installed, at ultimate capacity, and under any configuration or combination of options. (The designation of A and B power sources is used for clarity in the following descriptions.)

  • If either one of two AC power sources fails or is out-of-tolerance, the equipment must still be able to start up or continue uninterrupted operation with no loss of data, reduction in hardware functionality, performance, capacity, or cooling.
  • After the return of either AC power source from a failed or out-of-tolerance condition, during which acceptable power was continuously available from the other AC power source, the equipment will not require a power-down, IPL, or human intervention to restore data, hardware functionality, performance, or capacity.
  • The first or second AC power source may then subsequently fail no later than 10 seconds after the return of the first or second AC power source from a failed or out-of-tolerance condition with no loss of data, reduction in hardware functionality, performance, capacity, or cooling.
  • The two AC power sources can be out of synchronization with each having a different voltage, frequency, phase rotation, and phase angle as long as the power characteristics for each separate AC source remain within the range of the manufacturer’s published specifications and tolerances.
  • Both external AC power inputs must terminate within the manufacturer’s fault-tolerant power compliant computer equipment. In the event that the external AC power input is a detachable power cord, the equipment must provide for positive retention of the female plug so the plug cannot be pulled loose accidentally. Within the equipment, the AC power train (down to and including the AC to DC power supplies) must be compartmentalized such that any power train component to neighter side can be safely serviced without affecting computer equipment availability or performance and without putting the AC power train of the other side at risk.
  • For single- or three-phase power sources, the neutral conductor in the AC power input shall not be bonded to the chassis ground inside the equipment. This will prevent circulating ground currents between the two external power sources.
  • Internal or external active AC input switching devices (e.g., mechanical or electronic transfer switches) are not acceptable.
  • A fault inside the manufacturer’s equipment that results in the failure of one AC power source shall not be transferred to the second AC power source causing it to also fail.
  • For single- or three-phase power sources, with both AC power inputs available and with both inputs operating at approximately the same voltage, the normal load on each power source will be shared within 10% of the average.
  • For three-phase power source configurations, the normal load on each phase will be within 10% of the average.

Keith-KlesnerKeith Klesner’s career in critical facilities spans 14 years and includes responsibilities ranging from planning, engineering, design and construction to start-up and ongoing operation of data centers and mission-critical facilities. In the role of Uptime Institute vice president of Engineering, Mr. Klesner has provided leadership and strategic direction to maintain the highest levels of availability for leading organizations around the world. Mr. Klesner performs strategic-level consulting engagements, Tier Certifications and industry outreach—in addition to instructing premiere professional accreditation courses. Prior to joining the Uptime Institute, Mr. Klesner was responsible for the planning, design, construction, operation and maintenance of critical facilities for the U.S. government worldwide. His early career includes six years as a U.S. Air Force officer. He has a Bachelor of Science degree in Civil Engineering from the University of Colorado-Boulder and a Masters in Business Administration from the University of LaVerne. He maintains status as a professional engineer (PE) in Colorado and is a LEED-accredited professional.

data-center-personnel

Resolving the Data Center Staffing Shortage

The availability of qualified candidates is just part of the problem with data center staffing; the industry also lacks training and clear career paths attractive to recruits.

The data center industry is experiencing a shortage of personnel. Uptime Institute Founder Ken Brill, as always, was among the first to note a trend, mentioning it more than 10 years ago. This trend reflects, in part, an aging global demographic but also increasing demand for data center personnel, which Uptime Institute Network members have described as chronic. The aging population threatens many industries but few more so than the data center industry, where the Network Abnormal Incident Reports (AIRs) database supports the relationship between downtime and inexperienced personnel.

As a result, at recent meetings North American Network members discussed how enterprises could successfully attract, train and retain staff. Network members blame the shortage on increasing market demand for data centers, inflexible organizational structures and an aging workforce retiring in greater numbers. This shortfall has already caused interruptions in service and reduced availability to mission-critical business applications. Some say that the shortage of skilled personnel has already created conditions that could lead to downtime. If not addressed in the near term, the problem could affect sections of the economy and company valuations.

Prior to 2000, data center infrastructure was fairly static; power and cooling demand generally grew following a linear curve. Enterprises could manage the growth in demand during this period fairly easily. Then technology advances and increased market adoption rates changed the demand for data center capacity so that it no longer followed a linear growth model. This trend continues, with one recent survey from TheInfoPro, a service of 451 Research, finding that 37% of data center operators had added data center space from July 2012 to June 2013. Similarly, the 2013 Uptime Institute Data Center Industry Survey found that 70% of 1000 respondents had built new data center space or added space in the last five years. The survey reported even more data center construction projects in 2012 and 2011 (see Figures 1-3). The 2013 survey showing more detail about industry growth appears here.

size-changes-of-data-center

Figure 1: New data center space reported in the last 12 months.

Drivers of the data center market are similar to those that drive overall internet growth and include increasing broadband penetration, e-commerce, video delivery, gaming, social networking, VOIP, cloud computing and web applications that make the internet and data networking a key enabler of business and consumer activity. More qualified personnel are required to respond to this accelerated growth.

The organization models of many companies kept IT and Facilities or real-estate groups totally separate. IT did IT work while Facilities maintained the building, striped the parking lot, and—oh, by the way—supported the UPS systems. The groups did not share goals, schedules, meetings or ideas. This organizational structure worked well until technology accentuated the importance of, and lack of actual, middle ground between the two groups.

data-center-supply-demand

Figure 2. Demand significantly outpaced supply since 2010.

Efforts to bridge the gap between the two groups foundered because of conflicting work processes and multiple definitions for shared terms (such as mission critical maintenance). Simply put, the two groups spoke different languages, followed different leaders and pursued unreconciled goals. Companies that recognized the implications of the situation immediately reorganized. Some of these companies established mission critical teams and others moved Facilities and IT into the same department. This organizational challenge is by no means worked out and will continue well into the next decade.

Though no government agency or private enterprise keeps track of employment trends in data centers, U.S. Social Security Administration (SSA) statistics for the general population support anecdotes shared by Network members. According to the SSA, which is the agency that supervises the federal retirement benefits program in the U.S., 10,000 people per day apply for social security benefits, with this number expected to continue to 2025 as the baby boomers continue to retire, a phenomenon first apparently dubbed the “silver tsunami” by the Alliance for Aging Research in 2006. The populations of Europe and wide parts of Asia, including China and Japan, are also aging.

The direct experiences shared by Uptime Institute Network members suggest that the data center industry is highly vulnerable to, if not already diminished by, this larger societal trend. Network members estimate that 40% of the facilities engineering community is older than 50. One member of the Network expects that 50% of its staff will retire in the next two years. Network members remain concerned that many qualified candidates—science, technology, engineering and mathematics (STEM) students—are unaware of the employment opportunities offered by the industry and may not be attracted to the 24 x 7 nature of the work.

Tony Ulichnie, who presided over many of these discussions as Network Director, North America (before retiring in July of this year), described the cost of wisdom and experience lost with the retirement of the retiring generation as “the price of silver,” referring to the loss to the organization when a longstanding and silver-haired data center operations specialist retires.

Military and civilian nuclear programs have proven to be a source of excellent candidates for data center facilities but yield only so many graduates. These “Navy Nukes” and seasoned facilities engineers command very competitive salaries and find themselves being courted by the industry.

Industry leaders say that the pipeline for replacement engineers has slowed to a dribble. Tactics such as poaching and counteroffers have become commonplace in the field.

Potential employers are also reluctant to risk hiring green (inexperienced) recruits. The practices of mission-critical maintenance require much discipline and patience, especially when dealing with IT schedules and meeting application availability requirements. Deliberate processes along with clear communications skills become necessary elements of an effective facilities organization. Identifying individuals with these capabilities is the trick: one Uptime Institute Network member found a key recruit working at a bakery. Another member puts HVAC students through an 18-month training program after hiring them from a local vocational school, with a 70% success rate.

data-center-facility-capacity

Figure 3. Those reporting new space in the Uptime Institute Survey (see p. 142 for the full
2013 Uptime Institute Data Center Survey) in the last five years, Growth in new whitespace by size also reported that a wide variety of spaces had been built.

The hunt for unexplored candidate pools will increase in intensity as the demand for talent escalates in the next decade, and availability and reliability will also suffer unless the industry addresses the problem in a comprehensive manner. To mitigate the silver tsunami, some combination of industry, individual enterprises and academia must create effective training, development and apprenticeship programs to prepare replacements for retirees at all levels of responsibility. In particular, data center operators must develop ways to identify and recruit talented young individuals who possess the key attributes needed to succeed in a pressure-packed environment.

A Resource Pool

Veterans form an often overlooked and/or misunderstood talent pool for technical and high-precision jobs in many fields, including data centers. Statistics suggest that unemployment among veterans exceeds the national rate, which is counterintuitive to those who have served. With more than one million service members projected to leave the military between now and 2016 due to the draw down in combat operations, according to U.S. Department of Defense estimates, unemployment among veterans could be a growing national problem in the U.S.

From the industry perspective, however, the national problem of unemployed veterans could prove an opportunity to “do well by doing good.” While experienced nuclear professionals represent a small pool of high-end and experienced talent, the pool of unemployed but trainable veterans represents a nearly inexhaustible source of talent suitable, with appropriate preparation, for all kinds of data center challenges.

Data centers compete with other industries for personnel, so now is the time to seize the opportunity because other industries are already moving to capitalize on this pool of talent. For example, Walmart has committed to hiring any veteran who was honorably discharged in the past 12 months, JP Morgan Chase has teamed with the U.S. Chamber of Commerce to hire over 100,000 veterans, and iHeartRadio’s Show Your Stripes program features many large and small enterprises, including some that own or operate data centers, committed to meeting the employment needs of veterans. For its own good, the data center industry must broadly participate in these efforts and drive to acquire and train candidates from this talent pool.

In North America, some data center staffs already include veterans who received technical training in the military and were able to land a job because they could quickly apply those skills to data centers. These technicians have proven the value of hiring veterans for data center work, not only for their relevant skills but also for their personal attributes of discipline and performance excellence.

The data center industry can take further advantage of the talent pool of veterans by establishing effective training and onboarding programs (mechanisms that enable new employees to acquire the necessary knowledge, skills and behaviors to become effective organizational members and insiders) for veterans who do not have the technical training (e.g., infantry, armor) that translates easily to the data center industry but have all the other important characteristics, including a proven ability to learn. Providing clear pathways for veterans of all backgrounds to enter the industry will ensure that it benefits from the growing talent pool and will be able to compete effectively with the other industries.

While technically trained veterans can enter the data center industry needing only mentoring and experience to become near-term replacements for retiring mid-level personnel, reaching out to a broader pool that requires technical training will create a generation of junior staff who can grow into mid-level positions and beyond with time and experience. The leadership, discipline and drive that veterans have will enable them to more quickly grasp and master the technical requirements of the job and adapt with ease to the rigor of data center operations.

Veterans’ Value to Industries

Military training and experience is unequaled in the breadth and depth of skills that it develops and the conditions in which these skills are vetted. Service members are trained to be intellectually, mentally and emotionally strong. They are then continuously tested in the most extreme conditions. Combat veterans have made life and death decisions, 24 hours a day for months without a break. They perform complex tasks, knowing that the consequences of failure could result in harm or even the death of themselves and others. This resilience and strength can be relied on in the civilian marketplace.

Basic training teaches the men and women of the military that the needs of the team are greater than their individual needs. They are taught to lead and to follow. They are taught to communicate up, down and across. They learn that they can achieve things they never thought possible because of these skills, and with a humble confidence can do the same in any work environment.

The military is in a constant state of learning, producing individuals with uncanny adaptive thinking and a capacity and passion for continuing to learn. This learning environment focuses not only on personal development but also on training and developing subordinates and peers. This experience acts as a force multiplier when a veteran who is used to knowing his job plus that of the entire team is added to the staff. The veteran is used to making sure that the team as a whole is performing well rather than focusing on the individual. This unwavering commitment to a greater cause becomes an ingrained ethos that can improve the work habits of the entire team.

The public commonly stereotypes military personnel as unable to think outside of a chain of command, but following a chain of command is only a facet of understanding how to perform in a team. Service members are also trained to be problem solvers. In this author’s experience, Iraq and Afghanistan were highly complex operations where overlooking the smallest detail could change outcomes. The military could not succeed at any mission if everyone waited for specific orders/instructions from their superiors before reacting to a situation. The mindset of a veteran is to focus on the mission: mission leaders impart a thorough understanding of the intent of a plan to troops, who then apply problem-solving skills to each situation in order to get the most positive outcome. They are trained to be consummate planners, engaging in a continuous process of assessment, planning, execution, feedback and fine tuning to ensure mission success.

Reliability is another key attribute that comes from military service. Veterans know that a mission that starts a minute late can be fatal. This precision translates to little things like punctuality and big things like driving projects to meet due dates and budgets. This level of dependability is cornerstone of being a good teammate and leader.

Finally, an often overlooked value of military service is the scope of responsibility that veterans have had, which is often much larger than their non-veteran peers. It is not uncommon for servicemen and women in their twenties to have managed multi-million dollar budgets and hundreds of people. Their planning and management experience is gained in situations where bad decisions can cause troops to drive into an ambush that might also prevent supplies or reinforcements from reaching an under-provisioned unit.

Because military experience produces individuals who demonstrate strong leadership skills, reliability, dependability, integrity, problem-solving ability, proven ability to learn and a team-first attitude, veterans are the best source of talent available. Salute Inc. is an example of a company that helps bring veterans into the data center industry, and in less than six months has proven the value proposition.

Challenges

Recent Uptime Institute Network discussions identified the need for standard curriculum and job descriptions to help establish a pathway for veterans to more easily enter the industry, and Network members are forming a subcommittee to examine the issue. The subcommittee’s first priority is establishing a foundation of training for veterans whose military specialty did not include technical training. Training programs should allow each veteran to enter the data center industry at an appropriate level.

At the same time, the subcommittee will assess and recommend human resource (HR) policies to address a myriad of systemic issues that should be expected. For example, once trainees become qualified, how should companies adjust their salaries? Pay adjustments might exceed normal increases; however, the market value of these trainees has changed, and, unlike other entry-level trainees, veterans have proven high retention rates. The subcommittee has already defined several entry-level positions:

  • Network operations center trainee
  • Data center operations trainee
  • Security administration trainee
  • IT equipment installation trainee
  • Asset management administrator trainee

Resources for Veterans

The Center for New American Security (CNAS) conducted in-depth interviews with 69 companies and found that more than 80% named one or two negative perceptions about veterans. The two most common are skill translation and concerns about post-traumatic stress (PTS).

Many organizations have looked at the issue of skill translation. Some of them have developed online resources to help veterans translate their experiences into meaningful and descriptive civilian terms (www.careerinfonet.org/moc/). They also provide tools to help veterans articulate their value in terms that civilian organizations will understand.

Organizations that access these resources will also gain a better understanding of how a veteran’s training and skills suit the data center environment. In addition, the military has established comprehensive transition programs that all service members go through when re-entering the civilian job market, including resume preparation and interview planning. The combination of government-sponsored programs and resources, a veteran’s own initiative and a civilian organization’s desire to understand can offset concerns about skill translation.

PTS is an issue that cannot be ignored. It is one of the more misunderstood problems in America, even among some in the medical community. It is important to understand more about PTS before assuming this problem affects only veterans. It is estimated that 8% of all Americans suffer from PTS, which is about 25 million people. The number of returning military who have been diagnosed with PTS is 300,000, which is about 30% of Iraq/Afghanistan combat veterans, yet only a very small proportion of the total PTS sufferers in the U.S. The mass media—where most people learn about PTS—often describes PTS as a military issue because the military approach to PTS is very visible: there is a formal process for identifying it and also enormous resources focused on helping veterans cope with it. Given that there are 80 times more non-veterans suffering from PTS, the focus for any HR organization should be ensuring that a company’s practices (from the interview to employee assistance programs and retention) are effectively addressing the issue in general.

Conclusion
The data center industry needs the discipline, leadership and flexibility skills of veterans to serve as a foundation on which it can build the next generation of data center operators. The Uptime Institute Network is establishing a subcommittee and called for volunteers to help define the fundamentals that would be required to have an effective onboarding, training and development program in the industry. This group will address everything from job descriptions to clearly defined career paths for both entry-level trainees as well as experienced technicians transitioning from the military. For further information or if you are interested in contributing to this effort, please contact Rob Costa, Network Director, North America ([email protected]).

Resources

The following list provides a good starting point for understanding the many resources available for veterans and employers to connect.


kirbyLee Kirby is Uptime Institute senior vice president, CTO. In his role he is responsible for serving Uptime Institute clients throughout the life cycle of the data center from design through operations. Mr. Kirby’s experience includes senior executive positions at Skanska, Lee Technologies and Exodus Communications. Prior to joining the Uptime Institute, he was CEO and founder of Salute Inc. He has more than 30 years of experience in all aspects of information systems, strategic business development, finance, planning, human resources and administration both in the private and public sectors. Mr. Kirby has successfully led several technology startups and turn-arounds as well as built and run world-class global operations. In addition to an MBA from University of Washington and further studies at Henley School of Business in London and Stanford University, Mr. Kirby holds professional certifications in management and security (ITIL v3 Expert, Lean Six Sigma, CCO). In addition to his many years as a successful technology industry leader, he masterfully balanced a successful military career over 36 years (Ret. Colonel) and continues to serve as an advisor to many veteran support organizations.

Mr. Kirby also has extensive experience working cooperatively with leading organizations across many Industries, including Morgan Stanley, Citibank, Digital Realty, Microsoft, Cisco and BP.

Data-center-owners-v-designers

Resolving Conflicts between Data Center Owners and Designers

Improving communication between the enterprise and design engineers during a capital project

For over 10 years, Uptime Institute has sought to improve the relationship between data center design engineers and data center owners. Yet, it is clear that issues remain.

Uptime Institute’s uniquely unbiased position—it does not design, construct, commission, operate, or provision equipment to data centers—affords direct insight into data center capital projects throughout the world. Uptime Institute develops this insight through relationships with Network members in North America, Latin America, EMEA, and Asia Pacific; the Accredited Tier Designer (ATD) community; and the owner/operators of 392 Tier Certified, high-performance data centers in 56 countries.

Despite increasingly sophisticated analyses and tools available to the industry, Uptime Institute continues to find that when an enterprise data center owner’s underlying assumptions at the outset of a capital project are not attuned to its business needs for performance and capacity, problematic operations issues can plague the data center for its entire life.

The most extreme cases can result in disrupted service life of the new data center. Disrupted service life may be classified in three broad categories.

1. Limited flexibility

  • The resulting facility does not the meet the performance requirements of an IT deployment that could have been reasonably forecast
  • The resulting facility is difficult to operate, and staff may avoid using performance or efficiency features in the design

2. Insufficient capacity

  • Another deployment (either new build, expansion, or colocation) must be launched earlier than expected
  • The Enterprise must again fund and resource a definition, justification, and implementation phase with all the associated business disruptions
  • The project is cancelled and capacity sought elsewhere

3. Excess capacity

  • Stranded assets in terms of space, power, and/or cooling represent a poor use of the Enterprise’s capital resources
  • Efficiency is diminished over the long term due to low utilization of equipment and space
  • Capital and operating cost per piece of IT or network equipment is untenable

Any data center capital project is subject to complex challenges. Overtime and over-budget considerations, such as inclement weather, delayed equipment delivery, overwhelmed local resources, slow-moving permitting and approval bureaucracies, lack of availability of public utilities (power, water, gas), merger or acquisition, or other shift in corporate strategy, may be outside of the direct control of the enterprise.

But other causes of overtime and over-budget are avoidable and can be dealt with effectively during the pre-design phase. Unfortunately, many of these issues become clear to the Enterprise after the project management, design, construction, and commissioning teams have completed their obligations.

Planning and justifying major data center projects has been a longstanding topic of research and education for Uptime Institute. Nevertheless, the global scale of planning shortfalls and project communication issues only became clear due to insight gained through the rapid expansion of Tier Certifications.

Even before a Tier Certification contract is signed, Uptime Institute requests a project profile, composed of key characteristics including size, capacity, density, phasing, and Tier objective(s). This information helps Uptime Institute determine the level of effort required for Tier Certification, based on similar projects. Additionally, this allows Uptime Institute to provide upfront counsel on common shortfalls and items of concern based upon our experience of similar projects.

Furthermore, a project team may update or amend the project profile to maintain cost controls. Yet, Uptime Institute noted significant variances in these updated profiles in terms of density, capacity, and Tier. It is acknowledged that an owner may decide to amend the size of a data center, or to adjust phasing, to limit initial capital costs or otherwise better respond to business needs. But a project that moves up and down the Tier levels or varies dramatically in density from one profile to another indicates management and communication issues.

These issues result in project delays, work stoppages, or cancellations. And if the project is completed, it can be expected to lack in terms of capacity (either too much or too little), meeting performance requirements (in design or facility), and flexibility.

Typically, a Tier Certification inquiry occurs after a business need has been established for a data enter project and a data center design engineer has been contracted. Unstable Certification profiles show that a project may have prematurely been moved into the design phase, with cost, schedule, and credibility consequences for a number of parties—notably, the owner and the designer.

Addressing the Communications Gap

Beginning in May 2013, Uptime Institute engaged the industry to address this management and communication issue on a broader basis. Anecdotally, both sides, via the Network or ATD courses, had voiced concerned that one had insufficient insight into the scope or responsibility, or unrealistic expectations, of the other. For example, a design engineer would typically be contracted to produce an executable design but soon find out that the owner was not ready to make the decisions that would allow the design process to begin. On the other hand, owners found that the design engineers lacked commitment to innovation, and they would be delivered a solution that was similar to a previous project rather than vetted against established performance and operations requirements. This initiative was entitled Owners vs Designers (OvD) to call attention to a tension evident between these two responsibilities.

The Uptime Institute’s approach was to meet with the designers and owners separately to gather feedback and recommendations and to then reconcile the feedback and recommendations in a publication.

OvD began with the ATD community during a special session at Uptime Institute Symposium in May 2013. Participants were predominantly senior design engineers with experience in the U.S., Canada, Brasil, Mexico, Kenya, Australia, Saudi Arabia, Lebanon, Germany, Oman, and Russia. This initial session verified the need for more attention to this issue.

The design engineers’ overwhelming guidance to owners could be summarized as “know what you want.” The following issues were raised specifically and repeatedly:

1. Lack of credible IT forecasting

  • Without a credible IT requirements definition, it is difficult to establish the basic project profile (size, density, capacity, phasing, and Tier). As this information is discovered, the profile changes, requiring significant delays and rework.
  • In the absence of an IT forecast, designers have to make assumptions about IT equipment. The designers felt that that this task is outside their design contract and they are hired to be data center design experts, not IT planning experts.

2. Lack of detailed Facilities Technical Requirements

  • The absence of detailed Facilities Technical Requirements forces designers to complete a project definitions document themselves because a formal design exercise cannot be launched in its absenceSome designers offer, or are forced, to complete the Facilities Technical Requirements, although it is out-of-scope
  • Others hesitated to do so as this is an extensive effort that requires knowledge and input from a variety of stakeholders
  • Others acknowledged that this process is outside their core competency and the result could be compromised by schedule pressures or limited experience

3. Misalignment of available budget and performance expectations

  • Owners wanted low capital expense, operating expense, and cost of ownership over the life of the project.
  • Most solutions cannot satisfy all three. The owners should establish the highest priority (Capex, Opex, TCO).
  • Designers felt unduly criticized for not prioritizing energy efficiencies in data center designs, although the owners did not understand the correlation between Capex and efficiency. “Saving money takes money; a cheap data center design is rarely efficient.”

Data Center Owners Respond
Following the initial meeting with the data center design community, Uptime Institute brought the discussion to data center owners and operators in the Uptime Institute Network throughout 2013, at the North America Network Meeting in Seattle, WA, APAC Network Meeting in Shenzhen, China, and at the Fall Network Meeting in Scottsdale, AZ.

Uptime Institute solicited input from the owners and also presented the designers’ perspective to the Network members. The problems the engineering community identified resonated with the Operations professionals. However, the owners also identified multiple problems encountered on the design side of a capital project.

In the owner’s words, “designers, do your job.”

According to the owners, the design community is responsible for drawing out the owners’ requirements, providing multiple options, and identifying and explaining potential costs. Common problems in the owners’ experience include:

  • Conflicts often arise between the design team and outside consultants hired by owners
  • Various stakeholders in the owner’s organization have different agendas that confuse priorities
  • Isolated IT and Facilities teams result in capacity planning problems
  • Design teams are reluctant to stray from their preferred designs

The data center owner community agreed with the designer’s perspective and took responsibility for those shortcomings. But the owners pointed out that many design firms promote cookie-cutter solutions and are reluctant to stray from their preferred topologies and equipment-based solutions. One participant shared that he received data center design documents for a project with the name of the design firm’s previous customer still on the paperwork.

Recommendations

Throughout this process, Uptime Institute worked to collect and synthesize the feedback and potential solutions to chronic communications problems between these two constituencies. The following best practices will improve management and communication throughout the project planning and development, with lasting positive effect on the operations lifecycle.

Pre-Design Phase
All communities that participated in OvD discussions understood the need to unite stakeholders throughout the project and the importance of reviewing documentation and tracking changes throughout. Owners and designers also agreed on the need to invest time and budget for pre-design, specifically including documenting the IT Capacity Plan with near-term, mid-term, and long-term scenarios.

The owners and designers also agreed on the importance of building Facilities Technical Requirements that are responsive to the IT Capacity Plan and includes essential project parameters:

  • Capacity (initial and ultimate)
  • Tiers[s]
  • Redundancy
  • Density
  • Phased implementation strategy
  • Configuration preferences
  • Technology preferences
  • Operations requirements
  • Level of innovation
  • Energy efficiency objectives
  • Computer Room Master Plan

Workshop Computer Room Master Plans with IT, Facilities, Corporate Real Estate, Security, and other stakeholders and then incorporate them into the Facilities Technical Requirements. After preparing the Facilities Technical Requirements, invite key stakeholders to ratify the document. This recommendation does not prohibit changes later but provides a basis of understanding and launch point for the project. Following ratification, brief the executive (or board). This and subsequent briefings can provide the appropriate forum for communicating the costs associated with various design alternatives, but also how they deliver business value.

RFP and Hiring
Provide as much detail about project requirements as possible in the RFP, including an excerpt of Facilities Technical Requirements in the RFP itself and technology and operations preferences and requirements. This allows respondents to the RFP to begin to understand the project and respond with most relevant experience. Also, given that many RFPs compel some level of at-risk design work, a detailed RFP will best guide this qualification period and facilitate the choice of the right design firm. Inclusion of details in the RFP does not prohibit the design from changing during its development and implementation.

Negotiate in person as much as possible. Owners regretted not spending more time with the design firm(s) before a formal engagement as misalignments only became evident once it was too late. Also, multiple owners remarked with pride that they walked out of a negotiation at least once. This demonstrated their own commitment to their projects and set a tone of consequences and accountability for poor or insufficient communication.

Assess and score the culture of the design firms for alignment with the owner’s preferred mode and tone of operations. Owners commented that they preferred a small and local design firm, which may require some additional investment in training, but they were confident would get more careful and close attention in return.

Notify the design engineer from the outset of specific requirements and indicators of success to pre-empt receiving a generic or reconstituted design.

Should the owner engage an outside consultant, avoid setting an aggressive tone for consultants. Owners may want to augment their internal team with a trusted advisor resource. Yet, this role can inadvertently result in the consultant assuming the role of guard dog, rather than focusing on collaboration and facilitation.

Design and Subsequent Phases
Owners and designers agreed that a design effort was a management challenge rather than a technical one. An active and engaged owner yields a more responsive and operable design. Those owners that viewed it as outsourcing the production/fabrication effort of a data center struggled with the resulting solution. The following recommendations will reduce surprises during or after the project.

  • Success was defined not as a discrete number of meetings or reports, but as being contingent upon establishing and managing a communication system.
    Key components of this system include the following:
  • Glossary of terms: Stakeholders will have varying experience or expertise and some terms may be foreign or misconceived. A glossary of terms established a consistent vocabulary, encourages questions, and builds common understanding.
  • List of stakeholders: Stakeholders may vary, but identifying the ‘clients’ of the data center helps to establish and maintain accountability.
  • Document all changes: The owner must be able to evidence the circumstances and reasons behind any changes. These are a natural aspect of a complex data center project, but knowing the decision made and why will be key to setting expectations and successful operation of the data center.
  • Notify stakeholders of changes to IT Capacity Plans, Facilities Technical Requirements, and design documents. This will also help executive and non-technical stakeholders to feel engaged without disruption(s) to the project low and allow the project stakeholders to provide accurate and timely answers when decisions are questioned during or after the project.

As the recommendations were compiled from the OvD initiative, many of the recommendations resonated with Uptime Institute guidance of years past. Over 10 years ago, Ken Brill and Pitt Turner held seminars on project governance that touched upon a number of the items herein. It is an old problem, but just as relevant.


Key Quotes from the Design Community

Owners want to design to Tier III, but they want to pay for Tier II and get Tier IV performance.

Owners want technologies or designs that don’t work in their region or budget.

The IT people are not at the table, and engineers don’t have adequate opportunity to understand their requirements. Designers are often trying to meet the demands of an absent, remote, or shielded IT client who lives in a state of constant crisis.

Once the project is defined, it’s in the hands of the general contractor and commercial real estate group. Intermediaries may not have data center experience, and engineers aren’t in direct contract with the end user anymore.


Industry Perspectives

Chris-CrosbyChris Crosby, CEO, Compass Datacenters

There are some days when I’d like to throw architects and engineers off the roof. They don’t read their own documents, for example, putting in boilerplate that has nothing to do with the current project in a spec. They can also believe that they know better than the owner—making assumptions and changes independent of what you have clearly told them on paper that you want. It drives me nuts because as an owner you may not catch it until it has cost you a hundred grand, since it just gets slipsheeted into some detail or RFI response with no communication back to you.

 

Dennis-JulianDennis R. Julian, PE, ATD, Principal, Integrated Design Group, Inc.

Data center designs are detail oriented. Missing a relatively minor item (e.g. control circuit), could result in shutting down the IT equipment. When schedules are compressed, it is more difficult and requires more experienced design personnel to flush out the details, do the analysis, and provide options with recommendations required for a successful solution.

There are pressures to stick with proven designs when:

  • Fees are low. Standard designs are used so less experienced personnel may be used to meet budgets.
  • Schedules are compressed. Reuse of existing designs and minimizing options and analysis speeds up completion of the design.

Good design saves capital and operating costs over the life of the facility and vastly dwarfs any savings in design fees. Selecting designers based on qualifications and not fees, similar to the Brooks Act regulating the selection of engineers by the U.S. Federal government (Public Law 92-582 92nd Congress, H.R. 12807. October 27, 1972) and allowing reasonable schedules will allow the discussion about the client’s goals and needs and the time to review alternatives for the most cost-effective solution based on total cost of ownership.



Julian-Kudritzki

Julian Kudritzki joined the Uptime Institute in 2004 and currently serves as Chief Operating Officer. He is responsible for the global proliferation of Uptime Institute Standards. He has supported the founding of Uptime Institute offices in numerous regions, including Brasil, Russia, and North Asia. He has collaborated on the development of numerous Uptime Institute publications, education programs, and unique initiatives such as Server Roundup and FORCSS. He is based in Seattle, WA.

 

 

Matt-StansberryMatt Stansberry is director of Content and Publications for the Uptime Institute and also serves as program director for the Uptime Institute Symposium, an annual spring event that brings together 1,500 stakeholders in enterprise IT, data center facilities, and corporate real estate to deal with the critical issues surrounding enterprise computing. He was formerly Editorial Director for Tech Target’s Data Center and Virtualization media group, and was managing editor of Today’s Facility Manager magazine. He has reported on the convergence of IT and Facilities for more than a decade.

 

server-roundup

Decommissioning as a Discipline: Server Roundup Winners Share Success

How did these six enterprises find and eliminate so much waste?

Comatose IT equipment, servers long abandoned by application owners and users but still racked and running, are hiding in plain sight within even the most sophisticated IT organizations. Obsolete or unused servers represent a double threat in terms of energy waste—squandering power at the plug, but also wasting data center facility power and capacity.

Uptime Institute Research circa 2009 states decommissioning one rack unit (1U) of servers can result in a savings of US$500 per year in energy costs, an additional US$500 in operating system licenses and US$1,500 in hardware maintenance costs. But reaping those rewards is no easy task.

According to Uptime Institute’s estimates based on industry experience, around 20% of servers in data centers today are obsolete, outdated or unused. That percentage may in fact be conservative.

According to one media report, Lexis Nexis found 50% of its servers were comatose in one of its audit samples. When the insurance firm SunLife took back management from an outsourced data center engagement firm in 2011, it found 40% of its servers were doing absolutely nothing

As early as 2006, Uptime Institute Founder Ken Brill identified comatose servers as one of the biggest opportunities for companies to improve overall IT energy efficiency. While Mr. Brill advocated for industry action on this issue, he often cautioned, “Nobody gets promoted for going around in the data center and unplugging servers.” Mr. Brill meant that data center professionals had no incentive to remove comatose machines and that IT executives lacked insight into the impact idle IT equipment was having on the cost structures of their organizations, as their departments do not pay the data center power bill.

The corporate disconnect between IT and Facilities Operations continues to challenge the data center industry. Data center managers need to overcome that organizational barrier and get executive level buy-in in order to implement an effective server decommissioning program.

server-roundup-winners-2013

Winners of Server Roundup at the Uptime Institute Symposium 2013

This is why Uptime Institute invited companies around the globe to help address and solve the problem of comatose servers by participating in the Server Roundup, an initiative to promote IT and Facilities integration and improve data center energy efficiency.

The annual Uptime Institute Server Roundup contest was launched in October 2011 to raise awareness about the removal and recycling of comatose and obsolete IT equipment in an effort to reduce data center energy use. In 2012, Uptime Institute named AOL and NBC Universal inaugural Server Roundup champions. AOL had removed nearly 10,000 obsolete servers, and NBC Universal culled 1,090 comatose machines, representing 29% of its overall IT footprint. The following year’s results were even more impressive.

2013 Winners and Finalists

WINNER: AOL won in back-to-back years for its overall tally of servers removed. The global Web services company decommissioned 8,253 servers in calendar year 2012. This produced (gross) total savings of almost US$3 million from reduced utility and maintenance costs and asset resale/scrap. Environmental benefits included reducing carbon emissions by more than 16,000 tons, according to AOL.

WINNER: Barclays, a global financial organization, removed 5,515 obsolete servers in 2012, gaining power savings of around 3 megawatts, US$3.4 million annualized savings for power, and a further US$800K savings in hardware maintenance.

FINALIST: TD Bank removed 513 servers in 2012. The team from this Canadian financial firm removed 2,941 units in the 5 years they’ve been working to remove obsolete machines from the raised floor. Although the TD Bank annual server count does not approach the impressive numbers put up by AOL, the organization makes up for it in volume of waste that it diverts from local and municipal waste sites. All the equipment sent through the E-Waste recycler is salvaged within a 110-mile radius from TD Bank’s primary data centers. Nothing is shipped overseas for processing.

server-roundup-belt-buckle

Server Roundup trophy belt buckle

FINALIST: McKesson pulled 586 servers in 2012, reducing data center power usage by 931.7 kilowatts and saving US$734,550.

FINALIST: Sun Life Financial removed 387 servers in 2012, which resulted in 32 kilowatts of power savings across three data centers and financial savings of US$8,800 per month.

Since the contest’s launch two years ago, Server Roundup participants have decommissioned and recycled 30,000 units of obsolete IT equipment.

In the sidebars, Server Roundup winner Paul Nally and finalist Rocco Alonzi discuss the challenges and benefits of a server-decommissioning program and detailed their strategies for success .

Takeaways From Last Year’s Winners

During the 2013 Uptime Institute Symposium, last year’s winners provided the following advice:

  • Get senior management to buy-into the program. “There is risk involved, but we need to get senior management buy-off on the risk,” Nally said. “There’s short-term risk and long-term risk. If you flip the wrong switch, and we have, you’ll cause an outage. But if you leave it on the wire to stagnate for five to six years, when it eventually dies, we will not be able to recover it.”
  • When you pitch server decommissioning to execs, discuss business impacts. “The easiest way to find yourself alone in an empty room is to call a meeting about server retirement,” Nally said. “People don’t understand the challenge. When we have the conversations with the C-level suite, we tell them what 5,000 servers means. We don’t talk in terms of kilowatts. We talk in terms of dollars.”
  • The biggest roadblock will be cultural. “Executives have other things on their roadmaps that are more interesting, like developing revenue. Getting buy-in requires getting people to commit to doing stuff they don’t like doing. People would rather move on to the next great thing, rather than dealing with the management problem they have,” said Scott Killian, Senior Technical Director of Data Center Services at AOL.
  • Get some help. “We brought in a couple of university students to do a bookto-floor audit of all the servers over three months under supervision of my group,” Alonzi said. “We took that information and started to cross-reference based on applications. All these data were about 80% accurate. Once we gathered all the information, we found question marks around a lot of hardware. There was work we had to do with our service providers, network people and storage guys. We literally had to drag people onto the raised floor and point to a cabinet or a bank of servers and say, ‘What are these doing?’”
  • Don’t be afraid to perform the “scream test.” “This is where you have a server that you know is not live, but you cannot find or establish the server owner. You pull the network cable from the back of the server and see who calls you to report the server being down and then investigate from there,” said Guy Pattison, Technical Solutions Officer, Data Center Management, TD Bank.
  • Document as much as possible. “Having a good DCIM is key. We have a backend system polling the servers to understand how machines are being used and who’s using them,” Nally said.
  • Keep up with incoming servers. “Any new hardware purchased comes through the data center operations group,” Alonzi said. “We don’t make a decision on what they’re buying, but we make sure it’s assigned to a project, and it’s not landing on the dock because the vendor was having a fire sale. Unless there’s a net new project or growth, we challenge more now.”

Paul Nally, Director at Barclays

“It has been said that the greenest data center is the one that’s never built. That is the main reason we have our server decommissioning program at Barclays. We are looking to shrink our data center footprint and benefit from the savings that this affords us, while allowing ourselves to massively expand our overall compute capability. When obsolete servers are removed in the thousands, it creates the capacity that we need to bring the next generation of systems in.

We save in space; we save in power. It helps us meet our carbon targets. When we eliminate or virtualize a server, we also save on network, SAN, and software costs. A server that may have cost US$100,00 seven years ago, took up half a rack of space, and required a couple of kW to run is absolutely crushed in compute performance by a modern blade costing US$5,000. But the benefits extend throughout the overall organization. A focus on removing these obsolete systems simplifies the environment from a network and systems administration perspective. Applications teams benefit from a more stable system that is easily maintained and integrated into contemporary management frameworks. We end up in a cleaner, safer, cheaper place with the capacity in hand that we need to continue to grow our business. There is real work, and some risk, in getting this job done, but the benefits are simply too many to ignore.”

Rocco Alonzi, AVP Data Center Governance at Sun Life Financial

“The removal of an under-utilized server sounds much easier than it really is. The thought of turning off a server and removing it from the raised floor can be overwhelming even if you are 100% certain that it is no longer required. Think about the process for a moment. As the server connections (electrical power, network, SAN storage) are removed and the server physically pulled out of a production cabinet, the hard drive data must be permanently destroyed and finally the server needs to be returned to the vendor or disposed of properly. The logical aspect includes another entire separate process so that in the end it is much easier on everyone to leave it powered on.

This is the message I communicated to the Leadership team followed by a solution and a promise. The solution included a dedicated resource (Contractor), asset database, and cooperation from the Server, Storage, and Network support teams. The contractor walked the raised floor performing an asset database book to raised-floor audit. And, yes, this did take some time, three months to be exact. This rich information was used to identify the servers that were not in the database but physically on the raised floor. We also challenged the support groups to associate their service offering with corresponding hardware infrastructure. These two exercises led to approximately 400 servers being switched off.

The promise was that Data Centre Operations team would do all the work after the hardware device was switched off. This included working with the support groups to reclaim IP addresses, SAN storage ports, and electrical power cords. We also provided the Financial department with detailed hardware information reclaiming cost savings that was passed on to the business unit. Finally, a process was put into place to remove the physical server from the raised floor, destroy the data, and properly dispose of the hardware.

The message: Raise awareness to the Leadership team of the issue and take a dedicated approach of decommissioning hardware infrastructure. It is well worth the effort.”


matt-stansberry

Matt Stansberry is director of Content and Publications for the Uptime Institute and also serves as program director for the Uptime Institute Symposium, an annual spring event that brings together 1,500 stakeholders in enterprise IT, data center facilities, and corporate real estate to deal with the critical issues surrounding enterprise computing. He was formerly Editorial Director for Tech Target’s Data Center and Virtualization media group, and was managing editor of Today’s Facility Manager magazine. He has reported on the convergence of IT and Facilities for over a decade.

 

Data center cost

Data Center Cost Myths: SCALE

What happens when economies of scale is a false promise?

By Chris Crosby

 

Chris Crosby

Chris Crosby

Chris Crosby is a recognized visionary and leader in the data center space, Founder and CEO of Compass Datacenters. Mr. Crosby has more than 20 years of technology experience and 10 years of real estate and investment experience. Previously, he served as a senior executive and founding member of Digital Realty Trust.  Mr. Crosby was Senior Vice President of Corporate Development for Digital Realty Trust, responsible for growth initiatives including establishing the company’s presence in Asia. Mr. Crosby received a B.S. degree in Computer Sciences from the University of Texas at Austin.

For many of us, Economics 101 was not a highlight of our academic experience. However, most of us picked up enough jargon to have an air of competence when conversing with our business compadres. Does “supply and demand” ring a bell?

Another favorite term that we don’t hesitate to use is “economies of scale.” It sounds professorial and is easy for everyone, even those of us who slept our way through the course, to understand. Technically the term means: The cost advantages that enterprises obtain due to size, throughput, or scale of operation, with cost per unit of output generally decreasing with increasing scale as fixed costs are spread out over more units of output.

The metrics used in our world are usually expressed as cost/kilowatt (kW) of IT capacity and cost/square foot (ft2) of real estate. Some folks note all costs as cost/kW. Others simply talk about the data center fit out in cost/kW and leave the land and building (cost/ft2) out of the equation entirely. In both cases, however, economy of scale is the assumed catalyst that drives cost/ft2 and/or cost/kW ever lower. Hence the birth of data centers so large that they have their own atmospheric fields.

This model is used both by providers of multi-tenant data centers (MTDC), vendors of pre-fabricated modular units, and many enterprises building their own facilities. Although the belief that building at scale is the most cost efficient data center development method appears logical on the surface, it does, in fact, rely on a fundamental requirement: boat loads of cash to burn.

It’s First Cost, Not Just TCO

In data center economics, no concept has garnered more attention, and less understanding, than Total Cost of Ownership (TCO). Entering the term “data center total cost of ownership” into Google returns more than 11.5 million results, so obviously people have given this a lot of thought. Fortunately for folks who write white papers, nobody has broken the code. To a large degree, the problem is the nature of the components that comprise the TCO calculus. Because of the longitudinal elements that are part of the equation, energy costs over time for example, the perceived benefits of design decisions sometimes hide the fact that they are not worth the cost of the initial investment (first cost) required to produce them. For example, we commonly find this to be the case in the quest of many operators and providers to achieve lower PUE. While certainly admirable, incomplete economic analysis can mask the impact of a poor investment. In other words, this is like my wife bragging about the money she saved by buying the new dining room set because it was on sale­ even though we really liked the one we already had.

In a paper posted on the Compass website, Trading TCO for PUE?, Romonet, a leading provider of data center analytical software, illustrated the effect of failing to properly examine the impact of first cost on a long-term investment. Due to New Mexico’s favorable atmospheric conditions, Compass chose it as the location to examine the value of using an adiabatic cooling system in addition to airside economization as the cooling method for a hypothetical location. This is a fairly common industry approach to free cooling. New Mexico’s climate is hot and dry and offers substantial free cooling benefits in the summer and winter as demonstrated by Figure 1.

Figure 1

Figure 1. Free cooling means that the compressors are off, which, on the surface, means “free” as they are not drawing electricity.

In fact, through the use of an adiabatic system, the site would benefit from over four times the free-cooling hours than a site without one. Naturally, the initial reaction to this cursory data would be “get that cooling guy in here and give him a PO so I have a really cool case study to present at the next Uptime Institute Symposium.” And, if we looked at the perceived cost savings over a ten-year period, we¹d be feeling even better about our US$500,000 investment in that adiabatic system since it appears that it saved us over US$440,000 in operating expenses.

Unfortunately, appearances can be deceiving, and any analysis of this type needs to include a few things such as discounted future savings‹otherwise known as net present value (NPV), the cost of not only the system maintenance but also for the water used, and its treatment, over the 10-year period. When these factors are taken into account, it turns out that our US$500,000 investment in an adiabatic cooling system actually resulted in a negative payback of US$430,000! That’s a high price to pay for a tenth of a point in your PUE.

The point is that the failure to account for the long-term impact of an initial decision can permanently preclude companies from exercising alternative business, not just data center, strategies.

The Myth of Scale (a.k.a., The First Cost Trap)
Perhaps there is no better example of how the myth of scale morphs into the first cost trap than when a company elects to build out the entire shell of its data center upfront, even though their initial space requirements are only a fraction of the ultimate capacity. This is typically done using the justification that they will eventually “grow into it,” and it is necessary to build a big building because of the benefit of economy of scale. It¹s important to note that this is also a strategy used by providers of MTDCs, and it doesn’t work any better for them.

The average-powered core and shell (defined here as the land, four walls, and roof along with a transformer and common areas for security, loading dock, restrooms, corridors, etc.) of a data center facility typically ranges from US$20 million to upwards of US$100  million. The standard rationale for this “upfront” mode of data center construction is that this is not the ³expensive² portion of the build and will be necessary in the long run. In other words, the belief is that it is logical to build the facility in its entirety because construction is cheap on a per-square-foot basis. Under this scenario, the cost savings are gained through the purchase and use of materials in a high enough volume that price reductions can be extracted from providers. The problem is that when the first data center pod or module is added the costs go up in additional US$10 million increments. In other words, in the best case it costs US$30 million minimum just to turn on the first server! Even modular options require that first generator and that first heat rejection and piping. First cost per kW is two to four times the supposed “end” point cost per kilowatt. Enterprises can pay two or three times more.

Option Value
This volume mode of cost efficiency has long been viewed as an irrefutable truth within the industry. Fortunately, or unfortunately, depending on how you look at things, irrefutable truths oftentimes prove very refutable. In this method of data center construction, what is gained is often less important than what has been lost.

Option value is the associated monetary value of  the prospective economic alternatives (options) that a company has in making decisions. In the example, the company gained a shell facility that it believes, based on its current analysis, will satisfy both its existing and future data center requirements. However, the inexpensive (as compared to the fit out of the data center) cost of US$100-US$300/ft2 is still real money (US$20-US$100 million depending on the size and hardening of the building). The building and the land it sits on are now dedicated to the purpose of housing the company’s data center, which means that it will employ today¹s architecture for the data center of the future. If the grand plan does not unfold as expected, this is kind of like going bust after you’ve gone all in during a poker game.

Free cooling in New Mexico

Figure 2. Estimated hours of free cooling at a hypothetical site in New Mexico.

Now that we have established what the business has gained through its decision to build out the data center shell, we should examine what it has lost. In making the decision to build in this way, the business has chosen to forgo any other use. By building the entire shell first, it has lost any future value of an appreciating asset‹the land used for the facility. It cannot be used to support any other corporate endeavors, such as disaster recovery offices, and it cannot be sold for its appreciated value. While maybe not foreseeable, this decision can become doubly problematic if the site never reaches capacity and some usable portion of the building/land is permanently rendered useless. It will be a 39-year depreciating rent payment that delivers zero return on assets. Suddenly, the economy of scale is never realized, so the initial cost per kilowatt is the end-point cost.

For example, let’s assume a US$3-million piece of land and US$17 million to build a building of 125,000 ft2 that supports six pods at 1,100 kW each. At US$9,000 per kW for the first data center, we have an all-in of US$30 million for 1,100 kWh over US$27,000 per kW. It’s not until we build all six pods that we get to the economy of scale that produces an all-in of US$12,000/kW. In other words, there is no economy of scale unless you commit to invest almost US$80M! This is the best case, assuming the builder is an MTDC.

It is logical for corporate financial executives to ask whether this is the most efficient way to allocate capital. The company has also forfeited any alternative uses for the incremental capital that was invested to manifest this all at once approach. Obviously once invested, this capital cannot be repurposed and remains tied to an underutilized depreciating asset.

10-year payback

Figure 3. Savings assumes energy cost of $US0.058 kWh, 0.25 cooling overhead and 1,250 kW of IT load

An Incremental Approach
The best way to address the shortcomings associated with the myth of scale is to construct data center capacity incrementally. This approach entails building a facility in discrete units that, as part of the base architecture, enable additional capacity to be added when it is required. For a number of reasons, until recently, this approach has not been a practical reality for businesses desiring this type of solution.

For organizations that elect to build their own data centers, the incremental approach described above is difficult to implement due to resource limitations. Lacking a viable prototype design (the essential element for incremental implementation), each project effectively begins from scratch and is typically focused on near-term requirements. Thus, the ultimate design methodology reflects the build it all at once approach as it is perceived to limit the drain on corporate resources to a one-time-only requirement. The typical end result of these projects is an extended design and construction period (18-36 months on average), which sacrifices the efficiency of capital allocation and option value for a flawed definition of expediency.

For purveyors of MTDC facilities, incremental expansion via standardized discrete units is precluded due to their business models. Exemplifying the definition of economies of scale found in our old Economics 101 textbooks, these organizations reduce their cost metrics by leveraging their size to procure discounted volume purchase agreements with their suppliers. These economies then translate into the need to build large facilities designed to support multiple customers. Thus, the cost efficiencies of MTDC providers drive a business model that requires large first-cost investments in data center facilities, with the core and shell built all at once and data center pods completed based on customer demand. Since MTDC efficiencies can only be achieved by reducing high first-cost investments by leasing capacity to multiple tenants or multiple pods to a tenant, they are forced to locate these sites in market areas that include a high population of their target customers. Thus, the majority of MTDC facilities are predominately found within a handful of markets (e.g., Northern Virginia, New York/New Jersey, and the San Francisco Bay area) where a critical mass of prospective customers can be found. This is the predominant reason why they have not been able to respond to customers requiring data centers in other locations. As a result, this MTDC model requires a high degree of sacrifice to be made by the customers. Not only must they relinquish their ability to locate their new data center wherever they need it, they must pre-lease additional space to ensure that it will be if they grow over time as even the largest MTDC facilities have finite data center capacity.

Many industry experts view prefabricated data centers as a solution to this incremental requirement. In a sense, they are correct. These offerings are designed to make the addition of capacity a function of adding one or more additional units. Unfortunately, many users of prefabricated data centers experience problems from how these products are incorporated in designs. Unless the customer is using them in a parking lot, more permanent configurations require the construction of a physical building to house them. The end result of this need is the construction of an oversized facility that will be grown into, but also suffers from the same first cost and absence of option value as the typical customer-constructed or MTDC facility. In other words, if I have to spend $US20 million day one for the shell and core, how am I saving by only building in 300-kW increments instead of 1-megawatt like the traditional guys?

The Purpose-Built Facility
In order to effectively implement a data center strategy that eliminates the issues of exorbitant first costs and the elimination of option value, the facility itself must be designed for just such a purpose. Unlike attempting to use size as the method for cost reduction, the data center would achieve this requirement through the use of a prototype, replicable design. In effect, the data center becomes a product with cost focus on a system level, not parts and pieces.

To many, the term “standard” is viewed as a pejorative that denotes a less than optimal configuration. However, as ³productization² has shown with the likes of the Boeing 737, the Honda Accord, or the Dell PC, when you include the most commonly desired features at a price below non-standard offerings, you eliminate or minimize the concern. For example, features like: Uptime Institute Tier III Design and Construction Certification, LEED certification, a hardened shell, and ergonomic features like a move/add/change optimized design would be included in the standard offering. This limits the scope of customer personalization to the data hall, branding experience, security and management systems, and jurisdictional requirements. This is analogous to car models that incorporate the most commonly desired features as standard, while enabling the customer to “customize their selection in areas such as car color, wheels, and interior finish.

The resulting solution then provides the customer with a dedicated facility, including the most essential features that can be delivered within a short timeframe (under six months from initial ground breaking) without requiring them to spend US$20-US$100 million on a shell while simultaneously relinquishing the option value of the remaining land. Each unit would also be designed to easily allow additional units to be built and conjoined to enable expansion to be based on the customer’s timeframe and financial consideration rather than have them imposed on them by the facility itself or a provider.

Summary
Due to their historically limited alternatives, many businesses have been forced to justify the inefficiency of their data center implementations based on the myth of scale. Although solutions like pre-fabricated facilities have attempted to offer prospective users the incremental approach that negates the impact of high first costs and the elimination of alternatives (option value), ultimately they require the same upfront physical and financial requirements as MTDC alternatives. The alternative to these approaches is through the productization of the data center in which a standard offering, that includes all of the most commonly requested customer features, provides end users with a cost effective option that can be grown incrementally in response to their individual corporate needs.

Industrialization, a la Henry Ford, ensures that each component is purchased at scale to reduce the cost per component. Productization shatters this theory by focusing on the system levels, not the part/component level. It is through productization that the paradox of high quality, low cost, in quickly delivered data centers becomes a reality.

Lessons Learned from the Tier Certification of an Operational Data Center

Telecom company Entel achieves first two Tier III Certification of Constructed Facility Awards in Chile

Empresa Nacional de Telecomunicaciones S.A. (Entel) is the largest telecommunications company in Chile. The company reported US$3.03 billion in annual revenue in December 2012, and an EBITDA around 40%. In Chile, Entel holds a leading position in traditional data services, growing through the integration of IT services, with significant experience.

The company also offers mobile and wireline services (including data center and IT services, Internet, local telephony, long distance and related services) and call center operations in Peru. Its standardized service offering to all companies is based on cloud computing.

To deliver these services, Entel has developed the largest mobile-fixed wire network in Chile, with 70 gigabyte per second (Gbps) of capacity. The company serves 9.3 million mobile customers and offers an MPLS-IP network of wide and national coverage with quality of service (QoS) with 8.8 Gbps peak traffic in 2011. The company also provides international network connectivity (internet peak traffic of 17.7 Gbps in 2011).

Entel is also a large data center infrastructure provider, with more than 13,800 square meters (m2) in seven interconnected data centers (see Figure 1):

  • Ciudad de Los Valles
  • Longovilo (three facilities)
  • Amunátegui
  • Ñuñoa
  • Pedro de Valdivia

These facilities host more than 7,000 servers and 2,000 managed databases. Entel currently plans to add 4,000 msup>2 in two steps at its Ciudad de los Valles facilities. As part of its IT services, Entel also provides an operational continuity service to more than 80,000 PCs and POS and terminals countrywide. Its service platform and processes are modeled under the ITIL framework, and Entel has SAS-70/II and COPC certifications.

Finally, Entel also provides processing services, virtualization, on-demand computing, SAP outsourcing and other services. Entel has seen rapid growth in demand for IaaS platforms, that it meets with robust and multiple platform offerings (iSeries, p Series, Sun Solaris and x86) and different tiers of storage. Finally, the company has a 38% market share of managed services to large financial service institutions.

Entel has defined its corporate strategy to enable it to continue to lead the traditional telco market as well as increase coverage and capacity by deploying a fiber optic access network (GPON). The company also wants to increase its market share in O/S, data center, and virtual and on-demand services and expand to other countries in the region, leveraging its experience in Chile.

Entel chose to invest in its own data centers for three reasons, two of which related directly to competitive advantage:

  • Investing in its own infrastructure would help Entel develop its commercial offering to corporate clients in Chile. Entel believed that having control over its basic infrastructure would enable it guarantee the operational continuity of customers.
  • Entel also wanted to add white space when it felt more capacity was needed. It feels this flexibility helped it win additional large deals.
  • The country does not have enough experienced personnel with the experience to manage/coordinate facilities like Entel’s. The company finds it a challenge to coordinate and supervise subcontractors responsible for various systems.
Scope-of-Entel-Operations

Figure 1. General Scope of Entel’s Operations

As part of this process, Entel determined that it would work with the Uptime Institute to earn Tier Certification for Constructed Facility at its Ciudad de Los Valles Data Center. As a result, Entel learned several important lessons about its operations and also decided to obtain the Uptime Institute’s Tier Certification of Operational Sustainability.

Project Master Plan

The initial phase of the project included the construction of a 2,000 m2 of floor space for servers. A second phase was recently inaugurated that added another 2,000 m2 of floor space for additional equipment. The complete project is intended to reach a total growth of 8,000 m2 distributed in four buildings, each one with two floors of 1,000 m2, and with a total of 26,000 m2 of building space. (The general layout of the data centers is shown in Figure 2.)

Entel-Facilities-Layout

Figure 2. General layout of the Entel facilities

The data centers had to meet four key requirements:

1. It had to house the most demanding technical and IT equipment: 1,500 kilograms (kg)/m2 in IT spaces and more than 3,000 kg/m2 in battery spaces, with special support for heavier equipment such as transformers, diesel generators, and chillers.

2. It had to achieve Uptime Institute Tier III requirements, meaning that the design includes two independent and separate power and cooling distribution paths, each able to support 100% of the capacity to allow Concurrent Maintainability so that all the critical elements can be replaced or maintained without impact in the service.

3. The building had to have sufficient capacities to meet high electrical and cooling demand (see Figure 3).

4. Service areas such as UPS and meet-me rooms had to be outside of the server rooms.

The structural design of the facility also had to address the threat of earthquake.

Entel decided to certify its data centers to meet its commercial commitments. Due to contractual agreements, Entel had to certify the infrastructure design and its facilities with an internationally recognized institution. In addition Entel wanted to validate its design and construction for audit processes and as a way to differentiate itself from competitors.

Entel-Facilities-Comparison

Figure 3. Comparison of two of Entel’s facilities.

The first step was to choose a standard to follow. In the end, Entel decided to certify Ciudad de Los Valles according to Tier Standards by the Uptime Institute because it is the most recognized body in the Chilean market and customers were already requesting it. As a result Entel’s facility was the first in Chile to earn Tier Certification of Constructed Facility.

Preparation

Ciudad de Los Valles Data Center is a multi-tenant data center that Entel uses to provide housing and hosting services, so its customers also had to be directly involved in project planning, and their authorization and approvals were important. At the time the Tier Certification of Design Documents began, the facility was in production, with important parts of its server rooms at 100% capacity. Modifications to the infrastructure had to be done without service disruptions or incident.

Tier III Certification of Constructed Facility testing had to be done at 100% of the electrical and air conditioning approved design load so coordination was extremely challenging.

As part of the design review, the Uptime Institute consultants recommended some important improvements:

  • Separate common electrical feeds from the electrical generators
  • Additional breakers to isolate bus bars serving critical loads
  • Separate water tanks (chilled water tanks from building tanks)
  • Redundant electrical feeders for air handling units

Additionally, it was essential that all the servers and communication equipment have redundant electrical feeders to avoid incidents during performance tests.

Risks and Mitigation

Recognizing that Tier Certification testing would be challenging, Entel developed an approval process to help it meet what it considered four main challenges:

  • Meeting the Tier Certification timeline with a range of stakeholders to inform and coordinate
  • Preventing incidents due to activities necessary to test the infrastructure
  • Avoiding delays caused by coordination problems and obtaining approvals and permissions to modify the infrastructure
  • The possible existence of unidentified/undocumented single-corded servers and telecom equipment

The approval process included a high-level committee, including commercial and technical executives to ensure communications. In addition every technical activity had to be approved in advance, including a detailed work plan, timing, responsibilities, checkpoints, mitigation measures and backtracking procedures. And for the most important activities, the plan was presented to the most important customers.

The Certification process proved to be an excellent opportunity to improve procedures and test them. The process also made it possible to do on-the-job training and check the data center facilities at full load.

In addition, Entel completed the Tier Certification process of both design and facility without experiencing any incidents. However changes made to meet Uptime Institute Tier III requirements led Entel to exceed its project budget. In addition, unidentified/undocumented single-corded equipment delayed work and the eventual completion of the project. In order to proceed, Entel had to visually check each rack in the eight data halls rack to identify the single-corded loads. Once the loads had been identified, Entel installed static transfer switches (STS) to protect the affected racks, which required coordinating the disconnection of the affected equipment and its reconnection to the associated STS.

Lessons Learned

Entel learned a number of important lessons as a result of Tier Certification of two data centers.

The most important conclusion is that is absolutely possible to Tier Certify an ‘in-service data center’ with good planning and validation, approval and execution procedures. At the Ciudad de Los Valles 1 Data Center, Entel learned the importance of having a good understanding of the Tier Certification process scope, timing and testing, and the necessity of having good server room equipment inventories.

As a result of Tier Certification of Ciudad de Los Valles 2 Data Center, Entel can attest that is easier to certify a new data center with construction, commissioning and certification all taking place in order before the facility goes live.

Next Steps

Entel’s next goal is to obtain the Uptime Institute’s Operational Sustainability Certification. For a service provider like Entel, robust infrastructure is not enough:

  • A good operation regime is as important as the design, construction and testing
  • It’s critical to have detailed maintenance plans and testing of the infrastructure equipment
  • Entel wants its own people to control operations
  • Entel needs well trained people

Entel is preparing for this Certification by adapting its staffing plan, maintenance procedures, training and other programs to meet Uptime Institute requirements.

In addition to its other ‘firsts’ in terms of Tier III Certification of Constructed Facility—twice over—Entel seeks to be the first data center to achieve Tier III Gold Certification.


Juan-Miguel-Duran-EntelMr. Juan Miquel Durán, electrical engineer, joined Entel Chile in September 2011. He has more than 20 years of experience, with strong knowledge of data centers, switching and telecommunication services.

He is responsible for Facilities operation and maintenance for Ciudad de los Valles TIER III Certified Data Centers. Mr. Duran is also responsible Entel’s Longovilo Data Centers.

He participated in the certification process team, and he has strong experience in data center operations (Negotiation and implementation of long terms housing contracts, according international standards), as well as in infrastructure mission critical data center projects planning and design, including tender formulation, designing, projects evaluation, management for different companies.