Blog Single Author Small - Uptime Institute Blog

Data Center AI (Artificial Intelligence) Creates New Risks

April 22, 2019/in Executive, Operations/by Rhonda Ascierto, Vice President, Research, Uptime Institute

Artificial intelligence (AI) is being used in data centers to drive up efficiencies and drive down risks and costs. But it also creates new types of risks. This is one of the findings from a recent Uptime Intelligence research report #25, “Very smart data centers: How artificial intelligence will power operational decisions”, published in April 2019 (and available to Uptime Institute Network members).

Some of these risks are not clear-cut. Take, for example, new AI-driven cloud services, such as data center management as a service (DMaaS), that pool anonymized data from hundreds or thousands of other customers’ data centers. They apply AI to this vast store of information and then deliver individualized insight to customers via a wide area network, usually the internet. But that raises a big question: Who owns the data, the supplier or the customer? The answer is usually both: customers can keep their own data but the supplier typically also retains a copy (even if the paid service stops, the data becomes an anonymous part of their data lake.)

Whether lack of certainty or clarity over data ownership actually constitutes a risk to data centers is vigorously debated. Some say that if hackers accessed data, it would be of little use as the data is anonymized and, for example, does not include specific location details. Others say hackers could apply techniques, including their own AI analysis, to piece together sensitive information to build up a fairly complete picture.

This is just one example of the risks that should at least be considered when deploying AI. In our new report, we describe four areas of risk with AI offerings:

Commercial risk: AI models and data are (often) stored in the public cloud and outside of immediate control (if using a supplier model) or may be on-site but not understood.

Commercial machine learning products and services raise the risk of lock-in because processes and systems may be built on top of models using data that cannot be replicated.
Pricing may be increasing as adoption grows — at present, prices are low to attract new data (to build up the effectiveness of AI models) or to attract equipment services or sales.
A high reliance on AI could change skills requirements or “deskill” staff positions, which could potentially be an issue.

Legal and service level agreement risk: Again, AI models and data are stored outside of immediate control (if using a supplier model) or may be on-site but not understood.

This may be unacceptable for some, such as service providers or organizations operating within strict regulatory environments.
In theory, it could also shift liability back to an Data Center AI service supplier — a particular concern for any automated actions provided by the service.

Technical risk: While we usually understand what types of data are being used for human actions and recommendations, it is not always possible to understand why and exactly how a machine reached a decision.

It may not be possible to easily change or override decisions.
As machines guide more decisions, core skills may become outsourced, leaving organizations vulnerable.

Interoperability risk and other “unknown unknowns”: The risk from the development of “2001” HAL scenarios (i.e., singularity) are over-played but there is an unknown, long-term risk.

One example is that Data Center AI is likely to be embedded in most cases (i.e., inside an individual equipment and management system). This could lead to situations where two or three or five systems all have some ability to take action according to their own models, leading to a potential runaway situation — or conflict with each other. For example, a building management system may turn up the cooling, while an IT system moves workload to another location, which turns up cooling elsewhere.

——————————————————————————–

For more on Artificial Intelligence in the data center read our latest report, Very smart data centers: How artificial intelligence will power operational decisions, available to members of the Uptime Institute Network. Members share a wealth of experiences with their peers from some of the largest companies in the world. Membership instills a primary consciousness about operational efficiency and best practices which can be put into action everyday. For membership information click here.

“Think Globally, Act Locally”: Re-Evaluating Data Center Resiliency in the Face of Climate Change

January 16, 2019/in Design, Executive, Operations/by Kevin Heslin

The 1970s-era environmental phrase “Think globally, act locally,” is an apt way for data center operators to consider the best approach to understand and address the effects of climate change on their facilities and IT operations TODAY.

Globally, we all hear that climate change threatens to bring warmer temperatures, stronger storms, rising sea levels, and more rain. It’s not too much of a stretch to agree that at the local level, this can mean increased flooding and droughts associated with higher ambient temperatures and winds, stronger lightning, and more humidity. So it’s not hard to understand that these changes are already posing significant implications for public infrastructures and real estate ventures, as well as their digital constituents including the delivery of IT services.

Uptime Institute advises data center operators and owners to evaluate their facilities and procedures considering the growing threat, and where possible, act to mitigate the risk from these environmental changes which will potentially impact their physical infrastructure along with their operational best practices currently in place.

Survey Results Show the Data Center Industry is Taking Action

In our 2018 survey of hundreds of data center owners and operators, we asked about climate change and what was being done to address it in their infrastructure and planning processes. Here’s what we saw:

Owners & Operators Should Refresh Plans Based on Latest Weather and Storm Models

To get started, data center operators should strive to create a baseline of service resiliency with climate change in mind. Many facilities have been designed to meet the level of flooding, storm, drought, or fire events as they were understood years ago. These facilities also may have adequate capacity to meet some amount of increased cooling demands from higher ambient temperatures, but the specific additional capacity may not be well understood relative to the needs brought about through climate change.

Resiliency assessments could reveal greater-than-expected vulnerabilities. For example, the operator of a 20-year-old data center originally built to withstand a 100-year storm as defined in the year 2000 may no longer withstand the greater flooding predicted to occur as part of a 100-year storm defined in 2020. For example, in Houston, TX, 19.5 inches of rain defined a 100-year storm as recently as 2000. Today, that same amount of rainfall is a 17.5-year event. By 2100, it is expected to be a 5.5-year event. As a corollary, 100-year events in the near future will involve far greater rainfall and flooding than we expect today. This evolving threat means that data centers slowly become less resilient and then suddenly are vulnerable, if planning and mitigation are not part of an ongoing planned process.

Hosting Providers Being Proactive, but Still Vulnerable to Underlying Risks

There are no shortcuts to these investigations. Moving to a hosting facility, does not necessarily lower an organization’s risk. Hosting facilities can also be just as vulnerable to a wide range of climate change impacts, and these providers must be subject to due diligence before hosting contracts are signed, just as in any other site evaluation process. And the questions which are part of due diligence should be more forward looking with climate change level stress provisions identified.

According to Uptime Institute’s 2018 survey, many hosting providers are taking climate change very seriously. The survey found that hosting providers are far more likely than any other vertical to have considered or taken a variety of climate change precautions than any other sector. As the chart below indicates, hosting providers (81%) were far more likely to say they were preparing for climate change than any other sector (industry average, 45%). They were also far more willing to re-evaluate technology selection.

	Colocation (%)	Telecommunications (%)	Financial (%)	Software/Cloud Services (%)	Industry Average (%)
Preparing for Climate Change	81	60	61	57	45
Willingness to Re-Evaluate Technology Selection	54	33	33	32	33

Data center operators have many other operational issues to consider as well. Hardening the physical facility infrastructure is a good first step, but data center operators must also re-examine their MOPs, SOPs, and EOPs, as well as their supporting vendors’ SLAs.

One good example of proper planning is when a major brokerage in New York City, during Superstorm Sandy, remained 100% operational because it had developed and followed suitable procedures and best practices for addressing conditions proactively. Uptime Institute found that many firms like this that switched to generator power in advance of an expected utility outage remained operational through the storm and continued to successfully operate if supply procedures were in place to guarantee on-going fuel availability. In addition, these firms provided provisions for local operational staff whose homes and families were threatened by the storm.

Preparation Requires Careful Evaluation of Partners as well as Local Infrastructure

But data centers do not work as islands. A data center encircled by water, even if operational, cannot support the business if it loses local network connectivity. Likewise, flooded roadways and tunnels can interrupt fuel oil deliveries and block staff access. The extent of these threats cannot be fully evaluated without including an extensive list of local conditions ranging from town and village highway closure reports to statewide emergency services response plans as well as those plans from all telecommunications providers and other utilities. These services each have their own disaster recovery plans.

The incremental nature of climate change means that resiliency assessments cannot be one-time events. Data center resiliency in the face of climate change must be re-assessed often enough to keep pace with changing weather patterns, updated local and state disaster planning, and telecommunications and local infrastructure.

Costs Expected to Rise as Resources Become More Scare and Risks Increase

Finally, data center operators must keep abreast of a laundry list of costs expected to rise as a result of climate change. For example, insurance companies have already begun to increase premiums because of unexpected losses due to unprecedented and anomalous events. In the same way, resources such as water are expected to be increasingly expensive because of shortages and local restrictions. Utilities and suppliers could also be affected by these rising costs, which are likely to cause the prices of diesel fuel and electricity to increase.

Now is the time to evaluate climate change threats on a local basis, facility by facility. Data center operators have this obligation to fully understand how climate change affects their facilities and their customers. They may find that a mix of solutions will be adequate to address their needs today, while other more radical solutions are needed for tomorrow. Climate change, or whatever you wish to call it, is here… now.

Related Resources from Uptime Institute:

NEXTDC: Obsessed with the details so our customers’ business is always available

September 27, 2018/in Design, Executive, Operations/by Brett Ridley

This is a guest post written by Brett Ridley, Head of Central Operations and Facility Management for NEXTDC.NEXTDC is Australia’s leading independent data centre operator with a nationwide network of Uptime Institute certified Tier III and Tier IV facilities. NEXTDC provides enterprise-class colocation services to local and international organisations.

If you are interested in participating in our guest post program, please contact [email protected].

Data centres provide much of the connectivity and raw computing power that drives the connected economy. Government departments, financial institutions and large corporates usually run at least some of their own IT infrastructure in-house but it’s becoming more common for them to outsource their mission critical infrastructure to a certified high-availability colocation data centre.

It wasn’t always like this. In the past, if a large organisation wanted to ensure maximum uptime, they would hire specialist engineers to design a server room in their corporate headquarters and invest capital to strengthen floors, secure doors and ensure sufficient supplies of connectivity and electricity.

So what changed?

For a start, the reliance on technology and connectivity has never been greater. More and more applications are mission critical and organisations are less tolerant of outages even for their secondary systems. In addition, advances in processor technology have resulted in much faster, smaller and denser servers. As servers got smaller and demand for computing and storage increased, organisations would pack more and more computing power into their server rooms. Server rooms also started growing, taking over whole floors or even entire buildings.

As the density of computing power inside the data centre increased, the power and cooling requirements become more specialised. If you have a room with a hundred regular servers and it affects the room temperature, maybe a portable A/C would keep staff happy because the servers wouldn’t need any additional cooling.

However, if that same room had a hundred of the latest multi-core blade server cabinets, not only would its power requirements have increased exponentially, in order to deal with the sheer amount of heat generated by the servers, the room would need to be fitted with specialist cooling and ventilation systems in order to avoid a complete hardware meltdown.

At this point, relatively few organisations find it desirable or cost effective to run their own data centre facilities.

Ensuring the infrastructure of a dense computing data centre is designed and maintained to a level where it is completely reliable is an ongoing, time consuming, tedious and extremely expensive process. It requires specialist, dedicated staff backed up by a committed management with deep pockets.

In the data centre world this is known as “operational sustainability”, and it’s the primary goal of all large data centre managers.

A list of requirements describing the best practices to ensure the operational sustainability of data centres have been developed by Uptime Institute, which was established with a mission to offer the most authoritative, evidence-based, and unbiased guidance to help global companies improve the performance, efficiency, and reliability of their business critical infrastructure.

More than 1,500 data centres have been certified by Uptime Institute, which meticulously examines every component of the data centre including its design, construction, management and procedures. Once an assessment is positively completed, the data centre is then certified with the appropriate Tier rating.

Uptime Institute Tier Ratings

Tier I
Basic site infrastructure

Tier II
Redundant capacity components site infrastructure

Tier III
Concurrently maintainable site infrastructure

Tier IV
Fault tolerant site infrastructure

Bronze, Silver & Gold Ratings
Tier Certification of Operational Sustainability awards receive a Bronze, Silver or Gold rating. These ratings signify the extent to which a data center is optimizing its infrastructure performance and exceeding the baseline Tier standards.

We’re extremely passionate about the data centres we build and operate and we are totally obsessed with ensuring they are staffed and maintained in an environment that minimises human error.

NEXTDC has become the first data centre operator in the Southern Hemisphere to achieve Tier IV Gold Certification of Operational Sustainability by Uptime Institute -NEXTDC’s B2 data centre in Brisbane received the Tier IV Gold Certification, highlighting the company’s excellence in managing long-term operational risks and behaviours, and showcasing its commitment to customers to be robustly reliable, highly efficient and ensuring 100% uptime.

The Gold Operational Sustainability standard recognises the human factors in running a data centre to meet fault tolerant standards. It includes climate-change preparedness and the growing need for edge computing, outage risk mitigation, energy efficiency, increasing rack density, and staffing trends. Achieving Gold certification requires a score of greater than 90% in all areas, Silver is 80%-89% and Bronze is 70%-79%.

The physical design and construction of a data centre can be solid but that’s only two-thirds of the story. Human error is the biggest challenge we face when it comes to outages, with around 80% of issues being sighted as accidental.

If staff are not properly trained and the correct processes are not in place, it doesn’t matter if the building and hardware are perfect, it’s only a matter of time before an outage will strike. This is why NEXTDC invests hundreds of thousands of dollars every year to educate its staff, partners and vendors in an effort to maximise operational sustainability.

Other data centres may claim they have procedures and trained staff but unless they’re regularly assessed by an independent third party and benchmarked against the best data centres on the planet, their claims are worthless.

To qualify for even the lowest Bronze certificate, data centres need to establish training programs for all of their staff, however, Uptime Institute also examines any risks posed by other users of our facility – the clients and partners.

We have quarterly training sessions for our operations team – our staff are tested and trained like no one in the industry and Uptime Institute requires evidence of that. They want to know about all the testing and training carried out, they want to know if we have hired any new staff and they will check to ensure new hires have completed the necessary training and a competency-based assessment.

We need to know that our national partners, for example someone like Nilsen Networks, know what they’re doing. We hold regular training days for them on our MOPs (Method of Operations) and are required to show evidence of this training to the Uptime Institute. They also want to examine our maintenance records to show that the procedures are being followed to the letter.

We have a procedure for everything — it’s all written down and laid out. We’ve colour-code our folders, we’ve got the command centre set up and we make our staff and partners practice over and over and over again to ensure that, during an emergency, when stress levels are high, they are far less likely to make costly mistakes.

This dedication to details in the whole process, from design, construction, staffing and maintenance of our facilities is what sets NEXTDC apart from alternative data centre operators. We pay attention to all the details to ensure that your business remains connected and is available 100% of the time.

It’s the extra sleep-at-night factor that you don’t get with anybody else. The training and skillset of NEXTDC staff matches the design and engineering excellence of our buildings.

Application Resiliency vs. Infrastructure Resiliency

August 14, 2018/in Design, Executive/by Todd Traver

Software enabled application resiliency is now playing a significant and increasing role in bolstering applications availability and reliability across the enterprise, reducing risk to the business. No longer are clients solely reliant upon the stability provided by the electrical and mechanical systems in their data center. By utilizing new software techniques, enterprises are now able to deploy applications that span multiple instances from enterprise to co-location to cloud, that bolster the availability and reliability of their critical applications.

Hybrid Deployments Create Patchwork of Dependencies on Software Agility, Network Infrastructure and Cloud Partners

While we’ve not yet seen any enterprises that rely solely upon software techniques to provide reliability, we have noted that many leaders are now complementing their data center reliability efforts with some form of software agility capabilities. And as more clients move some of their workloads to the cloud, they are beginning to utilize built-in features such as high-availability and reliability zones to augment their applications’ reliability beyond the capabilities of the individual sites themselves.

Additionally, these same clients are starting to replace or augment their requirement for risk management through disaster recovery with the high resiliency provided by availability zones from cloud providers. And bleeding-edge adopters are beginning to utilize advanced new work distribution technologies such as Google’s Cloud Spanner, which allows clients to scale globally with low latency while still preserving transactional consistency, by utilizing data that is automatically shared amongst several data centers to ensure integrity.

Keep in mind that as clients move applications and data to edge locations they can now purchase newly developed cloud services that have recently come on the market which utilize micro data centers connected in a grid formation, to create a virtualized data center that can span an entire city geography, with impressive overall performance and availability! This type of implementation requires highly specialized orchestration software, as well as a dedicated low latency fiber network that is very carefully designed, implemented and automated, to provide the high level of service required.

Cloud Providers Become the New Owners of Infrastructure Resiliency

Given all these advances in software related agility, it must be noted that all these cloud and edge providers themselves still continue to maintain highly reliable underlying electrical and mechanical foundations for their data center facilities. And since the connecting network now plays a much bigger and more critical role in overall application resiliency, it too requires the same level of focus on designed-in redundancy, reliability and fault tolerance, as that traditionally given to data center infrastructure design. Risk management comes in so many forms to be sure.

So overall, there’s a big new world of options and approaches when it comes to applications resiliency design, with most enterprises still using a belt and suspenders approach of software and hardware to reduce risk and ensure resiliency and reliability. But with new cloud services providing increasingly more self-service capabilities, it’s becoming critically important for customers to clearly evaluate their modern digital business requirements which can then be used to map out a strategy that provides the highest level of availability and resiliency at a cost which is aligned with the business itself.

And with so much at stake, the risk management aspects of hybrid infrastructures should not be ignored just because they are hard to quantify. Your very business is at risk if you don’t. Remember, the measure of a great leader is one that is not afraid to ask for help. Ask for help if you need it!

Mission Critical Computing Fabric

July 30, 2018/in Executive, Operations/by Mark Harris

We’ve entered an era where our IT infrastructures are now becoming a compilation of capacity that is spread out and running upon a wide range of platforms; some we completely control, some we control partially and some we don’t control at all. No longer should our IT services discussions start with ‘And in the data center we have…’, but instead they need to center around mission critical business applications and/or transactions that are provided by ‘the fabric’.

Fabric Computing

Who would have thought that all of us long-time ‘data center professionals’ would now be on the hook to deliver IT services using a platform or set of platforms that we had little or no control over? Who would have thought we’d be building our infrastructures like fabric, weaving various pieces together like a finely crafted quilt? But yet here we are, and between the data centers we own, the co-locations we fill and the clouds we rent, we are putting a lot of faith in a lot of people’s prowess to create these computing quilts or fabrics.

We all know that the executive committee will ask us regularly, “We have now transformed to be digital everything. How prepared we are to deliver these essential business critical services?”, and we in turn know that we must respond with a rehearsed confirmation of readiness. The reality is we are really crossing our fingers and hoping that the colo’s we’ve chosen and our instances in the Cloud we’ve spun up won’t show up on the 6 o’clock news each night. We simply have less and less control as we outsource more and more.

A big challenge to be sure. What we need to do is to focus on the total capacity needed and identify the risk tolerance for each application, and then look at our hybrid infrastructure as a compilation of sub-assemblies which each have their own characteristics for risk and cost. While it’s not simple math to figure out our risk and cost, it *IS* math that needs to be done, application by application. Remember I can now throw nearly any application into my in-house data centers, or spin them up in a co-location site, and even burst up to the cloud on demand. The user of that application would NOT likely know the difference in platform, yet the cost and risk to process that transaction would vary widely.

But we have SLAs to manage all of this 3rd party risk, right? Nope. SLAs are part of the dirty little secret of the industry which essentially says what happens when a third-party fails to keep things running. Most SLA agreements spend most of the prose explaining what the penalties will be WHEN the service fails. SLAs do not prevent failure, they just articulate what happens when failures occur.

Data Center Tools

So this now becomes a pure business discussion about supporting a Mission Critical ‘Fabric’. This fabric is the hybrid infrastructures we are all already creating. What needs to be added to the mix are the business attributes of cost and risk and for each, a cost calculation and risk justification for why we have made certain platform choices. Remember, we can run nearly ANY application in any one of the platform choices described above, so there must be a clear reason WHY we have done what we have done, and we need to be able to articulate and defend those reasons. And we need to think about service delivery, when it spans multiple platforms and can actually traverse from one to another over the course of any given hour, day or week. Its all a set of calculations!

Put your screwdrivers away and fire up your risk management tools, your financial modelling tools, or even your trusty copy of Excel! This is the time to work through the business metrics, rather than the technical details.

Welcome to the era of Mission Critical Computing Fabric!

“Ask the Expert”: Data Center Management Q&A with Uptime Institute CTO, Chris Brown

May 11, 2018/in Operations/by Travis Pearl

On April 24th, Uptime Institute CTO, Chris Brown, participated in an “Ask the Expert” session on Data Center Infrastructure Management with BrightTALK senior content manager Kelly Harris.

The 45-minute session covered topics ranging from best practices of a well-run data center to a list of the most common data center issues Uptime Institute’s global consultant team is seeing in the field every day. Toward the end of the session, Harris asked Brown for his views on the challenges and changes facing the data center industry. His answers sparked a flurry of questions from the audience focusing on operations, edge computing, hybrid-resiliency strategies and new cyber security risks facing data center owners and operators.

You can view a recording of the full session on our website here: https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management

We have captured the top twelve audience questions and provided answers below:

If you could only focus on one thing to improve data center performance, what would that be?

Focus on improving operations discipline and protocols through training, procedures and support from management. A well run data center can overcome shortcomings of the design, but a poorly run data center can easily jeopardize a great design. Many people have considered the challenges they see in their infrastructure to be technology driven, but we are seeing a huge swing towards a focus on operational excellence, and the ability to defend the operational choices made.

What is Uptime seeing in the way of IT density growth in the industry?

We are seeing a big delta in what people are anticipating in the future vs. what they are practicing today. New facilities are being designed to support high density, but not a lot of density growth is showing up yet. Most of the density growth in the industry is coming from the hyperscalers right now (Facebook, Google, etc). One of the factors that is being much more heavily leveraged is the ability for virtualization and software-defined technology to create virtual infrastructure without the need for additional physical resources. So the growth in tangible space, power and cooling may appear to be slowing down, but in fact the growth in processing capacity and work processing is actually rapidly increasing.

Is IoT affecting density at this point?

Right now, we are seeing IoT drive the amount of compute that is needed but, at this moment, we don’t see it driving a lot of increases in density in existing sites. IoT has a main premise that the first layer of computing can and should be moved closer to the origination point of data itself, so we are seeing various edge computing strategies coming forward to support this goal. The density of those edges may not be a factor, but the existence of the edge is. In developing regions, we’re seeing more data centers being built to house regional demand vs. a dramatic increase in density. At some point in the future, we’ll run out of physical space to build new data centers and, at that time, I’d expect to see density increase dramatically but it isn’t happening yet.

Do you know if anyone is working on writing a spec/standard for “Edge”?

There are a number of entities out there trying to put together a standard for edge, but there hasn’t been a lot of traction thus far. That said, the focus on Edge computing is critically important to get right.

At Uptime Institute, we’re working with the edge vendors to keep the focus of Edge where it needs to be – delivering required business services. Historically, many in the industry would think Edge data centers are small and, therefore, less important. Uptime Institute takes the position that Edge data centers are already becoming a major piece of the data center strategy of any company, so the design, construction, equipment and implementation is just as important in an edge facility as it is in a centralized facility.

Are these edge data centers remotely managed?

Yes, most Edge data centers are remotely managed. Edge data centers are typically small, with perhaps 1 to 22 racks and physically placed near the demand point, so it is usually cost prohibitive to man those facilities. Various layers of hardware and software move workloads in and out of physical locations, so the amount of on-site management needed has been greatly reduced.

How is data center infrastructure management and security changing?

On the IT side, the devices that are facing the end user and providing compute to the end user have been focusing on cyber security for many years – lots of effort being put on securing and making these systems robust. Literally millions of dollars of cyber protection devices are now installed in most data centers to protect the data and applications from intrusion.
But, one of the things we are seeing is that the management of the building control systems is becoming more IP-based. MEP equipment and their controllers are connected to building networking systems. Some data centers are placing these management systems on the same company internet/intranet as their other production systems and using the same protocols to communicate. This creates a situation where building management systems are also at risk because they can be accessed from the outside, but are not as protected because they are not considered mainstream data. (Even air-gapped facilities are not safe because someone can easily bring malware in on a technician’s laptop and hook it up to the IT equipment, then that malware can replicate itself across the facility through the building management system and infrastructure.)

Consequently, we are starting to see more data center facilities apply the same security standards to their internal systems as they have been applying to their customer-facing systems for the last several years at address this new risk.

Will the Uptime Institute Tier levels be updated to account for the geographic redundancies that newer network technologies allow owners to utilize?

Uptime Institute has been looking into distributed compute and resiliency and how that applies to standard tier levels. The Uptime Institute Tier levels apply to a specific data center and focus to ensure that specific data center meets the level of resilience needed at the component level. Tier levels do not need to be updated to focus on hybrid resiliency. The measure of hybrid resiliency is based on achieving resiliency and redundancy across the individual components within a portfolio, as viewed from the business service delivery itself. We liken this metric to that of the calculation of MTBF for complex systems which is a calculation of the individual components when viewed at the complete system level.

What is the most important metric of a data center (PUE, RCI, RTI, etc)?

If you are looking at a single metric to measure performance, then you are probably approaching the data center incorrectly. PUE looks at how efficiently a data center can deliver a KW to the IT equipment, not how effective the IT equipment is actually being utilized. For example, UPS systems increase in efficiency with load so more total load in the data center can improve your PUE, but if you are only at 25% utilization of your servers, then you are not running an efficient facility despite having a favorable PUE. This is an example of why single metrics are rarely an effective way to run a facility. If you’re relying on a single metric to measure efficiency and performance, you are missing out on a lot of opportunity to drive improvement in your facility.

How are enterprises evaluating risk regarding their cloud workloads? For example, if their on-premises data center is Tier IV – how do they assess the equivalent SLA for cloud instances?

There are two primary ways that Enterprises can reduce risk and create a “Tier IV-like” cloud environment. The first and increasingly popular way is by purchasing High Availability (HA) as-a-service from the cloud provider such as Rackspace or Google. The second way is by the enterprise architecting a bespoke redundancy solution themselves using a combination of two or more public or private cloud computing instances.

The underlying approach to creating a high availability cloud-based service is fundamentally the same, with the combination having redundancy characteristics similar to that of a “Tier IV” data center. In practice, servers are clustered with a load balancer in front of them, with the load balancer distributing requests to all the servers located behind it with in that zone, so that if an individual server fails, the workload will be picked up and executed non-disruptively by the remaining servers. This implementation will often have an additional server node (N+1) installed to what is actually required, so that if a single node fails, the client won’t experience increased latency and the remaining systems are at a lower risk of being over-taxed. And this same concept can be applied across dispersed regions to account for much larger geographic outages.

This approach ensures that data will always continue to be available to clients, in the event that a server or an entire region or site fails. Enterprises can further strengthen their high availability capabilities by utilizing multiple cloud providers across multiple locations, which greatly reduces the chances of single provider failure, and where even the chance of planned maintenance windows overlapping between providers is very small.

What is the best range of PUE, based on Uptime Institute’s experience?

There is no right or wrong answer in regard to what is the best range for PUE. PUE is a basic metric to gauge mechanical and electrical (M&E) infrastructure energy efficiency, intended to be just that, a gauge so that operators can better understand where their infrastructure energy efficiencies lie in order to establish a plan for continuous improvement. We have, however, seen data centers that utilize very efficient M&E equipment, but still have PUEs higher than 1.6, because of inefficiencies in IT hardware management. In addition, the PUE metric is not tied whatsoever to the actual work being done, so although a data center may exhibit an attractive PUE, it may not be doing any work at all!

The value of the metric is not as important as the acknowledgement and the continuous improvement plan. Uptime Institute recommends a business results approach to improving overall IT efficiency, where PUE is a very small part of that overall assessment. Efficient IT is all about following a sustainable approach led from the top down, addressing IT operations and IT hardware utilization, with a small percentage addressing the data center and PUE, that follows a documented continuous improvement approach.

I would like to know if there are general guidelines for decommissioning a data center.

There are no general guidelines for when to decommission a data center because business circumstances are different for practically every data center. The determination for decommissioning a data center can depend on a number of facility factors that can lead or assist the business in making a decision, infrastructure age typically being a leading factor. However, the impact of cloud and edge computing, and hybrid IT, have caused IT strategy changes that have recently caused many data centers to be decommissioned.

We have also seen a lot very well-maintained data centers that are older than 20+ years, where the companies presently have no intent of decommissioning the sites. These companies typically have an IT data center strategy plan in place and are tracking against plan. IT and data center investment decisions are made with this plan in mind. The key is to make sure that the IT data center strategy plan is not developed in a vacuum by just IT, Finance, or by just Facilities/Real Estate. The best IT data center strategy is developed with input from all of these groups, creating a partnership. The IT data center strategy should also be a living plan, reviewed and adjusted as necessary on a regular basis.

Can you speak to where you see multi data center DCIM tools going over the next year or two?

Multi-tenant DCIM is likely to evolve from basic isolated power and environmental monitoring features (including alarming and reporting) to also include facilities asset management features including change and configuration management. Multi-tenant data center providers that offer remote hands services will, in particular, make use of DCIM asset management – to enable customers to remotely track the on-site work being done, including with auditable workflows.

Looking forward, service delivery will be measured with qualitative metrics, which identify not just IF a service is available, but at what cost, and at what capacity. Hence, DCIM will begin to include full-stack analytics to understand how work is hosted and keep track of it as it migrates. And to get there, Multi-tenant DCIM will likely also start to include ‘out-of-the-box’ pre-built connectors to other management software tools, such as ITSM and VM management, so that customers can match specific workloads to physical data center assets, enabling end-to-end costing, ‘right-sized’ asset/workload provisioning, etc.

You can watch the full “Ask the Expert” session with Uptime Institute CTO, Chris Brown, by visiting the recorded session page on our website at:
https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management