“Ask the Expert”: Data Center Management Q&A with Uptime Institute CTO, Chris Brown

On April 24th, Uptime Institute CTO, Chris Brown, participated in an “Ask the Expert” session on Data Center Infrastructure Management with BrightTALK senior content manager Kelly Harris.

The 45-minute session covered topics ranging from best practices of a well-run data center to a list of the most common data center issues Uptime Institute’s global consultant team is seeing in the field every day. Toward the end of the session, Harris asked Brown for his views on the challenges and changes facing the data center industry. His answers sparked a flurry of questions from the audience focusing on operations, edge computing, hybrid-resiliency strategies and new cyber security risks facing data center owners and operators.

You can view a recording of the full session on our website here: https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management

We have captured the top twelve audience questions and provided answers below:

If you could only focus on one thing to improve data center performance, what would that be?

Focus on improving operations discipline and protocols through training, procedures and support from management. A well run data center can overcome shortcomings of the design, but a poorly run data center can easily jeopardize a great design. Many people have considered the challenges they see in their infrastructure to be technology driven, but we are seeing a huge swing towards a focus on operational excellence, and the ability to defend the operational choices made.

What is Uptime seeing in the way of IT density growth in the industry?

We are seeing a big delta in what people are anticipating in the future vs. what they are practicing today. New facilities are being designed to support high density, but not a lot of density growth is showing up yet. Most of the density growth in the industry is coming from the hyperscalers right now (Facebook, Google, etc). One of the factors that is being much more heavily leveraged is the ability for virtualization and software-defined technology to create virtual infrastructure without the need for additional physical resources. So the growth in tangible space, power and cooling may appear to be slowing down, but in fact the growth in processing capacity and work processing is actually rapidly increasing.

Is IoT affecting density at this point?

Right now, we are seeing IoT drive the amount of compute that is needed but, at this moment, we don’t see it driving a lot of increases in density in existing sites. IoT has a main premise that the first layer of computing can and should be moved closer to the origination point of data itself, so we are seeing various edge computing strategies coming forward to support this goal. The density of those edges may not be a factor, but the existence of the edge is. In developing regions, we’re seeing more data centers being built to house regional demand vs. a dramatic increase in density. At some point in the future, we’ll run out of physical space to build new data centers and, at that time, I’d expect to see density increase dramatically but it isn’t happening yet.

Do you know if anyone is working on writing a spec/standard for “Edge”?

There are a number of entities out there trying to put together a standard for edge, but there hasn’t been a lot of traction thus far. That said, the focus on Edge computing is critically important to get right.

At Uptime Institute, we’re working with the edge vendors to keep the focus of Edge where it needs to be – delivering required business services. Historically, many in the industry would think Edge data centers are small and, therefore, less important. Uptime Institute takes the position that Edge data centers are already becoming a major piece of the data center strategy of any company, so the design, construction, equipment and implementation is just as important in an edge facility as it is in a centralized facility.

Are these edge data centers remotely managed?

Yes, most Edge data centers are remotely managed. Edge data centers are typically small, with perhaps 1 to 22 racks and physically placed near the demand point, so it is usually cost prohibitive to man those facilities. Various layers of hardware and software move workloads in and out of physical locations, so the amount of on-site management needed has been greatly reduced.

How is data center infrastructure management and security changing?

On the IT side, the devices that are facing the end user and providing compute to the end user have been focusing on cyber security for many years – lots of effort being put on securing and making these systems robust. Literally millions of dollars of cyber protection devices are now installed in most data centers to protect the data and applications from intrusion.
But, one of the things we are seeing is that the management of the building control systems is becoming more IP-based. MEP equipment and their controllers are connected to building networking systems. Some data centers are placing these management systems on the same company internet/intranet as their other production systems and using the same protocols to communicate. This creates a situation where building management systems are also at risk because they can be accessed from the outside, but are not as protected because they are not considered mainstream data. (Even air-gapped facilities are not safe because someone can easily bring malware in on a technician’s laptop and hook it up to the IT equipment, then that malware can replicate itself across the facility through the building management system and infrastructure.)

Consequently, we are starting to see more data center facilities apply the same security standards to their internal systems as they have been applying to their customer-facing systems for the last several years at address this new risk.

Will the Uptime Institute Tier levels be updated to account for the geographic redundancies that newer network technologies allow owners to utilize?

Uptime Institute has been looking into distributed compute and resiliency and how that applies to standard tier levels. The Uptime Institute Tier levels apply to a specific data center and focus to ensure that specific data center meets the level of resilience needed at the component level. Tier levels do not need to be updated to focus on hybrid resiliency. The measure of hybrid resiliency is based on achieving resiliency and redundancy across the individual components within a portfolio, as viewed from the business service delivery itself. We liken this metric to that of the calculation of MTBF for complex systems which is a calculation of the individual components when viewed at the complete system level.

What is the most important metric of a data center (PUE, RCI, RTI, etc)?

If you are looking at a single metric to measure performance, then you are probably approaching the data center incorrectly. PUE looks at how efficiently a data center can deliver a KW to the IT equipment, not how effective the IT equipment is actually being utilized. For example, UPS systems increase in efficiency with load so more total load in the data center can improve your PUE, but if you are only at 25% utilization of your servers, then you are not running an efficient facility despite having a favorable PUE. This is an example of why single metrics are rarely an effective way to run a facility. If you’re relying on a single metric to measure efficiency and performance, you are missing out on a lot of opportunity to drive improvement in your facility.

How are enterprises evaluating risk regarding their cloud workloads? For example, if their on-premises data center is Tier IV – how do they assess the equivalent SLA for cloud instances?

There are two primary ways that Enterprises can reduce risk and create a “Tier IV-like” cloud environment. The first and increasingly popular way is by purchasing High Availability (HA) as-a-service from the cloud provider such as Rackspace or Google. The second way is by the enterprise architecting a bespoke redundancy solution themselves using a combination of two or more public or private cloud computing instances.

The underlying approach to creating a high availability cloud-based service is fundamentally the same, with the combination having redundancy characteristics similar to that of a “Tier IV” data center. In practice, servers are clustered with a load balancer in front of them, with the load balancer distributing requests to all the servers located behind it with in that zone, so that if an individual server fails, the workload will be picked up and executed non-disruptively by the remaining servers. This implementation will often have an additional server node (N+1) installed to what is actually required, so that if a single node fails, the client won’t experience increased latency and the remaining systems are at a lower risk of being over-taxed. And this same concept can be applied across dispersed regions to account for much larger geographic outages.

This approach ensures that data will always continue to be available to clients, in the event that a server or an entire region or site fails. Enterprises can further strengthen their high availability capabilities by utilizing multiple cloud providers across multiple locations, which greatly reduces the chances of single provider failure, and where even the chance of planned maintenance windows overlapping between providers is very small.

What is the best range of PUE, based on Uptime Institute’s experience?

There is no right or wrong answer in regard to what is the best range for PUE. PUE is a basic metric to gauge mechanical and electrical (M&E) infrastructure energy efficiency, intended to be just that, a gauge so that operators can better understand where their infrastructure energy efficiencies lie in order to establish a plan for continuous improvement. We have, however, seen data centers that utilize very efficient M&E equipment, but still have PUEs higher than 1.6, because of inefficiencies in IT hardware management. In addition, the PUE metric is not tied whatsoever to the actual work being done, so although a data center may exhibit an attractive PUE, it may not be doing any work at all!

The value of the metric is not as important as the acknowledgement and the continuous improvement plan. Uptime Institute recommends a business results approach to improving overall IT efficiency, where PUE is a very small part of that overall assessment. Efficient IT is all about following a sustainable approach led from the top down, addressing IT operations and IT hardware utilization, with a small percentage addressing the data center and PUE, that follows a documented continuous improvement approach.

I would like to know if there are general guidelines for decommissioning a data center.

There are no general guidelines for when to decommission a data center because business circumstances are different for practically every data center. The determination for decommissioning a data center can depend on a number of facility factors that can lead or assist the business in making a decision, infrastructure age typically being a leading factor. However, the impact of cloud and edge computing, and hybrid IT, have caused IT strategy changes that have recently caused many data centers to be decommissioned.

We have also seen a lot very well-maintained data centers that are older than 20+ years, where the companies presently have no intent of decommissioning the sites. These companies typically have an IT data center strategy plan in place and are tracking against plan. IT and data center investment decisions are made with this plan in mind. The key is to make sure that the IT data center strategy plan is not developed in a vacuum by just IT, Finance, or by just Facilities/Real Estate. The best IT data center strategy is developed with input from all of these groups, creating a partnership. The IT data center strategy should also be a living plan, reviewed and adjusted as necessary on a regular basis.

Can you speak to where you see multi data center DCIM tools going over the next year or two?

Multi-tenant DCIM is likely to evolve from basic isolated power and environmental monitoring features (including alarming and reporting) to also include facilities asset management features including change and configuration management. Multi-tenant data center providers that offer remote hands services will, in particular, make use of DCIM asset management – to enable customers to remotely track the on-site work being done, including with auditable workflows.

Looking forward, service delivery will be measured with qualitative metrics, which identify not just IF a service is available, but at what cost, and at what capacity. Hence, DCIM will begin to include full-stack analytics to understand how work is hosted and keep track of it as it migrates. And to get there, Multi-tenant DCIM will likely also start to include ‘out-of-the-box’ pre-built connectors to other management software tools, such as ITSM and VM management, so that customers can match specific workloads to physical data center assets, enabling end-to-end costing, ‘right-sized’ asset/workload provisioning, etc.

You can watch the full “Ask the Expert” session with Uptime Institute CTO, Chris Brown, by visiting the recorded session page on our website at:
https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management