Data Centers & Mission Critical Fabric

Mission Critical Computing Fabric

We’ve entered an era where our IT infrastructures are now becoming a compilation of capacity that is spread out and running upon a wide range of platforms; some we completely control, some we control partially and some we don’t control at all. No longer should our IT services discussions start with ‘And in the data center we have…’, but instead they need to center around mission critical business applications and/or transactions that are provided by ‘the fabric’.

Fabric Computing

Fabric Computing

Who would have thought that all of us long-time ‘data center professionals’ would now be on the hook to deliver IT services using a platform or set of platforms that we had little or no control over? Who would have thought we’d be building our infrastructures like fabric, weaving various pieces together like a finely crafted quilt? But yet here we are, and between the data centers we own, the co-locations we fill and the clouds we rent, we are putting a lot of faith in a lot of people’s prowess to create these computing quilts or fabrics.

We all know that the executive committee will ask us regularly, “We have now transformed to be digital everything. How prepared we are to deliver these essential business critical services?”, and we in turn know that we must respond with a rehearsed confirmation of readiness. The reality is we are really crossing our fingers and hoping that the colo’s we’ve chosen and our instances in the Cloud we’ve spun up won’t show up on the 6 o’clock news each night. We simply have less and less control as we outsource more and more.

A big challenge to be sure.  What we need to do is to focus on the total capacity needed and identify the risk tolerance for each application, and then look at our hybrid infrastructure as a compilation of sub-assemblies which each have their own characteristics for risk and cost. While it’s not simple math to figure out our risk and cost, it *IS* math that needs to be done, application by application. Remember I can now throw nearly any application into my in-house data centers, or spin them up in a co-location site, and even burst up to the cloud on demand. The user of that application would NOT likely know the difference in platform, yet the cost and risk to process that transaction would vary widely.

But we have SLAs to manage all of this 3rd party risk, right?  Nope. SLAs are part of the dirty little secret of the industry which essentially says what happens when a third-party fails to keep things running. Most SLA agreements spend most of the prose explaining what the penalties will be WHEN the service fails. SLAs do not prevent failure, they just articulate what happens when failures occur.

Data Center Tools

Data Center Tools

So this now becomes a pure business discussion about supporting a Mission Critical ‘Fabric’. This fabric is the hybrid infrastructures we are all already creating. What needs to be added to the mix are the business attributes of cost and risk and for each, a cost calculation and risk justification for why we have made certain platform choices. Remember, we can run nearly ANY application in any one of the platform choices described above, so there must be a clear reason WHY we have done what we have done, and we need to be able to articulate and defend those reasons. And we need to think about service delivery, when it spans multiple platforms and can actually traverse from one to another over the course of any given hour, day or week. Its all a set of calculations!

Put your screwdrivers away and fire up your risk management tools, your financial modelling tools, or even your trusty copy of Excel! This is the time to work through the business metrics, rather than the technical details.

Welcome to the era of Mission Critical Computing Fabric!

“Ask the Expert”: Data Center Management Q&A with Uptime Institute CTO, Chris Brown

On April 24th, Uptime Institute CTO, Chris Brown, participated in an “Ask the Expert” session on Data Center Infrastructure Management with BrightTALK senior content manager Kelly Harris.

The 45-minute session covered topics ranging from best practices of a well-run data center to a list of the most common data center issues Uptime Institute’s global consultant team is seeing in the field every day. Toward the end of the session, Harris asked Brown for his views on the challenges and changes facing the data center industry. His answers sparked a flurry of questions from the audience focusing on operations, edge computing, hybrid-resiliency strategies and new cyber security risks facing data center owners and operators.

You can view a recording of the full session on our website here: https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management

We have captured the top twelve audience questions and provided answers below:

If you could only focus on one thing to improve data center performance, what would that be?

Focus on improving operations discipline and protocols through training, procedures and support from management. A well run data center can overcome shortcomings of the design, but a poorly run data center can easily jeopardize a great design. Many people have considered the challenges they see in their infrastructure to be technology driven, but we are seeing a huge swing towards a focus on operational excellence, and the ability to defend the operational choices made.

What is Uptime seeing in the way of IT density growth in the industry?

We are seeing a big delta in what people are anticipating in the future vs. what they are practicing today. New facilities are being designed to support high density, but not a lot of density growth is showing up yet. Most of the density growth in the industry is coming from the hyperscalers right now (Facebook, Google, etc). One of the factors that is being much more heavily leveraged is the ability for virtualization and software-defined technology to create virtual infrastructure without the need for additional physical resources. So the growth in tangible space, power and cooling may appear to be slowing down, but in fact the growth in processing capacity and work processing is actually rapidly increasing.

Is IoT affecting density at this point?

Right now, we are seeing IoT drive the amount of compute that is needed but, at this moment, we don’t see it driving a lot of increases in density in existing sites. IoT has a main premise that the first layer of computing can and should be moved closer to the origination point of data itself, so we are seeing various edge computing strategies coming forward to support this goal. The density of those edges may not be a factor, but the existence of the edge is. In developing regions, we’re seeing more data centers being built to house regional demand vs. a dramatic increase in density. At some point in the future, we’ll run out of physical space to build new data centers and, at that time, I’d expect to see density increase dramatically but it isn’t happening yet.

Do you know if anyone is working on writing a spec/standard for “Edge”?

There are a number of entities out there trying to put together a standard for edge, but there hasn’t been a lot of traction thus far. That said, the focus on Edge computing is critically important to get right.

At Uptime Institute, we’re working with the edge vendors to keep the focus of Edge where it needs to be – delivering required business services. Historically, many in the industry would think Edge data centers are small and, therefore, less important. Uptime Institute takes the position that Edge data centers are already becoming a major piece of the data center strategy of any company, so the design, construction, equipment and implementation is just as important in an edge facility as it is in a centralized facility.

Are these edge data centers remotely managed?

Yes, most Edge data centers are remotely managed. Edge data centers are typically small, with perhaps 1 to 22 racks and physically placed near the demand point, so it is usually cost prohibitive to man those facilities. Various layers of hardware and software move workloads in and out of physical locations, so the amount of on-site management needed has been greatly reduced.

How is data center infrastructure management and security changing?

On the IT side, the devices that are facing the end user and providing compute to the end user have been focusing on cyber security for many years – lots of effort being put on securing and making these systems robust. Literally millions of dollars of cyber protection devices are now installed in most data centers to protect the data and applications from intrusion.
But, one of the things we are seeing is that the management of the building control systems is becoming more IP-based. MEP equipment and their controllers are connected to building networking systems. Some data centers are placing these management systems on the same company internet/intranet as their other production systems and using the same protocols to communicate. This creates a situation where building management systems are also at risk because they can be accessed from the outside, but are not as protected because they are not considered mainstream data. (Even air-gapped facilities are not safe because someone can easily bring malware in on a technician’s laptop and hook it up to the IT equipment, then that malware can replicate itself across the facility through the building management system and infrastructure.)

Consequently, we are starting to see more data center facilities apply the same security standards to their internal systems as they have been applying to their customer-facing systems for the last several years at address this new risk.

Will the Uptime Institute Tier levels be updated to account for the geographic redundancies that newer network technologies allow owners to utilize?

Uptime Institute has been looking into distributed compute and resiliency and how that applies to standard tier levels. The Uptime Institute Tier levels apply to a specific data center and focus to ensure that specific data center meets the level of resilience needed at the component level. Tier levels do not need to be updated to focus on hybrid resiliency. The measure of hybrid resiliency is based on achieving resiliency and redundancy across the individual components within a portfolio, as viewed from the business service delivery itself. We liken this metric to that of the calculation of MTBF for complex systems which is a calculation of the individual components when viewed at the complete system level.

What is the most important metric of a data center (PUE, RCI, RTI, etc)?

If you are looking at a single metric to measure performance, then you are probably approaching the data center incorrectly. PUE looks at how efficiently a data center can deliver a KW to the IT equipment, not how effective the IT equipment is actually being utilized. For example, UPS systems increase in efficiency with load so more total load in the data center can improve your PUE, but if you are only at 25% utilization of your servers, then you are not running an efficient facility despite having a favorable PUE. This is an example of why single metrics are rarely an effective way to run a facility. If you’re relying on a single metric to measure efficiency and performance, you are missing out on a lot of opportunity to drive improvement in your facility.

How are enterprises evaluating risk regarding their cloud workloads? For example, if their on-premises data center is Tier IV – how do they assess the equivalent SLA for cloud instances?

There are two primary ways that Enterprises can reduce risk and create a “Tier IV-like” cloud environment. The first and increasingly popular way is by purchasing High Availability (HA) as-a-service from the cloud provider such as Rackspace or Google. The second way is by the enterprise architecting a bespoke redundancy solution themselves using a combination of two or more public or private cloud computing instances.

The underlying approach to creating a high availability cloud-based service is fundamentally the same, with the combination having redundancy characteristics similar to that of a “Tier IV” data center. In practice, servers are clustered with a load balancer in front of them, with the load balancer distributing requests to all the servers located behind it with in that zone, so that if an individual server fails, the workload will be picked up and executed non-disruptively by the remaining servers. This implementation will often have an additional server node (N+1) installed to what is actually required, so that if a single node fails, the client won’t experience increased latency and the remaining systems are at a lower risk of being over-taxed. And this same concept can be applied across dispersed regions to account for much larger geographic outages.

This approach ensures that data will always continue to be available to clients, in the event that a server or an entire region or site fails. Enterprises can further strengthen their high availability capabilities by utilizing multiple cloud providers across multiple locations, which greatly reduces the chances of single provider failure, and where even the chance of planned maintenance windows overlapping between providers is very small.

What is the best range of PUE, based on Uptime Institute’s experience?

There is no right or wrong answer in regard to what is the best range for PUE. PUE is a basic metric to gauge mechanical and electrical (M&E) infrastructure energy efficiency, intended to be just that, a gauge so that operators can better understand where their infrastructure energy efficiencies lie in order to establish a plan for continuous improvement. We have, however, seen data centers that utilize very efficient M&E equipment, but still have PUEs higher than 1.6, because of inefficiencies in IT hardware management. In addition, the PUE metric is not tied whatsoever to the actual work being done, so although a data center may exhibit an attractive PUE, it may not be doing any work at all!

The value of the metric is not as important as the acknowledgement and the continuous improvement plan. Uptime Institute recommends a business results approach to improving overall IT efficiency, where PUE is a very small part of that overall assessment. Efficient IT is all about following a sustainable approach led from the top down, addressing IT operations and IT hardware utilization, with a small percentage addressing the data center and PUE, that follows a documented continuous improvement approach.

I would like to know if there are general guidelines for decommissioning a data center.

There are no general guidelines for when to decommission a data center because business circumstances are different for practically every data center. The determination for decommissioning a data center can depend on a number of facility factors that can lead or assist the business in making a decision, infrastructure age typically being a leading factor. However, the impact of cloud and edge computing, and hybrid IT, have caused IT strategy changes that have recently caused many data centers to be decommissioned.

We have also seen a lot very well-maintained data centers that are older than 20+ years, where the companies presently have no intent of decommissioning the sites. These companies typically have an IT data center strategy plan in place and are tracking against plan. IT and data center investment decisions are made with this plan in mind. The key is to make sure that the IT data center strategy plan is not developed in a vacuum by just IT, Finance, or by just Facilities/Real Estate. The best IT data center strategy is developed with input from all of these groups, creating a partnership. The IT data center strategy should also be a living plan, reviewed and adjusted as necessary on a regular basis.

Can you speak to where you see multi data center DCIM tools going over the next year or two?

Multi-tenant DCIM is likely to evolve from basic isolated power and environmental monitoring features (including alarming and reporting) to also include facilities asset management features including change and configuration management. Multi-tenant data center providers that offer remote hands services will, in particular, make use of DCIM asset management – to enable customers to remotely track the on-site work being done, including with auditable workflows.

Looking forward, service delivery will be measured with qualitative metrics, which identify not just IF a service is available, but at what cost, and at what capacity. Hence, DCIM will begin to include full-stack analytics to understand how work is hosted and keep track of it as it migrates. And to get there, Multi-tenant DCIM will likely also start to include ‘out-of-the-box’ pre-built connectors to other management software tools, such as ITSM and VM management, so that customers can match specific workloads to physical data center assets, enabling end-to-end costing, ‘right-sized’ asset/workload provisioning, etc.

You can watch the full “Ask the Expert” session with Uptime Institute CTO, Chris Brown, by visiting the recorded session page on our website at:
https://uptimeinstitute.com/webinars/webinar-ask-the-expert-data-center-infrastructure-management

How Edge Computing Is Transforming IT Infrastructure

New technologies such as IoT and cloud architectures are driving computing to the edge. Companies must embrace this trend in order to survive.

The definition of computing infrastructure is changing. While large traditional data centers have been the mainstay of information technology for the past 60 years, we’re seeing a perfect storm today where mobility, economics, and technology are all converging to fundamentally redefine the IT challenge as well as the IT solution. In a nutshell, most everything we learned as IT professionals about leveraging economies of scale as it pertains to the delivery of IT in the corporate world is being turned on its side, and instead being viewed from the users’ perspective.  Read the Full Story Here.

Data Center Security

Hacking the Physical Data Center – Not just for Hollywood Movies

We have all seen big headline stories for years about overseas network hackers who are able to extract millions of account details and social security numbers from the retail, financial and a litany of other industries. And Hollywood has done a great job of painting the picture about bad guys physically breaking into data centers to steal information making them rich men overnight. But a less considered situation is the one the fits somewhere in the middle: online hacking of the the physical data center itself. And the reality of what this hacking entails, lies somewhere between what you see in the Hollywood movies and those stories heard around your water-cooler being shared between your IT staffers. Data Center hacking is real, occurs nearly every day, and goes far beyond downloading customer data which shows up on the evening news. And in many cases, it may not even be noticed for days, weeks or years!

Every organization needs to be aware of all of the ways in which their business can be compromised. In fact, now more than ever they need to take all of these threats seriously as their company’s core business is transforming to be digital. Every process and asset is represented digitally. Workflows and actions are defined digitally.  Building systems and energy usage is managed digitally too. And all of this access is connected to systems in the data center. Get into the data center logically or physically and a person can wreak havoc throughout an entire organization.

Security Starts in the Physical World

Security Starts with Policies

But let’s focus on the here and now; the ground level reality of what this all means and how it happens…

It actually started quite innocently. Over the years, manufacturers realized that hardwired RS232/RS485 and other proprietary control wiring interfaces were a dead end, so they turned to open systems approaches and added IP access to their infrastructure control devices. This open connected approach simplified the systems design and integrations needed for desired results. As such, nearly every company has now placed the vast majority of their infrastructure control systems on company intranets. The air conditioning, the power distribution systems, the security cameras, ID readers and door access. Yup, its now all digital and all IP controlled on the network. And its easy to see the huge upside possible for efficiency and intelligence, but the downside is that essentially anyone with access to the systems or network can turn all of these systems on and off, wipe security footage, and unlock doors.

So why has this gone unnoticed? We publicly hear about the security breaches that result in the bad guys gathering sensitive customer data because it affects us all. But all of the other breaches are mostly invisible to the public. Who really cares about a breach that simply shuts off the air-conditioning in the data center? Or the breach that unlocks all of the perimeter doors and allows criminals to wander around? Or the one that turns off the security cameras where your diesel generator fuel is stored? And these types of hacking aren’t just invisible to the public, but in many cases the executive team as well.

So how did we get here? This situation is very different than in decades past, since most of the people responsible for these control systems are not accustomed to worrying about such ‘invisible’ security risks. Many of these folks still think about bigger padlocks rather than firewalls. They think about taller fences and night-vision cameras. The very idea that all of those physical/mechanical systems can be useless when a hacker “rides the wire” is hard to imagine for many of them.

And it all starts with policy and procedure. A huge percentage of the hacking we hear about today actually originates from good people that HAVE AUTHORIZED ACCESS into the data center to do their regular jobs. What? Good people hack? Well, not intentionally. Regularly, data center operators plug in flash drives that they used at home, which have become infected. And as bad as that sounds with operators, higher-level technicians who have access to deeper or more strategic systems can do the same thing, unintentionally releasing a storm! They walk into into a data center to work on some type of control system and the first they do is connect THEIR laptop to the systems. That first hop is essential to any bad guy hack, so policies that prevent these types of common ‘innocent’ practices can offer huge reductions in risk and incidents.

Its hard to comprehend the true cost of physical systems that have been hacked. Security cameras that protect millions of dollars of capital and other assets can be compromised enabling theft. Cooling systems can be switched off for a mere few minutes taking down entire data centers with hundreds of millions of dollars of equipment and all of the lost business and valuation that goes with massive downtime. And this hacking isn’t even focused on the customer data breaches are all hear about every month.

So for 2018, commit to a top to bottom assessment of how you fare in this security framework. You likely already have lots of people focused on the Network and Server portions of security, but I would suspect that you have far fewer people working at the very top (policy and physical) or the very bottom (application and Device) of the security framework we are discussing.

Get on it!

Shrinking Data Center

As Data Center Footprint shrinks, the Importance of the Data Center Grows!

Today, your data center is more important and essential than it has ever been!

Let me explain. We are at a cross-roads in thinking for IT Infrastructures. For nearly 40 years (about the time when the first connected computing systems came into being), bigger has always been better. Bigger processors, bigger disk drives, bigger servers, and bigger data centers to house it all in. Most seasoned IT professionals have taken great pride in the amount of stuff they could acquire and then cobble together to create bespoke IT solutions that could meld and adjust to changing business conditions. And for many years, bigger worked. Every imaginable application was thrown at the IT organization and more stuff would be added to the infrastructure to allow for these new services and expanded capacity. Bigger really was better!

Balancing Public and Private Clouds is essential for 2018

But that all changed in the last ten years as applications and infrastructures were virtualized and all of the essential active devices increased in density and at the same time shrank in physical size. And all of this was happening at the very same time that the public cloud was coming into its own stride and offering new ways of running applications without consuming any owned and operated gear at all! In-house data centers began to shrink in size and take on many of the attributes of the Public Cloud, forming Private Clouds.

Now, some of our colleagues are already equating this shrinking trend to the elimination of data centers altogether. They see the physical size as trending downward and extrapolate this to zero footprint at some point in time. Not so fast partner! This isn’t a statistical and theoretical discussion. Its about identifying all of the business services that are needed to run a business, and then determining the best platform to deliver those services upon. And the platform choice is not one-size-fits-all…. it is a balancing act based upon cost, capacity, timing, security, support and regulatory needs!

Perhaps counter intuitive, but the shrinking footprints of these data centers that you and your team are operating and optimizing into private clouds today are actually becoming MORE important to the business itself! Think about it. What are the applications and services that are being kept in these owned and/or operated data centers? Well, typically its the most critical ones. It’s the secret sauce that you hold close to the vest. It’s the stuff that runs the very essence of your business. It’s your customer detail and your confidential data. It’s the stuff that can’t leave your four walls for jurisdiction, regulatory and/or security reasons.

Make no mistake, the public cloud will help us all offload capacity and allow us access to new applications that can not be brought in-house easily or cost-effectively. The Public Cloud provides a great platform for some applications, but as we have all learned in our IT careers, this is NEVER a one-size-fits-all game, and to focus on a zero-footprint end-game is irrational, and contrary to good business and technology understanding. Each IT business service needs to be delivered on the most suitable platform. And it is the definition of “most suitable” that will get us all in trouble if we are not careful. In cases of security, jurisdiction or regulatory compliance, that definition will be more stringent than many other applications and your ability to reliably deliver those ‘bet the business’ core applications has never been more critical.

In 2018, those IT professionals that really get their business’ needs, and can defend WHY they have satisfied those needs in the various technology platforms available to them today, will be handsomely rewarded….

Delivering IT in 2018 is All About Accountability!

It has always been incredibly difficult to align an organization’s business needs and strategies with the underlying information technologies required to get there. For many years, the long-tenured IT organization was given a tremendous amount of leeway due to the complex nature of delivering IT. IT was viewed as an almost magical discipline, something that had to exist at the genetic level of the people that called themselves IT professionals. In fact, the executive team knew that IT was a domain that few humans could comprehend, so they rarely asked any deep questions at all! They didn’t really challenge budgets, timing or strategy for that matter. Instead, they hired IT experts with impressive resumes and trusted them to deliver. In doing so, they simply let those experts TELL THEM what was possible IT, how soon they could deliver it online, and how much it would cost. And since these were IT experts, everyone listening just wrote down what they heard and considered it as fact. Not a lot of discussion or challenge to their statements. The tail really was wagging the dog!

Last summer I conducted a survey of IT professionals during the VMworld trade-show held in Las Vegas.  Not surprisingly, almost 80% of those people I surveyed indicated that they had been personally focused on delivering IT services for at least 10 years. A bit more surprising was that half of those surveyed actually indicated it was almost twice that amount! The IT world is full of long-term practitioners. So, during such long stints, many individuals find an approach or ‘rhythm’ that works for them. From their perspective, this rhythm continues is tried and true and will always meet the needs of their constituents. And given the lack of questions from the executive team and other stakeholders and the apparent free-wheeling nature of IT deployment and direction, times were pretty good for all those years.

In 2018… this is a problem. In the era where all businesses are fast becoming digitally-centric, this model no longer works. There must be a plan that is well understood and very defend-able, in ALL contexts. There is massive accountability for every action taken and every dollar spent, and just being able to deliver raw capacity and make systems talk to each other is expected as table-stakes.

And the strategies must be well-conceived and sound. Think back about 5 years ago when nearly every company on the planet was blindly jumping on the Public Cloud band wagon like drunken sailors on holiday leave.  That was their ‘strategy’. It sounded really good on paper and appeared to be very high tech.  Consequently, this Public Cloud cut-over plan was presented as the savior to all of the IT woes that existed….

Then reality set in and these same drunken sailors were sobering up and realizing that their actual business needs were not always being met by just abandoning what they already had and foolishly running en mass to the Public Cloud. Whether it was cost or access or predictability, there were one or more aspects of their shiny new “Cloud” strategy that simply didn’t pass the real-world muster. Oops.

So companies are now (re-)thinking through their IT strategies plans. They realize that due to economics, applications or audience they need to create a defendable plan to blend existing resources (like their billion dollars invested in data center infrastructure) with new resources that are now available as a service. They need to have operational plans that account for a long list of “what happens if” scenarios. They need to understand the hardware, software and management costs of each component, and then apply those costs to each business service on a per-unit basis to determine if the business can afford ‘their plan’.

So the point is, true innovation is available everywhere you look in IT, but there is a white-collar wrapper needed for everything we do. New data centers can be built if the business model make sense. Existing data centers can be run better and more efficiently. Co-Location space can be added based on cost or capacity build-outs, and Public Cloud services can be adopted for dynamic capacity when the business needs it. This is not a one-size fits all game.

Accountability is not a FOUR letter word (its actually 14), its your way of life if you are going to be a successful IT professional in 2018.


By Mark Harris, Senior Vice President of Marketing, Uptime Institute