The Making of a Good Method of Procedure

Good MOPs (method of procedure) help humans manage the complexity inherent in data centers
By Alfonso Aranda with Lee Kirby

Data centers are complex techno–human systems. The number of interrelated and interdependent elements (including the human element) that interact in the normal operation of a data center and the large number of interactions that take place combine to generate this complexity. These interactions include any installation or decommissioning of IT equipment; expansion or reconfiguration of the electrical, mechanical, control infrastructure, and ancillary installations (fuel, fire, and water treatment systems, etc.); and any maintenance action, etc. Increased automation adds more working and failure modes, which adds complexity to the system.

Failure Types
Complexity generates a high risk of data center failure, which can originate in the infrastructure (electrical, mechanical, IT, communications) or as a result of interaction between those who manage the infrastructure and systems within those infrastructures or systems. Uptime Institute considers these interactions to be the leading cause of data center outages.

Failures of infrastructure can by contained and mitigated by redundancy in the data center, depending on the topology (Tier). Building a more robust and resilient infrastructure does not minimize these failures; additional mechanisms, organizational and operational in nature, must be deployed to prevent failures caused by human activities in the data center.

One such mechanism, the methods of procedure or MOP, is a step-by-step sequence of actions to be executed by maintenance/operations technicians performing an operation or action that implies a change of state in any critical component of an installation. Such actions include switching breakers on or off, opening or closing sectioning valves, and other actions that could pose a risk to the normal operation of the data center. The purpose of an MOP is to control actions to ensure the desired outcome.

MOPs, SOPs, and EOPs
MOPs can be stand-alone documents or part of higher-level standard operating procedures (SOPs). In the latter case, the SOP is the overarching document that controls how changes are to be made during normal operations. They begin and end the overall procedure. Often, they comprise several MOPs that spell out specific steps for portions of the SOP. SOPs are not as detailed as individual MOPs.

Building an SOP from individual MOPs makes it easier to revise procedures, because the same MOP can be used as part of multiple SOPs. For example, the SOPs describing a filter change for CRAC1, CRAC2, etc., would include the same MOP of how-to steps to change out a filter but a different SOP for each CRAC. In that way updating the procedure for changing the filter only requires changes to the MOP, not to every SOP.

MOPs, SOPs, emergency operating procedures (EOPs), and site configuration procedures (SCPs) form the core of the data center site policies. EOPs are detailed written instructions that must be carried out sequentially when an abnormal event triggers the procedure.

Many of the characteristics of MOPs are also found in SOPs and EOPs, and all of these procedures can be applied—and actually are applied—in other fields (military, research, medical, etc.)

SCPs comprise those design and commissioning documents, drawings, schemes, tables, and studies that describe and document the normal configuration of the site, which sets up the initial conditions for the execution of any SOP or MOP.

Different sites may use different nomenclature in their site policies; however, to ensure their effectiveness, all site policies must cover the following areas:

• Description and administration of the normal operating configuration of the
site (including setpoints, discrimination study, and normally open/normally
closed breakers or valves, etc.)

• General overarching description of standard operations

• Detailed written instructions for executing changes of state or high/
medium-risk intervention

• Detailed written instructions for executing emergency operations for responding to abnormal events

The site policies and actual changes of state of the infrastructure operate within a change management environment.

Components of a MOP
MOPs may contain different elements, fields, and details depending on the complexity of the activity to be carried out and the probability and impact of a failure in its execution. For instance, the field “expected result of the action” could be added to every step in the procedure.

In order to be effective, an MOP needs to be followed at the point and time of use by the appropriate person, supervised or not, depending on the company’s policies.

Every MOP should include a title, description of the procedure, author, approval authority/signature, date, unique identifier, and version control (see Figures 1-3).

Figure 1. Basic information to be included on all MOPs

Figure 1. Basic information to be included on all MOPs

 

Figure 2. Detailed step-by-step instructions form the basis of a good MOP. Note the columns for time and initials

Figure 2. Detailed step-by-step instructions form the basis of a good MOP. Note the columns for time and initials

 

Figure 3. The procedures continue, with a final step to inform the BMS operator that the procedure has finished, indicating cross-team communication.

Figure 3. The procedures continue, with a final step to inform the BMS operator that the procedure has finished, indicating cross-team communication.

MOPs should also include other information, including prerequisites, safety requirements, special tools and parts, procedure sequencing, and a back-out plan.

Prerequisites include any actions that must be completed prior to performing the procedure, including verifying that all appropriate approvals (e.g., change approval, permit to work, access approval) have been obtained, that any required notifications have been issued, and that any required reconfiguration of the infrastructure (SCPs) have been performed.

The MOP should list any special tools and parts, in addition to the basic toolbox and inventory of parts that technicians usually carry, needed to do the job. Interruptions to retrieve missing tools or parts will usually extend the length of maintenance windows and increase risk to the data center. Extensions of maintenance windows or deviations from the schedule agreed upon during the change approval can lead to the maintenance window being aborted. Aborted changes (maintenance windows are one type of change) can lead to deferred maintenance, which also increases risk to a data center.

Safety requirements include lockout/tagout (LOTO) procedures and the verifications associated with them, the required presence of safety representatives, and the required use of personal protective equipment (PPE).

Include any special tools and parts, in addition to the basic toolbox and inventory of parts that technicians usually carry, needed to do the job. Interruptions to retrieve missing tools or parts will usually extend the length of maintenance windows and increase risk to the data center. Extensions of maintenance windows or deviations from the schedule agreed upon during the change approval can lead to the maintenance window being aborted. Aborted changes (maintenance windows are one type of change) can lead to deferred maintenance, which also increases risk to a data center.

The most important parts of an MOP are the step-by-step instructions. Each and every step needs to be described in detail to indicate exactly what needs to be done and what the expected result is (e.g., alarms going off or indicator lights changing state).

MOPs must incorporate a mechanism that allows technicians to mark each step as completed, once it has been executed. This is normally achieved by adding a field for the technician to initial after carrying out the action indicated on that line. Additionally, incorporating a time stamp of the moment at which the action was completed provides a log of events in case a later analysis is needed.

Some MOPs incorporate pictures or diagrams to communicate certain steps and their outcomes.

Returning to State
Back-out plans are step-by-step instructions for returning a system to its initial state or other pre-defined stable and safe state from which it can be operated at a later stage. The system could be taken at a later time to its initial state or any other intended state using the appropriate procedures (they could be EOPs). Back-out plans can be described as alternative action steps inserted within the MOP after some verification step or Go/No-Go checkpoint. These alternative action steps must be taken in case the outcome of the verification does not conform to the expected condition or if the result of the checkpoint is No Go. A back-out plan under certain predefined circumstances. These circumstances will usually be an unexpected outcome after the execution of one of the steps in the MOP, a No-Go condition at one of the verification steps, or any other abnormal event that might interfere with the normal execution of the MOP, for example, the unexpected interruption of the utility grid during the execution of the MOP.

MOPs must all be subject to certain processes to ensure that they are properly integrated in the operating routines of the data center. These include a review process and timing and preparation process.

Review and Approval Process
MOPs must be integrated into the change management system to ensure that they are properly reviewed and approved. The duration and depth of the change approval reviews will depend on the criticality or risk of the MOP. A risk assessment must be performed to see if any of the actions or changes of state included in the procedure entail a high risk (high probability of occurrence of the problem or high impact if the problem happens).

Version control is critical for MOPs. The technician must have the latest official version and not an outdated or inaccurate one. This is particularly important in those sites where hard copies of MOPs (and EOPs) are located in the rooms of use.

MOPs should be reviewed at least annually to ensure that they are up to date and relevant. They should also be reviewed after any changes to the data center infrastructure, including setpoints, to reflect the new configuration.

Version control is critical for MOPs. The technician must have the latest official version and not an outdated or inaccurate one. This is particularly important in those sites where hard copies of MOPs (and EOPs) are located in the rooms of use.

Training and Preparation Process
Employees must have the MOPs they need, understand them, and have practiced them prior to executing them. That is why employee training and qualification are so key to effectively executing an MOP.

Once an MOP is written and approved, it must be rehearsed, simulated via dry runs, and prepared with walk-throughs prior to executing it. The duration of the preparation stage, the number of checks and verifications, and the number of people involved will vary depending on the criticality of the change to be performed.

Corrections and adjustments must sometimes be made after executing a procedure. A mechanism to capture successful execution of MOPs or any identified discrepancies, issues, or lessons learned, or proposals for improvement is a key integral part of the MOP process.

Commissioning, when done completely and properly, is the single best training opportunity for the data center operations team and the reality check of all the processes and procedures in controlled, no-risk conditions. This will never be the case again once the data center is in production with IT loads. After that moment, any change of configuration in the data center will be subject to change control and the operators will not have a comparable level of freedom to exercise all components and elements of the installation.

The processes and contents by MOPs provide the framework for the successful daily operation of a data center.

Uptime Institute has found that successful MOPs share a number of common attributes. Successful MOPs

• define, detail, and delimit the interactions between the human and the machine and so effectively
eliminate or minimize human error.

• describe actions that, when performed in the same sequence and starting from the same original
configuration, lead to the same outcome.

• describe actions that leave no room for interpretation. Two different individuals following the same
MOP will perform exactly the same actions in exactly the same order regardless of location, company,
language, and cultural bias.

• describe actions in simple steps; the procedures never add complexity.

Achieving successful MOPs requires a careful balancing act. Overly long procedures and excessive detail may motivate technicians to cut corners, accelerate processes, or lose focus. Technicians may become bored or fatigued, or they may attempt to simplify the procedure. Overly complex layouts may frustrate a technician who must search for the right piece of information or subsequent steps in the document, which can lead to errors.

But the trait that stands out most: effective MOPs are those in which the technicians genuinely understand the need for the procedure and its objective.

There are many occasions on which a first version of site policies, including MOPs, have been created by the design consultant or commissioning agent, with more or less participation or input from the technicians who will ultimately execute those procedures. In these cases, it is very important to ensure that the ownership of reviewing, maintaining, and updating the procedures is transferred as soon as possible to the Operations team and that the technicians themselves originate a second and successive versions of the procedures, with as many changes and adaptations as they consider necessary. As covered in the best practice model Start With the End in Mind, developing, training and constantly updating procedures is a key component to any successful operation. This results in a committed team that buys into the concept and takes ownership of the responsibility over the accuracy and practicality of the procedures.

In summary, comprehensive MOPs are a key component to reducing data center outages caused by human error. However, they are only effective when they are complete and accurate and they are adequately followed.

Data center technicians must be trained, not only on the technologies and systems that configure the infrastructures but also on the different processes, procedures, and policies used on site. For MOPs in particular, technicians must know why it is important to use them, where to find them, and how they must be used. If the MOPs are complete and accurate and the technicians follow a disciplined approach to using them, the risk of an outage caused by human error will be minimized.


Alfonso Aranda

Alfonso Aranda

Alfonso Aranda Arias is a consultant with Uptime Institute. He performs Uptime Institute Professional Services audits and Operational Sustainability Certifications. He also serves as an instructor for the Accredited Tier Specialist course.

Mr. Aranda’s work in critical facilities has included design and engineering of data center facilities, project management for the design construction of large data centers, program management of major data center facility infrastructure developments, portfolio management of critical corporate real estate, and management of data center operations in EMEA. Mr. Aranda’s career has always been connected to data centers and project and energy management, through his technical roles and his management and senior leadership roles in corporations operating in the IT and telecommunications sectors.

R. Lee Kirby

R. Lee Kirby

R. Lee Kirby is Chief Technology Officer, Uptime Institute, providing thought leadership, new product development, and strategic marketing. He facilitates prospective engagements and delivers consistent quality services to contracted clients on a global basis. Additionally, Mr. Kirby serves as a spokesperson and evangelist for various initiatives that are focused on solving systemic issues in the industry.

Share this