Capability: ITIL/COBIT-Based Management Process
On This Page
Requirement: Operating, Optimizing, and Change Processes
Best practice processes must be defined for all tasks highlighted in the Infrastructure Optimization Model in order to receive maximum benefit and performance. The following table lists the high-level challenges, applicable solutions, and benefits of moving to the Rationalized level in ITIL/COBIT-based Management Process.
SLAs are informal or implied
Informal configuration management consists of basic build checklists and spreadsheets
Informal release management
Implement service level management across IT operations
Implement best practice release management
Optimize network and system administration processes
Implement best practice job scheduling
Proactive IT operations resolve problems earlier to avoid reducing user productivity
Automated services and tools free up resources to implement new services or optimize existing services
Formal SLAs connect IT to the business by improving IT’s credibility
The Rationalized level of optimization requires that your organization has defined procedures for incident management, problem management, user support, configuration management, and change management.
Requirement: Operating, Optimizing, and Change Processes
You should read this section if you do not have processes for service level management, release management, systems administration, network administration, and job scheduling.
Infrastructure optimization goes beyond products and technologies. People and processes compose a large portion of an organization’s IT service maturity. A number of standards and best practice organizations address the areas of people and process in IT service management. Microsoft Operations Framework (MOF) applies much of the knowledge contained in the IT Infrastructure Library (ITIL) and Control Objectives for Information and related Technology (COBIT) standards and makes them actionable and achievable for Microsoft customers.
Phase 1: Assess
The goal of the Assess phase in operations management is to evaluate the organization’s current capabilities and challenges. To support the operations assessment process, Microsoft has developed the Microsoft Operations Framework (MOF) Service Management Assessment (SMA) as part of the MOF Continuous Improvement Roadmap, and a lighter online version called the MOF Self-Assessment Tool.
Figure 16. MOF life cycle of continuous improvement
The MOF Service Management Assessment is focused on enhancing the performance of people and IT service management processes, as well as enabling technologies that improve business value. All recommendations generated as a result of the SMA are detailed and justified in business value, and a detailed service improvement roadmap is provided, supported by specific key performance indicators to monitor progress as improvements are implemented.
Phase 2: Identify
The results of the MOF Service Management Assessment form the basis of the Identify phase. The Assessment will often expose several areas of potential improvement in IT operations. During the Identify phase, we consider these results and prioritize improvement projects based on the business need. Tools and job aids are included in the MOF Continuous Improvement Roadmap to assist with this prioritization.
Phase 3: Evaluate and Plan
The Evaluate and Plan phase for operational improvement relies on the identified and prioritized areas for improvement. The MOF Service Improvement Program (SIP) guidance enables this phase. SIP is split into two major areas of focus: specific MOF/ITIL process improvement and service improvement guidance. This guidance is delivered through a tool that assists users in identifying their specific trouble points, provides focused guidance for remediation, and is supported by key performance indicators to measure process improvement.
Recommended Processes for Moving to the Rationalized Level
The recommendations in this section are based on common issues found at the Standardized level and areas of improvement sought by the Rationalized level. These are only recommendations and may be different for your specific organization or industry.
Although the Standardized level brings an increased use of tools for managing and monitoring IT operations and infrastructure, plus an environment in which such processes as change management, configuration management, and release management are standardized and predictable, there is room for improvement in key areas. Service level management is rudimentary with service level agreements (SLAs) that are informal or only implied. Configuration management is informal and typically consists of basic build checklists and spreadsheets, and release management is not well defined and lacks rigor.
The Rationalized infrastructure is where the costs involved in managing desktops and servers are at their lowest and processes and policies have been optimized to begin playing a large role in supporting and expanding the business. Security is very proactive, and responses to threats and challenges are rapid and controlled. The use of zero touch deployment helps minimize cost, the time to deploy, and technical challenges. The number of images is minimal, and the process for managing desktops is very low touch. These customers have a clear inventory of hardware and software and only purchase the licenses and computers they need. Security is extremely proactive with strict policies and control, from the desktop to server to firewall to extranet.
Microsoft provides Microsoft Operations Framework (MOF) as an iterative model for defining and improving IT operations. MOF defines service management functions (SMFs) as logical operational functions within an IT organization. The SMFs are grouped together into four broad areas, or quadrants: Changing, Operating, Supporting, and Optimizing. This guide highlights areas to improve that are typically found in organizations at the Standardized level of optimization:
Service Level Management
Depending on the organization, improvements to these service management functions might or might not have the greatest impact on operational effectiveness and improvement. We recommend that your organization at a minimum completes the online self-assessment, and preferably a full Service Management Assessment, to identify the most important areas requiring process or service improvements.
Service Level Management
Service Level Management (SLM) is a critical process that aligns business needs with the delivery of IT services. It provides the interface with the business that allows the other SMFs to deliver IT solutions that are in line with the requirements of the business and at an acceptable cost. Its primary goal is to successfully deliver, maintain, and improve IT services.
SLM is used to align and manage IT services through a process of definition, agreement, operation measurement, and review. The scope of SLM includes defining the IT services for the organization and establishing SLAs for them. Fulfilling SLAs is ensured by using underpinning contracts (UCs) and operating level agreements (OLAs) for internal or external delivery of the services. Introducing Service Level Management into a business will not give an immediate improvement in the levels of service delivered. It is a long-term commitment. Initially, the service is likely to change very little; but over time, it will improve as targets are met and then exceeded.
If an organization wants to implement Service Level Management, it must first assess what services IT provides to the organization’s customers and determine what existing service contracts are currently in place for these services. This assessment can make the IT service department aware, often for the first time, of the full range of services it is expected to deliver. With the information gained through this exercise, the organization can then develop and implement the full benefits of the Service Level Management process.
Service Level Management requires that the IT organization fully understand the services it offers. Implementing Service Level Management follows these steps:
Creating a service catalog.
Monitoring and reporting.
Performing regular service level reviews.
SLAs are developed in line with the requirements and priorities of the services documented in the service catalog, the requirements specified under negotiation of the SLAs, the monitoring of the service against the agreement criteria, and the reporting and reviewing of this information to highlight and remove failures in the levels of performance of the service.
Phase 4: Deploy (Service Level Management)
Setup activities are a series of appraisal steps carried out at the beginning of a Service Level Management project. These preliminary steps help the business determine if there is a need for Service Level Management and if it has the resources to implement it. As part of this process, the IT department establishes a baseline for the business by taking a snapshot of the existing services and management activities. The final step is to analyze the information collected in the previous steps and use the results to plan the implementation of Service Level Management for maximum benefit to the business.
Creating a Service Catalog
The service catalog lists all of the services currently being provided, summarizes service characteristics, describes the users of the service, and details those responsible for ongoing maintenance.
A service is defined by the business organization's perception. For example, e-mail may be a service and printing may be a service, regardless of the number of service components required to deliver the service to the end user.
Formalizing a service catalog is an important step in that it creates an officially recognized record. Making the service catalog an official record within the organization places it under change control. This is important since the record is valuable only if it is maintained and accurate.
There are many ways to formalize a service catalog. When determining which method is most suitable for use, consider how you want to view, report against, and use the service catalog. A service catalog can be stored as part of the configuration management database (CMDB) either as one component (the service catalog) or as its services. Microsoft applications, such as Microsoft Excel or Microsoft Access, can be used to record the services and such details as the components, effects, priorities, and SLAs and SLOs. If the tool selected allows the service catalog to be part of the CMDB, then this can add value by integrating the information in the service catalog with the configuration item (CI) in the CMDB. This can then be used to add value to the Change Management SMF, Incident Management SMF, and all other SMFs using the CMDB.
Developing Service Level Agreements (SLAs)
An SLA is an agreement between the IT service provider and the customer/user community. The SLA formalizes customer/user requirements for service levels and defines responsibilities of all participating parties.
The steps for creating an SLA are:
Define the type of SLA. For example, is it an internal, external, operating level, or multi-service level agreement?
Define the SLAs. For example, what levels of service will be delivered, including such measurable things as availability, responsiveness and performance, integrity and accuracy, and security.
Negotiate and agree on SLAs. For example, determine whether what has been agreed to can be delivered at a reasonable cost to the business and to the IT department.
Document the SLA. For example, record in writing what has been agreed to and who is involved.
Aligning SLA, OLA, and UC Commitments
Underpinning contracts (UCs)—legally binding contracts with a third-party service provider on which service deliverables for the SLA have been built—and operational level agreements (OLAs)—an internal agreement supporting the SLA requirements—must have service metrics that are aligned with the SLA commitment.
Service Level Monitoring
Service level management requires an ongoing cycle of agreeing, monitoring, and reporting on IT service achievements and taking appropriate actions to balance service levels with business needs and costs.
When the SLAs are agreed on and in place, the next stage in effective Service Level Management is to monitor the performance of the services against criteria specified in the service level objectives (SLOs). There are various methods of monitoring Service Level Management, but the main concern is if the performance of any of the criteria breaches or comes near to breaching the SLA.
Service Level Agreement Review
The SLA Review is one of the four MOF operations management reviews (OMRs). It is a key management checkpoint and occurs at specified intervals (as documented in the SLA). This review is meant to ensure that the business and IT have an opportunity to assess performance against SLA objectives and to review the operation of the SLA. The SLA Review is designed to involve high-level management in the review process, ensuring that involvement and communication is present from both IT and the business in all future decisions regarding the delivery of the service.
The Release Management service management function (SMF) is responsible for deploying changes into an information technology (IT) environment. After one or more changes are developed, tested, and packaged into releases for deployment, release management is responsible for introducing these changes and managing their release. Release management also contributes to the efficient introduction of changes by combining them into one release and deploying them together.
The goal of the release management process is to ensure that all changes are deployed successfully into the production IT environment in the least disruptive manner. Therefore, release management is responsible for:
Driving the release strategy, which is the overarching design, plan, and approach for deployment of a change into production in collaboration with the change advisory board (CAB).
Determining the readiness of each release based on release criteria (quality of release, release package and production environment readiness, training and support plans, rollout and backout plans, and risk management plan).
Provides a packaged release for all changes deployed into production and only deploys changes approved by change management.
Needs change management to approve changes and track them throughout the release process.
Ensures that, as changes are made, those changes are reported to configuration management for entry in the CMDB.
Needs configuration management information to build and verify valid test environments in the development phase of the new release.
Needs configuration management to assess the impact of changes to the IT environment and to provide a definitive store for the release package.
Phase 4: Deploy (Release Management)
The first step in the release process is the creation of a plan identifying the activities and the resources required to successfully deploy a release into the production environment. That plan involves determining what tasks need to be done, when they need to be complete (timescale), and what their priority is in relation to other tasks. Once these issues are fully understood, the release manager can draw up a detailed plan of activities and assign appropriate resources to the project. In Release Management, the release manager role is responsible for building a release (project) plan for each RFC approved by change management.
Once the release plan is agreed on, members of the release team identify and develop the processes, tools, and technologies required to deploy the release into production. Although most (if not all) releases could be deployed into production manually, a number of tools and technologies can be used to perform the same task. To make best use of time and resources, the release team should identify the tools and technologies that will enable it to automate as much of the deployment process as possible.
Once the release mechanism has been selected, the release team creates a release package that contains the processes, tools, and technologies required to deploy the release into production by using the selected mechanism and to remove it from production should that become necessary.
For some releases, the release package may simply consist of a set of documented installation and removal procedures.
The completed release package should be tested in a lab environment to give the release team a degree of confidence that it will work correctly when used in production. Assuming that testing completes successfully, the release and the contents of the release package are then placed under the control of change management.
Up to this point, the emphasis of testing has been to confirm that the release and release package work correctly within a development environment. Acceptance testing allows developers and business representatives to see how the release and release package perform together in an environment that closely mirrors production. In some cases, pilot testing is required to build the confidence necessary to proceed to a organization-wide deployment of the change.
Although testing in a simulated production environment provides the release team with a degree of confidence in the release, it does not guarantee that the release will perform well in production, where conditions may be significantly different. In this respect, it may be necessary to perform a number of controlled tests in the production environment to confirm that the release meets expectations. Piloting a release in a production environment carries a number of risks to that environment and should only be performed if the recovery procedures contained in the release package have been proven in the test environment.
After pilot and acceptance testing has been completed, the next step is to prepare the production environment for the release, move through the preparation process, and agree on the action to be taken—either to move to the Release Readiness Review or to return the release to the change owner or release manager for additional work.
The release manager, change manager, and change owner are the primary participants in the Release Readiness Review discussion, but it also may include representatives of other interested groups, such as the test teams, service desk, and user community (depending on the nature and size of the release).
Although a release may have failed a number of tests, both in the lab and in the production environment, the failures may not be significant enough to prevent deployment. Even if there are implications for the production environment, there may be a number of compelling business reasons why the release must be deployed.
For example, in a business-to-business commerce site, one feature—such as automated sign-up—may not work. It is easy to remove this feature and use a manual workaround. Therefore, the team might choose to proceed without this feature.
The testing experiences and lessons learned (in addition to any workarounds developed) are captured and documented. If issues were picked up during testing that affect the user community or service levels, it is necessary to discuss workarounds and expected problems with the service desk representatives and to ensure that the workarounds will be available to the service desk prior to deployment. Additional RFCs might need to be initiated in order to make the release work as planned. In either case, the change log needs to be updated with the decision and any other supporting information.
The process of deploying the release into the production environment depends on the type and nature of the release and on the selected release mechanism. In all cases, however, the release manager should be provided with status reports and, where appropriate, tools and technologies that will enable tracking and monitoring of deployment progress. As changes are made to IT components during deployment, corresponding changes must be made to the configuration items and relationships modeling them in the CMDB.
Once the release is deployed, the release manager confirms that it is working correctly before proceeding with further deployments. For some releases, technical support staff can obtain confirmation by using a number of tools and technologies; for others, the release manager may need to ask the service desk to contact individual users for their feedback and comments.
If the release fails to meet expectations or if serious problems are encountered during deployment, problem management may be needed to help identify and diagnose the root cause of the problem. If a suitable fix or workaround can be found, this should be documented and a request for change created to deploy it into the production environment. If not, it may be appropriate to use the backout procedures to remove the release from that environment.
Once the release deployment phase is complete, the release process should ensure that any results and information about any workarounds or RFCs raised in support of the release are recorded. If the release needs to be backed out, this should also be recorded, including any information that supports this decision.
The System Administration function performs Security Administration, Service Monitoring and Control, Job Scheduling, Network Administration, Directory Services Administration, Print and Output Administration, and Storage Management. The way in which one designs, develops, and implements this function will be determined by the size and architecture of the organization. Large organizations will have clearly defined models while smaller organizations will be likely to consolidate functions in order to maintain the health and operational capabilities of the systems.
The objective of the System Administration SMF is to administer a computing environment on a day-to-day basis. This entails managing and providing operational support for various elements within the production environment.
The System Administration function is responsible for providing administrative services in support of computing environments that contain both centralized and distributed hardware.
The System Administration function may also provide assistance with the functional administration of other SMFs that they are not directly responsible for implementing and managing. This assistance may include:
First-level performance and capacity monitoring for the Service Monitoring and Control function.
The day-to-day functional administration of account management, including adding, deleting, or moving accounts. Enquiries for resources such as printers and security access privileges for Directory Services Administration and Security Administration.
Management of resources used to produce printed reports and output for Print and Output Management.
The administrative tasks required to back up and restore critical data.
Enforcing a security policy for protecting data and shared network resources including files, folders, and printers.
Phase 4: Deploy (System Administration)
Implementing the Centralized Administration Model
In the centralized administration model, all or most of the operations and support functions are centrally located in a single site, or sites. With the maturation of local area, wide area, distributed, and client server computing and their supporting networks, more and more organizations have made great strides toward centralizing support for installed resources, applications, and solutions.
Bandwidth to remote sites and branch offices is more widely available and affordable. Basic technologies that support branch office computing (transmission protocols, remote access tools, headless servers, and so on) have improved to the point where each branch office no longer needs its own separate support staff. Companies are thus increasingly able to centralize the fundamental support functions required to maintain the availability, reliability, day-to-day support, and management of systems that are distributed to the remote sites or branch offices.
Centralized administration typically assumes that all or most of the computing systems and resources being managed (administered) are centrally located. Although this is increasingly true, there continues to be cases where specific solutions (that is, custom applications, specialized databases, and so on) are not centralized in the corporate data center, but instead are distributed to the remote branch or site. This distribution of some applications and databases does not prevent taking a centralized approach to the administrative model.
Implementing the Centralized/Remote Administration Model
The centralized/remote administration model achieves most of the benefits of the completely centralized model. Most administration continues to be performed at the central location (for example, central data center), thereby retaining the greatest control and consolidation of resources necessary to execute the administrative function.
Some control and resource consolidation is given up, however, due to the requirement of maintaining a remote data center environment with at least a minimal localized administrative presence. Remedial system maintenance requirements on the distributed system may include system updates that require a reboot of the computer, as well as tape-backup and storage duties. There may be additional local-site administrative requirements, depending on the application or specific system being managed; you'll have to decide what specific responsibilities are necessary based upon your technology application.
The centralized/remote administration model describes systems that are distributed to remote locations with all major administrative control remaining at the central location. As stated above, there now needs to be a data center presence in the remote or regional location to house the servers or storage units. This implies that you now incur the cost of the data center infrastructure, which includes the physical plant, floor space, power, wiring, HV/AC, and security components.
If the technology application evolves to the point where this model no longer remains viable (that is., no longer meets the service level agreements) or is no longer cost-effective, you may need to move to a distributed administrative model. In a distributed administrative model, the computing resources as well as the people resources are physically located at the remote location. This model is described in the next section.
Implementing the Centralized/Delegated Administration Model
This model attempts to embrace the best of the centralized and remote administration models with all of their inherent features and benefits, yet also realizes some of the benefits of the distributed administration model. These benefits are achieved by pushing a relatively small and specialized subset of administrative tasks and responsibilities to the local branch offices and remote sites.
As with the centralized model, the primary administrative function and administrative workforce reside at the corporate (central) data center—all administrative direction and control originate from this location. The centralized resources continue to manage the centralized, data center-based network servers and services; these centralized resources also continue to remotely administer services across the network where possible, reasonable, and applicable.
Certain circumstances dictate the need to distribute specific services, servers, and resources; in these cases it may also be prudent and/or more efficient to allow some of the administrative tasks to be performed at the regional or remote locations. This is done by delegating very specific authority to the remote location resources. "Very specific authority" refers to a small subset of administrative rights and access that allow the remote administrators to perform specific, discrete tasks.
Implementing the Distributed Administration Model
Unlike the other models, distributed administration relies on full-support resources located in remote sites or branch offices. Resources at remotely located sites perform the fundamental (although critical) support functions necessary to maintain the health, availability, and reliability of systems distributed to those sites.
There may continue to be business drivers for maintaining systems that are distributed to remote locations. Some of these drivers may be related to performance, scalability, a specific type of application, or the cost or availability of network bandwidth that would support a centralized solution.
Computing and people resources are completely distributed to the remote offices and regional sites. As a result, the organization may realize much better local site performance for specific technology applications.
Implementing the Distributed Administration of Centralized Data Centers Model
The fifth system administration possibility, referred to here as the "follow-the-sun" model, could also be called the "distributed administration-centralized data center" (more than one) model.
"Following the sun" in this context means providing support globally 7 days a week, 24 hours a day by transferring the responsibility for this support to different regions around the world as some offices close for the day and others open.
This model is somewhat unique and is not as widely implemented as the four basic models previously described. It should be noted, however, that companies have tried, or are currently trying, to get this model to work in their organizations.
The Network Administration service management function focuses on operating networks, which are the infrastructure components through which computer systems and shared peripherals communicate with each other. It is the most basic level of an IT infrastructure; without network facilities, there is no infrastructure, just a collection of individual computers.
The goal of the Network Administration SMF is to provide and reference a solid foundation of processes for administering a network environment on a day-to-day basis. This entails managing and providing operational support for various elements within the production environment. The SMF’s objectives include providing planning and deployment services to expand existing network facilities, as well as support services to troubleshoot and repair faults in the network environment. Through effective implementation of the Network Administration SMF, IT organizations can expect to:
Improve their deployment of network infrastructure.
Improve troubleshooting processes and associated incident-management processes.
Increase network reliability.
Enhance availability of IT solutions and services.
A typical network consists of hardware—including cabling, routers, switches, hubs, physical servers, and other components—and the software or firmware that controls the manner in which the hard components are utilized. In the networking model described by Open Systems Interconnection (OSI), the typical IT infrastructure is constructed in layers, from basal components that are used by all services at the bottom of the stack, to specialized applications at the top.
The layers making up the OSI stack are (from the top, down):
Link (Data Link)
Network administration is typically involved with the first three layers of the stack, which mostly consist of hardware. There is some overlap between network and system administration at the transport level, which includes the linking and networking protocols that enable the transfer of data from one point to another. From the MOF perspective, management of such services as DNS, WINS, and DHCP provides the basic name resolution services required by fully featured IT services. Depending upon the organization, these core services may also be included as network service functions. Since DNS, WINS, and DHCP run on servers, network servers are sometimes included among the hardware components managed by the Network Administration SMF.
Phase 4: Deploy (Network Administration)
Maintaining a Network
Operating the network infrastructure is largely a matter of monitoring its performance, evaluating that against expected norms, and generating work items to troubleshoot if performance drops off. Most hardware components within a network should operate without hands-on maintenance or intervention within the manufacturer’s specifications for mean time between failure and other performance criteria. The MOF Capacity Management SMF provides details for capacity planning that will help the network design team in optimizing network performance.
The server-based components of the network do require periodic attention, however. These components require regular backups, where applicable, and evaluations of storage or capacity requirements, in accordance with the Storage Management SMF.
Supporting a Network
Network support is closely aligned with activities in the Supporting Quadrant, particularly the Incident Management SMF and Problem Management SMF. Through the incident resolution process described in the Incident Management SMF, IT networking specialists correct network errors, develop workarounds, and prevent or mitigate impending network issues. Although the generic process for resolving incidents is described in the Incident Management SMF guidance document, network-specific processes for troubleshooting are provided in the following sections.
The Job Scheduling Service Management Function (SMF) is concerned with ensuring the efficient processing of data at a pre-determined time and in a prescribed sequence to maximize the use of system resources and minimize the impact to online users. A batch process is a system interaction with a database that runs in the background and in a sequential manner without interaction from an end user. The execution of batch processes may be automated or manually initiated. Batches are usually executed after business hours when user interaction with the system is low.
Batch runs typically require their own architecture as they tend to be resource-intensive and long running, repetitive processes. The process usually involves reading large amounts of data from a database, processing the data, and returning the results back to a database. This process is accomplished through the execution of scripts.
Types of batch jobs that organizations execute include:
Financial management reports
Supply chain management reports
Customer account processing (monthly account billing, and so on)
Automated backups of system and application data
System processing summaries and capacity planning reports
Phase 4: Deploy (Job Scheduling)
A batch architecture consists of the processes and components used to effectively manage batch processing. The purpose of the batch architecture is to optimize processing (improve response time and utilization of system resources) by executing batch runs during off-peak periods. The architecture should provide the capacity manager with an easy to use interface and permit a standard and centralized approach to batch scheduling, monitoring, control, and error correction. The architecture should be highly scalable in order to meet the future needs of the organization. It should also be highly available, with minimal downtime, and minimize impact to online operations, which usually are operating concurrently with the batch operations. Some organizations may decide to have backup components, such as the event server, to ensure the completion of all mission-critical batch jobs.
The basic components of the batch architecture include the management server, capacity database (CDB), monitor, printer, application servers, and databases. In addition to the monitor attached to the management server, each application server should have a monitor to permit viewing of local batch-processing activity; this also facilitates error analysis at the local level.
Before discussing scheduled batch runs, it is useful to first understand the hierarchy of the batch process and the contents of a batch script. A batch run consists of multiple independent batch jobs that are scheduled for execution on a recurring basis. A large organization may execute multiple batch runs throughout the day, depending on the resources required to process them. Each batch job consists of multiple batch job steps that control specific activities of job execution.
An organization typically processes numerous batch jobs. To ensure a consistent approach to the execution of each job, a batch job skeleton should be devised that contains the standardized code required for each job; job-specific information should be coded into a designated area within the skeleton. The skeleton also serves to facilitate development and maintenance requirements by standardizing the content and structure of each job script. For example, any type of standardized actions, such as the notification of a successful or non-successful batch job execution and transaction data archiving and deletion, should be included in the code of every script that is executed.
Scheduled batch runs are initiated by a scheduling tool during a pre-defined batch window, typically when user activity of the system is at a minimum. After the scheduling tool has been programmed, capacity manager interaction should not be necessary during batch runs unless an error, from which the tool cannot recover, occurs.
The scheduling tool initiates all batch runs. If the run does not begin as scheduled, the tool halts the run and records an error in the error log; some scheduling tools may have the ability to attempt to restart the batch run. If the initialization is successful, the first batch job is executed. The scheduler manages the execution of jobs on each application server targeted to perform the batch processing. If no errors are encountered during the execution of the job run, then the tool records the successful completion of the job in the batch log and the next job is executed, and so on.
If an error occurs during the execution of a batch job, the scheduling tool stops processing that job and an error is sent to the error log. The actual execution of a batch job is dependent on a number of inputs that could be the reason that a job is not executed. For example, the following inputs may be required:
Availability of production data
Completion of a job on which the queued job is dependent
Availability of resources to execute the job
Priority of the job and the batch window
Warnings may also be generated during job processing if, for example, CPU or disk space capacity exceeds the threshold value for an application server processing the batch job. Warnings should not stop a batch job from being completed; however, the capacity manager should address all warnings as soon as possible because they may lead to future job processing failures. If an error causes a job to be stopped, some scheduling tools attempt to automatically recover from the problem and begin processing the job from the beginning of the job step that was executing when processing was stopped. Recovery is useful when processing very long batch jobs because jobs that are interrupted do not have to be restarted from the beginning. If recovery is not possible, the job needs to be restarted.
If the scheduling tool is not capable of restarting or recovering from a failed job (or able to restart or recover in a specific instance), the capacity manager has to manually initiate restart or recovery procedures.
Job Scheduling Activities
Ideally, the batch architecture should be configured in such a manner that capacity manager interaction is kept to a minimum. However, there are still many daily tasks that the capacity manager must perform, including:
As-needed request handling
Capacity manager log entry
Each of these activities is an integral part of the job scheduling process. The monitoring, analysis, tuning, and implementation activities form an iterative process. Inputs to the process include resource utilization and OLA thresholds against which the batch architecture is monitored. These are ongoing activities whose goal is to optimize the performance and utilization of the architecture. The remaining activities are performed in response to an event, request, or requirement.
Documentation and Training
All policies and procedures should be clearly documented so that the capacity manager has a reference to refer to for daily operational guidance. Documentation should include information on:
Procedures for starting and stopping batch runs.
Procedures for changing job priority.
Procedures for handling alerts and errors.
Procedures for handling common errors.
Procedures for analyzing the cause of errors.
Procedures for escalating errors.
Procedures for submitting RFCs.
Procedures for documenting tasks in the capacity manager's log.
Procedures for what should be monitored and when.
Procedures for handling as-needed requests including reviewing, testing, and running these requests.
Without proper training, the capacity manager will not have the ability to conduct the activities discussed in this document. It is important that the capacity manager be properly trained so that processing errors can be responded to and corrected in a timely manner. If it is available, the capacity manager should attend vendor training on the scheduling tool utilized by the organization.
For more information, visit the Microsoft Operations Framework Web site.
To see how Microsoft IT uses Microsoft Operations Framework and best practice IT Service Management, go to http://www.microsoft.com/technet/itshowcase/content/mofmmppt.mspx.
Checkpoint: Operating, Optimizing, and Change Processes
Implemented service level management across IT operations.
Implemented best practice release management.
Optimized network and system administration processes.
Implemented best practice job scheduling.
If you have completed the steps listed above, your organization has met the minimum requirement of the Rationalized level for Operating, Optimizing, and Change Processes of the Core Infrastructure Optimization Model. We recommend that you follow the guidance of additional best practice resources for operations management in Microsoft Operations Framework.