Creating UNIX Diagnostic & Recovery Tasks in SCOM

 

Hello All!

Recently, I got an opportunity to work with J Powell from Nine Virtual Technologies. We worked on implementing a diagnostic & recovery solution for some of the critical UNIX workloads in their environment. The challenging part of the problem is that the SCOM 2012 R2 console does not have a UI interface for creating UNIX recoveries & diagnostics. We had to fall back to writing the tasks using MP authoring techniques.

We researched a lot and spent quite some time understanding the various nuances. When our tasks worked, it was a eureka moment for both of us! J & myself thought that it would be a great idea to share the steps with you all. And in no time, J wrote this article to be published in my blog.

Thank you, J for sharing your expertise with the System Center community.

Hope this post helps you save time in creating your UNIX recovery & diagnostic tasks. If you would like to contact J, you can email him at j.powell@ninevirtualtech.com.

Sanjeev

----

Creating Recovery and Diagnostic Tasks for Linux is not as easy as it is for Windows. I knew from creating Recovery and Diagnostic tasks for Windows servers that I could run a command or script. Turns out this is the same for Linux. I can run a script (bash, python, perl, etc) or a single command.

From researching the topic, I could tell that you had to create a custom Management Pack and edit the XML. However, I could not determine how to put the parts together. Also, I could not find anything specific to SCOM 2012 R2. I wanted to provide a write up on how to manage a Linux process and highlight the pitfalls and how to avoid them.

As an example, let’s monitor the sleep process in Linux. If the sleep process stops, I want it to start back automatically and reset the health of the monitor.

First, login to a Linux shell and issue the command ‘sleep 5000’. This will cause the sleep process to run for 5 minutes.

Now, let’s login to the SCOM Admin Console and go to Authoring > Management Pack Templates > UNIX/Linux Process monitoring

Follow the steps below to create the monitor for the sleep process:

1. Right click on UNIX/Linux Process Monitoring and select Add Monitoring Wizard

2. Select UNIX/Linux Process Monitoring and click Next

3. Under Name, Type a description. For this example, we will use Sleep

4. Under Management Pack, select New. This step is very important. Do not add this monitor to any existing Management Packs, as we will need to export and edit this one in later steps.

5. Once you click New, a new wizard will appear. Name the new Management Pack Sleep and click Next.

6. Under Knowledge, you can leave this part blank. Click Create and you will be returned to the previous Unix/Linux Process Monitoring window. Click Next to continue

7. Under process name, click the Select a process button. You will be prompted to browse for a computer. Clicking the Browse button will display all of the Linux/Unix Computers with the SCOM agent installed. Select the one you executed the sleep 5000 command, and hit Select

8. In a few moments, SCOM will display all of the active processes on the Linux server. Select sleep from the list and hit the Select button

9. You will be returned to the Process Monitor window. Notice that the Process Name and Computer to Target fields are now filled in. Also note, the full command is located under the Expression matching Results field. Select it and click Next.

clip_image002

10. On the Configure process monitoring settings, take the default. This will alert us if the sleep process drops to zero. Hit Next

11. On the Summary screen, hit Create.

After a moment, you will see the process named Sleep appear under the UNIX/Linux Process Monitoring section. Now, that we have created a monitor to keep tabs on the sleep process, we need to export the Management Pack. Go to Administration > Management Packs and search for the one you created named Sleep.

Click on Sleep, then click on Export Management Pack in the upper right-hand corner. You will be prompted to save the file. Save it to your desktop.

Now, with a text editor (I use Notepad ++), open the file and let’s go over some of the parts we need to manually edit:

Notice the version highlighted below:

clip_image004

You will have to change this number in order to import the Management Pack again. Let’s change it to 1.0.0.0.

Scrolling down a bit further, you will see a section of References. You will need to add the portion highlighted below:

clip_image006

You likely do not have the Unix Authoring Management pack installed. Download it from this link and install it: https://gallery.technet.microsoft.com/UNIXLinux-Authoring-b16fd2e4

You also need to view the Unix/Linux Authoring MP in a text editor and make certain the Public Key Token and version numbers match up with the section highlighted in the previous screen shot.

Before we get to the Diagnostic and Recovery Task edits, lets take note of one more item. It is the interval in which the monitor evaluates. For this particular one, the default is to run every 5 minutes. For testing, you may want to change this value to 60, which means it will run every minute. Look for the section to edit based on the screen shot below:

clip_image008

Now, on to the edits that will make a Diagnostic or Recovery Task possible.

To help simplify things, in the following sections I will provide four examples: 2 Diagnostic Tasks and 2 Recovery Tasks. The difference between them will be if you are running a command or a shell script. These examples could be cut-n-pasted into your management pack. Be sure not to change the order of items listed or remove any of the parts below, unless otherwise stated. Doing so will cause errors when you try to import the Management Pack.

Diagnostic Task for Executing a Command:

The following should be inserted between the </Monitors> and <Overrides> section. Notice the GUID that follows ProcessTemplate and the monitor are the same. This must match the GUID in the <UnitMonitor ID = section of your Management Pack. It will be different from the one below:

 <Diagnostics> 
  <Diagnostic 
      Accessibility="Public"
      ID="RunSleep.Diagnostic"
      Remotable="true"
      Target="ProcessTemplate_15f945f92c2e467d94742c41989abfa0"
      Enabled="true"
      Timeout="300"
      ExecuteOnState="Error"
      Monitor="Microsoft.Unix.Generic.ProcessCount.Monitor.ProcessTemplate_15f945f92c2e467d94742c41989abfa0"
      Comment="In response">
      <Category>Maintenance</Category>
      <ProbeAction ID="ExecuteRecoveryCommandWA" TypeID="UNIXAuthoring!Unix.Authoring.ShellCommand.ProbeAction">                          
          <TargetSystem>$Target/Host/Property[Type="Unix!Microsoft.Unix.Computer"]/PrincipalName$</TargetSystem>
    <ShellCommand>export TERM=vt100; sleep 5000</ShellCommand>
  <Timeout>30</Timeout>
          <UserName>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/UserName$</UserName>
          <Password>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/Password$</Password>
       </ProbeAction>
  </Diagnostic>
</Diagnostics>

At the bottom of the Management Pack, in the <DisplayString> section, add the following:

 <DisplayString ElementID="RunSleep.Diagnostic">
    <Name>Run the sleep command</Name>
 </DisplayString>

This is how it will look once you reimport the Management Pack:

clip_image010

 

Diagnostic Task for Executing a Script:

The only two differences between executing a script and a command is substituting the word Command with Script and adding <ScriptArguments>. See the highlighted portion where we changed Command to Script as well as where to add the XML

 <Diagnostics> 
  <Diagnostic Accessibility="Public" ID="RunSleep.Diagnostic" Remotable="true" Target="ProcessTemplate_15f945f92c2e467d94742c41989abfa0" Enabled="true" Timeout="300" ExecuteOnState="Error" Monitor="Microsoft.Unix.Generic.ProcessCount.Monitor.ProcessTemplate_15f945f92c2e467d94742c41989abfa0" Comment="In response">
    <Category>Maintenance</Category>
    <ProbeAction ID="ExecuteRecoveryScriptWA" TypeID="UNIXAuthoring!Unix.Authoring.ShellScript.ProbeAction">                          
       <TargetSystem>$Target/Host/Property[Type="Unix!Microsoft.Unix.Computer"]/PrincipalName$</TargetSystem>
       <ShellScript>export TERM=vt100; ./sleep.sh</ShellScript>
       <ScriptArguments> </ScriptArguments>
       <Timeout>30</Timeout>
       <UserName>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/UserName$</UserName>
       <Password>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/Password$</Password>
    </ProbeAction>
  </Diagnostic>
</Diagnostics>

As in the previous example, add the following to the <DisplayString> Section:

 <DisplayString ElementID="RunSleep.Diagnostic">
          <Name>Run the sleep command</Name>
</DisplayString>

Recovery Task for Executing a Command:

The following should be inserted between the </Monitors> and <Overrides> section. Notice the GUID that follows ProcessTemplate and the monitor are the same. This must match the GUID in the <UnitMonitor ID = section of your Management Pack. It will be different from the one below:

 <Recoveries> 
  <Recovery ID="RunSleep.Recovery" Accessibility="Internal" Enabled="true"  Target="ProcessTemplate_15f945f92c2e467d94742c41989abfa0" Monitor="Microsoft.Unix.Generic.ProcessCount.Monitor.ProcessTemplate_15f945f92c2e467d94742c41989abfa0" ResetMonitor=”true” ExecuteonError=”true” Timeout="300">
      <Category>Maintenance</Category>
      <WriteAction ID="ExecuteRecoveryCommandWA" TypeID="UNIXAuthoring!Unix.Authoring.ShellCommand.WriteAction">                          
         <TargetSystem>$Target/Host/Property[Type="Unix!Microsoft.Unix.Computer"]/PrincipalName$</TargetSystem>
         <ShellCommand>export TERM=vt100; sleep 5000</ShellCommand>
         <Timeout>30</Timeout>
         <UserName>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/UserName$</UserName>
         <Password>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/Password$</Password>
      </WriteAction>
  </Recovery>
</Recoveries>

Recovery Task for Executing a Script:

The only two differences between executing a script and a command is substituting the word Command with Script and adding <ScriptArguments>. See the highlighted portion where we changed Command to Script as well as where to add the XML:

 

 <Recoveries> 
  <Recovery ID="RunSleep.Recovery" Accessibility="Internal" Enabled="true"  Target="ProcessTemplate_15f945f92c2e467d94742c41989abfa0" Monitor="Microsoft.Unix.Generic.ProcessCount.Monitor.ProcessTemplate_15f945f92c2e467d94742c41989abfa0" ResetMonitor=”true” ExecuteonError=”true” Timeout="300">
      <Category>Maintenance</Category>
      <WriteAction ID="ExecuteRecoveryScriptWA" TypeID="UNIXAuthoring!Unix.Authoring.ShellScript.WriteAction">                          
         <TargetSystem>$Target/Host/Property[Type="Unix!Microsoft.Unix.Computer"]/PrincipalName$</TargetSystem>
 <ShellScript>export TERM=vt100; ./sleep.sh</ShellScript>
         <ScriptArguments> </ScriptArguments>
   <Timeout>30</Timeout>
         <UserName>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/UserName$</UserName>
         <Password>$RunAs[Name="Unix!Microsoft.Unix.ActionAccount"]/Password$</Password>
      </WriteAction>
  </Recovery>
</Recoveries>

Now that the edits are complete, save them and let’s import the Management Pack. Be sure to change the version number at the top of XML or it will not import. If it imports successfully, you can kill the sleep task from the Linux shell and you will see a critical error logged in the SCOM console. Use the Health Explorer to view the results of your Diagnostic or Recovery Task.