Deploying and Troubleshooting SCOM on Unix/Linux machines
Disclaimer: Due to changes in the MSFT corporate blogging policy, I’m moving all of my content to the following location. Please reference all future content from that location. Thanks.
I’m not going to rehash all the how to articles written on deploying SCOM agents to cross platform machines, but I do think there would be some benefit on having a consolidated location to provide information as to the tips and tricks as well as issues that pop up on way. To be clear, I’m not a Unix/Linux expert at all. I know a few commands that are worthwhile, but I’m a windows guy through and through. In my two years at Microsoft, I have found that most SCOM environments that I work in are populated in the same manner. The SCOM admin/engineers usually don’t have much if any Unix experience and working with with cross platform engineers can be like working with someone who speaks a different language. That isn’t meant as a slap to the Cross Platform community, just an acknowledgement that terminology is very different, and just as I don’t quite understand their world, they don’t understand mine. As such, I’m using this as a nice little place to compile the various issues I’ve run across. Feel free to add some to my comments, especially if you know how to reproduce the issue or have links that I’m not listing here. I’d note that some of these links link back to some of these same pieces. I just figured I’d start making a more comprehensive list.
I don’t have any links here, though I’m sure there’s stuff out there. The key thing to understand is that in the Windows world, the agent does most of the SCOM’s workflows. In the cross platform world, the management server does most of the work as it makes a WS-man call over https to the Unix server on port 1270, issues commands, and compiles the results (more description here). That is why you cannot manage nearly as many cross platform agents in your SCOM environment. If you have a lot of Unix/Linux, you might want to check the sizing guide and make sure you don’t need to add a few more management servers. Also of note, the ports you need to use are 22 (for SSH) and 1270 as the management server will reach out to the Unix/Linux machine on 1270. If you have firewalls between the management server and the Unix/Linux machines, you need to make sure those ports are open. In both cases, the management server will make a call to those agents on those ports. It is my understanding that the agent communication returns to the management server on upper TCP ports in the 50000 range. The cross platform agent will not initiate calls to the management server.
Do make it a point in getting the latest/greatest MP. As well, the newest Unix/Linux MP also needs some changes found in SCOM 2012 R2 UR9 to correct for some known issues. Make sure you’re at the latest and greatest version of SCOM to go with that new MP, as both were released this year (2016). Also, Kris Bash has written a very nice extended monitoring MP that adds a few more counters for monitoring. Note, this is not an officially supported MP as far as I know.
Useful troubleshooting software:
Download a copy of WinSCP. This is a windows based tool that can be used to transfer files between Unix/Linux and Windows system. It’s a fairly easy to use UI that you can install on a management server. Putty (free) or Reflections can be used to SSH, and getting a copy on a management server can be useful. As well, the windows telnet client (which is a feature that can be added to windows server) can help testing connections to ports 22 and 1270. Also of note, Windows 10 with the anniversary update has a Bash application built in making it easy to command line SSH into a server without the need for Putty. Cygwin may also be of use.
Kevin has a great setup guide here. He walks you through it step by step. For basic setup, follow this guide, as it will get you started in the right direction.
Think of sudoers kind of like user account control in Windows. Most cross platform administrators are not too keen on giving you root access to their servers, and rightfully so. It is not good practice. As such, you can configure your monitoring account for sudoers rights. This allows the monitoring user account to elevate itself and run specific tasks as designated in the sudoers file. The natural inclination is to cheat and simply give rights to all. You may as well give root access at that point, and I don’t recommend it. Microsoft has published the commands necessary to configure Sudo elevation on a Linux/Unix system. I’d note that you can simply find your flavor in this article and then copy into a notepad document. Replace “monuser” with the name of your service account that the Unix/Linux admin created for you. That is what they will need to put in their sudoers file. That said, these are security settings and should be reviewed with your Unix/Linux administrator. Tyson Paul was kind of enough to link to a sudoer’s generator. Not a bad option either.
Note that you can test sudo by making an SSH connection to the Unix/Linux system and executing the commands manually from a terminal emulator. If those commands fail, the problem could be as silly as a typo in the Sudoers file.
This kind of dovetails off of the sudoers file, but Unix/Linux admins can move sudo to a non-standard location. If that is the case, they need to define a symbolic link for us SCOM admins as our agent so that when our agent calls sudo, it will run. This article mentions it as well as going through the process to create your monitoring account as well as making use of an SSH key if that is preferred.
Enabling Detailed Logging:
This little article on logging can be very useful in determining issues if they arise. It’s a bit unconventional. You can find some events in the OperationsManager log on a the SCOM server and during initial deployments I highly recommend you remove all but one management server from the resource pool so that you have only one to work with, but this little step can help you better see what is going on from a tracing perspective. First thing to note that the this log is a text file dropped in the C:\windows\temp folder on the management server. Actually, it’s not one but (I think) five of them. Step number one is to clear that directory before you do this. Once you run that command file, you can do a deployment of the Unix/Linux agent and see the verbose log of what the management sever is attempting, and hopefully better determine where it fails.
Manual Deployment of Agent:
Just in case you want to deploy the agent manually… This is how you do it.
This is not a universal list of things that can go wrong, but these articles do provide a nice starting point for troubleshooting deployment. I’ve already referenced manual deployment. This article is a nice reference to all the error codes and what is most likely wrong. It should give you an idea where to look. Marnix Wolf has done something similar here. There is some overlap between what I’ve referenced and his own work, and he has some items that I haven’t referenced, so it’s worth noting.
Other Miscellaneous Issues:
Shell type matters. Our agent only supports certain shell types.
FIPS: If you use it. Note that as of this post (Sept 2016) there are some known issues with the SCOM agent and FIPS in a certain scenarios. There will likely be an update at some point.
Software Compatibility issues can be of issue too. I’m sure there’s more than one, but Axaway/Tumbleweed is a 3rd party security add on that if installed on a management server will cause Unix/Linux installs to fail. I have run into this in a couple of high secure environments. If you want to use an internal CA, follow these instructions.
Another one I ran into prior to working for Microsoft was a 3rd party firewall dropping reset packets. As a part of the deployment, SCOM does a telnet on port 1270 to the Unix/Linux server it is attempting to manage. If it receives the reset packet, it assumes that there is no agent on the server and then will attempt deploy the agent. If it does not receive the reset, the management server assumes that you have installed the agent manually and will attempt to sign the certificate. The problem in this case was our firewall blocking those reset packets, as certain types of Denial of Service attacks will use RST packets, and as such, our firewall was filtering it. I’m not a network guy, so I don’t know if this is smart practice or not, but this was in a high secure environment where additional security protocols were in play. The big catch is when you are running the wizard, you’ll see the option to “Sign Certificate and Manage” instead of “Install Agent” on a machine that has no agent. This is the thread I used to document it. The original poster never really responded, but my environment had the same issue as described in it and with the help of CSS, we were able to reproduce it and fix it.
Last but not least, I ran into a familiar error this week at another secure environment: Failed during SSH discovery (Exit code: -1073479118). The online documentation assumed it was an issue with SSH, and in that it was correct, though there wasn’t anything wrong with SSH. We could SSH directly with the monitoring user account, but as we eventually figured out, we couldn’t do so from the management server. The server in question had a host based firewall that only allowed SSH from certain endpoints, of which ours wasn’t defined. IP Tables could also (in theory) come into play as it can be configured to allow/block specific ports.