Building Scalable Workflows in Opalis
Building from my first blog post on how we established “reach” in our environment for our Opalis infrastructure, I thought it would be fitting to fill you all in on some lessons learned that I gathered during my first workflow development effort. I believe this post is an important one as I have actually discussed this exact topic more than once and it resonates with the folks that I share it with as a critical way of thinking about how workflow development should be considered and laid out.
Let’s look at version 1.0 of the Reach workflow.
Looks simple enough right? We start with creating a status table to store our results. Next, we gather a list of machines from an input file, and then finally we do a round-trip assessment and ping to see if the machine is in fact online and if it is, we check the status of a service. To wrap this workflow up, we are storing our results into the status table we initialized at the very beginning. This is a very real world type scenario that I have done in the past with batch files or a VBScript to determine if a server/workstation is online to patch, or do some automated task to from the command line. We’re just pulling this directly into a workflow to bring it to life a bit more.
Version 1.0 of Reach Workflow (sequential processing)
So what’s wrong with the above approach?
Absolutely nothing on the surface. In fact, for 30 machines this is perfectly scalable. Even 50 machines, not that big of a deal. However, as you start to add more and more machines this workflow literally falls flat on it’s face . In fact, when I ran this against 250 machines as a test, it started out just fine. Then as I watched each result insert into the table one by one, I could see that the timestamps were getting longer and longer in between. This workflow (as architected above) took approximately 1 hour and 15 minutes, it took all the memory on the server ~4GB, and it crashed the client and terminated the workflow before it completed. Essentially, setting up our workflow in the configuration above kept all running processes within this workflow into this long running execution that didn’t release the memory until it was complete. So on each iteration of the workflow, memory was being consumed and inching the memory consumption slowly till it completely ran the server out of memory and crashed the process.
Let’s look at version 2.0 of the Reach workflow (redone for scalability)
With Version 2.0 I took a slightly different approach. Rather than running everything from a single workflow, I thought it may be better to trigger sub tasks that would execute, do what they needed to do, and then exit and return to the main workflow. I like to call those triggered sub tasks “workers”. So the concept is we can fire up the main orchestration (long running) workflow, and then trigger multiple worker workflows that can do the job much faster and with little or no impact on memory or CPU (at least for the task I was doing).
Version 2.0 of Reach Workflow (concurrent processing)
So looking at the above, I added a schedule to let it run every 4 hours (instead of just once) so we could establish trends over time. Then we create the table, pull in the machines we want to analyze, and then trigger worker(s).
From the custom start, we are gathering the computer names. I still do the ping and service status checks and log into the status table. One other thing I added was the variance information from the last time the machine was analyzed to see if the latency data is going up or down. Finally, I’m publishing the computer name from the workflow (ironically that is not necessary – I know that now ). In fact, this workflow is completely self contained. No need to publish anything back to the main workflow in this case. One final thing I did to speed the whole process up was allow this workflow to run multiple times (concurrent parallel executions). For my initial testing I set it to 50 from the default of 1. Leveraging this approach allows you to execute multiple concurrent threads to reduce the time it takes to gather your information.
Note: For Opalis, the default maximum number of policies running at one time is 50. I wouldn’t recommend setting yours to 50 unless you follow the recommendations and process in this link http://support.microsoft.com/kb/2102398 .
So what were the results?
The new execution time (total) for my policy too 4.5 minutes (instead of 1.25 hours). Monitoring the memory on the Action Server, the memory stayed happy and constant without ramping up and maxing out. Not even a blip on the radar!
It is better to build out your workflows in a way that provides for scalability rather than simplicity. Simply put, avoid placing everything that you need to do into a single policy. Instead, break your policies up into task oriented policies and trigger them from a main orchestration workflow. This will allow you to manage memory, performance, and scalability while allowing those workers to be leveraged as functional based executions that fire and return.
Thanks for reading. Happy Automating!