Systems Management

Measure Twice, Roll Out Once With The SMS Capacity Planner

Craig Morris

 

At a Glance:

  • Customer requirements and capacity planning
  • Hypothetical examples
  • Using SMS Capacity Planner to meet service level agreements

Systems Management Server

SMS Capacity Planner

Service Level Agreements

You know the story, or one like it. You’ve just deployed Systems Management Server (SMS) and now your CIO wants you to upgrade Microsoft Office on all your company’s

computers before the weekend. You might wonder just what impact this will have on your network. Or you might be wondering how you could create an SMS design to provide the service your business managers’ require to meet this service level agreement (SLA) when you have so many small branch offices that are connected over slow network links?

The SMS 2003 Capacity Planner is an infrastructure design tool that allows you to try various scenarios so you can do advance planning and better evaluate your SMS design options. (You can download it at SMS 2003 Capacity Planner). More than just a tool for designing SMS hierarchies, the Capacity Planner also lets you simulate any number of pre- and post-deployment scenarios to illustrate the network impact of each. To demonstrate how customer requirements and best practices can be utilized to design a custom SMS environment for any situation, I’ll walk you through two key scenarios: a branch office design planning exercise and an exercise in complying with company SLAs for software distribution (including patch deployment and a Microsoft® Office deployment).

Let’s begin with a little history. The Capacity Planner came about as the SMS product group looked at allowing the customer to decide how to design and deploy an SMS environment themselves. What information would we need from them? And what information would we need to provide to enable this process to occur while taking advantage of known best practices?

To answer these two simple questions, the SMS team developed the SMS Capacity Planner spreadsheet. This takes the information about a company’s locations, number of machines, bandwidth connections, business requirements (SLAs and such), and administrative personnel staffing, and combines this with SMS traffic and utilization features to provide an outline of what effects the various SMS design options would have on the company’s network infrastructure. So let’s proceed to a hypothetical case.

The Hypothetical Company

Company X has 8,500 employees, each of whom has a single computer on their desktop. The company has a headquarters (called HQ) with 3000 employees and two regional offices (RegA and RegB) with 500 and 1500 employees, respectively. The rest of the employees are scattered among the 50 branch offices, 30 of which have 100 employees and 20 have 25 employees. The network infrastructure of Company X consists of T3 (45 Mbps) connections between the headquarters and the regional offices. The regional offices have connections to the branch offices—64Kbps for the smaller offices and 128Kbps to the larger offices.

The company will have three SMS administrators, one at HQ and one each at RegA and RegB. Additional staff will be used to create and test software distribution packages. Looking at the Capacity Planner you’ll see the details shown in Figure 1.

Figure 1 Capacity Planner

Figure 1** Capacity Planner **

The Branch Office Scenario

The branch office scenario is probably the most common one facing SMS architects. SMS 2003 introduced a completely new client and infrastructure that made old SMS 2.0 design practices obsolete. In addition, SMS 2003 SP1 freed some of the limitations related to Distribution Point (DP) locations. These changes resulted in SMS 2003 providing more flexibility but with that came more complexity.

The two key factors in branch office design are typically to minimize both network utilization and the round-trip latency of server–to-client-to-server communications. In other words, distribute software as quickly as possible without consuming too many network resources.

If you run the existing configuration through the Capacity Planner process using all the SMS features at their default settings, you are offered suggestions for the branch offices like those you see in the Original panels in Figure 2. I’ll use LBranchB to illustrate.

Figure 2 Various Distribution Scenarios for LBranchB

Figure 2

Figure 2  

As you can see, the Capacity Planner does not allow the larger branch office to have the option "No SMS Site Server" since saturation of the network would occur based on the company’s management requirement that SMS processes not use more than 70 percent of total bandwidth. It is also interesting to note that deploying a DP alone at this location cuts network traffic by over 68 percent compared to having no SMS server. This also offers better network utilization than a setup that locates a secondary site with a DP. Here you’re saving the overhead inherently associated with site-to-site communication traffic.

From these results you can also see that it’s tough to decide between a standalone DP (option 2), a secondary site with a DP (option 3), a secondary site with a DP and a Proxy Management Point (Proxy MP) (option 4), and a secondary site with a DP, a Proxy MP, and a replicated SQL database (option 5). In this case you are likely to come down to a cost/benefit decision, not a technical one. (Note this traffic is based on averages, which, as you’ll see in the software distribution scenario, is not always the best method.)

Now let’s take a moment to look at the pros and cons of each solution. First, with the release of SMS 2003 SP1 came the ability to support up to 100 DPs per site server due to better multithreading within the site server distribution manager replication process. The disadvantage of DPs is that packages are replicated uncompressed. Utilizing secondary sites with DPs allows packages to be replicated in a compressed fashion via the site server sender. Both these solutions result in all client-initiated transactions going directly to the parent primary site (over the network); in both of these options this amounts to approximately 43 percent of all network traffic.

Deploying a secondary site at this location also comes at the cost of regular communication with the parent server. In the example I’m discussing, this equates to 12.1 percent of total network traffic for options 3, 4, and 5. Selecting option 4, implementing a Proxy MP in conjunction with a secondary site and a DP, will allow client inventory, metering, and status data to be received by the Proxy MP and replicated via the secondary site using sender compression.

The limitation of this solution is that requests for advertisements (policy assignment) and DP locations (DP location requests) must still be retrieved from the primary site database by the Proxy MP. These requests are unique to the machine (and its IP address/Active Directory® site, in the case of DP location requests) and therefore cannot be cached by the Proxy MP (unlike policy body requests, which are unique only to each advertisement). Deploying a replicated SQL database (option 5) reduces this traffic.

However, replication from the primary site results in all data within the required tables to be replicated. There are numerous SQL replication options and parameters to control how and when this information is replicated; however, regardless of which replication design is implemented, all data from the associated tables must be replicated (for all resources at the primary site). For an environment where an individual branch location’s client numbers are disproportionate compared to the total number of clients assigned to the parent site, this volume can be considered excessive.

A good example of this would be a primary site with 30,000 assigned clients compared to an individual branch office with 100 machines. To identify the tables that require replication, execute the following SQL query against the primary site’s SMS database:

SELECT * FROM replicatedobjects 
WHERE sitesystemrole = ‘MP’

Data for all 30,000 clients must be replicated. In all secondary site-based options a primary site can support more secondary sites than DPs, approximately 250–500 per site (depending on network performance and availability, as well as primary site server hardware) provided sender threads are changed appropriately.

Other Planning Considerations

There are a number of factors for which the tool cannot account. First, the level of administration a particular solution requires can play a significant role in the decision to use it. The more complex the solution, the more potential it has to be affected by outages. The administration effort required to maintain a DP is certainly going to be less than that needed to monitor a secondary site. Hardware costs can also influence this decision. Hardware is fairly inexpensive now, but when you multiply $100 here or there by 100-200 locations, it starts adding up. Generally speaking, the options are illustrated in the tool in descending order and are based on increasing levels of administration and hardware.

Fault tolerance is another key factor in an SMS hierarchy and one that cannot be accurately captured by a capacity planning tool. As I move down the options in the screenshot, I am also adding more fault tolerance to the location. For example, deploying just a secondary site and a DP at this location will allow clients to execute preexisting advertisements provided they have received the policy and location requests prior to the network outage. If I deploy a secondary site with a DP and a Proxy MP the clients can also send inventory/metering data, as this will be stored on the secondary site until the link is available. Deploying a replicated SQL Server™ database allows the previous activities, as well as allows the clients to request and execute any existing advertisements that were replicated from the primary site prior to the network outage. For a summary of the best practices gleaned from this example, read the sidebar "Branch Office Summary."

Company SLAs for Software Distribution

Company SLAs can change the way you design and deploy SMS. The more aggressive the SLA, the less latency can exist in data replication and therefore the more network utilization an SMS system will require. For this reason it is important that SLAs are defined early in the planning cycle and in as detailed a manner as possible. This is especially true of software distribution SLAs.

Most companies are likely to have at least two separate SLAs for software distribution—one for regular application distribution and one for emergency/security application distribution. Emergency/security applications are usually less than 10MB in size but are required to be deployed and reported to all company systems within a short period of time (24 to 48 hours). Most successful SMS environments have even more granularity, having one SLA for small or regular applications (less than 20 to 50MB) and one for larger and less frequent applications (greater than 50MB).

Figure 2 shows how usage varies depending upon patch size. This example is for Company X, which has an SLA to deploy a patch of 10MB within 24 hours.

In this case, software distribution network traffic (the swd value in green) increased from 16.6 percent to 25.5 percent and total network traffic increased by 7MB (slightly less than the 10MB since the original scenario already modeled the deployment of a smaller emergency patch/application on a daily basis). This was achieved by changing the default Patch or LOB (line of business) Updates from 1MB to 10MB. This exercise probably won’t affect the solution I’ll select for this environment. However, let’s consider the situation in which I distribute a larger application such as Office.

From the values offered in the Assumptions tab of the tool I will use the 750MB value and change the frequency to reflect the need to deploy this to all locations within one day from the current 182.5 days (see Figure 3). I changed the frequency to one day since this is how quickly I want the package replicated to all locations, and not necessarily installed by all machines. This results in a network load of 581MB and network utilization for a secondary site/DP/Proxy MP of 42 percent. Note that the DP-only option starts to become considerably more expensive (see Figure 2).

Figure 3 Patch LOB Updates

Figure 3** Patch LOB Updates **

Let’s analyze these numbers a little more closely. The tool shows that the average network utilization SMS would consume under this scenario over a 24-hour period (for option 4) would equal 42 percent of total available bandwidth. If I apply no restrictions on the sender at the parent site the sender will send the application package to the child sites as fast as possible. Assuming 100 percent available bandwidth, you can calculate the time used to execute the package as:

(Total Package Size) × (Percentage of Software Distribution) / (Bandwidth).

If your package is 581MB and you’re running at 92.5 percent software distribution over a network with a bandwidth of 128Kbps, your package will take just under 72 minutes to execute. This case illustrates that I may want to consider placing limits on the bandwidth consumption by sender. This will result in the application taking longer to distribute to all branch offices, but it may be preferable to consuming all of the available bandwidth for over an hour.

There are two options available to achieve this. I can limit the percentage of the network the sender uses. This results in the sender consuming 100 percent of the network for a percentage of time and then backing off. For example, if I set the bandwidth to 50 percent, then the SMS sender will use 100 percent of the network bandwidth 50 percent of the time.

The other option, as introduced with SMS 2003 SP1, is to utilize Pulse Mode. This mode restricts the sender to use X Kbps and then wait Y seconds. This allows the administrator to configure the sender to send 16KB and then wait 1 second (effectively 16Kbps) or send 64KB and wait 3 seconds. The net result is that the sender will only use a portion of the available network all the time and over a 24-hour period can send approximately 1.3GB total traffic to each location (based on 128Kbps).

Branch Office Summary

There are a number of best practices in capacity planning that can contribute to your success.

  • Do not deploy more than 100 DPs per site, and use considerably fewer over slow, unreliable networks (slower than 256Kbps) or in high software distribution environments (which will require frequent distributions of numerous large applications).

  • Never deploy more than 500 secondary sites per primary site and generally try to keep this ratio to approximately 250 or 300 to 1. Remember that the primary site has to replicate all software distribution packages to all child sites. This best practice is especially critical over slow or unreliable links.

  • If the company regularly deploys a number of large applications, you will find that a secondary site-based solution will generally prove more efficient than just a DP alone.

  • Secondary site overhead increases in proportion to the number of collections, packages, and advertisements at the parent primary site.

  • A secondary site can support up to 5,000 clients, depending upon connection performance to the parent site, company SLA requirements, and server hardware.

  • A primary site can support 100,000 directly assigned advanced clients (although this number will be considerably less with legacy clients), depending upon connection performance to parent and child servers, company SLA requirements, and server hardware.

  • A central site server can support 200,000 indirectly reporting clients (not utilizing the site server database for advertisement requests), depending upon connection performance to child servers, company SLA requirements, and server hardware.

    One of the great things about the Capacity Planner tool is it allows you to change column values (see Figure A) and then reevaluate all or just one location. This allows the user to perform what-if analysis by modifying such options as increasing/decreasing bandwidth, increasing/decreasing the number of client machines at a location, and deciding whether to locate an administrator at a particular location (and thus enabling an SMS primary site to be located at that location).

Figure A Columns To Change

Figure A** Columns To Change **

The only caution here is that this design will result in one thread being used all the time from the parent primary site to each secondary site, so you may want to configure the sender threads at the primary site to 100 total sender threads and one per secondary site. Note this combination effectively limits the number of secondary sites a primary site can manage within a 24-hour period to 100, so plan appropriately. The parent site connection to the WAN must also support 100 times the rate per second of each of the 100 secondary sites.

SLAs for policy requests (advertisement polling) can also play a part in system design. By default the client checks the Management Points, or MPs, for new advertisements every 60 minutes. Customers often want to change this for a particular set of machines (such as critical servers) and thus will deploy a separate primary site server environment to manage these machines. In Company X, suppose RegB and its associated branch offices house machines that require extremely fast software deployment. If you change the policy polling interval to 15 minutes and reanalyze the branch office scenario using 100 machines, the benefits of replicated SQL become apparent (see Figure 2). Be careful with this—an MP can support 25,000 clients polling every 60 minutes, so if you change the polling frequency to 15 minutes the support statement effectively changes to 25,000 / 4 or 6250 clients per MP.

Network traffic without replicated SQL is 98MB per day versus 62MB per day with replicated SQL. If each branch office contained 500 machines this delta would be even larger. Without replicated SQL, network traffic would be 446MB versus 198MB with replicated SQL. Placing a primary site at each of these branch offices (in the 500 machine scenario) reduces the network traffic even further, to 105MB.

More to Consider

In the examples given here, if the customer has a site with 30,000 machines, changing the polling cycle to 15 minutes would result in a need to change the site configuration from requiring 2 MPs, utilizing a Network Load Balancing (NLB) solution, to requiring 5 MPs. With the increased volume of requests per hour I would also recommend implementing SQL replicas of the MP tables on the MPs (all MPs could share a single replica or each MP could have its own).

With regard to fault tolerance, for sites with more than 5,000 or 10,000 direct clients, I suggest at least 2 MPs (in an NLB). This allows one of these to become unavailable and still support the client base. For larger SMS sites, or to create more fault tolerance in your SMS site design, deploy SQL Server on the Management Points and replicate the MP-related tables from the SQL Server database on the SMS primary site. This allows the site server and all but one of the MPs to become unavailable, yet clients can still send inventory/metering/status messages and install existing advertisements (excluding new ones created on the primary site that have not yet been replicated).

Software Distribution Summary

SLAs play an important role in defining a company’s priorities. By discovering priorities early in the planning cycle, you can ensure that all of your customers’ key deliverables are addressed. SLAs should always be analyzed against the whole customer environment. While it’s important to consider specific challenges related to branch offices, it is the entire enterprise that matters most to the customer. For the SLAs discussed here, shorter SLAs for software distribution have the following effects:

  • The shorter the SLA, the closer the administrative site server (the one used to initiate the software distribution) should be to the receiving clients.
  • The shorter the SLA, the smaller the DP to client ratio should be. DPs can typically support up to 4,000 clients per hour (in optimal network and performance conditions) for emergency/patch applications. For larger packages though, this limit can be a good deal lower. It’s always best to err on the side of caution.
  • The shorter the SLA, the shorter the policy request polling interval should be. Remember that this affects the MP scalability support statement.
  • The shorter the SLA, the flatter the SMS hierarchy should be. Fewer tiers to replicate advertisements and packages results in better SLAs.

Remember these and other rules presented in this article and you will be sure to have a better outcome when doing your capacity planning.

TechNet Online Resources

This article offers just a brief look at overall capacity and SLA planning for your management infrastructure. For more in-depth information on infrastructure planning, visit these TechNet Online resources:

Craig Morris is a Program Manager on the Microsoft SMS team, responsible for performance, scalability, and manageability of SMS. He has over 12 years of industry experience architecting Microsoft enterprise solutions, the last 5 with Microsoft. He can be contacted at Craig .Morris@microsoft.com or at his blog.

© 2008 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.