Capping CPU using Resource Governor – The Concurrency Mathematics
Here is what you need to know: A = πr2
Okay, not really as that is the formula for the area of a circle but it does set the stage for this discussion. I have been working with the CAP_CPU_PERCENT (RESOURCE POOL) setting as it relates to concurrency. This turned into a mathematical exercise I was not planning on.
You all have had that one user who keeps running their ‘special report’, they just have to have, in the middle of the day. No matter how many times you have asked them to stop doing this until after hours they continue to ignore you. So one day you decide that you will put them in their own resource pool and workload group and cap their CPU to a small percentage and MAX DOP = 1. This way they quit impacting the overall production server. Shortly after you do this you discover you may have made matters worse. How could that have happened?
Simply put, capping the CPU means you forced a wait and when you force a wait where shared state is involved you impact all entities attempting to use the shared resource. I created a pathological example but it shows the behavior very well.
User 1: Default Pool 100% CPU and Full DOP
User 2: Limited to 1% CPU CAP and MAX DOP = 1
I have them both do a select with UPDLOCK on the same table. I created an execution pattern of User 1, User 2, User 1, User 2, … acquiring and releasing the lock to each other.
select count(*) from dbRG..tblRG with (UPDLOCK) -- Only 1 row in the table
I then used OStress to run the batch in a tight loop in various configurations. The diagram below shows the basics of the CPU activity.
1. User 1 has ownership of the lock and owns the scheduler, using the CPU while User 2 is blocked waiting on the lock. Standard blocking scenario where User 2 is not taking any CPU, it is just placed on a waiter list of the lock.
2. User 1 completes the transaction and releases the lock. During release User 2 is granted ownership and placed on the runnable list of the scheduler. User 1 then yields to the scheduler because it has used its quantum. Placing User 1 at the tail of the runnable queue.
3. User 2 yields and User 1 attempts to access the lock. If User 2 has cleared the lock the lock can be granted. If User 2 still owns the lock User 1 will be blocked and added to the lock’s wait list.
The ‘Delay’ is the interesting part of this scenario. When User 1 grants User 2 ownership of the lock, User 2 becomes the next runnable worker on the scheduler. However, the resource pool for User 2 is set to cap the CPU percentage. This means SQL Server will delay User 2’s execution to keep it at the configured cap. Even if during the delay User 1 is allowed to execute User 1 simply becomes blocked on the shared resource and does not make meaningful forward process.
What just happened is that by limiting the CPU cap the shared resource (lock in this example for the same row) it results in limiting the overall resource acquire and release frequency.
Now let’s get to some of the math behind this I promised you. It won’t be all that difficult, you’ll see, I will use some nice, round numbers.
Assume each transaction takes 1ms or 1000 transactions per second if utilizing the CPU at 100%. If you cap the CPU at 10% the math is 1000 * 0.10 = 100 transactions/sec. Meaning the user in the 10% CPU CAPPED pool should only be able to execute the transaction 100 times to the 1000 times the uncapped user can execute.
When I combine the two users the 10% CAP introduces the delay and causes the conflict, lowering the transaction rate near the 100 mark for both users combined.
Here are some actual numbers from my laptop running the pathological, tight loop on the same row.
|Transactions/Sec||User Stress Connections|
|192||1 – 1% CPU CAPPED User|
|435||1 – 100% CPU User|
|240||1 – 1% CPU CAPPED User 1 – 100% CPU User|
|920||2 – 100% CPU Users|
|1125||1 – 1% CPU CAPPED User 2 – 100% CPU Users|
Most of the time you won’t even notice the impact I am describing in this post. As you can see the 1125 transactions/sec level is achieved by 2 – 100% users and 1 – 1% user. Back to the math the 1% user is only 1/3 of the workers on the scheduler. Each of the 2 – 100% users get full quantum so the more workers the more this scenario becomes a standard blocking issue as if you had a slow client, for example. You just have to figure out how to reduce the shared state interaction behavior(s) and things will run smoothly, as you expect them to.
I was looking at this in more detail and I noticed I was not accumulating large numbers for LCK_M_U waits, wait time or signal time. What I found is that the execution quantum is generally such that the lock can be acquired and released before the next user is allowed to execute. In my attempt to create the pathological case I tuned away some of the blocking aspects that would make it worse. If I add more rows to my table I can get into the ultimate pathological I go, you go, I go, you go, … scenario.
Instead it was enlightening that the delay that was necessary to control the 1% user introduces overall delay at the CPU resource. The CPU resource became my shared entity and when the 1% user exceeded its CPU target the scheduler may need to force the delay and in doing so other users on the same scheduler can become impacted. The more users I added to the testing the less, forced delay required.
While I believe this will be an edge case scenario and unlikely that you will encounter this I wanted to share this so you could put it in your toolbox. We are always looking for the connection that is holding a resource (usually a lock) and not responding on the client fast enough. Instead, you could introduce some of the delay attempting to isolate a user like this.
Bob Dorr - Principal SQL Server Escalation Engineer