We are running right now with the WAFs set at a minimum of 10, which is way more than we need in an attempt to avoid this issue, but it keeps happening.
We will be running fine, and then suddenly one of the WAF units out of the 10 we normally run starts running at 100% CPU while all the others are not running over 25% CPU. We cannot see what the individual WAFs are doing in the Azure portal, but Microsoft looked deeper into the issue and told us this is what was happening, but they do not know why.
This causes our entire website to start running very slow and it continues until we manually add or remove a WAF unit, which seems to reset something. This is a trick a Rackspace tech told us about to make the problem stop.
Rackspace has engaged Microsoft on this issue multiple times, but no one can tell us why this happens or what they can do to prevent it. Honestly, it sounds like a WAF software issue in my opinion.
We have redeployed the WAFs several times so far in an attempt to get on better hardware, but sooner or later this same issue comes up again and it really makes our customers scream for several minutes until we notice the issue and can add or remove a WAF instance.
This happened again today. We really need a good solid answer and solution to this problem. We are a telemedicine company so when we have issues because of this, it is much more serious than an average website.