Cloud Elasticity - A Real-World Example
Windows Azure is the cloud-based development platform that lets you build and run applications in the cloud, launch them in minutes instead of months and code in multiple languages or technologies including .NET, PHP, and Java.
One of the promises of Windows Azure and cloud computing in general is elasticity. The ability to quickly and easily expand and contract computing resources based on demand. A common example is that of retailers during the holiday season. There is a huge spike of activity beginning on “Black Friday” and, traditionally, a lot of money is spent preparing for the peak load.
Social eXperience Platform (SXP) is a multi-tenant web service that powers community and conversations for many sites on microsoft.com. A great example of one of our tenants is the Cloud Power web site. SXP is responsible for providing the conversation content on the site while the remainder of the site is served from the standard microsoft.com infrastructure. When the Cloud Power site sees an increase in traffic, SXP also sees an increase in traffic, which is exactly what happened in April.
In this case, the web traffic spikes were due to ads, which typically ran for a day or two. Compared to March’s average daily traffic (represented by the 100% bar), SXP’s traffic spiked to over 700%. Here’s a graph that shows 72 hours of traffic while an ad campaign was active. The blue area is normal traffic and the red area is the additional traffic generated by the ads. Since the ad was targeted to the US, there is a heavy US business hour bias to the traffic on both the soft launch as well as the full run.
There are lots of examples of traffic spikes like this taking web sites down or causing such slow response that they appear to be down. Traditionally, the only way to handle such spikes was to over-purchase capacity. The majority of the time, the capacity isn't needed, so it sits mostly idle, consuming electricity and generating cooling costs. Windows Azure has a better approach.
The engineering team had a discussion with our business partners and learned that due to a couple of ad buys, SXP would see increased traffic on several days in April. In advance of the ads running, we worked with the Microsoft.com operations team and decided to double our Windows Azure compute capacity to ensure that we could handle the load.
This is where Windows Azure really made things easy. We went from 3 servers to 6 servers on our web tier within an hour of making the decision. The total human time to accomplish this was a couple of minutes. Our Ops lead changed one value in an XML file and Windows Azure took care of the rest. Within half an hour, we validated via the logs that we had doubled our capacity and all web servers were taking traffic.
We didn’t have to allocate servers or VMs. We didn’t have to install and patch an OS. We didn’t have to install our application. We didn’t have to run penetration tests. We didn’t have to reconfigure the firewall or the load balancer. All we did is modify a value in an XML file. Since Windows Azure provides a REST API as well as Power Shell scripts, automating this process is straight forward.
We monitored SXP closely during our first spike and determined that we didn’t need the additional capacity, so we turned the additional instances off. Again, this took a couple of minutes to modify the XML file and about a half hour to take effect.
Our total, full-retail cost for the burst capacity was $70 plus about 5 minutes of operations time.
Compared to the traditional models, Windows Azure made the process very fast and very simple. Windows Azure compute has a minimum time commitment of one hour, so complex systems can burst their capacity multiple times per day if needed. This is a very different model than traditional charge-back models for servers or VMs where you often get billed for a month or longer even though you don’t need the capacity.
If you have the need for elasticity, Windows Azure offers a compelling solution at a very reasonable price point.