High Availability for IaaS, PaaS and SaaS in the Cloud

Outages, natural disasters and unforeseen events have proved that even in a distributed architecture, you need to plan for High Availability (HA). In this entry I'll explain a few considerations for HA within Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). In a separate post I'll talk more about Disaster Recovery (DR), since each paradigm has a different way to handle that.

Planning for HA in IaaS

IaaS involves Virtual Machines - so in effect, an HA strategy here takes on many of the same characteristics as it would on-premises. The primary difference is that the vendor controls the hardware, so you need to verify what they do for things like local redundancy and so on from the hardware perspective.

As far as what you can control and plan for, the primary factors fall into three areas: multiple instances, geographical dispersion and task-switching.

In almost every cloud vendor I've studied, to ensure your application will be protected by any level of HA, you need to have at least two of the Instances (VM's) running. This makes sense, but you might assume that the vendor just takes care of that for you - they don't. If a single VM goes down (for whatever reason) then the access to it is lost. Depending on multiple factors, you might be able to recover the data, but you should assume that you can't. You should keep a sync to another location (perhaps the vendor's storage system in another geographic datacenter or to a local location) to ensure you can continue to serve your clients.

You'll also need to host the same VM's in another geographical location. Everything from a vendor outage to a network path problem could prevent your users from reaching the system, so you need to have multiple locations to handle this.

This means that you'll have to figure out how to manage state between the geo's. If the system goes down in the middle of a transaction, you need to figure out what part of the process the system was in, and then re-create or transfer that state to the second set of systems. If you didn't write the software yourself, this is non-trivial.

You'll also need a manual or automatic process to detect the failure and re-route the traffic to your secondary location. You could flip a DNS entry (if your application can tolerate that) or invoke another process to alias the first system to the second, such as load-balancing and so on. There are many options, but all of them involve coding the state into the application layer. If you've simply moved a state-ful application to VM's, you may not be able to easily implement an HA solution.

Planning for HA in PaaS

Implementing HA in PaaS is a bit simpler, since it's built on the concept of stateless applications deployment. Once again, you need at least two copies of each element in the solution (web roles, worker roles, etc.) to remain available in a single datacenter. Also, you need to deploy the application again in a separate geo, but the advantage here is that you could work out a "shared storage" model such that state is auto-balanced across the world. In fact, you don't have to maintain a "DR" site, the alternate location can be live and serving clients, and only take on extra load if the other site is not available. In Windows Azure, you can use the Traffic Manager service top route the requests as a type of auto balancer.

Even with these benefits, I recommend a second backup of storage in another geographic location. Storage is inexpensive; and that second copy can be used for not only HA but DR.

Planning for HA in SaaS

In Software-as-a-Service (such as Office 365, or Hadoop in Windows Azure) You have far less control over the HA solution, although you still maintain the responsibility to ensure you have it. Since each SaaS is different, check with the vendor on the solution for HA - and make sure you understand what they do and what you are responsible for. They may have no HA for that solution, or pin it to a particular geo, or perhaps they have a massive HA built in with automatic load balancing (which is often the case).


All of these options (with the exception of SaaS) involve higher costs for the design. Do not sacrifice reliability for cost - that will always cost you more in the end. Build in the redundancy and HA at the very outset of the project - if you try to tack it on later in the process the business will push back and potentially not implement HA.

References: http://www.bing.com/search?q=windows+azure+High+Availability  (each type of implementation is different, so I'm routing you to a search on the topic - look for the "Patterns and Practices" results for the area in Azure you're interested in)