Romeo and Juliette and Windows Azure

1. Juliette sends a message "I'll take a drug which makes me look dead but I'm not really"
2. Romeo receives the message
3. Romeo finds Juliette looking dead, but knows she's not really dead
4. They live happily ever after

vs.

1. Juliette sends a message "I'll take a drug which makes me look dead but I'm not really"
[the message is lost in a plague-related network outage]
2. Romeo never receives the message
3. Romeo finds Juliette looking dead, thinks she's dead, and kills himself
4. Juliette wakes to find Romeo dead and kills herself too.

Cloud computing is about distributed protocols with message-passing usually in XML. As VB developers, you'll be the ones responsible for saving Romeo and Juliette. How? The quick answer is "make your operations idempotent". For the long answer, read on...

Every distributed protocol has a window of vulnerability, where one party can't be sure that the other party has received the message. This is a fundamental law of computer science. I've been learning about Windows Azure. As a distributed architect, my first question is always going to be: "where are the windows of vulnerability?"

The Azure infrastructure

 Azure Roles

This diagram shows the basic Azure infrastructure. Everything is done by message-passing, and all parties are potentially distributed. When an end-user makes an HTTP request, it arrives at an instance of the WebRole, written in ASP. The WebRole might make a synchronous request to fetch from a Table in the StorageService to help construct the page it gives back. I'm think I'm going to write my WebRoles entirely in VB, and have them return just XML literals. The WebRole might also synchronously deposit messages to the Queue in the StorageService in case further work needs to be done, e.g. generating thumbnails or submitting an expense claim. The message can be in any format.

Messages in the queue will be picked up by an instance of the WorkerRole. The WorkerRole is a DLL written by you in VB or C# or other .NET languages. Azure loads this DLL in a VM and invokes its StartReceiving method. You implement this method, usually by creating a thread which (in a loop) synchronously gets messages from the Queue and handles them appropriately, maybe by calling third-party web-services, or maybe by fetching data from blobs and writing them to other blobs.

The Azure framework might create multiple instances of WebRole or of WorkerRole to cope with load, each instance indistinguishable. It might also "freeze" a role and unfreeze it again on a new machine. It might also terminate a Role, by calling its Stop method. A WorkerRole can create and use temporary local storage on the machine it's running on, using the ILocalResource interface, but this local storage is discarded each time the WorkerRole is stopped or frozen.

NOTE: the current Azure community preview runs all of these things in just a single datacenter. The chance of message-loss between them is therefore vanishingly small. Things will only start to get interesting once Azure becomes geo-located. At that time, presumably you'll be able to chose in which datacenter each part of your service runs.

Each of the interactions sketched out above has a window of vulnerability. I'm going to focus just on the WorkerRole's message-loop. In the samples the message-loop is written like this:

1. WorkerRole sends a GET request to the StorageService queue.
[Message A: GET message in transit from WorkerRole to Queue]
2. Queue receives the GET request
3. Queue figures which message to return, and marks it "invisible" for the next 120 seconds.
4. Queue sends back a "200 OK" status response along with the content of that message
[Message B: OK message in transit from Queue to WorkerRole]
5.Worker receives the "200 OK" message
6. Worker processes the message
7. Worker sends a DELETE to the StorageService queue asking it to delete the message
[Message C: DELETE message in transit from WorkerRole to Queue]
8. Queue receives the DELETE request
9. Queue deletes the message
10. Queue sends back a "200 OK" status response
[Message D: OK message in transit from Queue to WorkerRole]
11. Worker receives the response.

Correctness

Now we have to do an exhaustive case analysis of every case where one of the four messages is lost, and also an exhaustive case analysis of every case where one of the two machines crashes in between steps. I'm going to focus on the case where the third message, Message "C", is lost in transit.

If message "C" is lost, then (step 8) the Queue will never receive a DELETE request, and (step 11) the Worker will never receive an "OK". After a delay the worker will time out waiting for the OK. At this point it could decide to retry, but the damage has already been done. (The decision of whether to retry is governed by a user-settable "RetryPolicy" property which defaults to RetryPolicies.NoRetry. The timout is governed by a user-settable "Timeout" property which defaults to 30 seconds).

Probably by now the Queue's timeout of 120 seconds will have expired, and the Queue will mark the message in question as "Visible" again. Now this instance of the WorkerRole, or another instance, will pick up the work item again and process it. The end result is that the message will have been processed TWICE.

For sake of correctness, you have to structure your program so that whatever work you do in your WorkerRole (step 6), it doesn't matter if that work accidentally happens twice. Such work is called idempotent.

Some easy operations are already idempotent, e.g. generating a thumbnail of an image. It doesn't matter if the same thumbnail is generated twice and stored in a blob.

Other operations are harder to make idempotent, e.g. adding "+1" to a number stored in a blob. The storage APIs provide atomicity through HTTP1.1 conditional headers, e.g. "Put this blob only if it has not been modified since <datestamp>", or "Put this blob only if its contents match <etag>". This is similar to Interlocked.CompareExchange in .NET. Out of this atomicity you will have to build idempotency.

Performance

For performance tuning, what values should you use for the two timeouts? And what retry strategy should you use? At this stage I have no idea. I have an idea that the WorkerRole would store metrics on its historical performance in its local storage, via ILocalResource.

I'm especially eager to experiment with Azure programming using VB's XML literals to send data and construct web-pages. I have to stress that I'm just at the start of learning about Azure. I might have misunderstood parts of Azure, and parts of Azure might still change. If you have any corrections, questions or comments, then please post!