April 2009

Volume 24 Number 04

Lessons Learned - Optimizing A Large Scale Software + Services Application

By Udi Dahan | April 2009

 

This article discusses:

  • WCF and 3-tier service models
  • Dealing with occasional connectivity
  • Threading and usability
  • Synchronization domains
This article uses the following technologies:
WCF, Software + Services

Contents

Always Up To Date
Trouble with 2-Tier
WCF and 3-Tier
Scalability Problems
Trouble in Paradise
Occasional Connectivity Affects Servers
Threading and Usability
Client-side WCF
Object-Graph Dependencies
Synchronization Domains
View-Controller Interaction
Thread-Safe Data Binding
Two-way DataBinding
Client-side Repositories
Performance
Lessons Learned

Designing rich, interactive desktop client software has never been easier. With Visual Studio, Windows Presentation Foundation (WPF), and the Composite Application Guidance for WPF (Prism), developers are better equipped than ever.

Unfortunately, it is not enough.

When working on a large-scale, rich-client software + services application, my team found out the hard way that developing occasionally connected, multithreaded, smart clients that don't deadlock or generate garbage through data-races is far from trivial. It's not a failing in the toolset, though. We found that these tools need to be employed in very specific patterns to achieve systems that are both responsive and robust.

In this article I highlight some problems we ran into, and explain how we overcame them. Hopefully you can harness this information to avoid our mistakes and make the best use of these tools when building your own software + services applications.

Always Up To Date

One of the first challenges we faced was the need for users to always be kept up to date with what all other users were doing, as well as events occurring in other systems. This real-time system behavior allowed users to act on events as they happened on large quantities of data. The previous version of the system allowed them to search, sort, and filter, but users just couldn't act on that information in time.

During the user experience design process, many of the grids in the previous version of the system were replaced by Messenger-like pop-up messages ("toast") that informed the user of important events. Other changes enabled users to set various thresholds on properties of entities that, when exceeded, popped up toast messages. The toast often contained links that, when clicked, opened a form showing the user the affected entity—some with a before and after view.

While thinking through ways to scale the regular request/response patterns that we had experience with, our key stakeholder came in and reminded us: "This ability to surface events from the world to the user in real time is critical for our knowledge workers to support the collaborative, real-time enterprise we're building."

In short, it was clear that we needed some kind of publish/subscribe capability. Trying to support two-second user visibility on events with hundreds of users polling a central database just wasn't going to work (it broke down at around 60 users).

On top of it all, our user population was mobile—some of them disconnecting and reconnecting every few seconds as they moved in and out of Wi-Fi zones, others working offline for an hour or two while telecommuting, and some performing intensive offline work for up to a week.

We had to juggle occasional connectivity, data synchronization, and publish/subscribe all at the same time. We learned that we couldn't solve all problems either client-side or server-side, but rather that an integrated approach was needed since any changes on one side needed corresponding changes on the other side. In this article, I'll describe this process starting from the server-side working forward to the client.

Trouble with 2-Tier

Our original deployment model was the traditional 2-tier model found in many rich client applications. The difficulty we had with this model was pushing out notifications from the database to the clients.

One option that we tried was to have all interaction with the database done via stored procedures that, after committing their transactions, would call out to a Web service hosted on the database server. It was the responsibility of that Web service to notify clients of what happened by passing them the name of the stored procedure that was invoked as well as the arguments that were passed to the stored procedure.

That didn't work out so well. There were two critical problems with this approach—one logical, the other physical.

The logical problem was that a single stored procedure could be invoked by all sorts of paths in the code meaning different things at different times. When a client received notification of the activation of a stored procedure, it wasn't always clear what that meant and what toast to pop up to the user.

The physical problem was that we needed each client to expose its own Web service, which the Web service on the database server would call. This was quickly vetoed by the security professionals as a gaping security hole, as well as by the regulatory auditors who were (rightly) concerned about users making decisions in real time on data whose origins couldn't necessarily be traced.

WCF and 3-Tier

To solve the logical problems with the 2-tier solution we realized that we needed to be more explicit about our service contract—stating the name of the stored procedure that ran and its parameters was explicit about what happened, but not why it happened.

As we began designing our Windows Communication Foundation (WCF) service contracts, it became clear that it wasn't enough to just focus on the appropriate naming and scoping of our service methods. We needed to be explicit about the contract of notifications—the way the server called back to the clients. WCF callback contracts enabled us to take the same amount of care on server-to-client messages as the regular service contract. (To read more about WCF callback contracts, see Juval Lowy's article from the October 2006 issue, "What You Need To Know About One-Way Calls, Callbacks, And Events.")

Each command/event pair was modeled as follows:

[ServiceContract(CallbackContract = typeof(ISomethingContractCallback))] interface ISomethingContract { [OperationContract(IsOneWay = true)] void DoSomething(); } interface ISomethingContractCallback { [OperationContract(IsOneWay = true)] void SomethingWasDone(); }

To enable our WCF service to call back to clients other than the one that invoked the current method, we had to host the service in a process separate from that of the client and deploy that process to its own server, in essence moving to a 3-tier deployment.

While it was suggested we put the service on the same box as the database, our database admins were concerned that the service might take away precious resources, thereby hurting the performance of the database. The added benefit of having a separate tier for our stateless WCF service was that we could easily scale it out to more machines without requiring any tinkering with the database server.

Unfortunately, it wasn't as easy as we thought.

Scalability Problems

We ran into two kinds of scalability problems—logical and physical, as before.

The logical scalability problem manifested itself as the system gained more functionality. When we first started development, we had only a single server contract containing all methods on it, and a single callback contract with all the corresponding methods on it. We quickly ran into the brick wall of too many developer chefs stirring the same pot. After moving each command/event pair to its own service/callback contract pair, that problem was solved. Yet the solution introduced another, more insidious problem.

When taking the decision to have multiple service/callback contract pairs, we thought the main issue would be having an unmanageable number of contracts. That actually didn't turn out to be much of a problem—the modularity between business domains in the app mitigated that very well. The problem was the "pair."

One command could result in multiple kinds of notifications, and one kind of notification could be caused by many types of commands. The clean one-to-one mapping we had before was overly simplistic and needed to evolve into a many-to-many mapping. Not only were we scaling the number of commands and notifications in the system, we needed to scale the number of relationships between them. It looked like the one-to-one representation supported by WCF callback contracts wasn't going to be able to take us forward.

We ran into a physical scalability problem when scaling out our WCF service tier. We needed some way to share the list of client callback contract subscribers. As we worked through the guidance provided in Juval Lowey's October 2006 MSDN Magazine article, we created the relevant infrastructure pieces including the pub/sub service. We found that a single pub/sub service wasn't able to handle the amount of notifications it needed to push from all publishers to all subscribers, in essence constituting a bottleneck in publisher to subscriber communication. Also, different notifications had different priorities—something that was difficult to bend the infrastructure to support.

After going back and forth on the topic several times, the best structure of pub/sub services we came up with was to have a single piece of pub/sub infrastructure per logical business domain. Since each domain also had sets of commands and notifications clumped around it, it served as a good clean boundary. This partitioning solved many of the priority issues we previously had while automatically creating the first level of pub/sub infrastructure scale out.

Trouble in Paradise

One of the advantages of WCF is its support for multiple bindings on various technologies like HTTP and TCP. We originally chose to work with WS HTTP since it was the most feature rich, though not considering all the implications of that decision.

As we rolled out the system to our users, everything seemed to be going fine, but that feeling was short lived. After about an hour, we started getting calls from our help desk folks that "servers were refusing connections."

We quickly checked and saw that all our servers were up and running, but they didn't seem to be doing much. CPU usage was low, same with memory, IO, and everything else. We checked the database for deadlocks and other nasties, but everything was clear there, too.

So we started up another server process in debug mode and began watching it to see what it did. It didn't take long for the problem to show itself.

Every 10 to 20 seconds, a thread that was publishing an event to a client would hang for about 30 seconds. You can imagine what this did to our scalability. Because we were fairly close to the recommended thread pool size for the number of clients we were supporting, we were running out of threads. At that point, WCF started refusing to accept new connections from clients.

We thought maybe we had a bug in our client app that caused it to block the server.

We went hunting for the specific client machines that caused the problems (using the IP addresses we picked out from the server logs). We talked to the users of those machines to see if they had experienced any odd behaviors in the application at around the time of the server problem. They didn't report anything out of the ordinary. But one of them made an interesting observation: "The app tends to get stuck the way Outlook does, right around the time my WiFi connection cuts out."

Occasional Connectivity Affects Servers

We started collecting logs from several client and server machines and saw the same behavior over and over again. Any time a client went offline, calls to it would block for 30 seconds.

This ran counter to our understanding of the IsOneWay attribute on the OperationContract. All our event notification methods returned void—there was no reason for the server threads to get blocked.

As we drilled deeper into the behavior of WsHttpBinding we began to understand how it worked in one-way operation contracts. When the server goes to notify a client by invoking a one-way method on its proxy, if that proxy was previously connected and has in its caches all the underlying channel objects, it makes use of them and tries to call the client over HTTP. Even though the call is one-way, the underlying channel waits until an HTTP connection can be established and the HTTP request can be sent. It does not wait for the HTTP response. Unfortunately, if the client is offline for whatever reason, the HTTP connection waits the default HTTP timeout (30 seconds) before giving up.

Realizing that we couldn't fix this behavior, we needed to find an alternative binding that would be robust in the face of clients disconnecting. We found the answer in the Microsoft Message Queuing (MSMQ) bindings—specifically NetMsmqBinding. Luckily, with WCF, swapping out one binding for another isn't a big deal.

After verifying that our server threads were not getting stuck any more when client connectivity went in and out, we turned our sights back to the client.

Threading and Usability

When designing rich clients that interact with servers in a request/response manner, thread safety wasn't ever much of a concern for us. As the servers we developed got more features and became more powerful, the amount of time it took them to respond grew as well. This is known to cause usability problems when the application hangs until the server responds. The common solution is to change the client so that it calls the server asynchronously. At that point, developers find out that when callbacks from the server are handled on a background thread, UI controls tend to throw exceptions at them. Finally, the code is changed so that the server callbacks switch to the foreground thread and everything works again.

The thing is, when the client receives an endless stream of notifications from multiple servers, those patterns don't serve us as well.

Since every notification arriving at the client was being processed on the foreground thread, and all UI interaction was also done on the foreground thread, these two tasks ended up continuously fighting for control. You could actually see the mouse erratically jump on the screen as you tried to move it. When typing into textboxes, the characters would fill in just as erratically.

I noticed this when spending some time in our load testing lab. I remember looking at the client showing all the various notifications popping up in real time, very pleased with how our architecture was working out. I took the mouse to click on a link in one of the toasts and it kept getting stuck as I moved it. It reminded me of the way wireless mice behave as their batteries die. But the mouse was a regular wired USB mouse.

I winced. We were only two weeks away from a release. All our functional tests were passing, and the load tests were showing that the system could handle everything we threw at it. The only problem was that the system wasn't exactly functional under load—usability nightmare would have been a better description.

Client-side WCF

Before going back to the drawing board on our client, we looked for something—anything—that could prevent the imminent rewrite. We pored over guidance in MSDN on how to use WCF in smart client environments (see "Writing Smart Clients by Using Windows Communication Foundation") and wrote more proof of concept code than ever before, but somehow nothing covered all the bases. It seemed we were stuck between the deadlock rock and the unusable hard place.

One strategy that looked promising was based on safe controls that encapsulate the interaction with the Windows Forms synchronization context (reference) in such a way that even if they are accessed by a background thread, they marshal the call back onto the UI thread.

Since not every notification that arrives at the client involves updating the UI, we could gain a great deal in usability by performing most client-side notification processing on a background thread. Occasionally, when the notification involved updating the UI, the safe controls would ensure that the actual UI rendering was done on the correct thread. Technologically, it looked like a solid solution.

After going forward and implementing the needed safe controls and forms, we spent a good amount of time in the test lab performing functional tests while the load tests were running in the background. We had taken all the logging on the client and made it asynchronous as well so that we could keep log levels high without slowing down the UI.

In this financial application, multiple traders would collaborate on a single investment portfolio to achieve various goals in terms of risk, commission, and return—sometimes collaborating on a single trade. When we had several testers simulating this behavior, in one of the runs they had a terribly negative return. While this isn't necessarily surprising in and of itself (let's face it, if a tester could really do the work of a trader, they wouldn't be a tester), the fact that it was so different from the outcomes of the other runs made us take notice. Our previous experiences developing this system taught us that when we witnessed something unusual, it meant that some assumption we made in our programming turned out to be false.

Sifting through the piles of logs, we were looking for the moment in that one run when the financial return took a dip. Surprisingly, that was all too clear—there was one clear point where the return went into the red. As we worked our way back and forth through the log entries around that point in time, nothing stood out. Everything looked pretty much like it did when the return was positive.

One of our DBAs, an old-time UNIX hacker, bet me that with some regex (regular expressions), he could find the core difference in an hour. Since we were already 3 hours into it, I quickly conceded and, 45 minutes later, he resurfaced with a grin.

It was a context switch, and at the worst possible time, and it ended up causing a data race. One thread was setting the trade to (roughly) 90 million Yen, and the other was setting it to 1 million US dollars, one property at a time. Unfortunately, the trade ended up in the state of 1 million Yen, or $11,000 USD. That explained the dip in the financial return.

Preventing multiple threads from working with the same object at the same time isn't rocket surgery. Each thread just needs to lock the object before working with it (see Figure 1). This required a thorough pass through a lot of the client code to make sure we had locked everything we needed. It then required a fair amount of testing to shake out the deadlocks that resulted from us locking more than we needed.

Figure 1 Locking Objects

//Thread 1 void UserWantsToDoSomethingToTrade( Guid tradeId, double value, Currency c) { Trade t = InMemoryStore.Get(tradeId); lock(t) { t.Value = value; t.Currency = c; } } //Thread 2 void ReceivedNotificationAboutTrade( Guid tradeId, TradeNotification tn) { Trade t = InMemoryStore.Get(tradeId); lock(t) { t.Value = tn.Value; t.Currency = tn.Currency; } }

One of the things we (very reluctantly) gave up on was data binding our in-memory objects to user-editable views. We couldn't have the user locking the object for the duration of the form being open since that could block a background thread waiting for that object from processing all the other notifications.

With a fair amount of trepidation, we put the system through the previous battery of tests, waiting for the inevitable "something else" that we didn't know about that would knock the project of its feet. Again.

As our tester-traders worked the system, it looked like the previous issues had been solved. As deeper and more involved scenarios were run through, more users and more types of users interacted with the same investment portfolio—changing risk profiles, performing what-if projections and historical comparisons. The testers really banged on the system from every direction and it held. Half-disbelieving that this might actually be it, we came back to the business with good news.

Not that they believed us. I mean, our track record wasn't particularly impressive at that point. 120% late delivery tends to erode confidence that way. But they were in dire need of the new system so that Tuesday we rolled into beta.

And on Thursday, we rolled back out.

Object-Graph Dependencies

When the profitability of a single test run is grossly different than other test runs, even financial novices like developers and testers take notice. When more delicate investment and trading rules are violated, but those don't have immediate or large-scale impact, novices don't see it. Experts do.

I won't get into the nitty gritty financial rules here, but technically the problem was that our single-object locking strategy was too simplistic for rules involving multiple objects.

It was tricky.

You see, when our calling code locked and updated one object, that object could raise an event that would be handled by multiple other objects. Each one of those objects could choose to update itself and raise events of its own, with this process percolating out affecting many objects. Of course, our calling code couldn't know how far those ripples would go so those other objects would not have been locked. Once again, we had multiple threads updating the same objects without locks.

There was talk of giving up on events and other loosely coupled communication mechanisms so that we could get some thread safety, but the thought of having to re-implement all our complex rules (that took us so long to get right) without the strengths of .NET was too much to bear.

And it wasn't just domain objects either. Controller objects could get called back as well, changing their state, too, disabling and enabling toolbar buttons and menus, popping up toast, you name it.

We needed a single global lock, yet memories of an unusable UI made us wary of such a solution. There was also the concern that even if such a lock existed, we ' d have to review every single line of code in the system to ensure that the lock was taken and released appropriately—not only for this version of the software, but for every single maintenance release and patch thereafter.

It was a case of damned if you do, damned if you don't.

And that's when we discovered synchronization domains.

Synchronization Domains

The synchronization domain provides automatic synchronization of thread access to objects declaratively. This class was introduced as a part of the infrastructure supporting .NET remoting. Developers wishing to indicate that a class is to have access to its objects synchronized must have the class inherit from ContextBoundObject and mark it with the SynchronizationAttribute like so:

[Synchronization] public class MyController : ContextBoundObject { /// All access to objects of this type will be intercepted /// and a check will be performed that no other threads /// are currently in this object's synchronization domain. }

As expected, all this magic comes with some performance overhead, both in object creation and access. Another annoying caveat is that classes involved in a synchronization domain cannot have generic methods or properties, although it can call and otherwise make use of generics from other classes.

At this point in the project, when we'd practically exhausted all other options, we gave it a shot.

We planned to have a single synchronization domain in which all logic would run. This meant controller objects, domain objects, and client-side WCF objects would need to be in the synchronization domain. In fact, the only objects that would have to be outside the synchronization domain were the Windows Forms themselves, their controls, and any other visual GUI elements.

The interesting thing we discovered was that not all objects in the synchronization domain had to inherit from ContextBoundObject or have the SynchronizationAttribute applied to them. Rather, we only needed to do so for objects on the boundaries of the synchronization domain. This meant all our domain objects could remain as before—a big performance boost.

Controller classes required just a bit more care.

View-Controller Interaction

In our use of the Model-View-Controller (MVC) pattern, controllers interacted with views only on the UI thread. This was a different approach than the safe controls previously described, where the controller could call views on the background thread as well.

We also knew that not every notification received from the server required updating the UI, and it was the responsibility of the controller objects to make that decision. As such, controllers would need to handle notifications on a background thread, and if updating the UI was required, would be responsible for switching threads.

One small thing to keep in mind, though, is to always switch threads asynchronously—otherwise you might deadlock your system or cause severe performance problems (speaking from experience).

A bit later we created a controller base class that encapsulated the threading and invoking to keep application code simple (see Figure 2).

Figure 2 Controller Base Class

[Synchronization] public class MyController : ContextBoundObject { // this method runs on the background thread public void HandleServerNotificationCorrectly() { // RIGHT: switching threads asynchronously Invoker.BeginInvoke( () => CustomerView.Refresh() ); // other code can continue to run here in the background // when this method completes, and the thread exits the // synchronization domain, the UI thread in the Refresh // method will be able enter this or any other synchronized object. } // this method runs on the background thread public void HandleServerNotificationIncorrectly() { // WRONG: switching threads synchronously Invoker.Invoke( () => CustomerView.Refresh() ); // code here will NOT be run until Refresh is complete // DANGER! If Refresh tries to call into a controller or any other // synchronized object, we will have a deadlock. } // have the main form injected into this property public ISynchronizeInvoke Invoker { get; set; } // the view we want to refresh on server notification public ICustomerView CustomerView { get; set; } }

One thing we really wanted back, though, was data binding. You can hardly say you're doing MVC if your model is a collection of strings, doubles, and ints.

Thread-Safe Data Binding

The problem that we had before with data binding was that our view objects held references to model objects allowing the UI thread to update the same objects as were being updated by the background thread. As we began putting in place the ViewModel pattern, many things became much simpler—just having independent control of the structure of these objects made all the difference.

Our next step in bringing back data binding then, was to have controller objects pass clones to their views like so:

public class CustomerController : BaseController { // this method runs on the background thread public void CustomerOverdrawn(Customer c) { ICustomerOverdrawnView v = this.CreateCustomerOverdrawnView(); v.Customer = c.Clone(); // always remember to clone this.CallOnUiThread( () => v.Show() ); } }

Although this technically worked, there were a couple of problems with it. The first problem was maintainability—how could we ensure that all developers remembered to clone their domain objects before passing them to views? The second problem was more technical—domain objects referenced each other so just cloning one didn't mean the others were cloned.

In short, we needed domain objects to be deep cloned as they were passed to views. It was also important that we didn't move any of this complexity into the views themselves. What we needed was to create a generic proxy of the view in CreateCustomerOverdrawnView, and in that proxy inspect all method calls and property setters for parameters that were domain objects, perform a deep clone, and then pass that clone into the view itself.

There are so many technologies that enable you to perform this proxying and each one does things different. Some use aspect-oriented programming techniques, others more straightforward hooks, but the tactics themselves aren't important. Just know that you need to create a proxy. In the proxy, include a method for the deep cloning and the accompanying dictionary for holding the working set of clones. Figure 3 shows our solution.

Figure 3 Cloning Objects for the View

// dictionary of references from source objects to their clones // so that we always return the same clone for the same source object. private IDictionary<object, object> sourceToClone = new Dictionary<object, object>(); // performs a deep clone of the given entity public object Clone(object entity) { if (entity.GetType().IsValueType) return entity; if (entity is string) return (entity as string).Clone(); if (entity is IEnumerable) { object list = Activator.CreateInstance(entity.GetType()) as IEnumerable; MethodInfo addMethod = entity.GetType().GetMethod("Add"); foreach (object o in (entity as IEnumerable)) addMethod.Invoke(list, new object[] {Clone(o)}); return list; } if (sourceToClone.ContainsKey(entity)) return sourceToClone[entity]; object result = Activator.CreateInstance(entity.GetType()); sourceToClone[entity] = result; foreach(FieldInfo field in entity.GetType().GetFields(BindingFlags.Instance | BindingFlags.FlattenHierarchy | BindingFlags.NonPublic | BindingFlags.Public)) field.SetValue(result, Clone(field.GetValue(entity))); return result; }

This method goes through all properties and fields of the object, cloning them as necessary. If it finds a reference to an object it has previously cloned in the dictionary, it doesn't clone again, but returns the first clone. In this manner, the method creates a mirror image of a given graph of objects, preventing the view from gaining access to an object that may be used on a background thread.

After all this was done, our views could now freely bind to the given domain objects. What we could no longer count on was two-way data binding of domain objects automatically updating bound views.

Two-way DataBinding

In regular data binding scenarios, when the bound object is changed, it raises events letting the view know that it should be refreshed. This really helps decouple a system as the parts that handle communication with the server don't need to update any views when they update a domain object. In our new thread-safe architecture, since the client-side WCF objects don't update the same instances of domain objects as those that are bound to the views, we lose some of the benefits of two-way data binding.

It is important to understand that in a multithreaded environment we have to give up on two-way data binding. If a background thread were to update a domain object bound to the UI, the INotifyPropertyChanged (reference) behavior would cause the background thread to update the UI directly and the application to crash.

Controllers now have to take on the responsibility of knowing which view needs to be updated when. For example, when a user has a form open showing the details of a pending trade, and the client receives notification about changes to the trade, the relevant controller should update the form. Here's how we did it originally:

public class TradeController : BaseController { public void Init() { Commands.Open<Trade>.Activated += (args => { TradeForm f = OpenTradeForm(args.Trade); args.Trade.Updated += this.CallOnUiThread( () => f.TradeUpdated(args.Trade) ); }); } }

This code handles the generic open command for a trade by opening a form and showing the requested trade. It also specifies that when the trade is updated, the form is passed the updated trade.

We were well-pleased with this clean, simple, and straightforward code.

That is, until we realized it had a bug in it.

When the user opened their second trade (or any trade after that), when any of the previous trades would be updated, the form would show the updated trade—even though the user didn't care about it anymore. We needed to be more careful in the disposal of our callbacks:

public class TradeController : BaseController { public void Init() { Commands.Open<Trade>.Activated += (args => { TradeForm f = OpenTradeForm(args.Trade); Delegate tradeUpdated = this.CallOnUiThread( () => f.TradeUpdated(args.Trade) ); args.Trade.Updated += tradeUpdated; f.Closed += () => args.Trade -= tradeUpdated; }); } }

The difference here is that we're holding on to the reference of the delegate so that we can unsubscribe from the trade update when the form closes. Bug fixed.

Except for one little thing.

This code would be correct in a single-threaded system, but in the harsh clone-war environment of our multithreaded client, references aren't always what you think.

The trade object in eventargs came from somewhere—the activation of a command, to be specific. Commands are activated by users, on the UI thread, as the result of the user doing something—in this case, double-clicking a trade in a grid. But, if the trade is shown in a grid, that means a controller had to give it to that view, and in the process the trade would have been cloned.

In short, since any notifications from the server pertaining to that trade would not work directly with the cloned object in the grid, the Updated event the controller subscribes to above would never get raised. As such, the form showing the details of the trade would not get refreshed.

Something's missing from the solution.

Client-side Repositories

Our controllers need some way to find out when a specific instance of an entity gets updated, but without relying on object references.

If every entity had an identifier, and there was an in-memory repository of these entities on the client that could be queried, controllers could use it to get the authoritative reference for an entity based on identifiers coming from the UI.

Here is what our trade controller looks like when using the repository.

public class TradeController : BaseController { public void Init() { Commands.Open<Trade>.Activated += (args => { TradeForm f = OpenTradeForm(args.Trade); Delegate tradeUpdated = this.CallOnUiThread( (trade) => f.TradeUpdated(trade) ); this.Repository<Trade>.When((t => t.Id == args.Trade.Id)) .Subscribe(tradeUpdated); f.Closed += () => this.Repository<Trade>.Unsubscribe(tradeUpdated); }); } }

Our controller is now using the repository to subscribe to changes on the specific trade instance with the identifier provided by the UI. The last piece of the puzzle is simply to have our client-side WCF objects use that same repository to update the client-side domain objects.

Performance

After putting all the pieces in place, rewriting significant parts of our client application, introducing supporting frameworks, and going through our performance testing, we discovered that there were still problems.

Under heavy load, when many objects had been updated at around the same time, large notifications being processed on the client-side caused the UI to become sluggish. Our instrumentation showed that the larger the notification, the longer the synchronization domain was held by the background thread—during which the user could not perform any actions requiring controller logic.

We tried optimizing the client-side code, refactoring domain objects so that graphs would be smaller taking less time to clone, and everything else we could think of. Nothing we did on the client helped.

And that's when one of the junior server developers spoke up (a little hesitantly), suggesting we could change the server code to publish more than one notification message. Rather than putting all changed entities in one message, we could do one entity per message, or anything else, possibly even making it configurable.

And it made all the difference in the world.

Since between processing one notification and the next the background thread on the client retreated from the synchronization domain, this allowed the UI thread to get in and do some work on behalf of the user. The fact that it took a bit longer for the large notifications to be processed one message at a time on the client was perfectly acceptable.

Lessons Learned

When we started out on the development of this project, we assumed it would be just like any other rich-client/database system. The challenges we ' d faced both architecturally and technically were greater than anything else we ' d seen. Dealing with low-level threading issues both server and client, understanding how occasional connectivity affects scalability, and maintaining an acceptable level of user interactivity while avoiding deadlocks and race conditions—there were so many pieces that had to click together just right for the whole thing to work. All the supporting technologies were there, it was all in how we put them together.

In the end, though, as we watched our users collaborating in real time across the globe, we understood that it couldn't be any other way. Companies whose systems relied on users pulling, sorting, and grouping information just wouldn't be able to compete. Software + services, occasionally connected clients, and the multicore revolution would mean ever greater productivity and profitability for an organization able to make the leap.

Our stakeholder was right. The ability to surface events from the world to the user in real time really is critical for knowledge workers to support collaborative, real-time enterprises.

After finishing the project I was a little concerned that all the patterns and techniques I'd picked up wouldn't serve me at all as I transitioned in to more line-of-business style application development. I've got to tell you that I was pleasantly surprised.

On a more personal note, I can tell you that nothing does better for a developer's career than line-of-business managers raving about your app, and the development itself is much more interesting. Give it a try.

Udi Dahan is The Software Simplist, an MVP and Connected Technologies Advisor working on WCF, WindowsWF, and "Oslo." He provides training, mentoring, and high-end architecture consulting services, specializing in service-oriented, scalable, and secure .NET architecture design. Udi can be contacted via his blog: www.UdiDahan.com.