ICorDebug re-architecture in CLR 4.0

In my previous post I mentioned that CLR 4.0 will support managed dump debugging through ICorDebug, and that to do this we had to re-architect the debugging support in the CLR. I want to give you a little more detail about what we've been doing here. I'm sure I (and others) will be discussing more details in the time leading up to CLR 4.0 RTM, so stay tuned and also feel free to ask questions on this in our forum.

History and background
First, if you haven't already, read Mike's blog entry describing the difference between "hard-mode" and "soft-mode" debugging, and his related entry describing the implication of relying on a helper thread. Mike has clearly been thinking about this for a long time, and the successes we announced at PDC are due in large part to Mike's dedication, vision, and hard-work on this project for the past several years. In fact, after reading this post, if you go back and read other ICorDebug-related entries on Mike’s blog you’ll find a lot of great details on the thinking that lead to the design points I discuss below. Drew Bliss (while he was on the WinDbg team) was also instrumental to the early vision, design and implementation of our re-architecture.

As you know from Mike's blog, there are clearly a number of advantages to being able to debug from out-of-process (not the least of which is support for dump debugging). The big challenge is how to implement this in a maintainable way for a complex system with lots of fancy data structures and algorithms for accessing them. We've actually been experimenting with prototypes for doing this for many years (since nearly the start of the CLR), and this is exactly what SOS relies on. But it's really only been in the past few years that we've developed the confidence to move ALL managed debugging towards an out-of-process model. A power-user tool like SOS has somewhat different requirements from a production-grade API used by millions of developers every day in their Visual Studio debugging sessions.

The way we attempt to address the maintenance issue is to build both the normal in-process CLR code (mscorwks.dll) and the out-of-process code (mscordacwks.dll) from the same code base. We use some fancy C++ templates to instrument all pointer dereferences so that we can read memory from the target process instead of the process in which the code is executing ("host process"). We call this, and it's related infrastructure, our "DAC" (data access component) technology. There are still a lot of challenges associated with DAC (eg. trying to help CLR developers reason about their code when there are two separate address spaces involved), but we've taken it far enough that we're willing to trust the approach. If people are interested, I'd be happy to go into more detail on this in the future, but let's move on now to what we're actually delivering in CLR v4.

Design points for ICorDebug v4.0Here's a list of principles we followed in the design for debugging in CLR v4.0. There’s a lot of overlap here since these design points are all part of a single coherent vision. Just skimming this list should give you a pretty good idea of what our new architecture is all about. I’ll save going into the concrete details of the new API here (although you can grab the new cordebug.idl from the CTP VPC image and explore it yourself if you're really curious).

  1. Debugging a 4.0 managed app requires a 4.0-aware debugger (i.e. we are making breaking changes)
    It has always been our policy that a debugger needs to be designed to understand the major features of the run-time, and so with a major new release of the runtime (like 4.0) we will require updates to the debugger.  This means that VS 2008 will not be able to debug code running in the V4 CLR (exaclty like VS 2003 couldn't debug code running in CLR v2) - but see below.
  2. Migrating a 2.0 debugger to support 4.0 should be very easy
    Although we want to hold a hard line on our compat policy above, we are working very hard to make it easy for debuggers to update their code to support CLR v4.  In almost all cases, we're not removing / disabling or drastically changing the semantics of any existing APIs.   We are adding a bunch of new APIs and re-implementing the existing APIs on top of them, thus giving you the choice of which to use (the new APIs tend to have a lower level of abstraction).  In fact, you could install VS 2008 on the .NET 4.0 CTP image and debug code running in the 4.0 CLR.  In adition to wanting to make it easy for debugger writers to migrate, we also didn't want to force everyone inside Microsoft using early CLR v4 builds to also use VS 2010 as their debugger until it reaches beta quality.  By the time we ship the beta, we'll explicitly break this to ensure we're not setting any unrealistic expectations.
  3. Re-architecture is incremental instead of a complete rewrite
    We (mainly, to his credit, Mike) decided early on that we should transition to this new architecture incrementally while keeping our system working throughout.  This was done partly for pragmatic reasons, for example that we wanted to always have a working Visual Studio without having to wait for them to update to our new architecture in lock-step with us.  This also enabled us to react to changes in the business plan.  For example, when we decided to ship Silverlight 2 based on the latest CLR code base (i.e. the V4 branch) rather than based on the CLR v2 codebase, we were pretty-much prepared to ship the current state of our re-architecture even though we weren't yet "done" (we did have to spend a month or so adding Mac support to our new architecture - but that was a relatively isolated new feature, as opposed to a race to productize an unfinished codebase).
  4. Maintain a single, self-consistent debugging API based on ICorDebug
    Despite the huge fundamental changes we needed to make, we decided we needed to maintain consistency with the existing ICorDebug APis as much as possible.  Again, this helps ensure that existing managed debuggers can take advantage of our new functionality as easily as possible.  In fact, the VS debugger team spent only a few weeks adding the initial dump debugging support to VS 2010 because we made all the inspection operations behave exactly the same as for live debugging (we even had a hacked-up demo of dump-debugging working with VS 2008 without ANY VS changes at all!).
  5. All access to the target is abstracted through a callback interface
    In V2, ICorDebug used a number of different Win32 APIs to interact with the target process (eg. using ReadProcessMemory, GetThreadContext, shared-memory blocks, and named events).  In V4, the core of ICorDebug uses a simple abstraction we call the "data target" (ICorDebugDataTarget) that a debugger implements.  This basically just has methods like 'ReadVirtual' (read some memory) and 'GetThreadContext' (get CPU state).
  6. ICorDebug can operate in two difference modes - "pure out-of-process v4" and "v2 compatibility"
    Again, this is related to everything we've been discussing above.  In v2 compat mode, you create an ICorDebug instance in a similar way to V2, and use that to, for example, attach to a process given a PID to get an ICorDebugProcess that can do everything the one in V2 could do.  In this case, we implement the ICDDataTarget object ourselves.  In pure v4 mode, you use an alternate "open virtual process" creation API that basically just takes an ICDDataTarget and returns an ICorDebugProcess whose connection to the target is completely abstracted.  Today, a pure v4 ICDProcess only has the ability to do inspection operations (i.e. mainly just dump debugging), but we'll be enabling more out-of-process functionality (like stepping) in future releases.
  7. In v4-mode, ICorDebug acts just as a utility library for existing native debuggers
    In V2 and before, ICorDebug really behaves as if it owns the target process.  This has a lot of negative implications, most significantly that it makes mixed-mode debugging extremely difficult and brittle.  It also means that when debuggers want to add features, they often need to come to us (this is especially problematic for us given that the Visual Studio debugger team has a lot more people than the CLR debugger team <grin>).  Really we want to enable debuggers to implement their own policy and control of the target process rather than imposing a lot of policy ourselves.  This means we're lowering the level of abstraction of ICorDebug in some places.  SOS, as an extension that runs on top of a native debugger like WinDbg,  clearly already follows this principle.  An example benefit is that native debuggers like Visual Studio and WinDbg already have some policy and mechanism for skipping over breakpoints, and now in v4-mode we no longer have to have a completely different policy and mechanism for managed breakpoints.  This principle leads directly to the following design decision.
  8. Under the hood we're built on the native debugging pipeline
    In v2-compat mode, ICD continues to own the pipeline to the target process (since that was the V2 model), but that pipeline is no longer a collection of shared IPC objects with the target process, but is instead the same pipeline a native debugger uses.  Specifically, we attach to a process by calling kernel32!DebugActiveProcess, and get our managed events (things that result in calls to ICorDebugManagedCallback) using kernel32!WaitForDebugEvent.  This also means that kernel32!IsDebuggerPresent now returns true when doing managed-only debugging.  This also has the nice side-effect of avoiding the problem with doing managed-only debugging when a kernel debugger is enabled (the OS assumes any breakpoint instructions that occur when a debugger isn't attached should cause a break in the kernel debugger).

 Ok, I think those are the most important design points.  Sometime later I'll start going into more of the concrete details.   Post a comment if you have any specific things you'd like to hear about.