RPC over IT/Pro
Hi folks, Ned here again to talk about one of the most commonly used – and least understood – network protocols in Windows: Remote Procedure Call. Understanding RPC is a foundation for any successful IT Professional. It’s integral to distributed systems like Active Directory, Exchange, SQL, and System Center. The administrator who has never run into RPC configuration issues is either very new or very lucky.
Today I attempt to explain the protocol in practical terms. As always, the best way to troubleshoot is with an understanding of how things are supposed to work, so that when it fails the reasons are obvious. If you have a metered or capped Internet connection, read this off hours – it’s a biggee.
The RPC concept has roots in ARPANET, but got its first business computing use – like so many others – at Xerox PARC as “Courier”. The Microsoft implementation is an extension of The Open Group’s DCE/RPC, sometimes called MSRPC. We further extended that into the Distributed Component Object Model (DCOM), which is RPC and COM. The Exchange folks heavily invested in RPC over HTTP. Microsoft also retains the legacy "RPC over SMB" system, often referred to as Named Pipes. That ends the brochure.
As I began to learn RPC, the first problem I ran into was the documentation. It seemed to come in two forms:
- Explanations by developers for developers, which contain very little architecture and troubleshooting info
- Explanations by alien hybrids for robot lawyers, which contain no understandable information at all
If you actually read the docs, you're let down in the details. It comes in two arrangements, both of which completely miss the IT boat:
1. The “it’s all processes and libraries, get to coding” form:
2. The “Jedi network magic” form:
I find developers are often like Rain Man: specialist geniuses, bewildered by real life. This isn’t bad documentation, but IT pros aren’t the audience. The developers of RPC are providing a framework and since they live in a perfect world of design where nothing breaks, how it works is not important – they just want you to use the right APIs. The problem is I don’t care about the specifics of MIDL, stubs, or marshaling unless I’m at the point of debugging; I just want to know how it all works in practical networking terms. Then when it breaks, I have somewhere to start, and when I’m designing a distributed system, I’m not setting my customer up for headaches.
Today I focus on MSRPC, as that’s the main RPC protocol of AD components. I may return someday to discuss the others, if you’re interested. And bribe me.
The MSRPC details
Let's start with an analogy: you meet a nice girl and really hit it off. Like an idiot, you manage to lose her phone number. You know that she works for Microsoft though, so you start by looking up the Charlotte office. You call and get a switchboard, so you ask for her by name. The operator tells you her number and then offers to transfer you – naturally, you say yes. Someone answers and you make sure it’s the nice girl by introducing yourself. You both exchange pleasantries, then make plans for dinner and a movie, with directions to the restaurant and a chat about the Flixster reviews. You hang up and think about what you’re going to say to keep her interested until the appetizers arrive. You called her on your mobile phone so you have the outgoing number saved in case you need to call back.
There, now you understand MSRPC. No really, you do…
- A client application knows about a server application and wants to communicate with it.
- The client computer uses name resolution to locate the computer where that server application runs.
- The client app connects to an endpoint locator and requests access to the server application.
- The endpoint locator provides that info and the client connects to the server with an initial conversation.
- The client and server apps exchange instructions and data.
- The client and server apps disconnect.
- The client computer has a cache of name resolution and the connection that can save time reconnecting later.
RPC allows a client application to let other computers work on its behalf, offloading processing to more powerful centralized servers. Instead of sending real functions over the network, the client tells the server what functions to run, and then the server sends the data back. This has nothing to do with the OS: some of these applications can be both client and server – for instance, Active Directory multi-master replication. That RPC application is LSASS.EXE. I’m going to use it as our sample app.
There are a few important terms to understand:
- Endpoint mapper – a service listening on the server, which guides client apps to server apps by port and UUID
- Tower – describes the RPC protocol, to allow the client and server to negotiate a connection
- Floor – the contents of a tower with specific data like ports, IP addresses, and identifiers
- UUID – a well-known GUID that identifies the RPC application. The UUID is what you use to see a specific kind of RPC application conversation, as there are likely to be many
- Opnum – the identifier of a function that the client wants the server to execute. It’s just a hexadecimal number, but a good network analyzer will translate the function for you. MSDN can too. If neither knows, your application vendor must tell you
- Port – the communication endpoints for the client and server applications
- Stub data – the information given to functions and data exchanged between the client and server. This is the payload; the important part
There’s a lot more but we’re getting into developer country. I know it sounds like jabber, so let’s dissect this with a real-world example using our old friend NetMon and the latest open source parsers.
Back to reality
Here I have two DCs in the same AD site, named WIN2008R2-01 and WIN2008R2-02, with respective IP addresses of 10.0.0.101 and 10.0.0.102. I reboot DC2 and have a network capture running on DC1. I create a brand new test user and let it replicate, then I stop the capture. It’s critical to have a network capture see the whole conversation or it will be a mess to analyze; if possible, the captures should always be running on both client and server, but in this case, that’s not possible due to the reboot.
When you first examine AD replication traffic in NetMon (like above) it looks like Greek. What the heck is a stub parser? DRSR?
Open the Options menu and select Parser Profiles. The reason you see the “Windows stub parser” messages is that by default, NetMon uses a balanced set of parsers designed for limited analysis without packet loss.
When analyzing captures on your desktop, set the active parser to “Windows” and you get the most detail.
While you’re in the Options, I also recommend configuring color filters. Since I am examining AD replication, I want visual cues for DRSR (Directory Replication Service Remote protocol), EPM (RPC Endpoint Mapper), MSRPC, and DNS. This makes skimming a capture easier.
Now I add a simple filter of: msrpc. Better. Let’s start deciphering:
Right away, we see the endpoint mapper request above. The tower for Directory Replication is in that request, using the UUID E3514235-4B06-11D1-AB04-00C04FC2DCD2 (that's how Netmon knows to parse it, by the way). It is connecting to TCP port 135. This happens shortly after LSASS.EXE starts, as domain controllers are nearly always talking about replication.
Naturally, there is a response, and it contains several key ingredients:
You can see the towers - there may be more than one - and the floors in each tower with their ports. Importantly, you also see the status of the attempted connection. And a specific server port is listed. That port may be dynamic or static, it depends on the application’s configuration.
Now the client application opens a local client port (again, maybe dynamic, maybe static) and binds to that new application port, using security; the original connection, by default, did not require special permissions - EPM is a switchboard, remember. Because this is MSRPC and domain controllers, this means Kerberos and packet privacy are required. This bind phase below is negotiation.
The server responds with the (hopefully) successful negotiation, providing details about which security protocols were selected for further encryption of the traffic. The NegState field shows how this is not yet complete, but things are proceeding as planned.
This bind was the negotiation. What follows is the completion of the authentication and encapsulation phase, called an ALTER_CONTEXT operation. If all goes well, the authentication is accepted and RPC application communications proceeds with some nice secure packet payloads.
Everything after this point is application… stuff. RPC connected from a client port to a server port and then communicates along that "channel" for the rest of the conversation. The two halves of the application send each other requests and responses, with stub data used by the application's functions.
Every application is different, but once you know each one's rules, it will work in a (relatively) predictable fashion. Since this is the well-documented Directory Replication Services application, what happens next is the DC creates a context handle, called a DRSBIND. It then does some work. Let's take a look at one example of the work by switching the NetMon filter to just DRSR, then apply it to our scenario.
Netmon is politely translating all of these RPC functions above into semi-intelligible words, like DRSBind, DRSReplicaSync, and DRSGetNCChanges. It knows that when there is an opnum it understands for a given protocol, it means an RPC function that the client is telling the server to run remotely on the client's behalf.
If you examine one of those packets, you see that the data itself is encrypted (good!), but with knowledge of the opnum's purpose and that RPC reached this stage, you have a decent idea what it is doing or how to look it up based on the UUID and Opnum information, even if your network parsers are terrible. In this case:
Function Explanation IDL_DRSBind Creates a context handle necessary to call any other method in this interface.Opnum: 0 IDL_DRSReplicaSync Triggers replication from another DC.Opnum: 2 IDL_DRSGetNCChanges Replicates updates from an NC replica on the server.Opnum: 3 IDL_DRSCrackNames Looks up each of a set of objects in the directory and returns it to the caller in the requested format.Opnum: 12 IDL_DRSUnbind Destroys a context handle previously created by the IDL_DRSBind method.Opnum: 1
Importantly, you know that RPC and the network appear to be functioning correctly, so any application problems are likely inside the application itself. If the application has internal logging, you can use these network captures to correlate each opnum request/response to real work, and perhaps see where things are failing internally. If the application doesn’t have good security, you can see exactly what it's doing - but so can anyone else. Probably something to bring to the third party vendor's attention, as it will not be Microsoft.
A polite application will tear down the connection with noticeable "unbind" traffic, and perhaps even send a network reset, but many simply abandon the conversation and let Windows deal with it later.
A final note: a domain controller has a great many RPC conversations going with multiple partners; always ensure you are looking at the same conversations by filtering based on IP addresses and ports, as well as your network analysis tools conversation ID system. NetMon makes this pretty easy:
And we're done. See? It’s just a phone call with a nice girl from Microsoft. Don’t be intimidated when she knows more about computers than you do, bub.
Until next time.
Ned "really pedantic chatter" Pyle