Taming the CLR: How to Write Real-Time Managed Code
I've actually been meaning to write about real time applications for ages so when I was asked to give a talk at MS Gamefest (http://microsoftgamefest.com) I jumped at the opportunity to give myself a hard reason to do the homework. Last Tuesday I gave that talk and below are the slide contents plus my speaker notes. The actual audio was recorded an will probably be available soonish, when that happens I'll post a link.
I hope you find this somewhat interesting :)
Managed Code for Real Time?
- It may seem like a crazy notion to some people but gamers probably aren’t nearly as surprised
- Managed Runtimes in the game world are common
- My first AI code was written in LPC for LPMud
- You can get great results out of a managed runtime
Good for What Ails You
- Garbage Collected memory model is the centerpiece
- Amortized cost of allocations in this model can be excellent!
- Long term fragmentation is less of a problem
- Who hasn’t played a game that degrades over 2 hours of play?
- Locality can be good due to compaction – we need all the cache hits we can get
- Easy to use model, immune to classic “leaks” and wild pointers
Garbage collected memory models tend not to have long term degradations. Many of the most deadly problems simply can’t happen in this kind of model. When people talk about a “GC” leak what they mean is that they are holding on to a pointer that they should have nulled out. This is much easier to track down than memory that was “lost” in the classic leak model – nothing points to it or it isn’t freed. Wild pointers are totally impossible.
All of this is great news for game developers.
Great, I’ll take a dozen!
- Wait, not so fast, you have to read the fine print…
- There are important practices to get these benefits
- The GC is like a BFG9000
- Don’t shoot yourself in the foot with it!
People learned best practices for classic memory allocators. In a word – they have awful performance so you have to wrap them. And that’s exactly what everyone does. You get assorted different custom allocators for different purposes. Ones that allocate and free a big chuck, or carve out lots of little objects, or some other important specialized requirement. The main thing is you are careful to get to know your allocator(s) and use them accordingly.
Likewise, you have to get to know the GC. It’s useable directly – without wrapping – in a variety of cases, a great step forward. But you might shoot yourself in the foot if you do unwise things.
Things you need to know
- In .NET CF the GC is a compacting mark and sweep collector
- Contrast with the regular .NET Collector which is generational
- What does this mean?
- You probably should understand both models because you can find yourself on the PC platform too
(You can find the picture that I used in this article here, this discussion is taken directly from that article)
Collectors come in many flavors and today we’ll be talking about the flavor of a couple of different collectors that you might run into. The one in the .NET CF is quite a bit different than the one in the desktop. There are different rules for both – although if you follow the .NET CF rules you should get excellent performance out of the Desktop collector as well.
“Compacting” refers to the fact that our collectors will squeeze out the free memory, kind of like a disk defrag, when they think it is wise to do so.
“Generational” refers to the fact that the desktop collector can collect just some of the objects rather than all of them – notably it can collect “just the new stuff”
Simplified GC Model
- This shows generations which are not always present
- Some of the details in the following discussion are not exactly correct
- The idea is to have a mental model that helps you understand costs, you need this more than you need the exact details
In the simplified model (only):
- All garbage-collectable objects are allocated from one contiguous range of address space
- The heap can be divided into generations (more on this later)
- Objects within a generation are all roughly the same age.
- The oldest objects are at the lowest addresses, while new objects are created at increasing addresses. (In the drawing, addresses are increasing going down)
- The allocation pointer for new objects marks the boundary between the used (allocated) and unused (free) areas of memory.
- Periodically the heap is compacted by removing dead objects and sliding the live objects up toward the low-address end of the heap.
- The order of objects in memory remains the order in which they were created, for good locality.
- There are never any gaps between objects in the heap.
- Write Barriers are used to note changes in pointers from old objects to new objects allowing the newer objects to be collected seperately.
- See http://blogs.msdn.com/ricom, http://blogs.msdn.com/maoni for more articles and a longer discussion of this topic here http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/dotnetgcbasics.asp
Why do you care?
- Really there are just a few things that you need to keep top of mind
- Allocations are very cheap, basically you increment a pointer
- Objects that are allocated close together in time tend to be close together in space
- Full collections can be very expensive because the heap could be arbitrarily big
You need to know these things so that you get good locality of reference in your data structures. Better locality means fewer cache misses which means fewer clocks per instruction and therefore more frames per second.
You need to know how to prevent very expensive collection costs. Why do these costs arise and what can you do to limit the cost.
Some good things to keep in mind
- The garbage collector concerns itself with the living – dead objects cost nearly nothing. Even if we don’t compact and choose to thread a free list through the dead objects we still don’t have to visit every dead object to do that. Costs of collection tend to be driven by live objects.
- The overall amortized cost of memory allocation in a garbage collected world tends to be linear in the number of bytes allocated. i.e. allocation volume drives the total cost, which includes zeroing memory, actually doing the allocations (the cheapest part) and the eventual collections. It’s not always the same linear factor because that depends on how much is living and how rich in pointers everything is but, generally, allocation cost is a “per byte” thing for any given application.
Stay in Control
- You can get great performance in this kind of world if you stay in control
- In the regular CLR, for real-time or high throughput applications, I advise people to keep their eye on mid-life objects, make sure they have few of them
- Mid life objects cause you to have to do full collections, and a lot more memory is implicated
- But what if we don’t even have generations?
The classic thing you have to worry about in the desktop CLR is if you start leaking a lot of objects into generation 2 – the oldest generation. So basically object-lifetime patterns drive performance – the allocation rate and the death rate in each generation.
This is true because in the desktop world the GC heap could be very large (e.g. 1GB+) and full collects could therefore be very costly. So the trick in this world is to make sure you’re only doing partial collections.
There are many real-time applications in the desktop/server world that are successful, e.g. assorted financial companies have stock streaming services that do things with quotes and then facilitate order processing. These are all done on a deadline. How do they do it? Strict control of allocation volume and promotions to the elder generations.
.NET CF Considerations
- A good way to think about .NET CF is that it’s like you only have generation 0
- Your total heap size should be roughly what your generation 0 size would have been – that means comparable to the CPU cache
- Collection times in that world will be excellent
- If your heap gets too big collect times will kill you
If you want to get the best performance your total heap sizes in .NET CF should be roughly what your generation 0 size would have been in a generational collector.
If you keep accumulating old objects thereby letting your heap grow, the fixed cost of marking those objects during collections will start to hurt you.
If you have large numbers of objects that you need to pre-create that are going to survive you can minimize the cost of managing these object by keeping them devoid of pointers. If no tracing needs to happen huge swaths of memory can be marked as in-use (e.g. arrays) without even having to look at them.
So a good tactic is to keep your very long lived data low in pointers (e.g. use handles) and things you churn rich in pointers so that they are easy to manage. Controlling this blend is up to you.
Remember that when you start using handles you’re back to managing your own object lifetimes and free lists and so forth but that doesn’t matter at all if you’re talking about objects that basically stay around for a whole level. Use this wisely.
The total heap should be about the size of the CPU cache or a small multiple of it so that you are going to get lots of hits when collecting and processing generally.
Also, .NET CF needs no write barriers as its not generational, so one less cost right there.
- In addition to typical heap size you also need to keep your eye on the allocation rate
- Freshly allocated memory has to be zeroed, and then “constructed” (by you)
- Basically that’s a lot of memory traffic
- At 20-40 MB/s you are going to be pleased with the memory management overhead
- At 60 frames per second that’s around 500k per frame, or say 20k objects of modest size
Volume can kill you as well. Zeroing out all the memory can be expensive – and of course volume drives collections.
Collections are typically triggered when a certain number of bytes has been allocated – at that point the GC deems it wise to do a collection because there is likely to be enough junk that it’s worth bothering with.
The GC gets its efficiency by being lazy – collect to aggressively and all those savings go away. On the flip side, collect not enough and memory usage skyrockets and locality is destroyed. The GC keeps these two competing things in balance with allocation budgets.
At the suggested volume of allocations and heap size you should be able to achieve garbage collection overheads in the low to mid single digits of percent. That’s a great result for general memory management.
- Jitted code optimization is not as good as a full optimizing compiler
- So this is no place to do intense floating point math, that’s what Shaders are for
- BUT! You’ll write simpler code
- The fastest code is the code that isn’t there!
- Simplifications can easily dwarf code-gen taxes, and give you coding time elsewhere
- Cheap memory model can be ideal for AI logic and physics
Jitting isn’t going to be your death… it’s like a fixed overhead.
The idea is that you can more than win this back with simplified logic. Lack of destructors. No cleanup code to run (and who wants to write code to (e.g.) visit partly created trees and release them?)
Less code to write means more time to focus on algorithmic gains in your code – that’s where the real money is. But, full disclosure, you should know you are starting at a modest penalty.
Use the coding simplifications to your advantage, build great algorithms that would have been too hard or too expensive in man-hours to code otherwise and win.
Don’t try to write a Phong Shader in managed code for your game. That’s what the GPU is for.
Interop with Native
Just a few words here, again you can shoot yourself in the foot
- Keep your interop under control
- Use primitive types in the API whenever possible
- Marshalling costs are what will kill you
- Simple calls can be very fast
- Desktop CLR can get many millions of calls per second on simple signatures
- Native Interop is not possible on the Xbox 360
Not much to say here other than, if you must, on the desktop, then keep it as simple as possible.
On 360 it isn’t even an option, therefore you can’t get it wrong :)