Managed Code Performance on Xbox 360 for XNA: Part 1 - Intro and CPU
Now that XNA Game Studio Express 1.0 is out, it’s time to start writing managed code for the Xbox 360. Performance is a popular topic that’s been on this blog and the Xbox is likely no exception. The bottom line for gaming performance is framerate. The de facto framerate standard these days seems to be 60fps. I emphasize that bottom line because it’s tempting to get in to technical minutiae when talking about performance, for example CPU cache or pipeline efficiency, but that’s not really the spirit of XNA Game Studio Express (with an emphasis on Express here). The goal of this blog posting is to provide some insight in to the NetCF team’s performance expectations for this release along with guidance on where you’ll get the most “bang” for your perf optimizing “buck”.
XNA pretty much gives you full access to the powerful Xbox 360 graphics hardware which supports a superset of Shader Model 3.0. Efficient use of the rendering pipeline is critical to game performance and it’s often the right area to focus on first… assuming you’re doing anything graphically interesting. Besides… isn’t the visceral appeal of graphics the reason why everyone wants to program games! Though AI might fit in there somewhere too.
That said an in depth look at graphics is going to be outside the scope of this posting. A detailed discussion of how to maximize the XNA graphics performance is best left to the XNA team that built it. We just did the runtime!
So on to the CPU… since the Xbox 360 CLR is based on the .Net Compact Framework, much of the existing knowledge sitting around on the internet for NetCF applies just as well to the Xbox. Here are some examples:
.Net Compact Framework version 2.0 Performance and Working Set FAQ:
Desktop perf advice often applies to NetCF as well:
Improving Managed Code Performance (J.D. Meier, Srinath Vasireddy, Ashish Babbar, Rico Mariani, and Alex Mackman):
Taming the CLR: How to Write Real-Time Managed Code (via Rico Mariani’s blog)
Measuring managed code quickly and easily: CodeTimers (via Vance Morrison’s blog)
If you’re new to managed code performance, I highly recommend the “Improving Managed Code Performance” link. It covers a lot of fundamentals in a comprehensible way: efficient class design, boxing, exceptions, collections, etc. Pretty much everything.
Now that all said, both the Xbox 360 hardware and gaming scenarios are very different from what NetCF has been traditionally optimized for. The NetCF 2.0 JIT compiler for Windows CE does not support hardware floating point. In fact, hardware floating point isn’t supported in the ARMv4 architecture that current Windows Mobile devices target.
Thankfully, the Xbox 360 version of NetCF now does support hardware floating point. That work resulted in over a 10-fold improvement in our micro-benchmarks. So we think we’re satisfied with floating point perf for now. Nevertheless, it hasn’t been maximized in this release and that‘s an area of focus for the future. For example, the Xbox 360 also supports VMX128 (often referred to as AltiVec), but NetCF doesn’t take advantage of that hardware today.
Expect to get less floating point throughput than you would from identical managed code running on the desktop CLR on a good off the shelf PC. How much less depends on your code so it’s hard to provide a meaningful ratio. In most cases, other performance “gotchas” will probably hit you before floating point becomes a dominating factor, for example excessive virtual function calls or over-zealous object allocation.
If you do suspect that floating point is keeping your game from the magic 60fps number, make sure to try out the static “by ref” and “by out” methods provided by the XNA math APIs such as Vector4.Add(ref Vector4 , ref Vector4, out Vector4). The more convenient binary operators (like "+" in this case) pass their arguments by value. Using ref and out is a useful trade-off that can yield significant perf improvements by avoiding extra overhead associated with passing large structs by value.
If you’re implementing some sort of mathematically intensive model perhaps as part of a fancy particle system, make sure to think about what floating point operations can be offloaded to the GPU via shaders in case you haven’t looked in to that already. Shaders get the full throughput of the Xbox GPU that the pro game developers get. You’ll just need to trade-in C# for some HLSL.
Manual inlining is a useful way to improve perf by removing unnecessary function call overhead. NetCF will handle some method inlining on your behalf, but it’s limited to very simple methods like basic getters/setters. Thus it’s worthwhile for developers to keep inlining in mind too. Methods that get called a lot (eg. big loops are a place to look) are good candidates to manually inline (assuming said method is reasonably sized). Check out the NetCF “Method Inlining” section on this link for more context: http://blogs.msdn.com/netcfteam/archive/2005/05/04/414820.aspx. Note that virtual methods never get inlined automatically.
The Xbox 360 has 6 hardware threads on 3 cores. Support for multiple hardware threads was also added to System.Threading so XNA developers can access 4 of the 6 hardware threads (2 are reserved) via Thread.SetProcessoryAffinity(). You will have to assign your threads manually to take advantage of the hardware threads as the Xbox kernel doesn’t do it automatically for you. Note that this API isn’t available on the desktop.
Of course the general concepts behind efficient multithreaded coding and synchronization apply.
Here’s a place to get started on the .Net threading APIs:
This article is more of an editorial and admittedly more interesting read that also happens to address multi-CPU and hyper threaded systems like the Xbox (by Joe Duffy):
Once you’ve got that stuff down, you should keep in mind that hardware threads are different from cores. Each core has two hyper threads that share the same cache. Putting two threads that access totally separate data on the same core can lead to unnecessary thrashing of the cache. XNA GSE doesn’t provide tools to really see what’s going on at the CPU level anyways, so if anything you can time your code and use some trial and error to determine what works best.
To be continued... the next part will cover GC and Memory Management, and Tools.
Updated... Part 2 is here.