Does the JIT take advantage of my CPU?
Short answer is yes. One of the advantages of generating native code at runtime is that we know what processor you are running on and we can tune the code accordingly. Why would we do that for x86? Every generation of x86 processors has its own personality. Their personality comes usually in 2 ways:
- New instructions: For example, SSE and SSE2 instruction set
- New ‘moods’: For example, Pentium 1 wanted programmer to schedule instructions by hand (in order to fill its 2 execution pipes), changes in branch prediction logic, P4s trace cache, etc, and even in the form of ‘regressions’, such as P4 preferring ADD REG, 1 vs INC/DEC instructions (which were very frequent instructions in tight loop code).
Also, AMD processors have their own personality, although in my experience, AMD’s are much more predictable and ‘well behaved’, and thus, need less work.
Note that we just don’t jump and implement in the JIT functionality to take advantage of every processor difference. The process is usually identifying something that is hurting in one of the benchmarks or user scenarios we track, evaluate the cost of the fix, the risk of the fix (every time we make a processor specific optimization, we make the life of our test team a bit harder and it’s easier to have x-proc only bugs, which are a bit harder to track down) and then, once we have all that data, we make a decision.
Examples of some of the processor specific optimizations in x86 (none of these will be a big surprise for developers that do machine level programming, all of these optimizations are called out in big fonts in the processor optimization manuals).
- Use of CMOV instruction when available (enables conditional moves, which is very useful in branches that are taken in a random (ie, non predictable by the processor) fashion
- Use of FCOMIx family of instructions (makes floating point comparisons much cheaper)
- Use of SSE2 for memory copies (memory copies are fun, you would expect something that simple to be always, the same, but I’ve witnessed 4 ‘recommended’ ways of doing it during the time I’ve worked with x86: Use string instructions (REP/MOVS/STOS), Use floating point registers for the move, Use scalar instructions (to get better pairing/parallelization) and now use SSE2). Note that we don’t use SSE2 for floating point code. The reason for this is that we don’t vectorize code (which is the real win with SSE2), SSE2 for scalar floating point is not always a win compared to the x87 (different latencies between instruction sets for adds and multiplications makes each one of them better than the other depending on the scenario) and some things like converting doubles to floats was really slow on SSE2, so we decided investing in making our x87 code better (which we were going to have to support anyways).
- Use of SSE2 for floating point to int conversion.
- Other minor instruction selection differences (such as avoiding INC and DEC instructions in hot code for P4 or to avoid store forwarding problems in P4 and Centrino processors)
We don’t take advantage of other things, such as knowing code cache sizes, etc… One of the reasons for this is that we don’t want different code on every single machine out there. As usual, there is a trade off if we did this, we may get some extra speed in some situations, but on the other hand, in a realistic world, it’s more likely for us to produce bugs that only repro in machines that meet n conditions, so introducing more processor specific optimizations has to be done carefully.
What about NGEN?
NGEN is an interesting case. In previous versions of the CLR (1.0 and 1.1), we treated NGEN compilation the same way as we treated JIT compilation, ie, given that we are compiling in the target machine, we should take advantage of it. However, in Whidbey, this is not the case. We assume a PPro instruction set and generate code as if it was targeting a P4. Why is this? There was a number of reasons:
- Increase predictability of .NET redist assemblies (makes our support life easier).
- OEMs and other big customers wanted a single image per platform (for servicing, building and managing reasons)
We could have had a command line option to generate platform specific code in ngen, but given that ngen is mainly to provide better working set behavior and that this extra option would complicate some other scenarios, we decided not to go for it