Code Optimization with the DirectXMath Library

This topic describes optimization considerations and strategies with the DirectXMath Library.

Use accessors sparingly

Vector-based operations use the SIMD instruction sets and these make use of special registers. Accessing individual components requires moving from the SIMD registers to the scalar ones and back again.

When possible, it is more efficient to initialize all of the components of an XMVECTOR at one time, instead of using a series of individual vector accessors.

Use correct compilation settings

For Windows x86 targets, enable /arch:SSE2. For all Windows targets, enable /fp:fast.

By default, compilation against the DirectXMath Library for Window x86 targets is done with _XM_SSE_INTRINSICS_ defined. This means that all DirectXMath functionality will make use of SSE2 instructions. However, the same is not true for other code.

Code outside of DirectXMath is handled using compiler defaults. Without this switch, the generated code may often use the less efficient x87 code.

We highly recommend that you always use the latest available version of the compiler.

Use Est functions when appropriate

Many functions have an equivalent estimation function ending in Est. These functions trade some accuracy for improved performance. Est functions are appropriate for non-critical calculations where accuracy can be sacrificed for speed. The exact amount of lost accuracy and speed increase are platform dependent.

For example, the XMVector3AngleBetweenNormalsEst function could be used in place of the XMVector3AngleBetweenNormals function.

Use Aligned Data Types and Operations

The SIMD instruction sets on versions of windows supporting SSE2 typically have aligned and unaligned versions of memory operations. The use of the aligned operations is faster, and should be preferred wherever possible.

The DirectXMath Library provides access aligned and unaligned functionality through variant vector types, structure, and functions. These variants are indicated by an "A" at the end of the name.

For example, there are an unaligned XMFLOAT4X4 structure and an aligned XMFLOAT4X4A structure, which are used by the XMStoreFloat4 and XMStoreFloat4A functions respectively.

Properly Align Allocations

The aligned versions of the SSE intrinsics underlying the DirectXMath Library are faster than the unaligned.

For this reason, DirectXMath operations using XMVECTOR and XMMATRIX objects assume those objects are 16-byte aligned. This is automatic for stack based allocations, if code is compiled against the DirectXMath Library using the recommended Windows (see Use Correct Compilation Settings) compiler settings. However, it is important to ensure that heap-allocation containing XMVECTOR and XMMATRIX objects, or casts to these types, meet these alignment requirements.

While 64-bit Windows memory allocations are 16-byte aligned, by default on 32 bit versions of Windows memory allocated is only 8-byte aligned. For information on controlling memory alignment, see _aligned_malloc.

When using aligned DirectXMath types with the Standard Template Library (STL), you will need to provide a custom allocator that ensures the 16-byte alignment. See the Visual C++ Team blog for an example of writing a custom allocator (instead of malloc/free you'll want to use _aligned_malloc and _aligned_free in your implementation).


Some STL templates modify the provided type's alignment. For example, make_shared<> adds some internal tracking information which may or may not respect the alignment of the provided user type, resulting in unaligned data members. In this case, you need to use unaligned types instead of aligned types. If you derive from existing classes, including many Windows Runtime objects, you can also modify the alignment of a class or structure.


Avoid Operator Overloads When Possible

As a convenience feature, a number of types such as XMVECTOR and XMMATRIX have operator overloads for common arithmetic operations. Such operator overloads tend to create numerous temporary objects. We recommend that you avoid these operator overloads in performance sensitive code.


To support computations close to 0, the IEEE 754 float-point standard includes support for gradual underflow. Gradual underflow is implemented through the use of denormalized values, and many hardware implementations are slow when handling denormals. An optimization to consider is to disable the handling of denormals for the vector operations used by DirectXMath.

Changing the handling of denormals is done by using the _controlfp_s routine on a pre-thread basis, and can result in performance improvements. Use this code to change the handling of denormals:

  #include <float.h>;
    unsigned int control_word;
    _controlfp_s( &control_word, _DN_FLUSH, _MCW_DN );


On 64-bit versions of Windows, SSE instructions are used for all computations, not just the vector operations. Changing the denormal handling affects all floating-point computations in your program, not just the vector operations used by DirectXMath.


Take Advantage of the Integer Floating Point Duality

DirectXMath supports vectors of 4 single-precision floating-point or four 32-bit (signed or unsigned) values.

Because the instruction sets used to implement the DirectXMath Library have the ability to treat the same data as multiple different types-for example, treat the same vector as floating-point and integer data-certain optimizations can be achieved. You can get these optimizations by using the integer vector initialization routines and bit-wise operators to manipulate floating-point values.

The binary format of single-precision floating-point numbers used by the DirectXMath Library completely conforms to the IEEE 764 standard:

     1 bit   8 bits     23 bits

When working with the IEEE 764 single precision floating-point number, it is important to keep in mind, that some representations have special meaning (that is, they do not conform to the preceding description). Examples include:

  • Positive zero is 0
  • Negative zero is 0x80000000
  • Q_NAN is 07FC0000
  • +INF is 0x7F800000
  • -INF is 0xFF800000

Prefer Template Forms

Template form exists for XMVectorSwizzle, XMVectorPermute, XMVectorInsert, XMVectorShiftLeft, XMVectorRotateLeft, and XMVectorRotateRight. Using these instead of the general function form allows the compiler to create much more efficent implementations. For SSE, this often collapses down to one or two _mm_shuffle_ps values. For ARM-NEON, the XMVectorSwizzle template can utilize a number of special cases rather than the more general VTBL swizzle/permute.

Using DirectXMath with Direct3D

A common use for DirectXMath is to perform graphics computations for use with Direct3D. With Direct3D 10.x and Direct3D 11.x, you can use the DirectXMath library in these direct ways:

DirectXMath Programming Guide