Library Internals

Article
09/21/2021

This topic describes the internal design of the DirectXMath library.

Calling Conventions
Graphics Library Type Equivalence
Global Constants in the DirectXMath Library
Windows SSE versus SSE2
Routine Variants
Platform Inconsistencies
Platform-specific Extensions
Related topics

Calling Conventions

To enhance portability and optimize data layout, you need to use the appropriate calling conventions for each platform supported by the DirectXMath Library. Specifically, when you pass XMVECTOR objects as parameters, which are defined as aligned on a 16-byte boundary, there are different sets of calling requirements, depending on the target platform:

For 32-bit Windows

For 32-bit Windows, there are two calling conventions available for efficient passing of __m128 values (which implements XMVECTOR on that platform). The standard is __fastcall, which can pass the first three __m128 values (XMVECTOR instances) as arguments to a function in a SSE/SSE2 register. __fastcall passes remaining arguments via the stack.

Newer Microsoft Visual Studio compilers support a new calling convention, __vectorcall, which can pass up to six __m128 values (XMVECTOR instances) as arguments to a function in a SSE/SSE2 register. It can also pass heterogeneous vector aggregates (also known as XMMATRIX) via SSE/SSE2 registers if there is sufficient room.

For 64-bit editions of Windows

For 64-bit Windows, there are two calling conventions available for efficient passing of __m128 values. The standard is __fastcall, which passes all __m128 values on the stack.

Newer Visual Studio compilers support the __vectorcall calling convention, which can pass up to six __m128 values (XMVECTOR instances) as arguments to a function in a SSE/SSE2 register. It can also pass heterogeneous vector aggregates (also known as XMMATRIX) via SSE/SSE2 registers if there is sufficient room.

For Windows on ARM

The Windows on ARM & ARM64 supports passing the first four __n128 values (XMVECTOR instances) in-register.

DirectXMath solution

The FXMVECTOR, GXMVECTOR, HXMVECTOR, and CXMVECTOR aliases support these conventions:

Use the FXMVECTOR alias to pass up to the first three instances of XMVECTOR used as arguments to a function.
Use the GXMVECTOR alias to pass the 4th instance of an XMVECTOR used as an argument to a function.
Use the HXMVECTOR alias to pass the 5th and 6th instances of an XMVECTOR used as an argument to a function. For info about additional considerations, see the __vectorcall documentation.
Use the CXMVECTOR alias to pass any further instances of XMVECTOR used as arguments.

Note

For output parameters, always use XMVECTOR* or XMVECTOR& and ignore them with respect to the preceding rules for input parameters.

Because of limitations with __vectorcall, we recommend that you not use GXMVECTOR or HXMVECTOR for C++ constructors. Just use FXMVECTOR for the first three XMVECTOR values, then use CXMVECTOR for the rest.

The FXMMATRIX and CXMMATRIX aliases help support taking advantage of the HVA argument passing with __vectorcall.

Use the FXMMATRIX alias to pass the first XMMATRIX as an argument to the function. This assumes you don't have more than two FXMVECTOR arguments or more than two float, double, or FXMVECTOR arguments to the ‘right’ of the matrix. For info about additional considerations, see the __vectorcall documentation.
Use the CXMMATRIX alias otherwise.

Because of limitations with __vectorcall, we recommend that you never use FXMMATRIX for C++ constructors. Just use CXMMATRIX.

In addition to the type aliases, you must also use the XM_CALLCONV annotation to make sure the function uses the appropriate calling convention (__fastcall versus __vectorcall) based on your compiler and architecture. Because of limitations with __vectorcall, we recommend that you not use XM_CALLCONV for C++ constructors.

The following are example declarations that illustrate this convention:

XMMATRIX XM_CALLCONV XMMatrixLookAtLH(FXMVECTOR EyePosition, FXMVECTOR FocusPosition, FXMVECTOR UpDirection);

XMMATRIX XM_CALLCONV XMMatrixTransformation2D(FXMVECTOR ScalingOrigin,  float ScalingOrientation, FXMVECTOR Scaling, FXMVECTOR RotationOrigin, float Rotation, GXMVECTOR Translation);

void XM_CALLCONV XMVectorSinCos(XMVECTOR* pSin, XMVECTOR* pCos, FXMVECTOR V);

XMVECTOR XM_CALLCONV XMVectorHermiteV(FXMVECTOR Position0, FXMVECTOR Tangent0, FXMVECTOR Position1, GXMVECTOR Tangent1, HXMVECTOR T);

XMMATRIX(FXMVECTOR R0, FXMVECTOR R1, FXMVECTOR R2, CXMVECTOR R3)

XMVECTOR XM_CALLCONV XMVector2Transform(FXMVECTOR V, FXMMATRIX M);

XMMATRIX XM_CALLCONV XMMatrixMultiplyTranspose(FXMMATRIX M1, CXMMATRIX M2);

To support these calling conventions, these type aliases are defined as follows (parameters must be passed by value for the compiler to consider them for in-register passing):

For 32-bit Windows apps

When you use __fastcall:

typedef const XMVECTOR  FXMVECTOR;
typedef const XMVECTOR& GXMVECTOR;
typedef const XMVECTOR& HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX& FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;

When you use __vectorcall:

typedef const XMVECTOR  FXMVECTOR;
typedef const XMVECTOR  GXMVECTOR;
typedef const XMVECTOR  HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX  FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;

For 64-bit native Windows apps

When you use __fastcall:

typedef const XMVECTOR& FXMVECTOR;
typedef const XMVECTOR& GXMVECTOR;
typedef const XMVECTOR& HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX& FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;

When you use __vectorcall:

typedef const XMVECTOR  FXMVECTOR;
typedef const XMVECTOR  GXMVECTOR;
typedef const XMVECTOR  HXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX  FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;

Windows on ARM

typedef const XMVECTOR  FXMVECTOR;
typedef const XMVECTOR  GXMVECTOR;
typedef const XMVECTOR& CXMVECTOR;
typedef const XMMATRIX& FXMMATRIX;
typedef const XMMATRIX& CXMMATRIX;

Note

While all the functions are declared inline and in many cases the compiler won't need to use calling conventions for these functions, there are cases where the compiler may decide it's more efficient to not inline the function and in these cases we want the best calling convention possible for each platform.

Graphics Library Type Equivalence

To support the use of the DirectXMath Library, many DirectXMath Library types and structures are equivalent to the Windows implementations of the D3DDECLTYPE and D3DFORMAT types, as well as the DXGI_FORMAT types.

DirectXMath	D3DDECLTYPE	D3DFORMAT	DXGI_FORMAT
XMBYTE2			DXGI_FORMAT_R8G8_SINT
XMBYTE4	D3DDECLTYPE_BYTE4 (Xbox Only)	D3DFMT_x8x8x8x8	DXGI_FORMAT_x8x8x8x8_SINT
XMBYTEN2		D3DFMT_V8U8	DXGI_FORMAT_R8G8_SNORM
XMBYTEN4	D3DDECLTYPE_BYTE4N (Xbox Only)	D3DFMT_x8x8x8x8	DXGI_FORMAT_x8x8x8x8_SNORM
XMCOLOR	D3DDECLTYPE_D3DCOLOR	D3DFMT_A8R8G8B8	DXGI_FORMAT_B8G8R8A8_UNORM (DXGI 1.1+)
XMDEC4	D3DDECLTYPE_DEC4 (Xbox Only)	D3DDECLTYPE_DEC3 (Xbox Only)
XMDECN4	D3DDECLTYPE_DEC4N (Xbox Only)	D3DDECLTYPE_DEC3N (Xbox Only)
XMFLOAT2	D3DDECLTYPE_FLOAT2	D3DFMT_G32R32F	DXGI_FORMAT_R32G32_FLOAT
XMFLOAT2A	D3DDECLTYPE_FLOAT2	D3DFMT_G32R32F	DXGI_FORMAT_R32G32_FLOAT
XMFLOAT3	D3DDECLTYPE_FLOAT3		DXGI_FORMAT_R32G32B32_FLOAT
XMFLOAT3A	D3DDECLTYPE_FLOAT3		DXGI_FORMAT_R32G32B32_FLOAT
XMFLOAT3PK			DXGI_FORMAT_R11G11B10_FLOAT
XMFLOAT3SE			DXGI_FORMAT_R9G9B9E5_SHAREDEXP
XMFLOAT4	D3DDECLTYPE_FLOAT4	D3DFMT_A32B32G32R32F	DXGI_FORMAT_R32G32B32A32_FLOAT
XMFLOAT4A	D3DDECLTYPE_FLOAT4	D3DFMT_A32B32G32R32F	DXGI_FORMAT_R32G32B32A32_FLOAT
XMHALF2	D3DDECLTYPE_FLOAT16_2	D3DFMT_G16R16F	DXGI_FORMAT_R16G16_FLOAT
XMHALF4	D3DDECLTYPE_FLOAT16_4	D3DFMT_A16B16G16R16F	DXGI_FORMAT_R16G16B16A16_FLOAT
XMINT2			DXGI_FORMAT_R32G32_SINT
XMINT3			DXGI_FORMAT_R32G32B32_SINT
XMINT4			DXGI_FORMAT_R32G32B32A32_SINT
XMSHORT2	D3DDECLTYPE_SHORT2	D3DFMT_V16U16	DXGI_FORMAT_R16G16_SINT
XMSHORTN2	D3DDECLTYPE_SHORT2N	D3DFMT_V16U16	DXGI_FORMAT_R16G16_SNORM
XMSHORT4	D3DDECLTYPE_SHORT4	D3DFMT_x16x16x16x16	DXGI_FORMAT_R16G16B16A16_SINT
XMSHORTN4	D3DDECLTYPE_SHORT4N	D3DFMT_x16x16x16x16	DXGI_FORMAT_R16G16B16A16_SNORM
XMUBYTE2			DXGI_FORMAT_R8G8_UINT
XMUBYTEN2		D3DFMT_A8P8, D3DFMT_A8L8	DXGI_FORMAT_R8G8_UNORM
XMUINT2			DXGI_FORMAT_R32G32_UINT
XMUINT3			DXGI_FORMAT_R32G32B32_UINT
XMUINT4			DXGI_FORMAT_R32G32B32A32_UINT
XMU555		D3DFMT_X1R5G5B5, D3DFMT_A1R5G5B5	DXGI_FORMAT_B5G5R5A1_UNORM
XMU565		D3DFMT_R5G6B5	DXGI_FORMAT_B5G6R5_UNORM
XMUBYTE4	D3DDECLTYPE_UBYTE4	D3DFMT_x8x8x8x8	DXGI_FORMAT_x8x8x8x8_UINT
XMUBYTEN4	D3DDECLTYPE_UBYTE4N	D3DFMT_x8x8x8x8	DXGI_FORMAT_x8x8x8x8_UNORM DXGI_FORMAT_R10G10B10_XR_BIAS_A2_UNORM (Use XMLoadUDecN4_XR and XMStoreUDecN4_XR.)
XMUDEC4	D3DDECLTYPE_UDEC4 (Xbox Only) D3DDECLTYPE_UDEC3 (Xbox Only)	D3DFMT_A2R10G10B10 D3DFMT_A2B10G10R10	DXGI_FORMAT_R10G10B10A2_UINT
XMUDECN4	D3DDECLTYPE_UDEC4N (Xbox Only) D3DDECLTYPE_UDEC3N (Xbox Only)	D3DFMT_A2R10G10B10 D3DFMT_A2B10G10R10	DXGI_FORMAT_R10G10B10A2_UNORM
XMUNIBBLE4		D3DFMT_A4R4G4B4, D3DFMT_X4R4G4B4	DXGI_FORMAT_B4G4R4A4_UNORM (DXGI 1.2+)
XMUSHORT2	D3DDECLTYPE_USHORT2	D3DFMT_G16R16	DXGI_FORMAT_R16G16_UINT
XMUSHORTN2	D3DDECLTYPE_USHORT2N	D3DFMT_G16R16	DXGI_FORMAT_R16G16_UNORM
XMUSHORT4	D3DDECLTYPE_USHORT4 (Xbox Only)	D3DFMT_x16x16x16x16	DXGI_FORMAT_R16G16B16A16_UINT
XMUSHORTN4	D3DDECLTYPE_USHORT4N	D3DFMT_x16x16x16x16	DXGI_FORMAT_R16G16B16A16_UNORM

Global Constants in the DirectXMath Library

To reduce the size of the data segment, the DirectXMath Library uses the XMGLOBALCONST macro to make use of a number of global internal constants in its implementation. By convention, such internal global constants are prefixed by g_XM. Typically, they are one of the following types: XMVECTORU32, XMVECTORF32, or XMVECTORI32.

These internal global constants are subject to change in future revisions of the DirectXMath Library. Use public functions that encapsulate the constants when possible rather than direct use of g_XM global values. You can also declare your own global constants using XMGLOBALCONST.

Windows SSE versus SSE2

The SSE instruction set provides support only for single-precision floating-point vectors. DirectXMath must make use of the SSE2 instruction set to provide integer vector support. SSE2 is supported by all Intel processors since the introduction of the Pentium 4, all AMD K8 and later processors, and all x64-capable processors.

Note

Windows 8 for x86 or later requires support for SSE2. All versions of Windows x64 require support for SSE2. Windows on ARM / ARM64 requires ARM_NEON.

Routine Variants

There are several variants of DirectXMath functions that make it easier to do your work:

Comparison functions to create complicated conditional branching based on a smaller number of vector comparison operations. The name of these functions end in "R" such as XMVector3InBoundsR. The functions return a comparison record as a UINT return value, or as a UINT out parameter. You can use the XMComparision* macros to test the value.
Batch functions for performing batch-style operations on larger vector arrays. The name of these functions end in "Stream" such as XMVector3TransformStream. The functions operate on an array of inputs, and they generate an array of outputs. Typically, they take an input and output stride.
Estimation functions that implement a faster estimation instead of a slower, more accurate result. The name of these functions end in "Est" such as XMVector3NormalizeEst. The quality and performance impact of using estimation varies from platform to platform, but we recommend that you use estimation variants for performance-sensitive code.

Platform Inconsistencies

The DirectXMath library is intended for use in performance-sensitive graphics applications and games. Therefore, the implementation is designed for optimal speed doing normal processing on all supported platforms. Results at boundary-conditions, particularly those that generate floating-point specials, are likely to vary from target to target. This behavior will also depend on other run-time settings, such as the x87 control word for the Windows 32-bit no-intrinsics target or the SSE control word for both Windows 32-bit and 64-bit. Furthermore, there will be differences in boundary-conditions between various CPU vendors.

Don't use DirectXMath in scientific or other applications where numerical accuracy is paramount. Also, this limitation is reflected in the lack of support for double or other extended precision computations.

Note

The _XM_NO_INTRINSICS_ scalar code paths generally are written for compliance, not performance. Their boundary-condition results also will vary.

Platform-specific Extensions

The DirectXMath library is intended to simplify C++ SIMD programming providing excellent support for x86, x64, and Windows RT platforms using broadly supported intrinsics instructions (SSE2 and ARM-NEON).

There are times, however, when platform-specific instructions may prove beneficial. Due to the way DirectXMath is implemented, in many cases it is trivial to use DirectXMath types directly in standard compiler-supported intrinsics statements, and to use DirectXMath as the fallback path for platforms that don't support the extended instruction.

For example, here is a simplified example of leveraging the SSE 4.1 dot-product instruction. Note that you must explicitly guard the code-path to avoid generating invalid instruction exceptions at run time. Ensure the code paths do significant enough work to justify the additional cost of branching, complexity of maintaining multiple code-paths, and so on.

#include <Windows.h>
#include <stdio.h>

#include <DirectXMath.h>

#include <intrin.h>
#include <smmintrin.h>

using namespace DirectX;

bool g_bSSE41 = false;

void DetectCPUFeatures()
{
#ifndef _M_ARM
   // See __cpuid documentation on MSDN for more information

   int CPUInfo[4] = {-1};
#if defined(__clang__) || defined(__GNUC__)
   __cpuid(0, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
   __cpuid(CPUInfo, 0);
#endif

   if ( CPUInfo[0] >= 1 )
   {
#if defined(__clang__) || defined(__GNUC__)
        __cpuid(1, CPUInfo[0], CPUInfo[1], CPUInfo[2], CPUInfo[3]);
#else
        __cpuid(CPUInfo, 1);
#endif

       if ( CPUInfo[2] & 0x80000 )
           g_bSSE41 = true;
   }
#endif
}

int main()
{
   if ( !XMVerifyCPUSupport() )
       return -1;

   DetectCPUFeatures();

   ...

   XMVECTORF32 v1 = { 1.f, 2.f, 3.f, 4.f };
   XMVECTORF32 v2 = { 5.f, 6.f, 7.f, 8.f };

   XMVECTOR r2, r3, r4;

   if ( g_bSSE41 )
   {
#ifndef _M_ARM
       r2 = _mm_dp_ps( v1, v2, 0x3f );
       r3 = _mm_dp_ps( v1, v2, 0x7f );
       r4 = _mm_dp_ps( v1, v2, 0xff );
#endif
   }
   else
   {
       r2 = XMVector2Dot( v1, v2 );
       r3 = XMVector3Dot( v1, v2 );
       r4 = XMVector4Dot( v1, v2 );
   }

   ...

   return 0;
}

For more info about platform-specific extensions, see:

DirectXMath: SSE, SSE2, and ARM-NEON
DirectXMath: SSE3 and SSSE3
DirectXMath: SSE4.1 and SSE4.2
DirectXMath: AVX
DirectXMath: F16C and FMA
DirectXMath: AVX2
DirectXMath: ARM64

DirectXMath Programming Guide

Library Internals

Calling Conventions

Graphics Library Type Equivalence

Global Constants in the DirectXMath Library

Windows SSE versus SSE2

Routine Variants

Platform Inconsistencies

Platform-specific Extensions

Feedback

Feedback

Additional resources

Library Internals

Calling Conventions

Graphics Library Type Equivalence

Global Constants in the DirectXMath Library

Windows SSE versus SSE2

Routine Variants

Platform Inconsistencies

Platform-specific Extensions

Related topics

Feedback

Feedback

Additional resources