Events
May 19, 6 PM - May 23, 12 AM
Calling all developers, creators, and AI innovators to join us in Seattle @Microsoft Build May 19-22.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Direct3D 11 supports several floating-point representations. All floating-point computations operate under a defined subset of the IEEE 754 32-bit single precision floating-point rules.
There are two sets of rules: those that conform to IEEE-754, and those that deviate from the standard.
Some of these rules are a single option where IEEE-754 offers choices.
IEEE-754 requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result, known as round-to-nearest-even. Direct3D 11 defines the same requirement: 32-bit floating-point operations produce a result that is within 0.5 unit-last-place (ULP) of the infinitely-precise result. This means that, for example, hardware is allowed to truncate results to 32-bit rather than perform round-to-nearest-even, as that would result in error of at most 0.5 ULP.This rule applies only to addition, subtraction, and multiplication.
There is no support for floating-point exceptions, status bits or traps.
Denorms are flushed to sign-preserved zero on input and output of any floating-point mathematical operation. Exceptions are made for any I/O or data movement operation that doesn't manipulate the data.
States that contain floating-point values, such as Viewport MinDepth/MaxDepth, BorderColor values, may be provided as denorm values and may or may not be flushed before the hardware uses them.
Min or max operations flush denorms for comparison, but the result may or may not be denorm flushed.
NaN input to an operation always produces NaN on output. But the exact bit pattern of the NaN is not required to stay the same (unless the operation is a raw move instruction - which doesn't alter data.)
Min or max operations for which only one operand is NaN return the other operand as the result (contrary to comparison rules we looked at earlier). This is an IEEE 754R rule.
The arithmetic rules in Direct3D 10 and later don't make any distinctions between "quiet" and "signaling" NaN values (QNaN vs SNaN). All "NaN" values are handled the same way.
If both inputs to min() or max() are NaN, then any NaN is returned.
An IEEE 754R rule is that min(-0,+0) == min(+0,-0) == -0, and max(-0,+0) == max(+0,-0) == +0; which honor the sign. That's in contrast to the comparison rules for signed zero (stated above). Direct3D 11 recommends the IEEE 754R behavior here, but doesn't enforce it; it's permissible for the result of comparing zeros to be dependent on the order of parameters, using a comparison that ignores the signs.
x*1.0f always results in x (except denorm flushed).
x/1.0f always results in x (except denorm flushed).
x +/- 0.0f always results in x (except denorm flushed). But -0 + 0 = +0.
Fused operations (such as mad, dp3) produce results that are no less accurate than the worst possible serial ordering of evaluation of the unfused expansion of the operation. The definition of the worst possible ordering, for the purpose of tolerance, is not a fixed definition for a given fused operation; it depends on the particular values of the inputs. The individual steps in the unfused expansion are each allowed 1 ULP tolerance (or for any instructions Direct3D calls out with a more lax tolerance than 1 ULP, the more lax tolerance is allowed).
Fused operations adhere to the same NaN rules as non-fused operations.
sqrt and rcp have 1 ULP tolerance. The shader reciprocal and reciprocal square-root instructions, rcp and rsq, have their own separate relaxed precision requirement.
Multiply and divide each operate at the 32-bit floating-point precision level (accuracy to 0.5 ULP for multiply, 1.0 ULP for reciprocal). If x/y is implemented directly, results must be of greater or equal accuracy than a two-step method.
Hardware and display drivers optionally support double-precision floating-point. To indicate support, when you call ID3D11Device::CheckFeatureSupport with D3D11_FEATURE_DOUBLES, the driver sets DoublePrecisionFloatShaderOps of D3D11_FEATURE_DATA_DOUBLES to TRUE. The driver and hardware must then support all double-precision floating-point instructions.
Double-precision instructions follow IEEE 754R behavior requirements.
Support for generation of denormalized values is required for double-precision data (no flush-to-zero behavior). Likewise, instructions don't read denormalized data as a signed zero, they honor the denorm value.
Direct3D 11 also supports 16-bit representations of floating-point numbers.
Format:
A float16 value (v) follows these rules:
32-bit floating-point rules also hold for 16-bit floating-point numbers, adjusted for the bit layout described earlier. Exceptions to this include:
Direct3D 11 also supports 11-bit and 10-bit floating-point formats.
Format:
A float11/float10 value (v) follows the following rules:
32-bit floating-point rules also hold for 11-bit and 10-bit floating-point numbers, adjusted for the bit layout described earlier. Exceptions include:
Events
May 19, 6 PM - May 23, 12 AM
Calling all developers, creators, and AI innovators to join us in Seattle @Microsoft Build May 19-22.
Register today