For earlier shader models, HLSL programming exposes only a single thread of execution. New wave-level operations are provided, starting with model 6.0, to explicitly take advantage of the parallelism of current GPUs - many threads can be executing in lockstep on the same core simultaneously. For example, the model 6.0 intrinsics enable the elimination of barrier constructs when the scope of synchronization is within the width of the SIMD processor, or some other set of threads that are known to be atomic relative to each other.
Potential use cases include: stream compaction, reductions, block transpose, bitonic sort or Fast Fourier Transforms (FFT), binning, stream de-duplication, and similar scenarios.
Most of the intrinsics appear in pixel shaders and compute shaders, though there are some exceptions (noted for each function). The functions have been added to the requirements for DirectX Feature Level 12.0, under API level 12.
The <type> parameter and return value for these functions implies the type of the expression, the supported types are those from the following list that are also present in the target shader model for your app:
half, half2, half3, half4
float, float2, float3, float4
double, double2, double3, double4
int, int2, int3, int4
uint, uint2, uint3, uint4
short, short2, short3, short4
ushort, ushort2, ushort3, ushort4
uint64_t, uint64_t2, uint64_t3, uint64_t4
Some operations (such as the bitwise operators) only support the integer types.
Terminology
Term
Definition
Lane
A single thread of execution. The shader models before version 6.0 expose only one of these at the language level, leaving expansion to parallel SIMD processing entirely up to the implementation.
Wave
A set of lanes (threads) executed simultaneously in the processor. No explicit barriers are required to guarantee that they execute in parallel. Similar concepts include "warp" and "wavefront."
Inactive Lane
A lane which is not being executed, for example due to the flow of control, or insufficient work to fill the minimum size of the wave.
Active Lane
A lane for which execution is being performed. In pixel shaders, it may include any helper pixel lanes.
Quad
A set of 4 adjacent lanes corresponding to pixels arranged in a 2x2 square. They are used to estimate gradients by differencing in either x or y. A wave may be comprised of multiple quads. All pixels in an active quad are executed (and may be "Active Lanes"), but those that do not produce visible results are termed "Helper Lanes".
Helper Lane
A lane which is executed solely for the purpose of gradients in pixel shader quads. The output of such a lane will be discarded, and so not render to the destination surface.
Shading language intrinsics
All the operations of this shader model have been added in a range of intrinsic functions.
Returns a 64-bit unsigned integer bitmask of the evaluation of the Boolean expression for all active lanes in the specified wave.
*
*
Wave Broadcast
These intrinsics enable all active lanes in the current wave to receive the value from the specified lane, effectively broadcasting it. The return value from an invalid lane is undefined.
Returns the value of the expression for the active lane of the current wave with the smallest index.
*
*
Wave Reduction
These intrinsics compute the specified operation across all active lanes in the wave and broadcast the final result to all active lanes. Therefore, the final output is guaranteed uniform across the wave.
Returns the bitwise AND of all the values of the expression across all active lanes in the current wave, and replicates the result to all lanes in the wave.
Returns the bitwise OR of all the values of the expression across all active lanes in the current wave, and replicates the result to all lanes in the wave.
Returns the bitwise Exclusive OR of all the values of the expression across all active lanes in the current wave, and replicates the result to all lanes in the wave.
Counts the number of boolean variables which evaluate to true across all active lanes in the current wave, and replicates the result to all lanes in the wave.
Sums up the value of the expression across all active lanes in the current wave and replicates it to all lanes in the current wave, and replicates the result to all lanes in the wave.
*
*
Wave Scan and Prefix
These intrinsics apply the operation to each lane and leave each partial result of the computation in the corresponding lane.
Returns the product of all of the values in the lanes before this one of the specified wave.
*
*
Quad-wide Shuffle operations
These intrinsics perform swap operations on the values across a wave known to contain pixel shader quads as defined here. The indices of the pixels in the quad are defined in scan-line or reading order - where the coordinates within a quad are:
+---------> X
| [0] [1]
| [2] [3]
v
Y
These routines work in either compute shaders or pixel shaders. In compute shaders they operate in quads defined as evenly divided groups of 4 within an SIMD wave. In pixel shaders they should be used on waves captured by WaveQuadLanes, otherwise results are undefined.
Azure HPC is a purpose-built cloud capability for HPC & AI workload, using leading-edge processors and HPC-class InfiniBand interconnect, to deliver the best application performance, scalability, and value. Azure HPC enables users to unlock innovation, productivity, and business agility, through a highly available range of HPC & AI technologies that can be dynamically allocated as your business and technical needs change. This learning path is a series of modules that help you get started on Azure HPC - you