Intrinsics Overview
Microsoft Specific
3DNow! technology provides up to 26 additional instructions to support highperformance 3D graphics and audio processing. 3DNow! instructions are vector instructions that operate on 64bit registers. 3DNow! instructions are SIMD; each instruction operates on pairs of 32bit values. See 3DNow! Intrinsics for the reference documentation for the AMD intrinsics.
Vector instructions operate in parallel on two sets of 32bit singleprecision, floatingpoint words. Scalar instructions operate on a single set of 32bit operands (from the low halves of the two 64bit operands).
The 3DNow! singleprecision, floatingpoint format is compatible with the IEEE 754 singleprecision format. This format comprises a 1bit sign, an 8bit biased exponent, and a 23bit significand with one hidden integer bit for a total of 24 bits in the significand. The bias of the exponent is 127, consistent with the IEEE singleprecision standard. The significands are normalized to be within the range of [1,2).
In contrast to the IEEE standard that dictates four rounding modes, 3DNow! technology supports one rounding mode, as either roundtonearest or roundtozero (truncation). The hardware implementation of 3DNow! technology determines the rounding mode. The AMD processors implement roundtonearest mode. Regardless of the rounding mode used, the floatingpointtointeger and integertofloatingpoint conversion instructions, PF2ID and PI2FD, always use the roundtozero (truncation) mode.
The largest representable normal number in magnitude for this precision in hexadecimal has an exponent of FEh and a significand of 7FFFFFh, with a numerical value of 2127 (2 – 2–23). All results that overflow above the maximum representable positive value are saturated to either this maximum representable normal number or to positive infinity. Similarly, all results that overflow below the minimum representable negative value are saturated to either this minimum representable normal number or to negative infinity.
The implementation of 3DNow! technology determines how arithmetic overflow is handled, as either properly signed maximum or minimum representable normal numbers or properly signed infinities. The processor generates properly signed maximum or minimum representable normal numbers.
Infinities and NANs are not supported as operands to 3DNow! instructions.
The smallest representable normal number in magnitude for this precision in hexadecimal has an exponent of 01h and a significand of 000000h, with a numerical value of 2–126. Accordingly, all results below this minimum representable value in magnitude are held to zero. The following table shows the exponent ranges supported by the 3DNow! technology.
3DNow! Technology Exponent Ranges
Biased exponent 
Description 

FFh 
Unsupported. Unsupported numbers can be used as operands. The results of operations with unsupported numbers are undefined. 
00h 
Zero. 
00h<x<FFh 
Normal. 
01h 
2 (1–127) lowest possible exponent. 
FEh 
2 (254–127) largest possible exponent. 
Like MMX instructions, 3DNow! instructions do not generate numeric exceptions or set any status flags. It is the user's responsibility to ensure that inrange data is provided to 3DNow! instructions and that all computations remain within valid ranges (or are held as expected).
The register operations of all 3DNow! floatingpoint instructions are executed by either the register X unit or the register Y unit. One operation can be issued to each register unit at each clock cycle for a maximum issue and execution rate of two 3DNow! operations per cycle.
Normally, in highperformance 3DNow! code, all 3DNow! instructions are properly scheduled apart from each other to avoid delays caused by execution resource contentions (as well as taking into account dependencies and execution latencies).
For further information regarding code optimization on the AMDK6 processor, see the AMDK6 Processor Code Optimization Application Note, order number 21924. This document provides indepth discussions of code optimization techniques for the processor.
For execution resources information on the AMD Athlon processor, refer to the AMD Athlon Processor x86 Code Optimization Guide, order number 22007.
The 3DNow! performance enhancement instructions for AMD processors are summarized in the following tables.
AMD 3DNow! FloatingPoint Instructions
Operation 
Function 
Opcode 

PAVGUSB 
Packed 8bit unsigned integer averaging 
BFh 
PFADD 
Packed floatingpoint addition 
9Eh 
PFSUB 
Packed floatingpoint subtraction 
9Ah 
PFSUBR 
Packed floatingpoint reverse subtraction 
Aah 
PFACC 
Packed floatingpoint accumulate 
Aeh 
PFCMPGE 
Packed floatingpoint comparison, greater or equal 
90h 
PFCMPGT 
Packed floatingpoint comparison, greater 
A0h 
PFCMPEQ 
Packed floatingpoint comparison, equal 
B0h 
PFMIN 
Packed floatingpoint minimum 
94h 
PFMAX 
Packed floatingpoint maximum 
A4h 
PI2FD 
Packed 32bit integer to floatingpoint conversion 
0Dh 
PF2ID 
Packed floatingpoint to 32bit integer 
1Dh 
PFRCP 
Packed floatingpoint reciprocal approximation 
96h 
PFRSQRT 
Packed floatingpoint reciprocal square root approximation 
97h 
PFMUL 
Packed floatingpoint multiplication 
B4h 
PFRCPIT1 
Packed floatingpoint reciprocal first iteration step 
A6h 
PFRSQIT1 
Packed floatingpoint reciprocal square root first iteration step 
A7h 
PFRCPIT2 
Packed floatingpoint reciprocal/reciprocal square root second iteration step 
B6h 
PMULHRW 
Packed 16bit integer multiply with rounding 
B7h 
AMD 3DNow! Performance Enhancement Instructions
Operation 
Function 
Opcode second byte 

FEMMS 
Faster entry/exit of the MMX or floatingpoint state. 
0Eh 
PREFETCH/PREFETCHW 

0Dh 
AMD Athlon Processor 3DNow! Technology DSP Extensions
Operation 
Function 
Opcode / imm8 

PF2IW 
Packed floatingpoint to integer word conversion with sign extend 
0Fh 0Fh / 1Ch 
PFNACC 
Packed floatingpoint negative accumulate 
0Fh 0Fh / 8Ah 
PFPNACC 
Packed floatingpoint mixed positivenegative accumulate 
0Fh 0Fh / 8Eh 
PI2FW 
Packed integer word to floatingpoint conversion 
0Fh 0Fh / 0Ch 
PSWAPD 
Packed swap doubleword 
0Fh 0Fh / BBh 
MMX Instruction set extensions starting with AMD Athlon Processor
Operation 
Function 
Opcode / imm8 

MASKMOVQ 
Streaming (cache bypass) store using byte mask 
0Fh F7h 
MOVNTQ 
Streaming (cache bypass) store 
0Fh E7h 
PAVGB 
Packed average of unsigned byte 
0Fh E0h 
PAVGW 
Packed average of unsigned word 
0Fh E3h 
PEXTRW 
Extract word into integer register 
0Fh C5h 
PINSRW 
Insert word from integer register 
0Fh C4h 
PMAXSW 
Packed maximum signed word 
0Fh Eeh 
PMAXUB 
Packed maximum unsigned byte 
0Fh Deh 
PMINSW 
Packed minimum signed word 
0Fh Eah 
PMINUB 
Packed minimum unsigned byte 
0Fh Dah 
PMOVMSKB 
Move byte mask to integer register 
0Fh D7h 
PMULHUW 
Packed multiply high unsigned word 
0Fh E4h 
PREFETCHNTA 
Move data closer to the processor using the NTA reference 
0Fh 18h 0* 
PREFETCHT0 
Move data closer to the processor using the T0 reference 
0Fh 18h 1* 
PREFETCHT1 
Move data closer to the processor using the T1 reference 
0Fh 18h 2* 
PREFETCHT2 
Move data closer to the processor using the T2 reference 
0Fh 18h 3* 
PSADBW 
Packed sum of absolute byte differences 
0Fh F6h 
PSHUFW 
Packed shuffle word 
0Fh 70h 
SFENCE 
Store fence 
0Fh AEh / 7h 
*The number after the opcode indicates the different prefetch modes in the modR/M byte.
For further information regarding code optimization on the AMDK62 processor, see the AMDK62 Processor Code Optimization Application Note, order number 21924. This document provides indepth discussions of code optimization techniques for the AMDK6 family processor.
For execution resources information on the AMD Athlon processor, refer to the AMD Athlon Processor x86 Code Optimization Guide, order number 22007. This document provides indepth discussions of code optimization techniques for the AMD Athlon processor.
See http://go.microsoft.com/fwlink/?LinkID=95131 for the online versions of these documents.