ARM64 exception handling

Windows on ARM64 uses the same structured exception handling mechanism for asynchronous hardware-generated exceptions and synchronous software-generated exceptions. Language-specific exception handlers are built on top of Windows structured exception handling by using language helper functions. This document describes exception handling in Windows on ARM64, and the language helpers used by code that's generated by the Microsoft ARM assembler and the MSVC compiler.

Goals and motivation

The exception unwinding data conventions, and this description, are intended to:

  1. Provide enough description to allow unwinding without code probing in all cases.

    • Analyzing the code requires the code to be paged in. This prevents unwinding in some circumstances where it is useful (tracing, sampling, debugging).

    • Analyzing the code is complex; the compiler must be careful to only generate instructions that the unwinder is capable of decoding.

    • If unwinding cannot be fully described through the use of unwind codes, then in some cases it must fall back to instruction decoding. This increases the overall complexity, and ideally would be avoided.

  2. Support unwinding in mid-prolog and mid-epilog.

    • Unwinding is used in Windows for more than exception handling, so it is critical that we be able to perform an accurate unwind even when in the middle of a prolog or epilog code sequence.
  3. Take up a minimal amount of space.

    • The unwind codes must not aggregate to significantly increase the binary size.

    • Since the unwind codes are likely to be locked in memory, a small footprint ensures a minimal overhead for each loaded binary.

Assumptions

These are the assumptions made in the exception handling description:

  1. Prologs and epilogs tend to mirror either other. By taking advantage of this common trait, the size of the metadata needed to describe unwinding can be greatly reduced. Within the body of the function, it does not matter whether the prolog's operations are undone, or the epilog's operations are performed in a forward manner. Both should produce identical results.

  2. Functions tend on the whole to be relatively small. Several optimizations for space rely on this in order to achieve the most efficient packing of data.

  3. There is no conditional code in epilogs.

  4. Dedicated frame pointer register: If the sp is saved in another register (x29) in the prolog, that register remains untouched throughout the function, so that the original sp may be recovered at any time.

  5. Unless the sp is saved in another register, all manipulation of the stack pointer occurs strictly within the prolog and epilog.

  6. The stack frame layout is organized as described in next section.

ARM64 stack frame layout

stack frame layout

For frame chained functions, the fp and lr pair can be saved at any position in the local variable area depending on optimization considerations. The goal is to maximize the number of locals that can be reached by one single instruction based on frame pointer (x29) or stack pointer (sp). However for alloca functions, it must be chained and x29 must point to the bottom of stack. To allow for better register-pair-addressing-mode coverage, nonvolatile register save areas are positioned at the top of the Local area stack. Here are examples that illustrate several of the most efficient prolog sequences. For the sake of clarity and better cache locality, the order of storing callee-saved registers in all canonical prologs is in "growing up" order. #framesz below represents the size of entire stack (excluding alloca area). #localsz and #outsz denote local area size (including the save area for the <x29, lr> pair) and outgoing parameter size, respectively.

  1. Chained, #localsz <= 512

        stp    x19,x20,[sp,#-96]!        // pre-indexed, save in 1st FP/INT pair
        stp    d8,d9,[sp,#16]            // save in FP regs (optional)
        stp    x0,x1,[sp,#32]            // home params (optional)
        stp    x2,x3,[sp,#48]
        stp    x4,x5,[sp,#64]
        stp    x6,x7,[sp,#72]
        stp    x29,lr,[sp,#-localsz]!   // save <x29,lr> at bottom of local area
        mov    x29,sp                   // x29 points to bottom of local
        sub    sp,sp,#outsz             // (optional for #outsz != 0)
    
  2. Chained, #localsz > 512

        stp    x19,x20,[sp,#-96]!        // pre-indexed, save in 1st FP/INT pair
        stp    d8,d9,[sp,#16]            // save in FP regs (optional)
        stp    x0,x1,[sp,#32]            // home params (optional)
        stp    x2,x3,[sp,#48]
        stp    x4,x5,[sp,#64]
        stp    x6,x7,[sp,#72]
        sub    sp,sp,#(localsz+outsz)   // allocate remaining frame
        stp    x29,lr,[sp,#outsz]       // save <x29,lr> at bottom of local area
        add    x29,sp,#outsz            // setup x29 points to bottom of local area
    
  3. Unchained, leaf functions (lr unsaved)

        stp    x19,x20,[sp,#-80]!       // pre-indexed, save in 1st FP/INT reg-pair
        stp    x21,x22,[sp,#16]
        str    x23,[sp,#32]
        stp    d8,d9,[sp,#40]           // save FP regs (optional)
        stp    d10,d11,[sp,#56]
        sub    sp,sp,#(framesz-80)      // allocate the remaining local area
    

    All locals are accessed based on SP. <x29,lr> points to the previous frame. For frame size <= 512, the "sub sp, ..." can be optimized away if the regs saved area is moved to the bottom of stack. The downside of that is that it is not consistent with other layouts above, and saved regs take part of the range for pair-regs and pre- and post-indexed offset addressing mode.

  4. Unchained, non-leaf functions (lr is saved in Int saved area)

        stp    x19,x20,[sp,#-80]!       // pre-indexed, save in 1st FP/INT reg-pair
        stp    x21,x22,[sp,#16]         // ...
        stp    x23,lr,[sp,#32]          // save last Int reg and lr
        stp    d8,d9,[sp,#48]           // save FP reg-pair (optional)
        stp    d10,d11,[sp,#64]         // ...
        sub    sp,sp,#(framesz-80)      // allocate the remaining local area
    

    Or, with even number saved Int registers,

        stp    x19,x20,[sp,#-80]!       // pre-indexed, save in 1st FP/INT reg-pair
        stp    x21,x22,[sp,#16]         // ...
        str    lr,[sp,#32]              // save lr
        stp    d8,d9,[sp,#40]           // save FP reg-pair (optional)
        stp    d10,d11,[sp,#56]         // ...
        sub    sp,sp,#(framesz-80)      // allocate the remaining local area
    

    Only x19 saved:

        sub    sp,sp,#16                // reg save area allocation*
        stp    x19,lr,[sp]              // save x19, lr
        sub    sp,sp,#(framesz-16)      // allocate the remaining local area
    

    * The reg save area allocation is not folded into the stp because a pre-indexed reg-lr stp cannot be represented with the unwind codes.

    All locals are accessed based on SP. <x29> points to the previous frame.

  5. Chained, #framesz <= 512, #outsz = 0

        stp    x29,lr,[sp,#-framesz]!       // pre-indexed, save <x29,lr>
        mov    x29,sp                       // x29 points to bottom of stack
        stp    x19,x20,[sp,#(framesz-32)]   // save INT pair
        stp    d8,d9,[sp,#(framesz-16)]     // save FP pair
    

    Comparing to #1 prolog above, the advantage is that all register save instructions are ready to be executed right after the only one stack allocating instruction. Thus, there is no anti-dependence on sp that prevents from instruction level parallelism.

  6. Chained, frame size > 512 (optional for functions without alloca)

        stp    x29,lr,[sp,#-80]!            // pre-indexed, save <x29,lr>
        stp    x19,x20,[sp,#16]             // save in INT regs
        stp    x21,x22,[sp,#32]             // ...
        stp    d8,d9,[sp,#48]               // save in FP regs
        stp    d10,d11,[sp,#64]
        mov    x29,sp                       // x29 points to top of local area
        sub    sp,sp,#(framesz-80)          // allocate the remaining local area
    

    For optimization purpose, x29 can be put at any position in local area to provide a better coverage for "reg-pair" and pre-/post-indexed offset addressing mode. Locals below frame pointers can be accessed based on SP.

  7. Chained, frame size > 4K, with or without alloca(),

        stp    x29,lr,[sp,#-80]!            // pre-indexed, save <x29,lr>
        stp    x19,x20,[sp,#16]             // save in INT regs
        stp    x21,x22,[sp,#32]             // ...
        stp    d8,d9,[sp,#48]               // save in FP regs
        stp    d10,d11,[sp,#64]
        mov    x29,sp                       // x29 points to top of local area
        mov    x15,#(framesz/16)
        bl     __chkstk
        sub    sp,sp,x15,lsl#4              // allocate remaining frame
                                            // end of prolog
        ...
        sub    sp,sp,#alloca                // more alloca() in body
        ...
                                            // beginning of epilog
        mov    sp,x29                       // sp points to top of local area
        ldp    d10,d11,[sp,#64]
        ...
        ldp    x29,lr,[sp],#80              // post-indexed, reload <x29,lr>
    

ARM64 exception handling information

.pdata records

The .pdata records are an ordered array of fixed-length items which describe every stack-manipulating function in a PE binary. Note carefully the phrase "stack-manipulating": leaf functions which do not require any local storage and which do not need to save/restore non-volatile registers do not require a .pdata record; these should be explicitly omitted to save space. A unwind from one of these functions can simply get the return address from LR to move up to the caller.

Each .pdata record for ARM64 is 8 bytes in length. The general format of the each record places the 32-bit RVA of the function start in the first word, followed by a second with that contains either a pointer to a variable-length .xdata block, or a packed word describing a canonical function unwinding sequence.

.pdata record layout

The fields are as follows:

  • Function Start RVA is the 32-bit RVA of the start of the function.

  • Flag is a 2-bit field that indicates how to interpret the remaining 30 bits of the second .pdata word. If Flag is 0, then the remaining bits form an Exception Information RVA (with the low two bits implicitly 0). If Flag is non-zero, then the remaining bits form a Packed Unwind Data structure.

  • Exception Information RVA is the address of the variable-length exception information structure, stored in the .xdata section. This data must be 4-byte aligned.

  • Packed Unwind Data is a compressed description of the operations needed to unwind from a function, assuming a canonical form. In this case, no .xdata record is required.

.xdata records

When the packed unwind format is insufficient to describe the unwinding of a function, a variable-length .xdata record must be created. The address of this record is stored in the second word of the .pdata record. The format of the .xdata is a packed variable-length set of words:

.xdata record layout

This data is broken into four sections:

  1. A 1 or 2-word header describing the overall size of the structure and providing key function data. The second word is only present if both the Epilog Count and Code Words fields are set to 0. These are the bit fields in the header:

    a. Function Length is an 18-bit field indicating the total length of the function in bytes, divided by 4. If a function is larger than 1M, then multiple pdata and xdata records must be used to describe the function. See the Large functions section for more details.

    b. Vers is a 2-bit field describing the version of the remaining xdata. As of this writing, only version 0 is defined, and thus values of 1-3 are not permitted.

    c. X is a 1-bit field indicating the presence (1) or absence (0) of exception data.

    d. E is one bit field indicates that information describing a single epilog is packed into the header (1) rather than requiring additional scope words later (0).

    e. Epilog Count is a 5-bit field that has two meanings, depending on the state of E bit:

    1. If E is set to 0: it specifies the count of the total number of epilog scopes described in section 2. If more than 31 scopes exist in the function, then the Code Words field must be set to 0 to indicate that an extension word is required.

    2. If E is set to 1, then this field specifies the index of the first unwind code that describes the one and only epilog.

    f. Code Words is a 5-bit field that specifies the number of 32-bit words needed to contain all of the unwind codes in section 3. If more than 31 words are required (i.e., more than 124 unwind code bytes), then this field must be set to 0 to indicate that an extension word is required.

    g. Extended Epilog Count and Extended Code Words are 16-bit and 8-bit fields, respectively, that provide more space for encoding an unusually large number of epilogs or an unusually large number of unwind code words. The extension word containing these fields is only present if both the Epilog Count and Code Words fields in the first header word are set to 0.

  2. After the header and optional extended header described above, if Epilog Count is not zero, is a list of information about epilog scopes, packed one to a word, and stored in order of increasing starting offset. Each scope contains the following bits:

    a. Epilog Start Offset is an 18-bit field describing the offset in bytes, divided by 4, of the epilog relative to the start of the function

    b. Res is a 4-bit field reserved for future expansion. Its value must be 0.

    c. Epilog Start Index is a 10-bit (2 more bits than Extended Code Words) field indicating the byte index of the first unwind code that describes this epilog.

  3. After the list of epilog scopes comes an array of bytes that contain unwind codes, described in detail in a later section. This array is padded at the end to the nearest full word boundary. Unwind codes are written to this array, starting with the one closest to the body of the function, moving towards the edges of the function. The bytes for each unwind code are stored in big-endian order so they can be fetched directly, starting with the most significant byte first, which identifies the operation and the length of the rest of the code.

  4. Finally, after the unwind code bytes, if the X bit in the header was set to 1, comes the exception handler information. This consists of a single Exception Handler RVA providing the address of the exception handler itself, followed immediately by a variable-length amount of data required by the exception handler.

The .xdata record above is designed such that it is possible to fetch the first 8 bytes and from that compute the full size of the record (minus the length of the variable-sized exception data that follows). The following code snippet computes the record size:

ULONG ComputeXdataSize(PULONG *Xdata)
{
    ULONG EpilogScopes;
    ULONG Size;
    ULONG UnwindWords;

    if ((Xdata[0] >> 27) != 0) {
        Size = 4;
        EpilogScopes = (Xdata[0] >> 22) & 0x1f;
        UnwindWords = (Xdata[0] >> 27) & 0x0f;
    } else {
        Size = 8;
        EpilogScopes = Xdata[1] & 0xffff;
        UnwindWords = (Xdata[1] >> 16) & 0xff;
    }

    Size += 4 * EpilogScopes;
    Size += 4 * UnwindWords;
    if (Xdata[0] & (1 << 20)) {
        Size += 4;        // exception handler RVA
    }
    return Size;
}

It should be noted that although the prolog and each epilog has its own index into the unwind codes, the table is shared between them, and it is entirely possible (and not altogether uncommon) that they can all share the same codes (see Example 2 in the Examples section below). Compiler writers should optimize for this case, in particular because the largest index that can be specified is 255, thus limiting the total number of unwind codes for a particular function.

Unwind codes

The array of unwind codes is pool of sequences that describe exactly how to undo the effects of the prolog, in the order in which the operations need to be undone. The unwind codes can be thought of as a mini instruction set, encoded as a string of bytes. When execution is complete, the return address to the calling function is in the lr register, and all non-volatile registers are restored to their values at the time the function was called.

If exceptions were guaranteed to only ever occur within a function body (and never with a prolog or any epilog), then only a single sequence would be necessary. However, Windows unwinding model requires that we be able to unwind from within a partially executed prolog or epilog. In order to accommodate this requirement, the unwind codes have been carefully designed such that they unambiguously map 1:1 to each relevant opcode in the prolog and epilog. This has several implications:

  1. By counting the number of unwind codes, it is possible to compute the length of the prolog and epilog.

  2. By counting the number of instructions past the start of an epilog scope, it is possible to skip the equivalent number of unwind codes, and execute the rest of a sequence to complete the partially-executed unwind that the epilog was performing.

  3. By counting the number of instructions before the end of the prolog, it is possible to skip the equivalent number of unwind codes, and execute the rest of the sequence to undo only those parts of the prolog that have completed execution.

The unwind codes are encoded according to the table below. All unwind codes are a single/double byte, except the one that allocates a huge stack. Totally there are 21 unwind code. Each unwind code maps exactly one instruction in the prolog/epilog in order to allow for unwinding of partially executed prologs and epilogs.

Unwind code Bits and interpretation
alloc_s 000xxxxx: allocate small stack with size < 512 (2^5 * 16).
save_r19r20_x 001zzzzz: save <x19,x20> pair at [sp-#Z*8]!, pre-indexed offset >= -248
save_fplr 01zzzzzz: save <x29,lr> pair at [sp+#Z*8], offset <= 504.
save_fplr_x 10zzzzzz: save <x29,lr> pair at [sp-(#Z+1)*8]!, pre-indexed offset >= -512
alloc_m 11000xxx'xxxxxxxx: allocate large stack with size < 16k (2^11 * 16).
save_regp 110010xx'xxzzzzzz: save x(19+#X) pair at [sp+#Z*8], offset <= 504
save_regp_x 110011xx'xxzzzzzz: save pair x(19+#X) at [sp-(#Z+1)*8]!, pre-indexed offset >= -512
save_reg 110100xx'xxzzzzzz: save reg x(19+#X) at [sp+#Z*8], offset <= 504
save_reg_x 1101010x'xxxzzzzz: save reg x(19+#X) at [sp-(#Z+1)*8]!, pre-indexed offset >= -256
save_lrpair 1101011x'xxzzzzzz: save pair <x(19+2*#X),lr> at [sp+#Z*8], offset <= 504
save_fregp 1101100x'xxzzzzzz: save pair d(8+#X) at [sp+#Z*8], offset <= 504
save_fregp_x 1101101x'xxzzzzzz: save pair d(8+#X), at [sp-(#Z+1)*8]!, pre-indexed offset >= -512
save_freg 1101110x'xxzzzzzz: save reg d(8+#X) at [sp+#Z*8], offset <= 504
save_freg_x 11011110'xxxzzzzz: save reg d(8+#X) at [sp-(#Z+1)*8]!, pre-indexed offset >= -256
alloc_l 11100000'xxxxxxxx'xxxxxxxx'xxxxxxxx: allocate large stack with size < 256M (2^24 *16)
set_fp 11100001: set up x29: with: mov x29,sp
add_fp 11100010'xxxxxxxx: set up x29 with: add x29,sp,#x*8
nop 11100011: no unwind operation is required.
end 11100100: end of unwind code. Implies ret in epilog.
end_c 11100101: end of unwind code in current chained scope.
save_next 11100110: save next non-volatile Int or FP register pair.
arithmetic(add) 11100111'000zxxxx: add cookie reg(z) to lr (0=x28, 1=sp); add lr, lr, reg(z)
arithmetic(sub) 11100111'001zxxxx: sub cookie reg(z) from lr (0=x28, 1=sp); sub lr, lr, reg(z)
arithmetic(eor) 11100111'010zxxxx: eor lr with cookie reg(z) (0=x28, 1=sp); eor lr, lr, reg(z)
arithmetic(rol) 11100111'0110xxxx: simulated rol of lr with cookie reg (x28); xip0 = neg x28; ror lr, xip0
arithmetic(ror) 11100111'100zxxxx: ror lr with cookie reg(z) (0=x28, 1=sp); ror lr, lr, reg(z)
11100111: xxxz----: ---- reserved
11101xxx: reserved for custom stack cases below only generated for asm routines
11101001: Custom stack for MSFT_OP_TRAP_FRAME
11101010: Custom stack for MSFT_OP_MACHINE_FRAME
11101011: Custom stack for MSFT_OP_CONTEXT
1111xxxx: reserved

In instructions with large values covering multiple bytes, the most significant bits are stored first. The unwind codes above are designed such that by simply looking up the first byte of the code, it is possible to know the total size in bytes of the unwind code. Given that each unwind code is exactly mapped to an instruction in prolog/epilog, to compute the size of the prolog or epilog, all that needs to be done is to walk from the start of the sequence to the end, using a lookup table or similar device to determine how long the corresponding opcode is.

Note that post-indexed offset addressing is not allowed in prolog. All offset ranges (#Z) match the encoding of STP/STR addressing except save_r19r20_x in which 248 is sufficient for all save areas (10 Int registers + 8 FP registers + 8 input registers).

save_next must follow a save for Int or FP volatile register pair: save_regp, save_regp_x, save_fregp, save_fregp_x, save_r19r20_x, or another save_next. It saves next register pair at the next 16-byte slot in "growing up" order. save-next following a save_next that denotes the last Int register pair refers to first FP register pair.

Since the size of regular return and jump instructions are the same, there is no need of a separated end unwind code for tail-call scenarios.

end_c is designed to handle noncontiguous function fragments for optimization purposes. A end_c which indicates the end of unwind codes in the current scope must be followed by another series of unwind code ended with a real end. The unwind codes between end_c and end represent the prolog operations in parent region ("phantom" prolog). More details and examples are described in section below.

Packed unwind data

For functions whose prologs and epilogs follow the canonical form described below, packed unwind data can be used, eliminating the need for an .xdata record entirely and significantly reducing the cost of providing unwind data. The canonical prologs and epilogs are designed to meet the common requirements of a simple function that does not require an exception handler, and which performs its setup and teardown operations in a standard order.

The format of a .pdata record with packed unwind data looks like this:

.pdata record with packed unwind data

The fields are as follows:

  • Function Start RVA is the 32-bit RVA of the start of the function.
  • Flag is a 2-bit field as described above, with the following meanings:
    • 00 = packed unwind data not used; remaining bits point to an .xdata record
    • 01 = packed unwind data used as described below with single prolog and epilog at the beginning and end of the scope
    • 10 = packed unwind data used as described below for code without any prolog and epilog; this is useful for describing separated function segments.
    • 11 = reserved;
  • Function Length is an 11-bit field providing the length of the entire function in bytes, divided by 4. If the function is larger than 8k, a full .xdata record must be used instead.
  • Frame Size is a 9-bit field indicating the number of bytes of stack that is allocated for this function, divided by 16. Functions that allocate greater than (8k-16) bytes of stack must use a full .xdata record. This includes local variable area, outgoing parameter area, callee-saved Int and FP area, and home parameter area, but excluding dynamic allocation area.
  • CR is a 2-bit flag indicating whether the function includes extra instructions to set up a frame chain and return link:
    • 00 = unchained function, <x29,lr> pair is not saved in stack.
    • 01 = unchained function, <lr> is saved in stack
    • 10 = reserved;
    • 11 = chained function, a store/load pair instruction is used in prolog/epilog <x29,lr>
  • H is a 1-bit flag indicating whether the function homes the integer parameter registers (x0-x7) by storing them at the very start of the function. (0=does not home registers, 1=homes registers).
  • RegI is a 4-bit field indicating the number of non-volatile INT registers (x19-x28) saved in the canonical stack location.
  • RegF is a 3-bit field indicating the number of non-volatile FP registers (d8-d15) saved in the canonical stack location. (RegF=0: no FP register is saved; RegF>0: RegF+1 FP registers are saved). Packed unwind data cannot be used for function that save only one FP register.

Canonical prologs that fall into categories 1, 2 (without outgoing parameter area), 3 and 4 in section above can be represented by packed unwind format. The epilogs for canonical functions follow a very similar form, except H has no effect, the set_fp instruction is omitted, and the order of steps as well as instructions in each step are reversed in epilog. The algorithm for packed xdata follows these steps, detailed in the following table:

Step 0: Perform the pre-computation of the size of each area.

Step 1: Save Int callee-saved registers.

Step 2: This step is specific for type 4 in early sections. lr is saved at the end of Int area.

Step 3: Save FP callee-saved registers.

Step 4: Save input arguments in the home parameter area.

Step 5: Allocate remaining stack, including local area, <x29,lr> pair, and outgoing parameter area. 5a corresponds to canonical type 1. 5b and 5c are for canonical type 2. 5d and 5e are for both type 3 and type 4.

Step # Flag values # of instructions Opcode Unwind Code
0 #intsz = RegI * 8;
if (CR==01) #intsz += 8; // lr
#fpsz = RegF * 8;
if(RegF) #fpsz += 8;
#savsz=((#intsz+#fpsz+8*8*H)+0xf)&~0xf)
#locsz = #famsz - #savsz
1 0 < RegI <= 10 RegI / 2 + RegI % 2 stp x19,x20,[sp,#savsz]!
stp x21,x22,[sp,#16]
...
save_regp_x
save_regp
...
2 CR==01* 1 str lr,[sp,#(intsz-8)]* save_reg
3 0 < RegF <=7 (RegF + 1) / 2 +
(RegF + 1) % 2)
stp d8,d9,[sp,#intsz]**
stp d10,d11,[sp,#(intsz+16)]
...
str d(8+RegF),[sp,#(intsz+fpsz-8)]
save_fregp
...
save_freg
4 H == 1 4 stp x0,x1,[sp,#(intsz+fpsz)]
stp x2,x3,[sp,#(intsz+fpsz+16)]
stp x4,x5,[sp,#(intsz+fpsz+32)]
stp x6,x7,[sp,#(intsz+fpsz+48)]
nop
nop
nop
nop
5a CR == 11 && #locsz
<= 512
2 stp x29,lr,[sp,#-locsz]!
mov x29,sp***
save_fplr_x
set_fp
5b CR == 11 &&
512 < #locsz <= 4080
3 sub sp,sp,#locsz
stp x29,lr,[sp,0]
add x29,sp,0
alloc_m
save_fplr
set_fp
5c CR == 11 && #locsz > 4080 4 sub sp,sp,4080
sub sp,sp,#(locsz-4080)
stp x29,lr,[sp,0]
add x29,sp,0
alloc_m
alloc_s/alloc_m
save_fplr
set_fp
5d (CR == 00 || CR==01) &&
#locsz <= 4080
1 sub sp,sp,#locsz alloc_s/alloc_m
5e (CR == 00 || CR==01) &&
#locsz > 4080
2 sub sp,sp,4080
sub sp,sp,#(locsz-4080)
alloc_m
alloc_s/alloc_m

* If CR == 01 and RegI is an odd number, Step 2 and last save_rep in step 1 are merged into one save_regp.

** If RegI == CR == 0, and RegF != 0, the first stp for the floating-point does the predecrement.

*** No instruction corresponding to mov x29,sp is present in the epilog. Packed unwind data cannot be used if a function requires restoration of sp from x29.

Unwinding partial prologs and epilogs

The most common unwinding situation is one where the exception or call occurred in the body of the function, away from the prolog and all epilogs. In this situation, unwinding is straightforward: the unwinder simply begins executing the codes in the unwind array beginning at index 0 and continuing until an end opcode is detected.

It is more difficult to correctly unwind in the case where an exception or interrupt occurs while executing a prolog or epilog. In these situations, the stack frame is only partially constructed, and the trick is to determine exactly what has been done in order to correctly undo it.

For example, take this prolog and epilog sequence:

0000:    stp    x29,lr,[sp,#-256]!          // save_fplr_x  256 (pre-indexed store)
0004:    stp    d8,d9,[sp,#224]             // save_fregp 0, 224
0008:    stp    x19,x20,[sp,#240]           // save_regp 0, 240
000c:    mov    x29,sp                      // set_fp
         ...
0100:    mov    sp,x29                      // set_fp
0104:    ldp    x19,x20,[sp,#240]           // save_regp 0, 240
0108:    ldp    d8,d9,[sp,224]              // save_fregp 0, 224
010c:    ldp    x29,lr,[sp],#256            // save_fplr_x  256 (post-indexed load)
0110:    ret    lr                          // end

Next to each opcode is the appropriate unwind code describing this operation. The first thing to note is that the series of unwind codes for the prolog is an exact mirror image of the unwind codes for the epilog (not counting the final instruction of the epilog). This is a common situation, and for this reason the unwind codes for the prolog are always assumed to be stored in reverse order from the prolog's execution order.

Thus, for both the prolog and epilog, we are left with a common set of unwind codes:

set_fp, save_regp 0,240, save_fregp,0,224, save_fplr_x_256, end

Starting with the epilog case (more straightforward as it is in normal order), at offset 0 within the epilog (which starts at offset 0x100 in the function), we would expect to execute the full unwind sequence, as no cleanup has yet been done. If we find ourselves one instruction in (at offset 2 in the epilog), we can successfully unwind by skipping the first unwind code. Generalizing this situation, assuming a 1:1 mapping between opcodes and unwind codes, we can state that if we are unwinding from instruction n in the epilog, we should skip the first n unwind codes and begin executing from there.

It turns out that a similar logic works for the prolog, except in reverse. If we are unwinding from offset 0 in the prolog, we would want to execute nothing. If we unwound from offset 2, which is one instruction in, then we would want to start executing the unwind sequence one unwind code from the end (remember, the codes are stored in reverse order). And here too we can generalize that if we are unwinding from instruction n in the prolog, we should start executing n unwind codes from the end of the list of codes.

Now, it is not always the case that the prolog and epilog codes match exactly. For this reason, the unwind array may need to contain several sequences of codes. To determine the offset of where to begin processing codes, use the following logic:

  1. If unwinding from within the body of the function, simply begin executing unwind codes at index 0 and continue until hitting an "end" opcode.

  2. If unwinding from within an epilog, use the epilog-specific starting index provided with the epilog scope as a starting point. Compute how many bytes the PC in question is from the start of the epilog. Then advance forward through the unwind codes, skipping unwind codes until all of the already-executed instructions are accounted for. Then execute starting at that point.

  3. If unwinding from within the prolog, use index 0 as your starting point. Compute the length of the prolog code from the sequence, and then compute how many bytes the PC in question is from the end of the prolog. Then advance forward through the unwind codes, skipping unwind codes until all of the not-yet-executed instructions are accounted for. Then execute starting at that point.

As a result of these rules, the unwind codes for the prolog must always be the first in the array, and they are also the codes used to unwind in the general case of unwinding from within the body. Any epilog-specific code sequences should follow immediately after.

Function fragments

For code optimization purposes and other reasons, it may be preferable to split a function into separated fragments (also called regions). When this is done, each resulting function fragment requires its own separate .pdata (and possibly .xdata) record.

For separated secondary fragment that has its own prolog, it is expected that no stack adjustment is done in its prolog. All stack space required by the secondary regions must be pre-allocated by its parent region (or called host region). This keeps stack pointer manipulation strictly in the function's original prolog.

A typical case of function fragments is "code separation" with that compiler may move a region of code out of its host function. There are three unusual cases that could be resulted by code separation.

Example:

  • (region 1 : begin)

        stp     x29,lr,[sp,#-256]!      // save_fplr_x  256 (pre-indexed store)
        stp     x19,x20,[sp,#240]       // save_regp 0, 240
        mov     x29,sp                  // set_fp
        ...
    
  • (region 1 : end)

  • (region 3 : begin)

        ...
    
  • (region 3 : end)

  • (region 2: begin)

    ...
        mov     sp,x29                  // set_fp
        ldp     x19,x20,[sp,#240]       // save_regp 0, 240
        ldp     x29,lr,[sp],#256        // save_fplr_x  256 (post-indexed load)
        ret     lr                      // end
    
  • (region 2 : end)

  1. Prolog only (region 1: all epilogs are in separated regions):

    Only the prolog needs to be described. This cannot be represented by compact .pdata format. In the full .xdata case, this can be represented by setting Epilog Count = 0. See region 1 in the example above.

    Unwind codes: set_fp, save_regp 0,240, save_fplr_x_256, end.

  2. Epilogs only (region 2: prolog is in host region)

    It's assumed that by the time control jumping into this region, all prolog codes have been executed. Partial unwind can happen in epilogs the same way as in a normal function. This type of region cannot be represented by compact .pdata. In full xdata record, it can be encoded with a "phantom" prolog, bracketed by an end_c and end unwind code pair. The leading end_c indicates the size of prolog is zero. Epilog start index of the single epilog points to set_fp.

    Unwind code for region 2: end_c, set_fp, save_regp 0,240, save_fplr_x_256, end.

  3. No prologs or epilogs (region 3: prologs and all epilogs are in other fragments):

    Compact .pdata format can be applied via setting Flag = 10. With full .xdata record, Epilog Count = 1. Unwind code is the same as those for region 2 above, but Epilog Start Index also points to end_c. Partial unwind will never happen in this region of code.

Another more complicated case of function fragments is "shrink wrapping" with that compiler may choose to delay saving some callee-saved registers until outside of the function entry prolog.

  • (region 1 : begin)

        stp     x29,lr,[sp,#-256]!      // save_fplr_x  256 (pre-indexed store)
        stp     x19,x20,[sp,#240]       // save_regp 0, 240
        mov     x29,sp                  // set_fp
        ...
    
  • (region 2 : begin)

        stp     x21,x22,[sp,#224]       // save_regp 2, 224
        ...
        ldp     x21,x22,[sp,#224]       // save_regp 2, 224
    
  • (region 2 : end)

        ...
        mov     sp,x29                  // set_fp
        ldp     x19,x20,[sp,#240]       // save_regp 0, 240
        ldp     x29,lr,[sp],#256        // save_fplr_x  256 (post-indexed load)
        ret     lr                      // end
    
  • (region 1 : end)

In the prolog of region 1, stack space is pre-allocated. Note that region 2 will have the same unwind code even it's moved out of its host function.

Region 1: set_fp, save_regp 0,240, save_fplr_x_256, end with Epilog Start Index points to set_fp as usual.

Region 2: save_regp 2, 224, end_c, set_fp, save_regp 0,240, save_fplr_x_256, end. Epilog Start Index points to first unwind code save_regp 2, 224.

Large functions

Fragments can be leveraged to describe functions larger than the 1M limit imposed by the bit fields in the .xdata header. To describe a very large function like this, it simply needs to be broken into fragments smaller than 1M. Each fragment should be adjusted so that it does not split an epilog into multiple pieces.

Only the first fragment of the function will contain a prolog; all other fragments are marked as having no prolog. Depending on the number of epilogs present, each fragment may contain zero or more epilogs. Keep in mind that each epilog scope in a fragment specifies its starting offset relative to the start of the fragment, not the start of the function.

If a fragment has no prolog and no epilog, it still requires its own .pdata (and possibly .xdata) record, to describe how to unwind from within the body of the function.

Examples

Example 1: Frame-chained, compact-form

|Foo|     PROC
|$LN19|
    str     x19,[sp,#-0x10]!        // save_reg_x
    sub     sp,sp,#0x810            // alloc_m
    stp     fp,lr,[sp]              // save_fplr
    mov     fp,sp                   // set_fp
                                    //  end of prolog
    ...

|$pdata$Foo|
    DCD     imagerel     |$LN19|
    DCD     0x416101ed
    ;Flags[SingleProEpi] functionLength[492] RegF[0] RegI[1] H[0] frameChainReturn[Chained] frameSize[2080]

Example 2: Frame-chained, full-form with mirror Prolog & Epilog

|Bar|     PROC
|$LN19|
    stp     x19,x20,[sp,#-0x10]!    // save_regp_x
    stp     fp,lr,[sp,#-0x90]!      // save_fplr_x
    mov     fp,sp                   // set_fp
                                    // end of prolog
    ...
                                    // begin of epilog, a mirror sequence of Prolog
    mov     sp,fp
    ldp     fp,lr,[sp],#0x90
    ldp     x19,x20,[sp],#0x10
    ret     lr

|$pdata$Bar|
    DCD     imagerel     |$LN19|
    DCD     imagerel     |$unwind$cse2|
|$unwind$Bar|
    DCD     0x1040003d
    DCD     0x1000038
    DCD     0xe42291e1
    DCD     0xe42291e1
    ;Code Words[2], Epilog Count[1], E[0], X[0], Function Length[6660]
    ;Epilog Start Index[0], Epilog Start Offset[56]
    ;set_fp
    ;save_fplr_x
    ;save_r19r20_x
    ;end

Note that EpilogStart Index [0] points to the same sequence of Prolog unwind code.

Example 3: Variadic unchained Function

|Delegate| PROC
|$LN4|
    sub     sp,sp,#0x50
    stp     x19,lr,[sp]
    stp     x0,x1,[sp,#0x10]        // save incoming register to home area
    stp     x2,x3,[sp,#0x20]        // ...
    stp     x4,x5,[sp,#0x30]
    stp     x6,x7,[sp,#0x40]        // end of prolog
    ...
    ldp     x19,lr,[sp]             // beginning of epilog
    add     sp,sp,#0x50
    ret     lr

    AREA    |.pdata|, PDATA
|$pdata$Delegate|
    DCD     imagerel |$LN4|
    DCD     imagerel |$unwind$Delegate|

    AREA    |.xdata|, DATA
|$unwind$Delegate|
    DCD     0x18400012
    DCD     0x200000f
    DCD     0xe3e3e3e3
    DCD     0xe40500d6
    DCD     0xe40500d6
    ;Code Words[3], Epilog Count[1], E[0], X[0], Function Length[18]
    ;Epilog Start Index[4], Epilog Start Offset[15]
    ;nop        // nop for saving in home area
    ;nop        // ditto
    ;nop        // ditto
    ;nop        // ditto
    ;save_lrpair
    ;alloc_s
    ;end

Note: EpilogStart Index [4] points to the middle of Prolog unwind code (partially reuse unwind array).

See also

Overview of ARM64 ABI conventions
ARM Exception Handling