Understanding ARM Assembly Part 2

My name is Marion Cole, and I am a Sr. Escalation Engineer in Microsoft Platforms Serviceability group.  This is Part 2 of my series of articles about ARM assembly.  In part 1 we talked about the processor that is supported.  Here we are going to talk about how Windows utilizes that ARM processor.

 

As we discussed in part 1 Windows runs on the ARMV7-A with NEON.  We discussed the CPSR register in part 1.  There are a few bits that are important in the CPSR.  The first one is the Endian State bit:

31

30

29

28

27

26

25

24

23

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

N

Z

C

V

Q

IT

J

Reserved

GE

IT

E

A

I

F

T

M

 

Bit 9 (the E bit) indicates the EndianState.  This bit should always be a 0 because Windows only runs in Little-Endian state.  So if you get a dump, and see the CPSR bit 9 is set then you have a problem.  Here is an example from the debugger:

1: kd> r

r0=00000001  r1=00000001  r2=00000000  r3=00000000  r4=e1074044  r5=c555b580

r6=00000001  r7=e104ca39  r8=00000001  r9=00000000 r10=e9bf06c7 r11=d5f1ea08

r12=e16b213c  sp=d5f1e9b0  lr=e0f0fe2f  pc=e0fdebd0 psr=00000133 ----- Thumb

nt!DbgBreakPointWithStatus:

e0fdebd0 defe     __debugbreak

 

1: kd> .formats 00000133

Evaluate expression:

  Hex:     00000133

  Decimal: 307

  Octal:   00000000463

  Binary:  00000000 00000000 00000001 00110011  ßBit 9 is 0.  Note first bit is Bit 0. 

  Chars:   ...3

  Time:    Wed Dec 31 18:05:07 1969

  Float:   low 4.30199e-043 high 0

  Double:  1.51678e-321

 

So how could Bit 9 ever be a 1?  The SETEND instruction in the ARM ISA allows even user mode code to change the current endianness, doing so will be dangerous for an application and is discouraged.  If an exception is generated while in big-endian mode the behavior is unpredictable, but may lead to an application fault (user mode) or bugcheck (kernel mode).

 

The next bit we are going to discuss is bit 5, the Thumb bit (the T bit).  This should be a 1 if executing Thumb instructions.  So let’s discuss the different instruction sets the ARM processor has.

 

ARMv7 has four different ISA's for programming. 

  • ARM - basic ARM instruction set including conditional execution.
  • Thumb - This mode uses a 16 bit instruction encoding to reduce code footprint.  It has limitations with respect to register access and some system instructions aren't implemented for Thumb.
  • Thumb2 - This extension of the Thumb instruction set adds 32 bit opcode encodings and adds enough facilities to author an entire OS.  Support for Thumb2 is guaranteed in the ARMv7 architecture revision.
  • Jazelle - Java code interpretation.
  • ThumbEE - a limited version of Thumb2 intended as a code generation target for JIT scenarios.

 

Windows requires Thumb2 support.  The advantage of using Thumb2 is that the combination of 16 and 32 bit opcodes along with some other ISA improvements allows for saving 20-30% code footprint at a 1-2% performance loss.  In addition the cache hit rate is improved due to increased density of the code.

 

CPSR Bit 5 should always be 1 as Windows only runs in Thumb2 mode.  Also note that this bit is combined with bit 24, the Java state bit (the J bit).  Bit 24 should always be 0 when running Windows.

 

The next bits to discuss are the CPU Mode bits 4-0 (M).  Windows only runs in two modes.  They are User Mode (10000) and Supervisor Mode (10011).  If Bits 4-0 are anything other than the indicated values given an exception will be raised.  Kernel will run in Supervisor Mode, and applications will run in User Mode.

 

That brings up another point.  How does the processor switch between Supervisor Mode and User Mode?  It is called the SVC call.  In the x86 processor this was done via SYSENTER/SYSEXIT.  In x64 processor this was done via SYSCALL/SYSRET.  In ARM this is done via the SVC or Supervisor Call.  This call is made to have the kernel provide a service.  When invoked in ntdll.dll the service number is in r12.  Here is an example:

1: kd> u ntdll!ZwQueryVolumeInformationFile

771e8674    f04f0c8d    mov   r12,#0x8D
771e8678    df01        svc   #1
771e867a    4770        bx    lr

 

When SVC is called the previous CPSR register is saved in the SPSR register (the Saved Program Status Register), and pc register is saved in lr register (the Link Register).  The processor then changes to kernel mode (0x13) with interrupts disabled.  The lr and SPSR values are used to generate a return from the SVC call.  When an exception is taken the stack is untouched, the previous mode's SP and LR are left alone, new modes SP becomes active, exception address is stored in the new mode's LR, and the previous CPSR is copied into the new mode's SPSR.  When returning from the exception the SPSR is copied back into the CPSR, and it returns to LR.

 

Data Types

ARMv7 processors support four data types from 8 bits to 64 bits, but the definitions are different than the ones in Windows.  In Windows 16 bits are defined as a word, on ARM a word is 32 bits.

Byte

8 bits

HalfWord

16 bits

Word

32 bits

DoubleWord

64 bits

 

These can be signed or unsigned.

  • Unsigned 32 bit integer
  • Signed 32 bit integer
  • Unsigned 16 bit integer (zero extended)
  • Signed 16 bit register (sign extended)
  • Unsigned 8 bit integer (zero extended)
  • Signed 8 bit register (sign extended)
  • Two 16 bit integers
  • Four 8 bit integers
  • The upper or lower 32 bits of a 64 bit signed value whose other half is in another register
  • The upper or lower 32 bits of a 64 bit unsigned value whose other half is in another register

 

Memory Model

The ARM memory model is much like other architectures that we have supported.  ARM has a "weak ordering" memory model.  This means that two memory operations that occur in program order, may be observed from another processor or DMA controller in any order.  When an instruction stalls because it is waiting for the result of a preceding instruction, the core can continue executing subsequent instructions that do not need to wait for the unmet dependencies.  There are three instructions that allow you to configure memory barriers:

  • ISB - Instruction Synchronization Barrier
  • DMB - Data Memory Barrier
  • DSB - Data Synchronization Barrier

 

An excellent blog article on this topic with an explanation of these three instructions is available at:

https://blogs.arm.com/software-enablement/594-memory-access-ordering-part-3-memory-access-ordering-in-the-arm-architecture/

 

Alignment and Atomicity

Windows enables the ARM hardware to handle misaligned integer accesses transparently; however, there are still several situations where alignment faults may be generated on misaligned accesses. Follow the rules below:

  • Halfword and word-sized integer loads and stores do NOT need to be aligned (hardware will handle them efficiently and transparently)
  • Floating-point loads and stores SHOULD be aligned (the kernel will handle them transparently, but with significant overhead)
  • Load/store double (LDRD/STRD) and multiple (LDM/STM) operations SHOULD be aligned (the kernel will handle most of them transparently, but with significant overhead)
  • All uncached memory accesses MUST be aligned, even for integer accesses (you will get an alignment fault)

 

Note that the memcpy() implementation provided by the Windows CRT presumes the copies are to/from cached memory, and thus leverages the hardware’s support for transparently handling misaligned integer reads and writes with little penalty. This means that memcpy() CANNOT be used when the source or destination is uncached memory. Instead, use the separate function _memcpy_strict_align(), which only performs aligned accesses.

 

There are two types of atomicity supported.  Single-copy and Multi-copy.

 

Single-copy atomicity

There are rules around atomicity that are intended to specify the cases where memory access behavior in relation to program order can be guaranteed.  So certain access (aligned word accesses) are guaranteed by the architecture to return sensible results even if other threads are accessing the same memory.  These rules are necessary in order to guarantee that the programmer (and compiler) can rely on correct behavior of memory in the majority of the cases.

 

Multi-copy atomicity

These rules are similar, but relate specifically to multi-processing environments in which several observers may be using a particular item in memory.  To be able to guarantee correct behavior you need to be able to assume that memory behaves in a consistent way.

 

More on Single-Copy and Multi-Copy atomicity in the ARM Architecture Reference Manual available from https://infocenter.arm.com/help/index.jsp.

 

Common Assembly Instructions

We are going to cover some common Thumb2 instructions.

  • ldr           r0, [r4]                  (ldrex, ldrh ldrb, ldrd, ldrexd, etc.)

    This is the Load Register instruction.  In the above example r0 is the destination register, and r4 is the base register.  This will take the address that is in r4, go to that memory location and copy the contents of that memory location into r0.

  • str           r2, [r4, #0x08]                    (strex, strh, strexh, strd, etc.)

    This is the Store Register instruction.  In the above example r2 is the source register, and r4 is the base register.  This will take the address in r4 and add 8 to that address.  It will take the value that is in r2, and store it at the address pointed to by r4 plus 8.

  • mov       r1, r4                                      (movs – sets the condition codes)

    This is the Move instruction.  In the above example r1 is the destination register, and r4 is the source register.  It will do the same thing as x86 in that it just copies what is in r4 to r1.  It can optionally updated the condition flags based on the value.

  • adds      r1, r5, #0                              (add)

    This is the Add instruction.  In the above example r1 is the destination register.  This will take the value that is in r5 and add 0 to it.  It will store the result in r1.  Because this has an (s) at the end of add it will update the flags.

  • sub         sp, sp, #0x14                      (subs)

    This is the Subtract instruction.  In the above example sp is the destination.  This will take the value that is in sp, subtract 14h from it, and store the result in sp. Because this does not have an (s) at the end it will not update the flags.

  • push      {r4-r9, r11, lr}

    This is the Push instruction.  It can push multiple registers to the stack in one instruction.  You can separate a full series of register with the beginning register "-" and ending register like seen above.  You can also list them all, and just separate them by ",".  This operates the same as an x86 processor in that it subtracts 4 from the stack pointer for each push.

  • pop        {r4-r9, r11, lr}

    This is the Pop instruction.  It pulls values from the stack back into the registers you list.  The registers work just like the push instruction.  This operates the same as an x86 processor in that it adds 4 to the stack pointer for each pop.

  • b??         |MyApp!main+0x60 (00b81348)|

    This is the Branch instruction.  This is equivalent to the jmp instruction in x86.  However it has several conditional variants such as "beq, bge, and etc.".

  • bx           r3

    This is the Branch and Exchange instruction.  This causes a branch to an address and instruction set specified by a register (r3 here).  This can do a long branch anywhere in the 32-bit address range.

  • bl            |MyApp!Function (00b815c4)|

    This is the Branch with Link instruction.  This calls a subroutine at a PC-relative address.  This will update the lr register.

  • blx          r3

    This is the Branch with Link and Exchange.  This calls a subroutine at an address and instruction set specified by a register (r3 here).  This will do a long branch anywhere in the 32-bit address range, and update the lr register.

  • dmb      

    This is the Data Memory Barrier instruction.  It is a memory barrier that ensures the ordering of observations of memory accesses.

  • cmp       r3, #0

    This is the Compare instruction.  It will subtract 0 from the value in r3, and set the flags accordingly. 

 

In ARM addressing the base register points to memory being referenced.  The offset can be an immediate or an index register.  The memory stored at the base register`s address plus the offset is accessed.  The base register remains unchanged.  Example:

Ldr r5,[r9,#0x1c]

 

This will take the value that is in r9 and add 0x1C to it, go to that memory location, and retrieve the value there and store it in r5.  R9 will remain the same value.

 

ARM also has some interesting thing about indexing.  They have Pre-Indexed addressing, Offset Addressing, and Post-Indexed Addressing.

 

Pre-Indexed addressing the value of the base register is first modified by the offset then the memory pointed to by the modified base register is accessed.  Example:

Str r2,[r4,#0x4]!

 

The "!" at the end of the instruction is not a mistake.  This is how you tell it is a Pre-Indexed address. 

 

Offset Addressing.  The value is added to the base register, and that is used as the address for memory access.  If the "!" was not there then this would just be Offset addressing.  Example:

Str r2,[r4,#0x4]

 

Post-Index addressing the memory address in the base register is accessed then afterwards the base register is modified by the offset value.  Example:

Ldr pc,[sp],0x1c

 

Notice the "!" is missing here.  Also notice the offset is outside the "[ ]".  That is how you can find a Post-Index.

 

Part 3 of this series will cover Calling Conventions, Prolog/Epilog, and Rebuilding the stack.