September 2009

Volume 24 Number 09

Going Places - An Introduction to IPsec VPNs on Mobile Phones

By Ramon Arjona | September 2009

This afternoon, while I was away from the office, I got an e-mail on my phone. The message had a link to a document I was supposed to read -- a document on a SharePoint site available only through my company's intranet. This was a bummer because I had to wait until I could fire up my laptop, put my smart card in the smart card reader, get a Wi-Fi connection at a coffee shop, connect my laptop to the corporate VPN, and log on before I could read the document.

Life would be much easier if I could just use the phone to access the SharePoint site. Of course, my phone would need some magic way of connecting securely to the corporate network and authenticating me. In other words, like my laptop, my phone would need the capability to start a VPN connection to the network. Many commercial phone models, including Windows phones, come with a VPN client built in. But there are shortcomings in the most widely available VPN clients because they are based on version 1 of a specification called Internet Key Exchange (IKEv1). IKEv1 is a stable part of the IPsec framework and is great for wired devices or devices, like laptops, with relatively large batteries that don't move around too much. However, it is not ideal for mobile phones.

Sure, I use a wireless connection on my laptop to go from my office to a conference room and back again. I might switch from a wireless connection to a wired one and expect no loss of connectivity. But we don't move around nearly as much with a laptop as we do with a phone. A phone travels with you through traffic and in and out of buildings, transitioning into and out of the roaming state. Unless you use a cellular modem, when was the last time your laptop told you it was roaming? Generally, a phone changes its point of network attachment with a frequency and complexity that a laptop never has to worry about. The guys who thought about the IKEv1 spec didn't have to worry about mobile phone scenarios either, because in the late 1990s, when RFC 2409 was being written, smart phones just weren't prevalent in the marketplace. Since then, the use of mobile phones has skyrocketed, and the importance of the mobile phone has begun to approach the importance of other, larger computing devices, like the workstation or the laptop.

IKEv1 isn't well suited to a highly mobile style of computing because IKEv1 doesn't have a good way to cope with a host that might change its point of network attachment several times in a few seconds. When IKEv2 was drafted, a set of extensions called the Mobility and Multihoming (MOBIKE) protocol was also drafted to accommodate the mobile phone scenario. Using these extensions, 2009a mobile phone VPN becomes far more practical. Several products on the market support IKEv2 and MOBIKE, including Microsoft Systems Center Mobile Device Manager (SCMDM).

In this article, I'll cover some of the basics of the technology behind IKEv2 and MOBIKE. I assume you have a working knowledge of IPv4 and networking, some familiarity with mobile phones, and a basic understanding of cryptography. This article is not going to cover IPsec in detail and is not going to discuss other sorts of VPN technologies, such as SSL-based VPNs. I also won't discuss IPv6. IPsec and IKE are extensions to IPv4, but they are baked into IPv6, so some of the things we'll touch on here will still be applicable on an IPv6 network. However, IPv6 introduces enough complexity and new terminology that it wouldn't be possible to cover it in sufficient detail.

Say ‘AH’

Three protocols make up the core of IPsec: Authentication Header (AH), Encapsulating Security Payload (ESP), and IKE. To address IKE, I first need to discuss AH and ESP.

In a nutshell, AH makes sure that the packets we send aren't tampered with. It protects the integrity of our packets. AH also ensures that the packets are sent from whomever claims to have sent them. It ensures the authenticity of our packets. AH does not, however, provide any measure of privacy or encryption. An attacker who gains access to the network could still sniff packets that are guarded by AH and extract their content. He would not, however, be able to masquerade as one of the authenticated parties nor be able to change our packets in transit. AH is defined in RFC 4302.

For instance, Alice and Bob have established a VPN connection between themselves and have chosen to protect their packets with AH. Charlie inserts himself onto the network and starts intercepting packets. Charlie is able to reconstruct Alice and Bob's conversation because he can see the content of their packets and discern what kind of traffic is passing between them. However, he can't forge a message from Alice to Bob without getting caught, and he can't alter Bob's messages to Alice, either. Both of these things are prevented by AH.

AH can be used in transport mode or tunnel mode. In transport mode, the payload of a packet is protected, and packets are routed directly from one host to another. In tunnel mode, the entire packet is protected by AH, and packets are routed from one end of an IPsec "tunnel" to another. Tunneling is accomplished through encapsulation of the original packet. At either end of the tunnel is a security gateway. The gateway that is sending a packet is responsible for encapsulating that packet by adding an "outer" IP header and address. This outer address routes the packet to the security gateway at the other end of the tunnel. The gateway that receives the packet is responsible for processing the AH to determine that the packet is from a valid sender and hasn't been tampered with, and then it routes the packet to its final destination.

In transport mode, an AH gets added to the packet right after the IP header. The AH comes before the next layer protocol in the packet (such as UDP or TCP) and also before any other IPsec headers in the packet, such as the ESP header. The IP header that precedes the AH must have the value 51, which is the magic number assigned by the IANA to AH that tells the application processing the packet that the next thing it will see is an AH. Figure 1 shows the shape of a packet after having an AH added to it.

In tunnel mode, the AH gets added after the new IP header. The encapsulated IP header is treated as part of the payload, along with everything else. This is shown in Figure 2. The first 8 bits of a packet protected by AH specify the protocol ID of the payload that comes after the AH. This tells the receiver what to expect after the AH payload. For instance, if the next layer protocol is TCP, the protocol ID will be set to 6. This field is called the Next Header. (You might wonder why it's not called Protocol in a manner consistent with other headers in IPv4. The reason is consistency and compatibility with IPv6. Fortunately, you don't really need to know much about IPv6 to understand how AH works on your IPv4 network, but if you were wondering why it's called Next Header, that's why.)

Next comes the 7-bit length of the AH payload. Since there's only 7 bits, the payload's size is limited to 128 bytes. Then comes a long string of zeroes -- 2 bytes of them, in fact. These 2 bytes are reserved for future use according to the RFC, so until that future use is defined, we've got 2 empty bytes that are just along for the ride.

Following these 2 bytes is the Security Parameter Index (SPI). This is a 32-bit number that is used to determine which IPsec Security Association (SA) the AH is associated with. I'll talk in detail about SAs when I discuss IKEv2. This number is followed by a 32-bit sequence number, which is incremented with each packet sent and is used to prevent replay attacks.

After the sequence number comes the Integrity Check Value (ICV). The ICV is calculated at the sender by applying a hashing function, such as SHA-2, to the IP header, the AH, and the payload. The receiver checks that the packet hasn't been tampered with by applying the same hashing function and confirming that the same hash is produced.

I Have ESP

Like AH, Encapsulating Security Payload (ESP) can provide integrity and authentication. Unlike AH, however, ESP provides authentication and integrity only for the payload of the packet, not for the packet header. ESP can also be used to provide confidentiality by encrypting the packet payload. In theory, these features of ESP can be enabled independently, so it is possible to have encryption without authentication and integrity, or to have integrity and authentication without encryption. In practice, however, doing one without the other doesn't make a great deal of sense. For instance, knowing that a message has been sent to me confidentially doesn't do me any good if I can't also be completely sure who sent it. ESP is defined in RFC 4303.

ESP, like AH, can be enabled in tunnel mode or transport mode. The ESP header must be inserted after the AH. In transport mode, ESP encrypts the payload of the IP packet. In tunnel mode, ESP treats the entire encapsulated packet as the payload and encrypts it. This is illustrated in Figure 3.

The encryption algorithm is chosen through a process of negotiation between the peers that set up the IPsec SA. Advanced Encryption Standard (AES) is a common choice of encryption algorithm for modern implementations. Of course, to use AES, you must first have a shared secret for the two hosts that are about to start a secure communication. Manually giving the shared key to the host is an impractical approach because it doesn't scale, but you can use Diffie-Hellman (DH) key exchange to provision the secret.

Diffie-Hellman Groups

DH is a protocol that allows two parties to share a secret over an insecure channel, and it is integral to the negotiation that takes place in IKE. The shared secret that is communicated via DH can be used to create a communication channel that is securely encrypted. The math that goes into DH is complex, and I will not go into it in detail here. If you're interested in learning more about it, check the resources that cover modular exponential (MODP) groups and their application to DH. For our purposes, it suffices to say that a DH group is a specific collection of numbers with a mathematical relationship and a unique group ID.

A DH group is specified when a secure connection is being set up with IPsec, during the IKE negotiation. In this negotiation, the two peers trying to establish a secure connection need to find a DH group that they both support. DH groups with higher IDs have higher cryptographic strength. For instance, the first DH groups that are called out in the original IKE specification had about the same cryptographic strength as a symmetric key, with between 70 and 80 bits. With the advent of stronger encryption algorithms such as AES, more strength was required from DH groups to prevent the DH groups from becoming a weak link in the cryptographic chain. Therefore, newer DH groups specified in RFC 3526 provide estimated strength between 90 and 190 bits.

The downside to these newer DH groups is that their greater strength comes at a cost: with the stronger groups, more processing time is required. This is one of the reasons why peers need to negotiate a mutually acceptable DH group. For instance, my phone might not have enough processing power to deal with DH group 15, so it wants to support only DH group 2. While trying to establish an IPsec connection with a server, my phone will propose DH group 2, and if the server supports DH group 2, that group will be used -- even if the server could potentially have used a stronger DH group. Of course, if the two peers can't agree on a common DH group, they won't be able to communicate.

That's enough background. Let's talk about IKE.

I Like IKE (and So Should You)

IKE is used to establish an IPsec connection between peers. This connection is called a Security Association (SA). There are two kinds of SAs, the IKE_SA and the CHILD_SA. The IKE_SA is set up first. It is where the shared secret is negotiated over DH and where encryption and hashing algorithms are also negotiated. The CHILD_SA is where network traffic is sent, protected by AH, ESP, or both.

Every request in IKE requires a response, which makes it convenient to think in terms of pairs of messages. The first message pair is called IKE_SA_INIT and is used to decide what cryptographic algorithm and DH group the peers should use. Since cryptography is still being determined during this exchange, this message pair is not encrypted. The next pair of messages is called IKE_SA_AUTH. This exchange authenticates the messages sent during IKE_SA_INT, and proves the identity of both the initiator and the responder. This step is necessary because the first message was sent in the clear -- having established a secure channel, the peers now need to prove to each other that they really are who they say they are and that they really meant to start this conversation. The IKE_SA_AUTH message exchange also sets up the first CHILD_SA, which is frequently the only CHILD_SA created between the peers.

A CHILD_SA is a simplex -- or one-way -- connection, so CHILD_SAs are always set up in pairs. If one SA in a pair is deleted, the other should also be deleted. This is handled in IKE through INFORMATIONAL messages. Per RFC 4306, an INFORMATIONAL message contains zero or more Notification, Delete or Configuration messages. Let's say we have an initiator that's a PC and a responder that's a server. The PC decides to close the CHILD_SA connection and terminate the VPN. It sends an INFORMATIONAL message with a Delete payload to the server, which identifies the SA to delete by its SPI. The server then deletes this incoming SA and sends a response to the PC to delete its half of the SA. The PC receives this message and deletes its half of the SA, and everything is great.

Of course, things might not always work this way. Either an initiator or a responder might end up with an SA in a "half-closed" state, where one member of the SA pair is closed but the other is still open. The RFC specifies that this is an anomalous condition, but it doesn't allow for a peer to close these half-open connections by itself. Instead, the peer is supposed to delete the IKE_SA if the connection state becomes sufficiently unstable -- but deleting the IKE_SA deletes alf the CHILD_SAs that were created beneath it. Either case would be painful on a phone because keeping the CHILD_SA open would consume scarce system resources, as would having to tear down and rebuild the IKE_SA.

Also, it is possible for one peer at the end of an SA to disappear completely, without telling the system on the other side of the SA that it's good. This is a circumstance that is particularly prone to occur with mobile phones. For instance, consider a scenario in which a mobile phone user has established an IPsec VPN connection with a server. The mobile phone user goes into a basement, and loses her radio signal. The server has no way of knowing that the phone has disappeared into a black hole so it continues to send messages on the CHILD_A, but receives no response. It would likewise be possible for the phone to start sending messages into a black hole because of routing issues on the cellular network. Anything that could cause one peer to lose track of another can lead to this condition, but the cost on the phone is greater than the cost on the server because the resources on the phone are more scarce.

Why Is It Bad to Tear Down and Rebuild the SA?

The short answer is that tearing down and rebuilding the SA uses expensive resources that the phone can’t afford to waste. Specifically, CPU is consumed in performing cryptographic calculations and while the radio is in use sending and receiving large amounts of data, both of which cost battery life. The large amount of data being transferred also consumes bandwidth, which costs money—especially in places where data plans charge by the kilobyte.

Because sending a message on the radio costs power, and power on the phone is limited, the phone needs to detect these black-hole situations and deal with them to conserve system resources. This is usually handled through a process called Dead Peer Detection (DPD), in which a peer that suspects that it might be talking to a black hole sends a message demanding proof of liveness. If the target of this request does not respond in an appropriate amount of time, the sender can take appropriate action to delete the IKE_SA and reclaim the resources being spent on it. In general, it's preferable to send DPD messages only when there's no other traffic traveling through the SA and the peer has reason to suspect that its partner on the SA is no longer there. While there's no requirement to implement DPD this way, it doesn't make much sense to confirm the liveness of a peer that's currently sending you other sorts of network traffic.

Another situation that can cause trouble on a VPN connection is a host changing its IP address. The IP address of a host is used along with the 32-bit SPI to identify a particular host and associate it with an SA. When a host loses its IP address, this association is also lost, and the SA needs to be torn down and re-created with the new IP address.

As we've said before, this is not much of a problem with desktops or laptops. A PC might lose a DHCP lease and get a new IP address, but most DHCP implementations make it quite likely that the PC will be assigned the same IP address it had before. In other words, desktop PCs don't change IP addresses very often. Laptops, because they are mobile, can change their point of network attachment and therefore get a new IP address, forcing any SA they have open to be destroyed and re-created. However, the rate at which laptops switch IP addresses is still relatively infrequent when compared with the rate at which a phone does. For instance, a phone that has the ability to transmit data via Wi-Fi and cellular channels might switch networks every time a user walks in or out of her office building as the phone changes from a Wi-Fi to a GPRS connection and back again. Unlike a laptop moving around an office building, which might remain on the same network link and therefore continue to have a topologically correct network address without a change, the phone has switched between two fundamentally different networks, so it is virtually guaranteed to change IP addresses. This results in an interrupted connection on the phone whenever a handoff between networks occurs. The same thing can happen while the phone is on the cellular network alone. The phone might enter roaming mode and switch from its home network to a foreign network owned by a different mobile operator. The phone might move from one area of coverage to another and become attached to a completely different part of the mobile operator's network.

There are any number of other reasons controlled by the mobile operator that could cause an interrupted connection. This frequent tearing down and rebuilding of the SA would make the mobile VPN intractable were it not for the extensions to IKEv2 known as MOBIKE.


The IKEv2 MOBIKE protocol is defined in RFC 4555. It allows peers in an IPsec VPN to advertise that they have multiple IP addresses. One of these addresses is associated with the SA for that peer. If the peer is forced to switch its IP address because of a change in network attachment, one of the IP addresses previously identified, or a newly assigned address, can be swapped in without having to tear down and rebuild the SA.

To indicate its ability to use MOBIKE, a peer includes a MOBIKE_SUPPORTED notification in the IKE_SA_AUTH exchange. The IKE_AUTH exchange also includes the additional addresses for the initiator and the responder. The initiator is the device that started the setup of the VPN by sending the first IKE message, and is responsible for making decisions about which of the IP addresses to use from among those it has available and those offered to it by the responder.

As the RFC points out, the initiator is generally the mobile device because the mobile device has more awareness of its position on the network, and as a result it is better suited to make decisions about which addresses to use. However, the RFC does not specify how these decisions should be made. Generally, one end of the IPsec VPN will be a mobile device, and the other end will be a stationary security gateway server. The specification doesn't require this implementation, and does allow both ends of the gateway to move -- but it doesn't provide a way for the two ends of the gateway to find each other again if they move at the same time. That is, if one peer updates its address and the other peer does the same thing at the same time, there is no opportunity to communicate this change to either peer, and the VPN connection will be lost.

The initiator uses the responder's address list to figure out the best address pair to use for the SA. The responder doesn't use the initiator's addresses, except as a means of communicating to the initiator that the responder's address has changed.

For instance, when the initiator sees that its address has changed, it notifies the responder of this fact with an INFORMATIONAL message that contains an UPDATE_SA_ADDRESSES notification. This message uses the new address, which also starts being used in the peer's ESP messages. The receiver of the update notification records the new address and optionally checks for return routability to be sure that the address belongs to the other mobile node as is claimed. Following this, the responder starts using the new address for its outgoing ESP traffic.

Of course the initiator or responder might not know all the IP addresses it will ever have for the lifetime of the SA. A peer can advertise a change in the list of addresses it supports with an INFORMATIONAL message. If the peer has only one address, this address is present in the header, and the message contains the NO_ADDITIONAL_ADDRESSES notification. Otherwise, if the peer has multiple addresses, one of these addresses is put in the header of the INFORMATIONAL message, and the others are included in an ADDITIONAL_IP4_ADDRESSES notification.

This list is not an update -- it is the whole list of addresses that the peer wants to advertise at that time. In other words, the whole list is sent every time, but this cost is still lower than the cost of tearing down and rebuilding the SA every time the phone changes its point of network attachment.

Wrapping Up

You should now have a basic idea of how an IPsec VPN with MOBIKE would function on a mobile phone.

The growing prevalence of smart phones in the home and workplace is going to make these solutions more important as users begin to demand an experience that matches that on more resource-rich computing devices. For the time being, people are willing to accept phones that can't connect to the corporate network, that have only a day of battery life, and that suffer from the other shortcomings of the smart phone we are all familiar with. This won't last. Competition and the emergence of better hardware will force the adoption of more complete, end-to-end solutions that enable experiences for the phone that are on par with the laptop and desktop.

And then I, along with everyone else, will be able to browse corporate SharePoint sites just by clicking a link on my phone.

Special thanks to Melissa Johnson for her suggestions and technical review of this article.

Ramon Arjona* is an SDET lead at Microsoft.*