Office 365 Connectivity Guidance: Part 3
3. Optimize Route Length & Avoid Network Hairpins
The third principle is to Optimize Route length and avoid network hairpins.
In the majority of cases, the shortest path the Microsoft global network will offer the highest levels of performance to your users for the service. As we discovered in part two, this is due to the global availability of both Microsoft's global network to transport your data in addition to the availability of local service front door elements for Office 365.
A network hairpin would be we use the WAN for a long distance to route traffic back to a centralized location to egress to the internet there to then route out to an Office 365 endpoint which is likely available near the user. Or similarly, using a VPN for remote users to route traffic back into the corporate network to egress to the internet via the security stack there.
Take the diagram below as an example. We have hairpins occurring both for the branch office location which hairpins traffic through the Head Office egress, and also the remote users are forced to tunnel in through a VPN concentrator into the corporate network, then out through the standard egress to reach Office 365 services. In both cases, the likelihood is that users in these locations will experience suboptimal performance with the service in addition to considerable costs for Contoso to deliver this model.
Even routing Office 365 traffic into Azure (via a proxy hosted there for example) would be considered a hairpin as there is a good chance the endpoint we're talking to may not live within the datacentre but out on the network edge, traffic will then hairpin into Azure and back out to the network edge before having to be routed back into the datacentre again. There are also numerous Office 365 endpoints which do not live in Microsoft's infrastructure such as CDN, CRL, DNS endpoints which will also end up hairpinning.
Incorrect DNS configuration could also cause a hairpin. As we learnt in the previous page, DNS needs to be resolved locally so as to find a local service front door. Performing DNS lookups at a different location to the egress risks traffic hairpinning to where the DNS call was completed to hit the service front door. As an example, if I egressed in Sydney Australia but DNS was resolved in NYC, my Outlook client would be directed to a CAFE server in the North America region rather than one in Australia.
Whilst Microsoft's strongly recommended best practice is to send at least the Optimize marked endpoints direct to the service, care should also be taken if choosing a vendor in the cloud based security/network access field if applicable. If this model is in use take great care to ensure your chosen vendor is able to provide infrastructure close to your users. For example, performance is likely to be suboptimal if my cloud proxy vendor's nearest nodes are in Singapore but my users are in Sydney. Also check how well they peer onto Microsoft's infrastructure i.e. that this is done locally to their resources. In essence, it's essential to check how well the provider conforms to the connectivity principles outlined in this article series, for example, how well they allow you to differentiate the Office 365 traffic from normal web traffic? (see part 1).
Moving on to optimized route length. If we've delivered direct, local egress for our users for their Office 365 traffic, how do we test the route length is optimal and my ISP is peering well with Microsoft's network?
Using a simple tracert to known endpoints we can look at both the latency and routing path to Microsoft's network. Tracert will only work when there is a direct path to the endpoint, so if on a corporate network where a proxy is used to egress the traffic (or ICMP is blocked) then we'd have to run this action at the edge of the corporate network, in front of, or on the proxy.
Test endpoints can be as per the following:
Service: Outlook/Exchange Online
- Endpoint to test: outlook.office365.com
- tracert outlook.office365.com
1 1 ms 1 ms 1 internal router 1 [10.16.9.2]
2 1 ms 1 ms 1 internal router 2 [10.3.7.17]
3 3 ms 2 ms 2 internal router 3 [10.3.7.7]
4 2 ms 2 ms 2 Internal router 4 [10.3.7.12]
7 3 ms 3 ms 3 ISP router 1 [194.x.x.x]
8 4 ms 3 ms 5 ISP Router 2 [194.x.x.x]
10 4 ms 6 ms 4 ms ae26-0.icr01.lon24.ntwk.msn.net [188.8.131.52]
11 13 ms 13 ms 14 ms be-120-0.ibr02.lon24.ntwk.msn.net [184.108.40.206]
12 13 ms 13 ms 13 ms be-8-0.ibr02.dub08.ntwk.msn.net [220.127.116.11]
13 13 ms 13 ms 13 ms ae120-0.icr01.dub08.ntwk.msn.net [18.104.22.168]
14 12 ms 12 ms 14 ms ae22-0.db3-96c-3a.ntwk.msn.net [22.214.171.124]
16 17 ms 13 ms 14 ms 126.96.36.199
In the above tracert taken in the UK on a corporate network you see the connectivity hit Microsoft's backbone (msn.net) in LON24 (London) in 4ms, seen in hop 10. So peering here is excellent, and in an expected place. Further we can see the CAFE server we are using at this point in time is in Dublin (db3), seen in hop 14.
Service: SharePoint/OneDrive for Business
- Endpoint to test: tenantname.sharepoint.com (eg contoso.sharepoint.com)
- Tracert contoso.sharepoint.com (changing contoso to your tenant name)
1 <1 ms <1 ms <1 ms – Local gateway 192.168.1.1
4 6 ms 6 ms 6 ms ISP Router 1
5 8 ms 6 ms 6 ms ISP Router 2
7 12 ms 9 ms 10 ms ISP Router 3
8 16 ms 15 ms 9 ms msft-decix-02-dxb30.ntwk.msn.net [188.8.131.52]
9 12 ms 11 ms 11 ms ae26-0.icr01.dxb20.ntwk.msn.net [184.108.40.206]
10 40 ms 40 ms 40 ms be-120-0.ibr02.dxb20.ntwk.msn.net [220.127.116.11]
11 40 ms 39 ms 39 ms be-7-0.ibr02.bom30.ntwk.msn.net [18.104.22.168]
12 39 ms 44 ms 39 ms ae21-0.bom01-96cbe-1b.ntwk.msn.net [22.214.171.124]
13 * * * Request timed out.
14 * * * Request timed out.
15 39 ms 40 ms 39 ms 126.96.36.199
In this example taken in Dubai UAE we see the peering occur on Microsoft's new peering location in Dubai (DXB30) in hop 8. As we don't have a full set of Office 365 services live in the Dubai region at the time of writing, the AFD endpoint being used is bom01 in hop 12 (Mumbai). Again, this peering is in an expected location (see the table in the previous part to see where peering is available).
If we trace to the same endpoint from a machine the UK we see the following:
Tracing route to spo-0004.spo-msedge.net [188.8.131.52]
1 1 ms 1 ms 1 ms network gateway - [192.168.1.1]
3 16 ms 23 ms 13 ms winn-core-2b-xe-030-0.network.virginmedia.net [184.108.40.206]
6 25 ms 20 ms 20 ms tcl5-ic-5-ae0-0.network.virginmedia.net [220.127.116.11]
7 21 ms 19 ms 17 ms ae23-0.lts-96cbe-1b.ntwk.msn.net [18.104.22.168]
8 22 ms 21 ms 25 ms ae25-0.icr02.lon22.ntwk.msn.net [22.214.171.124]
9 22 ms 18 ms 20 ms ae28-0.lon04-96cbe-1b.ntwk.msn.net [126.96.36.199]
10 21 ms 21 ms 21 ms ae20-0.lon21-96cbe-1b.ntwk.msn.net [188.8.131.52]
13 21 ms 26 ms 23 ms 184.108.40.206
Above you're seeing the ISP peer with Microsoft in London (LTS/LON22) which is also where the AFD endpoint resides. If we trace to the same endpoint from the same location via a different ISP then we also see this ISP peering in London.
Tracing route to spo-0004.spo-msedge.net [220.127.116.11]
1 1 ms 4 ms 3 ms Network Gateway [192.168.1.1]
4 7 ms 7 ms 7 ms 18.104.22.168
5 12 ms 8 ms 8 ms 22.214.171.124
6 11 ms 9 ms 8 ms peer2-et0-0-0.slough.ukcore.bt.net [126.96.36.199]
7 11 ms 9 ms 9 ms 188.8.131.52
8 11 ms 10 ms 15 ms ae3-0.lon21-96cbe-1a.ntwk.msn.net [184.108.40.206]
13 9 ms 9 ms 10 ms 220.127.116.11
An important point of note. It shouldn't be expected that the perfect peering location for you is used in every instance, the important thing is that it's in a sensible location. For example, a connection in Berlin in Germany may peer with Microsoft in Frankfurt (despite peering being available in Berlin) as the exact location is down to the ISP's preference and capability. However, an ISP in Berlin should not be peering with Microsoft in NYC for example as this will undoubtedly cause performance issues.
You'll notice most router names have a local code in them to help us identify them, invariably they are the IATA location or airport codes for the area. BOM being that for Mumbai airport (Bombay being the old official name for Mumbai)
- Endpoint to test: world.tr.teams.microsoft.com
- Tracert world.tr.teams.microsoft.com
Again in this trace, as it's taken in the UK, we can see the peering occuring in London (as you'd expect) and the traffic hitting a Teams relay server in London (lon12)
Tracing route to world.tr.teams.microsoft.com [18.104.22.168]
1 12 ms 18 ms 5 ms Local Network Gateway [192.168.1.1]
3 9 ms 16 ms 16 ms winn-core-2a-xe-133-0.network.virginmedia.net [22.214.171.124]
7 33 ms 23 ms 22 ms tcl5-ic-5-ae0-0.network.virginmedia.net [126.96.36.199]
8 24 ms 18 ms 22 ms ae23-0.lts-96cbe-1b.ntwk.msn.net [188.8.131.52]
9 19 ms 21 ms 19 ms ae25-0.icr02.lon22.ntwk.msn.net [184.108.40.206]
10 21 ms 20 ms 21 ms ae28-0.lon04-96cbe-1b.ntwk.msn.net [220.127.116.11]
11 22 ms 23 ms 19 ms ae3-0.lon21-96cbe-1a.ntwk.msn.net [18.104.22.168]
14 27 ms 32 ms 25 ms world.tr.teams.microsoft.com [22.214.171.124]