A single Direct Connect circuit is a private 1/10 Gbps pipe with no SLA worth printing. The resiliency lives entirely in how you pair connections across locations and devices, how BGP converges when one fails, and whether you have an encrypted path to fall back to. This guide builds the maximum resiliency model end to end: four connections across two Direct Connect locations, transit VIFs landing on a Transit Gateway through a Direct Connect Gateway, BGP tuned for fast failover, and a Site-to-Site VPN backup riding the public internet.
1. Resiliency models and what the SLA actually covers
AWS publishes three resiliency models, and the SLA you can claim is a direct function of which one you build. The Direct Connect SLA (99.99% for the maximum model) is only honored if your topology matches the requirement.
| Model | Topology | Survives | SLA |
|---|---|---|---|
| Maximum | Two DX locations, each with redundant devices | Device failure and full location failure | 99.99% |
| High | One connection at each of two DX locations | Full location failure, single connection failure | 99.9% |
| Single (dev/test) | Two connections at one location | Single connection or device failure only | None |
The trap is the “high” model: two connections at the same location share a building and sometimes a single AWS router, so a maintenance event or fiber cut takes both down. For production hybrid, the maximum model is the only one that earns the 99.99% number.
The SLA measures availability of the service. If your BGP config blackholes traffic during a failover, the circuit was “up” and you still had an outage. Resiliency is a property of your routing, not just your cabling.
For the maximum model you order two connections at DX Location A (on separate AWS devices) and two more at DX Location B — four connections, four cross-connects, ideally on diverse fiber to each facility.
2. Connections, LAGs, and virtual interfaces decoded
Three layers stack on each physical port; conflating them is the most common design mistake.
- Connection — the physical port (1/10/100 Gbps) at a DX location, terminated by a cross-connect to your router or provider.
- LAG (Link Aggregation Group) — bundles 1-4 connections on the same AWS device into one logical link via LACP. It raises bandwidth and survives a single port failure, but every member lands on one device at one location, so a LAG is not a cross-location resiliency boundary. Do not confuse it with the maximum model.
- Virtual Interface (VIF) — the logical Layer 3 attachment carrying a BGP session. Three types:
| VIF type | Reaches | Use with |
|---|---|---|
| Private VIF | A single VPC via a VGW (or DX Gateway) | One-VPC hybrid |
| Public VIF | AWS public endpoints (S3, public APIs) over private fiber | Avoiding the internet for public endpoints |
| Transit VIF | A Transit Gateway via a Direct Connect Gateway | Many VPCs / many regions |
For a Transit Gateway design you want transit VIFs. A transit VIF attaches to a Direct Connect Gateway (DXGW), which in turn associates to one or more Transit Gateways. With four connections you get four transit VIFs and four BGP sessions, and the DXGW load-balances and fails over across them.
A hard limit: a Direct Connect connection supports at most one transit VIF, and on most port speeds cannot carry private or public VIFs alongside it. For a pure TGW design, make all four transit.
3. Step 1 - Order connections and stand up the Direct Connect Gateway
Order connections from the console or CLI; the cross-connect and Letter of Authorization (LOA-CFA) steps are manual. Request the connections first, on separate devices per location:
# Location A, device 1
aws directconnect create-connection \
--location "EqDC2" \
--bandwidth "10Gbps" \
--connection-name "dx-locA-dev1" \
--request-macsec-capable
# Location B, device 1 (diverse location)
aws directconnect create-connection \
--location "CSSEA1" \
--bandwidth "10Gbps" \
--connection-name "dx-locB-dev1" \
--request-macsec-capable
--request-macsec-capable only succeeds on MACsec-supported ports (dedicated 10/100 Gbps at supported locations); request it now, since you cannot retrofit a non-capable port. Repeat for the second device at each location, then download each LOA-CFA and hand it to the colocation provider for the cross-connect:
aws directconnect describe-loa \
--connection-id "dxcon-aaaa1111" \
--output text --query loaContent | base64 --decode > loa-locA-dev1.pdf
Once cross-connects are live and the ports show available, create the Direct Connect Gateway — a global, region-agnostic object and the anchor for the whole design. Its Amazon-side ASN is what AWS uses on the BGP sessions toward your router.
aws directconnect create-direct-connect-gateway \
--direct-connect-gateway-name "dxgw-prod-global" \
--amazon-side-asn 64512
Choose the Amazon-side ASN deliberately. It must differ from your on-prem ASN, and if you ever attach this DXGW to a Transit Gateway, the DXGW ASN and the TGW ASN must also be distinct. Picking from the private ASN range (64512-65534, or the 32-bit private range) and documenting it now avoids a painful renumber later, since the DXGW ASN is immutable after creation.
4. Step 2 - Associate the Transit Gateway and configure transit VIFs
Create the Transit Gateway (or reuse an existing one) and associate it to the DXGW. The association declares which CIDRs the TGW advertises out to on-prem via the allowed prefixes list — the single most important field in the whole build.
# Create the TGW with its own distinct ASN
aws ec2 create-transit-gateway \
--description "tgw-prod" \
--options "AmazonSideAsn=64513,DefaultRouteTableAssociation=enable,DefaultRouteTablePropagation=enable"
# Associate the TGW to the DXGW, declaring the prefixes the TGW will advertise to on-prem
aws directconnect create-transit-gateway-association \
--direct-connect-gateway-id "dxgw-1234567890abcdef" \
--gateway-id "tgw-0a1b2c3d4e5f6a7b8" \
--add-allowed-prefixes-to-direct-connect-gateway "cidr=10.0.0.0/8"
The allowed-prefixes on a transit association are not a filter on inbound routes — they are the summaries the DXGW advertises from AWS to on-prem over every transit VIF. Advertise one clean summary (e.g. 10.0.0.0/8 covering all VPC space) rather than dozens of specifics, since the DXGW caps advertised prefixes.
Create one transit VIF per connection — each with its own VLAN, /30 peering subnet, and BGP session to the DXGW.
aws directconnect create-transit-virtual-interface \
--connection-id "dxcon-aaaa1111" \
--new-transit-virtual-interface '{
"virtualInterfaceName": "tvif-locA-dev1",
"vlan": 101,
"asn": 65000,
"mtu": 8500,
"directConnectGatewayId": "dxgw-1234567890abcdef",
"addressFamily": "ipv4",
"amazonAddress": "169.254.100.1/30",
"customerAddress": "169.254.100.2/30",
"authKey": "your-bgp-md5-secret"
}'
Details that matter:
asnhere is your on-prem/router ASN (customer side); the DXGW answers with its Amazon-side ASN from Step 1.mtu: 8500enables jumbo frames, avoiding fragmentation for the encapsulation overhead TGW adds. Confirm the on-prem path supports it end to end first.authKeysets the BGP MD5 password. Always set one.- Repeat with distinct VLANs, peering /30s, and names for all four connections.
5. Step 3 - Route propagation, allowed prefixes, and asymmetric routing
You now have BGP in both directions and need to make it deterministic.
Outbound from AWS (TGW to on-prem) is controlled by the allowed-prefixes on the TGW-DXGW association (Step 2); the DXGW advertises those summaries equally over all four transit VIFs. With DefaultRouteTablePropagation=enable, learned routes propagate to the TGW route table automatically.
Inbound to AWS (on-prem to TGW) is driven by what your routers advertise. To prefer Location A normally and fail to Location B, shape it with BGP attributes:
! Cisco IOS-XE: prefer Location A, prepend Location B
router bgp 65000
address-family ipv4 unicast
! Location A: higher local-pref preferred for AWS-bound traffic
neighbor 169.254.100.1 route-map LOCA-PRIMARY in
! Location B: prepend our ASN outbound so AWS prefers A
neighbor 169.254.110.1 route-map LOCB-BACKUP out
!
route-map LOCA-PRIMARY permit 10
set local-preference 200
!
route-map LOCB-BACKUP permit 10
set as-path prepend 65000 65000
AWS path selection over Direct Connect goes longest prefix match, then AS_PATH length, then specific local routes. AWS ignores local-preference (your side’s attribute) and does not honor inbound MED across the DXGW reliably, so the durable lever for steering AWS-bound traffic is AS_PATH prepending on the secondary connections, plus a more-specific on the primary for a harder preference.
Asymmetric routing is the classic Direct Connect outage. If AWS returns traffic out Location B while you send out Location A, stateful firewalls drop the mismatched flows. The fix: make both directions agree on the same primary — higher local-pref inbound on A, prepend on B outbound — and keep prefix lengths symmetric per location rather than summarizing one and de-aggregating the other.
6. Step 4 - Encrypting the link: MACsec on the port vs IPsec over the VIF
Direct Connect is private but not encrypted by default. Two options operate at different layers:
| Approach | Layer | Scope | Requirements |
|---|---|---|---|
| MACsec (802.1AE) | L2, on the port | Entire connection, all VIFs | MACsec-capable dedicated port; supported location; CKN/CAK keys |
| IPsec over the VIF | L3, in a VPN tunnel | A Site-to-Site VPN over the DX path | Public VIF + VPN, or the VPN backup itself |
MACsec is line-rate, point-to-point on the cross-connect, and the cleaner answer when both router and port support it (why we passed --request-macsec-capable in Step 1). Associate a MACsec secret — a Connection Key Name (CKN) and Connectivity Association Key (CAK) — to the connection:
aws directconnect associate-mac-sec-key \
--connection-id "dxcon-aaaa1111" \
--ckn "0011...your-ckn..." \
--cak "1122...your-cak..."
# Require encryption: unencrypted frames are dropped, not allowed through
aws directconnect update-connection \
--connection-id "dxcon-aaaa1111" \
--encryption-mode "must_encrypt"
Set encryption-mode to must_encrypt only after the key is confirmed on both ends; setting it before the peer is keyed drops the link. Use should_encrypt during cutover.
If MACsec is unavailable (a hosted connection, or a 1 Gbps port), encrypt at L3 by running a Site-to-Site VPN over a public VIF, or rely on the IPsec VPN backup below. You cannot run a VPN over a transit VIF.
7. The Site-to-Site VPN backup and BGP timer tuning
The encrypted, internet-based backup attaches to the same Transit Gateway, so failover is a routing decision, not a topology change. Create a Customer Gateway, then a VPN attachment to the TGW with dynamic BGP routing.
aws ec2 create-customer-gateway \
--type ipsec.1 \
--public-ip 203.0.113.10 \
--bgp-asn 65000
aws ec2 create-vpn-connection \
--type ipsec.1 \
--customer-gateway-id "cgw-0abc123" \
--transit-gateway-id "tgw-0a1b2c3d4e5f6a7b8" \
--options '{"StaticRoutesOnly":false,"TunnelOptions":[{},{}]}'
The point of the backup is that it stays quiet until Direct Connect fails:
- On the VPN tunnels, AS-path prepend your on-prem prefixes heavily (3-4 times) so AWS prefers DX inbound. (AWS-to-on-prem, the TGW prefers DX over VPN by default — but verify it in the route table.)
- Keep both tunnels per connection up so a single tunnel failure does not drop the backup.
For failover speed, the constraint is BGP convergence. The default 90-second hold time is far too slow for production. Two levers:
- BFD (Bidirectional Forwarding Detection) on the DX VIFs. AWS supports BFD; with the AWS-side defaults (300 ms interval, multiplier 3), enabling it per VIF neighbor gives sub-second detection versus tens of seconds for BGP timers alone. This is the recommended approach.
- Tuned BGP timers where BFD is not available (the VPN). Lower keepalive/hold on your side; AWS negotiates the lower of the two.
! Enable BFD on the Direct Connect VIF neighbors for sub-second failover
router bgp 65000
neighbor 169.254.100.1 fall-over bfd
neighbor 169.254.110.1 fall-over bfd
!
interface ...
bfd interval 300 min_rx 300 multiplier 3
With BFD on DX and a heavily-prepended VPN on the same TGW, a connection or location failure reconverges to the surviving DX path in well under a second, and a total DX failure falls to the VPN automatically.
Enterprise scenario
A payments platform ran the maximum model across two DX locations into a TGW, plus an IPsec VPN backup on the same TGW. During a planned AWS maintenance on one Location A device, both Location A VIFs went down as expected and traffic moved to Location B — but a chunk of flows from on-prem to a PCI VPC started timing out. The circuits were “up”; this was asymmetric routing. On-prem still sent AWS-bound traffic toward Location A’s surviving-but-draining path because of stale local-preference, while AWS, having lost the Location A BGP sessions, returned everything via Location B. The stateful firewalls in front of the PCI VPC saw SYN out one location and SYN-ACK in via the other, and silently dropped the half-open flows.
Root cause: they steered inbound (AWS-to-on-prem) with local-preference, which AWS ignores across the DXGW, and never made outbound (on-prem-to-AWS) agree. The durable fix was AS_PATH prepending on the Location B VIFs so AWS consistently preferred Location A, matched by higher local-pref inbound on A — both directions pinned to the same primary.
! Location B VIFs: prepend outbound so AWS prefers Location A, symmetric with inbound local-pref
route-map LOCB-BACKUP-OUT permit 10
set as-path prepend 65000 65000 65000
!
router bgp 65000
neighbor 169.254.110.1 route-map LOCB-BACKUP-OUT out
neighbor 169.254.100.1 fall-over bfd
neighbor 169.254.110.1 fall-over bfd
They also enabled BFD on every DX VIF, cutting failover detection from tens of seconds to sub-second. The lesson: a “healthy” circuit count proves nothing when both directions disagree on the primary path.
Verify
Confirm BGP, prefixes, and failover behavior before declaring the design done.
# All four transit VIFs should be 'available' with 'up' BGP
aws directconnect describe-virtual-interfaces \
--query "virtualInterfaces[].{name:virtualInterfaceName,state:virtualInterfaceState,bgp:bgpPeers[0].bgpStatus}" \
--output table
# DXGW associations: confirm allowed prefixes and 'associated' state
aws directconnect describe-direct-connect-gateway-associations \
--direct-connect-gateway-id "dxgw-1234567890abcdef" \
--query "directConnectGatewayAssociations[].{tgw:associatedGateway.id,state:associationState,prefixes:allowedPrefixesToDirectConnectGateway}"
# TGW route table: on-prem prefixes should resolve to the DXGW attachment, not the VPN
aws ec2 search-transit-gateway-routes \
--transit-gateway-route-table-id "tgw-rtb-0abc123" \
--filters "Name=type,Values=propagated"
On the router side, confirm received routes show the expected AS_PATH (prepended from the secondary location) and that BFD sessions are up — show ip bgp neighbors <peer> received-routes and show bfd neighbors. Then test the failure modes deliberately:
- Single connection down: shut one VIF’s interface; traffic rides the second connection at the same location, no asymmetry.
- Full location down: shut both Location A VIFs; traffic moves to Location B. Watch for return-path asymmetry especially here.
- Total DX down: shut all four; traffic falls to the IPsec VPN within seconds, and the TGW route table now resolves on-prem prefixes via the VPN attachment.
Checklist
Monitoring and pitfalls
Direct Connect publishes metrics to CloudWatch under the AWS/DX namespace per connection: ConnectionState, ConnectionBpsEgress/Ingress, ConnectionPpsEgress/Ingress, light levels (ConnectionLightLevelTx/Rx), and CRC/error counters. Alarm on ConnectionState dropping below 1 and on light levels drifting — a degrading optic shows in light levels before the link fully fails.
aws cloudwatch put-metric-alarm \
--alarm-name "dx-locA-dev1-down" \
--namespace "AWS/DX" \
--metric-name "ConnectionState" \
--dimensions Name=ConnectionId,Value=dxcon-aaaa1111 \
--statistic Minimum --period 60 --evaluation-periods 1 \
--threshold 1 --comparison-operator LessThanThreshold
The pitfalls that bite in production:
- Mistaking a LAG for resiliency. A LAG is one device at one location — bandwidth and port-failure protection, not the maximum model. Keep your four connections diverse across locations.
- Forgetting the VPN can become primary. If DX fails and the VPN becomes your only path, capacity-plan for it. A 10 Gbps DX backed by a ~1.25 Gbps VPN aggregate is a brownout waiting to happen during a long outage.
must_encryptset too early. Confirm MACsec keys withshould_encryptfirst, or you drop the link the moment you tighten the policy.
Build to the maximum model, prove every failure mode with a real shutdown test, and keep the VPN quiet but ready. Resiliency you have not tested is just a diagram.