4.2.2. Troubleshooting VPN/Direct Connect
Troubleshooting VPN and Direct Connect (DX) involves systematically verifying configuration, routing, and tunnel/VIF status on both AWS and on-premises sides to restore hybrid cloud connectivity.
Scenario: Your on-premises data center has lost connectivity to your AWS VPC over a Site-to-Site VPN connection. You've confirmed your internet connection is up.
Connectivity issues in hybrid cloud environments can be complex due to interactions between on-premises devices and AWS.
Key Troubleshooting Steps for VPN/Direct Connect:
- Check AWS Side:
- Site-to-Site VPN:
- Verify the status of the two VPN tunnels in the AWS Management Console (VPN tunnel state CloudWatch metric). Both should be UP.
- Check Virtual Private Gateway (VPG) or Transit Gateway (TGW) VPN attachment route tables for correct routes to on-premises.
- Verify VPC route tables have routes to the VPG or TGW.
- AWS Direct Connect:
- Verify the status of the DX connection and Virtual Interfaces (VIFs) in the Direct Connect console. Both should be UP.
- Check BGP session status on DX VIFs.
- Verify route tables (VPC, Direct Connect Gateway, TGW) have correct routes for on-premises prefixes.
- Site-to-Site VPN:
- Check On-premises Side:
- Verify the status of your customer gateway device (VPN router/firewall) or Direct Connect router.
- Check VPN tunnel status or DX physical connection.
- Verify on-premises route tables have routes to AWS VPC CIDRs.
- Ensure firewall rules on-premises are not blocking traffic.
- Check DNS Resolution: Verify DNS settings are correct on both sides for resolving hostnames.
- Use VPC Flow Logs: Analyze logs from the VPC side to see if traffic is reaching the gateway or being rejected.
Practical Implementation: Checking VPN Tunnel Status (CLI)
# 1. Describe VPN connections to get tunnel details
aws ec2 describe-vpn-connections --vpn-connection-ids vpn-0abcdef1234567890 --query "VpnConnections[0].VgwTelemetry"
# Expected output will show TunnelState (UP/DOWN), LastStatusChange, StatusMessage
# Example:
# [
# {
# "OutsideIpAddress": "198.51.100.1",
# "Status": "UP",
# "LastStatusChange": "2023-10-27T10:00:00.000Z",
# "StatusMessage": "Tunnel is up.",
# "AcceptedRouteCount": 10
# },
# {
# "OutsideIpAddress": "203.0.113.1",
# "Status": "DOWN",
# "LastStatusChange": "2023-10-27T09:50:00.000Z",
# "StatusMessage": "IKE negotiation failed.",
# "AcceptedRouteCount": 0
# }
# ]
⚠️ Common Pitfall: Focusing only on the AWS side. Hybrid connectivity issues often stem from misconfigurations or failures on the on-premises network devices.
Key Trade-Offs:
- Manual Inspection vs. Automated Monitoring: While manual checks are necessary, automated monitoring (e.g., CloudWatch alarms on VPN TunnelState) provides proactive alerts.
Reflection Question: How does systematically verifying configuration, routing, and tunnel/VIF status on both the AWS side (e.g., VPN tunnel state in CloudWatch) and the on-premises side fundamentally help you troubleshoot VPN and Direct Connect (DX) issues and restore hybrid cloud connectivity?