4.2.1. Troubleshooting Connectivity (General)
Systematic troubleshooting of network connectivity issues involves validating configurations across all layers of the network stack, from application to physical, to rapidly identify the root cause.
Scenario: Your application's web servers can't connect to the database servers in a different private subnet. You've confirmed both EC2 instances are running and the database is healthy.
When network connectivity issues arise, a methodical approach is essential to diagnose and resolve them efficiently.
Key Troubleshooting Steps (General):
- Verify Resource Health & Application State: Check EC2 instance status checks, Load Balancer Target Group health checks, and application logs for network-related errors.
- Check Instance-Level Firewalls (Security Groups): Verify inbound and outbound rules on both the source and destination Security Groups. This is the most common cause of blocked traffic. Ensure the correct ports, protocols, and source/destination IP addresses or other Security Groups are allowed.
- Check Subnet-Level Firewalls (Network ACLs): Verify inbound and outbound rules on the Network ACLs associated with both source and destination subnets. Remember NACLs are stateless (must explicitly allow return traffic) and rules are processed by rule number.
- Check Network Routing (Route Tables): Ensure the subnets of both source and destination instances have correct route table entries to reach the destination (e.g., local route for within-VPC, Internet Gateway for internet, NAT Gateway for outbound from private, VPC peering connection, Transit Gateway attachment).
- Check Name Resolution (DNS): Ensure instances can resolve hostnames correctly by checking VPC DHCP options sets and Route 53 Resolver configurations.
- Analyze Traffic Flow (VPC Flow Logs): Use VPC Flow Logs to analyze detailed IP traffic information to see if traffic is reaching its destination or being rejected.
- Perform Automated Path Analysis (Reachability Analyzer): Use Reachability Analyzer to simulate the network path between two resources and identify the specific blocking component.
Practical Implementation: Checking Security Group Rules (CLI)
# 1. Describe Security Group rules for the web server's SG
aws ec2 describe-security-groups --group-ids sg-0abcdef1234567890 --query "SecurityGroups[0].IpPermissions"
# 2. Describe Security Group rules for the database server's SG
aws ec2 describe-security-groups --group-ids sg-0fedcba9876543210 --query "SecurityGroups[0].IpPermissions"
# 3. Check NACL rules for the subnets
aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=subnet-0a1b2c3d" --query "NetworkAcls[0].Entries"
# 4. Check Route Table for the web server's subnet
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-0a1b2c3d" --query "RouteTables[0].Routes"
⚠️ Common Pitfall: Overlooking outbound rules on Security Groups or NACLs. Traffic might be allowed inbound but blocked on its way out, or vice-versa.
Key Trade-Offs:
- Manual Inspection vs. Automated Tools: Manual inspection is good for simple issues. Automated tools like Reachability Analyzer are essential for complex, multi-layered problems.
Reflection Question: How does systematically validating configurations across all layers of the network stack (e.g., checking Security Groups, Network ACLs, route tables, and using VPC Flow Logs) fundamentally help you rapidly identify the root cause of network connectivity issues in AWS?