Copyright (c) 2025 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.4.4. Troubleshooting Network Connectivity

šŸ’” First Principle: Systematic troubleshooting of network connectivity issues, leveraging diagnostic tools and understanding network flow, is crucial for quickly identifying root causes and restoring operational normalcy.

Scenario: Your web application's EC2 instances in a private subnet can no longer access external APIs on the internet. You suspect a network configuration issue, but you're unsure where to start.

For SysOps Administrators, troubleshooting network connectivity problems is a common and critical task. It requires understanding how traffic flows through various VPC components and using the right diagnostic tools.

Common Network Connectivity Issues:
  • Incorrect Security Group Rules: Most common cause of blocked traffic.
  • Incorrect Network ACL Rules: Traffic blocked at the subnet level.
  • Incorrect Route Table Entries: Traffic not routed to the correct destination.
  • NAT Gateway Issues: Private instances cannot access the internet or AWS services.
  • VPN / Direct Connect Problems: Connectivity issues for hybrid cloud environments.
  • DNS Resolution Issues: Instances cannot resolve hostnames.
Key Troubleshooting Steps & Tools:
  1. Check Security Groups: Verify inbound/outbound rules on both source and destination instances.
  2. Check Network ACLs: Verify inbound/outbound rules on the subnets involved. Remember NACLs are stateless and rules are processed in order.
  3. Check Route Tables: Ensure subnets have routes to the correct gateways (Internet Gateway, NAT Gateway, VPC Peering connection, Transit Gateway).
  4. Use VPC Flow Logs: Analyze detailed IP traffic information to see if traffic is reaching its destination or being rejected.
  5. Use Reachability Analyzer: (A feature in VPC that analyzes the network path between two resources and determines if they are reachable.) A tool that helps diagnose network reachability issues between resources in your VPC.
  6. Check DNS Resolution: Ensure DNS settings are correct in VPC DHCP options sets.
  7. Ping/Traceroute: Basic network diagnostic commands from an EC2 instance.

āš ļø Common Pitfall: Jumping to conclusions or randomly changing configurations without a systematic approach, which can worsen the problem.

Key Trade-Offs: Speed of diagnosis (using automated tools like Reachability Analyzer) versus manual, in-depth investigation (for complex, unique issues).

Reflection Question: How does a systematic troubleshooting approach, combining checks on Security Groups and Network ACLs, verifying route table entries (especially for the NAT Gateway), and utilizing VPC Flow Logs, fundamentally help you as a SysOps Administrator quickly identify the root cause of network connectivity issues?


Reflection Checkpoint: Phase 3

Summary Scenario: You've successfully deployed an application to AWS using Infrastructure as Code. Now, you need to ensure its ongoing operational health by automating patching, maintaining consistent configurations, managing application updates with minimal downtime, and troubleshooting any network issues that arise.

Key Reflection Question: How do the combined capabilities of AWS Systems Manager, CloudFormation, CodeDeploy, and VPC networking components enable a SysOps Administrator to achieve comprehensive operational management and automation, transforming manual tasks into consistent, scalable, and reliable processes?

Self-Assessment Prompts:
  1. Can I explain the primary use cases for at least three different AWS Systems Manager capabilities (e.g., Run Command, Patch Manager, Session Manager)?
  2. Do I understand how CloudFormation helps ensure operational consistency and enables rollbacks for infrastructure changes?
  3. Can I differentiate between In-place, Rolling, and Blue/Green deployment strategies and when to use each?
  4. Do I know the purpose of public and private subnets, Internet Gateways, and NAT Gateways in a VPC?
  5. Can I list the key differences between Security Groups and Network ACLs and describe a scenario where each would be used?
  6. What are the first three steps I would take to troubleshoot an EC2 instance that cannot connect to an external API?