AWS-ANS-C01 & AWS CERTIFICATION | Troubleshooting Load Balancers - AWS Certified Advanced Networking

4.2.4. Troubleshooting Load Balancers

Troubleshooting load balancers involves systematically checking target group health, listener rules, and backend instance configurations to identify traffic routing issues and restore application availability.

Scenario: Your web application, accessed via an ALB, is returning 503 errors to users. In the ALB console, you see that all registered EC2 instances in the Target Group are marked as "unhealthy."

Elastic Load Balancing (ELB) is crucial for distributing traffic and ensuring high availability, but troubleshooting issues can involve multiple layers.

Key Troubleshooting Steps for Load Balancers:

Check Load Balancer Status: In the ELB console, verify the load balancer is "Active" and has enough healthy targets.
Check Target Group Health:
- Primary Focus: The most common cause of load balancer issues. Ensure targets are "Healthy."
- Verify Health Check Configuration: Protocol, port, path, thresholds (HealthyThresholdCount, UnhealthyThresholdCount, TimeoutSeconds, IntervalSeconds).
- Check Backend Instance Logs: If targets are unhealthy, check application logs and system logs on the EC2 instance for errors.
Check Listener Rules:
- Purpose: Define how incoming requests are routed to target groups.
- Verify: Ensure listener rules (e.g., host-based, path-based) are correctly configured to forward traffic to the intended target group.
Check Security Groups:
- Load Balancer SG: Ensure it allows inbound traffic on listener ports and outbound traffic to target instances on their health check and application ports.
- Target Instance SG: Ensure it allows inbound traffic from the load balancer's SG on application ports.
Check Network ACLs (NACLs): Verify NACLs associated with load balancer and target subnets are not blocking traffic.
Check Route Tables: Ensure subnets have correct route table entries.
CloudWatch Metrics: Monitor metrics like HealthyHostCount, HTTPCode_Target_5XX_Count, TargetConnectionErrorCount, Latency to gain insights.

Practical Implementation: Checking ALB Target Group Health (CLI)

# Assuming TARGET_GROUP_ARN is defined
# 1. Describe target health for a specific target group
aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abcdef1234567890

# Example output for unhealthy targets:
# {
#     "TargetHealthDescriptions": [
#         {
#             "Target": {
#                 "Id": "i-0abcdef1234567890",
#                 "Port": 80
#             },
#             "HealthCheckPort": "traffic-port",
#             "TargetHealth": {
#                 "State": "unhealthy",
#                 "Reason": "Target.FailedHealthChecks",
#                 "Description": "Health checks failed with these codes: [500]"
#             }
#         }
#     ]
# }

⚠️ Common Pitfall: Misconfiguring health checks. An incorrect health check path, port, or response code can cause healthy instances to be marked unhealthy, leading to application downtime.

Key Trade-Offs:

Health Check Sensitivity vs. Stability: A very sensitive health check detects issues faster but might mark instances unhealthy during transient network glitches or brief load spikes.

Reflection Question: How does systematically checking Target Group health (including health check configuration and backend logs), listener rules, and Security Groups fundamentally help you troubleshoot load balancer issues and restore application availability?