3.3.3.3. Root Cause Analysis
3.3.3.3. Root Cause Analysis
Finding the root cause — not just the symptom — prevents the same incident from recurring. AWS provides tools to systematically trace from symptom to cause.
The Five Whys applied to AWS:
- Why are users seeing errors? → ALB returning 502s
- Why is the ALB returning 502s? → All targets are unhealthy
- Why are targets unhealthy? → Application process crashed on all instances
- Why did the application crash? → Out-of-memory error from a memory leak
- Why is there a memory leak? → New code deployed without memory profiling
RCA tools and techniques:
| Tool | Use For | Example |
|---|---|---|
| CloudWatch Metrics | Identify when the problem started | Error rate spiked at 14:23 |
| CloudWatch Logs Insights | Find what errors occurred | Query for exceptions at 14:23 |
| X-Ray traces | Find where in the request path | Payment service latency spiked |
| CloudTrail | Find who changed what | New deployment at 14:20 |
| Config timeline | Find what configuration changed | Security group rule added at 14:15 |
Correlation technique: Overlay CloudTrail events (deployments, config changes) on CloudWatch metric graphs. If errors spike exactly when a deployment finished, you've found your root cause.
# CloudWatch Logs Insights: Find the first error after a deployment
fields @timestamp, @message
| filter @message like /Exception|Error|FATAL/
| sort @timestamp asc
| limit 10
Exam Trap: A common exam scenario presents an issue that started after a deployment. The correct RCA approach is: check CloudTrail for the deployment event → check CodeDeploy deployment status → review application logs for errors → correlate with the code changes. The answer is usually to roll back the deployment first (restore service), then investigate root cause. Don't spend time debugging while users are affected.
