Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.
3.3.3.5. Analyzing Incidents Regarding Failed Processes (Auto Scaling, Amazon ECS, Amazon EKS)
3.3.3.5. Analyzing Incidents Regarding Failed Processes (Auto Scaling, ECS, EKS)
Auto-healing services (ASG, ECS, EKS) automatically replace failed components — but understanding why they failed is essential to prevent recurring issues.
ASG instance termination analysis:
# Check why instances were terminated
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name prod-web-asg \
--max-items 10
# Look for: Cause field shows reason (health check, scale-in, etc.)
Common ASG failure patterns:
| Symptom | Root Cause | Diagnostic |
|---|---|---|
| Instances launch and immediately terminate | Fails ELB health check during startup | Increase HealthCheckGracePeriod |
| Instances launch then get replaced after minutes | Application crash after launch | Check instance system log (get-console-output) |
| Can't launch instances (InsufficientInstanceCapacity) | AZ capacity exhausted | Use multiple instance types or AZs |
| Instances scale out endlessly | Scaling metric never satisfies target | Check metric and target value |
ECS task failure analysis:
# Check stopped task reason
aws ecs describe-tasks --cluster prod --tasks "task-arn"
# stoppedReason: "Essential container exited" / "OutOfMemoryError"
# Check container exit code: 137 = OOM killed, 1 = app error
EKS pod failure analysis:
kubectl describe pod <name>— shows events, restart count, exit codeskubectl logs <pod> --previous— shows logs from the crashed container- Common CrashLoopBackOff causes: missing config, failed health probe, OOM
Exam Trap: ECS exit code 137 means the container was killed by the OOM killer — it tried to use more memory than the task definition allocated. The fix is either increasing the task memory allocation or fixing the application's memory usage. Exit code 1 is a generic application error — check the container logs. The exam tests whether you can interpret container exit codes.

Written byAlvin Varughese•Founder•15 professional certifications