Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.3.3.5. Analyzing Incidents Regarding Failed Processes (Auto Scaling, Amazon ECS, Amazon EKS)

3.3.3.5. Analyzing Incidents Regarding Failed Processes (Auto Scaling, ECS, EKS)

Auto-healing services (ASG, ECS, EKS) automatically replace failed components — but understanding why they failed is essential to prevent recurring issues.

ASG instance termination analysis:
# Check why instances were terminated
aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name prod-web-asg \
  --max-items 10
# Look for: Cause field shows reason (health check, scale-in, etc.)
Common ASG failure patterns:
SymptomRoot CauseDiagnostic
Instances launch and immediately terminateFails ELB health check during startupIncrease HealthCheckGracePeriod
Instances launch then get replaced after minutesApplication crash after launchCheck instance system log (get-console-output)
Can't launch instances (InsufficientInstanceCapacity)AZ capacity exhaustedUse multiple instance types or AZs
Instances scale out endlesslyScaling metric never satisfies targetCheck metric and target value
ECS task failure analysis:
# Check stopped task reason
aws ecs describe-tasks --cluster prod --tasks "task-arn"
# stoppedReason: "Essential container exited" / "OutOfMemoryError"
# Check container exit code: 137 = OOM killed, 1 = app error
EKS pod failure analysis:
  • kubectl describe pod <name> — shows events, restart count, exit codes
  • kubectl logs <pod> --previous — shows logs from the crashed container
  • Common CrashLoopBackOff causes: missing config, failed health probe, OOM

Exam Trap: ECS exit code 137 means the container was killed by the OOM killer — it tried to use more memory than the task definition allocated. The fix is either increasing the task memory allocation or fixing the application's memory usage. Exit code 1 is a generic application error — check the container logs. The exam tests whether you can interpret container exit codes.

Alvin Varughese
Written byAlvin Varughese•Founder•15 professional certifications