3.1.1.6. Identifying & Remediating Single Points of Failure

A single point of failure (SPOF) is any component whose failure takes down the entire system. Finding them requires tracing every request path and asking "what if this fails?"

Common SPOFs and remediations:

SPOF	Remediation
Single EC2 instance (no ASG)	Place behind ASG with min=2 across AZs
Single-AZ RDS	Enable Multi-AZ deployment
Single NAT Gateway	Deploy one NAT GW per AZ
Hardcoded IP addresses	Use DNS names, ELB, or Elastic IPs with failover
Single region deployment	Add standby region with Route 53 failover
Application storing state locally	Move state to DynamoDB, ElastiCache, or EFS

SPOF detection methods:

AWS Well-Architected Tool: Automated review against Reliability Pillar best practices
Chaos engineering: Inject failures (terminate instances, block network) and observe behavior. AWS Fault Injection Simulator (FIS) provides managed chaos experiments.
Architecture reviews: Trace every request through every component. If any single component's failure causes user impact, it's a SPOF.

# AWS FIS: Terminate random EC2 instances in an ASG to test resilience
aws fis create-experiment-template \
  --description "Test ASG self-healing" \
  --targets '{"instances":{"resourceType":"aws:ec2:instance","selectionMode":"COUNT(1)","filters":[{"path":"State.Name","values":["running"]}]}}' \
  --actions '{"terminateInstance":{"actionId":"aws:ec2:terminate-instances","targets":{"Instances":"instances"}}}'

Exam Trap: A NAT Gateway is a regional service but operates within a single AZ. If you have private subnets in 3 AZs routing through a single NAT GW in AZ-A, and AZ-A fails, all three subnets lose internet access. Deploy one NAT GW per AZ and configure route tables accordingly.

Written byAlvin Varughese•Founder•15 professional certifications