3.1.1.2. Mitigating Single Points of Failure
3.1.1.2. Mitigating Single Points of Failure
š” First Principle: A resilient architecture is achieved by systematically identifying and eliminating any single component whose failure would cause the entire system or a critical function to fail.
Scenario: An architect is reviewing an existing application that runs on a single "EC2 instance" and uses an "Amazon RDS" database deployed in a single "Availability Zone (AZ)". The architect identifies both as single points of failure.
A robust architecture actively seeks out and mitigates Single Points of Failure ("SPOFs") at every layer.
- Compute Layer:
"SPOF": Single"EC2 instance", single"ECS task", single"Lambda"function deployment.- Mitigation: Deploy across
"Multi-AZs"using"Auto Scaling Groups"("EC2"/"ECS"), or use managed services like"Lambda"/"Fargate"that inherently distribute resources. Use"Placement Groups"for specific isolation needs.
- Data Layer:
"SPOF": Single-"AZ"database (e.g.,"RDS"), unreplicated data.- Mitigation:
"RDS Multi-AZ","Aurora Global Database","DynamoDB Global Tables","S3 Cross-Region Replication","EBS Snapshots". Ensure data is always replicated and backed up.
- Networking Layer:
"SPOF": Single"Internet Gateway", single"NAT Gateway"in an"AZ", reliance on a single"Direct Connect"link.- Mitigation:
"IGWs"are inherently redundant. Deploy"NAT Gateways"in each"AZ"where private subnets need outbound internet access. For"Direct Connect", implement multiple links over diverse paths and multiple locations, or use"VPN"as backup.
- Application Layer:
"SPOF": Hardcoded endpoints, shared libraries on a single server, tightly coupled services.- Mitigation: Use load balancers (
"ALB","NLB") for traffic distribution, implement service discovery (e.g.,"Cloud Map"), design for loose coupling with message queues ("SQS") or event buses ("EventBridge"), use immutable infrastructure.
Visual: Mitigating Single Points of Failure (SPOFs)
Loading diagram...
ā ļø Common Pitfall: Overlooking dependencies on services in a single "AZ". For example, if all your "EC2 instances" rely on a single "NAT Gateway" in one "AZ" for internet access, the failure of that "AZ" will cut off internet connectivity for all instances, even those in other healthy "AZs".
Key Trade-Offs:
- Redundancy vs. Cost: Eliminating
"SPOFs"often involves duplicating resources or infrastructure (e.g.,"Multi-AZ"deployments), which increases the overall cost of the solution.
Reflection Question: How would you mitigate the single points of failure identified in this existing application (single "EC2 instance" and single-"AZ RDS" database) using AWS services like "Auto Scaling Groups" and "RDS Multi-AZ", to enhance the application's overall resilience without dramatically increasing cost?
