3.1.1.2. Mitigating Single Points of Failure
š” First Principle: A resilient architecture is achieved by systematically identifying and eliminating any single component whose failure would cause the entire system or a critical function to fail.
Scenario: An architect is reviewing an existing application that runs on a single "EC2 instance"
and uses an "Amazon RDS"
database deployed in a single "Availability Zone (AZ)"
. The architect identifies both as single points of failure.
A robust architecture actively seeks out and mitigates Single Points of Failure ("SPOFs"
) at every layer.
- Compute Layer:
"SPOF"
: Single"EC2 instance"
, single"ECS task"
, single"Lambda"
function deployment.- Mitigation: Deploy across
"Multi-AZs"
using"Auto Scaling Groups"
("EC2"
/"ECS"
), or use managed services like"Lambda"
/"Fargate"
that inherently distribute resources. Use"Placement Groups"
for specific isolation needs.
- Data Layer:
"SPOF"
: Single-"AZ"
database (e.g.,"RDS"
), unreplicated data.- Mitigation:
"RDS Multi-AZ"
,"Aurora Global Database"
,"DynamoDB Global Tables"
,"S3 Cross-Region Replication"
,"EBS Snapshots"
. Ensure data is always replicated and backed up.
- Networking Layer:
"SPOF"
: Single"Internet Gateway"
, single"NAT Gateway"
in an"AZ"
, reliance on a single"Direct Connect"
link.- Mitigation:
"IGWs"
are inherently redundant. Deploy"NAT Gateways"
in each"AZ"
where private subnets need outbound internet access. For"Direct Connect"
, implement multiple links over diverse paths and multiple locations, or use"VPN"
as backup.
- Application Layer:
"SPOF"
: Hardcoded endpoints, shared libraries on a single server, tightly coupled services.- Mitigation: Use load balancers (
"ALB"
,"NLB"
) for traffic distribution, implement service discovery (e.g.,"Cloud Map"
), design for loose coupling with message queues ("SQS"
) or event buses ("EventBridge"
), use immutable infrastructure.
Visual: Mitigating Single Points of Failure (SPOFs)
Loading diagram...
ā ļø Common Pitfall: Overlooking dependencies on services in a single "AZ"
. For example, if all your "EC2 instances"
rely on a single "NAT Gateway"
in one "AZ"
for internet access, the failure of that "AZ"
will cut off internet connectivity for all instances, even those in other healthy "AZs"
.
Key Trade-Offs:
- Redundancy vs. Cost: Eliminating
"SPOFs"
often involves duplicating resources or infrastructure (e.g.,"Multi-AZ"
deployments), which increases the overall cost of the solution.
Reflection Question: How would you mitigate the single points of failure identified in this existing application (single "EC2 instance"
and single-"AZ RDS"
database) using AWS services like "Auto Scaling Groups"
and "RDS Multi-AZ"
, to enhance the application's overall resilience without dramatically increasing cost?