2.1.1.2. Designing for High Availability (Multi-AZ, Placement Groups)
š” First Principle: Eliminating single points of failure by distributing resources across independent failure domains ensures continuous application availability and resilience against localized outages.
Scenario: A critical financial trading application requires extremely low network latency between its compute instances, but also high availability. The architect needs to design how these instances are placed on underlying hardware to meet both requirements.
High Availability ("HA"
) is central to reliable cloud architectures. It focuses on ensuring that an application remains operational even if components or infrastructure fail.
- "Multi-Availability Zone (Multi-AZ) Deployments": A strategy that distributes compute resources and stateful services across physically isolated
"Availability Zones"
within a single"AWS Region"
. This protects against failures in a single data center or"AZ"
.- Implementation Example:
"EC2 Auto Scaling Groups"
spread instances across"AZs"
,"Elastic Load Balancing (ELB)"
distributes traffic,"Amazon RDS Multi-AZ"
for synchronous database replication,"Amazon EFS"
for shared file systems spanning"AZs"
.
- Implementation Example:
- "Placement Groups": Specific configurations for
"EC2 instances"
that control how instances are placed on underlying hardware.- Cluster Placement Group: Packs instances close together in a single
"AZ"
for low-latency network performance. High risk of correlated failures within the group. - Spread Placement Group: Spreads instances across distinct underlying hardware (or
"AZs"
) to minimize correlated failures. Ideal for critical applications with a small number of instances (up to 7 per"AZ"
). - Partition Placement Group: Spreads instances across different racks (partitions) within an
"AZ"
. Reduces correlated failures and improves availability for larger, distributed workloads.
- Cluster Placement Group: Packs instances close together in a single
Visual: High Availability with Multi-AZ & Placement Groups
Loading diagram...
ā ļø Common Pitfall: Using a "Cluster Placement Group"
for an application that requires high availability. While it provides excellent performance, placing all instances on the same underlying hardware creates a significant single point of failure.
Key Trade-Offs:
- Performance (Low Latency) vs. Availability (Fault Isolation): A
"Cluster Placement Group"
optimizes for performance at the cost of availability. A"Spread"
or"Partition Placement Group"
optimizes for availability at the cost of slightly higher network latency.
Reflection Question: How would you balance the trade-off between ultra-low latency (e.g., using a "Cluster Placement Group"
) and minimizing correlated failures (e.g., using a "Spread Placement Group"
) when designing for high availability for a financial trading application with stringent latency requirements?