3.1.1.4. Techniques to Achieve High Availability (Multi-AZ, Multi-Region)
First Principle: Designing systems to withstand failures and ensure continuous operation ensures service delivery and meets uptime objectives by handling disruptions from component to data center scale.
High Availability (HA) embodies the principle of resilience.
Key techniques and AWS services for HA:
- Multi-AZ Deployments: Distribute components across Availability Zones within a Region for fault tolerance against localized failures. Traffic routes to healthy resources if one AZ fails.
- Multi-Region Architectures: Extend HA beyond a single region, providing robust disaster recovery. Workloads failover to a standby region during regional disasters, ensuring global continuity.
- AWS Services for HA:
- Auto Scaling: Adjusts capacity, replaces unhealthy instances, providing self-healing and elasticity.
- Elastic Load Balancing (ELB): Distributes traffic across targets in multiple AZs, performing health checks to route only to healthy endpoints.
- Amazon Route 53: A scalable DNS service enabling DNS failover, directing users to healthy endpoints across AZs or Regions for global HA.
Key HA Techniques:
- Multi-AZ Deployments: Within-region fault tolerance.
- Multi-Region Architectures: Cross-region disaster recovery.
- Auto Scaling: Self-healing, elastic capacity.
- ELB: Traffic distribution, health checks.
- Route 53: DNS failover for global HA.
Scenario: A DevOps team manages a critical web application. They've already deployed it across multiple Availability Zones within a single region. Now, they need to enhance its resilience to withstand an entire regional outage, ensuring global continuity.
Reflection Question: How do Multi-Region architectures (e.g., using Route 53 DNS failover) extend high availability beyond a single region, providing robust disaster recovery and ensuring service delivery even during widespread events?
These combine to create resilient architectures, keeping applications operational.
š” Tip: Differentiate between fault tolerance (handling failures within a system, often Multi-AZ) from disaster recovery (recovering from widespread events, often Multi-Region).