3.1.1.1. Disaster Recovery Strategies (RTO, RPO, Pilot Light, Warm Standby, Multi-Site Active/Active)
š” First Principle: A structured disaster recovery strategy, driven by business-defined objectives for recovery time ("RTO"
) and data loss ("RPO"
), is essential for minimizing the impact of a major disruption and ensuring business continuity.
Scenario: A gaming company's popular online game relies on a backend that experiences frequent updates and needs to be available globally with minimal latency. While extremely high availability is important, a full regional outage is a rare event, and the company can tolerate a few minutes of downtime for critical data.
Disaster Recovery ("DR"
) is a comprehensive plan to recover from large-scale outages, often involving a separate "AWS Region"
. Key metrics define the strategy:
- "Recovery Time Objective (RTO)": Max tolerable downtime after disaster.
- "Recovery Point Objective (RPO)": Max tolerable data loss after disaster.
"DR"
Strategies (from highest "RTO"
/"RPO"
/lowest cost to lowest "RTO"
/"RPO"
/highest cost):
-
Backup and Restore:
- Concept: Back up data to
"S3"
(potentially cross-"Region"
) and restore to a new environment upon disaster. "RTO"
/"RPO"
: Hours to days / Hours.- AWS Services:
"AWS Backup"
,"S3"
,"EC2 AMIs"
,"RDS Snapshots"
. - Practical Relevance: Suitable for non-critical applications or data archives.
- Concept: Back up data to
-
Pilot Light:
- Concept: A minimal core infrastructure is kept running in the
"DR Region"
, ready for quick scale-up. Data is replicated. "RTO"
/"RPO"
: Minutes to hours / Minutes.- AWS Services:
"Cross-Region RDS Read Replicas"
,"S3 CRR"
, pre-built AMIs,"Auto Scaling Groups"
scaled to zero or minimal instances. - Practical Relevance: Cost-effective for applications with some tolerance for downtime.
- Concept: A minimal core infrastructure is kept running in the
-
Warm Standby:
- Concept: A scaled-down, fully functional production replica is continuously running and updated in the
"DR Region"
. "RTO"
/"RPO"
: Minutes / Seconds to minutes.- AWS Services: Active-passive load balancers, small
"Auto Scaling Groups"
, constantly synchronized databases (e.g.,"RDS Cross-Region Read Replicas"
,"DynamoDB Global Tables"
). - Practical Relevance: For business-critical applications requiring rapid recovery.
- Concept: A scaled-down, fully functional production replica is continuously running and updated in the
-
Multi-Site Active/Active:
- Concept: Application is fully deployed and actively serving traffic in multiple
"Regions"
simultaneously. "RTO"
/"RPO"
: Near Zero / Near Zero.- AWS Services:
"DynamoDB Global Tables"
,"Aurora Global Database"
,"Route 53"
(latency, geoproximity),"ALB"
,"CloudFront"
. - Practical Relevance: Highest cost, highest complexity. For mission-critical, global applications with no tolerance for downtime or data loss.
- Concept: Application is fully deployed and actively serving traffic in multiple
Visual: Disaster Recovery (DR) Strategy Spectrum
Loading diagram...
ā ļø Common Pitfall: Implementing a "DR"
strategy without regularly testing it. An untested "DR"
plan is likely to fail during a real disaster due to configuration drift, outdated procedures, or unforeseen dependencies.
Key Trade-Offs:
"RTO"
/"RPO"
vs. Cost: The lower your"RTO"
and"RPO"
(i.e., the faster you need to recover with less data loss), the more expensive and complex your"DR"
strategy will be. A near-zero"RTO"
/"RPO"
(Multi-Site Active/Active) is the most expensive option.
Reflection Question: Given the gaming company's global nature, the need for rapid recovery, and tolerance for a few minutes of downtime for critical data, which disaster recovery strategy (from the list above) would you recommend, and what AWS services would be central to its implementation to balance "RTO"
/"RPO"
with cost?