3.3.5. Disaster Recovery Strategies: Pilot Light to Multi-Site
š” First Principle: Disaster recovery is an economic decision masquerading as a technical one. Every dollar you spend reducing RTO and RPO is a dollar not spent on other things. The right DR strategy is the cheapest one that still meets your actual business requirements ā not the most comprehensive one you can architect.
The four AWS DR strategies form a spectrum from cheapest-and-slowest to most-expensive-and-fastest:
Strategy 1: Backup and Restore
Keep only backups in the DR region. When disaster strikes, restore from scratch.
| Characteristic | Detail |
|---|---|
| RTO | Hours |
| RPO | Hours (depends on backup frequency) |
| Cost | Lowest ā pay only for storage |
| When to use | Non-critical workloads; data archiving; long RTO acceptable |
Strategy 2: Pilot Light
Keep a minimal core of your architecture running in the DR region (e.g., a replicated database, key configuration). Application servers are not running.
| Characteristic | Detail |
|---|---|
| RTO | 10ā60 minutes (spin up servers from AMIs) |
| RPO | Minutes (near-real-time DB replication) |
| Cost | Low ā pay for DB replication + minimal compute |
| When to use | Core business systems; some downtime acceptable |
Strategy 3: Warm Standby
A scaled-down but fully functional version of your production environment runs continuously in the DR region. Scale it up to full capacity when needed.
| Characteristic | Detail |
|---|---|
| RTO | Minutes (scale up existing infrastructure) |
| RPO | Seconds to minutes (continuous replication) |
| Cost | Medium ā pay for reduced-size production environment |
| When to use | Business-critical systems; limited downtime acceptable |
Strategy 4: Multi-Site Active-Active
Full production capacity running simultaneously in two or more regions. Traffic is split between regions in normal operation.
| Characteristic | Detail |
|---|---|
| RTO | Near-zero (traffic instantly routes to healthy region) |
| RPO | Near-zero (active-active means no replication lag) |
| Cost | Highest ā full production capacity in 2+ regions |
| When to use | Mission-critical; regulatory requirements; global user base |
AWS Elastic Disaster Recovery (DRS): A managed service for server replication and recovery. DRS continuously replicates on-premises servers or EC2 instances to a staging area in the target region. When you need to fail over, it launches full-size recovery instances within minutes. DRS enables near-RPO-0 recovery without the complexity of building a custom replication pipeline.
ā ļø Exam Trap: The exam distinguishes between DR strategies by their cost and RTO/RPO characteristics. When a question says "a company wants the lowest RTO and can afford the cost" ā Multi-Site Active-Active. "Lowest cost but can accept hours of downtime" ā Backup and Restore. The trap is selecting Pilot Light when the question specifies near-zero RTO ā Pilot Light still requires provisioning servers during failover, which takes 10ā60 minutes.
Reflection Question: A healthcare company has a regulatory requirement: their patient record system must be recoverable in under 15 minutes (RTO) with no more than 5 minutes of data loss (RPO) in the event of a complete regional failure. Which DR strategy do you recommend, and what specific AWS services implement it?