3.1.3.1. Disaster Recovery Concepts (RTO, RPO)
3.1.3.1. Disaster Recovery Concepts (RTO, RPO)
Every DR strategy is a trade-off between cost and recovery speed. The exam tests whether you can match the right strategy to given RTO/RPO constraints.
Recovery Time Objective (RTO): Maximum acceptable time from disaster to full restoration. A 4-hour RTO means the business can tolerate 4 hours of downtime.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. A 1-hour RPO means you can lose at most 1 hour of data.
DR strategy spectrum (cost ↔ recovery speed):
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Backups in S3; restore when needed |
| Pilot Light | 10s of minutes | Minutes | $ | Core infra running (DB replica), scale up on failover |
| Warm Standby | Minutes | Seconds-Minutes | $$ | Scaled-down duplicate running in DR region |
| Active-Active | Near-zero | Near-zero | $$ | Full infrastructure in both regions serving traffic |
Key distinction — Pilot Light vs. Warm Standby:
- Pilot Light: Only the data layer runs continuously (RDS replica, S3 replication). Compute and application layers are off. On disaster, you launch compute, update DNS, and promote the replica.
- Warm Standby: A fully functional but scaled-down copy of production runs continuously. On disaster, scale up the DR environment and redirect traffic. Faster than Pilot Light because compute is already running.
Exam Trap: If the question specifies RTO < 5 minutes, only Active-Active meets the requirement. Warm Standby requires scaling up (takes minutes). Pilot Light requires launching new instances (takes 10+ minutes). The exam often tests whether you understand these time boundaries — don't choose Warm Standby for a sub-minute RTO.
