Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.3.5. Disaster Recovery Strategies: Pilot Light to Multi-Site

šŸ’” First Principle: Disaster recovery is an economic decision masquerading as a technical one. Every dollar you spend reducing RTO and RPO is a dollar not spent on other things. The right DR strategy is the cheapest one that still meets your actual business requirements — not the most comprehensive one you can architect.

The four AWS DR strategies form a spectrum from cheapest-and-slowest to most-expensive-and-fastest:

Strategy 1: Backup and Restore

Keep only backups in the DR region. When disaster strikes, restore from scratch.

CharacteristicDetail
RTOHours
RPOHours (depends on backup frequency)
CostLowest — pay only for storage
When to useNon-critical workloads; data archiving; long RTO acceptable
Strategy 2: Pilot Light

Keep a minimal core of your architecture running in the DR region (e.g., a replicated database, key configuration). Application servers are not running.

CharacteristicDetail
RTO10–60 minutes (spin up servers from AMIs)
RPOMinutes (near-real-time DB replication)
CostLow — pay for DB replication + minimal compute
When to useCore business systems; some downtime acceptable
Strategy 3: Warm Standby

A scaled-down but fully functional version of your production environment runs continuously in the DR region. Scale it up to full capacity when needed.

CharacteristicDetail
RTOMinutes (scale up existing infrastructure)
RPOSeconds to minutes (continuous replication)
CostMedium — pay for reduced-size production environment
When to useBusiness-critical systems; limited downtime acceptable
Strategy 4: Multi-Site Active-Active

Full production capacity running simultaneously in two or more regions. Traffic is split between regions in normal operation.

CharacteristicDetail
RTONear-zero (traffic instantly routes to healthy region)
RPONear-zero (active-active means no replication lag)
CostHighest — full production capacity in 2+ regions
When to useMission-critical; regulatory requirements; global user base

AWS Elastic Disaster Recovery (DRS): A managed service for server replication and recovery. DRS continuously replicates on-premises servers or EC2 instances to a staging area in the target region. When you need to fail over, it launches full-size recovery instances within minutes. DRS enables near-RPO-0 recovery without the complexity of building a custom replication pipeline.

āš ļø Exam Trap: The exam distinguishes between DR strategies by their cost and RTO/RPO characteristics. When a question says "a company wants the lowest RTO and can afford the cost" — Multi-Site Active-Active. "Lowest cost but can accept hours of downtime" — Backup and Restore. The trap is selecting Pilot Light when the question specifies near-zero RTO — Pilot Light still requires provisioning servers during failover, which takes 10–60 minutes.

Reflection Question: A healthcare company has a regulatory requirement: their patient record system must be recoverable in under 15 minutes (RTO) with no more than 5 minutes of data loss (RPO) in the event of a complete regional failure. Which DR strategy do you recommend, and what specific AWS services implement it?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications