3.1.3.2. Backup & Recovery Strategies (Pilot Light, Warm Standby)
First Principle: Restoring critical data and applications, minimizing data loss (RPO) and downtime (RTO), ensures business continuity.
Disaster recovery (DR) ensures business continuity by restoring critical data and applications. This principle guides strategy selection, balancing recovery objectives with cost.
Key DR Strategies (from highest RTO/RPO/lowest cost to lowest RTO/RPO/highest cost):
- Backup and Restore:
- Concept: Back up data to S3 (potentially cross-Region) and restore to a new environment upon disaster.
- RTO/RPO: Hours to days / Hours.
- AWS Services: AWS Backup, S3, EC2 AMIs, RDS Snapshots.
- Use Case: Non-critical applications or data archives.
- Pilot Light:
- Concept: A minimal core infrastructure is kept running in the recovery region, ready for quick scale-up. Data is replicated.
- RTO/RPO: Minutes to hours / Minutes.
- AWS Services: Cross-Region RDS Read Replicas, S3 CRR, pre-built AMIs, Auto Scaling Groups scaled to minimal instances.
- Use Case: Cost-effective for applications with some tolerance for downtime.
- Warm Standby:
- Concept: A scaled-down, fully functional production replica is continuously running and updated in the DR Region.
- RTO/RPO: Minutes / Seconds to minutes.
- AWS Services: Active-passive load balancers, small Auto Scaling Groups, constantly synchronized databases (e.g., RDS Cross-Region Read Replicas, DynamoDB Global Tables).
- Use Case: Business-critical apps requiring rapid recovery.
- Multi-Site Active/Active:
- Concept: Application is fully deployed and actively serving traffic in multiple Regions simultaneously.
- RTO/RPO: Near Zero / Near Zero.
- AWS Services: DynamoDB Global Tables, Aurora Global Database, Route 53, ALB, CloudFront.
- Use Case: Highest cost, highest complexity. For mission-critical, global applications with no tolerance for downtime or data loss.
Scenario: A DevOps team needs to design a disaster recovery (DR) strategy for a critical application. The application can tolerate several minutes of downtime (RTO) but needs minimal data loss (RPO in minutes). They want a solution that balances cost and recovery objectives.
Reflection Question: Compare and contrast the Pilot Light and Warm Standby DR strategies. How do they differ in terms of resource utilization in the recovery region and their impact on RTO and cost?
Strategy selection hinges on application criticality, RTO/RPO needs, and budget. Pilot Light offers cost savings for less critical systems; Warm Standby provides faster recovery for essential services.
š” Tip: Compare these with "Backup and Restore" (highest RTO/RPO, lowest cost) and "Multi-site Active/Active" (lowest RTO/RPO, highest cost) for a complete DR spectrum view.