3.3.3. Point-in-Time Restore and RTO/RPO Objectives
š” First Principle: The value of a recovery mechanism is defined by what it promises: how much data you might lose (RPO) and how long recovery takes (RTO). Selecting the right recovery mechanism isn't about features ā it's about matching the service's actual capabilities to the business's actual requirements.
RDS Point-in-Time Restore (PITR):
RDS automated backups enable restore to any second within the retention period (up to 35 days). The process:
- RDS restores from the most recent automated backup
- Applies transaction logs to reach the target time
PITR creates a new DB instance ā it does not overwrite the existing one. This means:
- Original database continues running while restore is in progress
- You can test the restored database before redirecting traffic
- DNS update is required to point your application at the new instance
RTO implications: RDS PITR typically takes minutes to hours depending on database size. This is not suitable for RTO requirements of seconds.
DynamoDB PITR: Works similarly ā restores to a new table, any point in the past 35 days. Restore time is proportional to table size.
RTO/RPO Selection Guide:
| Recovery Scenario | RTO | RPO | Recommended Mechanism |
|---|---|---|---|
| "We lost the last 30 minutes of orders" | Hours acceptable | 30 min | RDS/DynamoDB PITR |
| "Last night's batch job corrupted all records" | Hours acceptable | Previous day | RDS snapshot restore |
| "Primary RDS failed, need it back" | Minutes | ~0 (no data loss) | RDS Multi-AZ automatic failover |
| "Entire region failed" | Hours | Hours | Restore from cross-region snapshot |
| "Entire region failed, need recovery in <1 min" | Seconds | ~0 | Aurora Global Database |
ā ļø Exam Trap: PITR restores are not zero-downtime. When restoring RDS PITR, the application is typically pointed at the original database while the restore happens in the background on a new instance. The application continues serving (potentially with data loss up to the failure point) until the restore completes and you cut over. Don't confuse "restore to the point before the mistake" with "resume immediately."
Reflection Question: A company's RPO is 1 hour and RTO is 4 hours for their RDS PostgreSQL database. Their current backup configuration is: automated backups with 7-day retention, Multi-AZ enabled. Does this configuration meet both objectives? What would change their RTO to under 60 seconds?