8.6.2. RTO, RPO, and High Availability Architecture
💡 First Principle: RTO and RPO are not aspirational targets — they are maximum acceptable limits derived from the BIA's analysis of business impact over time. When the actual recovery time exceeds the RTO, the business is operating beyond its maximum tolerable downtime (MTD), and the consequences escalate from "significant financial loss" to "existential threat to the organization." The MTD is the absolute ceiling; the RTO must be set below it with margin for error.
Recovery metric relationships:
MTD (Maximum Tolerable Downtime)
└── RTO (Recovery Time Objective) — must be < MTD
└── WRT (Work Recovery Time) — time to verify and catch up after restore
└── RTO + WRT ≤ MTD
RPO (Recovery Point Objective) — independent of RTO
└── Drives backup frequency and replication architecture
High availability vs. fault tolerance:
| Concept | Goal | Implementation | Downtime |
|---|---|---|---|
| High availability (HA) | Minimize downtime through redundancy | Active-passive clusters, load balancers, failover | Seconds to minutes (planned failover) |
| Fault tolerance | Eliminate downtime entirely; survive component failure with zero interruption | Active-active, redundant hardware, RAID, dual power | Zero — service continues without interruption |
HA is more common and cost-effective than fault tolerance. Fault tolerance requires complete redundancy at every layer (dual power supplies, dual network paths, mirrored storage, redundant compute) and is typically reserved for the most critical systems where any downtime is unacceptable.
RAID levels for availability:
| RAID | Method | Fault Tolerance | Performance |
|---|---|---|---|
| RAID 0 | Striping (no redundancy) | None — single drive failure = total data loss | Highest read/write |
| RAID 1 | Mirroring | Survives single drive failure | Read improved; write same |
| RAID 5 | Striping with distributed parity | Survives single drive failure | Good read; write penalty for parity |
| RAID 6 | Striping with double parity | Survives two simultaneous drive failures | Good read; higher write penalty |
| RAID 10 | Mirror + stripe | Survives one failure per mirror pair | High read and write |
Clustering and failover:
- Active-passive: One node handles all traffic; standby node takes over on failure. Simple but wastes standby resources during normal operations.
- Active-active: Both nodes handle traffic simultaneously; if one fails, the other absorbs the full load. More efficient but requires application support for session sharing or stateless design.
- Geographic clustering: Nodes in different physical locations protect against site-level disasters. Requires synchronous or near-synchronous data replication, with latency tradeoffs.
⚠️ Exam Trap: RAID is not a backup. RAID provides availability (system continues operating after a drive failure) but does not protect against data corruption, ransomware, accidental deletion, or site-level disaster. A ransomware infection that encrypts data on a RAID array encrypts the data on all drives simultaneously. RAID and backups serve fundamentally different purposes and both are required.
Reflection Question: An e-commerce company's BIA determines that the order processing system has an MTD of 6 hours and an RPO of 30 minutes. The current architecture uses a single server with nightly full backups to a local NAS device. Identify every gap between the current architecture and the BIA requirements, and design a recovery architecture that meets both the RTO and RPO.