4.4.2. Continuity of Operations and Capacity Planning
š” First Principle: Business continuity planning ensures the organization can continue critical functions during and after a disruption. It bridges the gap between the incident and full recovery by defining what's essential, what's acceptable degradation, and how long each system can be unavailable.
Recovery Time Objective (RTO) ā the maximum acceptable time a system can be down. A payment processing system might have an RTO of 15 minutes; an internal wiki might have an RTO of 48 hours.
Recovery Point Objective (RPO) ā the maximum acceptable data loss measured in time. An RPO of 1 hour means you can afford to lose up to 1 hour of data. This drives backup frequency ā an RPO of 1 hour requires at least hourly backups.
Mean Time to Repair (MTTR) ā the average time to fix a failed component and restore service. Lower MTTR comes from preparation: documented procedures, spare parts on hand, trained staff, and automated failover. A server with hot-swappable drives and a documented replacement procedure has lower MTTR than one requiring a vendor service call.
Mean Time Between Failures (MTBF) ā the average time between component failures. Higher MTBF = more reliable hardware. MTBF helps predict when components will need replacement and how many spares to stock. Together, MTTR and MTBF determine system availability: Availability = MTBF / (MTBF + MTTR). A system with MTBF of 10,000 hours and MTTR of 2 hours has 99.98% availability.
Capacity planning ensures resources meet demand during both normal operations and incidents. Considerations: people (enough trained staff for incident response), technology (sufficient compute/storage/bandwidth for failover), and infrastructure (power, cooling, physical space for recovery operations). Scalability (ability to grow) and elasticity (ability to scale up and down dynamically) are cloud-specific capacity concepts tested on the exam.
ā ļø Exam Trap: RTO is about time to restore. RPO is about data loss tolerance. If a question asks "how much data can you afford to lose?" ā that's RPO. If it asks "how quickly must systems be back online?" ā that's RTO.
