8.2.1. SIEM Architecture and Log Management
💡 First Principle: Every BCP/DR metric traces back to a business decision: how much downtime and data loss can the business tolerate before the financial, regulatory, or reputational consequences become unacceptable? These tolerances — not IT capabilities — drive the recovery objectives. IT must then design and fund solutions that meet the business's stated tolerances.
Core BCP/DR metrics:
| Metric | Definition | Who Determines It | Relationship |
|---|---|---|---|
| MTD (Maximum Tolerable Downtime) | The longest the business can survive without a critical function | Business owners | MTD is the ceiling — all recovery plans must complete within it |
| RTO (Recovery Time Objective) | Target time to restore function after disruption | IT + Business | RTO < MTD (RTO is the IT target; MTD is the business limit) |
| RPO (Recovery Point Objective) | Maximum acceptable data loss measured in time | Business owners | RPO drives backup frequency (RPO = 4 hours → backup every 4 hours) |
| MTTR (Mean Time To Repair) | Average time to repair a failed component | IT Operations | Operational metric; inputs to RTO planning |
| MTBF (Mean Time Between Failures) | Average time between component failures | Vendor / IT Operations | Reliability metric; informs redundancy decisions |
The critical inequality: RTO < MTD If a business's MTD for online ordering is 4 hours (after 4 hours, customers go to competitors permanently), and the current RTO is 6 hours (current DR plan takes 6 hours to restore), the organization has a gap. The DR plan must be improved or the business must accept higher risk — this is a governance decision, not a technical one.
Business Impact Analysis (BIA): The BIA is the foundational input to BCP/DR planning. It identifies:
- Critical business functions (which processes are essential to survival?)
- Dependencies between functions (which IT systems support which processes?)
- Financial impact over time (what is the cost per hour of downtime for each function?)
- MTD, RTO, and RPO for each critical function (business owners define these)
Recovery site strategies:
| Site Type | Description | RTO | Cost | Appropriate For |
|---|---|---|---|---|
| Hot site | Fully operational mirror; data replicated in real-time or near-real-time | Minutes to hours | Very High $$ | Critical financial systems; healthcare; safety |
| Warm site | Infrastructure ready; software installed; data must be restored from backup | Hours to days | Medium $ | Most business-critical applications |
| Cold site | Space and power available; no equipment or data | Days to weeks | Low $ | Non-critical functions; cost-constrained orgs |
| Cloud DR | Cloud infrastructure spun up from templates on demand | Minutes to hours | Medium (pay-per-use) | Flexible; increasingly common |
| Reciprocal agreement | Two organizations agree to host each other in disaster | Varies | Low | Small orgs; last resort (often unreliable) |
BCP/DR testing types — test from least to most disruptive:
| Test Type | Description | Disruption | Value |
|---|---|---|---|
| Checklist review | Review the plan document for completeness | None | Low — only finds documentation gaps |
| Tabletop exercise | Walk through a scenario verbally | None | Medium — finds procedural gaps |
| Parallel test | Bring up DR systems while production stays online | Low | High — confirms DR systems work |
| Full interruption test | Switch completely to DR; production offline | High | Highest — most realistic; significant risk |
| Simulation | Realistic scenario exercise with actual team actions | Low-Medium | High — finds coordination gaps |
Organizations should graduate from checklist reviews to tabletop exercises to parallel tests as the plan matures. Full interruption tests should be rare given the operational risk.
⚠️ Exam Trap: MTD > RTO is required — but this is frequently stated backwards in exam distractors. Remember: MTD is the business limit (the ceiling); RTO is the IT target (must be lower than the ceiling). An RTO that exceeds MTD means IT cannot recover fast enough to prevent unacceptable business impact — this is a gap that must be addressed.
Reflection Question: An organization's BIA identifies that its e-commerce platform has a MTD of 2 hours and an RPO of 15 minutes. The current DR solution is a warm site with daily backups. Without additional detail, what two gaps in the current DR solution does this BIA reveal, and what technical changes would address each gap?