8.6.3. DR Plan Testing Methods
💡 First Principle: A DR plan that has never been tested is a document, not a capability. Testing converts assumptions into evidence: "we assume we can recover the database in 2 hours" becomes "we demonstrated recovery in 3 hours and 15 minutes — which exceeds our RTO." Without testing, the first time the DR plan is executed under real conditions is during an actual disaster, when stakes are highest and improvisation is most dangerous.
DR test types — progressive complexity:
| Test Type | Description | What It Validates | Disruption |
|---|---|---|---|
| Checklist (desk check) | Team reviews plan documents against current environment | Plan completeness; accuracy of contact lists, system inventory, procedure documentation | None |
| Tabletop exercise | Team walks through a disaster scenario verbally; no systems involved | Decision-making under stress; role clarity; communication flow; gap identification | None |
| Walkthrough/simulation | Team performs recovery procedures in a test environment | Technical procedure accuracy; tool availability; staff competency | Minimal |
| Parallel test | Recovery environment activated alongside production | Full end-to-end recovery capability without risking production | Low — production unaffected |
| Full interruption test | Production shut down; all operations shift to recovery systems | Actual RTO and RPO under real conditions; true failover capability | High — production at risk if recovery fails |
Testing progression strategy:
Organizations should mature through the testing types progressively:
- Annual minimum: Tabletop exercise for all critical systems; parallel test for highest-criticality systems.
- After major changes: Any significant infrastructure change (cloud migration, data center move, application upgrade) should trigger at minimum a walkthrough test of affected DR procedures.
- Full interruption: Performed only by mature organizations with high confidence in their DR capability and executive willingness to accept the risk of production impact.
Test output analysis — the metrics that matter:
| Metric | Compare Against | Action If Gap Found |
|---|---|---|
| Actual recovery time | Documented RTO | If actual > RTO, redesign recovery architecture or request RTO adjustment |
| Data currency at recovery | Documented RPO | If data loss > RPO, increase backup frequency or implement replication |
| Procedure failures | Expected procedure count | Update documentation; retrain staff; add automation |
| Undocumented dependencies | Dependency map in DR plan | Add to plan; verify recovery procedure for each dependency |
| Communication gaps | Notification SLAs | Update contact lists; test out-of-band communication channels |
After every test: Document results, compare metrics against BIA requirements, update the DR plan to reflect findings, assign remediation items for gaps, and schedule the next test. The DR plan is a living document — every test should produce updates.
⚠️ Exam Trap: An organization that only performs checklist reviews of its DR plan has never tested whether recovery actually works. The exam distinguishes between plan review (checklist), decision testing (tabletop), procedure testing (walkthrough), capability testing (parallel), and actual failover testing (full interruption). A checklist review, while better than nothing, provides the lowest assurance level.
Reflection Question: A hospital's DR plan was last tested two years ago using a tabletop exercise. Since then, the organization migrated its EHR system to a cloud provider, replaced its on-premises SAN with cloud storage, and implemented a new backup solution. The tabletop results from two years ago showed an estimated recovery time of 3 hours for the EHR system. What is the current validity of that 3-hour estimate, what test type should be conducted now and why, and what specific metrics should the test capture?