3.1.1. Response Plans and Runbooks
First Principle: A runbook transforms incident response from an improvised reaction into a repeatable, auditable process — ensuring that every responder follows the same steps regardless of experience level or stress.
What is a Runbook vs. a Playbook?
- Runbook: Step-by-step procedures for a specific incident type (e.g., "Compromised EC2 Instance Response"). Detailed enough that a junior engineer can execute it at 2 AM.
- Playbook: Higher-level strategy documents that describe the overall approach to incident categories (e.g., "Data Breach Response Strategy"). Guides decision-making.
AWS Tools for IR Plans:
Systems Manager OpsCenter centralizes operational issues:
- Aggregates findings from multiple sources into OpsItems
- Tracks remediation progress and assigns owners
- Integrates with runbooks (SSM Automation documents) for guided response
- Provides operational dashboards for incident management
Amazon SageMaker AI Notebooks (new in C03) support forensic analysis:
- Pre-configured Jupyter notebooks for analyzing CloudTrail, VPC Flow Logs, and Security Lake data
- Enable data science approaches to incident investigation (pattern analysis, anomaly detection)
- Sharable analysis workflows that document investigation methodology
Runbook Structure for Common Incident Types:
| Incident Type | Key Runbook Steps |
|---|---|
| Compromised credentials | Disable key → Check CloudTrail for usage → Revoke sessions → Assess blast radius → Rotate → Monitor |
| Compromised EC2 | Isolate (SG swap) → Snapshot EBS → Capture memory → Investigate → Terminate → Harden |
| S3 data exposure | Block public access → Check access logs → Identify data scope → Notify stakeholders → Remediate policy |
| Unauthorized IAM user | Disable user → Review CloudTrail → Revoke sessions → Delete user → Audit creation path |
⚠️ Exam Trap: The exam expects you to know the correct ORDER of response steps. For compromised EC2: isolate FIRST, then capture evidence, THEN investigate. Terminating before capturing evidence destroys forensic data.
Scenario: You're designing a runbook for compromised IAM access keys. The first automated step disables the key, the second queries CloudTrail for all API calls made with the key, the third revokes all active sessions for the associated IAM user, and the fourth generates a blast radius report.
Reflection Question: Why must credential revocation happen before investigation begins, and what information might you lose if you investigate before revoking?