2.3.3. Lambda-Based Remediation Patterns
š” First Principle: When remediation logic is too complex for a predefined runbook but too specialized to justify building a full service, Lambda fills the gap. Lambda lets you write custom remediation code in minutes without managing infrastructure, triggered automatically by CloudWatch, EventBridge, Config, or SNS.
Lambda's role in operational automation is custom logic execution. It's not a replacement for Systems Manager Automation (which has richer workflow capabilities) or for EventBridge (which handles routing). Lambda is the "do arbitrary code" step in a remediation pipeline.
Common Lambda Remediation Patterns:
| Pattern | Trigger | Lambda Action |
|---|---|---|
| Tag enforcement | Config rule violation | Add missing required tags to resources |
| Security group remediation | EventBridge (Config change) | Remove overly permissive inbound rules |
| Unused resource cleanup | EventBridge (scheduled) | Stop idle EC2 instances, delete unattached EBS volumes |
| SNS message processing | SNS subscription | Parse alert, create JIRA ticket, send Slack message |
| CloudWatch alarm response | CloudWatch alarm ā SNS ā Lambda | Restart application, flush cache, scale resource |
Lambda Execution Role: The Lambda function needs an IAM execution role with exactly the permissions required for its remediation task. Follow least privilege:
- If Lambda remediates EC2, grant only
ec2:StopInstances,ec2:StartInstances - Don't grant
AdministratorAccessto a Lambda function ā that's a security risk waiting to become an incident
Idempotency: Good remediation functions are idempotent ā running them twice has the same result as running them once. This matters because event-driven systems can deliver duplicate events. If your Lambda is "delete unattached volumes," make sure it doesn't fail if the volume was already deleted.
ā ļø Exam Trap: Lambda has a maximum execution timeout of 15 minutes. If a remediation task takes longer (e.g., a large data migration), Lambda is the wrong choice. Use Step Functions to orchestrate long-running workflows, or Systems Manager Automation with a long timeout.
Reflection Question: Config detects that an EC2 security group has inbound port 22 (SSH) open to 0.0.0.0/0. Design a remediation that: (1) removes the rule, (2) creates a CloudWatch event, and (3) notifies the security team. Which services does each step use?