3.1.3.3. Recovery Procedures

A DR plan without documented, tested procedures is a hope, not a strategy. Recovery procedures must be automated, versioned, and rehearsed.

Automated recovery playbook (SSM Automation):

schemaVersion: '0.3'
description: 'DR Failover to us-west-2'
mainSteps:
  - name: PromoteRDSReplica
    action: aws:executeAwsApi
    inputs:
      Service: rds
      Api: PromoteReadReplica
      DBInstanceIdentifier: dr-replica
  - name: WaitForDBAvailable
    action: aws:waitForAwsResourceProperty
    inputs:
      Service: rds
      Api: DescribeDBInstances
      DBInstanceIdentifier: dr-replica
      PropertySelector: '$.DBInstances[0].DBInstanceStatus'
      DesiredValues: ['available']
  - name: ScaleUpASG
    action: aws:executeAwsApi
    inputs:
      Service: autoscaling
      Api: UpdateAutoScalingGroup
      AutoScalingGroupName: dr-web-asg
      MinSize: 4
      DesiredCapacity: 8
  - name: UpdateDNS
    action: aws:executeAwsApi
    inputs:
      Service: route53
      Api: ChangeResourceRecordSets
      # Switch to DR region endpoint

Recovery procedure elements:

Detection: CloudWatch alarms, Route 53 health checks, or AWS Health events trigger the procedure
Decision: Automated (Route 53 failover) or manual approval (SSM Automation approval step)
Execution: Promote DB, scale compute, update routing
Validation: Synthetic tests confirm the DR environment is serving traffic correctly
Communication: SNS notifications to operations team at each step

Exam Trap: Automated failover can cause "split-brain" if the primary region recovers while the DR region is active. Both regions may accept writes, causing data conflicts. To prevent this, use Route 53 health check failover with sufficient evaluation periods (3+ failed checks) and always have a clear failback procedure that resolves any data divergence.

Written byAlvin Varughese•Founder•15 professional certifications