AWS-DOP-C02 & AWS CERTIFICATION | Replication & Failover Methods for Stateful Services - AWS Certified DevOps Engineer

3.1.1.3. Replication & Failover Methods for Stateful Services

First Principle: Replication and failover are fundamental strategies to prevent data loss and minimize downtime when infrastructure components fail, protecting critical data in stateful services.

For highly available architectures, protecting critical data in stateful services is paramount. This applies the principle of resilience.

Replication: (Creating and maintaining multiple data copies.)

Synchronous: Data written to all replicas simultaneously; strong consistency, higher latency.
Asynchronous: Data written to primary, then propagated; lower latency, potential minor data loss (RPO > 0).

Failover: (Switching to a standby system upon primary failure.)

Automatic: System detects failure, promotes replica without manual intervention, minimizing RTO.
Manual: Requires human action, often for planned maintenance or complex scenarios.

Key Replication & Failover Methods:

Amazon RDS Multi-AZ: Synchronous replication to standby, automatic failover (RPO=0).
Amazon DynamoDB Global Tables: Multi-Region, active-active asynchronous replication.
Amazon S3 Cross-Region Replication (CRR): Asynchronous replication of objects between S3 buckets.

Scenario: A DevOps team manages a critical relational database on Amazon RDS that needs strong consistency and automatic failover within a region. They also manage a DynamoDB table that needs to be globally available with active-active replication across multiple regions.

Reflection Question: How do Amazon RDS Multi-AZ deployment (synchronous) and Amazon DynamoDB Global Tables (asynchronous) use different replication and failover methods to achieve high availability and disaster recovery, each suited to their respective consistency and global access requirements?

These methods collectively ensure data durability and continuous operation, safeguarding applications against various failure scenarios.

💡 Tip: When designing your architecture, consider the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) implications of synchronous vs. asynchronous replication strategies.