3.1.1.3. Replication & Failover Methods for Stateful Services
3.1.1.3. Replication & Failover Methods for Stateful Services
Stateless services (Lambda, containerized APIs) are easy to make resilient — just run more copies. Stateful services (databases, caches, file systems) are hard because data must be replicated without loss or inconsistency.
Replication modes:
- Synchronous: Write completes only after data is replicated. Zero data loss (RPO=0) but adds latency. Used by: RDS Multi-AZ, Aurora cluster volumes.
- Asynchronous: Write completes immediately; replication happens in background. Possible data loss during failover (RPO > 0). Used by: RDS read replicas, DynamoDB Global Tables, S3 CRR.
- Semi-synchronous: Write completes after at least one replica acknowledges. Balance of safety and performance. Used by: Aurora Global Database.
Failover patterns:
| Service | Failover Mechanism | Typical Time |
|---|---|---|
| RDS Multi-AZ | DNS CNAME flip to standby | 60-120 seconds |
| Aurora | Promote read replica, update cluster endpoint | ~30 seconds |
| ElastiCache Redis | Promote replica, update endpoint | Seconds |
| DynamoDB Global Tables | No failover needed (active-active) | N/A |
| EFS | Automatic (Multi-AZ by default) | Transparent |
Application considerations: After a database failover, the application must re-establish connections. Use connection pooling with health checks, and set DNS TTL low on database endpoints so clients pick up the new IP quickly.
Exam Trap: RDS failover changes the IP behind the DNS CNAME — but if your application caches DNS (common in Java), it won't discover the new primary until the cache expires. Set the JVM's networkaddress.cache.ttl to a low value (e.g., 60 seconds) or use the RDS Proxy, which handles failover transparently.
