Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.1.1.3. Replication & Failover Methods for Stateful Services

3.1.1.3. Replication & Failover Methods for Stateful Services

Stateless services (Lambda, containerized APIs) are easy to make resilient — just run more copies. Stateful services (databases, caches, file systems) are hard because data must be replicated without loss or inconsistency.

Replication modes:
  • Synchronous: Write completes only after data is replicated. Zero data loss (RPO=0) but adds latency. Used by: RDS Multi-AZ, Aurora cluster volumes.
  • Asynchronous: Write completes immediately; replication happens in background. Possible data loss during failover (RPO > 0). Used by: RDS read replicas, DynamoDB Global Tables, S3 CRR.
  • Semi-synchronous: Write completes after at least one replica acknowledges. Balance of safety and performance. Used by: Aurora Global Database.
Failover patterns:
ServiceFailover MechanismTypical Time
RDS Multi-AZDNS CNAME flip to standby60-120 seconds
AuroraPromote read replica, update cluster endpoint~30 seconds
ElastiCache RedisPromote replica, update endpointSeconds
DynamoDB Global TablesNo failover needed (active-active)N/A
EFSAutomatic (Multi-AZ by default)Transparent

Application considerations: After a database failover, the application must re-establish connections. Use connection pooling with health checks, and set DNS TTL low on database endpoints so clients pick up the new IP quickly.

Exam Trap: RDS failover changes the IP behind the DNS CNAME — but if your application caches DNS (common in Java), it won't discover the new primary until the cache expires. Set the JVM's networkaddress.cache.ttl to a low value (e.g., 60 seconds) or use the RDS Proxy, which handles failover transparently.

Alvin Varughese
Written byAlvin Varughese•Founder•15 professional certifications