4.4.1. High Availability and Site Considerations

💡 First Principle: High availability eliminates single points of failure by providing redundant components that take over when primary components fail. Site diversity ensures that location-specific disasters don't take down all operations.

Load balancing distributes workloads across multiple servers. If one server fails, the others absorb the traffic. Active-active configurations run all servers simultaneously; active-passive configurations keep standby servers ready.

Clustering groups multiple servers that share workloads and fail over automatically. Cluster members monitor each other's health — if one fails, another assumes its role in seconds.

Redundancy types:

Type	Description	Example
Server	Multiple servers for same function	Web server cluster
Network	Redundant paths and switches	Dual ISPs, link aggregation
Storage	RAID, replication	RAID 5, geo-replicated storage
Power	UPS, generators, dual feeds	Dual power supplies per server

Site considerations:

Site Type	Description	Recovery Time	Cost
Hot site	Fully operational duplicate	Minutes to hours	$$ Highest
Warm site	Partial infrastructure, needs data	Hours to days	$ Medium
Cold site	Empty facility, needs everything	Days to weeks	$ Lowest

Platform diversity — using different vendors and technologies reduces the risk that a single vulnerability affects all systems. If all servers run the same OS, one exploit compromises everything.

Multi-cloud systems — distributing workloads across multiple cloud providers prevents vendor lock-in and reduces impact of a single provider outage.

Geographic dispersion — placing redundant systems in different physical locations protects against regional disasters. If primary and backup systems are in the same building — or even the same city — a single earthquake, flood, or power grid failure can take both down simultaneously. Best practice is recovery sites in a different region or availability zone from the primary.

⚠️ Exam Trap: Hot site = fastest recovery, highest cost. Cold site = slowest recovery, lowest cost. The exam tests whether you can match recovery requirements (RTO) to the appropriate site type.

Written byAlvin Varughese

Founder•15 professional certifications