2.4.1. Compute Optimization: EC2, Placement Groups, and Compute Optimizer
š” First Principle: The right instance type for the wrong workload wastes money and degrades performance simultaneously. Compute optimization is about alignment ā matching instance characteristics (CPU architecture, memory ratio, network bandwidth) to workload requirements.
EC2 Placement Groups solve a different problem: physical placement within a data center affects inter-node latency. For workloads that need high-speed, low-latency communication between instances, physical proximity matters.
| Placement Strategy | Characteristic | Best For |
|---|---|---|
| Cluster | All instances in the same rack in one AZ | HPC, distributed computing, Hadoop, ML training ā needs low latency and high throughput |
| Partition | Instances spread across logical partitions (each on separate racks); up to 7 partitions per AZ | Large distributed systems (Cassandra, Kafka, HDFS) ā limits correlated hardware failures |
| Spread | Each instance on a distinct rack; max 7 instances per AZ | Small critical workloads that must avoid simultaneous failure (e.g., 3-node clusters) |
Burstable Instances (T-family): T3, T3a, T4g instances earn CPU credits during idle periods and spend them during bursts. This is cost-effective for workloads with low average CPU but occasional spikes (dev environments, web servers with variable traffic).
T3.unlimitedmode: instance can burst beyond credit balance at an extra per-vCPU cost ā no throttling, but watch the billT3.standardmode: CPU throttles to baseline when credits are exhausted
AWS Compute Optimizer analyzes CloudWatch metrics for your EC2 instances, Lambda functions, ECS services on Fargate, and Auto Scaling groups, then makes machine learning-based recommendations:
| Recommendation Type | Example |
|---|---|
| Over-provisioned | Your m5.xlarge runs at 8% CPU ā right-size to m5.large |
| Under-provisioned | Your c5.large is consistently at 95% CPU ā upgrade to c5.xlarge |
| Instance family change | Switch from M-series to C-series for compute-intensive workloads |
Compute Optimizer uses CloudWatch metrics as its data source ā specifically the last 14 days by default (up to 3 months with Enhanced Infrastructure Metrics). If an instance is brand new, Compute Optimizer has no recommendations yet.
ā ļø Exam Trap: Placement groups have constraints. Cluster placement groups are confined to a single AZ ā you can't span AZs. If an EC2 launch fails because no capacity exists in that AZ, the whole cluster placement group launch fails together (insufficient capacity errors are common for cluster groups on larger instance types). Spread placement groups are limited to 7 instances per AZ.
Reflection Question: A distributed machine learning training job needs 64 GPU instances to communicate with each other at maximum network throughput. Which placement group type do you use, and what potential operational risk should you mitigate?