2.1.2. AWS Storage Options: S3, EFS, FSx, and When to Use Each
💡 First Principle: Storage choice for ML is driven by three factors: throughput (how fast data flows to training instances), access pattern (random vs. sequential), and sharing (single instance vs. many instances reading simultaneously). The right storage option is the one that matches all three factors—not the one that's cheapest per GB.
Most ML workloads start and end with Amazon S3—it's the default data lake for AWS ML. But S3 is object storage: excellent for large sequential reads but poor for random access or POSIX-style file operations. When your training framework needs to read data like a filesystem (random seeks, small reads, file locking), S3 alone isn't enough.
| Storage | Type | Throughput | Access Pattern | Sharing | Best ML Use Case |
|---|---|---|---|---|---|
| Amazon S3 | Object | High (parallel reads) | Sequential, large objects | Unlimited concurrent | Training data lake, model artifacts, batch data |
| Amazon EFS | File (NFS) | Moderate | Random + sequential | Multi-instance POSIX | Shared datasets across training instances, notebook data |
| Amazon FSx for Lustre | File (Lustre) | Very high | Random + sequential | Multi-instance, S3-linked | High-performance distributed training, large model checkpoints |
| Amazon FSx for NetApp ONTAP | File (NFS/SMB) | High | Random + sequential | Multi-protocol | Hybrid workloads, multi-protocol access |
| Amazon EBS | Block | Very high (single instance) | Random | Single instance only | Local scratch, boot volumes, single-instance training |
S3 + SageMaker Integration: SageMaker training jobs can read directly from S3 in two modes. File mode downloads the entire dataset to the training instance before training starts—simple but adds startup time. Pipe mode streams data from S3 directly to the training algorithm—faster startup, lower disk requirements, but requires RecordIO/Protobuf format. Fast File mode (the default for many jobs) provides the best of both: streams from S3 with POSIX-like access, no format restrictions.
FSx for Lustre + S3 Pattern: For distributed training across many GPU instances, create an FSx for Lustre filesystem linked to your S3 bucket. FSx caches data locally with sub-millisecond latency while S3 remains the durable source of truth. This pattern is common for large-scale training where I/O is the bottleneck.
⚠️ Exam Trap: Don't confuse storage cost with total cost. EFS costs more per GB than S3, but if using S3 requires a 30-minute download phase before each training job (file mode), the compute time you waste waiting may cost more than EFS would. The exam tests whether you optimize for total workflow cost, not just storage cost.
Reflection Question: A team runs distributed training across 8 GPU instances. Each instance needs to read the same 200 GB dataset. Training with S3 file mode takes 25 minutes just for data download. What storage architecture would reduce this overhead?