3.1.1. Amazon S3: The Data Lake Foundation
š” First Principle: S3 is the gravitational center of AWS data engineering because it's the only storage service that combines virtually unlimited capacity, extreme durability (99.999999999%), sub-cent-per-GB pricing, and direct integration with every analytics service. When you don't know what questions you'll ask of your data in the future, store it in S3 ā you'll always be able to query it later.
S3 stores data as objects in buckets. For data engineering, the key architectural decisions are:
Prefix design (folder structure). S3 doesn't have real folders ā "folders" are key prefixes. Design prefixes to match query patterns: s3://datalake/raw/orders/year=2025/month=03/day=15/ enables Athena partition pruning on date. Poor prefix design (e.g., flat structure with no partitioning) forces full scans.
Versioning. Enables recovery from accidental overwrites or deletes. Every object write creates a new version. Combined with lifecycle policies, old versions can be archived to Glacier. Versioning is also a prerequisite for S3 Object Lock (WORM compliance) and cross-region replication.
Encryption. SSE-S3 (AWS-managed keys), SSE-KMS (customer-managed keys in KMS), or SSE-C (customer-provided keys). For most data lake use cases, SSE-S3 is the default. SSE-KMS adds auditability (CloudTrail logs every key usage) and cross-account sharing control.
S3 Select and Glacier Select. Query subsets of data from within individual objects using SQL ā filter rows and select columns without downloading the entire object. Useful for processing specific records from large CSV or JSON files without running a full Glue job.
ā ļø Exam Trap: S3 is not a database ā it doesn't support transactions, updates to individual records, or row-level locking. If a question describes needing to update individual records by primary key, S3 alone won't work. You need either a database (DynamoDB, RDS) or an open table format (Apache Iceberg) layered on top of S3 to get update/delete capabilities.
Reflection Question: A company stores 50 TB of log data in S3 as gzipped JSON. Athena queries are slow and expensive. What three changes would you make to the storage format and organization?