Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.1.1. Amazon S3: The Data Lake Foundation

šŸ’” First Principle: S3 is the gravitational center of AWS data engineering because it's the only storage service that combines virtually unlimited capacity, extreme durability (99.999999999%), sub-cent-per-GB pricing, and direct integration with every analytics service. When you don't know what questions you'll ask of your data in the future, store it in S3 — you'll always be able to query it later.

S3 stores data as objects in buckets. For data engineering, the key architectural decisions are:

Prefix design (folder structure). S3 doesn't have real folders — "folders" are key prefixes. Design prefixes to match query patterns: s3://datalake/raw/orders/year=2025/month=03/day=15/ enables Athena partition pruning on date. Poor prefix design (e.g., flat structure with no partitioning) forces full scans.

Versioning. Enables recovery from accidental overwrites or deletes. Every object write creates a new version. Combined with lifecycle policies, old versions can be archived to Glacier. Versioning is also a prerequisite for S3 Object Lock (WORM compliance) and cross-region replication.

Encryption. SSE-S3 (AWS-managed keys), SSE-KMS (customer-managed keys in KMS), or SSE-C (customer-provided keys). For most data lake use cases, SSE-S3 is the default. SSE-KMS adds auditability (CloudTrail logs every key usage) and cross-account sharing control.

S3 Select and Glacier Select. Query subsets of data from within individual objects using SQL — filter rows and select columns without downloading the entire object. Useful for processing specific records from large CSV or JSON files without running a full Glue job.

āš ļø Exam Trap: S3 is not a database — it doesn't support transactions, updates to individual records, or row-level locking. If a question describes needing to update individual records by primary key, S3 alone won't work. You need either a database (DynamoDB, RDS) or an open table format (Apache Iceberg) layered on top of S3 to get update/delete capabilities.

Reflection Question: A company stores 50 TB of log data in S3 as gzipped JSON. Athena queries are slow and expensive. What three changes would you make to the storage format and organization?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications