3.2.2. Open Table Formats: Apache Iceberg and S3 Tables
š” First Principle: Open table formats solve the fundamental limitation of data lakes: you can't update or delete individual records in plain S3 files. Apache Iceberg adds database-like capabilities (ACID transactions, row-level updates, schema evolution, time travel) on top of S3, while Amazon S3 Tables provides a managed Iceberg experience integrated with S3. Together, they bring the best of databases to data lakes ā the "lakehouse" architecture.
Traditional data lake files are immutable ā to "update" a record, you rewrite the entire file. This makes corrections expensive, GDPR deletion requests painful, and schema changes risky. Open table formats solve this by maintaining metadata that tracks which files contain the current version of each record, enabling efficient updates, deletes, and queries that see consistent snapshots.
Apache Iceberg is the leading open table format on AWS. Key capabilities:
ACID transactions ā concurrent reads and writes without conflicts. Multiple writers can safely update the same table simultaneously.
Time travel ā query the table as it existed at a specific timestamp or snapshot ID. Essential for debugging ("what did the data look like yesterday?") and compliance ("prove what was in the table on audit date X").
Schema evolution ā add, drop, rename, or reorder columns without rewriting data files. Downstream queries adapt automatically.
Partition evolution ā change partitioning strategy without rewriting historical data. If you started partitioning by month and need daily partitions, Iceberg handles the transition transparently.
Hidden partitioning ā users query with WHERE event_date = '2025-03-01' and Iceberg automatically applies partition filtering without users knowing the physical partitioning strategy.
Amazon S3 Tables (new in v1.1) provides managed Apache Iceberg tables natively within S3. Instead of managing Iceberg metadata files yourself, S3 Tables handles table creation, compaction, and snapshot management. This reduces the operational overhead of running Iceberg on S3.
ā ļø Exam Trap: Apache Iceberg and Delta Lake are both open table formats, but the exam scope covers Iceberg (not Delta Lake). If a question describes "open table format" capabilities on S3, the answer is Iceberg (or S3 Tables), not Delta Lake. Also, Iceberg still stores data files in S3 ā it adds a metadata layer, not a separate storage engine.
Reflection Question: A data lake stores customer records in S3 as Parquet. A GDPR request requires deleting a specific customer's data across 3 years of historical files. How does Apache Iceberg simplify this compared to plain S3?