2.1.1. Data Formats for ML Workloads
💡 First Principle: Data formats control two things that matter enormously at scale: how much you store (compression) and how fast you read (columnar vs. row access). Choosing the right format before ingestion prevents expensive reformatting later—and the exam tests your ability to match formats to access patterns.
Not all data formats are created equal for ML. A format that works well for transactional databases (row-oriented, like CSV) performs terribly for analytical ML workloads where you typically need a few columns across millions of rows. Understanding this distinction is fundamental.
Row-oriented formats store complete records together. When you read one field, you read the entire row. These are efficient when you need all columns for a given record—like serving a customer profile. CSV and JSON are the most common row-oriented formats.
Columnar formats store each column separately. When you read one feature, you skip all other columns entirely. These are efficient when you need a few features across millions of records—which is exactly what model training does. Apache Parquet, ORC, and Avro are columnar or semi-columnar formats.
| Format | Orientation | Compression | Best For | Exam Signal |
|---|---|---|---|---|
| Parquet | Columnar | Excellent (Snappy, gzip) | ML training, analytics, large datasets | "Efficient analytics," "columnar," "cost-effective storage" |
| ORC | Columnar | Excellent | Hive/Hadoop ecosystems, EMR workloads | "Hive," "EMR," "optimized row columnar" |
| CSV | Row | None (unless zipped) | Small datasets, simple interchange | "Human-readable," "simple," "legacy systems" |
| JSON | Row | None (unless compressed) | Semi-structured data, APIs, config | "Nested data," "API responses," "flexible schema" |
| Avro | Row (schema evolution) | Good | Streaming, schema evolution needs | "Schema changes," "Kafka," "streaming" |
| RecordIO | Record-based | Good | SageMaker built-in algorithms | "SageMaker built-in," "training optimization," "protobuf" |
| Protobuf | Binary serialized | Excellent | SageMaker pipe mode, high-performance | "Pipe mode," "streaming training data" |
⚠️ Exam Trap: RecordIO and Protobuf are specific to SageMaker's built-in algorithms and pipe mode. If a question mentions SageMaker built-in algorithms and asks about the most efficient training format, the answer is RecordIO—not Parquet. But if the question is about general-purpose storage for analytics and ML, Parquet is almost always the answer. Watch for which context the question describes.
Reflection Question: A team stores 500 GB of training data as JSON files in S3. Training jobs take 6 hours. Without changing the model, what single change to the data format would most reduce training time, and why?