Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.5.1. Columnar vs Row-Based Formats

šŸ’” First Principle: Row-based formats (CSV, JSON) store all fields of a record together — efficient for writing and reading entire records. Columnar formats (Parquet, ORC) store all values of a single column together — efficient for analytical queries that only need a few columns from many rows. The access pattern determines which format wins.

In a row-based file, reading "SELECT avg(price) FROM sales" requires scanning every byte of every row — customer names, addresses, product descriptions — to extract just the price column. In a columnar file, the price values are stored contiguously, so the query engine reads only the price column's data block. For wide tables with many columns, this reduces I/O by orders of magnitude.

CharacteristicRow-Based (CSV, JSON)Columnar (Parquet, ORC)
Write speedFast (append records)Slower (organize by column)
Read entire rowFastSlower (reassemble from columns)
Read few columnsSlow (scan everything)Fast (read only needed columns)
CompressionModerateExcellent (similar values compress well)
SchemaImplicit (CSV) or embedded (JSON)Embedded with rich types
SplittableCSV: yes; JSON: dependsYes (block-level)
Best forIngestion, exchange, simple ETLAnalytics, data lake queries, long-term storage

Parquet is the dominant columnar format in the AWS ecosystem. Athena, Redshift Spectrum, Glue, EMR, and Lake Formation all optimize for Parquet. It supports nested data structures, multiple compression codecs (Snappy for speed, GZIP/ZSTD for size), and predicate pushdown (the query engine skips row groups that can't match the WHERE clause based on column statistics).

ORC (Optimized Row Columnar) is similar to Parquet but historically associated with the Hive ecosystem. It supports ACID transactions on Hive and offers good compression. In the AWS context, both Parquet and ORC are supported, but Parquet appears more frequently on the exam.

Avro is a row-based format with a twist: it embeds the schema in the file and excels at schema evolution. It's commonly used for streaming data serialization (Kafka messages) and as an intermediate format in ETL pipelines. If a question mentions schema evolution in streaming data or Kafka integration, Avro may be the answer.

āš ļø Exam Trap: "Convert CSV to Parquet" is one of the most common optimizations on the exam — but don't forget to also mention compression and partitioning. A question asking for "the most effective way to reduce Athena query costs" usually expects all three: columnar format + compression + partitioning by the common filter column.

Reflection Question: A data lake stores 10 TB of web clickstream data as gzipped JSON. Analytics queries typically filter by date and select 3 of 50 columns. What format and storage optimizations would you recommend?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications