Skip to main content

Understanding the Apache Parquet File Format

7 min read

I was building a DSPM product. The first plan was simple: connect to a customer's RDS instance, run classification queries, report findings.

Customers hated that.

So we changed the approach. The customer takes an RDS snapshot, exports it to S3 as Parquet, and we pick it up from there. No database credentials required.

What Parquet actually is

Apache Parquet is a columnar file format. Not a database, not a query engine. Just a way to lay out tabular data on disk column by column instead of row by row. That changes compression, scan cost, and why tools like Spark, Athena, DuckDB, and Pandas treat it as the default interchange format.

Row vs column storage

Here is the difference on a tiny four-user table:

Row-oriented storage

NameAgeCity
Alice29Berlin
Brian34London
Chloe27Bangalore
Derek41Tokyo

Each row lives together on disk. Fast for one record, wasteful when you only need one column across millions of rows.

Column-oriented storage

Name

Alice
Brian
Chloe
Derek

Age

29
34
27
41

City

Berlin
London
Bangalore
Tokyo

Each column is stored contiguously. Scanning all ages reads only the age block and skips name and city entirely.

Row storage is great when you want one full record. That is why OLTP databases like Postgres optimize for it. Column storage is better when you want one field across millions of rows, which is the shape of most analytical scans.

Anatomy of a Parquet file

I expected a blob of compressed bytes. It is more structured than that. Hover through the layers below:

File layout

Parquet file

The top-level container. One file holds the data and self-describing metadata, so it stays portable across Spark, Athena, DuckDB, and Pandas.

Row groups make parallel reads possible. Column chunks make pruning possible. Pages are where encoding and compression actually happen. The footer is what makes the file self-describing.

Why the files are so small

On one production dataset, roughly 80 million rows across a few dozen columns, CSV was about 1.0 to 1.2 GB. The same data as Parquet with Snappy compression landed around 130 MB.

JSON
1.2 GB
CSV
1.0 GB
Parquet
~130 MB

Parquet stores homogeneous columns, so compressors see repeated patterns instead of mixed types in every row. It also applies logical encodings like dictionary, RLE, and delta encoding before byte-level compression runs.

Sort order matters too. One table was only about 2x smaller than CSV until we sorted it by date before writing. After that it crossed 10x. Similar values sitting together makes both encoding and compression much more effective.

Query performance

Compression is nice, but scan efficiency is where Parquet changed our DSPM pipeline. Reads moved off production databases and onto cold S3 storage.

Bloom filters help on equality checks for high-cardinality columns when min/max statistics are too broad. They cost extra space and write time, so they are worth enabling on columns you filter by often, not on every field.

Parquet vs CSV vs JSON

AspectCSVJSONParquet
File sizeLargeVariesOften 5-10x smaller
Read speedSlow, full scanSlow, parse overheadFast with pruning
Write speedFast appendFast to generateSlower writes
SchemaHeader row at bestImplicitEmbedded and enforced
Human-readableYesYesNo
CompressionPoorPoorStrong on wide tables

Should you use it?

Advantages

  • Much smaller files than CSV on wide analytical tables.
  • Engines read only the columns and row groups they need.
  • Self-describing schema in the footer.
  • Broad support across Spark, Athena, DuckDB, Pandas, and more.

Disadvantages

  • Not human-readable without tooling.
  • Heavier writes than appending CSV or JSON Lines.
  • Tiny files can be a bad fit because metadata dominates.
  • Best for write-once, read-many workloads.

If the data is large, the workload is analytical, and the files live on object storage, Parquet is usually the right default. If the data is small, someone needs to eyeball it in a text editor, or you are appending constantly, CSV or JSON Lines is often simpler. Land streaming data first, compact to Parquet later when the workload is read-heavy.

Back to the DSPM problem

Parquet was not the only option. Customers could have sent CSVs or we could have used a read replica. It fit because we were doing bulk column scans, file size mattered, and the export path already supported it.

That is the useful way to think about Parquet. Not magic. Just a file format that is very good at analytical reads. If that is the job, it is hard to beat. If it is not, do not force it.