Understanding Apache Parquet: The File Format Behind Modern Data Engineering

By Aman RawatMarch 13, 202612 min read

Data EngineeringParquetCloud

It started with a customer who said no

I was building a DSPM product. We scanned customer databases for sensitive data (PII, credentials, financial records) and the original plan was simple: connect to the customer's RDS instance, run classification queries, report findings.

Customers hated this.

The trust problem

Think about it from their side. They're paying for a security product, and the first thing it asks for is direct access to their production database? The one with customer records, payment info, health data? Even read-only credentials didn't make them comfortable. One bad query could saturate their connection pool. One misconfigured timeout could cascade into an outage. They weren't being paranoid. They were being reasonable.

So we changed the approach. The customer takes an RDS snapshot (zero impact on prod), exports it to S3, and we pick it up as Parquet files. No database access needed.

I didn't know much about Parquet before this. I knew it was "the columnar format" and that data engineers liked it. Working with it in production taught me why.

Illustration for Apache Parquet: columnar files, compression, and analytical query paths

What Parquet actually is

Apache Parquet is a file format. Not a database, not a query engine, just a way to lay out tabular data on disk. The key thing is that it stores data column by column instead of row by row. That sounds like a small difference but it affects everything: how well the data compresses, how fast you can query it, and why basically every data tool defaults to it now.

Twitter and Cloudera built it in 2013. A lot of the design comes from Google's Dremel paper and earlier research like C-Store. They gave it to Apache, and at this point Spark, Athena, BigQuery, DuckDB, Pandas... pretty much everything can read it natively.

Row vs column storage

Here's a table with four users:

Row-oriented storage

NameAgeCity

Alice29Berlin

Brian34London

Chloe27Bangalore

Derek41Tokyo

Each row lives together on disk. Fast for fetching a single record, but wasteful when you only need one column across millions of rows.

Column-oriented storage

Name

Alice

Brian

Chloe

Derek

Age

City

Berlin

London

Bangalore

Tokyo

Each column is stored contiguously. Scanning “all ages” reads only the Age block, skipping Name and City entirely.

In CSV or any row-oriented format, each record is stored together. If you want Alice's full profile, that's fast because everything's in one place. That's why OLTP databases like Postgres and MySQL store data this way. They're optimized for "give me user #42."

But say you want the average age across all users. Now the engine has to read every single row, including Name, City, and every other column it doesn't care about. With 4 rows, who cares. With 400 million rows and 50 columns, you're reading maybe 50x more data than you need.

Parquet flips it. All the ages are stored together, all the names together, all the cities together. SELECT AVG(age) reads just the age column. Name and City stay on disk untouched. That's why OLAP systems and data lakes use columnar formats.

The tradeoff is real though. If you need to fetch a single row by ID, Parquet is terrible at it, because that one row is scattered across dozens of column chunks. It's not a general-purpose format. It's built for scans.

Anatomy of a Parquet file

When I first opened one of those RDS exports in a hex editor (okay, a Parquet viewer, I'm not that hardcore), I expected it to be a blob of compressed column data. It's actually pretty structured. There are distinct layers, and each one exists for a reason:

File layout

Parquet File

The top-level container. A single .parquet file holds data and self-describing metadata, making it fully portable between tools like Spark, Athena, DuckDB, and Pandas.

The short version:

Row groups are horizontal slices, usually around 128 MB of uncompressed data each. They're independent of each other, so engines can process them in parallel across threads or machines.
Column chunks - within each row group, data is split by column. This is what makes column pruning work. If your query only needs 3 out of 50 columns, the other 47 never get read.
Pages are the actual compression unit, usually around 1 MB. Small enough to decompress quickly, large enough to compress well.
The footer sits at the end of the file but gets read first. It has the full schema, the locations of every row group and column chunk, and min/max statistics per column. This is what lets engines decide what to skip before reading any actual data.

Why the files are so small

The compression ratios surprised me when I first saw them. Our production dataset (roughly 80 million rows, a couple dozen columns) was about 1.0 to 1.2 GB as CSV. As Parquet with Snappy compression, it was around 130 MB.

JSON

1.2 GB

CSV

1.0 GB

Parquet

~130 MB

That's not unusual. On typical wide tables, 5-10x smaller than CSV is normal. Here's why.

Columns are homogeneous

In CSV, a single row mixes strings, integers, dates, floats, so the compressor doesn't have much to work with. In Parquet, each column chunk is a single type. A column of country names is just strings, and often the same few dozen values repeated millions of times. That's much easier to compress.

Encoding happens before compression

Parquet doesn't just throw Snappy or Zstd at the raw bytes. It first applies logical encodings that shrink the data:

RLE - runs of the same value get collapsed. "USA, USA, USA..." a million times becomes (USA, 1000000). Sorted data makes this absurdly effective.
Dictionary encoding - low-cardinality columns (status codes, country names) get a lookup table. The string "United States" is stored once, and each row just has a tiny integer pointing to it.
Delta encoding - for sequential data like timestamps or auto-incrementing IDs, Parquet stores the difference between consecutive values. A column of timestamps one second apart becomes a column of 1s.

Encoding then compression

Parquet does encoding first (RLE, dictionary, delta), then byte-level compression (Snappy, Zstd, Gzip) on top. Encoding removes logical redundancy, compression removes byte-level redundancy. Stacking them is why the ratios get so good.

Sort order matters more than you'd think

This one bit me. I had a fact table that was only about 2x smaller than CSV as Parquet. Pretty disappointing. Then I sorted it by date before writing. Suddenly it was over 10x smaller. Sorting groups similar values together, which means longer runs for RLE and better patterns for the compressor. If you're writing Parquet and not thinking about sort order, you're leaving a lot on the table.

Query performance

The compression numbers are nice, but honestly, query performance is where Parquet changed how we built things. In our DSPM pipeline, scans that would've hammered a production RDS instance became lightweight reads on cold S3 storage.

Column pruning

If your table has 50 columns and your query touches 3, Parquet reads roughly 6% of the file. CSV can't do this. It's row-oriented, so you read everything. With Athena (which charges per byte scanned), this directly cuts your bill.

Predicate pushdown

This is the footer paying for itself.

How predicate pushdown works

Say you run SELECT * FROM users WHERE age > 50. The footer stores min and max age for each row group. If a row group has min=18 and max=35, the engine skips it entirely. Zero bytes read. On large datasets this can eliminate most of the work before any data is even decompressed.

Row group sizing

The default 128 MB row group size works fine for most cases, but it's worth understanding the tradeoff. Bigger row groups = better compression and less metadata overhead, but coarser predicate pushdown. If only one row in a 128 MB group matches your filter, you still read the whole group. For very wide tables or highly selective queries, I've seen 32 MB row groups work better. More chances to skip, more parallelism, at the cost of slightly worse compression and more footer entries. There's no universal answer here.

Bloom filters

Min/max statistics in the footer help with range predicates (age > 50), but they're useless for equality checks on high-cardinality columns. WHERE user_id = 12345 when the min is 1 and the max is 10 million? Can't skip anything.

That's what bloom filters are for. They're optional per-column metadata that can definitively say "this value is NOT in this row group." They cost extra space and write time, which is why they're not on by default in most writers. Worth enabling on columns you frequently filter by equality (IDs, keys, that kind of thing). Not worth it on every column.

Schema evolution (and its gotchas)

The happy path is simple: add a new column to your schema, and old Parquet files still work. Readers just see null for the column that didn't exist yet. This is genuinely useful and one of the reasons Parquet beats CSV for any serious pipeline.

The gotchas are less obvious. Renaming a column isn't a first-class operation. The schema stores column names as strings, so if you rename user_name to username in your application, old files still have user_name and you end up with two columns unless you rewrite everything. Type changes (int to long, or string to int) aren't safe. Some readers will coerce, others will just fail. Adding a required (non-optional) column breaks old readers that don't know about it.

In practice: always add columns as optional, avoid renames unless you're doing a one-time migration with a full rewrite, and treat type changes as a new dataset.

Parquet vs CSV vs JSON

Here's how the three compare on the stuff that actually matters when you're picking a format:

Aspect	CSV	JSON	Parquet
File size	Large	Varies (JSON Lines can be compact, but keys still repeat)	Often 80-90% smaller (5-10x typical)
Read speed (analytics)	Slow, full scan	Slow, parse overhead	Fast (column pruning + pushdown)
Write speed	Fast, append-friendly	Fast, easy to generate	Slower (encoding + metadata)
Schema	None (header row at best)	Implicit (no enforcement)	Embedded and enforced
Human-readable	Yes	Yes	No (binary)
Nested data	No	Yes (native)	Yes (repeated/group types)
Compression	Poor (mixed types per row)	Poor (verbose syntax)	Excellent (homogeneous columns)
Ecosystem	Universal	Universal	Data tools (Spark, Athena, DuckDB, etc.)
Splittable	Line-by-line	Not easily	Yes, per row group

Should you use it?

Advantages

Much smaller files (often 85-90% smaller than CSV). Less storage, faster transfers.
Query performance: engines only read the columns and row groups they need.
Self-describing schema in the footer. No separate schema files.
Schema evolution: add columns later, old readers see null for new ones.
Every major tool reads Parquet. Default for data lakes, lakehouses, and ETL.
Language agnostic: Python, Java, Go, Rust, C++, JS all have readers.

Disadvantages

Not human-readable. For quick inspection CSV wins (parquet-cli, DuckDB CLI help).
Write overhead: heavier than appending a CSV line. Write-heavy workloads pay in latency.
Small files: under a few MB, metadata can make Parquet bigger than CSV. Avoid tiny files, aim for 128-512 MB.
Needs a library. For one-off scripts or quick handoffs, CSV is simpler.
Write-once, read-many. For streaming append use JSON Lines/Avro then compact to Parquet (e.g. Delta, Iceberg).

The short version: if your data is big (hundreds of MB+), the workload is analytical, and you're storing it in S3 or GCS, Parquet is probably the right call. It's the default for data lakes, lakehouses, and most ETL pipelines for a reason.

If your data is small, someone needs to read it in a text editor, or you're streaming/appending constantly, stick with CSV or JSON. Parquet is write-once, read-many. For continuous append, land events as JSON Lines or Avro, then compact to Parquet periodically with something like Delta Lake or Iceberg.

Back to the DSPM problem

The customer who wouldn't let us touch their database. Parquet turned out to be a good fit for that. They snapshot their RDS, export to S3 as Parquet, we pick it up. They never give us database credentials. Prod is never at risk. And on our side, we only read the columns we actually need to classify, instead of pulling every column from a 500-column table. The files are a fraction of the CSV size, so storage and transfer costs stay reasonable.

I want to be honest though, this wasn't the only option. They could've sent us CSVs, or we could've set up a read replica. Parquet fit because we were doing bulk column scans, we cared about file size, and the export tooling already supported it. It wasn't some revelation. It was the right tool for a specific problem.

That's kind of the thing with Parquet. It's not magic. It's a file format that happens to be really well-designed for analytical workloads. If you're doing that kind of work, it's hard to beat. If you're not, don't force it.