Understanding Apache Parquet: The File Format Behind Modern Data Engineering
It started with a customer who said no
I was building a DSPM product. We scanned customer databases for sensitive data (PII, credentials, financial records) and the original plan was simple: connect to the customer's RDS instance, run classification queries, report findings.
Customers hated this.
Think about it from their side. They're paying for a security product, and the first thing it asks for is direct access to their production database? The one with customer records, payment info, health data? Even read-only credentials didn't make them comfortable. One bad query could saturate their connection pool. One misconfigured timeout could cascade into an outage. They weren't being paranoid. They were being reasonable.
So we changed the approach. The customer takes an RDS snapshot (zero impact on prod), exports it to S3, and we pick it up as Parquet files. No database access needed.
I didn't know much about Parquet before this. I knew it was "the columnar format" and that data engineers liked it. Working with it in production taught me why.

What Parquet actually is
Apache Parquet is a file format. Not a database, not a query engine, just a way to lay out tabular data on disk. The key thing is that it stores data column by column instead of row by row. That sounds like a small difference but it affects everything: how well the data compresses, how fast you can query it, and why basically every data tool defaults to it now.
Twitter and Cloudera built it in 2013. A lot of the design comes from Google's Dremel paper and earlier research like C-Store. They gave it to Apache, and at this point Spark, Athena, BigQuery, DuckDB, Pandas... pretty much everything can read it natively.
Row vs column storage
Here's a table with four users:
In CSV or any row-oriented format, each record is stored together. If you want Alice's full profile, that's fast because everything's in one place. That's why OLTP databases like Postgres and MySQL store data this way. They're optimized for "give me user #42."
But say you want the average age across all users. Now the engine has to read every single row, including Name, City, and every other column it doesn't care about. With 4 rows, who cares. With 400 million rows and 50 columns, you're reading maybe 50x more data than you need.
Parquet flips it. All the ages are stored together, all the names together, all the cities together. SELECT AVG(age) reads just the age column. Name and City stay on disk untouched. That's why OLAP systems and data lakes use columnar formats.
The tradeoff is real though. If you need to fetch a single row by ID, Parquet is terrible at it, because that one row is scattered across dozens of column chunks. It's not a general-purpose format. It's built for scans.
Anatomy of a Parquet file
When I first opened one of those RDS exports in a hex editor (okay, a Parquet viewer, I'm not that hardcore), I expected it to be a blob of compressed column data. It's actually pretty structured. There are distinct layers, and each one exists for a reason:
The top-level container. A single .parquet file holds data and self-describing metadata, making it fully portable between tools like Spark, Athena, DuckDB, and Pandas.
The short version:
- Row groups are horizontal slices, usually around 128 MB of uncompressed data each. They're independent of each other, so engines can process them in parallel across threads or machines.
- Column chunks - within each row group, data is split by column. This is what makes column pruning work. If your query only needs 3 out of 50 columns, the other 47 never get read.
- Pages are the actual compression unit, usually around 1 MB. Small enough to decompress quickly, large enough to compress well.
- The footer sits at the end of the file but gets read first. It has the full schema, the locations of every row group and column chunk, and min/max statistics per column. This is what lets engines decide what to skip before reading any actual data.
Why the files are so small
The compression ratios surprised me when I first saw them. Our production dataset (roughly 80 million rows, a couple dozen columns) was about 1.0 to 1.2 GB as CSV. As Parquet with Snappy compression, it was around 130 MB.
That's not unusual. On typical wide tables, 5-10x smaller than CSV is normal. Here's why.
Columns are homogeneous
In CSV, a single row mixes strings, integers, dates, floats, so the compressor doesn't have much to work with. In Parquet, each column chunk is a single type. A column of country names is just strings, and often the same few dozen values repeated millions of times. That's much easier to compress.
Encoding happens before compression
Parquet doesn't just throw Snappy or Zstd at the raw bytes. It first applies logical encodings that shrink the data:
- RLE - runs of the same value get collapsed. "USA, USA, USA..." a million times becomes (USA, 1000000). Sorted data makes this absurdly effective.
- Dictionary encoding - low-cardinality columns (status codes, country names) get a lookup table. The string "United States" is stored once, and each row just has a tiny integer pointing to it.
- Delta encoding - for sequential data like timestamps or auto-incrementing IDs, Parquet stores the difference between consecutive values. A column of timestamps one second apart becomes a column of 1s.
Parquet does encoding first (RLE, dictionary, delta), then byte-level compression (Snappy, Zstd, Gzip) on top. Encoding removes logical redundancy, compression removes byte-level redundancy. Stacking them is why the ratios get so good.
Sort order matters more than you'd think
This one bit me. I had a fact table that was only about 2x smaller than CSV as Parquet. Pretty disappointing. Then I sorted it by date before writing. Suddenly it was over 10x smaller. Sorting groups similar values together, which means longer runs for RLE and better patterns for the compressor. If you're writing Parquet and not thinking about sort order, you're leaving a lot on the table.
Query performance
The compression numbers are nice, but honestly, query performance is where Parquet changed how we built things. In our DSPM pipeline, scans that would've hammered a production RDS instance became lightweight reads on cold S3 storage.
Column pruning
If your table has 50 columns and your query touches 3, Parquet reads roughly 6% of the file. CSV can't do this. It's row-oriented, so you read everything. With Athena (which charges per byte scanned), this directly cuts your bill.
Predicate pushdown
This is the footer paying for itself.
Say you run SELECT * FROM users WHERE age > 50. The footer stores min and max age for each row group. If a row group has min=18 and max=35, the engine skips it entirely. Zero bytes read. On large datasets this can eliminate most of the work before any data is even decompressed.
Row group sizing
The default 128 MB row group size works fine for most cases, but it's worth understanding the tradeoff. Bigger row groups = better compression and less metadata overhead, but coarser predicate pushdown. If only one row in a 128 MB group matches your filter, you still read the whole group. For very wide tables or highly selective queries, I've seen 32 MB row groups work better. More chances to skip, more parallelism, at the cost of slightly worse compression and more footer entries. There's no universal answer here.
Bloom filters
Min/max statistics in the footer help with range predicates (age > 50), but they're useless for equality checks on high-cardinality columns. WHERE user_id = 12345 when the min is 1 and the max is 10 million? Can't skip anything.
That's what bloom filters are for. They're optional per-column metadata that can definitively say "this value is NOT in this row group." They cost extra space and write time, which is why they're not on by default in most writers. Worth enabling on columns you frequently filter by equality (IDs, keys, that kind of thing). Not worth it on every column.
Schema evolution (and its gotchas)
The happy path is simple: add a new column to your schema, and old Parquet files still work. Readers just see null for the column that didn't exist yet. This is genuinely useful and one of the reasons Parquet beats CSV for any serious pipeline.
The gotchas are less obvious. Renaming a column isn't a first-class operation. The schema stores column names as strings, so if you rename user_name to username in your application, old files still have user_name and you end up with two columns unless you rewrite everything. Type changes (int to long, or string to int) aren't safe. Some readers will coerce, others will just fail. Adding a required (non-optional) column breaks old readers that don't know about it.
In practice: always add columns as optional, avoid renames unless you're doing a one-time migration with a full rewrite, and treat type changes as a new dataset.
Parquet vs CSV vs JSON
Here's how the three compare on the stuff that actually matters when you're picking a format:
| Aspect | CSV | JSON | Parquet |
|---|---|---|---|
| File size | Large | Varies (JSON Lines can be compact, but keys still repeat) | Often 80-90% smaller (5-10x typical) |
| Read speed (analytics) | Slow, full scan | Slow, parse overhead | Fast (column pruning + pushdown) |
| Write speed | Fast, append-friendly | Fast, easy to generate | Slower (encoding + metadata) |
| Schema | None (header row at best) | Implicit (no enforcement) | Embedded and enforced |
| Human-readable | Yes | Yes | No (binary) |
| Nested data | No | Yes (native) | Yes (repeated/group types) |
| Compression | Poor (mixed types per row) | Poor (verbose syntax) | Excellent (homogeneous columns) |
| Ecosystem | Universal | Universal | Data tools (Spark, Athena, DuckDB, etc.) |
| Splittable | Line-by-line | Not easily | Yes, per row group |
Should you use it?
- Much smaller files (often 85-90% smaller than CSV). Less storage, faster transfers.
- Query performance: engines only read the columns and row groups they need.
- Self-describing schema in the footer. No separate schema files.
- Schema evolution: add columns later, old readers see null for new ones.
- Every major tool reads Parquet. Default for data lakes, lakehouses, and ETL.
- Language agnostic: Python, Java, Go, Rust, C++, JS all have readers.
- Not human-readable. For quick inspection CSV wins (parquet-cli, DuckDB CLI help).
- Write overhead: heavier than appending a CSV line. Write-heavy workloads pay in latency.
- Small files: under a few MB, metadata can make Parquet bigger than CSV. Avoid tiny files, aim for 128-512 MB.
- Needs a library. For one-off scripts or quick handoffs, CSV is simpler.
- Write-once, read-many. For streaming append use JSON Lines/Avro then compact to Parquet (e.g. Delta, Iceberg).
The short version: if your data is big (hundreds of MB+), the workload is analytical, and you're storing it in S3 or GCS, Parquet is probably the right call. It's the default for data lakes, lakehouses, and most ETL pipelines for a reason.
If your data is small, someone needs to read it in a text editor, or you're streaming/appending constantly, stick with CSV or JSON. Parquet is write-once, read-many. For continuous append, land events as JSON Lines or Avro, then compact to Parquet periodically with something like Delta Lake or Iceberg.
Back to the DSPM problem
The customer who wouldn't let us touch their database. Parquet turned out to be a good fit for that. They snapshot their RDS, export to S3 as Parquet, we pick it up. They never give us database credentials. Prod is never at risk. And on our side, we only read the columns we actually need to classify, instead of pulling every column from a 500-column table. The files are a fraction of the CSV size, so storage and transfer costs stay reasonable.
I want to be honest though, this wasn't the only option. They could've sent us CSVs, or we could've set up a read replica. Parquet fit because we were doing bulk column scans, we cared about file size, and the export tooling already supported it. It wasn't some revelation. It was the right tool for a specific problem.
That's kind of the thing with Parquet. It's not magic. It's a file format that happens to be really well-designed for analytical workloads. If you're doing that kind of work, it's hard to beat. If you're not, don't force it.