A Parquet file is an open, column-oriented, compressed, self-describing storage format for tabular data, designed for analytics. It stores data by column rather than by row, embeds its own schema and statistics, and is the de-facto interchange format for analytical datasets. Because it is columnar and carries per-column statistics, an engine can read only the columns and the slices of the file a query actually needs.
| Property | Parquet |
|---|
| Layout | Columnar (data grouped by column, not row) |
| Schema | Embedded in the file (self-describing) |
| Compression | Per-column: snappy, zstd, gzip, lz4, brotli |
| Statistics | Per-row-group min/max, null count, distinct count |
| Typical use | Analytics, data lakes, table interchange |
| Open standard | Apache Parquet; readable by most analytics tools |
A Parquet file is a nested container. From the outside in:
- File: holds one or more row groups plus a footer. The footer carries the schema and an index of where everything lives.
- Row group: a horizontal slice of the rows (e.g. the first 500,000 rows). Each row group is independently readable.
- Column chunk: within a row group, the data for one column. This is the unit that makes Parquet columnar. All values for
country sit together, separate from revenue.
- Page: the smallest unit inside a column chunk. Pages are where compression and encoding are applied.
Describing that structure is one thing. Seeing it in a real file is more convincing, and Parquet makes it easy: it records its own structure in the footer. So let's write a Parquet file, then crack it open and read that footer back.
clickhouse local can do both. It's part of ClickHouse; install it with clickhousectl, the ClickHouse CLI for local and cloud:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Now generate a 2,000,000-row demo file, events.parquet, with output_format_parquet_row_group_size set so the rows split into four row groups:
clickhouse local -q "
SELECT
number AS id,
toDateTime('2026-01-01 00:00:00') + toIntervalMinute(number) AS event_time,
['GB','AU','IN','US','DE'][(number % 5) + 1] AS country,
['mobile','desktop','tablet'][(number % 3) + 1] AS device,
['view','click','purchase'][(number % 3) + 1] AS event_type,
round(randUniform(1, 500), 2) AS revenue,
toUInt8((number % 4) + 1) AS quantity
FROM numbers(2000000)
INTO OUTFILE 'events.parquet'
FORMAT Parquet
SETTINGS output_format_parquet_row_group_size = 500000"
Now read the structure back. The ParquetMetadata format exposes the footer as a queryable table, so we can inspect the file without touching its column data. Here is the top-level view:
clickhouse local -q "
SELECT num_columns, num_rows, num_row_groups, format_version,
formatReadableSize(total_uncompressed_size) AS uncompressed,
formatReadableSize(total_compressed_size) AS compressed
FROM file('events.parquet', ParquetMetadata)
FORMAT Vertical"
num_columns: 7
num_rows: 2000000
num_row_groups: 4
format_version: 2
uncompressed: 38.02 MiB
compressed: 11.63 MiB
Seven columns, two million rows, split into four row groups, compressed from 38 MiB down to 11.6 MiB. All of that is read from the footer without touching the column data itself.
Compression and encoding
Compression is applied per column, and because a column holds values of one type with similar distribution, it compresses well. Parquet supports several codecs (snappy, zstd, gzip, lz4 and brotli) and layers encodings on top: dictionary encoding replaces repeated values with small integer keys, and run-length encoding (RLE) collapses long runs of the same value.
The metadata shows the codec and the achieved ratio for each column:
clickhouse local -q "
SELECT c.name AS column, c.5 AS physical_type, c.7 AS compression, c.10 AS compression_ratio
FROM file('events.parquet', ParquetMetadata)
ARRAY JOIN columns AS c
FORMAT Pretty"
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ column ┃ physical_type ┃ compression ┃ compression_ratio ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ id │ INT64 │ ZSTD │ 87.07% │
│ event_time │ INT64 │ ZSTD │ 65.87% │
│ country │ BYTE_ARRAY │ ZSTD │ 99.9% │
│ device │ BYTE_ARRAY │ ZSTD │ 99.88% │
│ event_type │ BYTE_ARRAY │ ZSTD │ 99.88% │
│ revenue │ DOUBLE │ ZSTD │ 16.8% │
│ quantity │ INT32 │ ZSTD │ 99.89% │
└────────────┴───────────────┴─────────────┴───────────────────┘
The low-cardinality string columns (country, device, event_type) shrink by over 99%, because dictionary plus RLE encoding turn a few repeated labels into almost nothing. The high-entropy revenue floats compress far less. (The codec is whatever the writer chose; ClickHouse defaults to ZSTD here.)
Two properties make Parquet fast to query.
Column pruning. Because data is stored by column, a query that touches two columns reads only those two column chunks and skips the rest of the file. A SELECT count() reads no column data at all; the answer is in the footer. On a 20-million-row version of this file, count() returns in 0.004s, while a filter-and-aggregate over two columns takes 0.108s (best of three, warm cache, on an Apple-silicon laptop). Both are reading the same file; the difference is how much of it each query has to touch.
Predicate pushdown via row-group statistics. Each row group stores the min and max of every column. If a query filters on a column, the engine compares the filter against those ranges and skips any row group that cannot contain a match, without decompressing it. The per-row-group ranges for the time column in our demo file:
clickhouse local -q "
SELECT rg.num_rows AS rows,
fromUnixTimestamp64Milli(toInt64(c.statistics.min)) AS event_time_min,
fromUnixTimestamp64Milli(toInt64(c.statistics.max)) AS event_time_max
FROM file('events.parquet', ParquetMetadata)
ARRAY JOIN row_groups AS rg
ARRAY JOIN arrayFilter(x -> x.name = 'event_time', rg.columns) AS c
FORMAT Pretty"
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ rows ┃ event_time_min ┃ event_time_max ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 523272 │ 2026-01-01 00:00:00.000 │ 2026-12-30 09:11:00.000 │
│ 523272 │ 2026-12-30 09:12:00.000 │ 2027-12-28 18:23:00.000 │
│ 523272 │ 2027-12-28 18:24:00.000 │ 2028-12-26 03:35:00.000 │
│ 430184 │ 2028-12-26 03:36:00.000 │ 2029-10-20 22:19:00.000 │
└────────┴─────────────────────────┴─────────────────────────┘
The ranges don't overlap. A query for event_time >= '2028-01-01' can skip the first two row groups outright, reading roughly half the file. On real data with sorted or naturally-ordered columns, this is where Parquet earns its speed.
CSV stores data row by row as plain text, with no schema, no types, no statistics and no compression. To answer a query an engine must parse the entire file and infer types as it goes. Parquet stores data by column, typed, compressed, with statistics, so an engine reads only what it needs. The practical upshot is smaller files and faster analytical queries. For a fuller treatment with numbers, see Parquet vs CSV.
You don't need a server or an import step to work with Parquet. clickhouse local reads a Parquet file in place from the command line; it's a good single-binary tool for poking at files on your laptop. ClickHouse uses the same SQL whether you point it at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local Parquet file scales up unchanged.
To query the file we inspected above:
clickhouse local -q "
SELECT country, count() AS purchases, round(sum(revenue)) AS revenue
FROM file('events.parquet')
WHERE event_type = 'purchase'
GROUP BY country
ORDER BY revenue DESC"
For the full walkthrough, covering globbing many files, projecting columns, and casting types, see how to query a Parquet file.
Run it yourself: the data generator and every command above live in local-analytics/what-is-parquet-file in the ClickHouse examples repo.
Prefer Python? → Read a Parquet file in Python.