What is a Parquet file?

A Parquet file is an open, column-oriented, compressed, self-describing storage format for tabular data, designed for analytics. It stores data by column rather than by row, embeds its own schema and statistics, and is the de-facto interchange format for analytical datasets. Because it is columnar and carries per-column statistics, an engine can read only the columns and the slices of the file a query actually needs.

At a glance

Property	Parquet
Layout	Columnar (data grouped by column, not row)
Schema	Embedded in the file (self-describing)
Compression	Per-column: snappy, zstd, gzip, lz4, brotli
Statistics	Per-row-group min/max, null count, distinct count
Typical use	Analytics, data lakes, table interchange
Open standard	Apache Parquet; readable by most analytics tools

Internal structure

A Parquet file is a nested container. From the outside in:

File: holds one or more row groups plus a footer. The footer carries the schema and an index of where everything lives.
Row group: a horizontal slice of the rows (e.g. the first 500,000 rows). Each row group is independently readable.
Column chunk: within a row group, the data for one column. This is the unit that makes Parquet columnar. All values for country sit together, separate from revenue.
Page: the smallest unit inside a column chunk. Pages are where compression and encoding are applied.

Describing that structure is one thing. Seeing it in a real file is more convincing, and Parquet makes it easy: it records its own structure in the footer. So let's write a Parquet file, then crack it open and read that footer back.

clickhouse local can do both. It's part of ClickHouse; install it with clickhousectl, the ClickHouse CLI for local and cloud:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Now generate a 2,000,000-row demo file, events.parquet, with output_format_parquet_row_group_size set so the rows split into four row groups:

1clickhouse local -q "
2SELECT
3    number AS id,
4    toDateTime('2026-01-01 00:00:00') + toIntervalMinute(number) AS event_time,
5    ['GB','AU','IN','US','DE'][(number % 5) + 1] AS country,
6    ['mobile','desktop','tablet'][(number % 3) + 1] AS device,
7    ['view','click','purchase'][(number % 3) + 1] AS event_type,
8    round(randUniform(1, 500), 2) AS revenue,
9    toUInt8((number % 4) + 1) AS quantity
10FROM numbers(2000000)
11INTO OUTFILE 'events.parquet'
12FORMAT Parquet
13SETTINGS output_format_parquet_row_group_size = 500000"

Now read the structure back. The ParquetMetadata format exposes the footer as a queryable table, so we can inspect the file without touching its column data. Here is the top-level view:

1clickhouse local -q "
2SELECT num_columns, num_rows, num_row_groups, format_version,
3       formatReadableSize(total_uncompressed_size) AS uncompressed,
4       formatReadableSize(total_compressed_size)   AS compressed
5FROM file('events.parquet', ParquetMetadata)
6FORMAT Vertical"

1num_columns:    7
2num_rows:       2000000
3num_row_groups: 4
4format_version: 2
5uncompressed:   38.02 MiB
6compressed:     11.63 MiB

Seven columns, two million rows, split into four row groups, compressed from 38 MiB down to 11.6 MiB. All of that is read from the footer without touching the column data itself.

Compression and encoding

Compression is applied per column, and because a column holds values of one type with similar distribution, it compresses well. Parquet supports several codecs (snappy, zstd, gzip, lz4 and brotli) and layers encodings on top: dictionary encoding replaces repeated values with small integer keys, and run-length encoding (RLE) collapses long runs of the same value.

The metadata shows the codec and the achieved ratio for each column:

1clickhouse local -q "
2SELECT c.name AS column, c.5 AS physical_type, c.7 AS compression, c.10 AS compression_ratio
3FROM file('events.parquet', ParquetMetadata)
4ARRAY JOIN columns AS c
5FORMAT Pretty"

1┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
2┃ column     ┃ physical_type ┃ compression ┃ compression_ratio ┃
3┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
4│ id         │ INT64         │ ZSTD        │ 87.07%            │
5│ event_time │ INT64         │ ZSTD        │ 65.87%            │
6│ country    │ BYTE_ARRAY    │ ZSTD        │ 99.9%             │
7│ device     │ BYTE_ARRAY    │ ZSTD        │ 99.88%            │
8│ event_type │ BYTE_ARRAY    │ ZSTD        │ 99.88%            │
9│ revenue    │ DOUBLE        │ ZSTD        │ 16.8%             │
10│ quantity   │ INT32         │ ZSTD        │ 99.89%            │
11└────────────┴───────────────┴─────────────┴───────────────────┘

The low-cardinality string columns (country, device, event_type) shrink by over 99%, because dictionary plus RLE encoding turn a few repeated labels into almost nothing. The high-entropy revenue floats compress far less. (The codec is whatever the writer chose; ClickHouse defaults to ZSTD here.)

Why analytics engines love it

Two properties make Parquet fast to query.

Column pruning. Because data is stored by column, a query that touches two columns reads only those two column chunks and skips the rest of the file. A SELECT count() reads no column data at all; the answer is in the footer. On a 20-million-row version of this file, count() returns in 0.004s, while a filter-and-aggregate over two columns takes 0.108s (best of three, warm cache, on an Apple-silicon laptop). Both are reading the same file; the difference is how much of it each query has to touch.

Predicate pushdown via row-group statistics. Each row group stores the min and max of every column. If a query filters on a column, the engine compares the filter against those ranges and skips any row group that cannot contain a match, without decompressing it. The per-row-group ranges for the time column in our demo file:

1clickhouse local -q "
2SELECT rg.num_rows AS rows,
3       fromUnixTimestamp64Milli(toInt64(c.statistics.min)) AS event_time_min,
4       fromUnixTimestamp64Milli(toInt64(c.statistics.max)) AS event_time_max
5FROM file('events.parquet', ParquetMetadata)
6ARRAY JOIN row_groups AS rg
7ARRAY JOIN arrayFilter(x -> x.name = 'event_time', rg.columns) AS c
8FORMAT Pretty"

1┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
2┃   rows ┃          event_time_min ┃          event_time_max ┃
3┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━┩
4│ 523272 │ 2026-01-01 00:00:00.000 │ 2026-12-30 09:11:00.000 │
5│ 523272 │ 2026-12-30 09:12:00.000 │ 2027-12-28 18:23:00.000 │
6│ 523272 │ 2027-12-28 18:24:00.000 │ 2028-12-26 03:35:00.000 │
7│ 430184 │ 2028-12-26 03:36:00.000 │ 2029-10-20 22:19:00.000 │
8└────────┴─────────────────────────┴─────────────────────────┘

The ranges don't overlap. A query for event_time >= '2028-01-01' can skip the first two row groups outright, reading roughly half the file. On real data with sorted or naturally-ordered columns, this is where Parquet earns its speed.

How it compares to CSV

CSV stores data row by row as plain text, with no schema, no types, no statistics and no compression. To answer a query an engine must parse the entire file and infer types as it goes. Parquet stores data by column, typed, compressed, with statistics, so an engine reads only what it needs. The practical upshot is smaller files and faster analytical queries. For a fuller treatment with numbers, see Parquet vs CSV.

Read one yourself

You don't need a server or an import step to work with Parquet. clickhouse local reads a Parquet file in place from the command line; it's a good single-binary tool for poking at files on your laptop. ClickHouse uses the same SQL whether you point it at a file on your laptop, a server, or ClickHouse Cloud, so a query you write against a local Parquet file scales up unchanged.

To query the file we inspected above:

1clickhouse local -q "
2SELECT country, count() AS purchases, round(sum(revenue)) AS revenue
3FROM file('events.parquet')
4WHERE event_type = 'purchase'
5GROUP BY country
6ORDER BY revenue DESC"

For the full walkthrough, covering globbing many files, projecting columns, and casting types, see how to query a Parquet file.

Run it yourself: the data generator and every command above live in local-analytics/what-is-parquet-file in the ClickHouse examples repo.

Prefer Python? → Read a Parquet file in Python.

At a glance

Internal structure

Compression and encoding

Why analytics engines love it

How it compares to CSV

Read one yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

What is a Parquet file?

At a glance #

Internal structure #

Compression and encoding #

Why analytics engines love it #

How it compares to CSV #

Read one yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

At a glance

Internal structure

Compression and encoding

Why analytics engines love it

How it compares to CSV

Read one yourself