How to query a Parquet file from the command line

To query a Parquet file from the command line, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then query the file directly:

1clickhouse local -q "SELECT * FROM file('data.parquet') LIMIT 10"

1┌──────────────event_time─┬─user_id─┬─country─┬─device──┬─event_type─┬─revenue─┬─quantity─┐
21. │ 2026-02-03 10:41:38.000 │  219394 │ IN      │ desktop │ refund     │  147.26 │        2 │
32. │ 2026-05-18 10:43:26.000 │  364848 │ IN      │ tablet  │ click      │  158.77 │        5 │
43. │ 2026-03-24 12:00:39.000 │  392522 │ GB      │ mobile  │ refund     │  274.44 │        1 │
5   └─────────────────────────┴─────────┴─────────┴─────────┴────────────┴─────────┴──────────┘

The format is detected from the .parquet extension, and the schema is read in place from the file's embedded metadata with no import step first.

Prefer Python? See How to read a Parquet file in Python (and query it with SQL) for the same queries against a pandas DataFrame.

See the schema without declaring one

Parquet files carry their own schema, so you never write CREATE TABLE. DESCRIBE prints the column names and the types ClickHouse inferred:

1clickhouse local -q "DESCRIBE file('data.parquet') FORMAT PrettyCompact"

1┌─name───────┬─type─────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
21. │ event_time │ DateTime64(3, 'UTC') │              │                    │         │                  │                │
32. │ user_id    │ UInt32               │              │                    │         │                  │                │
43. │ country    │ String               │              │                    │         │                  │                │
54. │ device     │ String               │              │                    │         │                  │                │
65. │ event_type │ String               │              │                    │         │                  │                │
76. │ revenue    │ Float64              │              │                    │         │                  │                │
87. │ quantity   │ UInt8                │              │                    │         │                  │                │
9   └────────────┴──────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

(The extra columns are for CREATE TABLE metadata - defaults, codecs, TTLs - which a Parquet file doesn't carry, so they come back empty.)

Timestamps come back as DateTime64, integers keep their width, and decimals land as Float64, all read straight from the file's embedded metadata. (What is a Parquet file?)

Filter, aggregate, and group by

A file viewer lets you scroll through rows; a query engine lets you ask questions of them. Because the file is just a SQL source, you have the full ClickHouse dialect (WHERE, GROUP BY, aggregate functions, window functions, joins), not a fixed table UI:

1clickhouse local -q "
2SELECT country,
3       count() AS purchases,
4       round(sum(revenue), 2) AS revenue,
5       round(avg(quantity), 3) AS avg_qty
6FROM file('data.parquet')
7WHERE event_type = 'purchase'
8GROUP BY country
9ORDER BY revenue DESC
10LIMIT 5
11FORMAT PrettyCompact"

1┌─country─┬─purchases─┬───revenue─┬─avg_qty─┐
21. │ BR      │      2086 │ 526569.59 │   3.015 │
32. │ FR      │      2027 │ 512719.52 │   2.975 │
43. │ IN      │      2044 │ 508346.74 │    3.05 │
54. │ NL      │      2032 │ 505167.61 │   3.036 │
65. │ JP      │      1998 │ 502163.23 │   3.009 │
7   └─────────┴───────────┴───────────┴─────────┘

Parquet is columnar, so a query that touches three columns reads only those three off disk. The other columns are never decoded, which is why this scales (more on that below).

ClickHouse covers this with one binary that also reads CSV, JSON, ORC, Arrow and many other formats, talks to S3, MySQL, Postgres and more, and runs the same SQL unchanged when you move from a file to a server to the Cloud.

Read compressed Parquet transparently

Parquet supports per-column compression (gzip, zstd, snappy) inside the file. You do not pass a flag or decompress anything first. clickhouse local reads the codec from the file metadata:

1clickhouse local -q "SELECT count(), round(sum(revenue), 2) FROM file('data.zstd.parquet') FORMAT PrettyCompact"

1┌─count()─┬─round(sum(revenue), 2)─┐
21. │  100000 │            24957233.46 │
3   └─────────┴────────────────────────┘

The same command works on a gzip-compressed Parquet file, or on a Parquet file wrapped in an outer .parquet.gz. The extension and the embedded metadata tell ClickHouse what to do.

How fast is it on a real file?

Small files are instant in anything. The difference shows up at scale. On a 977 MB Parquet file of 70 million rows (events_large.parquet, generated by the example folder below), the same filter-and-group-by query, now pointed at the large file, runs in:

1clickhouse local --time -q "
2SELECT country, count(), round(sum(revenue), 2), round(avg(quantity), 3)
3FROM file('events_large.parquet')
4WHERE event_type = 'purchase'
5GROUP BY country ORDER BY 3 DESC
6FORMAT Null"

10.439

0.44 seconds, best of three with the file warm in the OS page cache, measured on an Apple-silicon laptop (Apple M4 Pro, 14 cores, 24 GB RAM; clickhouse local 26.6.1.117). The query reads only the columns it needs and runs across all cores, so a near-gigabyte aggregation finishes before you can switch windows.

The same SQL scales unchanged

The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server, or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes. You swap file('data.parquet') for a table name (or an s3() function pointing at a bucket) and the rest stays put. There is no separate "local dialect" to unlearn: you prototype against a file on your machine and ship the identical logic to production.

Run it yourself

The complete, runnable example lives here. It has generate.sh (builds the demo file, a zstd-compressed copy, and the ~1 GB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-parquet

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-parquet
3./generate.sh && ./run.sh

See the schema without declaring one

Filter, aggregate, and group by

Read compressed Parquet transparently

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to query a Parquet file from the command line

See the schema without declaring one #

Filter, aggregate, and group by #

Read compressed Parquet transparently #

How fast is it on a real file? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

See the schema without declaring one

Filter, aggregate, and group by

Read compressed Parquet transparently

How fast is it on a real file?

The same SQL scales unchanged

Run it yourself

Related