How to query a compressed file (gzip, zstd) from the command line

To query a compressed file, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the compressed file exactly as you would the uncompressed version:

1clickhouse local -q "SELECT * FROM file('events.csv.gz') LIMIT 5"

1┌──────────event_time─┬─user_id─┬─country─┬─device──┬─event_type─┬─revenue─┬─quantity─┐
21. │ 2026-06-18 11:41:38 │       1 │ IN      │ mobile  │ view       │  497.89 │        2 │
32. │ 2026-12-24 10:43:26 │       2 │ CA      │ tablet  │ view       │  204.51 │        3 │
43. │ 2026-07-22 13:00:39 │       3 │ DE      │ mobile  │ purchase   │   64.19 │        5 │
54. │ 2026-08-13 01:02:57 │       4 │ FR      │ desktop │ view       │  103.71 │        2 │
65. │ 2026-01-20 00:59:20 │       5 │ JP      │ desktop │ purchase   │  345.78 │        3 │
7   └─────────────────────┴─────────┴─────────┴─────────┴────────────┴─────────┴──────────┘

There is no gunzip step and no decompression flag. A .gz, .zst, .lz4, .br, or .xz suffix is detected from the file name and the file is decompressed in place as the data streams in, with no import step first.

The schema is inferred straight through the gzip

Compression sits underneath the format. ClickHouse decompresses first, then reads the .csv (or .json, .parquet, ...) inside, so type inference works exactly as it does on a plain file. DESCRIBE proves it without a CREATE TABLE:

1clickhouse local -q "DESCRIBE file('events.csv.gz') FORMAT PrettyCompact"

1┌─name───────┬─type───────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
21. │ event_time │ Nullable(DateTime) │              │                    │         │                  │                │
32. │ user_id    │ Nullable(Int64)    │              │                    │         │                  │                │
43. │ country    │ Nullable(String)   │              │                    │         │                  │                │
54. │ device     │ Nullable(String)   │              │                    │         │                  │                │
65. │ event_type │ Nullable(String)   │              │                    │         │                  │                │
76. │ revenue    │ Nullable(Float64)  │              │                    │         │                  │                │
87. │ quantity   │ Nullable(Int64)    │              │                    │         │                  │                │
9   └────────────┴────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

The header gave the names, the data gave the types, and the gzip layer was invisible. (The extra empty columns are CREATE TABLE metadata that a CSV does not carry.)

Every codec works the same way

Swap the suffix and the command does not change. A zstd-compressed CSV reads identically:

1clickhouse local -q "
2SELECT event_type, count()
3FROM file('events.csv.zst')
4GROUP BY event_type
5ORDER BY event_type
6FORMAT PrettyCompact"

1┌─event_type─┬─count()─┐
21. │ click      │       6 │
32. │ purchase   │       5 │
43. │ refund     │       3 │
54. │ view       │       6 │
6   └────────────┴─────────┘

.lz4, .br (brotli), and .xz behave the same. The codec is chosen by the file extension, never by a command-line switch.

Compressed Parquet: the codec is inside the file

Parquet compresses each column block internally (zstd, snappy, gzip). That codec is part of the file's footer, so a zstd-compressed Parquet file is read with no hint at all. There is no .zst suffix involved; the .parquet extension is enough:

1clickhouse local -q "SELECT count(), round(sum(revenue), 2) FROM file('events.zstd.parquet') FORMAT PrettyCompact"

1┌─count()─┬─round(sum(revenue), 2)─┐
21. │      20 │                4838.95 │
3   └─────────┴────────────────────────┘

When you write Parquet yourself, pick the column codec with one setting: SELECT ... INTO OUTFILE 'out.parquet' FORMAT Parquet SETTINGS output_format_parquet_compression_method = 'zstd'. See what is a Parquet file for how the columnar layout and per-column compression fit together.

When the extension lies: force the codec

Auto-detection is the happy path, but it leans entirely on the file name. A gzipped CSV saved as events.bin has nothing for ClickHouse to go on, so inference fails:

1Code: 636. DB::Exception: The table structure cannot be extracted from a CSVWithNames format file.

Pass the compression method as the fourth argument to file(path, format, structure, compression) and it reads fine:

1clickhouse local -q "
2SELECT count() FROM file('events.bin', 'CSVWithNames',
3  'event_time DateTime, user_id Int64, country String, device String, event_type String, revenue Float64, quantity UInt8',
4  'gzip')"

The accepted values are gzip, zstd, lz4, brotli, xz, none, and auto (the default). This is the one case where you type the codec explicitly; everything else infers it.

Aggregate directly on the compressed file

Because the decompressed file is just a SQL source, the full dialect works on it (WHERE, GROUP BY, joins, window functions) with no import step:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(revenue), 2) AS revenue
3FROM file('events.csv.gz')
4WHERE event_type = 'purchase'
5GROUP BY country
6ORDER BY revenue DESC
7FORMAT PrettyCompact"

1┌─country─┬─events─┬─revenue─┐
21. │ AU      │      1 │  394.06 │
32. │ FR      │      1 │  393.05 │
43. │ JP      │      1 │  345.78 │
54. │ DE      │      1 │   64.19 │
65. │ IN      │      1 │   28.68 │
7   └─────────┴────────┴─────────┘

ClickHouse covers this with one binary that reads CSV, JSON, Parquet, ORC, Arrow and more, each transparently compressed, and runs the same SQL against a server or the Cloud when the data grows.

How fast is it on a compressed file?

Decompression costs CPU, but it is not the bottleneck. On a 3,000,000-row gzipped CSV (events_large.csv.gz, ~49 MiB on disk, ~173 MiB uncompressed), a filter-and-group-by across every row runs in:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(revenue),2) AS revenue, round(avg(quantity),3) AS avg_qty
3FROM file('events_large.csv.gz')
4WHERE event_type='purchase'
5GROUP BY country ORDER BY revenue DESC"

~0.55 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That decodes the gzip from scratch on every run and parses the CSV text underneath it; there is no cached table. Decompression overlaps with parsing and aggregation across cores, so it disappears into the work you were doing anyway.

The same SQL scales unchanged

The query you just ran on a compressed file on your laptop is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Swap file('events.csv.gz') for a table name, or an s3() function pointing at compressed objects in a bucket, and nothing else changes. You prototype against a compressed file on disk and ship the identical logic to production.

Run it yourself

The complete, runnable example lives here. It has generate.sh (builds the gzipped CSV, the zstd CSV, the zstd Parquet, and the ~49 MiB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-compressed

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-compressed
3./generate.sh && ./run.sh

The schema is inferred straight through the gzip

Every codec works the same way

Compressed Parquet: the codec is inside the file

When the extension lies: force the codec

Aggregate directly on the compressed file

How fast is it on a compressed file?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to query a compressed file (gzip, zstd) from the command line

The schema is inferred straight through the gzip #

Every codec works the same way #

Compressed Parquet: the codec is inside the file #

When the extension lies: force the codec #

Aggregate directly on the compressed file #

How fast is it on a compressed file? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

The schema is inferred straight through the gzip

Every codec works the same way

Compressed Parquet: the codec is inside the file

When the extension lies: force the codec

Aggregate directly on the compressed file

How fast is it on a compressed file?

The same SQL scales unchanged

Run it yourself

Related