How to query a compressed file (gzip, zstd) from the command line

Al Brown
Last updated: Jun 8, 2026

To query a compressed file, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the compressed file exactly as you would the uncompressed version:

1clickhouse local -q "SELECT * FROM file('events.csv.gz') LIMIT 5"
1   ┌──────────event_time─┬─user_id─┬─country─┬─device──┬─event_type─┬─revenue─┬─quantity─┐
21.2026-06-18 11:41:381IN      │ mobile  │ view497.89232.2026-12-24 10:43:262 │ CA      │ tablet  │ view204.51343.2026-07-22 13:00:393 │ DE      │ mobile  │ purchase   │   64.19554.2026-08-13 01:02:574 │ FR      │ desktop │ view103.71265.2026-01-20 00:59:205 │ JP      │ desktop │ purchase   │  345.7837   └─────────────────────┴─────────┴─────────┴─────────┴────────────┴─────────┴──────────┘

There is no gunzip step and no decompression flag. A .gz, .zst, .lz4, .br, or .xz suffix is detected from the file name and the file is decompressed in place as the data streams in, with no import step first.

The schema is inferred straight through the gzip #

Compression sits underneath the format. ClickHouse decompresses first, then reads the .csv (or .json, .parquet, ...) inside, so type inference works exactly as it does on a plain file. DESCRIBE proves it without a CREATE TABLE:

1clickhouse local -q "DESCRIBE file('events.csv.gz') FORMAT PrettyCompact"
1   ┌─name───────┬─type───────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
21. │ event_time │ Nullable(DateTime) │              │                    │         │                  │                │
32. │ user_id    │ Nullable(Int64)    │              │                    │         │                  │                │
43. │ country    │ Nullable(String)   │              │                    │         │                  │                │
54. │ device     │ Nullable(String)   │              │                    │         │                  │                │
65. │ event_type │ Nullable(String)   │              │                    │         │                  │                │
76. │ revenue    │ Nullable(Float64)  │              │                    │         │                  │                │
87. │ quantity   │ Nullable(Int64)    │              │                    │         │                  │                │
9   └────────────┴────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

The header gave the names, the data gave the types, and the gzip layer was invisible. (The extra empty columns are CREATE TABLE metadata that a CSV does not carry.)

Every codec works the same way #

Swap the suffix and the command does not change. A zstd-compressed CSV reads identically:

1clickhouse local -q "
2SELECT event_type, count()
3FROM file('events.csv.zst')
4GROUP BY event_type
5ORDER BY event_type
6FORMAT PrettyCompact"
1   ┌─event_type─┬─count()─┐
21. │ click      │       632. │ purchase   │       543. │ refund     │       354.view66   └────────────┴─────────┘

.lz4, .br (brotli), and .xz behave the same. The codec is chosen by the file extension, never by a command-line switch.

Compressed Parquet: the codec is inside the file #

Parquet compresses each column block internally (zstd, snappy, gzip). That codec is part of the file's footer, so a zstd-compressed Parquet file is read with no hint at all. There is no .zst suffix involved; the .parquet extension is enough:

1clickhouse local -q "SELECT count(), round(sum(revenue), 2) FROM file('events.zstd.parquet') FORMAT PrettyCompact"
1   ┌─count()─┬─round(sum(revenue), 2)─┐
21.204838.953   └─────────┴────────────────────────┘

When you write Parquet yourself, pick the column codec with one setting: SELECT ... INTO OUTFILE 'out.parquet' FORMAT Parquet SETTINGS output_format_parquet_compression_method = 'zstd'. See what is a Parquet file for how the columnar layout and per-column compression fit together.

When the extension lies: force the codec #

Auto-detection is the happy path, but it leans entirely on the file name. A gzipped CSV saved as events.bin has nothing for ClickHouse to go on, so inference fails:

1Code: 636. DB::Exception: The table structure cannot be extracted from a CSVWithNames format file.

Pass the compression method as the fourth argument to file(path, format, structure, compression) and it reads fine:

1clickhouse local -q "
2SELECT count() FROM file('events.bin', 'CSVWithNames',
3  'event_time DateTime, user_id Int64, country String, device String, event_type String, revenue Float64, quantity UInt8',
4  'gzip')"
120

The accepted values are gzip, zstd, lz4, brotli, xz, none, and auto (the default). This is the one case where you type the codec explicitly; everything else infers it.

Aggregate directly on the compressed file #

Because the decompressed file is just a SQL source, the full dialect works on it (WHERE, GROUP BY, joins, window functions) with no import step:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(revenue), 2) AS revenue
3FROM file('events.csv.gz')
4WHERE event_type = 'purchase'
5GROUP BY country
6ORDER BY revenue DESC
7FORMAT PrettyCompact"
1   ┌─country─┬─events─┬─revenue─┐
21. │ AU      │      1 │  394.06 │
32. │ FR      │      1 │  393.05 │
43. │ JP      │      1 │  345.78 │
54. │ DE      │      1 │   64.19 │
65. │ IN      │      1 │   28.68 │
7   └─────────┴────────┴─────────┘

ClickHouse covers this with one binary that reads CSV, JSON, Parquet, ORC, Arrow and more, each transparently compressed, and runs the same SQL against a server or the Cloud when the data grows.

How fast is it on a compressed file? #

Decompression costs CPU, but it is not the bottleneck. On a 3,000,000-row gzipped CSV (events_large.csv.gz, ~49 MiB on disk, ~173 MiB uncompressed), a filter-and-group-by across every row runs in:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(revenue),2) AS revenue, round(avg(quantity),3) AS avg_qty
3FROM file('events_large.csv.gz')
4WHERE event_type='purchase'
5GROUP BY country ORDER BY revenue DESC"

~0.55 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That decodes the gzip from scratch on every run and parses the CSV text underneath it; there is no cached table. Decompression overlaps with parsing and aggregation across cores, so it disappears into the work you were doing anyway.

The same SQL scales unchanged #

The query you just ran on a compressed file on your laptop is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Swap file('events.csv.gz') for a table name, or an s3() function pointing at compressed objects in a bucket, and nothing else changes. You prototype against a compressed file on disk and ship the identical logic to production.

Run it yourself #

The complete, runnable example lives here. It has generate.sh (builds the gzipped CSV, the zstd CSV, the zstd Parquet, and the ~49 MiB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-compressed

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-compressed
3./generate.sh && ./run.sh
Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...