Run SQL across multiple CSV or Parquet files

To run SQL across multiple files at once, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then pass a glob path to file() and query all the matching files as one table:

1clickhouse local -q "SELECT count() FROM file('sales/*.csv')"

The * matches every CSV in the directory and the files are read in place with no import step. ClickHouse infers a shared schema from the first file and applies it across all the rest, so an aggregate runs over every file in a single pass.

Aggregate across every file in one query

This is the point of a glob: the files behave as one table, so an aggregate runs over all of them together. Four monthly exports, one GROUP BY:

1clickhouse local -q "
2SELECT country, count() AS orders, round(sum(revenue),2) AS revenue
3FROM file('sales/*.csv')
4GROUP BY country
5ORDER BY revenue DESC"

1┌─country─┬─orders─┬─revenue─┐
21. │ US      │     36 │    1089 │
32. │ GB      │     36 │ 1052.64 │
43. │ AU      │     32 │ 1000.32 │
54. │ IN      │     32 │     968 │
65. │ FR      │     32 │  935.68 │
76. │ DE      │     32 │  903.36 │
8   └─────────┴────────┴─────────┘

The schema is inferred once and reused for every file, so the files do need matching columns. If a CSV carries a header, the CSVWithNames default reads it; see running SQL on a single CSV for the type-inference details that apply to each file here.

Attribute each row to its source file with `_file`

When you fan a query across files, you often want to break the result back out per file. ClickHouse exposes a virtual column called _file on every file() read. It holds the file name for each row, so you can GROUP BY it like any other column:

1clickhouse local -q "
2SELECT _file, count() AS rows, round(sum(revenue),2) AS revenue
3FROM file('sales/*.csv')
4GROUP BY _file
5ORDER BY _file"

1┌─_file───────┬─rows─┬─revenue─┐
21. │ 2026-01.csv │   50 │ 1487.25 │
32. │ 2026-02.csv │   50 │ 1487.25 │
43. │ 2026-03.csv │   50 │ 1487.25 │
54. │ 2026-04.csv │   50 │ 1487.25 │
6   └─────────────┴──────┴─────────┘

_file is the file name; its sibling _path is the full absolute path. Neither exists in the data on disk. ClickHouse synthesises them per row at read time, which is exactly what you need to spot the one bad export in a directory of a hundred, or to partition a result by source date encoded in the file name.

The same glob on Parquet

Nothing about the pattern is CSV-specific. Swap the extension and point at a directory of Parquet parts, the kind a writer emits as part-00.parquet, part-01.parquet, and so on:

1clickhouse local -q "
2SELECT _file, count() AS rows
3FROM file('events/*.parquet')
4GROUP BY _file
5ORDER BY _file"

1┌─_file───────────┬─rows─┐
21. │ part-00.parquet │   50 │
32. │ part-01.parquet │   50 │
43. │ part-02.parquet │   50 │
54. │ part-03.parquet │   50 │
6   └─────────────────┴──────┘

Because Parquet is columnar, a query that reads three columns across a hundred parts still only decodes those three columns from each file. See querying a single Parquet file for the column-pruning behaviour, and convert CSV to Parquet if you want to turn the directory of CSVs above into Parquet first.

clickhouse local runs the same glob syntax and the same SQL unchanged across CSV, Parquet, JSON, ORC and many other formats, and against a remote bucket with s3('.../*.parquet') when the files live in object storage.

Pick a subset with brace and range globs

* takes everything. When you want a slice, use brace expansion or a numeric range inside the path. Both are matched by ClickHouse, not the shell, so quote the path:

1clickhouse local -q "SELECT count() FROM file('sales/2026-{01,02}.csv')"   # two named months
2clickhouse local -q "SELECT count() FROM file('sales/2026-{01..03}.csv')"  # a range

1100
2150

{01,02} lists exact names; {01..03} expands a numeric range. You can also use ? for a single character and ** to recurse into subdirectories, so file('**/*.parquet') walks an entire tree.

How fast is it across a directory?

Speed holds up when the directory gets real. On 12 monthly CSVs totalling 3,000,000 rows (~123 MB, generated by the example folder below), a GROUP BY country with sum and avg over every row of every file runs in:

1clickhouse local -q "
2SELECT country, count() AS orders, round(sum(revenue),2) AS revenue, round(avg(quantity),3) AS avg_qty
3FROM file('sales_large/*.csv')
4GROUP BY country
5ORDER BY revenue DESC"

~0.22 seconds, best of three with the files warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes parsing the CSV text of all twelve files from scratch on every run, and the files are read in parallel across cores.

1┌─country─┬─orders─┬─────revenue─┬─avg_qty─┐
21. │ FR      │ 300902 │ 75167415.31 │   3.002 │
32. │ BR      │ 299994 │ 75103444.72 │   3.001 │
43. │ CA      │ 299920 │ 75006733.22 │   3.002 │
54. │ JP      │ 300045 │ 74978490.02 │   2.997 │
65. │ IN      │ 300285 │ 74955475.99 │   2.998 │
7   └─────────┴────────┴─────────────┴─────────┘

(The sample data uses unseeded random values, so the exact totals shift each time you regenerate; the shape stays the same.)

The same SQL scales unchanged

The query you ran across a folder of files is the same query you would run on a ClickHouse server or in ClickHouse Cloud. You swap file('sales/*.csv') for a table name or an s3('s3://bucket/sales/*.csv') function pointing at object storage, and the SELECT ... GROUP BY stays put. You prototype against a directory on your laptop and ship the identical logic to production, with no local dialect to unlearn.

Run it yourself

The complete, runnable example lives here. It has generate.sh (builds the CSV directory, the Parquet parts, and the 12-file perf set), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-multiple-files

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-multiple-files
3./generate.sh && ./run.sh

Prefer Python? Run the same glob queries in-process with chDB.

Aggregate across every file in one query

Attribute each row to its source file with `_file`

The same glob on Parquet

Pick a subset with brace and range globs

How fast is it across a directory?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

Run SQL across multiple CSV or Parquet files

Aggregate across every file in one query #

Attribute each row to its source file with _file #

The same glob on Parquet #

Pick a subset with brace and range globs #

How fast is it across a directory? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

Aggregate across every file in one query

Attribute each row to its source file with `_file`

The same glob on Parquet

Pick a subset with brace and range globs

How fast is it across a directory?

The same SQL scales unchanged

Run it yourself

Related