Run SQL across multiple CSV or Parquet files

Al Brown
Last updated: Jun 15, 2026

To run SQL across multiple files at once, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

curl https://clickhouse.com/cli | sh   # install clickhousectl
clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then pass a glob path to file() and query all the matching files as one table:

clickhouse local -q "SELECT count() FROM file('sales/*.csv')"
200

The * matches every CSV in the directory and the files are read in place with no import step. ClickHouse infers a shared schema from the first file and applies it across all the rest, so an aggregate runs over every file in a single pass.

Aggregate across every file in one query

This is the point of a glob: the files behave as one table, so an aggregate runs over all of them together. Four monthly exports, one GROUP BY:

clickhouse local -q "
SELECT country, count() AS orders, round(sum(revenue),2) AS revenue
FROM file('sales/*.csv')
GROUP BY country
ORDER BY revenue DESC"
┌─country─┬─orders─┬─revenue─┐
1. │ US      │     36 │    1089 │
2. │ GB      │     36 │ 1052.64 │
3. │ AU      │     32 │ 1000.32 │
4. │ IN      │     32 │     968 │
5. │ FR      │     32 │  935.68 │
6. │ DE      │     32 │  903.36 │
   └─────────┴────────┴─────────┘

The schema is inferred once and reused for every file, so the files do need matching columns. If a CSV carries a header, the CSVWithNames default reads it; see running SQL on a single CSV for the type-inference details that apply to each file here.

Attribute each row to its source file with _file

When you fan a query across files, you often want to break the result back out per file. ClickHouse exposes a virtual column called _file on every file() read. It holds the file name for each row, so you can GROUP BY it like any other column:

clickhouse local -q "
SELECT _file, count() AS rows, round(sum(revenue),2) AS revenue
FROM file('sales/*.csv')
GROUP BY _file
ORDER BY _file"
┌─_file───────┬─rows─┬─revenue─┐
1. │ 2026-01.csv │   50 │ 1487.25 │
2. │ 2026-02.csv │   50 │ 1487.25 │
3. │ 2026-03.csv │   50 │ 1487.25 │
4. │ 2026-04.csv │   50 │ 1487.25 │
   └─────────────┴──────┴─────────┘

_file is the file name; its sibling _path is the full absolute path. Neither exists in the data on disk. ClickHouse synthesises them per row at read time, which is exactly what you need to spot the one bad export in a directory of a hundred, or to partition a result by source date encoded in the file name.

The same glob on Parquet

Nothing about the pattern is CSV-specific. Swap the extension and point at a directory of Parquet parts, the kind a writer emits as part-00.parquet, part-01.parquet, and so on:

clickhouse local -q "
SELECT _file, count() AS rows
FROM file('events/*.parquet')
GROUP BY _file
ORDER BY _file"
┌─_file───────────┬─rows─┐
1. │ part-00.parquet │   50 │
2. │ part-01.parquet │   50 │
3. │ part-02.parquet │   50 │
4. │ part-03.parquet │   50 │
   └─────────────────┴──────┘

Because Parquet is columnar, a query that reads three columns across a hundred parts still only decodes those three columns from each file. See querying a single Parquet file for the column-pruning behaviour, and convert CSV to Parquet if you want to turn the directory of CSVs above into Parquet first.

clickhouse local runs the same glob syntax and the same SQL unchanged across CSV, Parquet, JSON, ORC and many other formats, and against a remote bucket with s3('.../*.parquet') when the files live in object storage.

Pick a subset with brace and range globs

* takes everything. When you want a slice, use brace expansion or a numeric range inside the path. Both are matched by ClickHouse, not the shell, so quote the path:

clickhouse local -q "SELECT count() FROM file('sales/2026-{01,02}.csv')"   # two named months
clickhouse local -q "SELECT count() FROM file('sales/2026-{01..03}.csv')"  # a range
100
150

{01,02} lists exact names; {01..03} expands a numeric range. You can also use ? for a single character and ** to recurse into subdirectories, so file('**/*.parquet') walks an entire tree.

How fast is it across a directory?

Speed holds up when the directory gets real. On 12 monthly CSVs totalling 3,000,000 rows (~123 MB, generated by the example folder below), a GROUP BY country with sum and avg over every row of every file runs in:

clickhouse local -q "
SELECT country, count() AS orders, round(sum(revenue),2) AS revenue, round(avg(quantity),3) AS avg_qty
FROM file('sales_large/*.csv')
GROUP BY country
ORDER BY revenue DESC"

~0.22 seconds, best of three with the files warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes parsing the CSV text of all twelve files from scratch on every run, and the files are read in parallel across cores.

┌─country─┬─orders─┬─────revenue─┬─avg_qty─┐
1. │ FR      │ 300902 │ 75167415.31 │   3.002 │
2. │ BR      │ 299994 │ 75103444.72 │   3.001 │
3. │ CA      │ 299920 │ 75006733.22 │   3.002 │
4. │ JP      │ 300045 │ 74978490.02 │   2.997 │
5. │ IN      │ 300285 │ 74955475.99 │   2.998 │
   └─────────┴────────┴─────────────┴─────────┘

(The sample data uses unseeded random values, so the exact totals shift each time you regenerate; the shape stays the same.)

The same SQL scales unchanged

The query you ran across a folder of files is the same query you would run on a ClickHouse server or in ClickHouse Cloud. You swap file('sales/*.csv') for a table name or an s3('s3://bucket/sales/*.csv') function pointing at object storage, and the SELECT ... GROUP BY stays put. You prototype against a directory on your laptop and ship the identical logic to production, with no local dialect to unlearn.

Run it yourself

The complete, runnable example lives here. It has generate.sh (builds the CSV directory, the Parquet parts, and the 12-file perf set), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-multiple-files

git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-multiple-files
./generate.sh && ./run.sh

Share this resource

  • Y Combinator icon
  • X icon
  • Bluesky icon
  • Facebook icon
  • LinkedIn icon

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

More like this

Scale vector search in Postgres with pgvector: avoid HNSW RAM limits, fix filtering recall, and know when to go hybrid. Read now.

Continue reading ->

How to query a REST API in Python

Al Brown • Last updated: Jun 15, 2026

Read a JSON API response into a DataFrame with chDB. Use the pandas API you already know to filter and aggregate the response, running on ClickHouse's engine with no server to start.

Continue reading ->

How to convert Parquet to ORC

Al Brown • Last updated: Jun 6, 2026

Convert a Parquet file to ORC with one clickhouse-local command. The schema is read from the Parquet footer and the types carry into ORC, with no server and no upload.

Continue reading ->