To run SQL across multiple files at once, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then pass a glob path to file() and query all the matching files as one table:
clickhouse local -q "SELECT count() FROM file('sales/*.csv')"
The * matches every CSV in the directory and the files are read in place with no import step. ClickHouse infers a shared schema from the first file and applies it across all the rest, so an aggregate runs over every file in a single pass.
This is the point of a glob: the files behave as one table, so an aggregate runs over all of them together. Four monthly exports, one GROUP BY:
clickhouse local -q "
SELECT country, count() AS orders, round(sum(revenue),2) AS revenue
FROM file('sales/*.csv')
GROUP BY country
ORDER BY revenue DESC"
┌─country─┬─orders─┬─revenue─┐
1. │ US │ 36 │ 1089 │
2. │ GB │ 36 │ 1052.64 │
3. │ AU │ 32 │ 1000.32 │
4. │ IN │ 32 │ 968 │
5. │ FR │ 32 │ 935.68 │
6. │ DE │ 32 │ 903.36 │
└─────────┴────────┴─────────┘
The schema is inferred once and reused for every file, so the files do need matching columns. If a CSV carries a header, the CSVWithNames default reads it; see running SQL on a single CSV for the type-inference details that apply to each file here.
When you fan a query across files, you often want to break the result back out per file. ClickHouse exposes a virtual column called _file on every file() read. It holds the file name for each row, so you can GROUP BY it like any other column:
clickhouse local -q "
SELECT _file, count() AS rows, round(sum(revenue),2) AS revenue
FROM file('sales/*.csv')
GROUP BY _file
ORDER BY _file"
┌─_file───────┬─rows─┬─revenue─┐
1. │ 2026-01.csv │ 50 │ 1487.25 │
2. │ 2026-02.csv │ 50 │ 1487.25 │
3. │ 2026-03.csv │ 50 │ 1487.25 │
4. │ 2026-04.csv │ 50 │ 1487.25 │
└─────────────┴──────┴─────────┘
_file is the file name; its sibling _path is the full absolute path. Neither exists in the data on disk. ClickHouse synthesises them per row at read time, which is exactly what you need to spot the one bad export in a directory of a hundred, or to partition a result by source date encoded in the file name.
Nothing about the pattern is CSV-specific. Swap the extension and point at a directory of Parquet parts, the kind a writer emits as part-00.parquet, part-01.parquet, and so on:
clickhouse local -q "
SELECT _file, count() AS rows
FROM file('events/*.parquet')
GROUP BY _file
ORDER BY _file"
┌─_file───────────┬─rows─┐
1. │ part-00.parquet │ 50 │
2. │ part-01.parquet │ 50 │
3. │ part-02.parquet │ 50 │
4. │ part-03.parquet │ 50 │
└─────────────────┴──────┘
Because Parquet is columnar, a query that reads three columns across a hundred parts still only decodes those three columns from each file. See querying a single Parquet file for the column-pruning behaviour, and convert CSV to Parquet if you want to turn the directory of CSVs above into Parquet first.
clickhouse local runs the same glob syntax and the same SQL unchanged across CSV, Parquet, JSON, ORC and many other formats, and against a remote bucket with s3('.../*.parquet') when the files live in object storage.
Pick a subset with brace and range globs
* takes everything. When you want a slice, use brace expansion or a numeric range inside the path. Both are matched by ClickHouse, not the shell, so quote the path:
clickhouse local -q "SELECT count() FROM file('sales/2026-{01,02}.csv')" # two named months
clickhouse local -q "SELECT count() FROM file('sales/2026-{01..03}.csv')" # a range
{01,02} lists exact names; {01..03} expands a numeric range. You can also use ? for a single character and ** to recurse into subdirectories, so file('**/*.parquet') walks an entire tree.
Speed holds up when the directory gets real. On 12 monthly CSVs totalling 3,000,000 rows (~123 MB, generated by the example folder below), a GROUP BY country with sum and avg over every row of every file runs in:
clickhouse local -q "
SELECT country, count() AS orders, round(sum(revenue),2) AS revenue, round(avg(quantity),3) AS avg_qty
FROM file('sales_large/*.csv')
GROUP BY country
ORDER BY revenue DESC"
~0.22 seconds, best of three with the files warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes parsing the CSV text of all twelve files from scratch on every run, and the files are read in parallel across cores.
┌─country─┬─orders─┬─────revenue─┬─avg_qty─┐
1. │ FR │ 300902 │ 75167415.31 │ 3.002 │
2. │ BR │ 299994 │ 75103444.72 │ 3.001 │
3. │ CA │ 299920 │ 75006733.22 │ 3.002 │
4. │ JP │ 300045 │ 74978490.02 │ 2.997 │
5. │ IN │ 300285 │ 74955475.99 │ 2.998 │
└─────────┴────────┴─────────────┴─────────┘
(The sample data uses unseeded random values, so the exact totals shift each time you regenerate; the shape stays the same.)
The query you ran across a folder of files is the same query you would run on a ClickHouse server or in ClickHouse Cloud. You swap file('sales/*.csv') for a table name or an s3('s3://bucket/sales/*.csv') function pointing at object storage, and the SELECT ... GROUP BY stays put. You prototype against a directory on your laptop and ship the identical logic to production, with no local dialect to unlearn.
The complete, runnable example lives here. It has generate.sh (builds the CSV directory, the Parquet parts, and the 12-file perf set), run.sh (every command above), and expected_output.txt:
github.com/ClickHouse/examples → local-analytics/clickhouse-local-multiple-files
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-multiple-files
./generate.sh && ./run.sh
Prefer Python? Run the same glob queries in-process with chDB.