How to run SQL on a CSV file

Al Brown
Last updated: Jun 8, 2026

To run SQL on a CSV file, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then query the file directly:

1clickhouse local -q "SELECT * FROM file('orders.csv') LIMIT 10"
12026-01-01	1	GB	widget	5	1
22026-01-02	2	US	gadget	6.01	2
32026-01-03	3	DE	gizmo	7.02	3
42026-01-04	4	FR	doohickey	8.03	4
52026-01-05	5	IN	widget	9.04	5

file('orders.csv') reads the header row for column names and infers the types from the data, so the file is queried in place with no import step first.

Prefer Python? See How to read a CSV file in Python for the same queries against a pandas DataFrame.

Header and type detection #

By default file('orders.csv') uses the CSVWithNames format: the first row is treated as column names, and each column's type is inferred from the values. Check what was detected with DESCRIBE:

1clickhouse local -q "DESCRIBE file('orders.csv')"
1order_date	Nullable(Date)
2order_id	Nullable(Int64)
3country	Nullable(String)
4product	Nullable(String)
5revenue	Nullable(Float64)
6quantity	Nullable(Int64)

The header gave the names; the data gave the types. Columns are Nullable because inference allows for empty values.

You don't usually need to override this, but you can when inference guesses wrong, for example when a ZIP code or an order ID should stay a string rather than become an integer. Pass the format and an explicit schema as the second and third arguments to file():

1clickhouse local -q "
2SELECT * FROM file('orders.csv', 'CSVWithNames',
3  'order_date Date, order_id UInt32, country String, product String, revenue Float64, quantity UInt8')
4ORDER BY revenue DESC LIMIT 3"
12026-01-20	20	US	doohickey	24.19	5
22026-01-19	19	GB	gizmo	23.18	4
32026-01-18	18	AU	gadget	22.17	3

If your CSV has no header row, use the CSV format instead of CSVWithNames and supply the schema yourself (or refer to columns positionally as c1, c2, ...).

Aggregate directly on the CSV #

Because the file is a table, the full SQL surface works on it (WHERE, GROUP BY, JOIN, window functions) with no load step in between:

1clickhouse local -q "
2SELECT
3  country,
4  count()              AS orders,
5  round(sum(revenue), 2) AS revenue,
6  round(avg(quantity), 2) AS avg_qty
7FROM file('orders.csv')
8GROUP BY country
9ORDER BY revenue DESC"
1US	4	60.4	3.5
2GB	4	56.36	2.5
3AU	3	48.33	2
4IN	3	45.3	2.67
5FR	3	42.27	3.33
6DE	3	39.24	4

This is the everyday workflow: open a terminal, run one query against the file, read the answer. The exact same SQL runs unchanged against dozens of file formats and remote sources, and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop, with no rewrite.

Convert CSV to Parquet in one line #

CSV is convenient but slow to scan repeatedly: every query re-parses text and re-infers types. If you'll query the data more than once, convert it to Parquet once and read that instead. It's a single command: SELECT from the CSV, INTO OUTFILE as Parquet.

1clickhouse local -q "SELECT * FROM file('orders.csv') INTO OUTFILE 'orders.parquet' TRUNCATE FORMAT Parquet"

The result is a real Parquet file with the columns and types ClickHouse inferred. Read it back with the same file() call, or query it directly. See how to query a Parquet file and the dedicated convert CSV to Parquet guide for the column-typing and compression options.

Gzipped CSVs work too #

You don't need to unzip anything first. clickhouse-local detects the .gz extension and decompresses on the fly, so a .csv.gz file is queried exactly like a .csv:

1clickhouse local -q "SELECT country, count() FROM file('orders.csv.gz') GROUP BY country ORDER BY country"
1AU	3
2DE	3
3FR	3
4GB	4
5IN	3
6US	4

The same applies to .csv.zst, .csv.xz, and other supported codecs. The compression is inferred from the file name.

How fast is it? #

Fast enough that the import step you skipped was the slow part. On an 8,000,000-row, ~338 MB CSV (orders_large.csv), a full GROUP BY country with sum and avg over every row completes in:

1clickhouse local -q "
2SELECT country, count() AS orders, round(sum(revenue), 2) AS revenue, round(avg(quantity), 3) AS avg_qty
3FROM file('orders_large.csv')
4GROUP BY country
5ORDER BY revenue DESC"

about 0.53 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That includes parsing the CSV text from scratch on every run; there's no cached table. Convert the same data to Parquet and the scan gets faster again, because Parquet is columnar and typed.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~338 MB file used for the timing above), run.sh with every command on this page, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/clickhouse-local-csv

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...