How to run SQL on a CSV file

Al Brown
Last updated: Jun 8, 2026

To run SQL on a CSV file, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

curl https://clickhouse.com/cli | sh   # install clickhousectl
clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then query the file directly:

clickhouse local -q "SELECT * FROM file('orders.csv') LIMIT 10"
2026-01-01	1	GB	widget	5	1
2026-01-02	2	US	gadget	6.01	2
2026-01-03	3	DE	gizmo	7.02	3
2026-01-04	4	FR	doohickey	8.03	4
2026-01-05	5	IN	widget	9.04	5

file('orders.csv') reads the header row for column names and infers the types from the data, so the file is queried in place with no import step first.

Prefer Python? See How to read a CSV file in Python for the same queries against a pandas DataFrame.

Header and type detection

By default file('orders.csv') uses the CSVWithNames format: the first row is treated as column names, and each column's type is inferred from the values. Check what was detected with DESCRIBE:

clickhouse local -q "DESCRIBE file('orders.csv')"
order_date	Nullable(Date)
order_id	Nullable(Int64)
country	Nullable(String)
product	Nullable(String)
revenue	Nullable(Float64)
quantity	Nullable(Int64)

The header gave the names; the data gave the types. Columns are Nullable because inference allows for empty values.

You don't usually need to override this, but you can when inference guesses wrong, for example when a ZIP code or an order ID should stay a string rather than become an integer. Pass the format and an explicit schema as the second and third arguments to file():

clickhouse local -q "
SELECT * FROM file('orders.csv', 'CSVWithNames',
  'order_date Date, order_id UInt32, country String, product String, revenue Float64, quantity UInt8')
ORDER BY revenue DESC LIMIT 3"
2026-01-20	20	US	doohickey	24.19	5
2026-01-19	19	GB	gizmo	23.18	4
2026-01-18	18	AU	gadget	22.17	3

If your CSV has no header row, use the CSV format instead of CSVWithNames and supply the schema yourself (or refer to columns positionally as c1, c2, ...).

Aggregate directly on the CSV

Because the file is a table, the full SQL surface works on it (WHERE, GROUP BY, JOIN, window functions) with no load step in between:

clickhouse local -q "
SELECT
  country,
  count()              AS orders,
  round(sum(revenue), 2) AS revenue,
  round(avg(quantity), 2) AS avg_qty
FROM file('orders.csv')
GROUP BY country
ORDER BY revenue DESC"
US	4	60.4	3.5
GB	4	56.36	2.5
AU	3	48.33	2
IN	3	45.3	2.67
FR	3	42.27	3.33
DE	3	39.24	4

This is the everyday workflow: open a terminal, run one query against the file, read the answer. The exact same SQL runs unchanged against dozens of file formats and remote sources, and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop, with no rewrite.

Convert CSV to Parquet in one line

CSV is convenient but slow to scan repeatedly: every query re-parses text and re-infers types. If you'll query the data more than once, convert it to Parquet once and read that instead. It's a single command: SELECT from the CSV, INTO OUTFILE as Parquet.

clickhouse local -q "SELECT * FROM file('orders.csv') INTO OUTFILE 'orders.parquet' TRUNCATE FORMAT Parquet"

The result is a real Parquet file with the columns and types ClickHouse inferred. Read it back with the same file() call, or query it directly. See how to query a Parquet file and the dedicated convert CSV to Parquet guide for the column-typing and compression options.

Gzipped CSVs work too

You don't need to unzip anything first. clickhouse-local detects the .gz extension and decompresses on the fly, so a .csv.gz file is queried exactly like a .csv:

clickhouse local -q "SELECT country, count() FROM file('orders.csv.gz') GROUP BY country ORDER BY country"
AU	3
DE	3
FR	3
GB	4
IN	3
US	4

The same applies to .csv.zst, .csv.xz, and other supported codecs. The compression is inferred from the file name.

How fast is it?

Fast enough that the import step you skipped was the slow part. On an 8,000,000-row, ~338 MB CSV (orders_large.csv), a full GROUP BY country with sum and avg over every row completes in:

clickhouse local -q "
SELECT country, count() AS orders, round(sum(revenue), 2) AS revenue, round(avg(quantity), 3) AS avg_qty
FROM file('orders_large.csv')
GROUP BY country
ORDER BY revenue DESC"

about 0.53 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That includes parsing the CSV text from scratch on every run; there's no cached table. Convert the same data to Parquet and the scan gets faster again, because Parquet is columnar and typed.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~338 MB file used for the timing above), run.sh with every command on this page, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/clickhouse-local-csv


Share this resource

  • Y Combinator icon
  • X icon
  • Bluesky icon
  • Facebook icon
  • LinkedIn icon

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

More like this

Scale vector search in Postgres with pgvector: avoid HNSW RAM limits, fix filtering recall, and know when to go hybrid. Read now.

Continue reading ->

How to query a REST API in Python

Al Brown • Last updated: Jun 15, 2026

Read a JSON API response into a DataFrame with chDB. Use the pandas API you already know to filter and aggregate the response, running on ClickHouse's engine with no server to start.

Continue reading ->

How to convert Parquet to ORC

Al Brown • Last updated: Jun 6, 2026

Convert a Parquet file to ORC with one clickhouse-local command. The schema is read from the Parquet footer and the types carry into ORC, with no server and no upload.

Continue reading ->