How to convert CSV to Arrow

To convert a CSV file to Arrow, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then write the CSV to Arrow with a single query:

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow"

12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	purchase	7.02	3
42026-01-04	4	FR	signup	8.03	4
52026-01-05	5	IN	click	9.04	5

The CSV is read in place with no import step. ClickHouse infers each column's type from the data and bakes that schema into the Arrow IPC file, so every downstream reader gets typed columns without re-parsing text.

Why Arrow

Arrow is a typed, columnar in-memory format. The on-disk Arrow IPC file (the format the Arrow project also calls Feather) carries the schema in its footer, so the types you inferred from the CSV travel with the data. Anything that speaks Arrow (pandas, polars, pyarrow, Spark) reads it zero-copy without re-parsing text or re-guessing types. That is the win over leaving data in CSV: a CSV is just text, so every reader re-parses and re-infers it from scratch.

The conversion, end to end

SELECT everything from the CSV, send it INTO OUTFILE with the Arrow format. Add TRUNCATE so the command is safe to re-run.

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow"

Read the first rows straight back to confirm it worked:

1clickhouse local -q "SELECT * FROM file('events.arrow') LIMIT 5"

12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	purchase	7.02	3
42026-01-04	4	FR	signup	8.03	4
52026-01-05	5	IN	click	9.04	5

Types are inferred from the CSV and embedded in the Arrow file

A CSV has no type information; clickhouse-local infers it. An Arrow file does have type information, and that schema is written into the file. Compare the two with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.csv')"
2clickhouse local -q "DESCRIBE file('events.arrow')"

1-- CSV (text, types inferred each read):
2event_date	Nullable(Date)
3event_id	Nullable(Int64)
4country	Nullable(String)
5action	Nullable(String)
6amount	Nullable(Float64)
7sessions	Nullable(Int64)
8-- Arrow (types stored in the file's schema):
9event_date	Nullable(Date32)
10event_id	Nullable(Int64)
11country	Nullable(String)
12action	Nullable(String)
13amount	Nullable(Float64)
14sessions	Nullable(Int64)

Note the one cast: Date becomes Date32 in the Arrow file. Arrow's date type is a 32-bit day count, which maps to ClickHouse Date32 on readback. The dates are identical; only the physical width changed. If you need a column to keep an exact type across the conversion, pass an explicit schema to file() as its third argument rather than relying on inference:

1clickhouse local -q "
2SELECT * FROM file('events.csv', 'CSVWithNames',
3  'event_date Date, event_id UInt32, country String, action String, amount Float64, sessions UInt8')
4INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow"

Options worth knowing

This is where a one-line command beats a drag-and-drop converter site. You stay in control of the output, and nothing leaves your machine.

Feather is Arrow. The Arrow project's .feather files are the Arrow IPC format. Write one by using the same FORMAT Arrow with a .feather name:

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.feather' TRUNCATE FORMAT Arrow"

Compress the buffers. Arrow supports per-buffer compression. Turn on zstd (or lz4) to shrink the file with no change to how it reads:

1clickhouse local -q "
2SELECT * FROM file('events.csv')
3INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow
4SETTINGS output_format_arrow_compression_method='zstd'"

Streaming variant. FORMAT Arrow writes the random-access IPC file (with a footer). For piping into another process that consumes a stream, use FORMAT ArrowStream instead.

Filter or reshape on the way out. The SELECT is a full query, so you can project columns, cast types, filter rows, or pre-aggregate before writing the Arrow file, not just copy the CSV verbatim.

Query the Arrow file directly

Once converted, the typed columnar file is queryable as a table, with no re-parsing of text:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(amount), 2) AS amount
3FROM file('events.arrow')
4GROUP BY country
5ORDER BY amount DESC
6LIMIT 5"

1US	4	60.4
2GB	4	56.36
3AU	3	48.33
4IN	3	45.3
5FR	3	42.27

How fast is the conversion?

On a 3,000,000-row, ~129 MB CSV (events_large.csv), converting the whole file to Arrow (parse the CSV text, infer types, write the columnar Arrow IPC file) completes in:

1clickhouse local -q "SELECT * FROM file('events_large.csv') INTO OUTFILE 'events_large.arrow' TRUNCATE FORMAT Arrow"

~0.38 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). The Arrow output lands at ~70 MB versus the ~123 MB CSV, because the columnar typed encoding is more compact than text. The conversion streams, so memory stays flat and the input can be larger than RAM.

Reverse direction?

Going the other way, Arrow back to CSV, is the same idea with the formats swapped. See convert Arrow to CSV.

The same conversion in Python (chDB)

chDB is ClickHouse as an in-process Python library. The SQL is identical, so the conversion is one chdb.query call, handy when the file is already part of a Python pipeline:

1import chdb
2
3# Same SELECT ... INTO OUTFILE ... FORMAT Arrow, in-process.
4chdb.query("SELECT * FROM file('events.csv') INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow")
5
6# Read it straight back as a DataFrame.
7df = chdb.query("SELECT * FROM file('events.arrow') LIMIT 5", "DataFrame")
8print(df)

If you would rather skip the file entirely and hand an Arrow table to pandas or pyarrow in memory, see reading Arrow files in Python.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~129 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-csv-to-arrow

The same SELECT ... INTO OUTFILE runs unchanged against dozens of formats and remote sources, and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop. If you query the data far more than you convert it, Parquet is the better at-rest target; Arrow shines when the next step is an in-memory analytics library. Start from running SQL on a CSV file for the basics.

Why Arrow

The conversion, end to end

Types are inferred from the CSV and embedded in the Arrow file

Options worth knowing

Query the Arrow file directly

How fast is the conversion?

Reverse direction?

The same conversion in Python (chDB)

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to convert CSV to Arrow

Why Arrow #

The conversion, end to end #

Types are inferred from the CSV and embedded in the Arrow file #

Options worth knowing #

Query the Arrow file directly #

How fast is the conversion? #

Reverse direction? #

The same conversion in Python (chDB) #

Run it yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

Why Arrow

The conversion, end to end

Types are inferred from the CSV and embedded in the Arrow file

Options worth knowing

Query the Arrow file directly

How fast is the conversion?

Reverse direction?

The same conversion in Python (chDB)

Run it yourself