How to convert CSV to Arrow

Al Brown
Last updated: Jun 15, 2026

To convert a CSV file to Arrow, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then write the CSV to Arrow with a single query:

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	purchase	7.02	3
42026-01-04	4	FR	signup	8.03	4
52026-01-05	5	IN	click	9.04	5

The CSV is read in place with no import step. ClickHouse infers each column's type from the data and bakes that schema into the Arrow IPC file, so every downstream reader gets typed columns without re-parsing text.

Why Arrow #

Arrow is a typed, columnar in-memory format. The on-disk Arrow IPC file (the format the Arrow project also calls Feather) carries the schema in its footer, so the types you inferred from the CSV travel with the data. Anything that speaks Arrow (pandas, polars, pyarrow, Spark) reads it zero-copy without re-parsing text or re-guessing types. That is the win over leaving data in CSV: a CSV is just text, so every reader re-parses and re-infers it from scratch.

The conversion, end to end #

SELECT everything from the CSV, send it INTO OUTFILE with the Arrow format. Add TRUNCATE so the command is safe to re-run.

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow"

Read the first rows straight back to confirm it worked:

1clickhouse local -q "SELECT * FROM file('events.arrow') LIMIT 5"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	purchase	7.02	3
42026-01-04	4	FR	signup	8.03	4
52026-01-05	5	IN	click	9.04	5

Types are inferred from the CSV and embedded in the Arrow file #

A CSV has no type information; clickhouse-local infers it. An Arrow file does have type information, and that schema is written into the file. Compare the two with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.csv')"
2clickhouse local -q "DESCRIBE file('events.arrow')"
1-- CSV (text, types inferred each read):
2event_date	Nullable(Date)
3event_id	Nullable(Int64)
4country	Nullable(String)
5action	Nullable(String)
6amount	Nullable(Float64)
7sessions	Nullable(Int64)
8-- Arrow (types stored in the file's schema):
9event_date	Nullable(Date32)
10event_id	Nullable(Int64)
11country	Nullable(String)
12action	Nullable(String)
13amount	Nullable(Float64)
14sessions	Nullable(Int64)

Note the one cast: Date becomes Date32 in the Arrow file. Arrow's date type is a 32-bit day count, which maps to ClickHouse Date32 on readback. The dates are identical; only the physical width changed. If you need a column to keep an exact type across the conversion, pass an explicit schema to file() as its third argument rather than relying on inference:

1clickhouse local -q "
2SELECT * FROM file('events.csv', 'CSVWithNames',
3  'event_date Date, event_id UInt32, country String, action String, amount Float64, sessions UInt8')
4INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow"

Options worth knowing #

This is where a one-line command beats a drag-and-drop converter site. You stay in control of the output, and nothing leaves your machine.

Feather is Arrow. The Arrow project's .feather files are the Arrow IPC format. Write one by using the same FORMAT Arrow with a .feather name:

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.feather' TRUNCATE FORMAT Arrow"

Compress the buffers. Arrow supports per-buffer compression. Turn on zstd (or lz4) to shrink the file with no change to how it reads:

1clickhouse local -q "
2SELECT * FROM file('events.csv')
3INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow
4SETTINGS output_format_arrow_compression_method='zstd'"

Streaming variant. FORMAT Arrow writes the random-access IPC file (with a footer). For piping into another process that consumes a stream, use FORMAT ArrowStream instead.

Filter or reshape on the way out. The SELECT is a full query, so you can project columns, cast types, filter rows, or pre-aggregate before writing the Arrow file, not just copy the CSV verbatim.

Query the Arrow file directly #

Once converted, the typed columnar file is queryable as a table, with no re-parsing of text:

1clickhouse local -q "
2SELECT country, count() AS events, round(sum(amount), 2) AS amount
3FROM file('events.arrow')
4GROUP BY country
5ORDER BY amount DESC
6LIMIT 5"
1US	4	60.4
2GB	4	56.36
3AU	3	48.33
4IN	3	45.3
5FR	3	42.27

How fast is the conversion? #

On a 3,000,000-row, ~129 MB CSV (events_large.csv), converting the whole file to Arrow (parse the CSV text, infer types, write the columnar Arrow IPC file) completes in:

1clickhouse local -q "SELECT * FROM file('events_large.csv') INTO OUTFILE 'events_large.arrow' TRUNCATE FORMAT Arrow"

~0.38 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). The Arrow output lands at ~70 MB versus the ~123 MB CSV, because the columnar typed encoding is more compact than text. The conversion streams, so memory stays flat and the input can be larger than RAM.

Reverse direction? #

Going the other way, Arrow back to CSV, is the same idea with the formats swapped. See convert Arrow to CSV.

The same conversion in Python (chDB) #

chDB is ClickHouse as an in-process Python library. The SQL is identical, so the conversion is one chdb.query call, handy when the file is already part of a Python pipeline:

1import chdb
2
3# Same SELECT ... INTO OUTFILE ... FORMAT Arrow, in-process.
4chdb.query("SELECT * FROM file('events.csv') INTO OUTFILE 'events.arrow' TRUNCATE FORMAT Arrow")
5
6# Read it straight back as a DataFrame.
7df = chdb.query("SELECT * FROM file('events.arrow') LIMIT 5", "DataFrame")
8print(df)

If you would rather skip the file entirely and hand an Arrow table to pandas or pyarrow in memory, see reading Arrow files in Python.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~129 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-csv-to-arrow

The same SELECT ... INTO OUTFILE runs unchanged against dozens of formats and remote sources, and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop. If you query the data far more than you convert it, Parquet is the better at-rest target; Arrow shines when the next step is an in-memory analytics library. Start from running SQL on a CSV file for the basics.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...