How to convert Parquet to Arrow

Al Brown
Last updated: Jun 15, 2026

To convert a Parquet file to Arrow, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read the converted Arrow file back to confirm the round-trip:

1clickhouse local -q "SELECT * FROM file('events.arrow') LIMIT 5"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	signup	7.02	3
42026-01-04	4	FR	purchase	8.03	4
52026-01-05	5	IN	click	9.04	5

The Parquet file is read in place with no import step: clickhouse local infers the schema from the Parquet footer and writes Arrow IPC bytes directly to disk, so a file larger than RAM converts without loading it all into memory at once.

Types carry across unchanged #

Parquet already stores a typed schema, so there is nothing to infer from raw text and nothing to guess. The types ClickHouse reads from the Parquet footer are the types it writes into the Arrow file. Compare the two with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.parquet')"
2clickhouse local -q "DESCRIBE file('events.arrow')"
1event_date	Date32
2event_id	UInt64
3country	String
4action	String
5amount	Float64
6sessions	UInt8

Identical on both sides. Note the columns are not Nullable: Parquet recorded them as required, and Arrow keeps that. (A column that was nullable in the Parquet schema stays nullable in Arrow.) This is the difference from converting out of CSV, where everything starts as text and types are inferred. Parquet to Arrow is lossless for the common numeric, string, date, and timestamp types because both formats descend from the same columnar type system.

Read the Arrow file straight back to confirm the values round-tripped:

1clickhouse local -q "SELECT * FROM file('events.arrow') LIMIT 5"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	signup	7.02	3
42026-01-04	4	FR	purchase	8.03	4
52026-01-05	5	IN	click	9.04	5

Options worth knowing #

This is the information that an upload-a-file converter site can't give you, because you control the writer:

  • .feather is the same format. Feather (v2) is the Arrow IPC file format. If a tool wants a .feather file, write the exact same bytes and name them .feather:

    clickhouse local -q "SELECT * FROM file('events.parquet') INTO OUTFILE 'events.feather' FORMAT Arrow"
    
  • Arrow IPC file vs stream. FORMAT Arrow writes the random-access file format (with a footer). For a pipe or socket, FORMAT ArrowStream writes the streaming variant instead. Reading the stream back needs the format named explicitly: file('events.arrows', 'ArrowStream').

  • Arrow output is uncompressed by default. That is deliberate. Arrow is built for fast in-memory handoff, not for storage. On the 3,000,000-row sample, the Arrow file is larger than its Parquet source (68 MB vs 23 MB) because Parquet compresses on disk and uncompressed Arrow doesn't. If you want a smaller Arrow file, turn on block compression:

    clickhouse local -q "
    SELECT * FROM file('events_large.parquet')
    INTO OUTFILE 'events_large.arrow'
    FORMAT Arrow
    SETTINGS output_format_arrow_compression_method='zstd'"
    

    That takes the file from 68 MB to 39 MB and still reads as a normal Arrow file. lz4_frame is the other supported codec when you want decode speed over ratio.

If you do want long-term, well-compressed storage rather than an in-memory exchange buffer, keep the data as Parquet — see what is a Parquet file.

Why convert at all #

Parquet is the right format on disk: compressed, columnar, great for repeated scans. Arrow is the right format in memory: it's the zero-copy layout that pandas, pyarrow, and Polars read without re-parsing. Converting Parquet to Arrow is how you hand a dataset to an in-process analytics tool with no deserialization tax. You'd reach for the Arrow file when the consumer is a Python or R process that maps Arrow directly, and keep Parquet when the consumer reads from disk or object storage.

Verify nothing was dropped #

A conversion is only done when the row count and values match. Check both files agree:

1clickhouse local -q "
2SELECT
3  (SELECT count() FROM file('events_large.parquet')) AS parquet_rows,
4  (SELECT count() FROM file('events_large.arrow'))   AS arrow_rows"
13000000	3000000

How fast is it? #

On the 3,000,000-row events_large.parquet (~23 MB Parquet in, ~68 MB Arrow out), the full conversion runs in:

1clickhouse local -q "SELECT * FROM file('events_large.parquet') INTO OUTFILE 'events_large.arrow' TRUNCATE FORMAT Arrow"

~0.30 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). Because both sides are columnar, the work is mostly reading column chunks and re-laying them out as Arrow record batches; there is no row-by-row text parsing.

Reverse direction? #

Going the other way is the same one line with the formats swapped. See how to convert Arrow to Parquet.

In Python with chDB #

If you're already in a Python process, chDB runs the identical SQL in-process, with no separate binary. Use the same SELECT ... FORMAT Arrow and write the bytes to a file:

1import chdb
2
3arrow_bytes = chdb.query("SELECT * FROM file('events.parquet') FORMAT Arrow").bytes()
4with open("events.arrow", "wb") as f:
5    f.write(arrow_bytes)
6
7# Or skip the file entirely and get a DataFrame, since Arrow is the handoff:
8df = chdb.query("SELECT * FROM file('events.parquet')", "DataFrame")
9print(df.shape)
1(20, 6)

For reading Arrow back into a DataFrame, see how to read an Arrow file in Python with chDB.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample Parquet files (including the 3M-row file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB path, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-parquet-to-arrow

The same SQL that converts a local file scales unchanged to a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...