How to convert Arrow to Parquet

Al Brown
Last updated: Jun 15, 2026

To convert Arrow to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read the Arrow file and write Parquet out:

1clickhouse local -q "SELECT * FROM file('events.arrow') INTO OUTFILE 'events.parquet' FORMAT Parquet"
120

The Arrow file is read in place with no import step. ClickHouse carries the schema embedded in the Arrow file straight into Parquet, so types are preserved exactly and the row count comes back as confirmation.

The one thing that trips people up: file vs stream #

Arrow IPC ships in two framings, and they are not interchangeable. The file format (what people often call Feather, and what ClickHouse writes with FORMAT Arrow) has a footer and a magic header. The streaming format (FORMAT ArrowStream, and what pyarrow's new_stream writes) does not. Both commonly use the .arrow extension, so you can't tell them apart by name.

clickhouse-local defaults to the file framing. Hand it a stream-framed .arrow and it fails fast:

1-- this FAILS: a stream file is not an Arrow 'file' --
2Code: 636. DB::Exception: The table structure cannot be extracted from a Arrow format file. Error:
3Code: 1002. DB::Exception: Error while opening a table: Invalid: Not an Arrow file. (UNKNOWN_EXCEPTION)

The fix is to name the framing explicitly as the second argument to file():

1clickhouse local -q "SELECT * FROM file('events_stream.arrow', 'ArrowStream') INTO OUTFILE 'events.parquet' FORMAT Parquet"

If a conversion errors with "Not an Arrow file", you have a stream, not a file. Switch to ArrowStream and it works.

Types carry across, no re-inference #

Arrow is typed and columnar, so there's no header-sniffing or text re-parsing the way there is with CSV. The schema is embedded in the file, and ClickHouse carries those types straight into Parquet. Check both ends with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.arrow')"
2clickhouse local -q "DESCRIBE file('events.parquet')"
1event_date	Date32
2event_id	UInt32
3event_type	String
4country	String
5amount	Float64
6is_member	UInt8

Identical on both sides. No Nullable wrapper gets bolted on, integers stay integers, dates stay dates. One detail worth knowing: ClickHouse maps a DateTime to a 32-bit epoch integer when it writes Arrow, so if your Arrow file came from ClickHouse a timestamp may already be an integer column. Date columns round-trip cleanly as shown above.

Inspect the Parquet you just wrote #

You don't have to trust the conversion blind. The ParquetMetadata format reads the footer, so you can confirm the row count, row groups, and per-column compression:

1clickhouse local -q "SELECT num_rows, num_columns, num_row_groups, format_version FROM file('events.parquet', ParquetMetadata) FORMAT Vertical"
1Row 1:
2──────
3num_rows:       20
4num_columns:    6
5num_row_groups: 1
6format_version: 2

ClickHouse compresses Parquet with ZSTD by default. You can confirm that, column by column:

1clickhouse local -q "SELECT tupleElement(arrayJoin(columns), 'name') AS column, tupleElement(arrayJoin(columns), 'compression') AS compression FROM file('events.parquet', ParquetMetadata)"
1event_date	ZSTD
2event_id	ZSTD
3event_type	ZSTD
4country	ZSTD
5amount	ZSTD
6is_member	ZSTD

Options the upload converters don't give you #

Online "Arrow to Parquet" converters make you upload your data and hand back whatever defaults they picked. Running the conversion locally means you control it, and your data never leaves the machine.

Pick the codec. ZSTD is the default; pass output_format_parquet_compression_method to choose another (snappy, lz4, gzip, brotli, or none):

1clickhouse local -q "SELECT * FROM file('events.arrow') INTO OUTFILE 'events_snappy.parquet' TRUNCATE FORMAT Parquet SETTINGS output_format_parquet_compression_method='snappy'"
1SNAPPY

Tune the row group size with output_format_parquet_row_group_size. And because the source is just a table, you can convert and transform in the same pass. Filter rows, drop columns, or aggregate before writing, instead of converting first and cleaning up later:

1clickhouse local -q "
2SELECT event_type, count() AS events, round(sum(amount), 2) AS amount
3FROM file('events.arrow')
4GROUP BY event_type
5ORDER BY amount DESC"
1logout	4	0.46
2refund	4	0.42
3purchase	4	0.38
4login	4	0.34
5signup	4	0.3

How fast is it? #

On a 3,000,000-row Arrow IPC file (events_large.arrow, ~46 MB), the full conversion to Parquet runs in:

1clickhouse local -q "SELECT * FROM file('events_large.arrow') INTO OUTFILE 'events_large.parquet' TRUNCATE FORMAT Parquet"

~0.24 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). The ~46 MB Arrow input came out as a ~15 MB ZSTD Parquet file. Arrow and Parquet are both columnar, so the conversion is mostly re-encoding column chunks rather than reshaping data, which is why it's quick.

The same thing in Python with chDB #

If you're already in a Python session, chDB is the same engine in-process, with the same SQL and no separate binary. Write the Parquet with the identical SELECT ... INTO OUTFILE:

1import chdb
2
3chdb.query(
4    "SELECT * FROM file('events.arrow') "
5    "INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"
6)
7
8print(chdb.query("SELECT count() FROM file('events.parquet')", "CSV"))
9# 20

The ArrowStream rule is the same here: a stream-framed file needs file('events_stream.arrow', 'ArrowStream'). If you'd rather load the Arrow data into a DataFrame instead, see how to read an Arrow file in Python.

Reverse direction? #

Going the other way is the same one-liner with the formats swapped. See how to convert Parquet to Arrow.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample Arrow files (including the ~46 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-arrow-to-parquet

Where to go next #

Arrow IPC and Feather are the same format under two names, so the same conversion reads a Feather file too. Once you have Parquet, see how to query a Parquet file, what a Parquet file is, or query Parquet on S3. clickhouse-local runs the same SQL unchanged across dozens of formats, and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop, with no rewrite.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...