How to convert Avro to Parquet

Al Brown
Last updated: Jun 15, 2026

To convert Avro to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read the converted result back to confirm it worked:

1clickhouse local -q "SELECT * FROM file('events.parquet') ORDER BY event_id LIMIT 5"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	signup	7.02	3
42026-01-04	4	FR	purchase	8.03	4
52026-01-05	5	IN	click	9.04	5

clickhouse local reads the schema embedded in the Avro file directly, so you never write a CREATE TABLE or define a structure argument. The Avro file is read in place and the Parquet file is written in one streaming pass, with no import step and no upload.

Why this conversion is more than a file swap #

Avro and Parquet solve different problems. Avro is row-oriented: each record is stored together, which is good for streaming and writing one event at a time. Parquet is columnar: values for one column are stored together, which is what makes analytical scans and column pruning fast. Converting Avro to Parquet is a layout change from row to column, not just a rename.

The useful part is that Avro stores its schema inside the file. clickhouse-local reads that schema directly, so you never write a CREATE TABLE or pass a structure argument. Check what it found:

1clickhouse local -q "DESCRIBE file('events.parquet')"
1event_date	Date32
2event_id	Int64
3country	String
4action	String
5amount	Float64
6items	Int32

Those types came from the Avro file and were carried straight into Parquet. Read the result back to confirm the data round-tripped:

1clickhouse local -q "SELECT * FROM file('events.parquet') ORDER BY event_id LIMIT 5"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	signup	7.02	3
42026-01-04	4	FR	purchase	8.03	4
52026-01-05	5	IN	click	9.04	5

Options you don't get from an upload converter #

Online converters take a file, run it on someone else's machine, and hand you a single result. Running the conversion locally gives you control over the things that actually matter for Parquet, and your data never leaves the machine.

One unsigned-type caveat. Avro has no unsigned integer types. A column that started life as UInt8 in ClickHouse is stored as a regular signed integer in Avro, so it comes back as Int32 after a round trip through Avro (see items above). That is an Avro property, not a conversion bug. If you need the narrower or unsigned type back, cast it on the way out: SELECT * EXCEPT items, items::UInt8 AS items FROM file('events.avro').

Choose the compression codec. Parquet compresses per column. clickhouse-local defaults to zstd; you can pick another with a setting. On the 3,000,000-row sample file the choice is visible:

1clickhouse local -q "
2SELECT * FROM file('events_large.avro')
3INTO OUTFILE 'events_large.parquet' TRUNCATE FORMAT Parquet
4SETTINGS output_format_parquet_compression_method = 'zstd'"
1none     45054899 bytes
2snappy   32584563 bytes
3zstd     22939538 bytes

zstd gives the smallest file here; snappy decompresses a little faster if read speed matters more than size. You can also set output_format_parquet_row_group_size to tune row-group granularity for your readers.

Inspect what you wrote. The ParquetMetadata format reads the footer without scanning the data, so you can confirm the row count, row groups, and format version:

1clickhouse local -q "
2SELECT num_rows, num_row_groups, format_version
3FROM file('events.parquet', ParquetMetadata) FORMAT Vertical"
1Row 1:
2──────
3num_rows:       20
4num_row_groups: 1
5format_version: 2

How fast is it? #

On the 3,000,000-row sample (events_large.avro, 44 MB), converting the whole file to Parquet takes ~0.61 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). The 44 MB row-oriented Avro file becomes a 22 MB columnar Parquet file. The conversion reads every row, re-lays it out by column, and compresses it, in one streaming pass.

In Python with chDB #

If you work in Python, chDB is the same ClickHouse engine in-process. The conversion is the identical SQL, no server and no subprocess:

1import chdb
2
3chdb.query("""
4SELECT * FROM file('events.avro')
5INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet
6""")
7
8# read it back into a pandas DataFrame
9df = chdb.query("SELECT * FROM file('events.parquet') ORDER BY event_id LIMIT 5", "DataFrame")
10print(df)
1  event_date  event_id country    action  amount  items
20 2026-01-01         1      GB     click    5.00      1
31 2026-01-02         2      US      view    6.01      2
42 2026-01-03         3      DE    signup    7.02      3
53 2026-01-04         4      FR  purchase    8.03      4
64 2026-01-05         5      IN     click    9.04      5

This is a clean alternative to reading the Avro into pandas with a separate Avro library and then writing Parquet through pyarrow: one engine, one schema inference step, and it streams files larger than memory.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample Avro files (including the 3M-row file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-avro-to-parquet

The same SQL that converts one file on your laptop runs unchanged against a directory of files, a remote object store, or a ClickHouse server and ClickHouse Cloud when the data outgrows your machine.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...