How to convert Parquet to Avro

Al Brown
Last updated: Jun 15, 2026

To convert Parquet to Avro, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the Parquet file and write the result out as Avro:

1clickhouse local -q "SELECT * FROM file('events.parquet') INTO OUTFILE 'events.avro' FORMAT Avro"

clickhouse local reads the schema straight from the Parquet footer, so you name no columns and no types. The file is read in place with no import step, and it streams rather than loading into memory, so inputs larger than RAM convert fine.

What changes: columnar to row-oriented #

Parquet is columnar: values for one column sit together on disk, which is what makes column scans and compression so good. Avro is row-oriented: each record is written whole, one after another, with the schema stored once in the file header. The conversion is a transpose plus a re-encode, and clickhouse-local derives the Avro schema from the column types it read out of the Parquet footer. You don't write a .avsc by hand.

Check what was read from the Parquet file:

1clickhouse local -q "DESCRIBE file('events.parquet')"
1event_id	UInt64
2ts	DateTime64(3, 'UTC')
3country	String
4event_type	String
5amount	Float64
6items	UInt8
7device	Tuple(
8    `1` String,
9    `2` UInt8)

Now read the same columns back from the Avro file that was produced:

1clickhouse local -q "DESCRIBE file('events.avro')"
1event_id	Int64
2ts	DateTime64(3)
3country	String
4event_type	String
5amount	Float64
6items	Int32
7device	Tuple(
8    `1` String,
9    `2` Int32)

The integer gotcha: Avro has no unsigned types #

Look closely at the two schemas. event_id went from UInt64 to Int64, and items went from UInt8 to Int32. That isn't a bug. Avro's primitive number types are int (32-bit) and long (64-bit), both signed — there is no unsigned integer in the Avro spec. clickhouse-local widens each unsigned column to the smallest signed type that holds its full range, so no value is ever truncated. A UInt8 (0–255) maps to a signed 32-bit int, and a UInt64 maps to a signed 64-bit long. This is the one type change worth knowing about before you convert; everything else (strings, floats, timestamps) carries across unchanged.

The nested structure survives too. The device column is a Tuple(String, UInt8) in Parquet, and it becomes a nested Avro record. Read three rows back as JSON to see it intact:

1clickhouse local -q "SELECT * FROM file('events.avro') ORDER BY event_id LIMIT 3 FORMAT JSONEachRow"
1{"event_id":1,"ts":"2026-01-01 00:00:00.000","country":"GB","event_type":"click","amount":5,"items":1,"device":{"1":"mobile","2":1}}
2{"event_id":2,"ts":"2026-01-01 01:00:00.000","country":"US","event_type":"view","amount":6.01,"items":2,"device":{"1":"desktop","2":0}}
3{"event_id":3,"ts":"2026-01-01 02:00:00.000","country":"DE","event_type":"purchase","amount":7.02,"items":3,"device":{"1":"tablet","2":1}}

The nested fields are addressable after conversion, so nothing is flattened away:

1clickhouse local -q "SELECT event_id, device.1 AS device_type, device.2 AS is_even FROM file('events.avro') ORDER BY event_id LIMIT 3"
11	mobile	1
22	desktop	0
33	tablet	1

Options worth knowing #

These are the things an upload-required converter site won't give you, and they are why doing it locally pays off:

  • Compress the Avro. Avro supports a block codec. Set it before the conversion to shrink the output:

    clickhouse local -q "SET output_format_avro_codec='deflate'; SELECT * FROM file('events.parquet') INTO OUTFILE 'events.avro' FORMAT Avro"
    

    snappy and null (uncompressed) are also valid. Row-oriented Avro typically lands larger than the columnar Parquet it came from, even compressed, because Parquet's per-column encoding is hard to beat for analytical data. Expect the Avro to be bigger, not smaller.

  • Convert a subset. Because the source is a SQL table, you filter and project in the same statement instead of converting then trimming: SELECT event_id, country FROM file('events.parquet') WHERE country = 'US' INTO OUTFILE 'us.avro' FORMAT Avro.

  • Pin the types. If you want to keep the unsigned semantics or force a specific width, cast in the SELECT (CAST(items AS Int16)), or pass an explicit structure to file(). The schema is yours to control, not whatever a black-box converter decided.

  • It's scriptable and offline. A single command in a Makefile or cron job, with no file leaving your machine. Useful when the Parquet holds data you can't upload anywhere.

How fast is it? #

On a 3,000,000-row events_large.parquet (~41 MB), the full Parquet-to-Avro conversion (read the columnar file, transpose to rows, encode Avro, write it out) completes in:

1clickhouse local -q "SELECT * FROM file('events_large.parquet') INTO OUTFILE 'events_large.avro' FORMAT Avro"

~0.6 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). Measured with /usr/bin/time -p; figures may vary slightly under concurrent load. The conversion streams, so memory stays flat regardless of file size.

Reverse direction #

Need to go the other way? See how to convert Avro to Parquet — the same one-liner with the formats swapped.

The same in Python with chDB #

If you're already in a notebook, chDB is the same engine in-process. The SQL is identical; only the wrapper changes:

1import chdb
2
3# Convert Parquet -> Avro: the same SELECT ... FORMAT Avro, into a file.
4chdb.query("SELECT * FROM file('events.parquet') INTO OUTFILE 'events.avro' TRUNCATE FORMAT Avro")
5
6# Read it back to confirm.
7print(chdb.query("SELECT count() AS rows FROM file('events.avro')", "DataFrame"))
1   rows
20    20

For reading Avro into a DataFrame afterwards, see how to read an Avro file in Python with chDB, and for querying Avro from the terminal, how to read an Avro file.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample Parquet files (including the 3M-row file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-parquet-to-avro

The same SQL that converts one file on your laptop runs unchanged against a directory of Parquet files, a remote object store, or a ClickHouse Cloud service when the data outgrows your machine. See also how to query a Parquet file.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...