How to convert CSV to Avro

Al Brown
Last updated: Jun 15, 2026

To convert a CSV file to Avro, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then write the result with FORMAT Avro:

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.avro' FORMAT Avro"
12026-01-20	20	US	signup	24.19	5
22026-01-19	19	GB	purchase	23.18	4
32026-01-18	18	AU	view	22.17	3
42026-01-17	17	IN	click	21.16	2
52026-01-16	16	FR	signup	20.15	1

clickhouse local reads the CSV header for column names, infers each column's type from the data, and builds the Avro record schema from those types. The CSV is read in place with no import step, and the file streams row by row so conversions larger than RAM complete without issue.

What the schema becomes #

The interesting part of CSV to Avro is the schema. CSV has none; Avro requires one. clickhouse-local fills the gap by inferring types from the CSV, then translating them into Avro field types. Look at what it inferred from the CSV:

1clickhouse local -q "DESCRIBE file('events.csv')"
1event_date	Nullable(Date)
2event_id	Nullable(Int64)
3country	Nullable(String)
4action	Nullable(String)
5amount	Nullable(Float64)
6items	Nullable(Int64)

Avro carries its schema in the file header as JSON. The header from events.avro shows the direct mapping:

1{"type":"record","name":"row","fields":[{"name":"event_date","type":["null",{"type":"int","logicalType":"date"}]},{"name":"event_id","type":["null","long"]},{"name":"country","type":["null","string"]},{"name":"action","type":["null","string"]},{"name":"amount","type":["null","double"]},{"name":"items","type":["null","long"]}]}

Int64 became Avro long, Float64 became double, String became string, and Date became an int with logicalType: date (Avro stores dates as a day count). Because CSV inference marks every column Nullable, each Avro field is a union of ["null", T]. That is correct but verbose, and it forces every consumer to handle nulls.

Prefer Python? See how to read an Avro file in Python with chDB. The chDB conversion block is at the end of this page.

Pin the types for a clean schema #

If you know the data has no missing values, declare the types yourself. Pass the format and an explicit, non-Nullable schema as the second and third arguments to file(). The Avro fields then drop the null unions:

1clickhouse local -q "
2SELECT * FROM file('events.csv', 'CSVWithNames',
3  'event_date Date, event_id UInt32, country String, action String, amount Float64, items UInt8')
4INTO OUTFILE 'events_typed.avro' TRUNCATE FORMAT Avro"

The embedded schema is now plain scalars, no unions:

1{"type":"record","name":"row","fields":[{"name":"event_date","type":{"type":"int","logicalType":"date"}},{"name":"event_id","type":"int"},{"name":"country","type":"string"},{"name":"action","type":"string"},{"name":"amount","type":"double"},{"name":"items","type":"int"}]}

This is the lever an upload-based converter never gives you: you control the exact Avro types from the source CSV, including the integer width and whether a field is nullable. UInt32 and UInt8 collapse to Avro int (Avro has no unsigned integers), so pick a signed type that fits your range if the distinction matters downstream.

Verify the round-trip #

Read the Avro file straight back to confirm the data and types survived:

1clickhouse local -q "SELECT * FROM file('events.avro') ORDER BY amount DESC LIMIT 5"
12026-01-20	20	US	signup	24.19	5
22026-01-19	19	GB	purchase	23.18	4
32026-01-18	18	AU	view	22.17	3
42026-01-17	17	IN	click	21.16	2
52026-01-16	16	FR	signup	20.15	1

Options worth knowing #

These are the choices that matter when the Avro file leaves your machine and lands in another system.

  • Compression codec. Avro compresses each data block. clickhouse-local defaults to snappy (fast, widely supported). Switch to deflate for smaller files when read speed matters less:

    clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.avro' TRUNCATE FORMAT Avro SETTINGS output_format_avro_codec='deflate'"
    
  • Nullability. Decide it deliberately. Inferred Nullable columns produce ["null", T] unions; an explicit non-Nullable schema produces plain types, as shown above.

  • Type lossiness. A flat CSV maps cleanly to Avro. There is nothing nested to flatten here, unlike going the other direction from a deeply nested format. The only quiet cast is unsigned to signed integers.

  • Compressed input. A .csv.gz or .csv.zst source is decompressed transparently. The same command works; the extension is detected automatically.

Online CSV-to-Avro converters own a lot of these searches, and they are fine for a one-off small file. The reasons to do it locally: nothing leaves your machine, you control the exact Avro types, it scripts into a pipeline, and it handles files bigger than memory.

How fast is it? #

On a 3,000,000-row, ~120 MB CSV (events_large.csv), the full conversion to Avro, parsing the CSV text and encoding the Avro blocks, completes in:

1clickhouse local -q "SELECT * FROM file('events_large.csv') INTO OUTFILE 'events_large.avro' TRUNCATE FORMAT Avro"

~0.73 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). The number can move a little under concurrent load, but for a CSV this size the conversion is bound by parsing, not by Avro encoding. The 3,000,000 rows round-trip exactly when read back.

chDB: the same conversion in Python #

If you work in Python, chDB is the same ClickHouse engine in-process. Run the same SELECT ... FORMAT Avro and write the bytes to a file:

1import chdb
2
3avro_bytes = chdb.query(
4    "SELECT * FROM file('events.csv', 'CSVWithNames') FORMAT Avro"
5).bytes()
6
7with open("events.avro", "wb") as f:
8    f.write(avro_bytes)

The schema is derived from the inferred types exactly as with the CLI. This is a drop-in alternative to writing Avro by hand with pyarrow or fastavro when your input is already a CSV.

Reverse direction? #

Going the other way is the mirror command, swap the formats. See how to convert Avro to CSV. To learn the format itself, see what is an Avro file, and to query Avro in place without converting, see how to read an Avro file.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~120 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-csv-to-avro

The same SQL that converts a file on your laptop runs unchanged against a ClickHouse server or ClickHouse Cloud when the data outgrows it.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...