How to convert CSV to Parquet

To convert a CSV file to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then write the CSV out as Parquet in one command:

1clickhouse local -q "SELECT * FROM file('orders.csv') INTO OUTFILE 'orders.parquet' TRUNCATE FORMAT Parquet"

Then check the output to confirm the types were carried across:

1clickhouse local -q "DESCRIBE file('orders.parquet')"

1order_date	Nullable(Date32)
2order_id	Nullable(Int64)
3country	Nullable(String)
4product	Nullable(String)
5revenue	Nullable(Float64)
6quantity	Nullable(Int64)

clickhouse local reads the CSV header for column names, infers each column's type from the data, and writes a typed Parquet file in place with no import step first. A date stayed a date, a number stayed a number.

Types are inferred from the CSV and carried into Parquet

CSV is untyped text. Parquet is columnar and typed. The conversion's real work is deciding each column's type, and clickhouse-local does that from the data. DESCRIBE the output to see what it wrote:

1clickhouse local -q "DESCRIBE file('orders.parquet')"

1order_date	Nullable(Date32)
2order_id	Nullable(Int64)
3country	Nullable(String)
4product	Nullable(String)
5revenue	Nullable(Float64)
6quantity	Nullable(Int64)

The header gave the names, the data gave the types. Columns come out Nullable because inference allows for empty values. A date stayed a date, a number stayed a number. That typing is what you're after, because every later query against the Parquet reads typed columns instead of re-parsing text.

Pick the compression codec and row-group size

This is where a one-line conversion beats a generic file converter: you control how the Parquet is written. Two settings matter most.

output_format_parquet_compression_method sets the codec. The default is zstd, which gives a strong size-to-speed ratio. lz4 is faster to decode and larger on disk; snappy and gzip are also available.

output_format_parquet_row_group_size sets how many rows go in each row group. Larger groups compress better; smaller groups let a reader skip more aggressively on filtered scans.

1clickhouse local -q "
2SELECT * FROM file('orders.csv')
3INTO OUTFILE 'orders_zstd.parquet' TRUNCATE FORMAT Parquet
4SETTINGS output_format_parquet_compression_method='zstd', output_format_parquet_row_group_size=1000000"

The codec is recorded per column in the Parquet footer. Read it back with the ParquetMetadata format to confirm:

1clickhouse local -q "
2SELECT c.1 AS column, c.2 AS physical_type, c.3 AS compression
3FROM (
4  SELECT arrayJoin(arrayMap(x -> (x.name, x.physical_type, x.compression), columns)) AS c
5  FROM file('orders_zstd.parquet', ParquetMetadata))
6FORMAT TSV"

1order_date	INT32	ZSTD
2order_id	INT64	ZSTD
3country	BYTE_ARRAY	ZSTD
4product	BYTE_ARRAY	ZSTD
5revenue	DOUBLE	ZSTD
6quantity	INT64	ZSTD

Override the inferred types when you want tighter ones

Inference is conservative: it picks Nullable and wide integers so nothing breaks. If you know the data, narrow it before writing. Pass the format and an explicit schema as the second and third arguments to file():

1clickhouse local -q "
2SELECT * FROM file('orders.csv', 'CSVWithNames',
3  'order_date Date, order_id UInt32, country LowCardinality(String), product LowCardinality(String), revenue Float64, quantity UInt8')
4INTO OUTFILE 'orders_typed.parquet' TRUNCATE FORMAT Parquet
5SETTINGS output_format_parquet_compression_method='zstd'"

1order_date	Date32
2order_id	UInt32
3country	String
4product	String
5revenue	Float64
6quantity	UInt8

The columns are no longer Nullable and the integers are smaller. Two casts are worth knowing about. Date is stored physically as a 32-bit value, so DESCRIBE reports Date32 on the way back. LowCardinality(String) is a ClickHouse-side encoding; Parquet stores it as a plain String (dictionary encoding lives in the Parquet column itself), so it reads back as String. Neither loses data.

How small, and how fast?

On a 3,000,000-row CSV (orders_large.csv, 132 MB), the codec choice changes the output a lot:

1132270053	orders_large.csv        (source CSV)
2 32618489	orders_large_lz4.parquet
3 22943750	orders_large_zstd.parquet

zstd Parquet is about 5.8x smaller than the CSV here; lz4 about 4x. The exact ratio depends on your data, but typed columnar storage plus compression beats raw text comfortably. See Parquet vs CSV for why.

The conversion itself is quick. Writing the 132 MB CSV out as zstd Parquet with 1,000,000-row row groups takes ~0.38 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That includes parsing the CSV text and compressing every column.

1clickhouse local -q "
2SELECT * FROM file('orders_large.csv')
3INTO OUTFILE 'orders_large.parquet' TRUNCATE FORMAT Parquet
4SETTINGS output_format_parquet_compression_method='zstd', output_format_parquet_row_group_size=1000000"

Online "CSV to Parquet" converters can do the same shape of job, but they need you to upload the file, they cap its size, and they pick the types and codec for you. Doing it locally keeps the data on your machine, has no size limit beyond your disk, scripts cleanly in a pipeline, and lets you set the types and compression yourself. clickhouse-local runs the same SQL unchanged against dozens of formats and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop.

Convert it in Python with chDB

If you would rather stay in Python, chDB is the same ClickHouse engine in-process. The SQL is identical:

1import chdb
2
3chdb.query("""
4SELECT * FROM file('orders.csv')
5INTO OUTFILE 'orders.parquet' TRUNCATE FORMAT Parquet
6SETTINGS output_format_parquet_compression_method='zstd'
7""")

That writes the Parquet file directly. To pull the rows into a DataFrame instead, see how to read a Parquet file in Python with chDB.

Reverse direction

Going the other way is the same idea with the formats swapped: convert Parquet to CSV.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~132 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-csv-to-parquet

Types are inferred from the CSV and carried into Parquet

Pick the compression codec and row-group size

Override the inferred types when you want tighter ones

How small, and how fast?

Convert it in Python with chDB

Reverse direction

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to convert CSV to Parquet

Types are inferred from the CSV and carried into Parquet #

Pick the compression codec and row-group size #

Override the inferred types when you want tighter ones #

How small, and how fast? #

Convert it in Python with chDB #

Reverse direction #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

Types are inferred from the CSV and carried into Parquet

Pick the compression codec and row-group size

Override the inferred types when you want tighter ones

How small, and how fast?

Convert it in Python with chDB

Reverse direction

Run it yourself

Related