To convert a CSV file to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then write the CSV out as Parquet in one command:
clickhouse local -q "SELECT * FROM file('orders.csv') INTO OUTFILE 'orders.parquet' TRUNCATE FORMAT Parquet"
Then check the output to confirm the types were carried across:
clickhouse local -q "DESCRIBE file('orders.parquet')"
order_date Nullable(Date32)
order_id Nullable(Int64)
country Nullable(String)
product Nullable(String)
revenue Nullable(Float64)
quantity Nullable(Int64)
clickhouse local reads the CSV header for column names, infers each column's type from the data, and writes a typed Parquet file in place with no import step first. A date stayed a date, a number stayed a number.
Types are inferred from the CSV and carried into Parquet
CSV is untyped text. Parquet is columnar and typed. The conversion's real work is deciding each column's type, and clickhouse-local does that from the data. DESCRIBE the output to see what it wrote:
clickhouse local -q "DESCRIBE file('orders.parquet')"
order_date Nullable(Date32)
order_id Nullable(Int64)
country Nullable(String)
product Nullable(String)
revenue Nullable(Float64)
quantity Nullable(Int64)
The header gave the names, the data gave the types. Columns come out Nullable because inference allows for empty values. A date stayed a date, a number stayed a number. That typing is what you're after, because every later query against the Parquet reads typed columns instead of re-parsing text.
Pick the compression codec and row-group size
This is where a one-line conversion beats a generic file converter: you control how the Parquet is written. Two settings matter most.
output_format_parquet_compression_method sets the codec. The default is zstd, which gives a strong size-to-speed ratio. lz4 is faster to decode and larger on disk; snappy and gzip are also available.
output_format_parquet_row_group_size sets how many rows go in each row group. Larger groups compress better; smaller groups let a reader skip more aggressively on filtered scans.
clickhouse local -q "
SELECT * FROM file('orders.csv')
INTO OUTFILE 'orders_zstd.parquet' TRUNCATE FORMAT Parquet
SETTINGS output_format_parquet_compression_method='zstd', output_format_parquet_row_group_size=1000000"
The codec is recorded per column in the Parquet footer. Read it back with the ParquetMetadata format to confirm:
clickhouse local -q "
SELECT c.1 AS column, c.2 AS physical_type, c.3 AS compression
FROM (
SELECT arrayJoin(arrayMap(x -> (x.name, x.physical_type, x.compression), columns)) AS c
FROM file('orders_zstd.parquet', ParquetMetadata))
FORMAT TSV"
order_date INT32 ZSTD
order_id INT64 ZSTD
country BYTE_ARRAY ZSTD
product BYTE_ARRAY ZSTD
revenue DOUBLE ZSTD
quantity INT64 ZSTD
Inference is conservative: it picks Nullable and wide integers so nothing breaks. If you know the data, narrow it before writing. Pass the format and an explicit schema as the second and third arguments to file():
clickhouse local -q "
SELECT * FROM file('orders.csv', 'CSVWithNames',
'order_date Date, order_id UInt32, country LowCardinality(String), product LowCardinality(String), revenue Float64, quantity UInt8')
INTO OUTFILE 'orders_typed.parquet' TRUNCATE FORMAT Parquet
SETTINGS output_format_parquet_compression_method='zstd'"
order_date Date32
order_id UInt32
country String
product String
revenue Float64
quantity UInt8
The columns are no longer Nullable and the integers are smaller. Two casts are worth knowing about. Date is stored physically as a 32-bit value, so DESCRIBE reports Date32 on the way back. LowCardinality(String) is a ClickHouse-side encoding; Parquet stores it as a plain String (dictionary encoding lives in the Parquet column itself), so it reads back as String. Neither loses data.
How small, and how fast?
On a 3,000,000-row CSV (orders_large.csv, 132 MB), the codec choice changes the output a lot:
132270053 orders_large.csv (source CSV)
32618489 orders_large_lz4.parquet
22943750 orders_large_zstd.parquet
zstd Parquet is about 5.8x smaller than the CSV here; lz4 about 4x. The exact ratio depends on your data, but typed columnar storage plus compression beats raw text comfortably. See Parquet vs CSV for why.
The conversion itself is quick. Writing the 132 MB CSV out as zstd Parquet with 1,000,000-row row groups takes ~0.38 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That includes parsing the CSV text and compressing every column.
clickhouse local -q "
SELECT * FROM file('orders_large.csv')
INTO OUTFILE 'orders_large.parquet' TRUNCATE FORMAT Parquet
SETTINGS output_format_parquet_compression_method='zstd', output_format_parquet_row_group_size=1000000"
Online "CSV to Parquet" converters can do the same shape of job, but they need you to upload the file, they cap its size, and they pick the types and codec for you. Doing it locally keeps the data on your machine, has no size limit beyond your disk, scripts cleanly in a pipeline, and lets you set the types and compression yourself. clickhouse-local runs the same SQL unchanged against dozens of formats and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop.
If you would rather stay in Python, chDB is the same ClickHouse engine in-process. The SQL is identical:
import chdb
chdb.query("""
SELECT * FROM file('orders.csv')
INTO OUTFILE 'orders.parquet' TRUNCATE FORMAT Parquet
SETTINGS output_format_parquet_compression_method='zstd'
""")
That writes the Parquet file directly. To pull the rows into a DataFrame instead, see how to read a Parquet file in Python with chDB.
Going the other way is the same idea with the formats swapped: convert Parquet to CSV.
The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~132 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.
github.com/ClickHouse/examples/tree/main/local-analytics/convert-csv-to-parquet