How to convert TSV to Parquet

Al Brown
Last updated: Jun 8, 2026

To convert a TSV file to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then write the result INTO OUTFILE to produce the Parquet:

1clickhouse local -q "SELECT * FROM file('events.tsv') INTO OUTFILE 'events.parquet' FORMAT Parquet"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	purchase	7.02	3
42026-01-04	4	FR	signup	8.03	4
52026-01-05	5	IN	click	9.04	5

clickhouse local reads the TSV header for column names and infers each column's type from the data, so the file is converted in place with no import step. Those types flow directly into the Parquet schema, giving readers real dates, integers, and floats rather than strings to re-parse.

What gets carried across #

A TSV is just text: every value is a string until something infers a type. Parquet is columnar and typed, so the conversion has to assign a type to each column. clickhouse-local does that for you from the data. Check what it inferred with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.parquet')"
1event_date	Nullable(Date32)
2event_id	Nullable(Int64)
3country	Nullable(String)
4action	Nullable(String)
5amount	Nullable(Float64)
6sessions	Nullable(Int64)

The header gave the names; the data gave the types. Those types are written into the Parquet schema, so a reader gets real dates, integers, and floats back, not strings it has to re-parse. Read the file back to confirm the round-trip:

1clickhouse local -q "SELECT * FROM file('events.parquet') ORDER BY event_id LIMIT 5"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	purchase	7.02	3
42026-01-04	4	FR	signup	8.03	4
52026-01-05	5	IN	click	9.04	5

Lock the schema when inference guesses wrong #

Inference is convenient, but sometimes you want a specific type: an ID that should stay a fixed-width UInt32, a code that must remain a string rather than become a number. Pass the format and an explicit schema as the second and third arguments to file(), and those types flow into the Parquet:

1clickhouse local -q "
2SELECT * FROM file('events.tsv', 'TSVWithNames',
3  'event_date Date, event_id UInt32, country String, action String, amount Float64, sessions UInt8')
4INTO OUTFILE 'events.typed.parquet' TRUNCATE FORMAT Parquet"
1event_date	Date32
2event_id	UInt32
3country	String
4action	String
5amount	Float64
6sessions	UInt8

The columns are no longer Nullable, and event_id is a UInt32 and sessions a UInt8 rather than the inferred 64-bit integers. Smaller, exact types make the Parquet smaller and faster to scan.

Options worth knowing (the part the upload sites skip) #

Online TSV-to-Parquet converters make you upload the file and accept whatever defaults they bake in. Doing it locally means you control the output, and your data never leaves the machine. Two settings matter most.

The codec matters most. ClickHouse writes Parquet with zstd by default, which is a good balance of size and speed. You can switch it per file. Here are the three common codecs run against the 3,000,000-row sample (events_large.tsv, ~111 MB of text):

1clickhouse local -q "
2SELECT * FROM file('events_large.tsv')
3INTO OUTFILE 'events_large.zstd.parquet' TRUNCATE FORMAT Parquet
4SETTINGS output_format_parquet_compression_method = 'zstd'"
1none   45304450 bytes
2lz4    32697093 bytes
3zstd   22939660 bytes
4(source events_large.tsv: 110671413 bytes)

zstd turned 111 MB of TSV text into a 23 MB Parquet file: about 4.8x smaller than the source and roughly half the size of uncompressed Parquet. Use lz4 if you want faster reads at a slightly larger size; use none only if you will recompress downstream.

There is also output_format_parquet_row_group_size, which controls how many rows go in each row group and sets the granularity readers can skip at. The default suits most files; raise it for very wide scans, lower it for selective point lookups.

You can confirm what landed in the file by reading the Parquet footer directly with the ParquetMetadata format:

1clickhouse local -q "
2SELECT num_rows, num_row_groups,
3       arrayDistinct(arrayMap(c -> c.compression, columns)) AS codecs
4FROM file('events_large.zstd.parquet', ParquetMetadata)
5FORMAT Vertical"
1num_rows:       3000000
2num_row_groups: 3
3codecs:         ['ZSTD']

One note on small inputs: in the first example the 668-byte TSV produced a 2,875-byte Parquet. Parquet carries schema, statistics, and footer metadata, so it adds fixed overhead that dwarfs a tiny payload. Parquet pays off at scale, which is exactly where the 111 MB file lands.

How fast is it? #

Converting the 3,000,000-row, ~111 MB events_large.tsv to Parquet (parsing every line, inferring types, writing the columnar file) takes ~0.30 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM).

1clickhouse local -q "SELECT * FROM file('events_large.tsv') INTO OUTFILE 'events_large.parquet' TRUNCATE FORMAT Parquet"

That includes reading and parsing the TSV from scratch every run; there is no cached table.

Convert TSV to Parquet in Python with chDB #

If you would rather stay in Python, chDB is the same ClickHouse engine in-process. The SQL is identical: the same SELECT ... INTO OUTFILE ... FORMAT Parquet.

1import chdb
2
3chdb.query("""
4SELECT * FROM file('events.tsv')
5INTO OUTFILE 'events_chdb.parquet' TRUNCATE
6FORMAT Parquet
7""")
8
9# Confirm the inferred schema landed in the Parquet file
10print(chdb.query("DESCRIBE file('events_chdb.parquet')", "TSV"))
1event_date	Nullable(Date32)
2event_id	Nullable(Int64)
3country	Nullable(String)
4action	Nullable(String)
5amount	Nullable(Float64)
6sessions	Nullable(Int64)

No server, no upload, and you can wire the conversion straight into a pandas or pyarrow pipeline.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample TSVs (including the ~111 MB file used for the timing and codec comparison), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-tsv-to-parquet

Reverse direction? See how to convert Parquet to TSV.

The same SQL that converts a file here runs unchanged against a directory of files, a remote object store, and a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...