How to convert JSONL to Parquet

Al Brown
Last updated: Jun 8, 2026

To convert JSONL to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read the JSONL and write it to Parquet in one pass:

1clickhouse local -q "SELECT * FROM file('events.jsonl') INTO OUTFILE 'events.parquet' FORMAT Parquet"

Verify the schema of the Parquet file it created:

1clickhouse local -q "DESCRIBE file('events.parquet')"
1event_id	Nullable(Int64)
2ts	Nullable(DateTime64(3, 'UTC'))
3country	Nullable(String)
4action	Nullable(String)
5amount	Nullable(Float64)
6is_member	Nullable(Int64)
7device	Tuple(os Nullable(String), ver Nullable(String))

clickhouse local infers a type for every column from the JSON values and carries those types into the Parquet schema with no import step. The timestamp string became a real DateTime64, and the nested device object became a Parquet struct. The file is read and converted in place, with no upload or staging required.

What just happened #

clickhouse local reads JSONL with the JSONEachRow format: one JSON object per line. It scans the values to infer a type per column, then carries those types into the Parquet schema. SELECT * is the data, INTO OUTFILE picks the destination, and FORMAT Parquet is the only format hint you need.

Look at what was inferred from the JSONL:

1clickhouse local -q "DESCRIBE file('events.jsonl')"
1event_id	Nullable(Int64)
2ts	Nullable(DateTime)
3country	Nullable(String)
4action	Nullable(String)
5amount	Nullable(Float64)
6is_member	Nullable(Int64)
7device	Tuple(os Nullable(String), ver Nullable(String))

Now the Parquet file's schema:

1clickhouse local -q "DESCRIBE file('events.parquet')"
1event_id	Nullable(Int64)
2ts	Nullable(DateTime64(3, 'UTC'))
3country	Nullable(String)
4action	Nullable(String)
5amount	Nullable(Float64)
6is_member	Nullable(Int64)
7device	Tuple(os Nullable(String), ver Nullable(String))

Two things to notice. The timestamp string became a DateTime64 written as a proper Parquet timestamp, not text. And the nested device object survived: it is a Tuple on the ClickHouse side, which Parquet stores as a native struct. You do not have to flatten or stringify nested JSON to get it into Parquet.

Nested objects become Parquet structs #

This is the part online converters tend to mangle. A JSONL line like {"device":{"os":"ios","ver":"1"}} keeps its shape in Parquet. Read the struct fields back with dotted paths:

1clickhouse local -q "
2SELECT event_id, device.os AS os, device.ver AS ver, amount
3FROM file('events.parquet')
4ORDER BY event_id LIMIT 5 FORMAT PrettyCompactMonoBlock"
1   ┌─event_id─┬─os──────┬─ver─┬─amount─┐
21. │        1 │ ios     │ 1   │      5 │
32. │        2 │ android │ 2   │   6.01 │
43. │        3 │ web     │ 3   │   7.02 │
54. │        4 │ ios     │ 4   │   8.03 │
65. │        5 │ android │ 5   │   9.04 │
7   └──────────┴─────────┴─────┴────────┘

If your objects are deeply or irregularly nested, see how to query nested JSON with SQL for ways to project them before writing Parquet.

Options worth knowing #

These are the knobs you give up when you paste a file into a browser converter.

Pick the compression codec. Parquet defaults to snappy (fast). For smaller files use zstd:

1clickhouse local -q "
2SELECT * FROM file('events.jsonl')
3INTO OUTFILE 'events.parquet' TRUNCATE
4FORMAT Parquet
5SETTINGS output_format_parquet_compression_method = 'zstd'"

Tune the row group size. output_format_parquet_row_group_size (default 1,000,000) controls how many rows go in each row group, which affects how much a reader can skip.

Pin the types. If inference guesses wrong (an ID that should stay a string, say), pass the format and an explicit schema to file() as the second and third arguments, then write that out:

1clickhouse local -q "
2SELECT * FROM file('events.jsonl', 'JSONEachRow',
3  'event_id String, ts DateTime, country String, action String, amount Decimal(10,2), is_member Bool, device Tuple(os String, ver String)')
4INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"

Inspect the result. The ParquetMetadata format reads the footer without scanning the data, so you can confirm row groups and compression:

1clickhouse local -q "
2SELECT num_rows, num_row_groups, total_compressed_size, total_uncompressed_size
3FROM file('events.parquet', ParquetMetadata) FORMAT Vertical"
1Row 1:
2──────
3num_rows:                1000000
4num_row_groups:          1
5total_compressed_size:   9599064
6total_uncompressed_size: 19915486

Size and speed #

Typed columnar Parquet is dramatically smaller than line-delimited JSON, because column values compress together and field names are stored once in the schema instead of on every row. The 1,000,000-row sample goes from 137 MB of JSONL to 13.9 MB of Parquet, roughly a 10x reduction, before you even pick a heavier codec.

The conversion itself:

1clickhouse local -q "SELECT * FROM file('events_large.jsonl') INTO OUTFILE 'events_large.parquet' TRUNCATE FORMAT Parquet"

~0.50 seconds for the full 1,000,000-row, ~137 MB file, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That parses every JSON line from scratch; there is no cached table. clickhouse-local runs the exact same SQL unchanged across dozens of formats and remote sources, and against a ClickHouse server or ClickHouse Cloud when the data outgrows your laptop.

Reverse direction? See how to convert Parquet to JSONL.

Convert with Python (chDB) #

chDB is the same engine embedded in Python, so the conversion is the identical SQL:

1import chdb
2
3chdb.query(
4    "SELECT * FROM file('events.jsonl') "
5    "INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"
6)
7
8# Verify the result back in-process
9print(chdb.query("SELECT count() FROM file('events.parquet')", "CSV"))
120

The schema, the nested struct, and the timestamp type come out the same as the CLI. Install with pip install chdb.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample JSONL (including the ~137 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-jsonl-to-parquet

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...