How to convert NDJSON to Parquet

To convert an NDJSON file to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read the NDJSON and write it out as Parquet in one command:

1clickhouse local -q "SELECT * FROM file('events.ndjson') INTO OUTFILE 'events.parquet' FORMAT Parquet"

The file is read in place with no import step. ClickHouse infers the schema from the NDJSON, writes a typed Parquet file, and streams rows so inputs larger than RAM are handled without loading everything into memory first.

NDJSON and JSONL are the same thing: one JSON object per line, sometimes with a .ndjson extension and sometimes .jsonl. ClickHouse reads both with the JSONEachRow format, and the file extension .ndjson is enough to pick it automatically. So the command above is all you need.

The schema is inferred, including nested fields

Real NDJSON usually has nested objects and arrays, not just flat scalars. That is where a flat target like CSV loses information. Parquet keeps the structure, and so does this conversion. Take a line like:

1{"event_id":1,"ts":"2026-01-01 00:00:00","event_type":"login","user":{"country":"GB","plan":"free"},"items":[0],"amount":5}

clickhouse-local infers the column types straight from the data. Check what it found with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.ndjson')"

1event_id	Nullable(Int64)
2ts	Nullable(DateTime)
3event_type	Nullable(String)
4user	Tuple(country Nullable(String), plan Nullable(String))
5items	Array(Nullable(Int64))
6amount	Nullable(Float64)

The nested user object became a Tuple and the items array became an Array. Those are first-class Parquet types, so they survive the conversion. Describe the Parquet file you just wrote and you get the same shape back:

1clickhouse local -q "DESCRIBE file('events.parquet')"

1event_id	Nullable(Int64)
2ts	Nullable(DateTime64(3, 'UTC'))
3event_type	Nullable(String)
4user	Tuple(country Nullable(String), plan Nullable(String))
5items	Array(Nullable(Int64))
6amount	Nullable(Float64)

Read the nested columns back out with dotted paths, no re-parsing of JSON text:

1clickhouse local -q "SELECT event_id, user.country, user.plan, items FROM file('events.parquet') ORDER BY event_id LIMIT 5"

11	GB	free	[0]
22	US	pro	[1,2]
33	DE	team	[2,3,4]
44	FR	free	[3]
55	GB	pro	[4,5]

One cast to be aware of: the ts timestamp comes in from JSON as DateTime (second precision) and lands in Parquet as DateTime64(3), because Parquet's native timestamp is millisecond-precision. The values are identical; only the declared type widens. If you need exact control over a column's type, pass an explicit schema to file() as the second and third arguments before the conversion.

Options worth knowing

An upload-based online converter gives you one fixed output. Doing it locally means you control the file. The two settings you will actually reach for:

Compression codec. Parquet defaults to ZSTD here. Override it per write:

1clickhouse local -q "
2SELECT * FROM file('events.ndjson')
3INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet
4SETTINGS output_format_parquet_compression_method='snappy'"

ZSTD packs smaller; Snappy decodes a touch faster and is the safe default for older readers. On the 1,000,000-row file below, ZSTD produced a 16 MB Parquet and Snappy 21 MB, both from 140 MB of NDJSON text.

Row group size. output_format_parquet_row_group_size (default 1,000,000 rows) controls how many rows go in each row group. Smaller groups give readers finer-grained row-group pruning; larger groups compress slightly better. Leave it alone unless a downstream reader asks for a specific size.

You can also inspect what was written. The ParquetMetadata format reads the footer:

1clickhouse local -q "
2SELECT num_columns, num_rows, num_row_groups, columns.name, columns.compression
3FROM file('events.parquet', ParquetMetadata) FORMAT Vertical"

1num_columns:         7
2num_rows:            20
3num_row_groups:      1
4columns.name:        ['event_id','ts','event_type','country','plan','element','amount']
5columns.compression: ['ZSTD','ZSTD','ZSTD','ZSTD','ZSTD','ZSTD','ZSTD']

Note the seven physical columns from six logical ones: the user tuple is stored as two leaf columns (country, plan) and the array as an element column. That is how Parquet lays out nested types on disk, and it is exactly what makes columnar reads of user.country cheap later.

How fast is it, and how small?

The conversion streams: rows are read from the NDJSON and written to Parquet without materialising the whole file, so it works on inputs larger than RAM.

1clickhouse local -q "SELECT * FROM file('events_large.ndjson') INTO OUTFILE 'events_large.parquet' TRUNCATE FORMAT Parquet"

Converting events_large.ndjson (1,000,000 rows, ~140 MB) to Parquet takes ~0.47 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). The output is 16 MB, about a 9x reduction, because Parquet is columnar, typed, and ZSTD-compressed rather than repeated JSON keys in plain text. The next tool that reads it scans far less data than it would from the NDJSON.

Reverse direction? To go back the other way, see convert Parquet to NDJSON.

The same conversion in Python (chDB)

If you would rather stay in Python, chDB is the same ClickHouse engine in-process. The SQL is identical:

1import chdb
2
3chdb.query("""
4SELECT * FROM file('events.ndjson')
5INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet
6""")
7
8# read the nested columns straight back as a pandas DataFrame
9df = chdb.query("""
10SELECT event_id, user.country AS country, user.plan AS plan, items
11FROM file('events.parquet') ORDER BY event_id LIMIT 5
12""", "DataFrame")
13print(df)

1event_id country  plan      items
20         1      GB  free        [0]
31         2      US   pro     [1, 2]
42         3      DE  team  [2, 3, 4]
53         4      FR  free        [3]
64         5      GB   pro     [4, 5]

No server, no extra dependency to write Parquet — chDB handles the read and the write. pyarrow can also write Parquet from a DataFrame, and is a fair choice when your data already lives in pandas; the advantage here is that one engine infers the NDJSON schema, preserves the nesting, and writes the Parquet in a single step.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample NDJSON (including the ~140 MB file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-ndjson-to-parquet

The same SQL scales from this file on your laptop to a ClickHouse server or ClickHouse Cloud when the data outgrows it, with no rewrite. Related: how to query a Parquet file, what is Parquet, and the JSONL to Parquet version of this guide.

The schema is inferred, including nested fields

Options worth knowing

How fast is it, and how small?

The same conversion in Python (chDB)

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to convert NDJSON to Parquet

The schema is inferred, including nested fields #

Options worth knowing #

How fast is it, and how small? #

The same conversion in Python (chDB) #

Run it yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

The schema is inferred, including nested fields

Options worth knowing

How fast is it, and how small?

The same conversion in Python (chDB)

Run it yourself