How to convert JSON to Parquet

To convert JSON to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then convert the file:

1clickhouse local -q "SELECT * FROM file('events.json') INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"

1event_id	Nullable(Int64)
2ts	Nullable(DateTime64(3, 'UTC'))
3event_type	Nullable(String)
4geo	Tuple(city Nullable(String), country Nullable(String))
5tags	Array(Nullable(String))
6amount	Nullable(Float64)

INTO OUTFILE writes the Parquet file directly to disk and produces no further output. The JSON is read in place with no import step, schema is inferred automatically, and the resulting Parquet carries the same column names and types shown above.

How the schema is inferred

ClickHouse reads the JSON and works out a column name and type for each field. Check what it found with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.json')"

1event_id	Nullable(Int64)
2ts	Nullable(DateTime)
3event_type	Nullable(String)
4geo	Tuple(city Nullable(String), country Nullable(String))
5tags	Array(Nullable(String))
6amount	Nullable(Float64)

The numbers became Int64 and Float64, the timestamp became DateTime, and the strings stayed strings. The interesting part is geo: it was a nested JSON object, and it was inferred as a typed Tuple (a struct) rather than flattened into text. The tags field, a JSON array, became a typed Array. Those types carry straight into Parquet.

Nested JSON becomes typed Parquet columns

This is where converting with clickhouse-local differs from a naive flatten-to-CSV. Parquet has a nested type system, so the geo object does not have to be stringified or dropped. Look at the schema of the Parquet file you just wrote:

1clickhouse local -q "DESCRIBE file('events.parquet')"

1event_id	Nullable(Int64)
2ts	Nullable(DateTime64(3, 'UTC'))
3event_type	Nullable(String)
4geo	Tuple(city Nullable(String), country Nullable(String))
5tags	Array(Nullable(String))
6amount	Nullable(Float64)

The geo tuple and the tags array survived the round trip. Inside the Parquet file they are stored as proper nested columns. The ParquetMetadata format shows the physical layout. Note the dotted paths:

1clickhouse local -q "
2SELECT c.1 AS column, c.2 AS parquet_path, c.3 AS physical_type
3FROM file('events.parquet', ParquetMetadata) ARRAY JOIN columns AS c"

1event_id	event_id	1
2ts	ts	1
3event_type	event_type	1
4city	geo.city	1
5country	geo.country	1
6element	tags.list.element	2
7amount	amount	1

geo.city and geo.country are separate columns under the geo group, and tags.list.element is Parquet's standard encoding for a repeated field. Any reader that understands Parquet (pandas, pyarrow, Spark, a ClickHouse server) gets the structure back, not a wall of escaped JSON. You can query the nested fields by name straight away:

1clickhouse local -q "
2SELECT event_id, geo.country AS country, geo.city AS city, tags, amount
3FROM file('events.parquet') ORDER BY event_id LIMIT 5"

11	GB	London	[]	1
22	US	NYC	['mobile']	2.01
33	DE	Berlin	['mobile','web']	3.02
44	FR	Paris	[]	4.03
55	IN	Mumbai	['mobile']	5.04

One note on lossiness: if a JSON field has mixed or unpredictable shapes across rows (sometimes an object, sometimes a scalar), inference cannot pin a single type and may read it as a string. That is rare in machine-generated JSON. For deeply dynamic documents, the JSON data type keeps the structure addressable; for everything else, the inferred tuples and arrays above are exactly what you want.

Options worth knowing

This is the information you do not get from an upload-and-download converter site.

Pick the compression codec. Parquet defaults to zstd here, which gives the smallest files. Override it per write:

1clickhouse local -q "SELECT * FROM file('events.json')
2  INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet
3  SETTINGS output_format_parquet_compression_method = 'snappy'"

On the 1,000,000-row sample file, the codec choice is visible on disk:

1events_zstd.parquet           7,849,144 bytes
2events_snappy.parquet        13,951,649 bytes
3events_uncompressed.parquet  22,969,357 bytes

snappy is faster to decode and is the most widely compatible default in the wider ecosystem; zstd is smaller. Use none only if a downstream tool insists on it. You can also set output_format_parquet_row_group_size to tune row-group size for the readers you care about.

Pin the types when inference guesses wrong. Pass an explicit structure as the second and third arguments to file(), for example to keep an ID a string instead of an integer:

1clickhouse local -q "SELECT * FROM file('events.json', 'JSONEachRow',
2  'event_id String, ts DateTime, event_type String,
3   geo Tuple(city String, country String), tags Array(String), amount Float64')
4INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"

Filter or reshape on the way out. It is a SELECT, so the full SQL surface is available before the data lands in Parquet — project columns, filter rows, flatten the tuple into top-level columns, cast types, or aggregate:

1clickhouse local -q "
2SELECT event_id, geo.country AS country, amount
3FROM file('events.json')
4WHERE event_type = 'purchase'
5INTO OUTFILE 'purchases.parquet' TRUNCATE FORMAT Parquet"

Why convert at all

JSON is convenient to produce but expensive to read repeatedly: every query re-parses text and re-infers types, and the files are large. Parquet is columnar, typed, and compressed, so downstream reads only touch the columns and row groups they need. On the sample data the difference is stark: the 1,000,000-row events_large.json is ~133 MB, and the zstd Parquet version is ~7.8 MB — about 17x smaller, with the schema baked in.

How fast is it?

Converting the 1,000,000-row, ~133 MB events_large.json (nested object and array per row) to Parquet:

1clickhouse local -q "SELECT * FROM file('events_large.json') INTO OUTFILE 'events_large.parquet' TRUNCATE FORMAT Parquet"

~0.57 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That includes parsing every JSON line from scratch and re-encoding it as compressed, columnar Parquet. The number is best-of-three and can be nudged by other load on the machine, but the operation is single-pass and streaming, so it scales with file size rather than blowing up memory.

The chDB equivalent in Python

If you would rather stay in Python, chDB is the same ClickHouse engine in-process. The conversion is the identical SQL, and you can read the result straight back into a pandas DataFrame:

1import chdb
2
3# Convert JSON -> Parquet (same SELECT ... INTO OUTFILE ... FORMAT Parquet)
4chdb.query(
5    "SELECT * FROM file('events.json') "
6    "INTO OUTFILE 'events_chdb.parquet' TRUNCATE FORMAT Parquet"
7)
8
9# Read the typed columns back, nested fields included, as a DataFrame
10df = chdb.query(
11    "SELECT geo.country AS country, count() AS events, round(sum(amount), 2) AS total "
12    "FROM file('events_chdb.parquet') GROUP BY country ORDER BY total DESC",
13    "DataFrame",
14)
15print(df)

1country  events  total
20      AU       2  18.16
31      IN       2  16.14
42      FR       2  14.12
53      DE       2  12.10
64      US       2  10.08
75      GB       2   8.06

No server, no temporary table, and the Parquet file is portable to any other tool. For more on reading the result, see how to read a Parquet file in Python with chDB.

Reverse direction?

Going the other way is just as quick: see how to convert Parquet to JSON.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample JSON (including the 1,000,000-row file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-json-to-parquet

The same SQL that converts a file on your laptop runs unchanged against a ClickHouse server or ClickHouse Cloud when the data outgrows it. Related reading: what is a Parquet file, how to query a Parquet file, and run SQL on a JSON file.

How the schema is inferred

Nested JSON becomes typed Parquet columns

Options worth knowing

Why convert at all

How fast is it?

The chDB equivalent in Python

Reverse direction?

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to convert JSON to Parquet

How the schema is inferred #

Nested JSON becomes typed Parquet columns #

Options worth knowing #

Why convert at all #

How fast is it? #

The chDB equivalent in Python #

Reverse direction? #

Run it yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How the schema is inferred

Nested JSON becomes typed Parquet columns

Options worth knowing

Why convert at all

How fast is it?

The chDB equivalent in Python

Reverse direction?

Run it yourself