To convert JSON to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
clickhouse local -q "SELECT * FROM file('events.json') INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"
event_id Nullable(Int64)
ts Nullable(DateTime64(3, 'UTC'))
event_type Nullable(String)
geo Tuple(city Nullable(String), country Nullable(String))
tags Array(Nullable(String))
amount Nullable(Float64)
INTO OUTFILE writes the Parquet file directly to disk and produces no further output. The JSON is read in place with no import step, schema is inferred automatically, and the resulting Parquet carries the same column names and types shown above.
ClickHouse reads the JSON and works out a column name and type for each field. Check what it found with DESCRIBE:
clickhouse local -q "DESCRIBE file('events.json')"
event_id Nullable(Int64)
ts Nullable(DateTime)
event_type Nullable(String)
geo Tuple(city Nullable(String), country Nullable(String))
tags Array(Nullable(String))
amount Nullable(Float64)
The numbers became Int64 and Float64, the timestamp became DateTime, and the strings stayed strings. The interesting part is geo: it was a nested JSON object, and it was inferred as a typed Tuple (a struct) rather than flattened into text. The tags field, a JSON array, became a typed Array. Those types carry straight into Parquet.
Nested JSON becomes typed Parquet columns
This is where converting with clickhouse-local differs from a naive flatten-to-CSV. Parquet has a nested type system, so the geo object does not have to be stringified or dropped. Look at the schema of the Parquet file you just wrote:
clickhouse local -q "DESCRIBE file('events.parquet')"
event_id Nullable(Int64)
ts Nullable(DateTime64(3, 'UTC'))
event_type Nullable(String)
geo Tuple(city Nullable(String), country Nullable(String))
tags Array(Nullable(String))
amount Nullable(Float64)
The geo tuple and the tags array survived the round trip. Inside the Parquet file they are stored as proper nested columns. The ParquetMetadata format shows the physical layout. Note the dotted paths:
clickhouse local -q "
SELECT c.1 AS column, c.2 AS parquet_path, c.3 AS physical_type
FROM file('events.parquet', ParquetMetadata) ARRAY JOIN columns AS c"
event_id event_id 1
ts ts 1
event_type event_type 1
city geo.city 1
country geo.country 1
element tags.list.element 2
amount amount 1
geo.city and geo.country are separate columns under the geo group, and tags.list.element is Parquet's standard encoding for a repeated field. Any reader that understands Parquet (pandas, pyarrow, Spark, a ClickHouse server) gets the structure back, not a wall of escaped JSON. You can query the nested fields by name straight away:
clickhouse local -q "
SELECT event_id, geo.country AS country, geo.city AS city, tags, amount
FROM file('events.parquet') ORDER BY event_id LIMIT 5"
1 GB London [] 1
2 US NYC ['mobile'] 2.01
3 DE Berlin ['mobile','web'] 3.02
4 FR Paris [] 4.03
5 IN Mumbai ['mobile'] 5.04
One note on lossiness: if a JSON field has mixed or unpredictable shapes across rows (sometimes an object, sometimes a scalar), inference cannot pin a single type and may read it as a string. That is rare in machine-generated JSON. For deeply dynamic documents, the JSON data type keeps the structure addressable; for everything else, the inferred tuples and arrays above are exactly what you want.
This is the information you do not get from an upload-and-download converter site.
Pick the compression codec. Parquet defaults to zstd here, which gives the smallest files. Override it per write:
clickhouse local -q "SELECT * FROM file('events.json')
INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet
SETTINGS output_format_parquet_compression_method = 'snappy'"
On the 1,000,000-row sample file, the codec choice is visible on disk:
events_zstd.parquet 7,849,144 bytes
events_snappy.parquet 13,951,649 bytes
events_uncompressed.parquet 22,969,357 bytes
snappy is faster to decode and is the most widely compatible default in the wider ecosystem; zstd is smaller. Use none only if a downstream tool insists on it. You can also set output_format_parquet_row_group_size to tune row-group size for the readers you care about.
Pin the types when inference guesses wrong. Pass an explicit structure as the second and third arguments to file(), for example to keep an ID a string instead of an integer:
clickhouse local -q "SELECT * FROM file('events.json', 'JSONEachRow',
'event_id String, ts DateTime, event_type String,
geo Tuple(city String, country String), tags Array(String), amount Float64')
INTO OUTFILE 'events.parquet' TRUNCATE FORMAT Parquet"
Filter or reshape on the way out. It is a SELECT, so the full SQL surface is available before the data lands in Parquet — project columns, filter rows, flatten the tuple into top-level columns, cast types, or aggregate:
clickhouse local -q "
SELECT event_id, geo.country AS country, amount
FROM file('events.json')
WHERE event_type = 'purchase'
INTO OUTFILE 'purchases.parquet' TRUNCATE FORMAT Parquet"
JSON is convenient to produce but expensive to read repeatedly: every query re-parses text and re-infers types, and the files are large. Parquet is columnar, typed, and compressed, so downstream reads only touch the columns and row groups they need. On the sample data the difference is stark: the 1,000,000-row events_large.json is ~133 MB, and the zstd Parquet version is ~7.8 MB — about 17x smaller, with the schema baked in.
Converting the 1,000,000-row, ~133 MB events_large.json (nested object and array per row) to Parquet:
clickhouse local -q "SELECT * FROM file('events_large.json') INTO OUTFILE 'events_large.parquet' TRUNCATE FORMAT Parquet"
~0.57 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That includes parsing every JSON line from scratch and re-encoding it as compressed, columnar Parquet. The number is best-of-three and can be nudged by other load on the machine, but the operation is single-pass and streaming, so it scales with file size rather than blowing up memory.
If you would rather stay in Python, chDB is the same ClickHouse engine in-process. The conversion is the identical SQL, and you can read the result straight back into a pandas DataFrame:
import chdb
# Convert JSON -> Parquet (same SELECT ... INTO OUTFILE ... FORMAT Parquet)
chdb.query(
"SELECT * FROM file('events.json') "
"INTO OUTFILE 'events_chdb.parquet' TRUNCATE FORMAT Parquet"
)
# Read the typed columns back, nested fields included, as a DataFrame
df = chdb.query(
"SELECT geo.country AS country, count() AS events, round(sum(amount), 2) AS total "
"FROM file('events_chdb.parquet') GROUP BY country ORDER BY total DESC",
"DataFrame",
)
print(df)
country events total
0 AU 2 18.16
1 IN 2 16.14
2 FR 2 14.12
3 DE 2 12.10
4 US 2 10.08
5 GB 2 8.06
No server, no temporary table, and the Parquet file is portable to any other tool. For more on reading the result, see how to read a Parquet file in Python with chDB.
Going the other way is just as quick: see how to convert Parquet to JSON.
The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample JSON (including the 1,000,000-row file used for the timing above), run.sh with every command on this page, run.py / run.ipynb for the chDB version, and expected_output.txt.
github.com/ClickHouse/examples/tree/main/local-analytics/convert-json-to-parquet
The same SQL that converts a file on your laptop runs unchanged against a ClickHouse server or ClickHouse Cloud when the data outgrows it. Related reading: what is a Parquet file, how to query a Parquet file, and run SQL on a JSON file.