How to query a JSON file with SQL

To query a JSON file with SQL, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at your JSON file with the JSONEachRow format:

1clickhouse local -q "SELECT * FROM file('events.jsonl', JSONEachRow) LIMIT 5"

1Row 1:
2──────
3event_id:   1
4event_type: signup
5amount:     1
6geo:        {
7  "city": "London",
8  "country": "GB"
9}
10tags:       ['mobile']

JSONEachRow reads one object per line directly from disk, inferring column names and types without a schema declaration or an import step. Nested objects become Tuple columns and arrays become Array columns, so you can query their members in the same SELECT.

Prefer Python? See How to read a JSON file in Python and query it with SQL for the same queries against a pandas DataFrame.

JSONL and a top-level array read the same way

JSON on disk comes in two common shapes. Line-delimited JSON (also called JSONL or NDJSON) has one object per line:

1{"event_id":1,"event_type":"signup","amount":1,"geo":{"country":"GB","city":"London"},"tags":["mobile"]}
2{"event_id":2,"event_type":"click","amount":2.01,"geo":{"country":"US","city":"NYC"},"tags":["web","beta"]}

The other shape is a single top-level array, [{...}, {...}, ...]. Both read with the same JSONEachRow format. ClickHouse detects the leading [ and parses the array element-by-element, so you do not load the whole file into memory and you do not need a different command:

1clickhouse local -q "SELECT count() FROM file('events.json', JSONEachRow)"

JSONEachRow is the format to reach for in both cases. (ClickHouse also has a JSON format, but that one is for its own metadata-wrapped output, not for reading arbitrary array files.)

See the schema without declaring one

DESCRIBE shows you what ClickHouse inferred, including how it handled the nested object and the array:

1clickhouse local -q "DESCRIBE file('events.jsonl', JSONEachRow) FORMAT PrettyCompact"

1Row 4:
2──────
3name:  geo
4type:  Tuple(
5    city Nullable(String),
6    country Nullable(String))
7
8Row 5:
9──────
10name:  tags
11type:  Array(Nullable(String))

A nested JSON object becomes a Tuple with named fields. A JSON array becomes an Array. Scalars (event_id, event_type, amount) come back as Nullable(Int64), Nullable(String) and Nullable(Float64). No CREATE TABLE, no schema file.

Reach into nested objects with dot access

Because geo is a Tuple with named fields, you read its members with dot notation. There is no JSON path language to learn and no per-row parsing function to call; the field is a column:

1clickhouse local -q "
2SELECT event_id, geo.country AS country, geo.city AS city
3FROM file('events.jsonl', JSONEachRow)
4LIMIT 5
5FORMAT PrettyCompact"

1┌─event_id─┬─country─┬─city───┐
21. │        1 │ GB      │ London │
32. │        2 │ US      │ NYC    │
43. │        3 │ DE      │ Berlin │
54. │        4 │ FR      │ Paris  │
65. │        5 │ IN      │ Mumbai │
7   └──────────┴─────────┴────────┘

Nested fields group and aggregate like any other column:

1clickhouse local -q "
2SELECT geo.country AS country, count() AS events, round(sum(amount), 2) AS total
3FROM file('events.jsonl', JSONEachRow)
4GROUP BY country
5ORDER BY total DESC
6FORMAT PrettyCompact"

1┌─country─┬─events─┬─total─┐
21. │ IN      │      4 │ 50.46 │
32. │ FR      │      4 │ 46.42 │
43. │ DE      │      4 │ 42.38 │
54. │ US      │      4 │ 38.34 │
65. │ GB      │      4 │  34.3 │
7   └─────────┴────────┴───────┘

For deeper or irregular nesting, see how to query nested JSON with SQL.

Explode an array column with ARRAY JOIN

The tags column holds an array per row. To count over the individual elements, unfold the array with ARRAY JOIN: it produces one output row per array element, then you aggregate as usual:

1clickhouse local -q "
2SELECT tag, count() AS events
3FROM file('events.jsonl', JSONEachRow)
4ARRAY JOIN tags AS tag
5GROUP BY tag
6ORDER BY events DESC
7FORMAT PrettyCompact"

1┌─tag────┬─events─┐
21. │ beta   │     10 │
32. │ vip    │     10 │
43. │ mobile │     10 │
54. │ web    │      9 │
6   └────────┴────────┘

This is the JSON workflow that viewers and jq one-liners make awkward: a real GROUP BY over array members. ClickHouse covers the same ground with one binary that also reads CSV, Parquet, ORC, Arrow and many other formats, and runs the same SQL unchanged against a server or ClickHouse Cloud when the data outgrows your machine.

Gzipped JSON works too

You do not unzip anything first. clickhouse local detects the .gz extension and decompresses on the fly, so a .jsonl.gz file is queried exactly like a .jsonl:

1clickhouse local -q "
2SELECT event_type, count()
3FROM file('events.jsonl.gz', JSONEachRow)
4GROUP BY event_type
5ORDER BY event_type
6FORMAT PrettyCompact"

1┌─event_type─┬─count()─┐
21. │ click      │       5 │
32. │ purchase   │       5 │
43. │ refund     │       5 │
54. │ signup     │       5 │
6   └────────────┴─────────┘

The same applies to .jsonl.zst, .jsonl.xz and other supported codecs. The compression is inferred from the file name.

How fast is it?

JSON is text, so every query parses it from scratch; there is no cached table. That parse is the work, and it runs across all cores. On a 3,000,000-row, ~314 MB JSONL file (events_large.jsonl), grouping by the nested geo.country field with a sum over every row completes in:

1clickhouse local -q "
2SELECT geo.country AS country, count() AS events, round(sum(amount), 2) AS total
3FROM file('events_large.jsonl', JSONEachRow)
4GROUP BY country
5ORDER BY total DESC"

~0.66 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes parsing the JSON text and inferring the schema on every run. If you query the same data repeatedly, convert it once to a columnar format and read that instead: see convert JSON to Parquet.

The same SQL scales unchanged

The query you just ran against a file on your laptop is the same SQL you would run on a ClickHouse server, or in ClickHouse Cloud. Nothing about SELECT ... ARRAY JOIN ... GROUP BY changes. You swap file('events.jsonl', JSONEachRow) for a table name and the rest stays put. You prototype against a file and ship the identical logic to production.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo. It has generate.sh (builds the sample JSONL, the top-level-array copy, a gzipped copy, and the ~314 MB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-json

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-json
3./generate.sh && ./run.sh

JSONL and a top-level array read the same way

See the schema without declaring one

Reach into nested objects with dot access

Explode an array column with ARRAY JOIN

Gzipped JSON works too

How fast is it?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to query a JSON file with SQL

JSONL and a top-level array read the same way #

See the schema without declaring one #

Reach into nested objects with dot access #

Explode an array column with ARRAY JOIN #

Gzipped JSON works too #

How fast is it? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

JSONL and a top-level array read the same way

See the schema without declaring one

Reach into nested objects with dot access

Explode an array column with ARRAY JOIN

Gzipped JSON works too

How fast is it?

The same SQL scales unchanged

Run it yourself

Related