Going to re:Invent this December? Come party with us and The Chainsmokers ->->

Blog / Engineering

The world’s fastest tool for querying JSON files

author avatar
Pavel Kruglov
Oct 14, 2022

racing-cars.jpg

Introduction

At ClickHouse, we’re passionate about benchmarks and performance. So when I first saw the Hacker News post “The fastest tool for querying large JSON files is written in Python”, my first thought was - "But clickhouse-local is not written in Python". Let’s look at this benchmark and demonstrate that clickhouse-local is actually the fastest tool for querying large JSON files.

clickhouse-local

clickhouse-local is a single binary that enables you to perform fast processing on local and remote files using SQL - effectively database features without a database. As well as supporting the full breadth of ClickHouse functions, JSON is also one of the many supported file formats. Below we try to visualize the differences between a ClickHouse cluster, a single ClickHouse instance, and clickhouse–local:

clickhouse-local.png

The Benchmark

Daniel Moura, the creator of SPySQL, posted a benchmark as a part of the SPySQL project. The benchmark compares several command-line tools with a focus on querying large files that fit into the disk of a standard machine but might not fit into memory. A 10GB subset of the Amazon book reviews dataset was used as the test dataset. The original list of tools used in the benchmark was SPySQL, jq, trdsql, Miller, OctoSQL, spqrk-sql, Pandas, and DSQ. The benchmark itself focused on 3 challenges:

  • Map: a new column is calculated for all rows (batch input and output are large). This represents a common ETL-like task required for data cleansing and enrichment.
  • Aggregation/Reduce: the average of all rows of a column is calculated (only the input is large). Useful to users who need a quick analytical answer and want to avoid the steps of loading the data into a data store such as ClickHouse.
  • Subset/Filter: the first 100 values of a column matching filter criteria are returned (only a part of the input needs to be processed). The objective of this test is to assess the tool's ability to sample files quickly and represents a task users often perform before iterating on a query.

The first set of results for this benchmark showed SPySQL as the fastest tool for querying large JSON files:

initial_results.png

But Daniel was unaware of the clickhouse-local. To assist with this task, ClickHouse recently introduced two cool features that make it easy and straightforward to process JSON files and reproduce this benchmark: support for semi-structured data storage and automatic schema inference. The latter feature allows ClickHouse to infer the types of columns from the data itself so that the user is not required to specify the structure of JSON files and the type of each field: simplifying the syntax and accelerating the getting started experience.

After contacting Daniel, he promptly added clickhouse-local to his benchmarks and updated the results. To his surprise, clickhouse-local was faster than all previous tools. One of the developers of OctoSQL (written in Go) also asked to update the benchmark in accordance with the latest improvements with good results. Below we show the updated benchmark results.

Map challenge

map.png

Results for 1GB of data

map-results.png

Aggregation (reduce) challenge

reduce.png

Results for 1GB of data

reduce-results.png

Filter (subset) challenge

subset.png

Results for 1GB of data

subset-results.png

Results summary

results.png

The updated results have been posted here. As we can see, clickhouse-local is much faster than most other tools for querying large JSON files, whereas OctoSQL excels on smaller files.

This benchmark is not perfect. Each query is run only once, so fluctuations are possible, and users reproducing the results on local hardware may experience noticeable differences between runs. Results may also vary as a result of hardware and operating system differences. Finally, Daniel was deliberate in his approach to not have ORDER BY clauses in filter queries. Although this might mean results are different between tools, as SQL doesn’t enforce a default order, the objective of the benchmark is users wanting to sample files as quickly as possible and the ability of the tool to avoid a full scan. This test gives an advantage to tools that support early termination once satisfying the LIMIT, and those don’t need to load the entire file into memory.

So next time you need to process large JSON files, you know which tool to use!

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
Twitter imageSlack imageGitHub image
Telegram imageMeetup imageRss image