Skip to main content
Skip to main content

Hacker News dataset

In this tutorial, you'll insert 28 million rows of Hacker News data into a ClickHouse table from both CSV and Parquet formats and run some simple queries to explore the data.

CSV

Download CSV

A CSV version of the dataset can be downloaded from our public S3 bucket, or by running this command:

At 4.6GB, and 28m rows, this compressed file should take 5-10 minutes to download.

Sample the data

clickhouse-local allows users to perform fast processing on local files without having to deploy and configure the ClickHouse server.

Before storing any data in ClickHouse, let's sample the file using clickhouse-local. From the console run:

Next, run the following command to explore the data:

There are a lot of subtle capabilities in this command. The file operator allows you to read the file from a local disk, specifying only the format CSVWithNames. Most importantly, the schema is automatically inferred for you from the file contents. Note also how clickhouse-local is able to read the compressed file, inferring the gzip format from the extension. The Vertical format is used to more easily see the data for each column.

Load the data with schema inference

The simplest and most powerful tool for data loading is the clickhouse-client: a feature-rich native command-line client. To load data, you can again exploit schema inference, relying on ClickHouse to determine the types of the columns.

Run the following command to create a table and insert the data directly from the remote CSV file, accessing the contents via the url function. The schema is automatically inferred:

This creates an empty table using the schema inferred from the data. The DESCRIBE TABLE command allows us to understand these assigned types.

To insert the data into this table, use the INSERT INTO, SELECT command. Together with the url function, data will be streamed directly from the URL:

You've successfully inserted 28 million rows into ClickHouse with a single command!

Explore the data

Sample the Hacker News stories and specific columns by running the following query:

While schema inference is a great tool for initial data exploration, it is “best effort” and not a long-term substitute for defining an optimal schema for your data.

Define a schema

An obvious immediate optimization is to define a type for each field. In addition to declaring the time field as a DateTime type, we define an appropriate type for each of the fields below after dropping our existing dataset. In ClickHouse the primary key id for the data is defined via the ORDER BY clause.

Selecting appropriate types and choosing which columns to include in the ORDER BY clause will help to improve query speed and compression.

Run the query below to drop the old schema and create the improved schema:

With an optimized schema, you can now insert the data from the local file system. Again using clickhouse-client, insert the file using the INFILE clause with an explicit INSERT INTO.

Run sample queries

Some sample queries are presented below to give you inspiration for writing your own queries.

How pervasive a topic is "ClickHouse" in Hacker News?

The score field provides a metric of popularity for stories, while the id field and || concatenation operator can be used to produce a link to the original post.

Is ClickHouse generating more noise over time? Here the usefulness of defining the time field as a DateTime is shown, as using a proper data type allows you to use the toYYYYMM() function:

It looks like "ClickHouse" is growing in popularity with time.

Who are the top commenters on ClickHouse related articles?

Which comments generate the most interest?

Parquet

One of the strengths of ClickHouse is its ability to handle any number of formats. CSV represents a rather ideal use case, and is not the most efficient for data exchange.

Next, you'll load the data from a Parquet file which is an efficient column-oriented format.

Parquet has minimal types, which ClickHouse needs to respect, and this type information is encoded in the format itself. Type inference on a Parquet file will invariably lead to a slightly different schema than the one for the CSV file.

Insert the data

Run the following query to read the same data in Parquet format, again using the url function to read the remote data:

Null keys with Parquet

As a condition of the Parquet format, we have to accept that keys might be NULL, even though they aren't in the data.

Run the following command to view the inferred schema:

As before with the CSV file, you can specify the schema manually for greater control over the chosen types and insert the data directly from s3:

Add a skipping-index to speed up queries

To find out how many comments mention "ClickHouse", run the following query:

Next, you'll create an inverted index on the "comment" column in order to speed this query up. Note that lowercase comments will be indexed to find terms independent of casing.

Run the following commands to create the index:

Materialization of the index takes a while (to check if the index was created, use the system table system.data_skipping_indices).

Run the query again once the index has been created:

Notice how the query now took only 0.248 seconds with the index, down from 0.843 seconds previously without it:

The EXPLAIN clause can be used to understand why the addition of this index improved the query around 3.4x.

Notice how the index allowed skipping of a substantial number of granules to speed up the query.

It's also possible to now efficiently search for one, or all of multiple terms: