Hacker News dataset
In this tutorial, you'll insert 28 million rows of Hacker News data into a ClickHouse table from both CSV and Parquet formats and run some simple queries to explore the data.
CSV
Sample the data
clickhouse-local
allows users to perform fast processing on local files without
having to deploy and configure the ClickHouse server.
Before storing any data in ClickHouse, let's sample the file using clickhouse-local. From the console run:
Next, run the following command to explore the data:
There are a lot of subtle capabilities in this command.
The file
operator allows you to read the file from a local disk, specifying only the format CSVWithNames
.
Most importantly, the schema is automatically inferred for you from the file contents.
Note also how clickhouse-local
is able to read the compressed file, inferring the gzip format from the extension.
The Vertical
format is used to more easily see the data for each column.
Load the data with schema inference
The simplest and most powerful tool for data loading is the clickhouse-client
: a feature-rich native command-line client.
To load data, you can again exploit schema inference, relying on ClickHouse to determine the types of the columns.
Run the following command to create a table and insert the data directly from the remote CSV file, accessing the contents via the url
function.
The schema is automatically inferred:
This creates an empty table using the schema inferred from the data.
The DESCRIBE TABLE
command allows us to understand these assigned types.
To insert the data into this table, use the INSERT INTO, SELECT
command.
Together with the url
function, data will be streamed directly from the URL:
You've successfully inserted 28 million rows into ClickHouse with a single command!
Explore the data
Sample the Hacker News stories and specific columns by running the following query:
While schema inference is a great tool for initial data exploration, it is “best effort” and not a long-term substitute for defining an optimal schema for your data.
Define a schema
An obvious immediate optimization is to define a type for each field.
In addition to declaring the time field as a DateTime
type, we define an appropriate type for each of the fields below after dropping our existing dataset.
In ClickHouse the primary key id for the data is defined via the ORDER BY
clause.
Selecting appropriate types and choosing which columns to include in the ORDER BY
clause will help to improve query speed and compression.
Run the query below to drop the old schema and create the improved schema:
With an optimized schema, you can now insert the data from the local file system.
Again using clickhouse-client
, insert the file using the INFILE
clause with an explicit INSERT INTO
.
Run sample queries
Some sample queries are presented below to give you inspiration for writing your own queries.
How pervasive a topic is "ClickHouse" in Hacker News?
The score field provides a metric of popularity for stories, while the id
field and ||
concatenation operator can be used to produce a link to the original post.
Is ClickHouse generating more noise over time? Here the usefulness of defining the time
field
as a DateTime
is shown, as using a proper data type allows you to use the toYYYYMM()
function:
It looks like "ClickHouse" is growing in popularity with time.
Who are the top commenters on ClickHouse related articles?
Which comments generate the most interest?
Parquet
One of the strengths of ClickHouse is its ability to handle any number of formats. CSV represents a rather ideal use case, and is not the most efficient for data exchange.
Next, you'll load the data from a Parquet file which is an efficient column-oriented format.
Parquet has minimal types, which ClickHouse needs to respect, and this type information is encoded in the format itself. Type inference on a Parquet file will invariably lead to a slightly different schema than the one for the CSV file.
Insert the data
Run the following query to read the same data in Parquet format, again using the url function to read the remote data:
As a condition of the Parquet format, we have to accept that keys might be NULL
,
even though they aren't in the data.
Run the following command to view the inferred schema:
As before with the CSV file, you can specify the schema manually for greater control over the chosen types and insert the data directly from s3:
Add a skipping-index to speed up queries
To find out how many comments mention "ClickHouse", run the following query:
Next, you'll create an inverted index on the "comment" column in order to speed this query up. Note that lowercase comments will be indexed to find terms independent of casing.
Run the following commands to create the index:
Materialization of the index takes a while (to check if the index was created, use the system table system.data_skipping_indices
).
Run the query again once the index has been created:
Notice how the query now took only 0.248 seconds with the index, down from 0.843 seconds previously without it:
The EXPLAIN
clause can be used to understand why the addition of this index
improved the query around 3.4x.
Notice how the index allowed skipping of a substantial number of granules to speed up the query.
It's also possible to now efficiently search for one, or all of multiple terms: