Skip to main content

ClickHouse Tutorial

What to Expect from This Tutorial?

In this tutorial, you will create a table and insert a large dataset (two million rows of the New York taxi data). Then you will run queries on the dataset.

note

This tutorial assumes you have access to a running ClickHouse server. If not, check out the Quick Start.

1. Create a New Table

The New York City taxi data contains the details of millions of taxi rides, with columns like pickup and drop-off times and locations, cost, tip amount, tolls, payment type and so on. Let's create a table to store this data...

  1. Either open your Play UI at http://localhost:8123/play or startup the clickhouse-client:

    clickhouse-client
  2. Create the following trips table in the default database:

    CREATE TABLE trips
    (
    `trip_id` UInt32,
    `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    `pickup_date` Date,
    `pickup_datetime` DateTime,
    `dropoff_date` Date,
    `dropoff_datetime` DateTime,
    `store_and_fwd_flag` UInt8,
    `rate_code_id` UInt8,
    `pickup_longitude` Float64,
    `pickup_latitude` Float64,
    `dropoff_longitude` Float64,
    `dropoff_latitude` Float64,
    `passenger_count` UInt8,
    `trip_distance` Float64,
    `fare_amount` Float32,
    `extra` Float32,
    `mta_tax` Float32,
    `tip_amount` Float32,
    `tolls_amount` Float32,
    `ehail_fee` Float32,
    `improvement_surcharge` Float32,
    `total_amount` Float32,
    `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    `trip_type` UInt8,
    `pickup` FixedString(25),
    `dropoff` FixedString(25),
    `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    `pickup_nyct2010_gid` Int8,
    `pickup_ctlabel` Float32,
    `pickup_borocode` Int8,
    `pickup_ct2010` String,
    `pickup_boroct2010` String,
    `pickup_cdeligibil` String,
    `pickup_ntacode` FixedString(4),
    `pickup_ntaname` String,
    `pickup_puma` UInt16,
    `dropoff_nyct2010_gid` UInt8,
    `dropoff_ctlabel` Float32,
    `dropoff_borocode` UInt8,
    `dropoff_ct2010` String,
    `dropoff_boroct2010` String,
    `dropoff_cdeligibil` String,
    `dropoff_ntacode` FixedString(4),
    `dropoff_ntaname` String,
    `dropoff_puma` UInt16
    )
    ENGINE = MergeTree
    PARTITION BY toYYYYMM(pickup_date)
    ORDER BY pickup_datetime;

2. Insert the Dataset

Now that you have a table created, let's add the NYC taxi data. It is in CSV files in S3, and you can load the data from there.

  1. The following command inserts ~2,000,000 rows into your trips table from two different files in S3: trips_1.tsv.gz and trips_2.tsv.gz:

    INSERT INTO trips
    SELECT * FROM s3(
    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
    'TabSeparatedWithNames'
    )
  2. Wait for the INSERT to finish - it might take a moment for the 150MB of data to be downloaded.

    note

    The s3 function cleverly knows how to decompress the data, and the TabSeparatedWithNames format tells ClickHouse that the data is tab-separated and also to skip the header row of each file.

  3. When the insert is finished, verify it worked:

    SELECT count() FROM trips

    You should see about 2M rows (1,999,657 rows, to be precise).

    note

    Notice how quickly and how few rows ClickHouse had to process to determine the count? You can get back the count in 0.001 seconds with only 6 rows processed. (6 just happens to be the number of parts that the trips table currently has, and parts know how many rows they have.)

  4. If you run a query that needs to hit every row, you will notice considerably more rows need to be processed, but the run time is still blazing fast:

    SELECT DISTINCT(pickup_ntaname) FROM trips

    This query has to process 2M rows and return 190 values, but notice it does this in about 0.05 seconds. The pickup_ntaname column represents the name of the neighborhood in New York City where the taxi ride originated.

3. Analyze the Data

Let's see how quickly ClickHouse can process 2M rows of data...

  1. We will start with some simple and fast calculations, like computing the average tip amount (which is right on $1)

    SELECT avg(tip_amount) FROM trips

    The response is almost immediate:

    ┌────avg(tip_amount)─┐
    │ 1.6847585806972212 │
    └────────────────────┘

    1 rows in set. Elapsed: 0.113 sec. Processed 2.00 million rows, 8.00 MB (17.67 million rows/s., 70.69 MB/s.)
  2. This query computes the average cost based on the number of passengers:

    SELECT
    passenger_count,
    ceil(avg(total_amount),2) AS average_total_amount
    FROM trips
    GROUP BY passenger_count

    The passenger_count ranges from 0 to 9:

    ┌─passenger_count─┬─average_total_amount─┐
    │ 0 │ 22.69 │
    │ 1 │ 15.97 │
    │ 2 │ 17.15 │
    │ 3 │ 16.76 │
    │ 4 │ 17.33 │
    │ 5 │ 16.35 │
    │ 6 │ 16.04 │
    │ 7 │ 59.8 │
    │ 8 │ 36.41 │
    │ 9 │ 9.81 │
    └─────────────────┴──────────────────────┘

    10 rows in set. Elapsed: 0.015 sec. Processed 2.00 million rows, 10.00 MB (129.00 million rows/s., 645.01 MB/s.)
  3. Here is a query that calculates the daily number of pickups per neighborhood:

    SELECT
    pickup_date,
    pickup_ntaname,
    SUM(1) AS number_of_trips
    FROM trips
    GROUP BY pickup_date, pickup_ntaname
    ORDER BY pickup_date ASC

    The result looks like:

    ┌─pickup_date─┬─pickup_ntaname───────────────────────────────────────────┬─number_of_trips─┐
    │ 2015-07-01 │ Brooklyn Heights-Cobble Hill │ 13 │
    │ 2015-07-01 │ Old Astoria │ 5 │
    │ 2015-07-01 │ Flushing │ 1 │
    │ 2015-07-01 │ Yorkville │ 378 │
    │ 2015-07-01 │ Gramercy │ 344 │
    │ 2015-07-01 │ Fordham South │ 2 │
    │ 2015-07-01 │ SoHo-TriBeCa-Civic Center-Little Italy │ 621 │
    │ 2015-07-01 │ Park Slope-Gowanus │ 29 │
    │ 2015-07-01 │ Bushwick South │ 5 │
  1. This query computes the length of the trip and groups the results by that value:

    SELECT
    avg(tip_amount) AS avg_tip,
    avg(fare_amount) AS avg_fare,
    avg(passenger_count) AS avg_passenger,
    count() AS count,
    truncate(date_diff('second', pickup_datetime, dropoff_datetime)/3600) as trip_minutes
    FROM trips
    WHERE trip_minutes > 0
    GROUP BY trip_minutes
    ORDER BY trip_minutes DESC

    The result looks like:

    ┌────────────avg_tip─┬───────────avg_fare─┬──────avg_passenger─┬─count─┬─trip_minutes─┐
    │ 0.9800000190734863 │ 10 │ 1.5 │ 2 │ 458 │
    │ 1.18236789075801 │ 14.493377928590297 │ 2.060200668896321 │ 1495 │ 23 │
    │ 2.1159574744549206 │ 23.22872340425532 │ 2.4680851063829787 │ 47 │ 22 │
    │ 1.1218181631781838 │ 13.681818181818182 │ 1.9090909090909092 │ 11 │ 21 │
    │ 0.3218181837688793 │ 18.045454545454547 │ 2.3636363636363638 │ 11 │ 20 │
    │ 2.1490000009536745 │ 17.55 │ 1.5 │ 10 │ 19 │
    │ 4.537058907396653 │ 37 │ 1.7647058823529411 │ 17 │ 18 │
  1. This query shows the number of pickups in each neighborhood, broken down by hour of the day:

    SELECT
    pickup_ntaname,
    toHour(pickup_datetime) as pickup_hour,
    SUM(1) AS pickups
    FROM trips
    WHERE pickup_ntaname != ''
    GROUP BY pickup_ntaname, pickup_hour
    ORDER BY pickup_ntaname, pickup_hour

    The result looks like:

    ┌─pickup_ntaname───────────────────────────────────────────┬─pickup_hour─┬─pickups─┐
    │ Airport │ 0 │ 3509 │
    │ Airport │ 1 │ 1184 │
    │ Airport │ 2 │ 401 │
    │ Airport │ 3 │ 152 │
    │ Airport │ 4 │ 213 │
    │ Airport │ 5 │ 955 │
    │ Airport │ 6 │ 2161 │
    │ Airport │ 7 │ 3013 │
    │ Airport │ 8 │ 3601 │
    │ Airport │ 9 │ 3792 │
    │ Airport │ 10 │ 4546 │
    │ Airport │ 11 │ 4659 │
    │ Airport │ 12 │ 4621 │
    │ Airport │ 13 │ 5348 │
    │ Airport │ 14 │ 5889 │
    │ Airport │ 15 │ 6505 │
    │ Airport │ 16 │ 6119 │
    │ Airport │ 17 │ 6341 │
    │ Airport │ 18 │ 6173 │
    │ Airport │ 19 │ 6329 │
    │ Airport │ 20 │ 6271 │
    │ Airport │ 21 │ 6649 │
    │ Airport │ 22 │ 6356 │
    │ Airport │ 23 │ 6016 │
    │ Allerton-Pelham Gardens │ 4 │ 1 │
    │ Allerton-Pelham Gardens │ 6 │ 1 │
    │ Allerton-Pelham Gardens │ 7 │ 1 │
    │ Allerton-Pelham Gardens │ 9 │ 5 │
    │ Allerton-Pelham Gardens │ 10 │ 3 │
    │ Allerton-Pelham Gardens │ 15 │ 1 │
    │ Allerton-Pelham Gardens │ 20 │ 2 │
    │ Allerton-Pelham Gardens │ 23 │ 1 │
    │ Annadale-Huguenot-Prince's Bay-Eltingville │ 23 │ 1 │
    │ Arden Heights │ 11 │ 1 │
  2. Let's look at rides to LaGuardia or JFK airports, which requires all 2M rows to be processed and returns in less than 0.04 seconds:

    SELECT
    pickup_datetime,
    dropoff_datetime,
    total_amount,
    pickup_nyct2010_gid,
    dropoff_nyct2010_gid,
    CASE
    WHEN dropoff_nyct2010_gid = 138 THEN 'LGA'
    WHEN dropoff_nyct2010_gid = 132 THEN 'JFK'
    END AS airport_code,
    EXTRACT(YEAR FROM pickup_datetime) AS year,
    EXTRACT(DAY FROM pickup_datetime) AS day,
    EXTRACT(HOUR FROM pickup_datetime) AS hour
    FROM trips
    WHERE dropoff_nyct2010_gid IN (132, 138)
    ORDER BY pickup_datetime

    The response is:

    ┌─────pickup_datetime─┬────dropoff_datetime─┬─total_amount─┬─pickup_nyct2010_gid─┬─dropoff_nyct2010_gid─┬─airport_code─┬─year─┬─day─┬─hour─┐
    │ 2015-07-01 00:04:14 │ 2015-07-01 00:15:29 │ 13.3 │ -34 │ 132 │ JFK │ 2015 │ 1 │ 0 │
    │ 2015-07-01 00:09:42 │ 2015-07-01 00:12:55 │ 6.8 │ 50 │ 138 │ LGA │ 2015 │ 1 │ 0 │
    │ 2015-07-01 00:23:04 │ 2015-07-01 00:24:39 │ 4.8 │ -125 │ 132 │ JFK │ 2015 │ 1 │ 0 │
    │ 2015-07-01 00:27:51 │ 2015-07-01 00:39:02 │ 14.72 │ -101 │ 138 │ LGA │ 2015 │ 1 │ 0 │
    │ 2015-07-01 00:32:03 │ 2015-07-01 00:55:39 │ 39.34 │ 48 │ 138 │ LGA │ 2015 │ 1 │ 0 │
    │ 2015-07-01 00:34:12 │ 2015-07-01 00:40:48 │ 9.95 │ -93 │ 132 │ JFK │ 2015 │ 1 │ 0 │
    │ 2015-07-01 00:38:26 │ 2015-07-01 00:49:00 │ 13.3 │ -11 │ 138 │ LGA │ 2015 │ 1 │ 0 │
    │ 2015-07-01 00:41:48 │ 2015-07-01 00:44:45 │ 6.3 │ -94 │ 132 │ JFK │ 2015 │ 1 │ 0 │
    │ 2015-07-01 01:06:18 │ 2015-07-01 01:14:43 │ 11.76 │ 37 │ 132 │ JFK │ 2015 │ 1 │ 1 │
    note

    As you can see, it doesn't seem to matter what type of grouping or calculation that is being performed, ClickHouse retrieves the results almost immediately!

Congrats!

Well done, you made it through the tutorial, and hopefully, you have a better understanding of how to use ClickHouse. Here are some options for what to do next: