Welcome to Our House!
Come on in and stay a while!! We would love to introduce ourselves and say "Hello".
What's New in ClickHouse 21.9
Duration: 10-15 minutes
Audience: If you want to see how to use the new SQL operators in 21.9, this lesson is for you
Issues/Questions: Contact us at [email protected]
Overview: The latest release of ClickHouse 21.9 includes support for the subquery operators INTERSECT, EXCEPT, ANY and ALL. The operators are a great addition - you just need to understand a few details about how they work. We will look at some Spotify data that contains the number of times a song was streamed per day.
Let’s get started!
Prerequisites: You will be running a Docker image that includes ClickHouse 21.9 and a sample dataset already indexed that contains streaming details from Spotify, so you will need Docker installed if you want to follow along.
1. Startup ClickHouse 21.9
We have built a Docker image that already has ClickHouse installed, along with a table that contains the Spotify data:
Show instructions
- Assuming you have Docker installed, run the appropriate following command for your environment to startup the Docker container:
On Linux, you will need to use the
--network host
option:docker run -it --name clickhouse-spotify --network host -p 9000:9000 -p 9009:9009 -p 8123:8123 --platform linux/amd64 --ulimit nofile=262144:262144 learnclickhouse/public-repo:clickhouse-spotify-21.10
On all other environments, use this command:
docker run -it --name clickhouse-spotify -p 9000:9000 -p 9009:9009 -p 8123:8123 --platform linux/amd64 --ulimit nofile=262144:262144 learnclickhouse/public-repo:clickhouse-spotify-21.10
- Wait about 30 seconds for the clickhouse-spotify container to startup and also for the data to get inserted into the spotify database.
2. Open the ClickHouse SQL Interface
Let’s verify you have ClickHouse up and running and the data was inserted successfully.
Show instructions
- Point your web browser to http://localhost:8123/play. You should see the embedded ClickHouse Play UI:
Let’s run a few queries to understand what the dataset looks like. Copy-and-paste the following query into the UI, then click the Run button (or press Ctrl/Cmd+Enter):
SHOW TABLES IN spotify
You should see a table named songs.
The following command shows the schema of songs:
DESCRIBE spotify.songs
View some of the data in the tables. This query displays 100 days' worth of streaming data:
SELECT * FROM spotify.songs LIMIT 100
To see the most popular songs, sort by the Streams column:
SELECT * FROM spotify.songs ORDER BY Streams DESC
Aside from one good day for Taylor Swift , it looks like Kendrick Lamar, Post Malone and Drake have plenty of days where one of their songs was the most-listened to on that day (at least for 2017).
3. The INTERSECT Operator
INTERSECT does exactly what it sounds like it does: it takes two sets of data and returns their intersection. In other words, it compares to sets of rows and returns only the rows where all the columns are equal in both sets.
Let’s take a look at how it works.
Show instructions
Suppose we want to view artists who had songs with really good days of streaming vs. not-as-good days. In other words, a song that had some great days but also some slow days. The logic could possibly feel like the following SQL, but as you can see this particular query could not possibly have any hits:
SELECT * FROM spotify.songs WHERE Streams <= 2000 AND Streams >=1000000
Let’s see how many times a song had more than 1,000,000 streams in a day:
SELECT count(*) FROM spotify.songs WHERE Streams >= 1000000
Notice you get 4,440 hits.
Now let’s see how many times a song had less than 2,000 streams in a day:
SELECT count(*) FROM spotify.songs WHERE Streams <= 2000
This happens much more frequently, with 134,666 hits.
To use INTERSECT, you run two queries and the responses they have in common are returned. For equality, the responses of both queries must have the same number and data types of columns. Run the following query, noticing that the two queries select the same columns:
SELECT TrackName,Artist FROM spotify.songs WHERE Streams <= 2000 INTERSECT SELECT TrackName,Artist FROM spotify.songs WHERE Streams >= 1000000
You get 19,014 hits, which might seem contradictory because we know the second query only has 4,440 hits. But notice we selected TrackName and Artist multiple times in both queries, so for example if Starboy by The Weeknd appeared 20 times in one search and 100 times in the other, you would get 100 rows in the result.
The result would be more interesting if we added DISTINCT to both queries. This would tell us how many songs had at least one day with more than 1,000,000 streams and at least one day with less than 2,000 streams:
SELECT DISTINCT TrackName,Artist FROM spotify.songs WHERE Streams <= 2000 INTERSECT SELECT DISTINCT TrackName,Artist FROM spotify.songs WHERE Streams >= 1000000
This particular query returns 99 rows.
If you want to sort the results, simply sort the first query. For example, the results in this query will be sorted by TrackName:
SELECT DISTINCT TrackName,Artist FROM spotify.songs WHERE Streams <= 2000 ORDER BY TrackName INTERSECT SELECT DISTINCT TrackName,Artist FROM spotify.songs WHERE Streams >= 1000000
If you think about it, the Christmas songs on the list actually make sense - they probably don’t get a lot of streams in April!
Notice that the columns selected in the two queries must have the same data types. The column names do not have to match, as long as the data types lineup (the order of columns in the SELECT clause matters). Try the following query:
SELECT DISTINCT TrackName,Artist,Date FROM spotify.songs WHERE Streams <= 2000 INTERSECT SELECT DISTINCT TrackName,Artist,URL FROM spotify.songs WHERE Streams >= 1000000
You get the following error:
Code: 386. DB::Exception: There is no supertype for types Date, String because some of them are String/FixedString and some of them are not. (NO_COMMON_TYPE) (version 21.10.1.8013 (official build))
The third column in the first query is a Date and the third column in the second query is a String, so the query fails.
4. The EXCEPT Operator
The EXCEPT operator returns the rows that match the first query but throws out the rows that match the second query. Let’s see how it works…
Show instructions
We have already seen in the INTERSECT example above that 99 songs have had good days and bad days. The following query returns songs that have topped 1,000,000 streams in a day at least once, but have never had a day with less than 2,000 streams:
SELECT DISTINCT TrackName,Artist FROM spotify.songs WHERE Streams >=1000000 EXCEPT SELECT DISTINCT TrackName,Artist FROM spotify.songs WHERE Streams <= 2000
Notice you only get 15 hits.
5. The ANY Operator
The ANY operator compares a given value in one query with a set of values in a subquery. If the given value matches any of the hits in the subquery, then that row in the first query is returned. Like most situations, it is best understood by an example:
Show instructions
Review the following query. Can you figure out which artist will get returned?
SELECT DISTINCT Artist FROM spotify.songs WHERE Streams = ANY ( SELECT max(Streams) FROM spotify.songs GROUP BY Date )
Run the query above. You should get 13 artists:
- What do the hits mean? Let’s break it down…the subquery returns a set of numbers that represent the maximum number of streams in a day. The outer query looks for the Artist who had the maximum number of streams that day. Therefore, you are seeing the artist who, at some point in time, had a day in which one of their songs was the most-streamed song on Spotify.
6. The ALL Operator
The ALL operator has the same syntax as ANY, except the Boolean logic is different: for the ALL operator, the given value must match all of the values in the set of values returned by the subquery.
Show instructions
Run the following query, which returns the average number of daily streams for each of the 16 regions:
SELECT Region, avg(Streams) FROM spotify.songs GROUP BY Region
See if you can write a query that returns the artists who have had at least one song with more streams on a day than the average of all the 16 regions.
Show answer
The following query is one solution:
SELECT DISTINCT Artist FROM spotify.songs
WHERE Streams > ALL (
SELECT avg(Streams) FROM spotify.songs GROUP BY Region
)
This event has occurred for 210 artists.
What’s next: Check out the following lessons to continue your journey:
- Learn how to visualize your data using Grafana
- Check out What’s New in ClickHouse 21.10
- View all of our lessons on the Learn ClickHouse home page