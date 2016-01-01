YouTube dataset of dislikes
In November of 2021, YouTube removed the public dislike count from all of its videos. While creators can still see the number of dislikes, viewers can only see how many likes a video has received.
The dataset has over 4.55 billion records, so be careful just copying-and-pasting the commands below unless your resources can handle that type of volume. The commands below were executed on a Production instance of ClickHouse Cloud.
The data is in a JSON format and can be downloaded from archive.org. We have made this same data available in S3 so that it can be downloaded more efficiently into a ClickHouse Cloud instance.
Here are the steps to create a table in ClickHouse Cloud and insert the data.
The steps below will easily work on a local install of ClickHouse too. The only change would be to use the
s3 function instead of
s3cluster (unless you have a cluster configured - in which case change
default to the name of your cluster).
Step-by-step instructions
- Let's see what the data looks like. The
s3clustertable function returns a table, so we can
DESCRIBEthe result:
ClickHouse infers the following schema from the JSON file:
- Based on the inferred schema, we cleaned up the data types and added a primary key. Define the following table:
- The following command streams the records from the S3 files into the
youtubetable.
This inserts a lot of data - 4.65 billion rows. If you do not want the entire dataset, simply add a
LIMIT clause with the desired number of rows.
Some comments about our
INSERT command:
- The
parseDateTimeBestEffortUSOrZerofunction is handy when the incoming date fields may not be in the proper format. If
fetch_datedoes not get parsed properly, it will be set to
0
- The
upload_datecolumn contains valid dates, but it also contains strings like "4 hours ago" - which is certainly not a valid date. We decided to store the original value in
upload_date_strand attempt to parse it with
toDate(parseDateTimeBestEffortUSOrZero(upload_date::String)). If the parsing fails we just get
0
- We used
ifNullto avoid getting
NULLvalues in our table. If an incoming value is
NULL, the
ifNullfunction is setting the value to an empty string
- Open a new tab in the SQL Console of ClickHouse Cloud (or a new
clickhouse-clientwindow) and watch the count increase. It will take a while to insert 4.56B rows, depending on your server resources. (Without any tweaking of settings, it takes about 4.5 hours.)
- Once the data is inserted, go ahead and count the number of dislikes of your favorite videos or channels. Let's see how many videos were uploaded by ClickHouse:
The query above runs so quickly because we chose
uploader as the first column of the primary key - so it only had to process 237k rows.
- Let's look and likes and dislikes of ClickHouse videos:
The response looks like:
- Here is a search for videos with ClickHouse in the
titleor
descriptionfields:
This query has to process every row, and also parse through two columns of strings. Even then, we get decent performance at 4.15M rows/second:
The results look like:
Questions
If someone disables comments does it lower the chance someone will actually click like or dislike?
When commenting is disabled, are people more likely to like or dislike to express their feelings about a video?
Enabling comments seems to be correlated with a higher rate of engagement.
How does the number of videos change over time - notable events?
A spike of uploaders around covid is noticeable.
More subtitles over time and when
With advances in speech recognition, it's easier than ever to create subtitles for video with youtube adding auto-captioning in late 2009 - was the jump then?
The data results show a spike in 2009. Apparently at that, time YouTube was removing their community captions feature, which allowed you to upload captions for other people's video. This prompted a very successful campaign to have creators add captions to their videos for hard of hearing and deaf viewers.