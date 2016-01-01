Skip to main content
This dataset contains over 150M customer reviews of Amazon products. The data is in snappy-compressed Parquet files in AWS S3 that total 49GB in size (compressed). Let's walk through the steps to insert it into ClickHouse.

The queries below were executed on a Production instance of ClickHouse Cloud.

  1. Without inserting the data into ClickHouse, we can query it in place. Let's grab some rows, so we can see what they look like:

The rows look like:

  1. Let's define a new MergeTree table named amazon_reviews to store this data in ClickHouse:
  1. The following INSERT command uses the s3Cluster table function, which allows the processing of multiple S3 files in parallel using all the nodes of your cluster. We also use a wildcard to insert any file that starts with the name https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet:
In ClickHouse Cloud, the name of the cluster is default. Change default to the name of your cluster...or use the s3 table function (instead of s3Cluster) if you do not have a cluster.

  1. That query doesn't take long - averaging about 300,000 rows per second. within 5 minutes or so you should see all the rows inserted:
  1. Let's see how much space our data is using:

The original data was about 70G, but compressed in ClickHouse it takes up about 30G:

  1. Let's run some queries...here are the top 10 most-helpful reviews in the dataset:

Notice the query has to process all 151M rows, but takes less than one second!

  1. Here are the top 10 products in Amazon with the most reviews:
  1. Here are the average review ratings per month for each product (an actual Amazon job interview question!):

It calculates all the monthly averages for each product, but we only returned 20 rows:

  1. Here are the total number of votes per product category. This query is fast because product_category is in the primary key:
  1. Let's find the products with the word "awful" occurring most frequently in the review. This is a big task - over 151M strings have to be parsed looking for a single word:

The query only takes 4 seconds - which is impressive - and the results are a fun read:

  1. We can run the same query again, except this time we search for awesome in the reviews: