How to query data in an S3 bucket
A lot of the world's data lives in Amazon S3 buckets. In this guide, we'll learn how to query that data using chDB.
Setup
Let's first create a virtual environment:
python -m venv .venv
source .venv/bin/activate
And now we'll install chDB. Make sure you have version 2.0.2 or higher:
pip install "chdb>=2.0.2"
And now we're going to install IPython:
pip install ipython
We're going to use ipython
to run the commands in the rest of the guide, which you can launch by running:
ipython
You can also use the code in a Python script or in your favorite notebook.
Listing files in an S3 bucket
Let's start by listing all the files in an S3 bucket that contains Amazon reviews.
To do this, we can use the s3
table function and pass in the path to a file or a wildcard to a set of files.
If you pass just the bucket name it will throw an exception.
We're also going to use the One
input format so that the file isn't parsed, instead a single row is returned per file and we can access the file via the _file
virtual column and the path via the _path
virtual column.
import chdb
chdb.query("""
SELECT
_file,
_path
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', One)
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
┌─_file───────────────────────────────┬─_path─────────────────────────────────────────────────────────────────────┐
│ amazon_reviews_2010.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2010.snappy.parquet │
│ amazon_reviews_1990s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_1990s.snappy.parquet │
│ amazon_reviews_2013.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2013.snappy.parquet │
│ amazon_reviews_2015.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet │
│ amazon_reviews_2014.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2014.snappy.parquet │
│ amazon_reviews_2012.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2012.snappy.parquet │
│ amazon_reviews_2000s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2000s.snappy.parquet │
│ amazon_reviews_2011.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2011.snappy.parquet │
└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘
This bucket contains only Parquet files.
Querying files in an S3 bucket
Next, let's learn how to query those files. If we want to count the number of rows in each of those files, we can run the following query:
chdb.query("""
SELECT
_file,
count() AS count,
formatReadableQuantity(count) AS readableCount
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
GROUP BY ALL
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
┌─_file───────────────────────────────┬────count─┬─readableCount───┐
│ amazon_reviews_2013.snappy.parquet │ 28034255 │ 28.03 million │
│ amazon_reviews_1990s.snappy.parquet │ 639532 │ 639.53 thousand │
│ amazon_reviews_2011.snappy.parquet │ 6112495 │ 6.11 million │
│ amazon_reviews_2015.snappy.parquet │ 41905631 │ 41.91 million │
│ amazon_reviews_2012.snappy.parquet │ 11541011 │ 11.54 million │
│ amazon_reviews_2000s.snappy.parquet │ 14728295 │ 14.73 million │
│ amazon_reviews_2014.snappy.parquet │ 44127569 │ 44.13 million │
│ amazon_reviews_2010.snappy.parquet │ 3868472 │ 3.87 million │
└─────────────────────────────────────┴──────────┴─────────────────┘
We can also pass in the HTTP URI for an S3 bucket and will get the same results:
chdb.query("""
SELECT
_file,
count() AS count,
formatReadableQuantity(count) AS readableCount
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/*.parquet')
GROUP BY ALL
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
Let's have a look at the schema of these Parquet files using the DESCRIBE
clause:
chdb.query("""
DESCRIBE s3('s3://datasets-documentation/amazon_reviews/*.parquet')
SETTINGS describe_compact_output=1
""", 'PrettyCompact')
┌─name──────────────┬─type─────────────┐
1. │ review_date │ Nullable(UInt16) │
2. │ marketplace │ Nullable(String) │
3. │ customer_id │ Nullable(UInt64) │
4. │ review_id │ Nullable(String) │
5. │ product_id │ Nullable(String) │
6. │ product_parent │ Nullable(UInt64) │
7. │ product_title │ Nullable(String) │
8. │ product_category │ Nullable(String) │
9. │ star_rating │ Nullable(UInt8) │
10. │ helpful_votes │ Nullable(UInt32) │
11. │ total_votes │ Nullable(UInt32) │
12. │ vine │ Nullable(Bool) │
13. │ verified_purchase │ Nullable(Bool) │
14. │ review_headline │ Nullable(String) │
15. │ review_body │ Nullable(String) │
└───────────────────┴──────────────────┘
Let's now compute the top product categories based on number of reviews, as well as computing the average star rating:
chdb.query("""
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
GROUP BY ALL
LIMIT 10
""", 'PrettyCompact')
┌─product_category─┬──reviews─┬──avg─┐
1. │ Toys │ 4864056 │ 4.21 │
2. │ Apparel │ 5906085 │ 4.11 │
3. │ Luggage │ 348644 │ 4.22 │
4. │ Kitchen │ 4880297 │ 4.21 │
5. │ Books │ 19530930 │ 4.34 │
6. │ Outdoors │ 2302327 │ 4.24 │
7. │ Video │ 380596 │ 4.19 │
8. │ Grocery │ 2402365 │ 4.31 │
9. │ Shoes │ 4366757 │ 4.24 │
10. │ Jewelry │ 1767667 │ 4.14 │
└──────────────────┴──────────┴──────┘
Querying files in a private S3 bucket
If we're querying files in a private S3 bucket, we need to pass in an access key and secret.
We can pass in those credentials to the s3
table function:
chdb.query("""
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', 'access-key', 'secret')
GROUP BY ALL
LIMIT 10
""", 'PrettyCompact')
This query won't work because it's a public bucket!
An alternative way is to used named collections, but this approach isn't yet supported by chDB.