Skip to main content
Skip to main content

How to query data in an S3 bucket

A lot of the world's data lives in Amazon S3 buckets. In this guide, we'll learn how to query that data using chDB.

Setup

Let's first create a virtual environment:

python -m venv .venv
source .venv/bin/activate

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

pip install "chdb>=2.0.2"

And now we're going to install IPython:

pip install ipython

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

ipython

You can also use the code in a Python script or in your favorite notebook.

Listing files in an S3 bucket

Let's start by listing all the files in an S3 bucket that contains Amazon reviews. To do this, we can use the s3 table function and pass in the path to a file or a wildcard to a set of files.

Tip

If you pass just the bucket name it will throw an exception.

We're also going to use the One input format so that the file isn't parsed, instead a single row is returned per file and we can access the file via the _file virtual column and the path via the _path virtual column.

import chdb

chdb.query("""
SELECT
_file,
_path
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', One)
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
┌─_file───────────────────────────────┬─_path─────────────────────────────────────────────────────────────────────┐
│ amazon_reviews_2010.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2010.snappy.parquet │
│ amazon_reviews_1990s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_1990s.snappy.parquet │
│ amazon_reviews_2013.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2013.snappy.parquet │
│ amazon_reviews_2015.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet │
│ amazon_reviews_2014.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2014.snappy.parquet │
│ amazon_reviews_2012.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2012.snappy.parquet │
│ amazon_reviews_2000s.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2000s.snappy.parquet │
│ amazon_reviews_2011.snappy.parquet │ datasets-documentation/amazon_reviews/amazon_reviews_2011.snappy.parquet │
└─────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────┘

This bucket contains only Parquet files.

Querying files in an S3 bucket

Next, let's learn how to query those files. If we want to count the number of rows in each of those files, we can run the following query:

chdb.query("""
SELECT
_file,
count() AS count,
formatReadableQuantity(count) AS readableCount
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
GROUP BY ALL
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')
┌─_file───────────────────────────────┬────count─┬─readableCount───┐
│ amazon_reviews_2013.snappy.parquet │ 28034255 │ 28.03 million │
│ amazon_reviews_1990s.snappy.parquet │ 639532 │ 639.53 thousand │
│ amazon_reviews_2011.snappy.parquet │ 6112495 │ 6.11 million │
│ amazon_reviews_2015.snappy.parquet │ 41905631 │ 41.91 million │
│ amazon_reviews_2012.snappy.parquet │ 11541011 │ 11.54 million │
│ amazon_reviews_2000s.snappy.parquet │ 14728295 │ 14.73 million │
│ amazon_reviews_2014.snappy.parquet │ 44127569 │ 44.13 million │
│ amazon_reviews_2010.snappy.parquet │ 3868472 │ 3.87 million │
└─────────────────────────────────────┴──────────┴─────────────────┘

We can also pass in the HTTP URI for an S3 bucket and will get the same results:

chdb.query("""
SELECT
_file,
count() AS count,
formatReadableQuantity(count) AS readableCount
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/*.parquet')
GROUP BY ALL
SETTINGS output_format_pretty_row_numbers=0
""", 'PrettyCompact')

Let's have a look at the schema of these Parquet files using the DESCRIBE clause:

chdb.query("""
DESCRIBE s3('s3://datasets-documentation/amazon_reviews/*.parquet')
SETTINGS describe_compact_output=1
""", 'PrettyCompact')
    ┌─name──────────────┬─type─────────────┐
1. │ review_date │ Nullable(UInt16) │
2. │ marketplace │ Nullable(String) │
3. │ customer_id │ Nullable(UInt64) │
4. │ review_id │ Nullable(String) │
5. │ product_id │ Nullable(String) │
6. │ product_parent │ Nullable(UInt64) │
7. │ product_title │ Nullable(String) │
8. │ product_category │ Nullable(String) │
9. │ star_rating │ Nullable(UInt8) │
10. │ helpful_votes │ Nullable(UInt32) │
11. │ total_votes │ Nullable(UInt32) │
12. │ vine │ Nullable(Bool) │
13. │ verified_purchase │ Nullable(Bool) │
14. │ review_headline │ Nullable(String) │
15. │ review_body │ Nullable(String) │
└───────────────────┴──────────────────┘

Let's now compute the top product categories based on number of reviews, as well as computing the average star rating:

chdb.query("""
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet')
GROUP BY ALL
LIMIT 10
""", 'PrettyCompact')
    ┌─product_category─┬──reviews─┬──avg─┐
1. │ Toys │ 4864056 │ 4.21 │
2. │ Apparel │ 5906085 │ 4.11 │
3. │ Luggage │ 348644 │ 4.22 │
4. │ Kitchen │ 4880297 │ 4.21 │
5. │ Books │ 19530930 │ 4.34 │
6. │ Outdoors │ 2302327 │ 4.24 │
7. │ Video │ 380596 │ 4.19 │
8. │ Grocery │ 2402365 │ 4.31 │
9. │ Shoes │ 4366757 │ 4.24 │
10. │ Jewelry │ 1767667 │ 4.14 │
└──────────────────┴──────────┴──────┘

Querying files in a private S3 bucket

If we're querying files in a private S3 bucket, we need to pass in an access key and secret. We can pass in those credentials to the s3 table function:

chdb.query("""
SELECT product_category, count() AS reviews, round(avg(star_rating), 2) as avg
FROM s3('s3://datasets-documentation/amazon_reviews/*.parquet', 'access-key', 'secret')
GROUP BY ALL
LIMIT 10
""", 'PrettyCompact')
Note

This query won't work because it's a public bucket!

An alternative way is to used named collections, but this approach isn't yet supported by chDB.