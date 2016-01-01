How to query data in an S3 bucket

A lot of the world's data lives in Amazon S3 buckets. In this guide, we'll learn how to query that data using chDB.

Let's first create a virtual environment:

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

And now we're going to install IPython:

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

You can also use the code in a Python script or in your favorite notebook.

Let's start by listing all the files in an S3 bucket that contains Amazon reviews. To do this, we can use the s3 table function and pass in the path to a file or a wildcard to a set of files.

Tip If you pass just the bucket name it will throw an exception.

We're also going to use the One input format so that the file isn't parsed, instead a single row is returned per file and we can access the file via the _file virtual column and the path via the _path virtual column.

This bucket contains only Parquet files.

Next, let's learn how to query those files. If we want to count the number of rows in each of those files, we can run the following query:

We can also pass in the HTTP URI for an S3 bucket and will get the same results:

Let's have a look at the schema of these Parquet files using the DESCRIBE clause:

Let's now compute the top product categories based on number of reviews, as well as computing the average star rating:

If we're querying files in a private S3 bucket, we need to pass in an access key and secret. We can pass in those credentials to the s3 table function:

Note This query won't work because it's a public bucket!

An alternative way is to used named collections, but this approach isn't yet supported by chDB.