How to ingest Parquet files from an S3 bucket

Learn the basics of using the S3 table engine in ClickHouse to ingest and query Parquet files from an S3 bucket, including setup, access permissions, and data import examples.

Ingesting Parquet files from an S3 bucket

Below are some basics of using the S3 table engine to read parquet files.

create access and secret keys for an IAM service user. normal login users usually don't work since they may have been configured with an MFA policy.
set the permissions on the policy to allow the service user to access the bucket and folders.

The following is a very simple example that you can use to test the mechanics of accessing your parquet files successfully prior to applying to your actual data.

If you need an example of creating a user and bucket, you can follow the first two sections (create user and create bucket): https://clickhouse.com/docs/guides/sre/configuring-s3-for-clickhouse-use/

I used this sample file: https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet and uploaded it to my test bucket

You can set the policy something like this on the bucket: (adjust as needed, this one is fairly open for privileges but will help in testing. you can narrow your permissions as necessary)

You can run queries with this type of syntax using the S3 table engine: https://clickhouse.com/docs/sql-reference/table-functions/s3/

The data types reference for parquet format are here: https://clickhouse.com/docs/interfaces/formats/#data-format-parquet

To bring in the data into a native ClickHouse table:

create the table, something like this (just chose a couple of the columns in the parquet file):

Select the data from the S3 bucket to insert into the new table:

Verify the import:

When you are ready to import your real data, you can use some special syntax like wildcards and ranges to specify your folders, subfolders and files in your bucket. I'd recommend to filter a few directories and files to test the import, maybe a certain year, a couple months and some date range to test first.

besides the path options here, newly released is syntax ** which specifies all subdirectories recursively. https://clickhouse.com/docs/sql-reference/table-functions/s3/

For example, assuming the paths and bucket structure is something like this: https://your_s3_bucket.s3.amazonaws.com/<your_folder>/<year>/<month>/<day>/<filename>.parquet https://mars-doc-test.s3.amazonaws.com/system_logs/2022/11/01/my-app-logs-0001.parquet

This would get all files for 1st day of every month in 2021-2022 https://mars-doc-test.s3.amazonaws.com/system_logs/{2021-2022}/**/01/*.parquet

Mar 22, 2023 · 3 min read

Ingesting Parquet files from an S3 bucket​

Ingesting Parquet files from an S3 bucket