Skip to main content
Skip to main content
Edit this page

Integrating Amazon S3 with ClickHouse Cloud

The S3 ClickPipe provides a fully-managed and resilient way to ingest data from Amazon S3 and S3-compatible object stores into ClickHouse Cloud. It supports both one-time and continuous ingestion with exactly-once semantics.

S3 ClickPipes can be deployed and managed manually using the ClickPipes UI, as well as programmatically using OpenAPI and Terraform.

Supported data sources

NameLogoDetails
Amazon S3Continuous ingestion requires lexicographical order by default, but can be configured to ingest files in any order.
Cloudflare R2
S3-compatible
Continuous ingestion requires lexicographical order.
DigitalOcean Spaces
S3-compatible
Continuous ingestion requires lexicographical order.
Tip

Due to differences in URL formats and API implementations across object storage service providers, not all S3-compatible services are supported out-of-the-box. If you're running into issues with a service that is not listed above, please reach out to our team.

Supported formats

Features

One-time ingestion

By default, the S3 ClickPipe will load all files matched by a pattern from the specified bucket into the ClickHouse destination table in a single batch operation. Once the ingestion task completes, the ClickPipe stops automatically. This one-time ingestion mode provides exactly-once semantics, ensuring that each file is processed reliably without duplicates.

Continuous ingestion

When continuous ingestion is enabled, ClickPipes continuously ingests data from the specified path. To determine ingestion order, the S3 ClickPipe relies on the implicit lexicographical order of files, by default. It can also be configured to ingest files in any order using an Amazon SQS queue connected to the bucket.

Lexicographical order

By default, the S3 ClickPipe assumes files are added to a bucket in lexicographical order, and relies on this implicit order to ingest files sequentially. This means that any new file must be lexically greater than the last ingested file. For example, files named file1file2, and file3 will be ingested sequentially, but if a new file 0 is added to the bucket, it will be ignored because the file name is not lexically greater than the last ingested file.

In this mode, the S3 ClickPipe does an initial load of all files in the specified path, and then polls for new files at a configurable interval (by default, 30 seconds). It is not possible to start ingestion from a specific file or point in time — ClickPipes will always load all files in the specified path.

Any order

Note

Unordered mode is only supported for Amazon S3 and is not supported for public buckets. It requires setting up an Amazon SQS queue connected to the bucket.

It's possible to configure an S3 ClickPipe to ingest files that don't have an implicit order by setting up an Amazon SQS queue connected to the bucket. This allows ClickPipes to listen for object created events and ingest any new files regardless of the file naming convention.

In this mode, the S3 ClickPipe does an initial load of all files in the selected path, and then listens for ObjectCreated:* events in the queue that match the specified path. Any message for a previously seen file, file not matching the path, or event of a different type will be ignored.

Note

Setting a prefix/postfix for events is optional. If you do, make sure it matches the path set for the clickpipe. S3 doesn't allow multiple overlapping notification rules for the same event types.

Files are ingested once the threshold configured in max insert bytes or max file count is reached, or after a configurable interval (by default, 30 seconds). It is not possible to start ingestion from a specific file or point in time — ClickPipes will always load all files in the selected path. If a DLQ is configured, failed messages will be reenqueued and reprocessed up to the number of times configured in the DLQ maxReceiveCount parameter.

Tip

We strongly recommend configuring a Dead-Letter-Queue (DLQ) for the SQS queue, so it's easier to debug and retry failed messages.

SNS to SQS

It is also possible to emit S3 event notifications to SQS via an SNS topic. This can be used in case some of the limitations of direct S3 → SQS integration have been met. In this case, you’ll need to enable the raw message delivery option.

File pattern matching

Object Storage ClickPipes follow the POSIX standard for file pattern matching. All patterns are case-sensitive and match the full path after the bucket name. For better performance, use the most specific pattern possible (e.g., data-2024-*.csv instead of *.csv).

Supported patterns

PatternDescriptionExampleMatches
?Matches exactly one character (excluding /)data-?.csvdata-1.csv, data-a.csv, data-x.csv
*Matches zero or more characters (excluding /)data-*.csvdata-1.csv, data-001.csv, data-report.csv, data-.csv
**
Recursive
Matches zero or more characters (including /). Enables recursive directory traversal.logs/**/error.loglogs/error.log, logs/2024/error.log, logs/2024/01/error.log

Examples:

  • https://bucket.s3.amazonaws.com/folder/*.csv
  • https://bucket.s3.amazonaws.com/logs/**/data.json
  • https://bucket.s3.amazonaws.com/file-?.parquet
  • https://bucket.s3.amazonaws.com/data-2024-*.csv.gz

Unsupported patterns

PatternDescriptionExampleAlternatives
{abc,def}Brace expansion.{logs,data}/file.csvCreate separate ClickPipes for each path.
{N..M}Numeric range expansionfile-{1..100}.csvUse file-*.csv or file-?.csv.

Examples:

  • https://bucket.s3.amazonaws.com/{documents-01,documents-02}.json
  • https://bucket.s3.amazonaws.com/file-{1..100}.csv
  • https://bucket.s3.amazonaws.com/{logs,metrics}/data.parquet

Exactly-once semantics

Various types of failures can occur when ingesting large dataset, which can result in a partial inserts or duplicate data. Object Storage ClickPipes are resilient to insert failures and provides exactly-once semantics. This is accomplished by using temporary "staging" tables. Data is first inserted into the staging tables. If something goes wrong with this insert, the staging table can be truncated and the insert can be retried from a clean state. Only when an insert is completed and successful, the partitions in the staging table are moved to target table. To read more about this strategy, check-out this blog post.

Virtual columns

To track which files have been ingested, include the _file virtual column to the column mapping list. The _file virtual column contains the filename of the source object, which can be used to query which files have been processed.

Access control

Permissions

The S3 ClickPipe supports public and private buckets. Requester Pays buckets are not supported.

S3 bucket

Buckets must allow the following actions in the bucket policy:

SQS queue

When using unordered mode, the SQS must allow the following actions in the queue policy:

Authentication

IAM credentials

To use access keys to authenticate, choose Credentials under Authentication method when setting up your ClickPipe connection. Then, provide the access key ID (e.g., AKIAIOSFODNN7EXAMPLE) and secret access key (e.g., wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY) under Access key and Secret key, respectively.

IAM role

To use role-based access to authenticate, choose IAM role under Authentication method when setting up your ClickPipe connection.

Follow this guide to create a role with the required trust policy for S3 access. Then, provide the IAM role ARN under IAM role ARN.

Advanced settings

ClickPipes provides sensible defaults that cover the requirements of most use cases. If your use case requires additional fine-tuning, you can adjust the following settings:

SettingDefault valueDescription
Max insert bytes10GBNumber of bytes to process in a single insert batch.
Max file count100Maximum number of files to process in a single insert batch.
Max threadsauto(3)Maximum number of concurrent threads for file processing.
Max insert threads1Maximum number of concurrent insert threads for file processing.
Min insert block size bytes1GBMinimum size of bytes in the block which can be inserted into a table.
Max download threads4Maximum number of concurrent download threads.
Object storage polling interval30sConfigures the maximum wait period before inserting data into the ClickHouse cluster.
Parallel distributed insert select2Parallel distributed insert select setting.
Parallel view processingfalseWhether to enable pushing to attached views concurrently instead of sequentially.
Use cluster functiontrueWhether to process files in parallel across multiple nodes.

Scaling

Object Storage ClickPipes are scaled based on the minimum ClickHouse service size determined by the configured vertical autoscaling settings. The size of the ClickPipe is determined when the pipe is created. Subsequent changes to the ClickHouse service settings will not affect the ClickPipe size.

To increase the throughput on large ingest jobs, we recommend scaling the ClickHouse service before creating the ClickPipe.

Known limitations

File size

ClickPipes will only attempt to ingest objects that are 10GB or smaller in size. If a file is greater than 10GB, an error will be appended to the ClickPipes dedicated error table.

Compatibility

Despite being S3-compatible, some services use a different URL structure that the S3 ClickPipe might not be able to parse (e.g., Backblaze B2), or require integration with provider-specific queue services for continuous, unordered ingestion. If you're running into issues with a service that is not listed under Supported data sources, please reach out to our team.

View support

Materialized views on the target table are also supported. ClickPipes will create staging tables not only for the target table, but also any dependent materialized view.

We do not create staging tables for non-materialized views. This means that if you have a target table with one of more downstream materialized views, those materialized views should avoid selecting data via a view from the target table. Otherwise, you may find that you are missing data in the materialized view.