Integrating Azure Blob Storage with ClickHouse Cloud

The ABS ClickPipe provides a fully-managed and resilient way to ingest data from Azure Blob Storage into ClickHouse Cloud. It supports both one-time and continuous ingestion with exactly-once semantics.

ABS ClickPipes can be deployed and managed manually using the ClickPipes UI, as well as programmatically using OpenAPI and Terraform.

Supported formats

Features

One-time ingestion

The ABS ClickPipe will load all files matched by a pattern from the specified container into the ClickHouse destination table in a single batch operation. Once the ingestion task completes, the ClickPipe stops automatically. This one-time ingestion mode provides exactly-once semantics, ensuring that each file is processed reliably without duplicates.

Continuous ingestion

When continuous ingestion is enabled, ClickPipes continuously ingests data from the specified path. To determine ingestion order, the ABS ClickPipe relies on the implicit lexicographical order of files.

Lexicographical order

The ABS ClickPipe assumes files are added to a container in lexicographical order, and relies on this implicit order to ingest files sequentially. This means that any new file must be lexically greater than the last ingested file. For example, files named file1, file2, and file3 will be ingested sequentially, but if a new file 0 is added to the container, it will be ignored because the file name isn't lexically greater than the last ingested file.

In this mode, the ABS ClickPipe does an initial load of all files in the specified path, and then polls for new files at a configurable interval (by default, 30 seconds). It is not possible to start ingestion from a specific file or point in time — ClickPipes will always load all files in the specified path.

File pattern matching

Object Storage ClickPipes follow the POSIX standard for file pattern matching. All patterns are case-sensitive and match the full path after the container name. For better performance, use the most specific pattern possible (e.g., data-2024-*.csv instead of *.csv).

Supported patterns

Pattern	Description	Example	Matches
`?`	Matches exactly one character (excluding `/`)	`data-?.csv`	`data-1.csv`, `data-a.csv`, `data-x.csv`
`*`	Matches zero or more characters (excluding `/`)	`data-*.csv`	`data-1.csv`, `data-001.csv`, `data-report.csv`, `data-.csv`
`**` Recursive	Matches zero or more characters (including `/`). Enables recursive directory traversal.	`logs/**/error.log`	`logs/error.log`, `logs/2024/error.log`, `logs/2024/01/error.log`

Examples:

https://storageaccount.blob.core.windows.net/container/folder/*.csv
https://storageaccount.blob.core.windows.net/container/logs/**/data.json
https://storageaccount.blob.core.windows.net/container/file-?.parquet
https://storageaccount.blob.core.windows.net/container/data-2024-*.csv.gz

Unsupported patterns

Pattern	Description	Example	Alternatives
`{abc,def}`	Brace expansion - alternatives	`{logs,data}/file.csv`	Create separate ClickPipes for each path.
`{N..M}`	Numeric range expansion	`file-{1..100}.csv`	Use `file-*.csv` or `file-?.csv`.

Examples:

https://storageaccount.blob.core.windows.net/container/{documents-01,documents-02}.json
https://storageaccount.blob.core.windows.net/container/file-{1..100}.csv
https://storageaccount.blob.core.windows.net/container/{logs,metrics}/data.parquet

Exactly-once semantics

Various types of failures can occur when ingesting large dataset, which can result in a partial inserts or duplicate data. Object Storage ClickPipes are resilient to insert failures and provides exactly-once semantics. This is accomplished by using temporary "staging" tables. Data is first inserted into the staging tables. If something goes wrong with this insert, the staging table can be truncated and the insert can be retried from a clean state. Only when an insert is completed and successful, the partitions in the staging table are moved to target table. To read more about this strategy, check-out this blog post.

Virtual columns

To track which files have been ingested, include the _file virtual column to the column mapping list. The _file virtual column contains the filename of the source object, which can be used to query which files have been processed.

Access control

Permissions

The ABS ClickPipe only supports private containers. Public containers are not supported.

Containers must allow the s3:GetObject and s3:ListBucket actions in the bucket policy.

Authentication

Note

Microsoft Entra ID authentication (including Managed Identities) isn't currently supported.

Azure Blob Storage authentication uses a connection string, which supports both access keys and shared access signatures (SAS).

Access key

To authenticate using an account access key, provide a connection string in the following format:

DefaultEndpointsProtocol=https;AccountName=storage-account-name;AccountKey=account-access-key;EndpointSuffix=core.windows.net

You can find your storage account name and access key in the Azure Portal under Storage Account > Access keys.

Shared Access Signature (SAS)

To authenticate using a Shared Access Signature (SAS), provide a connection string that includes the SAS token:

BlobEndpoint=https://storage-account-name.blob.core.windows.net/;SharedAccessSignature=sas-token

Generate a SAS token in the Azure Portal under Storage Account > Shared access signature with the appropriate permissions (Read, List) for the container and blobs you want to ingest.

Network access

ABS ClickPipes use two distinct network paths for metadata discovery and data ingestion: the ClickPipes service and the ClickHouse Cloud service, respectively. If you want to configure an additional layer of network security (e.g., for compliance reasons), network access must be configured for both paths.

Note

IP-based access control doesn't work if your Azure Blob Storage container is in the same Azure region as your ClickHouse Cloud service. When both services are co-located, traffic is routed through Azure's internal network, rather than the public internet.

For IP-based access control, the IP network rules for your Azure Storage firewall must allow the static IPs for the ClickPipes service region listed here, as well as the static IPs for the ClickHouse Cloud service. To obtain the static IPs for your ClickHouse Cloud region, open a terminal and run:
```
# Replace <your-region> with your ClickHouse Cloud region
curl -s https://api.clickhouse.cloud/static-ips.json | jq -r '.azure[] | select(.region == "<your-region>") | .egress_ips[]'
```

Advanced settings

ClickPipes provides sensible defaults that cover the requirements of most use cases. If your use case requires additional fine-tuning, you can adjust the following settings:

Setting	Default value	Description
`Max insert bytes`	10GB	Number of bytes to process in a single insert batch.
`Max file count`	100	Maximum number of files to process in a single insert batch.
`Max threads`	auto(3)	Maximum number of concurrent threads for file processing.
`Max insert threads`	1	Maximum number of concurrent insert threads for file processing.
`Min insert block size bytes`	1GB	Minimum size of bytes in the block which can be inserted into a table.
`Max download threads`	4	Maximum number of concurrent download threads.
`Object storage polling interval`	30s	Configures the maximum wait period before inserting data into the ClickHouse cluster.
`Parallel distributed insert select`	2	Parallel distributed insert select setting.
`Parallel view processing`	false	Whether to enable pushing to attached views concurrently instead of sequentially.
`Use cluster function`	true	Whether to process files in parallel across multiple nodes.

Scaling

Object Storage ClickPipes are scaled based on the minimum ClickHouse service size determined by the configured vertical autoscaling settings. The size of the ClickPipe is determined when the pipe is created. Subsequent changes to the ClickHouse service settings won't affect the ClickPipe size.

To increase the throughput on large ingest jobs, we recommend scaling the ClickHouse service before creating the ClickPipe.

Known limitations

File size

ClickPipes will only attempt to ingest objects that are 10GB or smaller in size. If a file is greater than 10GB, an error will be appended to the ClickPipes dedicated error table.

Latency

For containers with over 100,000 files, Azure blob Storage LIST operations introduce extra latency when detecting new files, in addition to the default polling interval:

< 100k files: ~30 seconds (default polling interval)
100k files: ~40-45 seconds
250k files: ~55-70 seconds
500k+ files: May exceed 90 seconds

For continuous ingestion, ClickPipes must scan the container to identify new files lexically greater than the last ingested file. We recommend organizing files into smaller containers, or using hierarchical directory structures to reduce the number of files per listing operation.

View support

Materialized views on the target table are also supported. ClickPipes will create staging tables not only for the target table, but also any dependent materialized view.

We don't create staging tables for non-materialized views. This means that if you have a target table with one of more downstream materialized views, those materialized views should avoid selecting data via a view from the target table. Otherwise, you may find that you're missing data in the materialized view.

Dependencies

Any changes to the destination table, its materialized views (including cascading materialized views), or the target tables of the materialized views while the ClickPipe is running will result in retriable errors. To make schema changes to these dependencies, you should pause the ClickPipe, apply the changes, and then resume.

Supported formats​

Features​

One-time ingestion​

Continuous ingestion​

Lexicographical order​

File pattern matching​

Supported patterns​

Unsupported patterns​

Exactly-once semantics​

Virtual columns​

Access control​

Permissions​

Authentication​

Access key​

Shared Access Signature (SAS)​

Network access​

Advanced settings​

Scaling​

Known limitations​

File size​

Latency​

View support​

Dependencies​