Skip to main content
Skip to main content

s3Cluster Table Function

This is an extension to the s3 table function.

Allows processing files from Amazon S3 and Google Cloud Storage Google Cloud Storage in parallel with many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster, discloses asterisks in S3 file path, and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.

Syntax

Arguments

ArgumentDescription
cluster_nameName of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
urlpath to a file or a bunch of files. Supports following wildcards in readonly mode: *, **, ?, {'abc','def'} and {N..M} where N, M — numbers, abc, def — strings. For more information see Wildcards In Path.
NOSIGNIf this keyword is provided in place of credentials, all the requests will not be signed.
access_key_id and secret_access_keyKeys that specify credentials to use with given endpoint. Optional.
session_tokenSession token to use with the given keys. Optional when passing keys.
formatThe format of the file.
structureStructure of the table. Format 'column1_name column1_type, column2_name column2_type, ...'.
compression_methodParameter is optional. Supported values: none, gzip or gz, brotli or br, xz or LZMA, zstd or zst. By default, it will autodetect compression method by file extension.
headersParameter is optional. Allows headers to be passed in the S3 request. Pass in the format headers(key=value) e.g. headers('x-amz-request-payer' = 'requester'). See here for example of use.
extra_credentialsOptional. roleARN can be passed via this parameter. See here for an example.

Arguments can also be passed using named collections. In this case url, access_key_id, secret_access_key, format, structure, compression_method work in the same way, and some extra parameters are supported:

ArgumentDescription
filenameappended to the url if specified.
use_environment_credentialsenabled by default, allows passing extra parameters using environment variables AWS_CONTAINER_CREDENTIALS_RELATIVE_URI, AWS_CONTAINER_CREDENTIALS_FULL_URI, AWS_CONTAINER_AUTHORIZATION_TOKEN, AWS_EC2_METADATA_DISABLED.
no_sign_requestdisabled by default.
expiration_window_secondsdefault value is 120.

Returned value

A table with the specified structure for reading or writing data in the specified file.

Examples

Select the data from all the files in the /root/data/clickhouse and /root/data/database/ folders, using all the nodes in the cluster_simple cluster:

Count the total amount of rows in all files in the cluster cluster_simple:

tip

If your listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use ?.

For production use cases, it is recommended to use named collections. Here is the example:

Accessing private and public buckets

Users can use the same approaches as document for the s3 function here.

Optimizing performance

For details on optimizing the performance of the s3 function see our detailed guide.