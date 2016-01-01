The iceberg table function (alias for icebergS3 ) reads Iceberg tables directly from object storage. Variants exist for each storage backend: icebergS3 , icebergAzure , icebergHDFS , and icebergLocal . Example syntax: icebergS3(url [, NOSIGN | access_key_id, secret_access_key, [session_token]] [,format] [,compression_method]) icebergAzure(connection_string|storage_account_url, container_name, blobpath, [,account_name], [,account_key] [,format] [,compression_method]) icebergLocal(path_to_table, [,format] [,compression_method]) GCS support The S3 variant of the functions can be used for Google Cloud Storage (GCS). Example: SELECT url, count() AS cnt FROM icebergS3('https://datasets-documentation.s3.amazonaws.com/lake_formats/iceberg/') GROUP BY url ORDER BY cnt DESC LIMIT 5 ┌─url────────────────────────────────────────────────┬─────cnt─┐ │ http://liver.ru/belgorod/page/1006.jки/доп_приборы │ 3288173 │ -- 3.29 million │ http://kinopoisk.ru │ 1625250 │ -- 1.63 million │ http://bdsm_po_yers=0&with_video │ 791465 │ │ http://video.yandex │ 582400 │ │ http://smeshariki.ru/region │ 514984 │ └────────────────────────────────────────────────────┴─────────┘ 5 rows in set. Elapsed: 3.375 sec. Processed 100.00 million rows, 9.98 GB (29.63 million rows/s., 2.96 GB/s.) Peak memory usage: 10.48 GiB. The icebergS3Cluster function distributes reads across multiple nodes in a ClickHouse cluster. The initiator node establishes connections to all nodes and dispatches data files dynamically. Each worker node requests and processes tasks until all files have been read. icebergCluster is an alias for icebergS3Cluster . Variants also exist for Azure ( icebergAzureCluster ) and HDFS ( icebergHDFSCluster ). Example syntax: icebergS3Cluster(cluster_name, url [, NOSIGN | access_key_id, secret_access_key, [session_token]] [,format] [,compression_method]) -- icebergCluster is an alias for icebergS3Cluster icebergAzureCluster(cluster_name, connection_string|storage_account_url, container_name, blobpath, [,account_name], [,account_key] [,format] [,compression_method]) Example (ClickHouse Cloud): SELECT url, count() AS cnt FROM icebergS3Cluster( 'default', 'https://datasets-documentation.s3.amazonaws.com/lake_formats/iceberg/' ) GROUP BY url ORDER BY cnt DESC LIMIT 5 As an alternative to using the table function in every query, you can create a persistent table using the Iceberg table engine. The data still resides in object storage and is read on demand - no data is copied into ClickHouse. The advantage is that the table definition is stored in ClickHouse and can be shared across users and sessions without each user needing to specify the storage path and credentials. Engine variants exist for each storage backend: IcebergS3 (or the Iceberg alias), IcebergAzure , IcebergHDFS , and IcebergLocal . Both the table engine and the table function support data caching, using the same caching mechanism as the S3, AzureBlobStorage, and HDFS storage engines. Additionally, a metadata cache stores manifest file information in memory, reducing repeated reads of Iceberg metadata. This cache is enabled by default via the use_iceberg_metadata_files_cache setting. Example syntax: The table engine Iceberg is an alias to IcebergS3 . CREATE TABLE iceberg_table ENGINE = IcebergS3(url [, NOSIGN | access_key_id, secret_access_key, [session_token]] [,format] [,compression_method]) CREATE TABLE iceberg_table ENGINE = IcebergAzure(connection_string|storage_account_url, container_name, blobpath, [account_name, account_key, format, compression]) CREATE TABLE iceberg_table ENGINE = IcebergLocal(path_to_table, [,format] [,compression_method]) GCS support The S3 variant of the table engine can be used for Google Cloud Storage (GCS). Example: CREATE TABLE hits_iceberg ENGINE = IcebergS3('https://datasets-documentation.s3.amazonaws.com/lake_formats/iceberg/') SELECT url, count() AS cnt FROM hits_iceberg GROUP BY url ORDER BY cnt DESC LIMIT 5 ┌─url────────────────────────────────────────────────┬─────cnt─┐ │ http://liver.ru/belgorod/page/1006.jки/доп_приборы │ 3288173 │ │ http://kinopoisk.ru │ 1625250 │ │ http://bdsm_po_yers=0&with_video │ 791465 │ │ http://video.yandex │ 582400 │ │ http://smeshariki.ru/region │ 514984 │ └────────────────────────────────────────────────────┴─────────┘ 5 rows in set. Elapsed: 2.737 sec. Processed 100.00 million rows, 9.98 GB (36.53 million rows/s., 3.64 GB/s.) Peak memory usage: 10.53 GiB. For supported features including partition pruning, schema evolution, time travel, caching, and more, see the support matrix. For full reference, see the iceberg table function and Iceberg table engine documentation.

The deltaLake table function (alias for deltaLakeS3 ) reads Delta Lake tables from object storage. Variants exist for other backends: deltaLakeAzure and deltaLakeLocal . Example syntax: deltaLakeS3(url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression]) deltaLakeAzure(connection_string|storage_account_url, container_name, blobpath, [,account_name], [,account_key] [,format] [,compression_method]) deltaLakeLocal(path, [,format]) GCS support The S3 variant of the functions can be used for Google Cloud Storage (GCS). Example: SELECT URL, count() AS cnt FROM deltaLake('https://datasets-documentation.s3.amazonaws.com/lake_formats/delta_lake/') GROUP BY URL ORDER BY cnt DESC LIMIT 5 ┌─URL────────────────────────────────────────────────┬─────cnt─┐ │ http://liver.ru/belgorod/page/1006.jки/доп_приборы │ 3288173 │ -- 3.29 million │ http://kinopoisk.ru │ 1625250 │ -- 1.63 million │ http://bdsm_po_yers=0&with_video │ 791465 │ │ http://video.yandex │ 582400 │ │ http://smeshariki.ru/region │ 514984 │ └────────────────────────────────────────────────────┴─────────┘ 5 rows in set. Elapsed: 3.878 sec. Processed 100.00 million rows, 14.82 GB (25.78 million rows/s., 3.82 GB/s.) Peak memory usage: 9.16 GiB. The deltaLakeCluster function distributes reads across multiple nodes in a ClickHouse cluster. The initiator node dispatches data files dynamically to worker nodes for parallel processing. deltaLakeS3Cluster is an alias for deltaLakeCluster . An Azure variant ( deltaLakeAzureCluster ) is also available. Example syntax: deltaLakeCluster(cluster_name, url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression]) -- deltaLakeS3Cluster is an alias for deltaLakeCluster deltaLakeAzureCluster(cluster_name, connection_string|storage_account_url, container_name, blobpath, [,account_name], [,account_key] [,format] [,compression_method]) GCS support The S3 variant of the functions can be used for Google Cloud Storage (GCS). Example (ClickHouse Cloud): SELECT URL, count() AS cnt FROM deltaLakeCluster( 'default', 'https://datasets-documentation.s3.amazonaws.com/lake_formats/delta_lake/' ) GROUP BY URL ORDER BY cnt DESC LIMIT 5 As an alternative to using the table function in every query, you can create a persistent table using the DeltaLake table engine if using S3 compatible storage. The data still resides in object storage and is read on demand - no data is copied into ClickHouse. The advantage is that the table definition is stored in ClickHouse and can be shared across users and sessions without each user needing to specify the storage path and credentials. Both the table engine and the table function support data caching, using the same caching mechanism as the S3, AzureBlobStorage, and HDFS storage engines. Example syntax: CREATE TABLE delta_table ENGINE = DeltaLake(url [,aws_access_key_id, aws_secret_access_key]) GCS support This table engine can be used for Google Cloud Storage (GCS). Example: CREATE TABLE hits_delta ENGINE = DeltaLake('https://datasets-documentation.s3.amazonaws.com/lake_formats/delta_lake/') SELECT URL, count() AS cnt FROM hits_delta GROUP BY URL ORDER BY cnt DESC LIMIT 5 ┌─URL────────────────────────────────────────────────┬─────cnt─┐ │ http://liver.ru/belgorod/page/1006.jки/доп_приборы │ 3288173 │ │ http://kinopoisk.ru │ 1625250 │ │ http://bdsm_po_yers=0&with_video │ 791465 │ │ http://video.yandex │ 582400 │ │ http://smeshariki.ru/region │ 514984 │ └────────────────────────────────────────────────────┴─────────┘ 5 rows in set. Elapsed: 3.608 sec. Processed 100.00 million rows, 14.82 GB (27.72 million rows/s., 4.11 GB/s.) Peak memory usage: 9.27 GiB. For supported features including storage backends, caching, and more, see the support matrix. For full reference, see the deltaLake table function and DeltaLake table engine documentation.

The hudi table function reads Hudi tables from S3. Syntax: hudi(url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression]) The hudiCluster function distributes reads across multiple nodes in a ClickHouse cluster. The initiator node dispatches data files dynamically to worker nodes for parallel processing. hudiCluster(cluster_name, url [,aws_access_key_id, aws_secret_access_key] [,format] [,structure] [,compression]) As an alternative to using the table function in every query, you can create a persistent table using the Hudi table engine. The data still resides in object storage and is read on demand - no data is copied into ClickHouse. The advantage is that the table definition is stored in ClickHouse and can be shared across users and sessions without each user needing to specify the storage path and credentials. Syntax: CREATE TABLE hudi_table ENGINE = Hudi(url [,aws_access_key_id, aws_secret_access_key]) For supported features including storage backends and more, see the support matrix. For full reference, see the hudi table function and Hudi table engine documentation.