Blog / Engineering

Optimizing Analytical Workloads: Comparing Redshift vs ClickHouse

author avatar
Dale McDiarmid
Mar 23, 2023 - 59 minutes read

redshift_vs_clickhouse.png

Introduction

Amazon Redshift is a cloud data warehouse that provides reporting and analytics capabilities for structured and semi-structured data. It was designed to handle analytical workloads on big data sets using column-oriented database principles similar to ClickHouse. As part of the AWS offering, it is often the default solution AWS users turn to for their analytical data needs.

While attractive to existing AWS users due to its tight integration with the Amazon ecosystem, Redshift users that adopt it to power real-time analytics applications find themselves in need of a more optimized solution for this purpose. As a result, they increasingly turn to ClickHouse to benefit from superior query performance and data compression, either as a replacement or a “speed layer” deployed alongside existing Redshift workloads.

In this blog, we explore why users move workloads from Redshift to ClickHouse, providing evidence of increased compression and query performance and describe options for migrating data – in bulk, as well as continuously appending new data using AWS EventBridge, AWS Lambda and AWS Glue.

ClickHouse vs Redshift

For users heavily invested in the AWS ecosystem, Redshift represents a natural choice when faced with data warehousing needs. Redshift differs from ClickHouse in this important aspect – it optimizes its engine for data warehousing workloads requiring complex reporting and analytical queries. Across all deployment modes, the following two limitations make it difficult to use Redshift for real-time analytical workloads:

  • Redshift compiles code for each query execution plan, which adds significant overhead to first-time query execution (up to 2s in our testing). This overhead can be justified when query patterns are predictable and compiled execution plans can be stored in a query cache. However, this introduces challenges for interactive applications with variable queries. Even when Redshift is able to exploit this code compilation cache, ClickHouse is faster on most queries - see "Benchmarks" and "ClickHouse vs Redshift Query Comparison" below.
  • Redshift limits concurrency to 50 across all queues, which (while adequate for BI) makes it inappropriate for highly concurrent analytical applications.

Conversely, while ClickHouse can also be utilized for complex analytical queries it is optimized for real-time analytical workloads, either powering applications or acting as a warehouse acceleration later. As a result, Redshift users typically replace or augment Redshift with ClickHouse for the following reasons:

  • ClickHouse achieves lower query latencies, including for varied query patterns, under high concurrency and while subjected to streaming inserts. Even when your query misses a cache, which is inevitable in interactive user-facing analytics, ClickHouse can still process it fast.
  • ClickHouse places much higher limits on concurrent queries, which is vital for real-time application experiences. In ClickHouse, self-managed as well as cloud, you can scale up your compute allocation to achieve the concurrency your application needs for each service. The level of permitted query concurrency is configurable in ClickHouse, with ClickHouse Cloud defaulting to a value of 1000.
  • ClickHouse offers superior data compression, which allows users to reduce their total storage (and thus cost) or persist more data at the same cost and derive more real-time insights from their data. See "ClickHouse vs Redshift Storage Efficiency" below.

Users additionally appreciate ClickHouse for its wide-ranging support of real-time analytical capabilities, such as:

Redshift Deployment Options

When deploying Redshift, users are presented with several options, each with respective strengths and weaknesses for different workloads:

Redshift Serverless - A recent addition to the Redshift product lineup (GA in July 2022), this offering separates storage and compute and automatically provisions and scales the warehouse capacity based on query load. Similar to ClickHouse Cloud, this is a fully managed offering where instances are automatically upgraded. Available compute capacity is measured through a custom Redshift Processing Units (RPU) unit (approximately two virtual CPUs and 16 GB of RAM), for which the user sets a limit (defaulting to 128). Users are charged for the data stored, and the compute capacity consumed while the warehouse is active with 1-second granularity and a minimum charging period of 1 minute. While changing RPU limits requires notable downtime, this offering is suitable for ad-hoc analytics where performance is not critical, and workloads are variable with potential idle time. However, it is less appropriate for high or potentially unbounded and unpredictable query workloads, e.g., for applications where the load is based on the number of users. Furthermore, if latency is critical, users typically lean to the provisioned choices below.

Redshift Provisioned - The original Redshift offering, with recent additions and improvements, offers users several choices:

  • DC2 nodes - Designed for compute-intensive data warehouses based on local SSD storage, where query latency is critical. Redshift recommends these for datasets that are less than 1TB compressed. Users can select specific node sizes and the number of nodes to increase total capacity.
  • RA3 nodes - These nodes offer higher storage to compute ratios and offload data to S3 once the local disk is full, using a feature known as Managed Storage. This ability is comparable to ClickHouse Cloud's shared-nothing architecture, where the local disk acts as a cache with s3 offering unbounded storage capacity. Note, however, that, unlike ClickHouse Cloud, data In Redshift is still associated with a specific node, preventing storage and compute from being scaled and paid for completely independently. Storage costs are priced independently of where the data resides (i.e., local disk or s3), with users paying only for managed storage used.

In the rest of this post, we show how Redshift users can migrate data in bulk to ClickHouse, as well as keep data synchronized between the new systems in side-by-side deployments.

Setup & Dataset

Examples in this blog post utilize ClickHouse Cloud, which has a free trial that allows the completion of the scenarios we cover. We utilize a publicly available cloud environment on sql.clickhouse.com, which has a total of 720 GB of memory and 180 vCPUs, over three nodes (note our benchmarks utilize only one node). All instructions are also compatible with self-managed ClickHouse deployments running the latest version.

We use the Ethereum Cryptocurrency dataset available in BigQuery's public project, which we used in our earlier post comparing ClickHouse with BigQuery. Similar to this post, we defer exploring this dataset in detail to a later blog post but recommend reading Google's blog on how this was constructed and the subsequent post on querying this and other crypto datasets. No prior experience with crypto is required for reading this blog post, but for those interested, the Introduction to Ethereum provides a useful overview. Google has documented a number of great queries on top of this dataset, which we reference later in the blog. We have collated equivalent ClickHouse and Redshift queries here and welcome contributions.

The dataset consists of four tables:

  • Transactions - Cryptographically signed instructions from accounts, e.g. to transfer currency from one account to another
  • Blocks - Batches of transactions with a hash of the previous block in the chain
  • Traces - Internal transactions that allow querying all Ethereum addresses with their balances
  • Contracts - Programs that run on the Ethereum blockchain

These tables represent a subset of the full data and address the most common queries, while providing significant volume. Since this dataset is not offered by AWS, it can be generated using the excellent Ethereum ETL tooling for which a PR has been submitted supporting ClickHouse as a destination. Alternatively, we have made a snapshot of this data available in the s3 bucket s3://datasets-documentation/ethereum/ for our users to explore as well as in our public playground which is kept up-to-date. A full up-to-date version of the data can also be found in the gcs bucket gs://clickhouse_public_datasets/ethereum.

Loading Data into Redshift

For our examples, we assume the data has been loaded into Redshift. For those readers interested, we exported this data from BigQuery to S3 in Parquet format (where Google maintains an up-to-date copy) using BigQuery’s EXPORT capabilities. The full schemas can be found here and along with the full steps required to load the Parquet files. Parquet was chosen as it represents the most efficient format for Redshift.

Unfortunately, we were not able to use the COPY command for this task as either some columns could not be loaded (e.g., string lengths exceeding Redshift limits of 65k chars) or date conversion was required on load. Our examples thus use Redshift’s ability to query data in S3 via Spectrum with external tables used to expose the s3 files before inserting the data into the final tables via INSERT INTO SELECT (where subsets of columns can be selected and CAST).

Schemas were optimized using the ANALYZE COMPRESSION command to identify the most effective codecs. We utilized the same sorting keys as those established for queries in our earlier BigQuery post for this dataset.

ClickHouse vs Redshift Storage Efficiency

For users interested in the details of data loading, and options for keeping ClickHouse and Redshift in sync, we have provided these below. To highlight the value of moving to ClickHouse, however, we first show a summary of the respective sizes of our above dataset in both systems. As of the time of writing, it's not possible to identify an uncompressed size for the data in Redshift - so we measure the compressed size for both only. For the full table schemas and codecs used, see the below section Migrating Redshift Tables to ClickHouse.

Measuring Redshift Table Size

This information can be obtained with a simple query.

SELECT "table", size, tbl_rows, unsorted, pct_used, diststyle  FROM SVV_TABLE_INFO WHERE "table" = 'blocks'

?column?    size    tbl_rows    unsorted    pct_used
blocks      11005   16629116    0           0.0005

It is expected that the value of “unsorted” field is 0. If not, users can run a VACUUM command to sort any unsorted rows in the background and achieve more optimal compression. The value returned for size is in MB and can be compared to compressed storage in ClickHouse. The distribution style is also returned since this can impact total size. Our tables have all been configured with an AUTO value, Redshift is free to assign an optimal distribution style and adjust this based on table size. Aside from our smallest table, blocks, the EVEN distribution style is selected, which means that the data is sent round-robin across nodes. We applied the optimal compression algorithms for each column as identified by the ANALYZE COMPRESSION (see “Compression” below).

Below we capture Redshift storage statistics from the serverless instance.

Table Name Total Rows Compressed size Distribution style
blocks

16629116

10.74GB

AUTO(KEY(number))

contracts

57394746

12.51GB

AUTO(EVEN)

transactions

1874052391

187.53GB

AUTO(EVEN)

traces

6377694114

615.46GB

AUTO(EVEN)

Measuring ClickHouse Table Size

Compressed table sizes in ClickHouse can be found with a query to the system.columns table.

SELECT
    table,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed_size
FROM system.columns
WHERE database = 'ethereum'
GROUP BY table
ORDER BY sum(data_compressed_bytes) DESC

┌─table───────────┬─compressed_size─┬
│ traces          │ 339.95 GiB      │
│ transactions    │ 139.94 GiB      │
│ blocks          │ 5.37 GiB        │
│ contracts       │ 2.73 GiB        │
└─────────────────┴──────────────────

Comparison

Below we compare the above measurements, also comparing to Parquet and computing a ClickHouse to Redshift storage ratio.

Table Parquet Size (using SNAPPY) Redshift Size (Compressed) ClickHouse Size (Compressed) ClickHouse/Redshift ratio
transactions 252.4 GiB 187.53 GiB 139.94 GB 1.3
blocks 10.9 GiB 10.74 GiB 5.37 GB 2
traces 710.1 GiB 615.46 GiB 339.95 1.8
contracts 16.0 GiB 12.51 GiB 2.73 GB 4.6
Total 989.4 GiB 826.24 GiB 487.99 GiB 2

As shown, ClickHouse compresses the data more efficiently than the optimal Redshift schema with a combined rate of 2x for this dataset.

Benchmarks

To provide a comparison of query performance, we have performed the benchmarks detailed at benchmarks.clickhouse.com on a 2 node dc2.8xlarge cluster, which provides a total of 64 cores and 488GB RAM, using the steps outlined here. AWS recommends this node type for compute-intensive workloads on top of datasets under 1TB compressed. We compare the results below to a single ClickHouse Cloud node with 60 cores and 240GB RAM. The full methodology of this benchmark, which runs 42 queries over a 100m row web analytics dataset, is detailed in the repository. We present these results below which can also be accessed from here.

detailed-comparison.png

As shown, our 60 core ClickHouse Cloud node is on average 2.5x times faster than a comparative Redshift cluster. Feel free to explore other comparisons, where Redshift cluster resources are considerably higher.

Migrating Redshift Tables to ClickHouse

Both Redshift and ClickHouse are built on top of columnar storage, so dealing with tables is similar in both systems.

Data Types

Users moving data between ClickHouse and Redshift will immediately notice that ClickHouse offers a more extensive range of types, which are also less restrictive. While Redshift requires users to specify possible string lengths, even if variable, ClickHouse removes this restriction and burden from the user by storing strings without encoding as bytes. The ClickHouse String type thus has no limits or length specification requirements.

Furthermore, users can exploit Arrays, Tuples, and Enums - absent from Redshift as first-class citizens (although Arrays/Structs can be imitated with SUPER) and a common frustration of users. ClickHouse additionally allows the persistence, either at query time or even in a table, of aggregation states. This will enable data to be pre-aggregated, typically using a materialized view, and can dramatically improve query performance for common queries.

Below we map the equivalent ClickHouse type for each Redshift type:

Redshift ClickHouse
SMALLINT Int8*
INTEGER Int32*
BIGINT Int64*
DECIMAL UInt128, UInt256, Int128, Int256, Decimal(P, S), Decimal32(S), Decimal64(S), Decimal128(S), Decimal256(S) - (high precision and ranges possible)
REAL Float32
DOUBLE PRECISION Float64
BOOLEAN Bool
CHAR String, FixedString
VARCHAR** String
DATE Date32
TIMESTAMP DateTime, DateTime64
TIMESTAMPTZ DateTime, DateTime64
GEOMETRY Geo Data Types
GEOGRAPHY Geo Data Types (less developed e.g. no coordinate systems - can be emulated with functions)
HLLSKETCH AggregateFunction(uniqHLL12, X)
SUPER Tuple, Nested, Array, JSON, Map
TIME DateTime, DateTime64
TIMETZ DateTime, DateTime64
VARBYTE** String combined with Bit and Encoding functions

* ClickHouse additionally supports unsigned integers with extended ranges i.e. UInt8, UInt32, UInt32 and UInt64.
**ClickHouse’s String type is unlimited by default but can be constrained to specific lengths using Constraints.



When presented with multiple options for ClickHouse types, consider the actual range of data and pick the lowest required.

Compression

ClickHouse and Redshift support common compression algorithms, including ZSTD. Except for applying delta encoding to integer and date sequences, we typically find ZSTD is the most widely applicable compression algorithm and delivers the best results in most cases.

Redshift allows auto-detection of the best compression algorithm for each column when copying data with the COPY command, using the COMPUPDATE ON option (with some limitations on import data type, e.g. not supported for Parquet). We typically find this also suggests ZSTD for most columns, aligning with our own findings. Alternatively, the user can request an optimal schema with an estimation of the projected space savings via the ANALYZE COMPRESSION command. We apply these recommendations to all of our table schemas.

Currently, codecs in ClickHouse must be specified when creating tables. These can, however, be combined (e.g., CODEC(Delta, ZSTD)). Furthermore, ClickHouse allows these compression algorithms to be tuned, usually sacrificing compression or decompression speed for increased space savings (e.g. ZSTD(9) offers higher reduction rates than ZSTD(3) at the cost of slower compression, but largely consistent decompression performance at query time). This increased tunability helps ClickHouse achieve higher compression rates.

Sorting Keys

Both ClickHouse and Redshift have the concept of a “sorting key”, which defines how data is sorted when being stored. Redshift defines the sorting key using the SORTKEY clause:

CREATE TABLE some_table(...) SORTKEY (column1, column2)

Comparatively, ClickHouse uses an ORDER BY clause to specify the sort order:

CREATE TABLE some_table(...) ENGINE = MergeTree ORDER BY (column1, column2)

In most cases, users can use the same sorting key columns and order in ClickHouse as Redshift, assuming you are using the default COMPOUND type. When data is added to Redshift, you should run a VACUUM and ANALYZE commands to re-sort newly added data and update the statistics for the query planner - otherwise, the unsorted space grows. No such process is required for ClickHouse.

Redshift supports a couple of convenience features for sorting keys. One is automatic sorting keys (using SORTKEY AUTO), which may be appropriate for getting started, but explicit sorting keys ensure the best performance and storage efficiency when the sorting key is optimal. The other is the INTERLEAVED sort key, which gives equal weight to a subset of columns in the sort key to improve performance when a query uses one or more secondary sort columns. ClickHouse supports explicit projections, which achieve the same end result with a slightly different setup.

Users should be aware that the “primary key” concept represents different things in ClickHouse and Redshift. In Redshift, the primary key resembles the traditional RDMS concept intended to enforce constraints. However, they are not strictly enforced in Redshift and instead act as hints for the query planner and data distribution among nodes. In ClickHouse, the primary key denotes columns used to construct the sparse primary index, used to ensure the data is ordered on disk, maximizing compression while avoiding pollution of the primary index and wasting memory.

Example Table

In this example, we assume our data is only present in Redshift, and we are unfamiliar with the equivalent ClickHouse Ethereum schemas. The current schema for any Redshift table can be retrieved with the following query:

SHOW TABLE <schema>.<table>

For example, for the blocks table:

CREATE TABLE public.blocks (
        number bigint NOT NULL ENCODE zstd distkey,
        hash character(66) ENCODE zstd,
        parent_hash character(66) ENCODE zstd,
        nonce character(18) ENCODE zstd,
        sha3_uncles character(66) ENCODE zstd,
        logs_bloom character(514) ENCODE zstd,
        transactions_root character(66) ENCODE zstd,
        state_root character(66) ENCODE zstd,
        receipts_root character(66) ENCODE zstd,
        miner character(42) ENCODE zstd,
        difficulty numeric(38, 0) ENCODE az64,
        total_difficulty numeric(38, 0) ENCODE az64,
        SIZE bigint ENCODE zstd,
        extra_data CHARACTER varying(66) ENCODE zstd,
        gas_limit bigint ENCODE zstd,
        gas_used bigint ENCODE zstd, 
        timestamp timestamp WITHOUT TIME ZONE ENCODE RAW,
        transaction_count bigint ENCODE zstd,
        base_fee_per_gas bigint ENCODE zstd,
PRIMARY KEY (number)) DISTSTYLE AUTO SORTKEY (timestamp);

The full Redshift schemas can be found here. For some tables, columns have been dropped from the original dataset before inserting into Redshift, because their length exceeds the 65k maximum for a Redshift string, e.g. input column of transactions.

The schema for blocks and statements to create the equivalent table in ClickHouse is shown below. If no codec is specified, ZSTD(1) is used as the compression algorithm, since that is the default setting in ClickHouse Cloud.

CREATE TABLE blocks
(
    `number` UInt32 CODEC(Delta(4), ZSTD(1)),
    `hash` String,
    `parent_hash` String,
    `nonce` String,
    `sha3_uncles` String,
    `logs_bloom` String,
    `transactions_root` String,
    `state_root` String,
    `receipts_root` String,
    `miner` String,
    `difficulty` Decimal(38, 0),
    `total_difficulty` Decimal(38, 0),
    `size` UInt32 CODEC(Delta(4), ZSTD(1)),
    `extra_data` String,
    `gas_limit` UInt32 CODEC(Delta(4), ZSTD(1)),
    `gas_used` UInt32 CODEC(Delta(4), ZSTD(1)),
    `timestamp` DateTime CODEC(Delta(4), ZSTD(1)),
    `transaction_count` UInt16,
    `base_fee_per_gas` UInt64
)
ENGINE = MergeTree
ORDER BY timestamp

We have made basic optimizations to these schemas with appropriate types and codecs to minimize storage. For instance, we don't make columns Nullable, despite them being so in the original schema, because for most queries, there is no need to distinguish between the default value and the Null value. By using default values, we avoid additional UInt8 column overhead associated with Nullable. Otherwise, we kept many of the defaults, including using the same ORDER BY key as Redshift.

You can run an additional query to identify the data range and cardinality, allowing you to select the most optimal ClickHouse type. The blog “Optimizing ClickHouse with Schemas and Codecs” offers a deeper look at this topic. We leave full analysis of schema optimization to a later blog dedicated to this dataset.

SELECT
 MAX(number) AS max_number,
 MIN(number) AS min_number,
 MAX(size) AS max_size,
 MIN(size) AS min_size
FROM blocks

max_number    min_number    max_size    min_size
16547585      0             1501436     514

Getting data from Redshift to ClickHouse

Getting data from Redshift to ClickHouse

Redshift supports exporting data to S3 via the UNLOAD command. Data can, in turn, be imported into ClickHouse using the s3 table function. This "pivot" step approach has a number of advantages:

  • Redshift UNLOAD functionality supports a filter for exporting a subset of data via standard SQL query.
  • Redshift supports exporting to Parquet, JSON, and CSV formats and several compression types - all supported by ClickHouse.
  • S3 supports object lifecycle management, allowing data that has been exported and imported into ClickHouse to be deleted after a specified period.
  • Exports produce multiple files automatically, allowing export to be parallelized, limiting each to a maximum of 6.2GB. This is beneficial to ClickHouse, since it allows imports to be parallelized. This parallelization is done at the slice level, with each slice generating one or more files.
  • AWS does not charge for unloading data to S3, provided RedShift and the bucket are in the same region. Users, however, will still pay for compute resources consumed by the data export query (if using Redshift Serverless) and storage costs in S3.

Exporting Data from Redshift to S3

To export data from a Redshift table to a file in an S3 bucket, make sure you have created the bucket and given Redshift the permission to access it. We can use the UNLOAD command to export data from a Redshift table. It is possible to restrict the export to a subset of columns using the SELECT statement:

UNLOAD ('SELECT * FROM some_table')
TO 's3://my_bucket_name/some_table_'
iam_role 'arn:aws:...' PARQUET
ALLOWOVERWRITE

Exporting Data from Redshift to S3

We used the column-oriented Parquet file format for export, because it is a good choice in terms of storage efficiency and export speed (2x other formats), and is optimized for reading by ClickHouse. The time taken for this operation depends on the resources (and slices) assigned to the Redshift cluster as well as region locality. We utilize the same region for both S3 and Redshift for the export to maximize throughput and costs. The export timings for each of our tables are shown below for a provisioned and a serverless (limited to 128 RPUs) Redshift cluster. We utilize the setting MAXFILESIZE to limit Parquet file size to 100MB for block data, given its smaller size. This allows exports to be parallelized by Redshift as well as assisting with ClickHouse imports. For all other tables, we rely on the default file partitioning, which creates multiple files using 6.2GB as an upper limit.

Table Number of Files Parquet Size (GB) Redshift Serverless (128 RPUs) Redshift (2xdc2.8xlarge)
Blocks 128 10.9GiB 4.9s 18.4s
Contracts 128 16.0 GiB 2m 43.9s 22.5s
Transactions 128 252.4 GiB 4m 40s 10m 14.1s
Traces 128 710.1 GiB 5m 36.1s 29m 37.8s

An observant reader will notice that we have 128 files for all types. This is due to Redshift parallelizing at the slice level (this seems to be equivalent to an RPU for UNLOAD), with each slice producing at least one file, or multiple if the file exceeds 6.2GB. The serverless instance here has significantly more resources (around 256 cores) available, which attributes to the much faster export time.

Importing Data from S3 into ClickHouse

We load data from S3 into ClickHouse with the s3 function. We pre-create the table before running the following INSERT INTO the blocks table.

SET parallel_distributed_insert_select = 1
INSERT INTO blocks
SELECT * FROM s3Cluster('default', 'https://dalem-bucket.s3.eu-west-1.amazonaws.com/export/blocks/*.parquet')

0 rows in set. Elapsed: 14.282 sec. Processed 16.63 million rows, 19.26 GB (1.16 million rows/s., 1.35 GB/s.)

We utilize the function s3Cluster, which is a distributed variant of the s3 function. This allows the full cluster resources in ClickHouse Cloud to be utilized for reading and writing. The setting parallel_distributed_insert_select=1 ensures that insertion is parallelized and data is inserted into the same node from which it is read, skipping the initiator node on writes. We don’t provide authentication keys, because the bucket we use is public, but private buckets are supported.

In some cases, you will need to map exported Parquet columns to equivalent ClickHouse data types. For example, Redshift does not support Arrays. For columns such as function_sighashes, an array in the original dataset, these have been represented in Redshift as type SUPER. This has no equivalent representation in Parquet and this column is exported as a String. Below we map this back to an Array type in ClickHouse.

INSERT INTO contracts
SELECT
    address,
    bytecode,
    replaceAll(ifNull(function_sighashes, '[]'), '"', '\'') AS function_sighashes,
    is_erc20,
    is_erc721,
    block_timestamp,
    block_number,
    block_hash
FROM s3Cluster('default', 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/ethereum/contracts/*.parquet')

0 rows in set. Elapsed: 21.239 sec. Processed 57.39 million rows, 58.11 GB (2.70 million rows/s., 2.74 GB/s.)

We repeated this exercise for each of the tables, recording the timings below. Using this method, we were able to transfer ~1TB and 8.3 billion rows from Redshift to ClickHouse in less than 35 minutes.

Table Rows Data Size (Parquet) Redshift Export ClickHouse Import
blocks 16629116 10.9GiB 4.9s 14.28 s
contracts 57394746 16.0 GiB 2m 43.9s 21.24 s
transactions 1874052391 252.4 GiB 4m 40s 5 mins, 15 s
traces 6377694114 710.1 GiB 5m 36.1s 15 mins 34 s
Total 8.32 billion 990GB 13m 5secs 21m 25s

Handling New Data

The above approach works well for bulk-loading data static datasets or the historical data of a dynamic corpus. However, it does not address cases where Redshift tables are receiving new data continuously, which needs to be exported to ClickHouse.

Assumptions

For the remainder of this blog, we assume that:

  • Data is append-only and immutable. There is no requirement to selectively update rows, though dropping older data is expected and described below.
  • Either a time dimension or an incrementing numeric identifier exists on the data that allows new rows for copying to ClickHouse to be identified.

These assumptions are consistent with requirements for real-time analytical datasets we commonly see migrating to ClickHouse. For example, when users choose to keep data in sync between Redshift and ClickHouse, they typically synchronize the most recent dataset based on a time dimension. Our example dataset inherently satisfies these properties, and we use the block timestamp for synchronization purposes.

Scheduling Exports

The simplest solution is to schedule periodic UNLOAD queries to identify any new rows and export these to S3 for insertion into ClickHouse. This approach is easy to implement and maintain. It assumes our data is immutable, with only rows added, and having a property (usually a timestamp) that can be used to identify new data. For our example, suppose we schedule an export every hour of the last 60 minutes of data. This further assumes new rows will be inserted in real time with no delays. In most cases, this is unlikely, with new rows having some delay and offset from the current time. Every time we run an export, we need to export rows from a window offset from the current time. For example, suppose we were confident our data would be available within 15 minutes. We would in turn export all rows from <scheduled_time>-75mins to <scheduled_time>-15mins.

Markdown Image

This approach relies on data being reliably inserted to Redshift within 15 minutes.

Redshift Native

Redshift supports native scheduled query functionality, but it is not a viable choice for our purposes. First, it does not support the ability to reference scheduled time needed for our offset calculation. Second, this functionality is only provided for provisioned and not serverless clusters. Using Redshift native scheduling may be sufficient for users who want to perform periodic table export independent of the scheduled time. For example, users with smaller datasets could export all rows periodically and overwrite the whole dataset, but this is not practical for the larger datasets.

Using Amazon EventBridge

Amazon EventBridge is a serverless event bus that connects apps and services in AWS. For our purposes, we are interested specifically in the Amazon EventBridge Scheduler, which allows us to create, run, and manage scheduled tasks centrally via APIs.

Using Amazon EventBridge

In order to schedule an UNLOAD query with the EventBridge scheduler, configure the appropriate execution role and ensure the role under which the schedule executes has the correct permissions for the Redshift Data API, the ability to run ExecuteStatement commands, and the permission to run an UNLOAD query to export data to S3. To assist with debugging, users will also need the permission to create a Dead-letter queue (DLQ) queue in SQS, to which messages will be sent in the event of a failure. Schedules can be created using either the console, SDKs, or the AWS CLI. Our schedule depends on the ability of Amazon EventBridge to run ExecuteStatement commands against the Redshift Data API. We show the creation of a schedule below that exports all rows from the window <scheduled_time>-75mins to <scheduled_time>-15mins.

Markdown Image

The following important components of this schedule:

  • We use a cron-based schedule to run periodically at 15 minutes past the hour using the expression 15 * * * ? * i.e. There is no flexibility in this execution.
  • The schedule utilizes the Redshift Data API and ExecuteCommand endpoint. The JSON payload for this API is shown below:
{
  "Database": "dev",
  "ClusterIdentifier": "redshift-cluster-1",
  "Sql": "UNLOAD ('SELECT * FROM blocks WHERE timestamp > \\'<aws.scheduler.scheduled-time>\\'::timestamp - interval \\'75 minutes\\' AND timestamp < \\'<aws.scheduler.scheduled-time>\\'::timestamp - interval \\'15 minutes\\'') TO 's3://datasets-documentation.s3.eu-west-3.amazonaws.com/ethereum/incremental/blocks/<aws.scheduler.execution-id>' iam_role 'arn:aws:iam::925472944448:role/RedshiftCopyUnload' PARQUET MAXFILESIZE 200MB ALLOWOVERWRITE",
  "DbUser": "awsuser"
}
  • Here we execute against our dev database in a provisioned cluster in the default workgroup. If this was a serverless cluster, we would specify the WorkgroupName instead of the ClusterIdentifier. DbUser assumes the use of temporary credentials. EventBridge also supports the use of AWS Secrets Manager for authentication.
  • The above payload uses the UNLOAD command to export all rows from the blocks table, which satisfy a specific time range, to a dedicated s3 bucket in Parquet format. We inject the scheduled time (the actual execution time may vary) via [the context attribute <aws.scheduler.scheduled-time> and perform data math in our WHERE clause to shift to the required time range i.e. WHERE timestamp > '<aws.scheduler.scheduled-time>'::timestamp - interval '75 minutes' AND timestamp < '<aws.scheduler.scheduled-time>'::timestamp - interval '15 minutes'
  • The context <aws.scheduler.execution-id> is used to provide a prefix to exported files. This will be unique for each schedule invocation, thus avoiding file collisions.
  • We select a SQS DLQ to send events in the event of failure.

To test this schedule, we can insert a row into our blocks table with an adjusted timestamp that matches the next period and wait for the export.

INSERT INTO blocks(number,hash,parent_hash,nonce,sha3_uncles,logs_bloom,transactions_root,state_root,receipts_root,miner,difficulty,total_difficulty,size,extra_data,gas_limit,gas_used,timestamp,transaction_count,base_fee_per_gas) VALUES(99999999,'','','','','value','','','','',0,58750003716598356000000,74905,'',30000000,13141664,'2023-03-13 18:00:00',152,326697119799)

Objects

Scheduling Import

Our previous export file has the execution id as a prefix. While this avoids collisions, it does not allow us to identify the time range covered by a file using its name. All these files must therefore be scanned to identify the rows for import. As the number of files grows, users should expire files to avoid ever the cost of this query growing.

At the time of writing, ClickHouse does not have a built-in way of scheduling imports (proposal is in discussion). We explore options for periodically importing these files externally below.

External Script

For an initial simple approach, and to illustrate the logic, the following bash script can be run by a cron job periodically after exports are completed. This script first grabs the current maximum date in ClickHouse, before issuing an INSERT INTO blocks SELECT * FROM s3(<bucket with export files>) WHERE timestamp > ${max_date} query. This example handles the blocks table but can easily be adapted to the other tables. This approach has the advantage that it can be run independently of export as often as required, but assumes the availability of clickhouse-client in any container or self-managed environment. We leave scheduling as an exercise for the reader.

#!/bin/bash

max_date=$(clickhouse-client --query "SELECT toInt64(toStartOfHour(toDateTime(max(block_timestamp))) + toIntervalHour(1)) AS next FROM ethereum.transactions");
Clickhouse-client –query "INSERT INTO blocks SELECT * FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/ethereum/incremental/blocks/*.parquet') WHERE timestamp >= ${max_date}"

Using AWS Lambda

AWS Lambda is an event-driven, serverless computing platform provided by Amazon as a part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

This service can be used to periodically execute the following simple python script, which replicates the above bash logic in Python.

import requests
import json
CLICKHOUSE_HOST = '<host inc port>'
CLICKHOUSE_PASSWORD = '<password>'
CLICKHOUSE_TABLE = blocks'
TIME_COLUMN = 'timestamp'

def lambda_handler(event, context):
   s = requests.Session()
   s.auth = ('default', CLICKHOUSE_PASSWORD)
   response = s.get(f'https://{CLICKHOUSE_HOST}',
                    params={'query': f'SELECT max({TIME_COLUMN}) as max FROM {CLICKHOUSE_TABLE} FORMAT JSONEachRow'})
   max_time = response.json()['max']
   print(max_time)
   insert_query = f"INSERT INTO {CLICKHOUSE_TABLE} SELECT * FROM " \
                  f"s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/ethereum/incremental/blocks/*.parquet') " \
                  f"WHERE timestamp > '{max_time}'"
   response = s.post(f'https://{CLICKHOUSE_HOST}/?', params={'query': ''}, data=insert_query)
   return f'done. written {json.loads(response.headers["X-ClickHouse-Summary"])["written_rows"]} rows'

This script can be packaged and uploaded to AWS Lambda, along with the packaged requests dependency. AWS EventBridge can in turn be configured to schedule the Lambda function at the required interval - we show this below:

Markdown Image

This import not only relies on the latest timestamp in ClickHouse, and can thus run independently of the earlier export but also offloads all work to ClickHouse via INSERT INTO SELECT. This script contains the username and password for ClickHouse Cluster. We would recommend any production deployment of this code be enhanced to utilize AWS Secrets Manager for secure storage and retrieval of these credentials.

Using AWS EventBridge

We didn't use AWS EventBridge to handle the import for the following reasons. Initially, we intended to achieve this using an EventBridge API destination since this capability allows connections to external services over HTTP with Basic Auth. An EventBridge rule would be triggered off the export schedule and send the following query to ClickHouse, utilizing the $.time variable exposed in the job (this is the scheduled time of the export). However, data-changing queries in ClickHouse must be sent via a POST request. Currently, rule targets in EventBridge will either send the request body (a query) in either JSON or quoted-string format. This is not supported by ClickHouse. We are exploring the possibility of supporting an official ClickHouse Event Source.

Filling Gaps in Data

If we perform the bulk import and then schedule the above import and export queries, we will invariably have a “gap” in our data for the period between the bulk load completion and the incremental load starting. To address this we can use the same technique as documented in our Using ClickHouse to Serve Real-Time Queries on Top of BigQuery Data blog post (see “Filling Gaps”).

Using AWS Glue for Streaming Data between Redshift and ClickHouse

The above approaches assume a batch update process that requires export and import steps to be orchestrated correctly. Utilizing AWS Glue, we can avoid this two-step process and encapsulate this logic in a single ETL job.

AWS Glue is a serverless data integration service that makes it easy for users to extract, transform, and load data between multiple sources. While this would allow users to move data from Redshift to ClickHouse, potentially without writing code, AWS Glue does not support a connector for ClickHouse yet. However, it does support the ability to execute Python shell scripts, so we can stream data between the systems using the boto3 library for reading rows from Redshift and the clickhouse-connect library for sending data to ClickHouse.

A tested python script implementing this concept can be found here. This identifies the current maximum date in ClickHouse before requesting rows greater than this time from Redshift. The script paginates through the results, formulating batches for efficiency before inserting them into ClickHouse. Once all rows have been consumed and inserted, the script completes.

aws-glue.png

AWS Glue requires an IAM role to be associated with the execution of the script. Beyond the standard permissions, ensure this role has access to your Redshift cluster and is able to use temporary credentials as required by the ExecuteStatement command. The example below reads from a provisioned cluster but can be modified to connect a serverless cluster if required.

We highlight the commands to deploy the provided script to AWS Glue below. You need to specify the additional-python-modules parameter to ensure the clickhouse-connect dependency (boto3 is made available by default) is installed.

aws glue create-job --name clickhouse-import  --role AWSGlueServiceRoleRedshift --command '{"Name" :  "pythonshell", "ScriptLocation" : "s3://<bucket_path_to_script>"}' --default-arguments '{"--additional-python-modules", "clickhouse-connect"}'

{
    "Name": "clickhouse-import"
}

aws glue start-job-run --job-name "clickhouse-import"
{
    "JobRunId": "jr_a1fbce07f001e760008ded1bad8ee196b5d4ef48d2c55011a161d4fd0a39666f"
}

AWS Glue natively supports the scheduling of these scripts through simple cron expressions. Users are also again recommended to store ClickHouse cluster credentials in AWS Secret Manager, which is supported in AWS Glue vs. in the script. These can be retrieved using the boto3 library provided the required IAM permissions are configured.

This same approach could also be implemented in an AWS Lambda function or using an AWS Glue spark or streaming ETL job.

Dropping Older Data in ClickHouse

For most deployments, ClickHouse’s superior data compression means that you can store data in a granular format for long periods. For our specific Ethereum dataset, this is probably not particularly beneficial since we likely need to preserve the full history of the blockchain for many queries, e.g., computing account balances.

However, there are simple and scalable approaches to dropping older data that may be applicable to other datasets should you wish to only keep a subset in ClickHouse. For instance, it is possible to use TTL features to expire older data in ClickHouse at either a row or column level. This can be made more efficient by partitioning the tables by date, allowing the efficient deleting of data at set intervals. For the purposes of example, we have modified the schema for the blocks table below to partition by month. Rows older than five years are, in turn, expired efficiently using the TTL feature. The set setting ttl_only_drop_parts ensures a part is only dropped when all rows in it are expired.

CREATE TABLE blocks
(
...
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY timestampcoindesk
TTL timestamp + INTERVAL 60 MONTH DELETE
SETTINGS ttl_only_drop_parts=1

Partitioning can both positively and negatively impact queries and should be more considered a data management feature than a tool for optimizing query performance.

ClickHouse vs Redshift Query Comparison

This dataset warrants an entire blog on possible queries. The author of the etherium-etl tool has published an excellent list of blogs focused on insights with respect to this dataset. In a later blog, we’ll cover these queries and show how they can be converted to ClickHouse syntax and how some can be significantly accelerated. Here we extract similar queries from the popular crypto visualization site dune.com.

Some important considerations:

  • Table schemas are available here. We utilize the same ordering key for both and use the Redshift-optimized types described earlier.
  • Redshift and ClickHouse both provide query result caching capabilities. We disable these explicitly for these tests to provide a measurement of performance for cache misses. Redshift also benefits from compiling and caching the query plan after the first execution. Given this cache is unlimited, we thus run each query twice as recommended by Amazon.
  • We utilize the psql client for issuing queries to Redshift. This client isn’t officially supported, although timings are consistent with the UI. Users can also use the RSQL client for issuing queries.
  • These queries are executed using the same hardware as our earlier benchmark

Ethereum Gas Used by Week

This week has been adapted from this Dune visualization. For an explanation of Gas see here. We modify the query to use receipt_gas_used instead of gas_used. In our provisioned Redshift cluster this query executes in 66.3secs.

SELECT
  date_trunc('week', block_timestamp) AS time,
  SUM(receipt_gas_used) AS total_gas_used,
  AVG(receipt_gas_used) AS avg_gas_used,
  percentile_cont(.5) within GROUP (
    ORDER BY
      receipt_gas_used
  ) AS median_gas_used
FROM
  transactions
WHERE
  block_timestamp >= '2015-10-01'
GROUP BY
  time
ORDER BY
  time ASC
LIMIT
  10;

        time         | total_gas_used | avg_gas_used | median_gas_used
---------------------+----------------+--------------+-----------------
 2015-09-28 00:00:00 |  695113747    |      27562    |      21000.0
 2015-10-05 00:00:00 |  1346460245   |      29208    |      21000.0
 2015-10-12 00:00:00 |  1845089516   |      39608    |      21000.0
 2015-10-19 00:00:00 |  1468537875   |      33573    |      21000.0
 2015-10-26 00:00:00 |  1876510203   |      37293    |      21000.0
 2015-11-02 00:00:00 |  2256326647   |      37741    |      21000.0
 2015-11-09 00:00:00 |  2229775112   |      38535    |      21000.0
 2015-11-16 00:00:00 |  1457079785   |      28520    |      21000.0
 2015-11-23 00:00:00 |  1477742844   |      29497    |      21000.0
 2015-11-30 00:00:00 |  1796228561   |      34517    |      21000.0
(10 rows)

Time: 66341.393 ms (01:06.341)

Comparatively our 60 core ClickHouse Cloud node completes this query in 17 seconds. Note both the simpler quantile syntax and how we use the toStartOfWeek function with a mode of 1 to consider Mondays as the start of the week. This delivers consistent results as dune.com and Redshift.

SELECT
    toStartOfWeek(block_timestamp, 1) AS time,
    SUM(receipt_gas_used) AS total_gas_used,
    round(AVG(receipt_gas_used)) AS avg_gas_used,
    quantileExact(0.5)(receipt_gas_used) AS median_gas_used
FROM transactions
WHERE block_timestamp >= '2015-10-01'
GROUP BY time
ORDER BY time ASC
LIMIT 10

┌───────time─┬─total_gas_used─┬─avg_gas_used─┬─median_gas_used─┐
│ 2015-09-286951137472756221000  │
│ 2015-10-0513464602452920821000  │
│ 2015-10-1218450895163960921000  │
│ 2015-10-1914685378753357321000  │
│ 2015-10-2618765102033729421000  │
│ 2015-11-0222563266473774221000  │
│ 2015-11-0922297751123853521000  │
│ 2015-11-1614570797852852021000  │
│ 2015-11-2314777428442949821000  │
│ 2015-11-3017962285613451821000  │
└────────────┴────────────────┴──────────────┴─────────────────┘

10 rows in set. Elapsed: 17.287 sec. Processed 1.87 billion rows, 14.99 GB (108.39 million rows/s., 867.15 MB/s.)

Both of these functions utilize an exact computation of Percentiles. Equivalent estimation functions in Redshift and ClickHouse (likely sufficient for visualizations), offer the possibility of improved performance. For Redshift unfortunately, this function is limited by cluster size resulting in the following error on our 128 RPU serverless instance:

SELECT
  date_trunc('week', block_timestamp) AS time,
  SUM(receipt_gas_used) AS total_gas_used,
  AVG(receipt_gas_used) AS avg_gas_used,
  APPROXIMATE percentile_disc(.5) within GROUP (
    ORDER BY
      receipt_gas_used
  ) AS median_gas_used
FROM
  transactions
WHERE
  block_timestamp >= '2015-10-01'
GROUP BY
  time
ORDER BY
  time ASC
LIMIT
  10;

ERROR:  1036
DETAIL:  GROUP BY limit for approximate percentile_disc exceeded.
The number of groups returned by the GROUP BY clause exceeds the limit for your cluster size. Consider using percentile_cont instead. (pid:13074)

For ClickHouse, this query returns in less than 1.7 seconds, a huge improvement.

SELECT
    toStartOfWeek(block_timestamp,1) AS time,
    SUM(receipt_gas_used) AS total_gas_used,
    round(AVG(receipt_gas_used)) AS avg_gas_used,
    quantile(0.5)(receipt_gas_used) AS median_gas_used
FROM transactions
WHERE block_timestamp >= '2015-10-01'
GROUP BY time
ORDER BY time ASC
LIMIT 10

┌───────time─┬─total_gas_used─┬─avg_gas_used─┬─median_gas_used─┐
│ 2015-09-286951137472756221000  │
│ 2015-10-0513464602452920821000  │
│ 2015-10-1218450895163960921000  │
│ 2015-10-1914685378753357321000  │
│ 2015-10-2618765102033729421000  │
│ 2015-11-0222563266473774221000  │
│ 2015-11-0922297751123853521000  │
│ 2015-11-1614570797852852021000  │
│ 2015-11-2314777428442949821000  │
│ 2015-11-3017962285613451821000  │
└────────────┴────────────────┴──────────────┴─────────────────┘

10 rows in set. Elapsed: 1.692 sec. Processed 1.87 billion rows, 14.99 GB (1.11 billion rows/s., 8.86 GB/s.)

Ethereum Smart Contracts Creation

We adapt this query from a dune.com visualization. We remove the now() restriction since our data has a fixed upper bound. Due to Redshift not supporting the window RANGE function, we are also forced to modify the query slightly to compute the cumulative sum. ClickHouse runs this query in 76ms vs Redshift in 250ms, despite both tables being ordered by trace_type.

SELECT
  date_trunc('week', block_timestamp) AS time,
  COUNT(*) AS created_contracts,
  sum(created_contracts) OVER (
    ORDER BY
    time rows UNBOUNDED PRECEDING
  ) AS cum_created_contracts
from
  traces
WHERE
  trace_type = 'create'
GROUP BY
  time
ORDER BY
  time ASC
LIMIT
  10;
        time        | created_contracts | cum_created_contracts
---------------------+-------------------+-----------------------
 2015-08-03 00:00:00 |              139 |                   139
 2015-08-10 00:00:00 |              204 |                   343
 2015-08-17 00:00:00 |              189 |                   532
 2015-08-24 00:00:00 |              204 |                   736
 2015-08-31 00:00:00 |              266 |                  1002
 2015-09-07 00:00:00 |              252 |                  1254
 2015-09-14 00:00:00 |              293 |                  1547
 2015-09-21 00:00:00 |              274 |                  1821
 2015-09-28 00:00:00 |              129 |                  1950
 2015-10-05 00:00:00 |              143 |                  2093
(10 rows)

Time: 236.261 ms
SELECT
    toStartOfWeek(block_timestamp, 1) AS time,
    count() AS created_contracts,
    sum(created_contracts) OVER (ORDER BY time ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_created_contracts
FROM traces
WHERE trace_type = 'create'
GROUP BY time
ORDER BY time ASC
LIMIT 10

┌───────time─┬─created_contracts─┬─cum_created_contracts─┐
│ 2015-08-03139139  │
│ 2015-08-10204343  │
│ 2015-08-17189532  │
│ 2015-08-24204736  │
│ 2015-08-312661002 │
│ 2015-09-072521254 │
│ 2015-09-142931547 │
│ 2015-09-212741821 │
│ 2015-09-281291950 │
│ 2015-10-051432093 │
└────────────┴───────────────────┴───────────────────────┘

10 rows in set. Elapsed: 0.076 sec. Processed 58.08 million rows, 290.39 MB (767.20 million rows/s., 3.84 GB/s.)

Ether supply by day

The original BigQuery query, documented as part of Awesome BigQuery views and discussed here, executes in 428ms in Redshift. The ClickHouse query runs in 87ms. Using projections, this query can be further optimized to run in under 10ms secs.

WITH ether_emitted_by_date AS (
  SELECT
    date(block_timestamp) AS date,
    SUM(value) AS value
  FROM
    traces
  WHERE
    trace_type IN ('genesis', 'reward')
  GROUP BY
    DATE(block_timestamp)
)
SELECT
  date,
  SUM(value) OVER (
    ORDER BY
      date ASC ROWS BETWEEN UNBOUNDED PRECEDING
      AND CURRENT ROW
  ) / POWER(10, 18) AS supply
FROM
  ether_emitted_by_date
LIMIT
  10;

    date    |   supply
------------+----------------
 1970-01-01 | 72009990.49948
 2015-07-30 | 72049301.59323
 2015-07-31 | 72085493.31198
 2015-08-01 | 72113195.49948
 2015-08-02 | 72141422.68698
 2015-08-03 | 72169399.40573
 2015-08-04 | 72197877.84323
 2015-08-05 | 72225406.43698
 2015-08-06 | 72252481.90573
 2015-08-07 | 72279919.56198
(10 rows)

Time: 428.202 ms

ClickHouse, with and without projections:

WITH ether_emitted_by_date AS
    (
        SELECT
            date(block_timestamp) AS date,
            SUM(value) AS value
        FROM traces
        WHERE trace_type IN ('genesis', 'reward')
        GROUP BY DATE(block_timestamp)
    )
SELECT
    date,
    SUM(value) OVER (ORDER BY date ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / POWER(10, 18) AS supply
FROM ether_emitted_by_date
LIMIT 10
┌───────date─┬────────────supply─┐
│ 1970-01-0172009990.49948001 │
│ 2015-07-3072049301.59323001 │
│ 2015-07-3172085493.31198   │
│ 2015-08-0172113195.49948   │
│ 2015-08-0272141422.6869810 rows in set. Elapsed: 0.087 sec. Processed 18.08 million rows, 379.73 MB (207.34 million rows/s., 4.35 GB/s.)

-- add projections

ALTER TABLE traces ADD PROJECTION trace_type_projection (
                 SELECT trace_type,
                 toStartOfDay(block_timestamp) as date, sum(value) as value GROUP BY trace_type, date
                 )
ALTER TABLE traces MATERIALIZE PROJECTION trace_type_projection

-- re-run query
WITH ether_emitted_by_date AS
    (
     SELECT
         date,
         sum(value) AS value
     FROM traces
     WHERE trace_type IN ('genesis', 'reward')
     GROUP BY toStartOfDay(block_timestamp) AS date
    )
SELECT
    date,
    sum(value) OVER (ORDER BY date ASC) / power(10, 18) AS supply
FROM ether_emitted_by_date

3 rows in set. Elapsed: 0.009 sec. Processed 11.43 thousand rows, 509.00 KB (1.23 million rows/s., 54.70 MB/s.)

Total Ethereum Market Capitalisation

This is a query that has been modified from a dune.com visualization that estimates the total market capitalization of Ethereum. Here we use a fixed price of 1577.88 from CoinDesk, since our data is a snapshot with a latest date of 2023-02-14 19:34:59. In our Redshift provisioned instance this query fails as shown below (also occurs in Query UI).

SELECT
  120529053 - SUM(eb.base_fee_per_gas * et.gas) / 1e18 -- missing  ETH2 rewards for now, awaiting beacon chain data, using estimated 1600 ETH staking issuance /day for now
  + COUNT(eb.number) * 1600 /(24 * 60 * 60 / 12) AS eth_supply
FROM
  transactions et
  INNER JOIN blocks eb ON eb.number = et.block_number
WHERE
  et.block_timestamp >= '2022-09-29'
)
SELECT
  (eth_supply * 1577.88) / 1e9 AS eth_mcap
FROM
  eth_supply;

ERROR:  Numeric data overflow (addition)
DETAIL:
  -----------------------------------------------
  error:  Numeric data overflow (addition)
  code:     1058
  context:
  query:    4602282
  location:  numeric_bound.cpp:180
  process:   query10_500_4602282 [pid=31250]
  -----------------------------------------------

On our 60 core ClickHouse Cloud node, this query runs in 3.2secs.

WITH eth_supply AS
    (
        SELECT (120529053 - (SUM(eb.base_fee_per_gas * et.receipt_gas_used) / 1000000000000000000.)) + ((COUNT(eb.number) * 1600) / (((24 * 60) * 60) / 12)) AS eth_supply
        FROM transactions AS et
        INNER JOIN blocks AS eb ON eb.number = et.block_number
        WHERE et.block_timestamp >= '2022-09-29'
    )
SELECT (eth_supply * 1577.88) / 1000000000. AS eth_mcap
FROM eth_supply
┌───────────eth_mcap─┐
│ 251.42266710943835 │
└────────────────────┘

1 row in set. Elapsed: 3.220 sec. Processed 191.48 million rows, 2.30 GB (59.47 million rows/s., 713.69 MB/s.)

This value is consistent with that computed by dune.com. A full set of example queries can be found here. We welcome contributions!

Conclusion

In this blog post, we have explored how data can be moved to ClickHouse from Redshift to accelerate queries for real-time analytics. We have shown a number of approaches to loading data and keeping it in sync, and how to leverage ClickHouse for real-time analytics on top of this data. In later posts, we’ll explore this Ethereum dataset in more detail.

In the meantime, we have made this dataset available in a public ClickHouse deployment for exploration (sql.clickhouse.com) and gcs bucket gs://clickhouse_public_datasets/ethereum. You are welcome to try it by downloading a free open-source version of ClickHouse and deploying it yourself or spinning up a ClickHouse Cloud free trial. ClickHouse Cloud is a fully-managed serverless offering based on ClickHouse, where you can start building real-time applications with ease without having to worry about deploying and managing infrastructure.

Resources

We recommend the following resources with respect to Ethereum and querying this dataset.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
X imageSlack imageGitHub image
Telegram imageMeetup imageRss image