Going to re:Invent this December? Come party with us and The Chainsmokers ->->

Blog / Engineering

Using Aggregate Combinators in ClickHouse

author avatar
Denys Golotiuk
Feb 8, 2023

Aggregate combinators v02.png

ClickHouse supports not only standard aggregate functions but also a lot of more advanced ones to cover most analytical use cases. Along with aggregate functions, ClickHouse provides aggregate combinators, which are a powerful extension to the querying capabilities and can address a massive number of requirements.

Combinators allow extending and mixing aggregations to address a wide range of data structures. This capability will enable us to adapt queries instead of tables to answer even the most complex questions.

In this blog post, we explore Aggregate Combinators and how the can potentially simplify your queries and avoid the need to make structural changes to your data.

How to use combinators

To use a combinator, we have to do two things. First, choose an aggregate function we want to use; let's say we want a sum() function. Second, pick a combinator needed for our case; let's say we need an If combinator. To use this in a query, we add the combinator to the function name:

SELECT sumIf(...)

An even more useful feature is that we can combine any number of combinators in a single function:

SELECT sumArrayIf(...)

Here, we've combined the sum() function with the Array and If combinators:

sumArrayIf.png

This particular example would allow us to conditionally sum the contents of an array column.

Let's explore some practical cases where combinators can be used.

Adding conditions to aggregations

Sometimes, we need to aggregate data based on specific conditions. Instead of using a WHERE clause for this, we can use If combinator and specify conditions as a last argument of the combined function:

how sumIf works

Suppose we have a table with user payments of the following structure (populated with a sample data):

CREATE TABLE payments ( `total_amount` Float, `status` ENUM('declined', 'confirmed'), `create_time` DateTime, `confirm_time` DateTime ) ENGINE = MergeTree ORDER BY (status, create_time)

Let's say we want to get the total amount spent, but only when the payment was confirmed i.e. status="confirmed":

SELECT sumIf(total_amount, status = 'confirmed') FROM payments ┌─sumIf(total_amount, equals(status, 'declined'))─┐ │ 10780.18000793457 │ └─────────────────────────────────────────────────┘

We can use the same syntax for the condition as for WHERE clauses. Let's get the total amount of confirmed payments, but when confirm_time is later than create_time by 1 minute:

SELECT sumIf(total_amount, (status = 'confirmed') AND (confirm_time > (create_time + toIntervalMinute(1)))) AS confirmed_and_checked FROM payments ┌─confirmed_and_checked─┐ │ 11195.98991394043 │ └───────────────────────┘

The principal advantage of using the conditional If, over a standard WHERE clause, is the ability to compute multiple sums for different clauses. We can also use any available aggregate function with combinators, like countIf(), avgIf() or quantileIf() - any. Combing these capabilities we can aggregate on multiple conditions and functions within a single request:

SELECT countIf((status = 'confirmed') AND (confirm_time > (create_time + toIntervalMinute(1)))) AS num_confirmed_checked, sumIf(total_amount, (status = 'confirmed') AND (confirm_time > (create_time + toIntervalMinute(1)))) AS confirmed_checked_amount, countIf(status = 'declined') AS num_declined, sumIf(total_amount, status = 'declined') AS dec_amount, avgIf(total_amount, status = 'declined') AS dec_average FROM payments ┌─num_confirmed_checked─┬─confirmed_checked_amount─┬─num_declined─┬────────dec_amount─┬───────dec_average─┐ │ 39 │ 11195.98991394043 │ 50 │ 10780.18000793457 │ 215.6036001586914 │ └───────────────────────┴──────────────────────────┴──────────────┴───────────────────┴───────────────────┘

Aggregating on unique entries only

It's a common case to calculate the number of unique entries. ClickHouse has several ways to do this using either COUNT(DISTINCT col) (the same as uniqExact) or theuniq() when an estimated (but faster) value is sufficient. Still, we might want to have unique values from a column used in different aggregate functions. The Distinct combinator can be used for this:

Distinct combinator

Once we add Distinct to the aggregate function, it will ignore repeated values:

SELECT countDistinct(toHour(create_time)) AS hours, avgDistinct(toHour(create_time)) AS avg_hour, avg(toHour(create_time)) AS avg_hour_all FROM payments ┌─hours─┬─avg_hour─┬─avg_hour_all─┐ │ 2 │ 13.5 │ 13.74 │ └───────┴──────────┴──────────────┘

Here, avg_hour will be calculated based on the two distinct values only, while avg_hour_all will be calculated based on all 100 records in the table.

Combining Distinct and If

As combinators can be combined together, we can use both previous combinators with an avgDistinctIf function to address more advanced logic:

SELECT avgDistinctIf(toHour(create_time), total_amount > 400) AS avg_hour FROM payments ┌─avg_hour─┐ │ 13 │ └──────────┘

This will calculate the average on distinct hour values for records with a total_amount value of more than 400.

Splitting data into groups before aggregating

Instead of min/max analysis, we might want to split our data into groups and calculate figures for each group separately. This can be solved using the Resample combinator.

It takes a column, range (start/stop), and a step that you want to split data on. It then returns an aggregate value for each group:

Resample combinator

Suppose we want to split our payments table data based on the total_amount from 0 (which is the minimum) to 500 (which is the maximum) with a step of 100. Then, we want to know how many entries there are in each group as well as the groups average total:

SELECT countResample(0, 500, 100)(toInt16(total_amount)) AS group_entries, avgResample(0, 500, 100)(total_amount, toInt16(total_amount)) AS group_totals FROM payments FORMAT Vertical Row 1: ────── group_entries: [21,20,24,31,4] group_totals: [50.21238123802912,157.32600135803222,246.1433334350586,356.2583834740423,415.2425003051758]

Here, the countResample() function counts the number of entries in each group, and an avgResample() function calculates an average of the total_amount for each group. Resample combinator accepts column name to split based on as a last argument to the combined function.

Note that the countResample() function has only one argument (since count() doesn't require arguments at all) and avgResample() has two arguments (the first one is the column to calculate average values for). Finally, we had to use toInt16 to convert total_amount to an integer since a Resample combinator requires this.

To get the Resample() combinators output in a table layout, we can use arrayZip() and arrayJoin() functions:

SELECT round(tp.2, 2) AS avg_total, tp.1 AS entries FROM ( SELECT arrayJoin(arrayZip(countResample(0, 500, 100)(toInt16(total_amount)), avgResample(0, 500, 100)(total_amount, toInt16(total_amount)))) AS tp FROM payments ) ┌─avg_total─┬─entries─┐ │ 50.21 │ 21 │ │ 157.33 │ 20 │ │ 246.14 │ 24 │ │ 356.26 │ 31 │ │ 415.24 │ 4 │ └───────────┴─────────┘

Here, we combine corresponding values from 2 arrays into tuples and unfold the resulting array into a table using an arrayJoin() function:

Controlling aggregate values for empty results

Aggregate functions react differently to cases when the resulting set contains no data. For example, count() will return 0 while avg() will produce a nan value.

We can control this behaviour using the OrDefault() and OrNull() combinators. Both changes a returned value of an aggregate function used in case of an empty dataset:

  • OrDefault() will return a default value of the function instead of nan,
  • OrNull() will return NULL (and will also change the return type to Nullable).

Consider the following example:

SELECT count(), countOrNull(), avg(total_amount), avgOrDefault(total_amount), sumOrNull(total_amount) FROM payments WHERE total_amount > 1000 ┌─count()─┬─countOrNull()─┬─avg(total_amount)─┬─avgOrDefault(total_amount)─┬─sumOrNull(total_amount)─┐ │ 0 │ ᴺᵁᴸᴸ │ nan │ 0 │ ᴺᵁᴸᴸ │ └─────────┴───────────────┴───────────────────┴────────────────────────────┴─────────────────────────┘

As we can see in the first column, zero rows were returned. Note how countOrNull() will return NULL instead of 0, and avgOrDefault() gives 0 instead of nan.

Using with other combinators

As well as all other combinators, orNull() and orDefault() can be used together with different combinators for a more advanced logic:

SELECT sumIfOrNull(total_amount, status = 'declined') AS declined, countIfDistinctOrNull(total_amount, status = 'confirmed') AS confirmed_distinct FROM payments WHERE total_amount > 420 ┌─declined─┬─confirmed_distinct─┐ │ ᴺᵁᴸᴸ │ 1 │ └──────────┴────────────────────┘

We've used the sumIfOrNull() combined function to calculate only declined payments and return NULL on an empty set. The countIfDistinctOrNull() function counts distinct total_amount values but only for rows meeting the specified condition.

Aggregating arrays

ClickHouse's Array type is popular among its users because it brings a lot of flexibility to table structures. To operate with Array columns efficiently, ClickHouse provides a set of array functions. To make aggregations on Array types easy, ClickHouse provides the Array() combinators. These apply a given aggregate function on all values from an array column instead of the array itself:

Array combinator

Suppose we have the following table (populated with a sample data):

CREATE TABLE article_reads ( `time` DateTime, `article_id` UInt32, `sections` Array(UInt16), `times` Array(UInt16), `user_id` UInt32 ) ENGINE = MergeTree ORDER BY (article_id, time) ┌────────────────time─┬─article_id─┬─sections─────────────────────┬─times────────────────────────────────┬─user_id─┐ │ 2023-01-18 23:44:17 │ 10 │ [16,18,7,21,23,22,11,19,9,8] │ [82,96,294,253,292,66,44,256,222,86] │ 424 │ │ 2023-01-20 22:53:00 │ 10 │ [21,8] │ [30,176] │ 271 │ │ 2023-01-21 03:05:19 │ 10 │ [24,11,23,9] │ [178,177,172,105] │ 536 │ ...

This table is used to store article reading data for each section of the article. When a user reads an article, we save the read sections to the sections array column and the associated reading times to the times column:

Let's use the uniqArray() function to calculate a number of unique sections read for each article together with avgArray() to get an average time per section:

SELECT article_id, uniqArray(sections) sections_read, round(avgArray(times)) time_per_section FROM article_reads GROUP BY article_id ┌─article_id─┬─sections_read─┬─time_per_section─┐ │ 14 │ 22 │ 175 │ │ 18 │ 25 │ 159 │ ... │ 17 │ 25 │ 170 │ └────────────┴───────────────┴──────────────────┘

We can get the min and max read time across all articles using minArray() and maxArray() functions:

SELECT minArray(times), maxArray(times) FROM article_reads ┌─minArray(times)─┬─maxArray(times)─┐ │ 30 │ 300 │ └─────────────────┴─────────────────┘

We can also get a list of read sections for each article using the groupUniqArray() function combined with an Array() combinator:

SELECT article_id, groupUniqArrayArray(sections) FROM article_reads GROUP BY article_id ┌─article_id─┬─groupUniqArrayArray(sections)───────────────────────────────────────┐ │ 14 │ [16,13,24,8,10,3,9,19,23,14,7,25,2,1,21,18,12,17,22,4,6,5] │ ... │ 17 │ [16,11,13,8,24,10,3,9,23,19,14,7,25,20,2,1,15,21,6,5,12,22,4,17,18] │ └────────────┴─────────────────────────────────────────────────────────────────────┘

Another popular function is any(), which returns any column value under aggregation, and can also be combined Array:

SELECT article_id, anyArray(sections) FROM article_reads GROUP BY article_id ┌─article_id─┬─anyArray(sections)─┐ │ 14 │ 19 │ │ 18 │ 6 │ │ 19 │ 25 │ │ 15 │ 15 │ │ 20 │ 1 │ │ 16 │ 23 │ │ 12 │ 16 │ │ 11 │ 2 │ │ 10 │ 16 │ │ 13 │ 9 │ │ 17 │ 20 │ └────────────┴────────────────────┘

Using Array with other combinators

The Array combinator can be used together with any other combinator:

SELECT article_id, sumArrayIfOrNull(times, length(sections) > 8) FROM article_reads GROUP BY article_id ┌─article_id─┬─sumArrayOrNullIf(times, greater(length(sections), 8))─┐ │ 14 │ 4779 │ │ 18 │ 3001 │ │ 19 │ NULL │ ... │ 17 │ 14424 │ └────────────┴───────────────────────────────────────────────────────┘

We have used the sumArrayIfOrNull() function to calculate the total times for articles where more than eight sections were read. Note that NULL is returned for articles with zero cases of more than eight sections read because we've also used the OrNull() combinator.

We can address even more advanced cases if we use array functions along with combinators:

SELECT article_id, countArray(arrayFilter(x -> (x > 120), times)) AS sections_engaged FROM article_reads GROUP BY article_id ┌─article_id─┬─sections_engaged─┐ │ 14 │ 26 │ │ 18 │ 44 │ ... │ 17 │ 98 │ └────────────┴──────────────────┘

Here, we first filter the times array using an arrayFilter function to remove all values under 120 seconds. Then, we use countArray to calculate filtered times (which means engaged reads in our case) for each article.

Aggregating maps

Another powerful type available in ClickHouse is the Map. Like arrays, we can use Map() combinator to apply aggregations to this type.

Suppose we have the following table with a Map column type:

CREATE TABLE page_loads ( `time` DateTime, `url` String, `params` Map(String, UInt32) ) ENGINE = MergeTree ORDER BY (url, time) ┌────────────────time─┬─url─┬─params───────────────────────────────┐ │ 2023-01-25 17:44:26 │ / │ {'load_speed':100,'scroll_depth':59} │ │ 2023-01-25 17:44:37 │ / │ {'load_speed':400,'scroll_depth':12} │ └─────────────────────┴─────┴──────────────────────────────────────┘

We can use a Map() combinator for the sum() and avg() functions to get total loading times and average scroll depth:

SELECT sumMap(params)['load_speed'] AS total_load_time, avgMap(params)['scroll_depth'] AS average_scroll FROM page_loads ┌─total_load_time─┬─average_scroll─┐ │ 500 │ 35.5 │ └─────────────────┴────────────────┘

The Map() combinator can also be used with other combinators:

SELECT sumMapIf(params, url = '/404')['scroll_depth'] AS average_scroll FROM page_loads

Aggregating respective array values

Another way to work with array columns is to aggregate corresponding values from two arrays. This results in another array. This can be used for vectorized data (like vectors or matrices) and is implemented via the ForEach() combinator:

ForEach combinator

Suppose we have the following table with vectors:

SELECT * FROM vectors ┌─title──┬─coordinates─┐ │ first │ [1,2,3] │ │ second │ [2,2,2] │ │ third │ [0,2,1] │ └────────┴─────────────┘

To calculate the average coordinates array (vector), we can use an avgForEach() combined function:

SELECT avgForEach(coordinates) FROM vectors ┌─avgForEach(coordinates)─┐ │ [1,2,2] │ └─────────────────────────┘

This will ask ClickHouse to calculate an average value for the first element of all coordinates arrays and put it into the first element of the resulting array. Then repeat the same for the second and third elements.

And, of course, use with other combinators is also supported:

SELECT avgForEachIf(coordinates, title != 'second') FROM vectors ┌─avgForEachIf(coordinates, notEquals(title, 'second'))─┐ │ [0.5,2,2] │ └───────────────────────────────────────────────────────┘

Working with aggregation states

ClickHouse allows working with intermediate aggregation states instead of resulting values. Let's say we need to count unique values in our case, but we don't want to save the values themselves (because it takes space). In this case, we can use a State() combinator for the uniq() function to save the intermediate aggregation state and then use a Merge() combinator to calculate actual value:

SELECT uniqMerge(u) FROM ( SELECT uniqState(number) AS u FROM numbers(5) UNION ALL SELECT uniqState(number + 1) AS u FROM numbers(5) ) ┌─uniqMerge(u)─┐ │ 6 │ └──────────────┘

Here, the first nested query will return the state for the unique count of 1...5 numbers. The second nested query returns the same for 2...6 numbers. The parent query then uses the uniqMerge() function to merge our states and get a count of all unique numbers we saw:

uniqState() and uniqMerge() examples

Why do we want to do this? Simply because aggregate states take much less space than the original data. This is particulartly important when we want to store this state on disk. For example, uniqState() data takes 15 times less space than 1 million integer numbers:

SELECT table, formatReadableSize(total_bytes) AS size FROM system.tables WHERE table LIKE 'numbers%' ┌─table─────────┬─size───────┐ │ numbers │ 3.82 MiB │ <- we saved 1 million ints here │ numbers_state │ 245.62 KiB │ <- we save uniqState for 1m ints here └───────────────┴────────────┘

ClickHouse provides an AggregatingMergeTree table engine for storing aggregation states and automatically merging them on the primary key. Let's create a table to store aggregated data for daily payments from our previous examples:

CREATE TABLE payments_totals ( `date` Date, `total_amount` AggregateFunction(sum, Float) ) ENGINE = AggregatingMergeTree ORDER BY date

We've used the AggregateFunction type to let ClickHouse know we're going to store aggregated total states instead of scalar values. On insert, we need to use the sumState function to insert the aggregate state:

INSERT INTO payments_totals SELECT date(create_time) AS date, sumState(total_amount) FROM payments WHERE status = 'confirmed' GROUP BY date

Finally, we need to use the sumMerge() function to fetch the resulting values:

┌─sumMerge(total_amount)─┐ │ 12033.219916582108 │ └────────────────────────┘

Note that ClickHouse provides an easy way to use aggregated table engines based on materialized views. ClickHouse also provides a SimpleState combinator as an optimized version that can be used with some aggregate functions (like 'sum' or 'min').

Summary

Aggregation function combinators bring almost limitless possibilities to analytical querying on top of any data structure in ClickHouse. We can add conditions to aggregations, apply functions to array elements or get intermediate states to store data in aggregated form but still available for querying.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
Twitter imageSlack imageGitHub image
Telegram imageMeetup imageRss image