Numeric hint when last column and enough horizontal space
Since ClickHouse 24.2, when you return a single numerical column, if the value in that column is bigger than 1 million, the readable quantity will be displayed as a comment alongside the value itself.
┌─arrayJoin([277282240574747])─┐
│ 277282240574747 │ -- 277.28 trillion
└──────────────────────────────┘
But if we returned multiple values, it wouldn’t show the numeric hint:
SELECT arrayJoin([
277282240574747,
2772822,
1543210
]);
┌─arrayJoin([2⋯, 1543210])─┐
│ 277282240574747 │
│ 2772822 │
│ 1543210 │
└──────────────────────────┘
Now we’ll get the numeric hint for every row as long as there’s enough horizontal space:
┌─arrayJoin([2⋯, 1543210])─┐
│ 277282240574747 │ -- 277.28 trillion
│ 2772822 │ -- 2.77 million
│ 1543210 │ -- 1.54 million
└──────────────────────────┘
The distinctJSONPaths function returns the set of unique JSON paths present in a JSON column.
Starting with ClickHouse 25.12, this function is significantly optimized. Instead of scanning and decoding the full internal JSON column representation, ClickHouse now relies on metadata that is already maintained at data-part level.
Each data part tracks the set of JSON paths it contains. When distinctJSONPaths is executed, ClickHouse simply reads and merges this metadata across parts, without accessing the JSON column data itself.
Let’s see the performance impact using the BlueSky dataset. The following query ingests 10 million rows into a table:
CREATE TABLE bluesky
ORDER BY ()
AS
SELECT *
FROM s3(
'https://clickhouse-public-datasets.s3.amazonaws.com' ||
'/bluesky/file_{0001..0010}.json.gz',
'JSONAsObject'
)
SETTINGS
input_format_allow_errors_num = 100,
input_format_allow_errors_ratio = 1;
We’re going to compute the distinct JSON paths in this dataset by running the following query:
SELECT length(distinctJSONPaths(json))
FROM bluesky;
If we run this query three times in ClickHouse 25.11:
┌─length(disti⋯aths(json))─┐
│ 224 │
└──────────────────────────┘
1 row in set. Elapsed: 4.978 sec. Processed 9.97 million rows, 21.21 GB (2.00 million rows/s., 4.26 GB/s.)
Peak memory usage: 1.24 GiB.
1 row in set. Elapsed: 4.563 sec. Processed 9.99 million rows, 21.26 GB (2.19 million rows/s., 4.66 GB/s.)
Peak memory usage: 1.26 GiB.
1 row in set. Elapsed: 4.926 sec. Processed 9.94 million rows, 21.14 GB (2.02 million rows/s., 4.29 GB/s.)
Peak memory usage: 1.24 GiB.
And then three times in ClickHouse 25.12:
┌─length(disti⋯aths(json))─┐
│ 224 │
└──────────────────────────┘
1 row in set. Elapsed: 0.082 sec.
1 row in set. Elapsed: 0.099 sec.
1 row in set. Elapsed: 0.100 sec.
Our fastest result goes from 4.563 seconds to 82 milliseconds, an improvement of more than 50 times.
Optimize DISTINCT transform for LowCardinality Columns
LowCardinality columns store values using dictionary encoding: each distinct value is stored once in a dictionary, and rows reference it by a small integer ID.
Starting with ClickHouse 25.12, DISTINCT queries on LowCardinality columns are optimized to exploit this dictionary representation.
Instead of scanning all rows and comparing full values, ClickHouse operates directly on the dictionary IDs:
-
It tracks which dictionary entries have already been seen
-
Each dictionary value is returned at most once
-
The scan can stop early once all dictionary entries are encountered
As a result, the cost of DISTINCT depends primarily on the dictionary size, not the number of rows.
Let’s have a look at the effect with the UK property prices dataset. This dataset has the following columns with the LowCardinality type:
postcode1 LowCardinality(String)
postcode2 LowCardinality(String)
street LowCardinality(String)
locality LowCardinality(String)
town LowCardinality(String)
district LowCardinality(String)
county LowCardinality(String)
We’re going to run the following query three times against each column:
SELECT DISTINCT
FROM uk.uk_price_paid
FORMAT NULL;
For each of the fields on 25.11 and 25.12, record the fastest time. You can see the results in the table below:
| Field | Unique values | 25.11 | 25.12 |
|---|
| postcode1 | 2,392 | 0.017 sec | 0.033 sec |
| postcode2 | 4,005 | 0.121 sec | 0.043 sec |
| street | 334,642 | 0.186 sec | 0.084 sec |
| locality | 24,001 | 0.149 sec | 0.070 sec |
| town | 1,172 | 0.141 sec | 0.046 sec |
| district | 467 | 0.144 sec | 0.046 sec |
| county | 132 | 0.141 sec | 0.023 sec |
We can see that the optimization results in a faster query for every column except postcode1.
The sorting key for this table is (postcode1, postcode2, addr1, addr2), which explains why the optimization isn’t as effective on the postcode1 column.
The "early exit" optimization assumes you'll discover all unique values early. But when doing DISTINCT on postcode1, the unique values are spread from beginning to end by definition. We have to scan (almost) everything.
Therefore, for postcode1, we pay the overhead of tracking dictionary indices while still scanning nearly the full column, making it slower than the previous approach of scanning all rows.
A semi join is a way of filtering a rowset based on the existence of its rows in another rowset.
A LEFT SEMI JOIN returns rows from the left table where matching rows exist in the right table. A RIGHT SEMI JOIN returns rows from the right table where matching rows exist in the left table.
You can read a more detailed explanation in part 1 of our series on ClickHouse joins.
Let’s have a look at an example of each type of semi join, starting with LEFT SEMI JOIN:
-- Return users who have orders
SELECT * FROM users
LEFT SEMI JOIN orders ON users.id = orders.user_id
And RIGHT SEMI JOIN:
-- Return orders that have matching users
SELECT * FROM users
RIGHT SEMI JOIN orders ON users.id = orders.user_id
ClickHouse 25.12 optimizes SEMI JOIN queries by pushing filters down earlier in the query execution. This means that when you have a SEMI JOIN with filter conditions, the optimizer can now push those filters down to the table scan level using "equivalent column sets" - columns that are known to be equal through the join condition.
Without filter pushdown, you might scan the entire right (or left) table before filtering. With this optimization, you filter first, then check existence against a smaller dataset.
Let’s have a look at how this optimization works with help from the UK property prices dataset. The following query finds 2023 sales where the property also sold in 2022
SELECT DISTINCT street, town, postcode1, date, price
FROM uk_price_paid p1
LEFT SEMI JOIN uk_price_paid p2
ON p1.addr1 = p2.addr1
AND p1.street = p2.street
AND p1.postcode1 = p2.postcode1
AND p2.date BETWEEN '2022-01-01' AND '2022-12-31'
WHERE p1.date BETWEEN '2023-01-01' AND '2023-12-31'
ORDER BY p1.price DESC
LIMIT 10;
┌─street────────────┬─town─────────┬─postcode1─┬───────date─┬────price─┐
│ MOULSECOOMB WAY │ BRIGHTON │ BN2 │ 2023-05-03 │ 97483948 │
│ GLEBE PLACE │ LONDON │ SW3 │ 2023-10-18 │ 68250000 │
│ THREE QUEENS LANE │ BRISTOL │ BS1 │ 2023-12-28 │ 55859292 │
│ MARGATE ROAD │ BROADSTAIRS │ CT10 │ 2023-04-20 │ 54077847 │
│ BATH ROAD │ WEST DRAYTON │ UB7 │ 2023-10-20 │ 54000000 │
│ PARK STREET │ LONDON │ W1K │ 2023-09-07 │ 52625000 │
│ LINKS ROAD │ MORPETH │ NE65 │ 2023-02-28 │ 50090677 │
│ BILTON WAY │ ENFIELD │ EN3 │ 2023-08-18 │ 50017255 │
│ OLYMPIC WAY │ WEMBLEY │ HA9 │ 2023-08-04 │ 43096020 │
│ BLACKACRE ROAD │ IPSWICH │ IP6 │ 2023-11-14 │ 36990000 │
└───────────────────┴──────────────┴───────────┴────────────┴──────────┘
If we run this query against 25.11:
10 rows in set. Elapsed: 1.134 sec. Processed 59.22 million rows, 1.22 GB (52.22 million rows/s., 1.08 GB/s.)
10 rows in set. Elapsed: 1.295 sec. Processed 57.69 million rows, 1.18 GB (44.56 million rows/s., 913.69 MB/s.)
10 rows in set. Elapsed: 1.320 sec. Processed 60.90 million rows, 1.26 GB (46.13 million rows/s., 952.93 MB/s.)
And against 25.12:
10 rows in set. Elapsed: 0.818 sec. Processed 60.18 million rows, 940.56 MB (73.55 million rows/s., 1.15 GB/s.)
10 rows in set. Elapsed: 0.800 sec. Processed 60.90 million rows, 953.67 MB (76.12 million rows/s., 1.19 GB/s.)
10 rows in set. Elapsed: 0.741 sec. Processed 55.61 million rows, 856.55 MB (75.02 million rows/s., 1.16 GB/s.)
If we take the quickest runs for each, the query runs about 30% faster in ClickHouse 25.12. We can also see that about 30% less data is read from disk.