ClickHouse Newsletter March 2022: There’s a window function for that!

03.07.2022

Blog

Welcome to our March newsletter, a monthly roundup of the latest ClickHouse goodness. Below you will find new exciting features we released, recent meetup and webinar recordings, some neat queries using window functions and a few recommended reads.

ClickHouse v22.2

We released ClickHouse 22.2, the newest and greatest version of your favorite database. Highlights include:

  • Projections are production ready and enabled by default. They allow you to store part of a table in a subdirectory of the original table, often using a different sorting key or aggregating the original data – similar to materialized views but without all the overhead. Their usage is refreshingly simple: Queries will be automatically analyzed and redirected to the projection if less data has to be read there to produce the same result. Almost like magic!
  • Custom deduplication on insert – a new setting insert_deduplication_token allows you to specify a custom token (any string) that is used to determine if a block has already been inserted. If it has, then any subsequent blocks with the same token are discarded. Any insert statement creates a block but large inserts are divided into multiple blocks of about 1 million rows (controlled by max_insert_block_size). This makes it easier to implement exactly-once semantics in ClickHouse. Note: By default, ClickHouse will already deduplicate inserted data blocks that are identical (consist of the same rows in the same order). This new setting gives you more control over the process if you need it.
  • Default table engine – You can now set a default table engine so you no longer have to specify one when creating a table. For example:
SET default_table_engine = 'MergeTree'
CREATE TABLE table1 (x int PRIMAY KEY (x))
  • Ephemeral columns can be used at insert time to calculate other columns but are not stored themselves.
  • Text classification – Experimental functions for detecting language, character set, tonality and programming language were added. For example:
SET allow_experimental_nlp_functions = 1
SELECT detectLanguage('Ich bin ein Berliner')
// returns 'de'
dictGetOrNull(dict, attr, 1)
dictGetOrNull(dict, attr, '1')
dictGetOrNull(dict, attr, 1.0)
dictGetOrNull(dict, attr, toUInt8(1))
dictGetOrNull(dict, attr, toInt256(1))

For more details head over to the release blog post and the changelog.

Meetups & Webinars

The last few weeks have seen a number of meetups and webinars in ClickHouse land. Check out the recordings (at 2x speed if you like, we won’t judge you):

  • Virtual Meetup with Contentsquare, where we learned about how they use ClickHouse to power their experience analytics platform. It’s an impressive journey that started with moving from Elasticsearch to ClickHouse to be 11x cheaper (with 6x more data) and 10x faster! We also heard about the ClickHouse Proxy (Chproxy) project and Alexey’s favorite ClickHouse features in 2021 and 2022. Highlight: Listen in at 1:20:12 on our plans for a new JSON data type.
  • Learn about the new features we released in the v22.02 Release Webinar including upcoming performance improvements for running ClickHouse on top of S3 at 42:05 (the performance numbers are very impressive, if we may say so!).
  • The San Francisco Bay Area meetup had guest speakers from Materialize and FastNetMon.

Upcoming webinars:

Query of the Month: There’s a window function for that!

Did you know that ClickHouse has window functions? It’s a relatively recent (early 2021) feature but very powerful. Let’s go on a whirlwind tour of what’s possible that used to be much harder before. Note: We will use the UK Price Paid dataset of all property transactions in the UK in the last few decades, so feel free to reproduce the queries!

Query 1: Compare with cohort average

SELECT avg(price) OVER(PARTITION BY postcode1) AS postcode_avg, *
FROM uk_price_paid
WHERE type = 'flat'
ORDER BY price DESC

This query will show the most expensive flats / apartments ever sold in the UK but in addition will show the average prices in their postcodes as the first column. So now we can see who’s really overpaying!

Query 2: When was this property sold previously?

SELECT
lagInFrame(date) OVER (PARTITION BY postcode1, postcode2, addr1, addr2 ORDER BY date ASC) AS previous_date,
lagInFrame(price) OVER (PARTITION BY postcode1, postcode2, addr1, addr2 ORDER BY date ASC) AS previous_price,
* FROM uk_price_paid

With a time series dataset like this we will often want to see the previous record for a particular entity. We can do this with lagInFrame. If we use first_value we would get the first time the property was sold instead (the dataset only goes back to 1995).

Query 3: Sorting with window functions

You can also sort by window functions. For example, this query shows the most expensive sub-postcodes within the most expensive postcode (in the UK, postcodes have two parts):

SELECT DISTINCT
    postcode1,
    postcode2,
    avg(price) OVER (PARTITION BY postcode1, postcode2) AS avg_price
FROM uk_price_paid
WHERE type = 'flat'
ORDER BY
    avg(price) OVER (PARTITION BY postcode1) DESC,
    avg_price DESC

Note the two window functions used in the ORDER BY clause. For those of you that are familiar with London you won’t be surprised that this query comes up with the W1S postcode at the edge of Mayfair and Soho.

Reading Corner

What we’ve been reading:

  1. How we scale out our ClickHouse cluster by our friends at Contentsquare. They have a neat “ClickHouse resharding-cooker” combining clickhouse-backup and clickhouse-copier. Well worth a read if you’re running a ClickHouse cluster!
  2. Opensee: Analyzing Terabytes of Financial Data a Day With ClickHouse by our friends at financial analytics company Opensee. They’ve been using ClickHouse for years to analyze financial institutions’ risk profiles.
  3. ClickHouse 22.2 release blog post – read up on the most recent ClickHouse release and exciting new features including production-ready projections, custom deduplication and flexible memory limits.

New ClickHouse Adopters: Welcome Gigasheet. Get yourself added as well!

Thanks for reading. We’ll see you next month!
The ClickHouse Team

Photo by Fernando Venzano on Unsplash

  • newsletter

Would you like to keep in touch?

Sign up for our newsletter to stay informed on feature releases, our product roadmap, and our cloud and support offerings.

Thank you for signing up!