ClickHouse Newsletter March 2022: There’s a window function for that!
Welcome to our March newsletter, a monthly roundup of the latest ClickHouse goodness. Below you will find new exciting features we released, recent meetup and webinar recordings, some neat queries using window functions and a few recommended reads.
We released ClickHouse 22.2, the newest and greatest version of your favorite database. Highlights include:
- Projections are production ready and enabled by default. They allow you to store part of a table in a subdirectory of the original table, often using a different sorting key or aggregating the original data – similar to materialized views but without all the overhead. Their usage is refreshingly simple: Queries will be automatically analyzed and redirected to the projection if less data has to be read there to produce the same result. Almost like magic!
- Custom deduplication on insert – a new setting
insert_deduplication_tokenallows you to specify a custom token (any string) that is used to determine if a block has already been inserted. If it has, then any subsequent blocks with the same token are discarded. Any insert statement creates a block but large inserts are divided into multiple blocks of about 1 million rows (controlled by
max_insert_block_size). This makes it easier to implement exactly-once semantics in ClickHouse. Note: By default, ClickHouse will already deduplicate inserted data blocks that are identical (consist of the same rows in the same order). This new setting gives you more control over the process if you need it.
- Default table engine – You can now set a default table engine so you no longer have to specify one when creating a table. For example:
SET default_table_engine = 'MergeTree' CREATE TABLE table1 (x int PRIMAY KEY (x))
SET allow_experimental_nlp_functions = 1 SELECT detectLanguage('Ich bin ein Berliner') // returns 'de'
- Implicit casting in dictGet* functions, now you can simplify your queries! For example, when you have a dictionary with UInt32 as the primary key you can access it using any of these:
dictGetOrNull(dict, attr, 1) dictGetOrNull(dict, attr, '1') dictGetOrNull(dict, attr, 1.0) dictGetOrNull(dict, attr, toUInt8(1)) dictGetOrNull(dict, attr, toInt256(1))
Meetups & Webinars
The last few weeks have seen a number of meetups and webinars in ClickHouse land. Check out the recordings (at 2x speed if you like, we won’t judge you):
- Virtual Meetup with Contentsquare, where we learned about how they use ClickHouse to power their experience analytics platform. It’s an impressive journey that started with moving from Elasticsearch to ClickHouse to be 11x cheaper (with 6x more data) and 10x faster! We also heard about the ClickHouse Proxy (Chproxy) project and Alexey’s favorite ClickHouse features in 2021 and 2022. Highlight: Listen in at 1:20:12 on our plans for a new JSON data type.
- Learn about the new features we released in the v22.02 Release Webinar including upcoming performance improvements for running ClickHouse on top of S3 at 42:05 (the performance numbers are very impressive, if we may say so!).
- The San Francisco Bay Area meetup had guest speakers from Materialize and FastNetMon.
- ClickHouse v22.3 Release Webinar
When? Thursday, March 17 @ 8 am PDT / 4 pm GMT
Add to your calendar
Query of the Month: There’s a window function for that!
Did you know that ClickHouse has window functions? It’s a relatively recent (early 2021) feature but very powerful. Let’s go on a whirlwind tour of what’s possible that used to be much harder before. Note: We will use the UK Price Paid dataset of all property transactions in the UK in the last few decades, so feel free to reproduce the queries!
Query 1: Compare with cohort average
SELECT avg(price) OVER(PARTITION BY postcode1) AS postcode_avg, * FROM uk_price_paid WHERE type = 'flat' ORDER BY price DESC
This query will show the most expensive flats / apartments ever sold in the UK but in addition will show the average prices in their postcodes as the first column. So now we can see who’s really overpaying!
Query 2: When was this property sold previously?
SELECT lagInFrame(date) OVER (PARTITION BY postcode1, postcode2, addr1, addr2 ORDER BY date ASC) AS previous_date, lagInFrame(price) OVER (PARTITION BY postcode1, postcode2, addr1, addr2 ORDER BY date ASC) AS previous_price, * FROM uk_price_paid
With a time series dataset like this we will often want to see the previous record for a particular entity. We can do this with
lagInFrame. If we use
first_value we would get the first time the property was sold instead (the dataset only goes back to 1995).
Query 3: Sorting with window functions
You can also sort by window functions. For example, this query shows the most expensive sub-postcodes within the most expensive postcode (in the UK, postcodes have two parts):
SELECT DISTINCT postcode1, postcode2, avg(price) OVER (PARTITION BY postcode1, postcode2) AS avg_price FROM uk_price_paid WHERE type = 'flat' ORDER BY avg(price) OVER (PARTITION BY postcode1) DESC, avg_price DESC
Note the two window functions used in the
ORDER BY clause. For those of you that are familiar with London you won’t be surprised that this query comes up with the W1S postcode at the edge of Mayfair and Soho.
What we’ve been reading:
- How we scale out our ClickHouse cluster by our friends at Contentsquare. They have a neat “ClickHouse resharding-cooker” combining
clickhouse-copier. Well worth a read if you’re running a ClickHouse cluster!
- Opensee: Analyzing Terabytes of Financial Data a Day With ClickHouse by our friends at financial analytics company Opensee. They’ve been using ClickHouse for years to analyze financial institutions’ risk profiles.
- ClickHouse 22.2 release blog post – read up on the most recent ClickHouse release and exciting new features including production-ready projections, custom deduplication and flexible memory limits.
Thanks for reading. We’ll see you next month!
The ClickHouse Team
Would you like to keep in touch?
Sign up for our newsletter to stay informed on feature releases, our product roadmap, and our cloud and support offerings.