We’re all excited to share that the official managed service for ClickHouse, ClickHouse Cloud, is now generally available and production ready. Give it a spin if you’re looking to focus on using ClickHouse rather than managing it. As always, we are here to help you!
Keep reading to read about the new features in ClickHouse Cloud, upcoming events in the coming days and weeks, what is new in ClickHouse 22.11 and how projections can help you speed up queries without changing your queries.
And take a look at our holiday reading list. If you read only the most technical articles, we have you covered with this 25-page blog post about JIT compilation in ClickHouse!
By the way, if you’re reading this on our website, did you know you can receive every monthly newsletter as an email in your inbox? Sign up here.
##ClickHouse Cloud is now Generally Available
As usual, we like to deliver things quickly. So here it is, the GA release of our ClickHouse Cloud managed service, just in time for Christmas. We’ve added a number of features including SOC 2 Type II compliance, uptime status and SLAs, availability in the AWS Marketplace, a lower priced development tier, an interactive SQL console, free ClickHouse Academy training and we’ve enabled the PostgreSQL and MySQL engines, dictionaries and SQL UDFs.
If you’re just getting started with ClickHouse or need something to host your non-production development, staging and QA take a look at our new and affordable Development tier where prices can be as low as $50 per month.
Mark your calendars for the following virtual and in-person events:
6-hour ClickHouse Workshop (2 x 3 hours)
What? If you’re new to ClickHouse, this is for you. In two 3-hour sessions on two days we’ll take you through all the basics, from modeling data to query optimization
- Register here for Wednesday, Dec 14 @ 8 am PST / 5 pm CET and Thursday, Dec 15 @ 8 am PST / 5 pm CET.
- Register here for Wednesday, Jan 4 @ 9 am PST and Thursday, Jan 5 @ 9 am PST.
- Register here for Wednesday, Jan 18 @ 10 am GMT and Thursday, Jan 19 @ 10 am GMT
ClickHouse v22.12 Release Webinar
- When? Thursday, December 15 @ 9 am PST / 6 pm CET
- How do I join? Register here.
Meetup Tel Aviv
- When? Monday, January 16 @ 6 pm (register here)
- Speakers: ServiceNow, Contentsquare, CHEQ, ClickHouse
- When? Wednesday, January 18 @ 11 am (register here)
- Speakers: To be announced
Lots of new features in our regular monthly release - this month accompanied by a dedicated blog post.
- Composite time intervals You can now combine different intervals in one go, for example
INTERVAL '1 HOUR 1 MINUTE 1 SECOND'or
SELECT '2023-01-30'::Date + (INTERVAL 1 MONTH - INTERVAL 2 DAY). Can you guess the result of the second statement?
- Recursive globs in paths Use
**for recursive directory traversal in a filesystem or in S3. For example, to read all Parquet files in an S3 bucket use
- Retry keeper during inserts Long-running inserts can sometimes fail when the keeper connection to ClickHouse Keeper or ZooKeeper is temporarily lost. We have a new setting
insert_keeper_max_retriesthat enables retries for keeper operations to prevent this from failing the entire insert query.
- Hudi and DeltaLake table engines and table functions to query and ingest data from “data lakes”.
- Readonly S3 disks You can attach a MergeTree table from S3 directly in readonly mode. This is an experimental feature.
- Object (JSON) inside types The experimental Object (also known as JSON) type can now be wrapped in other types, e.g.
Query of the Month: Project all your troubles away
Today we’re going to take a look at a very useful ClickHouse feature that you’re probably not using enough: Projections.
Projections are a more convenient alternative to materialized views when all the data is from the same table. Materialized view queries can contain subqueries or dictionary lookups so the destination table will contain data from multiple sources. Like a materialized view, a projection is defined as a query, but the result of this query is saved in the same table (but hidden).
When ClickHouse receives a query for data in a table with a projection, it will first analyze whether it should use the original data or the projection data to answer the query. Whichever requires less data to be read and processed will be chosen.
As an example, consider a simple table containing metric names and values:
CREATE TABLE metrics ( timestamp DateTime64 CODEC(Delta, Default), name LowCardinality(String), value Float64 ) ENGINE = MergeTree() ORDER BY (name, timestamp)
Take a note of two optimizations to the data types:
- Instead of storing the
timestampas is we use delta encoding where only the difference between one value and the next are stored. We expect most values to be quite similar (same year, month, day, hour, etc.) so this will reduce the storage required for this column by a lot.
- Similarly, instead of storing the names of our metrics as they are, we use the
LowCardinalitydictionary encoding to store a map of names and pointers to where they occur in the data. We only expect a limited number of unique metrics so this will also reduce storage by a lot.
Notice also how we order the data by metric name first and then by timestamp. This will make queries that look for individual metrics very fast. The following query finds the latest temperature reading:
SELECT name, argMax(value, timestamp) val_max FROM metrics WHERE symbol = 'TEMP' GROUP BY symbol ┌─name───┬─val_max─┐ │ TEMP │ 179 │ └────────┴─────────┘ 1 row in set. Elapsed: 0.009 sec. Processed 73.73 thousand rows, 1.38 MB (8.01 million rows/s., 150.32 MB/s.)
Requesting the latest values for all metrics is 100x slower:
SELECT name, argMax(value, timestamp) val_max FROM metrics GROUP BY symbol 255 rows in set. Elapsed: 0.902 sec. Processed 220.01 million rows, 6.08 GB (259.38 million rows/s., 6.74 GB/s.)
To speed up this query we can add a projection that always stores the latest value for a metric
ALTER TABLE metrics ADD PROJECTION latest_values (SELECT argMax(value, timestamp) GROUP BY name)
And to make the data available immediately we can materialize the projection like this:
ALTER TABLE metrics MATERIALIZE PROJECTION latest_values SETTINGS mutations_sync = 1
Now let’s run the same query again:
SELECT name, argMax(value, timestamp) val_max FROM metrics GROUP BY symbol 255 rows in set. Elapsed: 0.007 sec. Processed 255 rows, 156.13 KB (1.42 million rows/s., 84.33 MB/s.)
Getting the latest value for all metrics is faster than getting the latest value for a single metric was! All we had to do was add a projection. No need to change our query or our table structure even one bit.
Give projections a try! We bet there are some that will be useful on your data.
Holiday Reading List
Some reading material for you:
ClickHouse Cloud is now Generally Available A very exciting time for ClickHouse users, the official ClickHouse managed service is generally available and fully production ready. Read all about it in this post.
User-defined functions in ClickHouse Cloud A refresher on UDFs from our team now that they are available in ClickHouse Cloud.
Using dictionaries to accelerate queries An in-depth explainer with examples of how to use dictionaries to speed up your queries in ClickHouse.
JIT in ClickHouse One of the most technical posts we’ve read in a while, our own Maksim Kita dives deeply into JIT compilation and how it’s implemented in ClickHouse.
Window and array functions for Git commit sequences Some really good visual explainers of window and array functions in this blog post on analyzing Git commits.
How We Use ClickHouse to Analyze Trends Across Millions of Builds BuildBuddy moved their Trends page from MySQL to ClickHouse - one query improved from taking 24 minutes (!) to 0.317s in ClickHouse!
Thanks for reading, and we’ll see you next year.
The ClickHouse Team