ClickHouse Newsletter January 2023: Better Safe Than Sorry

Happy New Year! Like many of you, we at ClickHouse took some time off and are using January to catch up and plan. To help you think about what you can do with ClickHouse in 2023, have a look at our joint webinars with dbt Labs and Grafana this month, the new features we released in version 22.12 just before the holidays, how to use encryption in ClickHouse and a reading list to get you started into the year.

By the way, if you’re reading this on our website, did you know you can receive every monthly newsletter as an email in your inbox? Sign up here.

Mark your calendars for the following virtual events:

Webinar: Using dbt to Transform Data in ClickHouse
When? Tuesday, January 24 @ 2 pm GMT / 3 pm CET
How do I join? Register here
Speakers: dbt Labs, Datricks, ClickHouse

ClickHouse v23.01 Release Webinar
When? Wednesday, January 25 @ 9 am PST / 6 pm CET
How do I join? Register here

Webinar: Real-time SQL analytics at scale: A story of open-source GitHub activity using ClickHouse and Grafana
When? Thursday, January 26 @ 6 pm CET / 9 am PST
How do I join? Register here
Speakers: Grafana, ClickHouse

Webinar: ClickHouse Cloud Onboarding
When? Thursday, January 26 @ 1 pm PST
How do I join? Register here

There was a lot of goodness in our December release. Have a look at the blog post as well:

Grace hash join A new join type that splits the to-be-joined tables into buckets and joins those buckets one by one, vastly reducing memory usage (at the cost of slower speed). Give it a try, and also take a look at all the other join algorithms ClickHouse supports here.
Async insert deduplication Asynchronous inserts are now deduplicated, so if you send the same data twice ClickHouse will only accept one.
Password complexity ClickHouse can now enforce password complexity rules for user passwords - how long a password has to be and whether it has to contain any specific character types. If you’re creating users in ClickHouse, you should probably configure this, so people don’t just set “asdf”.
GROUP BY ALL Grouping by ALL will group by every column in a SELECT statement that is not part of an aggregation. A handy shorthand, so you don’t have to specify them manually.
Numbers with underscores Sometimes it’s just easier to specify 1_000_000 rather than 1000000 (if you’re like me, you’re pausing shortly after typing the first three zeros to make sure you’re getting it right). Though in this particular case, you can also use 1e6.
Async reading from MergeTree On slower disks, especially remote stores like object storage (S3, etc.), reading data can take some time. Therefore it makes sense to send many parallel read requests to reduce the overall time it takes to gather all data for a query. New settings allow specifying how many parallel requests ClickHouse will make, allowing you to speed up some queries significantly, especially on smaller machines. This is especially useful when using ClickHouse Cloud, which is based on object storage.

Take a look at the release webinar slides and the recording, and please upgrade unless you want to stay on a Long Term Support (LTS) release. If you are using ClickHouse Cloud, you are already using the new release.

In this section, we like to explore lesser-known but potentially very useful features in ClickHouse. One such feature is built-in encryption and decryption. Let’s say you have a customer table that contains potentially sensitive information such as customer addresses:

CREATE TABLE customers
(
   id String,
   address String
)
ENGINE = MergeTree
ORDER BY id

Instead of storing the addresses in plaintext, you can encrypt them like this:

INSERT INTO customers
SELECT 1, encrypt('aes-128-cbc', '35 Highcombe, London,SE7 7HT', 'donotusethisword')

The resulting data will be unreadable:

SELECT * FROM customers FORMAT Values


('1','��kg`�\0�\0"��́Xf����ĭ\n0�Ą.\f�')

Unless you have the correct key:

SELECT id, decrypt('aes-128-cbc', address, 'donotusethisword') 
FROM customers FORMAT Values


('1','35 Highcombe, London,SE7 7HT')

Ah, but I hear you say that if certain users shouldn’t be able to see some columns, it’s better to REVOKE their access to those columns. You’re certainly right, but maybe they should have access to some values but not others, depending on whether they have access to the key. By default, if decryption fails, ClickHouse will throw an exception. If this is sometimes expected (when the user has a valid key for some values but not others), you can use tryDecrypt instead, which will return NULL for the values that cannot be decrypted and allow the query to continue.

INSERT INTO customers
SELECT 2, encrypt('aes-128-cbc', '31 Richmond Court, Ellesmere Port,CH65 9EA', 'usesomeotherword')


SELECT id, tryDecrypt('aes-128-cbc', address, 'usesomeotherword') address
FROM customers WHERE address IS NOT NULL


('2','31 Richmond Court, Ellesmere Port,CH65 9EA')

Native database encryption is a straightforward and low-cost way to improve data security. It is not as secure as using an external key management system, but those are not always available or easy to set up.

Some reading material to start the year with.

ClickHouse Release 22.12 The release blog post for our awesome pre-holiday Christmas release. If you’re using joins, you should look into the new grace hash join!
Announcing a new official ClickHouse Kafka Connector The official Kafka Connect Sink for ClickHouse with exactly-once delivery semantics is now in beta. Give it a try!
Optimizing ClickHouse with Schemas and Codecs With a bit of work on your table definitions, you can often significantly reduce storage and speed up queries at the same time. Have a look at how we did it here.
Super charging your ClickHouse queries Another way to increase query speed is to use projections and specify good sorting keys (primary indexes). Walk through an example with us here.
System Tables and a window into the internals of ClickHouse Your first port of call when monitoring ClickHouse should be the system tables. There are a lot of them, and they give you detailed information about almost everything ClickHouse is doing. Read our primer here.
Essential monitoring queries: Part 1 (INSERT) and Part 2 (SELECT). These articles will help you get started monitoring the two most important query types, inserts and selects.
Building an Observability Solution with ClickHouse in 2023 - Part 1 - Logs More and more companies are choosing ClickHouse to store their observability data. In this first part of a longer series, we show you how to use ClickHouse for logs.
Extracting, converting, and querying data in local files using clickhouse-local Don’t want to run a database but still want to use ClickHouse? Use headless ClickHouse (clickhouse-local)! It’s one of the best tools out there to query data anywhere, including in local files. We think every data engineer should have it installed.
Generating random data in ClickHouse Ever needed to generate some random data? We had to many times, so ClickHouse has a lot of functionality built in to help with that. Have a read and outsource your data generation to ClickHouse.
HIFI’s migration from BigQuery to ClickHouse HIFI is providing financial analytics to music creators worldwide. Initially using BigQuery, they were unhappy with its per-query pricing, so they switched to ClickHouse.