ClickHouse Newsletter May 2022: Explain Statement – Query Optimization

05.09.2022

Blog

A warm welcome to you all. It has been a busy few weeks at the ClickHouse office in Amsterdam – we have launched the private preview phase of our ClickHouse Cloud service! In the coming weeks and months the brave users that have joined this preview will help us find and fix all the bugs and tiny details we want to get right before we open it up to the world later this year. If you’re interested in a serverless managed service for ClickHouse, join our waitlist here.

We also released ClickHouse 22.4, new documentation pages, and the new official ClickHouse plugin for Grafana. There is a date to release ClickHouse 22.5 and some meetups are coming up so don’t forget to block your calendars!

Upcoming Events

Mark your calendars for these:

  • ClickHouse v22.5 Release Webinar
    When? Thursday, May 19 @ 9 am PDT / 5 pm GMT
    How to join? Register here.
  • ClickHouse Americas Virtual Meetup
    What?
    Gigasheet is using ClickHouse to build a simple, no-code analytics product with a web-based spreadsheet interface. Something to finally replace Excel with?! Also our own Geoff Genz will share his journey of implementing ClickHouse at scale at Comcast. A must-see!
    When? Thursday, May 12 @ 11 am PST / 2 pm EST
    How? Register in the meetup groups for New York or Seattle.
  • ClickHouse EMEA Virtual Meetup
    What? Fintech company Opensee will talk about how they use ClickHouse to provide self-service analytics across large amounts of data at financial institutions. And afterwards, Alexey – the creator of ClickHouse – will dive into the internals that make ClickHouse fast. Come prepared with your questions!
    When? Thursday, May 12 @ 1 pm BST / 2 pm CEST / 3 pm EEST
    How? Register in the meetup groups for Netherlands, France, London, Germany, Bangalore – or all of them if you like!

New Documentation

We have given the ClickHouse Documentation pages a new look! There is a new navigation structure on the left side and also lots of new content. We are sure there is at least one new thing for you to learn:

ClickHouse v22.4

What’s in our regular monthly April release:

  1. Transactions are available as a (highly experimental) preview feature. BEGIN TRANSACTIONCOMMIT and ROLLBACK statements support atomic inserts into multiple tables and materialized views as well as consistent and isolated reads from a single snapshot. More to come here, stay tuned!
  2. Keeper load balancing You can now specify how a ClickHouse instance chooses a (ClickHouse) Keeper or (Apache) ZooKeeper instance to connect to. The new config.xml setting is <zookeeper_load_balancing> with possible values random (default), in_ordernearest_hostnamefirst_or_random and round_robin. This can help ensure low latency between ClickHouse and Keeper when you have different Keeper nodes that are far apart from each other.
  3. Table metadata cache A new table setting use_metadata_cache uses the embedded RocksDB engine to cache table metadata. When ClickHouse starts up, it will try to read from this cache and fall back to the table files on disk if needed. This is useful when you have a very large number of data parts (usually caused by having a lot of databases, tables, and/or partitions, or by very high insert volume creating a lot of parts). In one extreme instance (700k parts), the metadata cache was able to reduce startup time from 75 minutes to 20 seconds.
  4. Kafka metrics The Kafka table engine is now exposing some metrics, e.g. number of processed messages, number of errors, number of messages that failed to parse. You can find these in the system.metrics and system.events tables.
  5. Gap filling You can now fill gaps in result sets using interpolation. For example, when you have minute-by-minute data but sometimes the value for a minute might be missing (completely missing, or maybe it’s stil making it’s way through the ingest pipeline and arriving late), you can have ClickHouse use the previous value instead using ORDER BY toStartOfMinute(timestamp) WITH FILL INTERPOLATE (c AS c). More complex expressions are possible.
  6. Last day of month So far, the ClickHouse date functions would round to the start of a time period ( toStartOf[Year|Month|Week|Day|Hour|Minute|etc.). Now, we also have toLastDayOfMonth if you want to round to the last day of a month (useful for financial and accounting data).
  7. H3 We’ve rounded out support for H3, the “hexagonal hierarchical geospatial indexing system” originally developed and open sourced by Uber. If you don’t know it yet, have a look at the website and the comparisons to other systems like S2 and Geohash.

Take a  look at the release webinar slides and please upgrade (unless you want to stay on an LTS release).

Query of the Month: Explain Statement – Query Optimization

One of the most important skills you can have as a ClickHouse user is query optimization. This will get you started.

EXPLAIN SYNTAX

If you’re not familiar yet with the EXPLAIN statement head over to the documentation and have a look. This is your best friend when debugging queries in ClickHouse. The first instance of this command you should know is EXPLAIN SYNTAX. You can use it like this:

EXPLAIN SYNTAX
SELECT sum(number)
FROM (SELECT * FROM numbers(10000))
WHERE number <= 10

This will show what query is actually going to be executed after any syntax optimizations. The output:

┌─explain─────────────────┐
│ SELECT sum(number)      │
│ FROM                    │
│ (                       │	
│     SELECT number       │
│     FROM numbers(10000) │
│     WHERE number <= 10  │
│ )                       │
│ WHERE number <= 10      │
└─────────────────────────┘

You can see two things:

  1. The wildcard projection in the subquery was replaced with the actual column name number.
  2. Even though the WHERE condition appears only on the outside, ClickHouse can determine that it can also be applied in the subquery to vastly reduce the amount of rows selected.

EXPLAIN PLAN

This will show us the query plan. It is often the first thing you want to look at when a query is slow because it will show how much data ClickHouse is reading from disk. For example, imagine two MergeTree tables, each with 1 billion rows, one unsorted and one sorted:

CREATE TABLE numbers (number int) ENGINE = MergeTree ORDER BY tuple()
AS SELECT number FROM numbers(1e9)

CREATE TABLE numbers_sorted (number int) ENGINE = MergeTree ORDER BY number
AS SELECT number FROM numbers(1e9)

Query speeds will be very different:

SELECT sum(number) FROM numbers WHERE number <= 10

1 rows in set. Elapsed: 0.688 sec. Processed 1.00 billion rows, 4.00 GB (1.45 billion rows/s., 5.81 GB/s.)

SELECT sum(number) FROM numbers_sorted WHERE number <= 10

1 rows in set. Elapsed: 0.007 sec. Processed 8.19 thousand rows, 32.77 KB (1.23 million rows/s., 4.90 MB/s.)

Both queries need only 10 values to calculate their results, but while the first query has to read all 1 billion rows of the entire table, the second query reads only 8192 rows (the default size of the table’s index_granularity). We can see why this is by running the queries through EXPLAIN PLAIN (the output is cropped):

EXPLAIN PLAN actions = 1, indexes = 1 SELECT [...] numbers [...]
┌─explain────────────┐
│ [...]              |
│ ReadFromMergeTree  │
│ ReadType: Default  │
│ Parts: 1           │
│ Granules: 122071   │
└────────────────────┘
 
EXPLAIN PLAN actions = 1, indexes = 1 SELECT [...] numbers_sorted [...]
 
┌─explain───────────────────────────────┐
│ [...]                                 │
│ ReadFromMergeTree                     │
│ ReadType: Default                     │
│ Parts: 1                              │
│ Granules: 1                           │
│ Indexes:                              │
│   PrimaryKey                          │
│     Keys:                             │
│       number                          │
│     Condition: (number in (-Inf, 10]) │
│     Parts: 1/1                        │
│     Granules: 1/122071                │
└───────────────────────────────────────┘

Both queries return the same result (55, the sum of numbers from 1 to 10), but the first query reads all 122071 granules (each granule contains 8192 rows so that comes out to the entire dataset of 1 billion rows) and the second query reads just the single granule that contains the ten numbers needed to compute the correct result.

Whenever you have a slow query, run it through EXPLAIN and you might quickly find out why it is taking so long and what to do about it.

Recordings to catch up on

These recordings are not as short as a TikTok video, but well worth your time if you’re a ClickHouse user. Feel free to watch at 2x speed to make it feel more like social media:

Reading Corner

What we’ve been reading:

  1. 10x improved response times, cheaper to operate, and 30% storage reduction: Instabug explains why they chose ClickHouse as their database for APM data. It wasn’t a hard decision if we’re being honest!
  2. Two sizes fit most: PostgreSQL and Clickhouse: Gitlab wrote about ClickHouse and PostgreSQL and also did some benchmarks in which ClickHouse looks very good!
  3. ClickHouse Keeper: Somebody tested ClickHouse Keeper with Kafka, Solr and Mesos – and it all worked! Keeper is not just for ClickHouse.
  4. Introducing the official ClickHouse plugin for Grafana: Read about the new plugin and see some of the beautiful dashboards you can create with it.
  5. ClickHouse Docs have a new look and feel!: Rich points to the new content in the revamped Clickhouse Documentation pages.
  6. New ClickHouse Adopters: Welcome Better StackPingCAPSynapsePiwikMarfeelDatafoldKaiko, and Improvado. Get yourself added as well!

Thanks for reading. We’ll see you next month!

The ClickHouse Team

Would you like to keep in touch?

Sign up for our newsletter to stay informed on feature releases, our product roadmap, and our cloud and support offerings.

Thank you for signing up!