This month, let’s start with some important weather information, then dive into the September release and round it out with some thread work.
By the way, if you’re reading this on our website, did you know you can receive every monthly newsletter as an email in your inbox as well? Sign up here.
ClickHouse Cloud Beta
If you’re a frequent reader of this newsletter, then you know that we’ve been hard at work building our ClickHouse Cloud managed service. On Tuesday, October 4, 2022, we made the beta version available to the world!
It’s not just a ClickHouse managed service. Today, when you run ClickHouse yourself you’re probably running it on some VMs or Kubernetes pods and configuring replication, sharding, clusters, distributed tables, etc. to achieve high availability and scalability. That’s not what we did with ClickHouse Cloud. Instead, we moved the data to shared object storage that all ClickHouse nodes can access. With that, there is no need to configure anything extra for replication and sharding - object storage is already replicated and it is practically infinitely scalable. In addition, object storage is several times cheaper than block storage.
Why does this matter? As a user, you benefit from the significantly lower price of storage and the ease of use of all data storage being taken care of for you. No more setting up replicas and shards and worrying about how to rebalance those when scaling a cluster up or down. No more worries about upgrades (we do those for you). No more worries about backups either (we do those for you, too). And of course, we monitor and support ClickHouse Cloud for you 24/7/365.
ClickHouse Cloud is the only official ClickHouse managed service built by the creators of ClickHouse and designed from the ground up for the cloud. Give it a try!
Here’s the release blog post, a Quick Start guide, and when you are ready to go head over to our website and hit the “Try it now” button. Keep in mind it’s still a beta service, so we really appreciate any feedback from you. Also make sure you sign up to our Launch Webinar later this month to learn about ClickHouse Cloud directly from our team.
Mark your calendars:
ClickHouse v22.10 Release Webinar
- When? CHANGED TO: Wednesday, October 26 @ 9 am PST / 6 pm CEST
- How do I join? Register here.
ClickHouse Cloud Launch Webinar
- What? Join us for the virtual unveiling of our managed service. We’ll walk through all the features and give you a peek behind the curtain as well. Can’t wait to tell you all about it!
- When? Thursday, October 27 @ 9 am PST / 6 pm CEST
- How do I join? Register here.
- What? A number of the ClickHouse team are going to be at re:Invent! Interested in meeting up with us, maybe grabbing a beverage, and talking about ClickHouse? Let us know!
- Where? Las Vegas, NV
- When? November 29 - December 3, 2022
Our first autumn release this year, with some neat new features:
- Vector search (ANN) ClickHouse has a new index type
annoy. It implements a search index for Approximate Nearest Neighbor (ANN) search to quickly find the nearest neighbor in multidimensional space (vector search). This is an experimental feature.
- KeeperMap table engine A new table engine
KeeperMapallows storing small amounts of key-value data in ClickHouse Keeper or ZooKeeper. This is an experimental feature.
- Automatically select quorum You can now set
insert_quorum = autoto have ClickHouse automatically select a majority of replicas to write an insert to before reporting back success to the client. Note: Using insert_quorum will make inserts take longer.
- Readonly users can now change settings It can be useful to allow read-only users to change certain settings, the number of threads or table filters, for example. The setting
changeable_in_readonlyallows you to do just that.
- INTERSECT DISTINCT and EXCEPT DISTINCT When combining results from two queries, it’s now possible to remove duplicate rows when using INTERSECT (which returns only rows present in both queries) and EXCEPT (which returns only rows present in the first but not the second query).
- Official Node.js client ClickHouse now has an official Node.js client written in Typescript. It supports both reading and writing data over HTTP(S). Give it a try!
Query of the Month: More threads are not always better
Many people, after they have used ClickHouse for a bit, will discover the magic of threads. Maybe loading data takes longer than you’d like or perhaps a query is slower than your boss wants it to be. So you do some googling and in an answer to an old StackOverflow question you read about max_threads and how it can be used to speed everything up.
And yet, as so often in life and technology, it’s not that simple!
By default, ClickHouse will try to determine the correct number of threads based on the number of available CPU cores. You can check what it is set to using
SELECT getSetting('max_threads'). For example, on my MacBook Pro it returns 10, which is the number of cores in an Apple M1 chip (yes, ClickHouse works on ARM!).
You can set it to use more threads, but it won’t always work well. For example, try this (after you create the right table structure see here):
INSERT INTO ontime SELECT *
FROM s3('https://clickhouse-public-datasets.s3.amazonaws.com/ontime/csv_by_year/*.csv.gz', CSVWithNames)
SETTINGS max_threads = 100
Unless you have a really big machine, it will not complete – and use a lot of memory along the way. Effectively, every one of the 100 threads is a separate insert sub-query and requires its own memory to retrieve and write data.
So should you set
max_threads = 1? Well, maybe - for me, the query above will insert data at about 120k rows per second when limited to a single thread. Without
max_threads it will insert 320k rows per second. However, the optimal insert speed for me is with 4 threads - 428k rows per second, some 33% faster than with thread autodetection. Sometimes it pays to experiment.
Similarly, when reading data, more threads are not always better. Yes, more threads mean there are more CPU cycles available to a query. However, it also means there are more separate states that all need to be merged back together before the result can be returned. Most importantly, many threads per query will reduce the number of queries per second that can be run.
For example, using the UK Property Price Paid dataset, we can check what the most popular street names are using:
SELECT count(), street FROM uk_price_paid
GROUP BY street ORDER BY count() DESC
The top 3: High Street, Station Road, London Road. In the next 7, there are Church Road, Church Street, and Church Lane - maybe not surprising in a monarchy where the King is both Head of State and Head of the Church.
This above query takes 0.2 seconds on my machine. If I set
max_threads = 1 it takes 0.5 seconds. So more threads are better, right?
In this case, yes - except if I want to run this query many times concurrently. I can execute 10 queries concurrently 10 times (100 queries in total) with
clickhouse-benchmark -c 10 -i 100 -q "<query>”
On my machine, it takes about 27 seconds to run all 100 queries, just under 4 queries a second (
QPS: 3.776 according to the result).
max_threads = 1 it takes about 10 seconds, or 13 queries per second (
QPS: 13.122). Three times higher query throughput! And all we had to do was change the number of threads - but lower, not higher. Generally, if you expect a lot of concurrent queries you should consider explicitly lowering the thread count per query.
What we’ve been reading:
- ClickHouse Cloud is now in Public Beta The launch blog post for the new and only official ClickHouse managed service by us, the creators of ClickHouse.
- Getting Data Into ClickHouse - Part 3 - Using S3 Part 3 of a series on how to get data into ClickHouse. Check out Parts 1 and 2 (here and here) as well.
- Visualizing Data with ClickHouse - Part 1 - Grafana First in a series on visualizing data with ClickHouse. We start with Grafana and will follow with posts on Superset and Metabase.
Thanks for reading, we’ll see you next month. And please test ClickHouse Cloud!
The ClickHouse Team