Advanced dashboard in ClickHouse Cloud

Monitoring your database system in a production environment is vital to understanding your deployment health so that you can prevent or solve outages.

The advanced dashboard is a lightweight tool designed to give you deep insights into your ClickHouse system and its environment, helping you stay ahead of performance bottlenecks, system failures, and inefficiencies.

The advanced dashboard is available in both ClickHouse OSS (Open Source Software) and Cloud. In this article we will show you how to use the advanced dashboard in Cloud.

Accessing the advanced dashboard

The advanced dashboard can be accessed by navigating to:

Left side panel
- Monitoring → Advanced dashboard

Accessing the native advanced dashboard

The native advanced dashboard can be accessed by navigating to:

Left side panel
- Monitoring → Advanced dashboard
- Clicking You can still access the native advanced dashboard.

This will open the native advanced dashboard in a new tab. You will need to authenticate to access the dashboard.

Each visualization has a SQL query associated with it that populates it. You can edit this query by clicking on the pen icon.

Out-of-box visualizations

The default charts in the Advanced Dashboard are designed to provide real-time visibility into your ClickHouse system. Below is a list with descriptions for each chart. They are grouped into three categories to help you navigate them.

ClickHouse specific

These metrics are tailored to monitor the health and performance of your ClickHouse instance.

Metric	Description
Queries Per Second	Tracks the rate of queries being processed
Selected Rows/Sec	Indicates the number of rows being read by queries
Inserted Rows/Sec	Measures the data ingestion rate
Total MergeTree Parts	Shows the number of active parts in MergeTree tables, helping identify unbatched inserts
Max Parts for Partition	Highlights the maximum number of parts in any partition
Queries Running	Displays the number of queries currently executing
Selected Bytes Per Second	Indicates the volume of data being read by queries

System health specific

Monitoring the underlying system is just as important as watching ClickHouse itself.

Metric	Description
IO Wait	Tracks I/O wait times
CPU Wait	Measures delays caused by CPU resource contention
Read From Disk	Tracks the number of bytes read from disks or block devices
Read From Filesystem	Tracks the number of bytes read from the filesystem, including page cache
Memory (tracked, bytes)	Shows memory usage for processes tracked by ClickHouse
Load Average (15 minutes)	Report the current load average 15 from the system
OS CPU Usage (Userspace)	CPU Usage running userspace code
OS CPU Usage (Kernel)	CPU Usage running kernel code

ClickHouse Cloud specific

ClickHouse Cloud stores data using object storage (S3 type). Monitoring this interface can help detect issues.

Metric	Description
S3 Read wait	Measures the latency of read requests to S3
S3 read errors per second	Tracks the read errors rate
Read From S3 (bytes/sec)	Tracks the rate data is read from S3 storage
Disk S3 write req/sec	Monitors the frequency of write operations to S3 storage
Disk S3 read req/sec	Monitors the frequency of read operations to S3 storage
Page cache hit rate	The hit rate of the page cache
Filesystem cache hit rate	Hit rate of the filesystem cache
Filesystem cache size	The current size of the filesystem cache
Network send bytes/sec	Tracks the current speed of incoming network traffic
Network receive bytes/sec	Tracks the current speed of outbound network traffic
Concurrent network connections	Tracks the number of current concurrent network connections

Identifying issues with the Advanced dashboard

Having this real-time view of the health of your ClickHouse service greatly helps mitigate issues before they impact your business or help solve them. Below are a few issues you can spot using the advanced dashboard.

Unbatched inserts

As described in the best practices documentation, it is recommended to always bulk insert data into ClickHouse if able to do so synchronously.

A bulk insert with a reasonable batch size reduces the number of parts created during ingestion, resulting in more efficient write-on disks and fewer merge operations.

The key metrics to spot sub-optimized insert are Inserted Rows/sec and Max Parts for Partition

The example above shows two spikes in Inserted Rows/sec and Max Parts for Partition between 13h and 14h. This indicates that we ingest data at a reasonable speed.

Then we see another big spike on Max Parts for Partition after 16h but a very slow Inserted Rows/sec speed. A lot of parts are being created with very little data generated, which indicates that the size of the parts is sub-optimal.

Resource intensive query

It is common to run SQL queries that consume a large amount of resources, such as CPU or memory. However, it is important to monitor these queries and understand their impact on your deployment's overall performance.

A sudden change in resource consumption without a change in query throughput can indicate more expensive queries being executed. Depending on the type of queries you are running, this can be expected, but spotting them from the advanced dashboard is good.

Below is an example of CPU usage peaking without significantly changing the number of queries per second executed.

Bad primary key design

Another issue you can spot using the advanced dashboard is a bad primary key design. As described in "A practical introduction to primary indexes in ClickHouse", choosing the primary key to fit best your use case will greatly improve performance by reducing the number of rows ClickHouse needs to read to execute your query.

One of the metrics you can follow to spot potential improvements in primary keys is Selected Rows per second. A sudden peak in the number of selected rows can indicate both a general increase in overall query throughput, and queries that select a large number of rows to execute their query.

Using the timestamp as a filter, you can find the queries executed at the time of the peak in the table system.query_log.

For example, running a query that shows all the queries executed between 11 am and 11 am on a certain day to understand what queries are reading too many rows:

In this example, we can see the same query being executed against two tables amazon_reviews_no_pk and amazon_reviews_pk. It can be concluded that someone was testing a primary key option for the table amazon_reviews.

Accessing the advanced dashboard​

Accessing the native advanced dashboard​

Out-of-box visualizations​

ClickHouse specific​

System health specific​

ClickHouse Cloud specific​

Identifying issues with the Advanced dashboard​

Unbatched inserts​

Resource intensive query​

Bad primary key design​