Monitor SLAs and scale ClickHouse Cloud with clickhousectl and agents

ClickHouse Cloud makes it trivial to automatically scale your infrastructure up and down, horizontally or vertically, in response to resource pressure. But sometimes you want to go further and monitor SLAs on specific queries. Perhaps they're the queries fired off by your frontend app, and it degrades your user-experience when latency exceeds >200ms.

This guide shows you how to tag queries so you can calculate SLAs, then use clickhousectl to query and scale ClickHouse Cloud to investigate and fix breaches. You'll also see how you can pass this workflow off to an agent to investigate and remediate for you.

Try out the runnable example.

Setup

Install clickhousectl and use an API key to auth:

1curl https://clickhouse.com/cli | sh
2
3clickhousectl cloud auth login --api-key "$CLICKHOUSE_CLOUD_API_KEY" --api-secret "$CLICKHOUSE_CLOUD_API_SECRET"

Confirm you can see your services. The first column is the service ID you'll use everywhere else:

1clickhousectl cloud service list

Defining and measuring your SLA

First, you need to define your SLA and know how to measure it. An SLA is only useful if it's specific: a percentile, a latency target, and the queries it applies to. For a frontend dashboard, that might be "p99 under 200 ms for the queries behind the main view". That's what we'll use for the example here.

The system.query_log records every query a ClickHouse service runs. The trick is to tag your queries so you can easily filter to them. Set log_comment on the queries you want to track, and they become trivial to isolate later:

1SELECT event_type, count(), avg(value), quantile(0.9)(value)
2FROM events
3WHERE event_type = 'purchase'
4  AND event_time > now() - INTERVAL 1 DAY
5GROUP BY event_type
6SETTINGS log_comment = 'frontend-dashboard';

With the queries tagged, you can read them back from the log:

1clickhousectl cloud service query --id "$SERVICE_ID" --query "
2  SELECT event_time, query_duration_ms
3  FROM clusterAllReplicas(default, system.query_log)
4  WHERE type = 'QueryFinish'
5    AND log_comment = 'frontend-dashboard'
6  ORDER BY event_time DESC
7  LIMIT 10"

Once you can see them, measuring the SLA is just an aggregation. Compute the p99 latency for exactly that workload over the last five minutes:

1clickhousectl cloud service query --id "$SERVICE_ID" --query "
2  SELECT
3      toUInt64(quantile(0.99)(query_duration_ms)) AS p99_ms,
4      count() AS queries
5  FROM clusterAllReplicas(default, system.query_log)
6  WHERE event_time > now() - INTERVAL 5 MINUTE
7    AND type = 'QueryFinish'
8    AND log_comment = 'frontend-dashboard'"

Investigating a breach

A breached SLA tells you that latency went up, but not why. There are two places to look, and they answer different questions. Sometimes it's a simple case of CPU/Memory being over-utilised. Other times the hardware stats look fine, and you need to dig a little deeper into whats going on inside the database.

Inside the database

The first signal lives in ClickHouse itself. system.query_log doesn't just help you with the SLA query, you can ask questions about everything else that ran alongside it, too. That helps you to understand if something about the workload is changing.

Bucketing volume and latency by minute is a good place to start:

1clickhousectl cloud service query --id "$SERVICE_ID" --query "
2  SELECT
3      toStartOfMinute(event_time) AS minute,
4      count() AS queries,
5      toUInt64(quantile(0.99)(query_duration_ms)) AS p99_ms
6  FROM clusterAllReplicas(default, system.query_log)
7  WHERE event_time > now() - INTERVAL 30 MINUTE
8    AND type = 'QueryFinish'
9    AND log_comment = 'frontend-dashboard'
10  GROUP BY minute
11  ORDER BY minute"

A common case can be an increase in query volume/concurrency. As your application grows, more users are actively viewing their dashboard, firing off more queries at the same.

If query volume climbed in lockstep with p99, you probably have a concurrency problem. If p99 rose while volume stayed flat, something else is competing for resources, and you can widen the same query (drop the log_comment filter, group by log_comment or query_kind) to find the heavy queries, ingestion, or merges crowding out your dashboard.

System metrics

The second signal is resource pressure. To see whether the service is actually saturated, look at its metrics. ClickHouse Cloud exposes a Prometheus-compatible endpoint per service. clickhousectl can help you take a quick peek:

1clickhousectl cloud service prometheus "$SERVICE_ID" --filtered-metrics true

The snapshot is enough to get an idea of current state. For trends over time, point a standing Prometheus scraper at the same endpoint.

Pay particular attention to these metrics:

Resource	Metric(s)	How to read it
CPU	`ClickHouseAsyncMetrics_CGroupUserTimeNormalized` + `ClickHouseAsyncMetrics_CGroupSystemTimeNormalized`, vs. `ClickHouseAsyncMetrics_CGroupMaxCPU`	Sum the two normalized values to get cores in use. ~1.0 = one core saturated; approaching `CGroupMaxCPU` = CPU maxed out.
Memory	`ClickHouseAsyncMetrics_CGroupMemoryUsed` ÷ `ClickHouseAsyncMetrics_CGroupMemoryTotal`	Fraction of the memory limit in use. Approaching 1.0 = memory pressure.
Concurrency	`ClickHouseMetrics_Query`	Queries executing right now, a quick proxy for how busy the service is.

The state of the service helps you determine the right action to take. High concurrency with low memory suggests that you add replicas, we just need more cores to spread query concurrency over. Memory pinned near the limit on every replica suggests you need bigger replicas.

Scaling with clickhousectl

cloud service scale allows you to scale a ClickHouse Cloud service horizontally and vertically:

1clickhousectl cloud service scale "$SERVICE_ID" \
2  --min-replica-memory-gb 8 \
3  --max-replica-memory-gb 16 \
4  --num-replicas 3

--num-replicas is the horizontal dimension (how many replicas run in parallel). The --min-replica-memory-gb and --max-replica-memory-gb flags control vertical scaling. ClickHouse Cloud has native auto-scaling that can vertically scale replicas when it sees resource pressure. Set them apart to let Cloud scale replicas up and down automatically; set them equal to fix the replica size. The example above runs 3 replicas, each free to scale between 8 and 16 GB.

A simple cron

You could put this inside a simple cron:

1#!/usr/bin/env bash
2set -euo pipefail
3SERVICE_ID="<your-service-id>"
4SLA_MS=200
5
6p99=$(clickhousectl cloud service query --id "$SERVICE_ID" --format TSV --query "
7  SELECT toUInt64(quantile(0.99)(query_duration_ms))
8  FROM clusterAllReplicas(default, system.query_log)
9  WHERE event_time > now() - INTERVAL 1 MINUTE
10    AND type = 'QueryFinish'
11    AND log_comment = 'frontend-dashboard'")
12
13if (( p99 > SLA_MS )); then
14  echo "SLA breached: p99=${p99}ms > ${SLA_MS}ms. Scaling out"
15  clickhousectl cloud service scale "$SERVICE_ID" --num-replicas 4
16else
17  echo "OK: p99=${p99}ms"
18fi

Run it once a minute, and it can give you a super simple way to give your application some breathing room. But you'll have to think about the rest of the flow, too. Scaling back down if pressure eases, scaling further when needed, deciding between horizontal or vertical scaling, and so on.

Using agents to investigate and remediate

If you want to go beyond hard-coded heuristics, it's an interesting use case for agents.

A cron might still be the right way to run the SLA-check every minute. But if the SLA is breached, an agent can help to reason about what action to take.

The ClickHouse agent skills can help your agent to better use ClickHouse and clickhousectl. You can install them easily using clickhousectl itself:

1clickhousectl skills --agent claude

The check itself can stay a cron, it's cheap and predictable. But instead of a hard-coded scale --num-replicas 4, you can pass the failure to an LLM, giving it context about the failure, how to investigate, and what remediation options it should consider:

1if (( p99 > SLA_MS )); then
2  read -r -d '' PROMPT <<EOF || true
3The 'frontend-dashboard' query latency SLA on ClickHouse Cloud service $SERVICE_ID
4has just breached: p99 over the last minute is ${p99}ms against a ${SLA_MS}ms target.
5
6You're the on-call agent. Work out WHY the SLA is breaching, then remediate it by
7applying exactly one scaling action to the service. Let the evidence drive the choice.
8
9What you have to work with (clickhousectl only):
10  - SQL against the service's system tables. system.query_log is the richest source:
11    one row per query, with its timing and memory use, each tagged with the workload
12    it belongs to in the log_comment column ('frontend-dashboard' is the SLA workload):
13      clickhousectl cloud service query --id $SERVICE_ID --format TSV --query "<SQL>"
14  - Live resource pressure from Prometheus (CPU, memory, query concurrency, merges):
15      clickhousectl cloud service prometheus $SERVICE_ID --filtered-metrics true
16
17Your two scaling levers. Apply only ONE, whichever the root cause calls for:
18  - Replica count:  clickhousectl cloud service scale $SERVICE_ID --num-replicas N
19  - Replica size:   clickhousectl cloud service scale $SERVICE_ID --min-replica-memory-gb M --max-replica-memory-gb M
20
21General advice on which scaling pattern to use:
22- Prefer scaling vertically if cause is unclear.
23- Scale vertically if latency is likely caused by resource contention from other queries.
24- Scale horizontally if latency is caused by an increase in query concurrency or write throughput.
25
26Apply one action, then explain the evidence you relied on and why that lever fits.
27EOF
28
29  printf '%s' "$PROMPT" | claude -p --model sonnet --allowedTools "Bash(clickhousectl:*)"
30fi

Use your own scaling policy

You can take this further with your own rules and guidelines for scaling. Perhaps you want to guide the model not to scale beyond X replicas, or give it additional guidance on exactly what to look for (and how).

Creating a context file in Markdown, or encoding it inside a custom agent skill, is a great way to guide the agent towards more desirable behaviour.

Auditing

Every action performed via clickhousectl lands in the ClickHouse Cloud activity log, so you get an audit trail for free:

1clickhousectl cloud activity list

Get clickhousectl

Everything in this guide uses clickhousectl, the ClickHouse CLI for local and cloud. It's the single tool for taking a project from your laptop to production: spinning up ClickHouse locally, building against it, and managing the Cloud service it eventually runs on.

Install with: curl https://clickhouse.com/cli | sh

Monitor SLAs and scale ClickHouse Cloud with clickhousectl and agents

Setup

Defining and measuring your SLA

Investigating a breach

Inside the database

System metrics

Scaling with clickhousectl

A simple cron

Using agents to investigate and remediate

Use your own scaling policy

Auditing

Get clickhousectl

Subscribe to our newsletter

Recent posts

Why Trainy migrated from Amazon RDS Postgres to ClickHouse Managed Postgres

Replacing the HDB: ClickHouse for historical ticker data

How Bullet uses ClickHouse Cloud to give DeFi's fastest exchange real-time analytics

@clickhouse/rowbinary: when your library is also a parser compiler

Monitor SLAs and scale ClickHouse Cloud with clickhousectl and agents

Setup #

Defining and measuring your SLA #

Investigating a breach #

Inside the database #

System metrics #

Scaling with clickhousectl #

A simple cron #

Using agents to investigate and remediate #

Use your own scaling policy #

Auditing #

Get clickhousectl #

Subscribe to our newsletter

Recent posts

Why Trainy migrated from Amazon RDS Postgres to ClickHouse Managed Postgres

Replacing the HDB: ClickHouse for historical ticker data

How Bullet uses ClickHouse Cloud to give DeFi's fastest exchange real-time analytics

@clickhouse/rowbinary: when your library is also a parser compiler

Setup

Defining and measuring your SLA

Investigating a breach

Inside the database

System metrics

Scaling with clickhousectl

A simple cron

Using agents to investigate and remediate

Use your own scaling policy

Auditing

Get clickhousectl