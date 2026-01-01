It’s 2:13 a.m.

Your AI SRE copilot has a confident answer: “Error rates in checkout increased because the Payment service is slow.”

Twenty minutes later, you discover the real issue was a bad feature flag rollout. The “copilot” just narrates your dashboards. That’s not a copilot. That’s a chat UI for your graphs.

AI SRE tools promised to transform incident response. However, most of the implementations have been disappointing. They all point an LLM at observability data, and try to explain what broke and why. And that doesn’t work.

When I led the platform and storage teams at Confluent and pushed availability SLA from 99.9 to 99.95, I learned something counterintuitive about incidents. A bulk of incidents ended with one of three crude corrective actions: roll back a bad change, restart an unhealthy component, or scale up capacity to absorb load.

Applying the fix usually took minutes. The hard part was figuring out the root cause.

Was the problem a bad configuration, a noisy neighbor, a control plane deadlock, or a subtle storage regression? Answering that question required an investigation, not just a runbook.

Many AI SRE tools fall short here. They lean toward automated remediation or market themselves as self-healing systems, which proves both risky and unnecessary in most real environments. Other tools focus more on correlation, summarization, and alert reduction.

Across both camps, the same constraint emerges: they try to reason at scale on top of an observability substrate not designed for AI-first investigation. As a result, most AI SRE products have been underwhelming.

Let's be real, the goal isn't to build a bot that restarts your databases. An AI SRE is an investigator who analyzes data so the on-call human can make a decision.

The AI hunts. The human decides.

This Human-in-the-Loop approach solves the real bottleneck (Mean Time to Understand, or MTTU) without the risks of auto-remediation.

The ClickHouse engineering team recently tested whether frontier models could autonomously identify root causes from real observability data. The finding was both uncomfortable and useful. Even GPT-5 couldn't do it reliably, even with access to detailed telemetry. The real constraint:

"The bottleneck is not model IQ; it is missing context, weak grounding, and no domain specialization."

The limiting factor was the data substrate, not the LLM. The models could read logs and metrics, but they were looking at short retention windows, incomplete dimensions, and fragmented context. They were reasoning over partial information.

I now think about this problem in two layers. First, build an observability foundation that actually captures the information an AI investigator needs, with the right economics and query profile. Second, use AI for what it excels at: reducing time-consuming work on correlation, pattern matching, and narrative, while engineers retain control over actions.

This article shows how to address this gap by building an AI SRE copilot for on-call engineers on a solid observability foundation.

Many AI SRE products are thin layers on top of older observability platforms. They inherit the economic and architectural constraints of those systems, and they hit the same ceiling in three predictable ways.

Problem 1: The retention problem #

Most legacy observability platforms that grew up around search-first, inverted-index architectures charge primarily based on ingestion volume. At scale, this pricing model pushes teams toward aggressively short retention. Teams typically retain 7 to 14 days of logs, with a slightly longer window for coarse metrics. While they may retain older data in “cold tiers”, these rarely deliver the query access times required for agentic-based analysis.

For an AI SRE copilot, short retention removes historical memory. A model investigating a checkout failure today can't see that the same pattern occurred six weeks ago after a similar deployment, because those logs no longer exist.

Seasonal patterns, rare edge cases, and long-tail incidents become invisible.

From a reliability perspective, every incident looks like the first time. The model can't learn from the organization's own history, and no amount of prompt engineering fixes missing data.

If your logs can’t remember more than two weeks, neither can your AI SRE.

To control cost and performance in search-first systems, teams routinely drop high-cardinality dimensions. User IDs, session IDs, request IDs, detailed error codes, and fine-grained labels often get removed because they increase index size and query latency in inverted index engines.

These fields are exactly what an AI SRE needs to correlate events.

Root cause analysis usually connects a symptom to a specific subset of users, regions, deployments, or feature flags. If those dimensions aren't stored, the model sees only aggregate curves and generic error messages. It can describe that the error rate increased, but it can't answer which customers, which change, or along which path.

The full stack blindspot #

At Confluent, the cardinality problem combined with stack complexity into a more painful pattern. Our architecture had a data plane, a control plane, and the underlying cloud infrastructure layer. Very few engineers, perhaps a handful in the entire organization, had a complete mental model of how a disk latency spike could ripple through to durability at the data layer.

Incident response often became a human coordination problem. We frequently pulled five different teams on a call just to reconstruct a complete picture. Each team saw a different slice of metrics and logs in their own tools, so the real diagnosis happened in people's heads and in ad hoc conversations.

An AI SRE can only close that gap if data from all layers lives in one place.

When the control plane, data plane, cloud metrics, and application telemetry all live in ClickHouse, the copilot has no team boundaries. It can trace a request from the load balancer through the API layer and down to disk, bridging the visibility gap that humans struggle to cross during a tense outage.

Problem 3: The query speed problem #

In the ClickHouse experiment, the team quantified how an AI agent actually behaves during an incident. An AI SRE operates in a loop: it forms a hypothesis, queries the data, refines its understanding, and queries again. Each investigation involved between 6 and 27 database queries as the model iterated.

A realistic workflow looks like this:

Inspect recent errors for the impacted service. Break down errors by version and region. Cross-reference with deployments and feature flags. Pull traces for the slowest endpoints. Join with customer impact or business metrics.

If each query takes 20 to 30 seconds on a legacy observability platform, the feedback loop collapses. An AI-based workflow becomes painfully slow when every step waits minutes for data. The operator will always be faster using native dashboards.

Problem 4: The per-query tax #

Human analysts and AI agents approach investigations differently. A human writes one or a handful of queries, waits for results, and examines the data.

An AI agent enters a "Chain of Thought" loop, firing up to 27 queries in a short time period to map dependencies, check outliers, and validate hypotheses.

If your observability data lives in a solution or database with per-query pricing (like New Relic or BigQuery), your AI agent will destroy your budget. If you're using a traditional database with strict concurrency limits, the agent spends more time waiting in the query queue than actually solving problems.

This leads to the core limitation: many AI SRE tools attempt to reason at scale on top of platforms not designed for high-volume, high-cardinality analytical queries with long retention. No prompt or fine-tuning can fully compensate for a data store that can't retain or serve what the copilot actually needs.

You can't "AI" your way out of a storage and query problem.

Why ClickHouse is the right database for building an AI SRE Copilot #

ClickHouse addresses three problems at their root: storage costs, high-cardinality performance, and query latency.

For observability workloads, modern observability solutions, such as ClickStack, which use ClickHouse as its core data engine, routinely achieve order-of-magnitude improvements over legacy observability platforms built on inverted indices.

At a high level, the differences look like this:

Data problem Legacy observability stacks built on inverted indices ClickHouse-based observability Retention 7–14 days of full logs, then aggressive sampling or rollups Months of full-fidelity logs, metrics, and traces at petabyte scale Cardinality High-cardinality dimensions dropped or pre-aggregated to control index size Native support for billions of unique values with sparse indexing and compression Query speed Seconds to minutes for multi-dimensional aggregations Sub-second scans and aggregates on billions of rows for typical incident queries LLM Compatibility Requires few-shot prompting or fine-tuning for custom DSLs. Zero-shot compatible via standard SQL.

The economics come from architecture, not marketing.

Columnar storage and compression = longer memory. Machine-generated logs and metrics compress extremely well when stored column-by-column. Real deployments often see 10x to 15x less storage compared to inverted index engines for the same raw telemetry volume. That difference translates directly into longer retention windows and more history for the copilot.

Vectorized execution for analytical queries = the copilot’s feedback loop stays interactive. Incident queries rely on aggregations, filters, and time ranges. ClickHouse executes these operations in tight vectorized loops on compressed data. It can scan and aggregate billions of rows in a few milliseconds on modern hardware, keeping the AI feedback loop interactive even when the model issues dozens of queries.

Sparse primary indexes instead of global inverted indices = keep your high-cardinality fields. MergeTree tables in ClickHouse use ordered primary keys and lightweight indexes rather than heavy per-field inverted indices. This design tolerates high-cardinality dimensions, such as request IDs and user IDs, in the schema without causing catastrophic index growth.

Standard SQL = Zero-Shot Fluency. LLMs are trained on SQL from the entire internet. They struggle with proprietary query languages such as SPL, KQL, and PromQL. When you use a SQL-native database such as ClickHouse, you don't waste your context window teaching the model a new language or fine-tuning it on custom syntax. The model context focuses on the data, not the grammar.

When this storage engine powers a modern observability solution, the AI SRE copilot builds on a very different foundation. Retention spans months instead of days. Dimensions remain intact. Queries complete fast enough that a model can afford to iterate. This foundation gives AI the breadcrumbs it needs to traverse the stack.

How to solve the context window problem with SQL #

Here's the common question: "How does an AI agent read months of logs with a 128k token limit?"

It doesn't. The database compresses the data. The agent uses SQL to scan petabytes of history and returns only the relevant insight (kilobytes) to the context window.

Legacy observability tools typically offer two modes: "search" (list logs) and "aggregations" (time-series metrics for line charts). ClickHouse offers full SQL.

Full SQL lets the agent run complex logic (joins, window functions, and subqueries) to filter signals from noise inside the database layer. This keeps data dumps out of the context window.

Note: You can absolutely build an AI SRE copilot without ClickHouse. Any database that gives you similar economics and query profiles can work. We’re biased because we’ve seen ClickHouse handle this at petabyte scale, but the architectural pattern matters more than the specific solution.

The reference architecture: AI copilot for SRE #

With the data substrate in place, the AI SRE copilot becomes a precisely describable architectural pattern.

At a high level:

The key pieces are: