It’s 3 AM. A pager alert wakes you. The dashboard is a sea of red, but every query you run is agonizingly slow, incomplete, or times out. You suspect a "noisy neighbor" issue with a specific user or tenant, but the user_id field was dropped months ago to manage cardinality and control costs. You're flying blind.
This scenario has become painfully common for engineering teams. The move to modern, cloud-native architectures has caused an explosion in telemetry data. The culprit isn't just the volume, it's the variety and uniqueness, a problem known as high-cardinality data. This challenge cripples all three pillars of traditional observability, slowing log searches to a crawl, making detailed tracing impractical, and breaking metrics systems with label explosions.
High-cardinality data refers to fields or attributes with a vast number of unique values. Think of identifiers like container_id, customer_tenant_id, request_trace_id, feature_flag_version, or commit_sha. These are not just metrics. They are the high-value, exploratory dimensions essential for debugging complex systems.
The core of the issue is this: your observability platform is likely failing not because you have too much data, but because it’s built on the wrong architecture. This guide will break down why traditional observability tools, from time-series databases to log search indexes, are fundamentally unsuited for high-cardinality data. We'll explore the architectural limitations that cause slow queries and force painful data sampling, and then introduce a fundamentally different approach designed to handle this data natively, without the usual trade-offs.
What you'll learn in this guide: #
- The common symptoms of an observability system struggling with high-cardinality data.
- The core architectural flaws in traditional time-series databases and search indexes that lead to performance degradation.
- How a columnar database like ClickHouse is purpose-built to handle high cardinality without sacrificing speed or data fidelity.
- Practical examples and case studies of how organizations solve their most complex observability challenges.
The breaking point: symptoms of a system crippled by high-cardinality data #
Before a system fails catastrophically, it sends warning signs. For observability platforms facing a high-cardinality crisis, these symptoms manifest as a slow degradation of performance and utility, directly impacting your team's ability to resolve incidents.
Symptom 1: glacial query performance #
The most immediate and frustrating symptom is a dramatic slowdown in query performance. Queries that once returned in seconds now take minutes, and dashboards that were once snappy now fail to load entirely. During an incident, this lag is crippling. The ad-hoc exploration needed to uncover the root cause becomes impossible when every question takes five minutes to answer.
This is a classic sign of poor query performance in high-cardinality environments. As the engineering team at Sentry discovered, high-cardinality aggregations don't just slow systems down. They kill performance through memory exhaustion, long before the number of rows being scanned becomes the primary problem. The underlying indexes, designed for a world with fewer unique values, are struggling to resolve queries that span millions or even billions of unique series.
Symptom 2: dropped fields and forced data sampling #
As performance declines, teams are often forced into a painful compromise: to keep the system stable, they must discard the most valuable data. You can no longer ask, "Which specific user is affected?" because the user_id field has been dropped. Worse, you can no longer identify the impacted endpoints because the full endpoint path hasn’t been retained.
This practice of dropping fields or aggressively sampling data is a defense mechanism for platforms not built to handle cardinality. While it may reduce the immediate load, it creates critical blind spots. You sacrifice the granular detail required for effective debugging, leaving your team to navigate incidents with incomplete information.
Symptom 3: ingestion bottlenecks and data lag #
The third symptom is a system that simply can't keep up. The platform is unable to write and index the incoming high-cardinality data fast enough, leading to significant delays between when an event occurs and when it becomes queryable. These ingestion bottlenecks in your logging solution mean your "real-time" data is actually minutes, or in severe cases, hours old. During a production outage, this lag renders your observability data nearly useless for immediate triage. This is a challenge solved by companies like Didi, who migrated from Elasticsearch to ClickHouse to handle a peak write traffic of over 40 GB/s across a 400-node cluster.
Under the hood: the architectural flaw in traditional observability tools #
These symptoms are not signs of a system that needs more hardware. They are the result of fundamental architectural flaws. Both time-series databases and search indexes were designed with assumptions about data shape that no longer apply in the modern cloud-native era.
The time-series database (TSDB) limitation: the label-set explosion #
Time-series databases like Prometheus are highly optimized for storing numerical data points over time. Their data model works by creating a unique time series for every distinct combination of a metric name and its key-value tags (or labels).
The problem arises from what is known as the label-set explosion. A simple metric like http_requests_total{path="/api/v1", status="200"} represents a single time series. However, if you add a high-cardinality label like user_id, the system must create a new time series for every unique user. In a system with millions of users, this causes a combinatorial explosion, creating billions of individual series.
This leads to several problems:
- Memory exhaustion: The index that maps label sets to time series data must be held in memory. As cardinality explodes, this index consumes massive amounts of RAM, leading to performance degradation and system crashes.
- Storage inefficiency: Each new time series adds metadata and storage overhead. For high-cardinality blocks, a system like Thanos may require 6 MB to 30 MB of local disk space per block stored in object storage.
- Slow queries: Aggregating data across millions of individual time series is computationally expensive and frequently results in query timeouts, rendering dashboards and alerts unreliable. This also prevents any form of exploratory or wider trend analysis - forcing users into narrow workflows inspecting subsets of time series at a time.
The issue is so well-known that Prometheus's own documentation explicitly warns, "Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values."
The search engine limitation: the aggregation bottleneck #
While log management platforms like the ELK Stack (Elasticsearch) are highly effective for full-text search and pinpointing specific high-cardinality values (like a single trace_id) they face significant hurdles when used for high-volume analytics and aggregations.
The challenge isn't necessarily the inverted index, but rather how search engines process analytical queries versus true columnar databases:
- Memory and Concurrency Bottlenecks: Search engines typically perform aggregations using the JVM heap, which should not exeed 26-30GB due to compressed ordinary object pointers no longer being viable. When attempting to group by high-cardinality fields, this memory constraint often leads to Out-Of-Memory (OOM) errors. Furthermore, parallelization is rigidly tied to the sharding model (one core per shard). While increasing shards can improve concurrency, it introduces heavy overhead for cluster state metadata management.
- Inefficient Data Access: For aggregations, the engine relies on "doc values"—columnar structures stored on disk. However, because these are typically sorted by Document ID rather than the value itself, they suffer from poor compression and increased I/O overhead compared to purpose-built analytical stores.
- Indexing Overhead with High Cardinality - Another limitation arises at write time. Elasticsearch maintains a per-field term index (the inverted index) to accelerate lookups and filters. When fields contain millions of unique values, such as user IDs, session IDs, or trace IDs, this index swells dramatically. Each distinct term adds entries to the dictionary and posting lists, increasing both on-heap and off-heap memory usage. As a result, ingestion slows due to heavier segment creation and frequent merges, while memory consumption grows as the engine caches these structures for query performance. Even with optimizations like doc values and reduced refresh intervals, high cardinality fields impose unavoidable indexing and memory costs that constrain Elasticsearch’s efficiency for analytical workloads.
A better way: solving high cardinality with a columnar architecture #
The limitations of legacy systems are not solvable with more hardware. They require a fundamentally different architectural approach. Columnar databases, such as ClickHouse, were purpose-built for large-scale analytical workloads and are structurally designed to handle high cardinality without the usual trade-offs.
Unlike traditional row-based databases that store all values for a single row together, a columnar database stores all values for a single column together. Instead of [ts, user, status], [ts, user, status], data is stored as [ts, ts, ts...], [user, user, user...], and [status, status, status...]. This simple change has profound implications for observability analytics. Crucially, ClickHouse sorts these columns based on a primary key. This sorting ensures that similar values are stored together, maximizing compression ratios and allowing the engine to skip reading vast amounts of data that aren't relevant to your query.
The advantages of columnar databases like ClickHouse observability #
- Very fast queries: Analytical queries (AVG, COUNT, GROUP BY) rarely need every column. A columnar engine only reads the specific columns required. To calculate AVG(latency_ms), ClickHouse scans only the latency_ms column, ignoring terabytes of other data. This is coupled with a vectorized query execution engine, which processes data in large batches (vectors) rather than row-by-row, maximizing CPU efficiency. Furthermore, queries are parallelized across all available CPU cores, splitting the work of reading, aggregating, and merging data to deliver sub-second query speeds.
- High cardinality is handled efficiently: This is the most crucial advantage. In ClickHouse, adding a new column with a billion unique trace_id values is a trivial operation. It's just another compressed block of data. It does not cause an index explosion or bloat a term dictionary. For specific point lookups (like finding a specific trace_id), ClickHouse utilizes lightweight secondary indexes, such as bloom filters, to locate data instantly. Unlike legacy systems limited by JVM heap sizes, ClickHouse distributes processing across nodes and spools to disk when necessary, meaning aggregations remain performant even as cardinality grows. This makes ClickHouse one of the few observability solutions specifically designed to handle high-cardinality data without dropping fields or compromising on speed.
- Excellent cost-efficiency and scale: ClickHouse builds on the above foundations with three architectural features to deliver the lowest TCO at scale.
- Industry-leading compression (10-30x is common) dramatically reduces the raw storage footprint.
- Native support for object storage (like S3) makes long-term data retention incredibly cheap. You're no longer forced into short, 14-day retention windows.
- In modern implementations like ClickHouse Cloud, the separation of storage and compute allows you to scale your query resources up or down independently from your storage, preventing ingestion workloads from interfering with analytics and optimizing costs. This combination is what allows teams to reframe the conversation from 'what can we afford to keep?' to 'why not send everything?'.
ClickHouse in practice: retaining 100% of your data #
The theoretical advantages of a columnar architecture become tangible when applied to real-world observability challenges. Companies operating at massive scale have turned to ClickHouse after hitting the architectural limits of traditional tools.
Case study: how monday.com achieved "almost limitless observability" #
The team at monday.com faced a common challenge: their existing observability platform couldn't handle the scale and cardinality of their event data. They chose ClickHouse to build a new solution that was "efficient, cost-effective, and scalable for handling high-cardinality events and metrics."
Today, "ClickHouse powers monday.com's high-cardinality observability platform with cost-efficient, and fast analytics". Their success demonstrates that retaining high-fidelity data is not only possible but practical with the right architecture.
Case study: how Tesla built a quadrillion-scale observability platform #
Tesla, operating at an immense scale, found that traditional time-series databases could not handle their needs. They built their internal observability platform on ClickHouse to manage tens of billions of unique time series.
Their experience led them to a powerful conclusion: "No other system lets you slice and dice your data, ask interesting questions, and get answers in an acceptable amount of time. There's nothing out there that competes with ClickHouse".
Interactive debugging with raw logs #
With ClickHouse, you are no longer constrained by a vendor's UI or a restrictive query language. You can retain 100% of your raw, unsampled log data and run powerful SQL queries to answer novel questions in seconds. Imagine your current logging solution times out when you try to group errors by customer. With ClickHouse, a simple SQL query can instantly identify which customers are experiencing the most errors:
SELECT
customer_id,
count(*) AS errors
FROM logs
WHERE level = 'error'
AND timestamp >= now() - interval 1 hour
GROUP BY customer_id
ORDER BY errors DESC
LIMIT 10;
This query runs in seconds across petabytes of data because ClickHouse only needs to scan the customer_id, level, and timestamp columns, skipping over all other log attributes. Furthermore, because ClickHouse sorts data by a primary key (in this case, timestamp and level), it utilizes a sparse primary index to skip reading massive ranges of data that don't match your criteria. This is the power of a columnar approach in action.
Comparing approaches to high cardinality data #
The choice of data store has a profound impact on an observability platform's ability to handle high-cardinality data. Here’s a comparison of the different architectural approaches:
| Approach | Traditional TSDB (e.g., Prometheus) | Search Engine (e.g., Elasticsearch) | ClickHouse (Columnar) |
|---|---|---|---|
| Core architecture | Metric + Label-Set Index | Document Store + Inverted Index | Column-Oriented Storage |
| High-cardinality handling | Struggles severely. The "label-set explosion" leads to memory exhaustion and slow queries. | Effective for lookups, but struggles with aggregations, as JVM heap limits and the indexing overhead from high-cardinality terms both contribute to slower writes and frequent OOM errors when grouping over large unique value sets. | Native. Aggregations scale linearly with cardinality, handling cardinality orders of magnitude higher than other systems. |
| Query performance | Fast for pre-defined, low-cardinality metrics. Very slow for high-cardinality aggregations. | Fast for text search. Slow for analytics due to limited concurrency (one core per shard). | Extremely fast for ad-hoc analytical queries and aggregations on raw, high-cardinality data. |
| Data compression | Moderate. | Low to moderate. Sorting by Document ID rather than value limits compression efficiency. | Very high (10-30x is common), significantly reducing storage footprint. |
| Typical compromise | Drop labels, sample data, pre-aggregate metrics. | Heavy indexing, expensive hardware, short retention periods. | None required. Retain 100% of unsampled, high-fidelity data. |
From database to solution: the ClickStack advantage #
A powerful database is the necessary foundation, but a complete observability solution requires more. This is where the ClickHouse Observability Stack, or ClickStack, comes in. It packages the columnar engine into an open-source, OTel-native platform that teams can deploy in minutes.
It includes components for data ingestion and a purpose-built UI (HyperDX) for user-friendly searching and dashboarding, all on top of the performant ClickHouse core. By providing an opinionated yet flexible solution, ClickStack lowers the barrier to entry, making a highly scalable and cost-efficient observability solution accessible to teams of any size without the heavy lift of a DIY project.
Conclusion: stop compromising on your data #
The challenges of slow queries, dropped fields, and ingestion bottlenecks are not a scale problem. They are an architectural problem. Time-series databases and search indexes, while powerful for their original use cases, are fundamentally the wrong tools for modern, high-granularity observability data.
Columnar databases like ClickHouse were purpose-built for this kind of analytical workload. By storing data in columns, using vectorized execution, and applying aggressive compression, ClickHouse treats high cardinality as a feature, not a bug.
By embracing a columnar approach, SREs and developers can finally stop making painful compromises. You can keep all your data, ask any question you can imagine, and get answers in seconds. This transforms observability from a reactive, frustrating chore into a powerful, proactive tool for understanding and improving your systems.