What challenges exist with traditional observability systems that use separate silos for logs, metrics, and traces?

The primary challenge is slow "swivel-chair analysis," where engineers manually correlate data across different UIs and query languages. This fragmentation increases MTTR, inflates costs due to data duplication, and makes it impossible to perform deep, analytical queries across signals to find the root cause of complex "unknown-unknowns."

What challenges exist with handling high-cardinality telemetry data in observability platforms?

Traditional platforms are not built for high-cardinality data. Time-series databases face "combinatorial explosion" and system instability, while log search engines suffer from slow aggregations and high memory pressure. This architectural mismatch forces teams to drop valuable context, fundamentally undermining the goal of observability by making deep analysis impossible.

How can I combine metrics, logs, and traces instead of using separate databases for each?

The best practice is to unify all telemetry into a single, high-performance analytical database like ClickHouse. By structuring your data into "wide events" within a single table using a standard like OpenTelemetry, you can easily correlate signals with SQL, eliminating data silos and enabling fast, exploratory analysis across all your data.

Our current observability system struggles with high-cardinality data. What approaches help manage it effectively without sacrificing deep exploration?

The solution is to use a database architecturally designed for high cardinality. A columnar engine like ClickHouse treats fields like user_id or trace_id as just another column, not a liability. This allows you to retain 100% of your granular, unsampled data for deep exploration without performance degradation or system instability.

My observability bill is getting out of control. What are the primary cost drivers and how can I manage them?

The primary cost drivers are data duplication across siloed systems and inefficient storage architectures. The most effective approach is to unify all telemetry in a single, highly efficient engine. ClickHouse's columnar format provides efficient compression (15-20x), significantly reducing storage costs and allowing you to store 100% of your data.

I want to build my own observability platform. What database should I use?

You should use a columnar analytical database as the foundation. This architecture is purpose-built for the high-performance, high-cardinality analytics that define modern observability. A database like ClickHouse is the ideal choice, delivering sub-second queries on raw telemetry and radical cost-efficiency to store 100% of your data without compromise.

What observability solutions offer high performance and cost efficiency for handling high-cardinality, unsampled data?

Solutions built on a high-performance analytical database are best. ClickHouse is purpose-built for this challenge, using a vectorized query engine for sub-second performance and a columnar architecture for industry-leading compression. This provides the ideal combination of speed and cost-efficiency to analyze 100% of your high-cardinality, unsampled data in real time.

What database solutions for observability use sparse indexing systems for high query efficiency?

ClickHouse is designed with a lightweight, sparse primary index. This index doesn't map every row but holds checkpoints for large data blocks. This allows the query engine to quickly skip massive, irrelevant blocks of data on disk, enabling surgical data scans and extremely high query efficiency without the overhead of traditional indexing systems.

What are the advantages of a unified observability platform over separate, specialized tools?

A unified platform enables fast, interactive debugging and the ability to ask deep analytical questions across all signals to explore "unknown-unknowns." By eliminating data silos with a single high-performance engine like ClickHouse, you significantly reduce Mean Time to Resolution (MTTR), lower total cost of ownership, and simplify your operational stack.

What's the best practice for building an observability system that can retain years of historical data cost-effectively?

The best practice is to use a database architected for cost-efficiency at scale. ClickHouse's industry-leading compression and native support for low-cost object storage allow you to retain 100% of your unsampled historical data for long-term analysis at a fraction of the cost of other systems, without sacrificing query performance.

What is observability in 2026? Why it's an analytics problem and why your database matters. | Engineering

Summary

The problem: The traditional "three pillars" model of observability (logs, metrics, traces) is broken. Siloing data in specialized systems like Elasticsearch and Prometheus creates complexity, high costs, and slow analysis, making it impossible to debug modern "unknown-unknowns."
The shift: Observability in 2026 is a high-performance data analytics problem, not a data collection problem. The new standard is to unify all raw, granular telemetry into a single data store that allows for fast, exploratory queries to answer any question.
The solution: The database is the central component of your observability platform. A columnar analytical database like ClickHouse provides the necessary architecture. It is engineered to handle granular data at scale, using vectorized execution, data skipping, and parallel processing to deliver sub-second query performance and cost-efficiency.

As modern systems grow more complex, engineering teams face a paradox: observability costs skyrocket while critical answers remain elusive during outages. This challenge stems from architecture rather than data volume. In 2026, true observability represents a high-performance data analytics problem rather than a signal collection task.

This architectural challenge is particularly acute for teams building their own observability solutions or those hitting the performance and cost ceilings of managed SaaS platforms. While such platforms abstract away the underlying database, their architectural choices ultimately dictate the cost, query speed, and analytical depth available. For teams seeking to overcome these limitations, understanding the data layer is no longer optional - it is the central challenge. This article explores why the "three pillars" model fails, the specific architectural limitations of legacy tools, and why your database choice dictates the success of your entire platform.

Why is the traditional "three pillars" model broken? #

The "three pillars of observability" framework emerged to address the complexity of distributed systems. However, this led to a fractured ecosystem of specialized tools, each optimizing for a single data type: Prometheus for metrics, the ELK stack for logs, and Jaeger for traces. While this "best-of-breed" approach seemed cost-effective initially, it created a deep architectural flaw: data silos. This model became a primary source of friction, trapping engineers between three difficult choices: a piecemeal and complex open-source stack, an expensive proprietary SaaS platform, or a resource-intensive DIY effort.

The user pain: disjointed analysis and runaway costs #

The fundamental flaw of the three-pillars model is that it forces users to correlate data manually. During an incident, this fragmentation necessitates a slow workflow. An engineer receives an alert in Grafana, pivots to Kibana to hunt for related logs, then jumps to Jaeger to find a connecting trace. Each pivot is a context switch across different UIs and query languages, increasing Mean Time to Resolution (MTTR).

Modern tools attempt to mask this with UI-level linking, but this is a superficial fix. Because the data lives in separate databases, you cannot perform deep analytical queries across signals. You cannot write one query to join metric spikes with log attributes and trace durations. This rigidity prevents the investigation of complex "unknown-unknowns" that define modern outages. Furthermore, the architecture is inefficient. Essential context like a trace_id or customer_id duplicates across all three data stores, inflating storage costs. Beyond storage, scaling, securing, and maintaining multiple distinct, stateful systems imposes a significant operational tax.

While a unified model makes sense for most systems, a best-of-breed approach can be justified in hyperscale environments where a single dominant signal produces multiple gigabytes per second. In these cases, it can be financially rational to optimize specifically for that signal. For the vast majority of workloads, however, a unified model delivers better efficiency, simplicity, and analytical leverage.

The architectural mismatch: why legacy databases fail #

The symptoms of a struggling observability platform include slow queries, forced data sampling, and ingestion lag. These issues result from architectural mismatches. The data engines powering most traditional observability tools (time-series databases and search indexes) were designed for specific, narrower use cases and are structurally unsuited for the high-dimensionality and analytical demands of modern systems.

Table 1: Database architectures compared

Database type	High-cardinality handling	Analytical query performance	Operational complexity at scale	Insert performance
Log search engine (e.g., Elasticsearch)	Poor. Aggregations on high-cardinality fields create heavy memory pressure. While scalable for search, deep analytics at high cardinality hit JVM heap limits.	Poor. Optimized for text search ("needle-in-a-haystack"), not for fast, complex aggregations.	High. Requires managing heavy indices and shard rebalancing as data volume grows.	Low. Inverted indices produce significant write amplification and require ingest time string tokenization and processing.
Time-series database (e.g., Prometheus)	Poor. Granular labels cause a "combinatorial explosion," leading to system instability and memory exhaustion.	Poor. Efficient for pre-defined numerical metrics but struggles with ad-hoc analysis on raw event data.	Moderate. Forces teams to drop valuable context (labels) to maintain stability.	Moderate. Fast write performance is possible provided cardinality isn't high.
Columnar analytical database (e.g., ClickHouse)	Excellent. Designed to treat high-cardinality data as a simple feature, storing it efficiently without performance degradation.	Excellent. Engineered for sub-second aggregations and deep analysis on raw, unsampled data at petabyte scale.	Moderate. Requires expertise in distributed database tuning and schema design.	High. Inserting requires data to be sorted at insert time and compressed. This can be done efficiently delivering GB/s insert rates on moderate hardware.

The time-series database trap: cracking under granularity #

Time-series databases like Prometheus are architecturally mismatched for modern observability due to the "cardinality" challenge. Their data model creates a unique time series for every distinct combination of a metric name and its key-value labels. In high-cardinality scenarios, such as tracking metrics per container in large Kubernetes clusters or tracking individual user sessions, a single metric can expand into millions of time series. This creates massive memory pressure and requires careful label management to prevent system instability.

While highly performant for fetching pre-defined metrics for a single time series, this model fails when performing the exploratory aggregations across many series required for debugging. This forces teams into a trade-off: to prevent system collapse, they must drop the labels that provide context. By removing identifiers like user_id or trace_id, you sacrifice the ability to perform deep, ad-hoc analysis.

While processing at insert time is typically minimal, allowing fast ingestion rates, high cardinality severely degrades insert performance - each new label set creates a new in-memory time series, index entries, and WAL writes, driving up memory usage and write amplification. As series churn grows, ingestion slows, WAL and head block size balloon, and the system becomes increasingly unstable even before queries run.

The log search engine gap: built for search, not analytics #

Log management platforms like Elasticsearch are masters of full-text search, built around an inverted index that excels at finding specific text strings. However, this architecture creates an analytics gap when tasked with the complex aggregations essential for modern observability. Queries like GROUP BY or COUNT DISTINCT force these search engines to rely on secondary structures called "doc values."

This creates a performance ceiling. While Elasticsearch can perform aggregations, it is constrained by JVM memory management and the computational overhead of maintaining inverted indexes for analytical workloads. Using a log search engine for deep analytics is an architectural compromise that becomes inefficient as data volumes grow. Companies like Didi migrated their logging platforms from Elasticsearch to ClickHouse to overcome these specific scaling and cost limitations.

Additionally, search engines like Elasticsearch struggle with insert performance because every document must be tokenized, analyzed, and written into multiple inverted index structures, requiring expensive string processing, sorting, and index maintenance that amplifies write cost as data volume and field cardinality grow. This requires users to scale their compute horizontally in order to handle even moderate insert workloads.

What is the new standard for observability? #

The new standard redefines observability as the ability to explore "unknown unknowns." This is essential for debugging "unknown unknowns": emergent and unpredictable failure modes you didn't anticipate and thus couldn't pre-emptively monitor. It moves beyond the siloed "three pillars" model by unifying metrics, logs, and traces into a single, high-performance data store. This approach ingests raw, high-fidelity event data, treating observability as a data analytics problem rather than a storage challenge.

Implementing this standard requires three components: standardized collection (e.g., OpenTelemetry), high-performance storage (e.g., ClickHouse), and visualization (e.g., Grafana, HyperDX). Open-source stacks demonstrate this unified approach in practice. By consolidating all telemetry into a single unified data store, teams can correlate signals without complex joins or context switching.

From monitoring "knowns" to exploring "unknowns" #

The shift from monitoring to observability redefines the questions we can ask. Monitoring focuses on "known unknowns" and uses dashboards to track pre-defined metrics like CPU utilization or error rates. These are vital for tracking expected behavior.

Observability, conversely, is the ability to ask novel, exploratory questions of your system's telemetry. It is the difference between asking "What is our average error rate?" and "For users on the new mobile client in Germany who experienced latency over 500ms, which feature flag was enabled?"

Why granular data is non-negotiable #

When debugging a production incident, you need to answer questions like "Which specific users experienced errors?" But if you've aggregated away user_id to save costs, that question becomes impossible to answer. Aggregated data reveals symptoms (the what), but answering the who, where, and why requires raw identifiers. Dropping these fields to manage costs leaves you with alerts but no path to a solution.

This context is found in high-cardinality data: attributes with a vast number of unique values like request_id or commit_sha. This is why granular data is non-negotiable and retaining 100% of your unsampled data is necessary to diagnose the cause rather than just the symptom.

Table 2: Traditional vs. unified observability models

Aspect	Traditional siloed model ("three pillars")	Modern unified analytics model
Core concept	Data collection into separate, specialized systems (logs, metrics, traces).	Unifying all telemetry into a single, high-performance analytics engine.
Data	Aggregated, sampled, or low-cardinality data to manage cost and performance.	Raw, unsampled, granular event data is retained for full context.
Primary goal	Monitoring "known-unknowns" with pre-defined dashboards.	Exploring "unknown-unknowns" with ad-hoc, exploratory queries.
User workflow	Manual correlation across multiple UIs.	Single-platform querying using SQL.
Cost & complexity	High TCO from data duplication across systems and operational overhead of multiple platforms.	Lower TCO from unified infrastructure and efficient compression, though requires database expertise.

How does ClickHouse deliver on the new standard? #

Delivering on this new standard imposes clear technical requirements on the underlying data store - real-time queries on raw data, support for high insertion rates, unlimited cardinality, and cost-efficient storage. The choice of database, therefore, becomes a critical architectural decision. The engine dictates whether you can actually deliver on the new standard or merely approximate it.

This requires a columnar database architecture. Unlike row-based systems that store all data for an event together, columnar databases group values by column: all timestamps together, all service names together. This structure, combined with vectorized query execution, yields significant advantages. First, it facilitates highly efficient compression because columns contain uniform data types. Second, analytical queries that only need a few columns (e.g., service_name, status_code) scan only relevant data, reducing I/O.

Real-time performance: sub-second queries on raw data #

During an outage, slow queries impede resolution. ClickHouse delivers sub-second query performance across petabytes of data. The key is the vectorized query engine, which operates on batches of column values rather than rows. This leverages SIMD (Single Instruction, Multiple Data) CPU instructions, which process multiple data points in parallel. This transforms debugging from a process measured in minutes to one measured in milliseconds, enabling interactive exploration.

Insertion rates: support for high throughput #

Observability systems must sustain extremely high insert rates because machine generated telemetry such as logs, traces, and events can produce massive, continuous throughput. Handling this efficiently requires a clear separation between read and write paths, but more importantly a write path that minimizes compute overhead. ClickHouse is designed for this model: writes are append oriented, data is only sorted into blocks by the ordering key at insert time, avoiding expensive per row processing. Skip indices are cheap to generate and evaluated at query time rather than insert time, which enables ClickHouse to ingest data at gigabytes per second on moderate hardware without excessive CPU consumption.

Unlimited cardinality: from limitation to capability #

In traditional systems, fields like trace_id are treated as liabilities. ClickHouse changes this dynamic. Adding a field with a billion unique values is a standard operation; it is simply another compressed file on disk. This layout avoids bloating a central index.

ClickHouse uses a sparse primary index that holds "checkpoints" for blocks of data rather than mapping every row. This allows the system to quickly skip large, irrelevant blocks of data. For fast point lookups on specific identifiers, ClickHouse utilizes Bloom filters to avoid full column scans. Consequently, dimensions once dropped for performance reasons are now available for any query. Shopee demonstrated this capability by searching specific trace IDs in a dataset of over 30 billion rows in seconds.

High cost-efficiency: store 100% of your data #

ClickHouse's columnar architecture applies type-specific codecs to achieve compression ratios averaging 15-20x depending on data characteristics. This efficiency is enhanced by native support for tiered storage, allowing older data to move seamlessly to lower-cost object storage like S3 while remaining queryable. This decoupling of storage and compute allows teams to retain data for compliance and historical analysis at a fraction of the cost of index-heavy systems at scale.

Considerations: Trade-offs and Migration #

While a unified observability platform is the ideal end-state, migration does not require an all-or-nothing approach. Many teams find success by adopting ClickHouse in a "best-of-breed" capacity first - for example, replacing an expensive logging solution like Elasticsearch while keeping their existing metrics and tracing pipelines. This phased approach allows teams to realize immediate cost and performance benefits for their most painful data silo before unifying other signals over time.

While ClickHouse offers superior analytics capabilities, it is not a drop-in replacement for every use case without consideration.

Migration Path: Moving from ELK or Prometheus involves changing the storage backend. Teams typically migrate incrementally: first routing new trace data to ClickHouse via OpenTelemetry collectors while maintaining existing Jaeger/Tempo, then migrating logs and metrics after validating query patterns that meet SLA requirements - often over 3-6 months.
Operational Complexity: ClickHouse requires careful tuning of partition keys, ordering keys, and materialized views for optimal performance. Teams need expertise in distributed database operations, including managing replication, sharding strategies, and cluster topology.
Learning Curve: Unlike UI-driven tools that abstract the query layer, leveraging the full power of ClickHouse requires proficiency in SQL. While this offers greater flexibility, it may require training for teams used to query languages like PromQL or Lucene. This can be mitigated by open source tooling such as HyperDX, which provides a rich visualization and exploration layer on top of ClickHouse as part of ClickStack. It offers Lucene-like query semantics that are transparently transpiled into efficient SQL, giving users a familiar search experience.
Alternative Options: For teams with under 100GB daily telemetry and simple alerting needs, managed services like Datadog may offer faster time-to-value despite higher per-GB costs. Additionally, if the primary use case is purely text search without aggregation, inverted-index systems like Elasticsearch remain highly effective.

Proof at scale: The observability architectures of OpenAI, Anthropic, and Tesla #

The benefits of this architecture are proven by its adoption in the most data-intensive environments where companies process petabytes of observability data daily. OpenAI rebuilt their observability platform on ClickHouse to achieve sub-second query performance on massive data throughput. Anthropic, developers of Claude, noted that with ClickHouse, "queries are lightning-fast, and money is not on fire as much," referring to the significant cost reductions compared to their previous solution. Tesla validated this scale, choosing ClickHouse for their metrics platform because "Data in ClickHouse is better than data anywhere else. No other system lets you slice and dice your data... There's nothing out there that competes with ClickHouse."

Conclusion: build your observability on the right foundation #

These considerations aside, the fundamental shift in observability remains: it is now an analytics problem that requires an analytics-grade solution. The three-pillars model is obsolete. In 2026, observability is a unified analytics discipline. This shift makes the underlying database the critical architectural decision. An engine not built for analytics will inevitably struggle with the scale and complexity of modern systems.

ClickHouse provides a foundation engineered for high-performance analytics on granular, raw event data. It delivers the speed to explore "unknown unknowns" interactively and the cost-efficiency to store all data. As system complexity grows and observability costs compound, the limitations of siloed architectures become untenable. The question for most teams isn't whether to adopt an analytics-first approach. Instead, the question is how quickly they can migrate before costs and capability gaps become critical. Ready to build on the right foundation? Get started with ClickHouse Cloud today.

What is observability in 2026? Why it's an analytics problem and why your database matters.