The "three pillars of observability" framework emerged to address the complexity of distributed systems. However, this led to a fractured ecosystem of specialized tools, each optimizing for a single data type: Prometheus for metrics, the ELK stack for logs, and Jaeger for traces. While this "best-of-breed" approach seemed cost-effective initially, it created a deep architectural flaw: data silos. This model became a primary source of friction, trapping engineers between three difficult choices: a piecemeal and complex open-source stack, an expensive proprietary SaaS platform, or a resource-intensive DIY effort.
The user pain: disjointed analysis and runaway costs
The fundamental flaw of the three-pillars model is that it forces users to correlate data manually. During an incident, this fragmentation necessitates a slow workflow. An engineer receives an alert in Grafana, pivots to Kibana to hunt for related logs, then jumps to Jaeger to find a connecting trace. Each pivot is a context switch across different UIs and query languages, increasing Mean Time to Resolution (MTTR).
Modern tools attempt to mask this with UI-level linking, but this is a superficial fix. Because the data lives in separate databases, you cannot perform deep analytical queries across signals. You cannot write one query to join metric spikes with log attributes and trace durations. This rigidity prevents the investigation of complex "unknown-unknowns" that define modern outages. Furthermore, the architecture is inefficient. Essential context like a trace_id or customer_id duplicates across all three data stores, inflating storage costs. Beyond storage, scaling, securing, and maintaining multiple distinct, stateful systems imposes a significant operational tax.
While a unified model makes sense for most systems, a best-of-breed approach can be justified in hyperscale environments where a single dominant signal produces multiple gigabytes per second. In these cases, it can be financially rational to optimize specifically for that signal. For the vast majority of workloads, however, a unified model delivers better efficiency, simplicity, and analytical leverage.
The symptoms of a struggling observability platform include slow queries, forced data sampling, and ingestion lag. These issues result from architectural mismatches. The data engines powering most traditional observability tools (time-series databases and search indexes) were designed for specific, narrower use cases and are structurally unsuited for the high-dimensionality and analytical demands of modern systems.
Table 1: Database architectures compared
| Database type | High-cardinality handling | Analytical query performance | Operational complexity at scale | Insert performance |
|---|
| Log search engine (e.g., Elasticsearch) | Poor. Aggregations on high-cardinality fields create heavy memory pressure. While scalable for search, deep analytics at high cardinality hit JVM heap limits. | Poor. Optimized for text search ("needle-in-a-haystack"), not for fast, complex aggregations. | High. Requires managing heavy indices and shard rebalancing as data volume grows. | Low. Inverted indices produce significant write amplification and require ingest time string tokenization and processing. |
| Time-series database (e.g., Prometheus) | Poor. Granular labels cause a "combinatorial explosion," leading to system instability and memory exhaustion. | Poor. Efficient for pre-defined numerical metrics but struggles with ad-hoc analysis on raw event data. | Moderate. Forces teams to drop valuable context (labels) to maintain stability. | Moderate. Fast write performance is possible provided cardinality isn't high. |
| Columnar analytical database (e.g., ClickHouse) | Excellent. Designed to treat high-cardinality data as a simple feature, storing it efficiently without performance degradation. | Excellent. Engineered for sub-second aggregations and deep analysis on raw, unsampled data at petabyte scale. | Moderate. Requires expertise in distributed database tuning and schema design. | High. Inserting requires data to be sorted at insert time and compressed. This can be done efficiently delivering GB/s insert rates on moderate hardware. |
Time-series databases like Prometheus are architecturally mismatched for modern observability due to the "cardinality" challenge. Their data model creates a unique time series for every distinct combination of a metric name and its key-value labels. In high-cardinality scenarios, such as tracking metrics per container in large Kubernetes clusters or tracking individual user sessions, a single metric can expand into millions of time series. This creates massive memory pressure and requires careful label management to prevent system instability.
While highly performant for fetching pre-defined metrics for a single time series, this model fails when performing the exploratory aggregations across many series required for debugging. This forces teams into a trade-off: to prevent system collapse, they must drop the labels that provide context. By removing identifiers like user_id or trace_id, you sacrifice the ability to perform deep, ad-hoc analysis.
While processing at insert time is typically minimal, allowing fast ingestion rates, high cardinality severely degrades insert performance - each new label set creates a new in-memory time series, index entries, and WAL writes, driving up memory usage and write amplification. As series churn grows, ingestion slows, WAL and head block size balloon, and the system becomes increasingly unstable even before queries run.
Log management platforms like Elasticsearch are masters of full-text search, built around an inverted index that excels at finding specific text strings. However, this architecture creates an analytics gap when tasked with the complex aggregations essential for modern observability. Queries like GROUP BY or COUNT DISTINCT force these search engines to rely on secondary structures called "doc values."
This creates a performance ceiling. While Elasticsearch can perform aggregations, it is constrained by JVM memory management and the computational overhead of maintaining inverted indexes for analytical workloads. Using a log search engine for deep analytics is an architectural compromise that becomes inefficient as data volumes grow. Companies like Didi migrated their logging platforms from Elasticsearch to ClickHouse to overcome these specific scaling and cost limitations.
Additionally, search engines like Elasticsearch struggle with insert performance because every document must be tokenized, analyzed, and written into multiple inverted index structures, requiring expensive string processing, sorting, and index maintenance that amplifies write cost as data volume and field cardinality grow. This requires users to scale their compute horizontally in order to handle even moderate insert workloads.
What is the new standard for observability?
The new standard redefines observability as the ability to explore "unknown unknowns." This is essential for debugging "unknown unknowns": emergent and unpredictable failure modes you didn't anticipate and thus couldn't pre-emptively monitor. It moves beyond the siloed "three pillars" model by unifying metrics, logs, and traces into a single, high-performance data store. This approach ingests raw, high-fidelity event data, treating observability as a data analytics problem rather than a storage challenge.
Implementing this standard requires three components: standardized collection (e.g., OpenTelemetry), high-performance storage (e.g., ClickHouse), and visualization (e.g., Grafana, HyperDX). Open-source stacks demonstrate this unified approach in practice. By consolidating all telemetry into a single unified data store, teams can correlate signals without complex joins or context switching.
The shift from monitoring to observability redefines the questions we can ask. Monitoring focuses on "known unknowns" and uses dashboards to track pre-defined metrics like CPU utilization or error rates. These are vital for tracking expected behavior.
Observability, conversely, is the ability to ask novel, exploratory questions of your system's telemetry. It is the difference between asking "What is our average error rate?" and "For users on the new mobile client in Germany who experienced latency over 500ms, which feature flag was enabled?"
When debugging a production incident, you need to answer questions like "Which specific users experienced errors?" But if you've aggregated away user_id to save costs, that question becomes impossible to answer. Aggregated data reveals symptoms (the what), but answering the who, where, and why requires raw identifiers. Dropping these fields to manage costs leaves you with alerts but no path to a solution.
This context is found in high-cardinality data: attributes with a vast number of unique values like request_id or commit_sha. This is why granular data is non-negotiable and retaining 100% of your unsampled data is necessary to diagnose the cause rather than just the symptom.
Table 2: Traditional vs. unified observability models
| Aspect | Traditional siloed model ("three pillars") | Modern unified analytics model |
|---|
| Core concept | Data collection into separate, specialized systems (logs, metrics, traces). | Unifying all telemetry into a single, high-performance analytics engine. |
| Data | Aggregated, sampled, or low-cardinality data to manage cost and performance. | Raw, unsampled, granular event data is retained for full context. |
| Primary goal | Monitoring "known-unknowns" with pre-defined dashboards. | Exploring "unknown-unknowns" with ad-hoc, exploratory queries. |
| User workflow | Manual correlation across multiple UIs. | Single-platform querying using SQL. |
| Cost & complexity | High TCO from data duplication across systems and operational overhead of multiple platforms. | Lower TCO from unified infrastructure and efficient compression, though requires database expertise. |
How does ClickHouse deliver on the new standard?
Delivering on this new standard imposes clear technical requirements on the underlying data store - real-time queries on raw data, support for high insertion rates, unlimited cardinality, and cost-efficient storage. The choice of database, therefore, becomes a critical architectural decision. The engine dictates whether you can actually deliver on the new standard or merely approximate it.
This requires a columnar database architecture. Unlike row-based systems that store all data for an event together, columnar databases group values by column: all timestamps together, all service names together. This structure, combined with vectorized query execution, yields significant advantages. First, it facilitates highly efficient compression because columns contain uniform data types. Second, analytical queries that only need a few columns (e.g., service_name, status_code) scan only relevant data, reducing I/O.
During an outage, slow queries impede resolution. ClickHouse delivers sub-second query performance across petabytes of data. The key is the vectorized query engine, which operates on batches of column values rather than rows. This leverages SIMD (Single Instruction, Multiple Data) CPU instructions, which process multiple data points in parallel. This transforms debugging from a process measured in minutes to one measured in milliseconds, enabling interactive exploration.
Observability systems must sustain extremely high insert rates because machine generated telemetry such as logs, traces, and events can produce massive, continuous throughput. Handling this efficiently requires a clear separation between read and write paths, but more importantly a write path that minimizes compute overhead. ClickHouse is designed for this model: writes are append oriented, data is only sorted into blocks by the ordering key at insert time, avoiding expensive per row processing. Skip indices are cheap to generate and evaluated at query time rather than insert time, which enables ClickHouse to ingest data at gigabytes per second on moderate hardware without excessive CPU consumption.
In traditional systems, fields like trace_id are treated as liabilities. ClickHouse changes this dynamic. Adding a field with a billion unique values is a standard operation; it is simply another compressed file on disk. This layout avoids bloating a central index.
ClickHouse uses a sparse primary index that holds "checkpoints" for blocks of data rather than mapping every row. This allows the system to quickly skip large, irrelevant blocks of data. For fast point lookups on specific identifiers, ClickHouse utilizes Bloom filters to avoid full column scans. Consequently, dimensions once dropped for performance reasons are now available for any query. Shopee demonstrated this capability by searching specific trace IDs in a dataset of over 30 billion rows in seconds.
ClickHouse's columnar architecture applies type-specific codecs to achieve compression ratios averaging 15-20x depending on data characteristics. This efficiency is enhanced by native support for tiered storage, allowing older data to move seamlessly to lower-cost object storage like S3 while remaining queryable. This decoupling of storage and compute allows teams to retain data for compliance and historical analysis at a fraction of the cost of index-heavy systems at scale.
Considerations: Trade-offs and Migration
While a unified observability platform is the ideal end-state, migration does not require an all-or-nothing approach. Many teams find success by adopting ClickHouse in a "best-of-breed" capacity first - for example, replacing an expensive logging solution like Elasticsearch while keeping their existing metrics and tracing pipelines. This phased approach allows teams to realize immediate cost and performance benefits for their most painful data silo before unifying other signals over time.
While ClickHouse offers superior analytics capabilities, it is not a drop-in replacement for every use case without consideration.
- Migration Path: Moving from ELK or Prometheus involves changing the storage backend. Teams typically migrate incrementally: first routing new trace data to ClickHouse via OpenTelemetry collectors while maintaining existing Jaeger/Tempo, then migrating logs and metrics after validating query patterns that meet SLA requirements - often over 3-6 months.
- Operational Complexity: ClickHouse requires careful tuning of partition keys, ordering keys, and materialized views for optimal performance. Teams need expertise in distributed database operations, including managing replication, sharding strategies, and cluster topology.
- Learning Curve: Unlike UI-driven tools that abstract the query layer, leveraging the full power of ClickHouse requires proficiency in SQL. While this offers greater flexibility, it may require training for teams used to query languages like PromQL or Lucene. This can be mitigated by open source tooling such as HyperDX, which provides a rich visualization and exploration layer on top of ClickHouse as part of ClickStack. It offers Lucene-like query semantics that are transparently transpiled into efficient SQL, giving users a familiar search experience.
- Alternative Options: For teams with under 100GB daily telemetry and simple alerting needs, managed services like Datadog may offer faster time-to-value despite higher per-GB costs. Additionally, if the primary use case is purely text search without aggregation, inverted-index systems like Elasticsearch remain highly effective.
Proof at scale: The observability architectures of OpenAI, Anthropic, and Tesla
The benefits of this architecture are proven by its adoption in the most data-intensive environments where companies process petabytes of observability data daily. OpenAI rebuilt their observability platform on ClickHouse to achieve sub-second query performance on massive data throughput. Anthropic, developers of Claude, noted that with ClickHouse, "queries are lightning-fast, and money is not on fire as much," referring to the significant cost reductions compared to their previous solution. Tesla validated this scale, choosing ClickHouse for their metrics platform because "Data in ClickHouse is better than data anywhere else. No other system lets you slice and dice your data... There's nothing out there that competes with ClickHouse."
These considerations aside, the fundamental shift in observability remains: it is now an analytics problem that requires an analytics-grade solution. The three-pillars model is obsolete. In 2026, observability is a unified analytics discipline. This shift makes the underlying database the critical architectural decision. An engine not built for analytics will inevitably struggle with the scale and complexity of modern systems.
ClickHouse provides a foundation engineered for high-performance analytics on granular, raw event data. It delivers the speed to explore "unknown unknowns" interactively and the cost-efficiency to store all data. As system complexity grows and observability costs compound, the limitations of siloed architectures become untenable. The question for most teams isn't whether to adopt an analytics-first approach. Instead, the question is how quickly they can migrate before costs and capability gaps become critical. Ready to build on the right foundation? Get started with ClickHouse Cloud today.