Mastering Kubernetes observability: a guide to monitoring modern architectures

Summary

The problem: Traditional Kubernetes monitoring tools fail at scale. They struggle with high cardinality data, leading to slow queries, alert fatigue, and unpredictable, high costs. This forces a painful trade-off between visibility and budget.
The root cause: The issue isn't just the monitoring tool, but the legacy databases they're built on. Legacy time-series databases (like Prometheus) and search indexes were not designed for the volume and complexity of modern telemetry data.
The solution: A modern, database-first approach using a columnar database like ClickHouse. It handles high cardinality by design, delivers sub-second analytics on petabytes of data, and is cost-efficient.
ClickStack: An open-source, integrated observability stack powered by ClickHouse. It provides a complete solution for logs, metrics, and traces without vendor lock-in, offering the power of a custom-built system with the simplicity of an integrated platform.

Kubernetes has become the standard for deploying modern applications, but gaining visibility into these dynamic environments remains a significant challenge. Legacy monitoring platforms were not designed for the sheer volume and complexity of data generated by today's cloud-native architectures. The symptoms are familiar to platform engineers: slow queries, alert fatigue, and rising SaaS bills that force difficult trade-offs between observability and budget. As a result, engineering teams spend more time managing monitoring tools and their unpredictable costs than building features.

This guide diagnoses the root cause of these Kubernetes observability challenges: a fundamental data architecture problem. We will examine why traditional monitoring fails at scale and present a modern, database-first approach that delivers speed and efficiency without the prohibitive cost. It's time to stop choosing between visibility and your budget.

Why is Kubernetes observability so difficult? #

Kubernetes generates more data than traditional monitoring systems can handle effectively. The core issue is a data architecture mismatch. Monitoring tools built for stable servers break when applied to Kubernetes's dynamic, distributed nature.

Three specific characteristics make Kubernetes particularly hard to monitor: high cardinality data (millions of unique identifiers), ephemeral workloads (containers constantly starting and stopping), and distributed request flows (single requests spanning dozens of services). Let’s talk about them one by one.

The data deluge: high volume and high cardinality #

The sheer volume of data in a Kubernetes environment is large. Every pod, container, deployment, and network request emits a continuous stream of logs, metrics, and traces. However, the primary challenge is not volume, but high cardinality: the immense number of unique values a given dimension or label can have.

In Kubernetes, metadata-like pod names with unique hashes (api-gateway-7b5d8f4c9f-x1y2z), container IDs, and trace IDs are inherently high cardinality. This data is essential for debugging, but creates a combinatorial explosion for many monitoring backends. CI/CD pipeline observability makes this worse, as each new build or deployment injects a fresh set of high-cardinality identifiers like a commit_sha into the system.

The ephemeral environment: constant churn and lost context #

Kubernetes workloads are, by design, ephemeral. Containers and pods are constantly created, terminated, and rescheduled based on demand, scaling events, or node availability. This constant churn challenges traditional monitoring tools, whose data models assume stable, long-lived hosts.

This dynamism makes it difficult to track performance historically and analyze trends, as the context of a metric can disappear as quickly as the pod that generated it. Capturing and querying this transient state with full context is performance-intensive and costly. #

The microservices maze: distributed complexity and data silos #

In a microservices architecture, a single user request can traverse dozens of distributed services. To handle the volume, organizations have traditionally adopted specialized data stores for each signal type: time-series databases for metrics, search indexes for logs, and object stores for traces. While this ensures efficient storage, it inadvertently fragments the data into distinct silos.

While modern UIs like Grafana can visually place these signals side-by-side (e.g., clicking a button to jump from a trace to a log), this correlation is superficial. Because the data lives in physically separate engines, the backend cannot perform joint analysis. You cannot easily identify the cause of a spike in CPU (stored in a metrics DB) or latency to a set of attributes (stored in a tracing DB).

This makes troubleshooting a manual, disjointed process. Answering complex questions like "show me the trace attributes which are most common for users who experienced a 'payment failed' log error and experienced a spike in payment service latency" remains extremely difficult because the data types are isolated. This friction drives the need for a unified data backend capable of ingesting and analyzing logs, metrics, and traces within a single store.

Challenge	Description	Impact on traditional monitoring
High cardinality data	An immense number of unique label values from pod IDs, container IDs, and commit hashes.	Overwhelms time-series databases, causing a combinatorial explosion of unique series, high memory usage, and slow queries.
Ephemeral environment	Containers and pods are constantly being created, terminated, and rescheduled by Kubernetes.	Data models that assume stable hosts fail. It becomes difficult to track historical performance as context disappears with the pod.
Distributed complexity	Because metrics, logs, and traces have different structures, they are traditionally stored in separate, optimized databases (e.g., TSDBs vs. Search Indices).	Prevents analytical correlation. Engineers cannot run queries across data types (e.g., filtering metrics based on trace data), forcing manual "swivel-chair analysis" to connect the dots during root-cause analysis.

The current landscape: why existing monitoring stacks fail at scale #

In the quest for Kubernetes observability, engineering teams face a flawed choice that pits cost against capability. The market presents two dominant paths, each with fundamental architectural trade-offs. Path one is the fragmented, open-source stack, which promises freedom but often delivers operational complexity and technical limitations at scale.

Path two is the proprietary, all-in-one SaaS platform, offering a unified experience at a high and unpredictable cost. This dilemma forces a compromise: either accept a solution that costs too much or one that doesn't meet technical needs. Neither path provides the clear, fast, and complete visibility modern environments demand.

Path 1: the operational complexity of fragmented open-source #

While Prometheus is the de facto open-source standard for cloud-native metrics, its architecture struggles with high cardinality data typical of Kubernetes environments. Its data model creates a unique time series for every combination of a metric name and its labels. The resulting high cardinality creates a combinatorial explosion in the number of unique time series Prometheus must track, overwhelming the database. This strains its single-node design by overwhelming the in-memory head block where recent data is indexed, leading to significant memory consumption and even crashes.

The ecosystem offers distributed implementations of Prometheus like Thanos and Mimir, which provide long-term storage and a global query view. However, they transform a single monitoring tool into a complex, distributed system requiring significant operational overhead to deploy, manage, and scale.

This complexity is compounded when teams adopt a fragmented, "best-of-breed" stack, using separate tools like Loki for logs and Tempo for traces. Since the data resides in separate, optimized databases, you cannot perform powerful queries that join or aggregate across logs, metrics, and traces natively. This forces engineers into a manual "swivel-chair analysis," pivoting between different interfaces and query languages or relying on opinionated workflows which correlate crudely at the application layer - all increasing Mean Time to Resolution (MTTR).

Path 2: the prohibitive cost of proprietary SaaS platforms #

Proprietary SaaS platforms like Datadog present themselves as the 'easy button' for observability, but this simplicity comes at a steep price. Their complex, SKU-heavy pricing models result in a high per-gigabyte price and unpredictable costs at scale. This misalignment is most damaging with legacy per-host pricing - ill-suited to containerized environments based on micro-service architectures. Furthermore, modern infrastructure relies on autoscaling to handle traffic spikes with host-based pricing models based on the peak number of concurrent nodes. This forces organizations to pay for temporary capacity as if it were permanent infrastructure and effectively erases the cost efficiencies of the cloud.

This pricing pressure forces engineering teams into a defensive posture, compelling them to adopt aggressive sampling and data-tiering strategies. The core issue with sampling is that it's a bet against the unknown. Teams risk discarding the exact low-frequency, high-impact event (the "needle in the haystack") that could be the root cause of a critical outage. Furthermore, these platforms lock you into proprietary agents and query languages.

Instead of just avoiding lock-in, consider what a standard like SQL provides: the ability to perform deep, ad-hoc analysis, use a vast ecosystem of data science tools like Python and notebooks, and ask complex questions of your data that are impossible with proprietary DSLs. Once dashboards and institutional knowledge are built on a vendor's platform, migrating becomes a daunting and expensive task.

Approach	Key technologies	Primary weaknesses	Cost implications
Fragmented open-source	Prometheus, Loki, Thanos, Mimir	Struggles with high cardinality at scale. Creates data silos (logs, metrics, traces separate by default). Complex to deploy, manage, and scale.	High operational overhead (engineering time and effort to maintain complex distributed systems).
Proprietary "all-in-one" SaaS	Datadog, etc.	Punitive pricing model. Forces data sampling, risking missed critical events. Proprietary agents and query languages create vendor lock-in.	Unpredictable and high costs that penalize scale and containerized envionments; expensive data retention plans.

The root solution: why your database is the key to observability #

The operational challenges of Kubernetes monitoring, from slow dashboards to unpredictable costs, are symptoms of a deeper issue. The problem isn't the tools themselves, but the databases they're built on. Observability in modern systems is fundamentally a data problem, characterized by massive volume, high cardinality, and the need for real-time analysis. The solution, therefore, isn't another dashboard. It's a better database architecture designed for this specific workload.

How columnar storage enables sub-second analytics #

When an incident occurs, Site Reliability Engineers (SREs) need answers in seconds, not minutes. They must be able to interactively query and aggregate data across billions of rows to identify trends and pinpoint root causes. This is where traditional row-based databases and search indexes falter. Modern observability platforms achieve sub-second aggregations by combining two key principles: columnar storage and vectorized query processing.

A columnar database stores all values for a single column together, meaning a query only reads the specific columns it needs (like latency_ms or status_code) from disk. This dramatically reduces I/O and allows for superior compression. ClickHouse's speed comes from a combination of three core architectural principles:

Data skipping: By storing data sorted on disk by a primary key, ClickHouse uses its sparse primary index to determine which blocks of data it can skip entirely, avoiding reading unnecessary data in the first place.
Vectorized query execution: It processes data in large blocks, or vectors, instead of row-by-row. This modern approach maximizes CPU cache usage and uses SIMD (Single Instruction, Multiple Data) instructions to perform the same operation on multiple data points simultaneously.
Parallel processing: Queries are automatically parallelized across all available CPU cores to compute intermediate states and merge them for the final result.

This combination of I/O reduction, CPU efficiency, and parallelism delivers the speed required for true real-time, exploratory analysis.

Solving high cardinality by design, not by patchwork #

Unlike traditional time-series databases, ClickHouse's columnar architecture is resilient to high cardinality by design. High-cardinality attributes are not indexed as millions of individual time series; they are simply values within a column. This avoids tracking unique combinations of labels across the entire dataset, which is what overwhelms traditional TSDBs. While a single column with many unique values might compress less efficiently, than one with few, the impact is minimal at scale and the architecture is fundamentally more resilient to the combinatorial problem. A `GROUP BY` on a high-cardinality field like a user ID or trace ID is a highly optimized scan operation. For aggregations on extremely high-cardinality data, ClickHouse can even spool intermediate data to disk, allowing it to scale far beyond the limits of available memory.

How modern architecture significantly reduces cost at scale #

A modern observability architecture must be economically viable at scale. ClickHouse achieves this through a combination of advanced compression and the ability to run on commodity hardware. Its columnar storage format groups similar data together, enabling effective compression ratios, often in the 3x-10x range. This is accomplished using algorithms like LZ4 for speed and ZSTD for higher compression, which can be applied on a per-column basis for optimal results.

This efficiency is further enhanced by decoupling storage from compute, a modern approach that uses cost-effective object storage like Amazon S3. This separation allows for independent scaling of resources, so teams are not forced to scale expensive servers just to retain more data. This efficiency gives teams the freedom to reduce costs or 'send everything' without facing punitive costs, ensuring that critical diagnostic data is never missed due to budget constraints.

Challenge	The ClickHouse architectural solution	Key benefit
Slow queries at scale	Columnar storage & vectorized processing: Queries only read required columns and process data in efficient blocks (vectors).	Sub-second query performance and aggregations across billions of rows, enabling real-time, interactive analysis.
High cardinality crisis	Columnar Data Model: High-cardinality labels are simply values in a column, not millions of individual time series.	Resilient to the high cardinality issues that break Prometheus. GROUP BY on any field is a fast, optimized scan operation.
High costs	Advanced Compression & Decoupled Storage/Compute: Achieves superior data compression (10x+) and uses cheap object storage (e.g., S3).	Significantly lower storage costs and independent scaling of resources, making it affordable to store all your data without sampling.

Proof at scale: how industry leaders manage Kubernetes telemetry data #

Theoretical benefits are one thing. Production-grade proof is another. Leaders operating at extreme scale have validated this architectural shift.

Tesla: from billions of time series to quadrillion-row analytics #

Struggling with Prometheus's inability to handle tens of billions of unique time series from high-cardinality data, Tesla built their monitoring platform on ClickHouse. They found it was the only system that could "slice and dice" their data effectively, culminating in a successful load test that ingested over a quadrillion rows in 11 days. And with PromQL support currently in development in open-source ClickHouse, the path for Prometheus-native environments is becoming even clearer.

Powering AI workloads with cost-efficient, high-speed queries #

This performance extends beyond metrics to the high-volume, high-cardinality logging challenges faced by AI pioneers. OpenAI, operating at an extreme scale, handles billions of rows in milliseconds, with engineers describing the performance as finishing "as fast as I could blink." Similarly, Anthropic uses ClickHouse to manage the large volume of telemetry from training models like Claude, confirming both performance and efficiency: with ClickHouse, "queries are lightning-fast, and money is not on fire as much."

From logistics to e-commerce: global scale deployments #

This unified architecture delivers the power that enables innovators to solve observability at scale. For example, Didi migrated its logging and trace solution from Elasticsearch to ClickHouse, reducing hardware costs by over 30% and improving query speeds by 4x. Similarly, e-commerce leader Shopee built its large-scale distributed tracing platform on ClickHouse to gain real-time visibility into system operations, allowing it to quickly identify and resolve bottlenecks across its entire platform.

ClickStack: an open, unified solution for Kubernetes observability #

For years, the power of ClickHouse for handling massive-scale observability data was a strategy used by elite engineering teams who had the resources to build custom platforms around it. ClickStack removes this barrier, offering a high-performance, open-source observability stack that makes the power of ClickHouse accessible to everyone. It delivers a complete, out-of-the-box experience for logs, metrics, and traces, designed specifically to address the challenges of modern Kubernetes monitoring, from centralizing petabytes of container logs to analyzing application performance.

Unifying logs, metrics, and traces in a single database #

ClickStack eliminates this complexity by treating logs, metrics, and traces as context-rich 'wide events' within a single ClickHouse database. While this unified model might seem duplicative to users of siloed systems, it is highly efficient and a natural fit for a columnar database like ClickHouse.

The data compresses extremely well and is very performant to query, as analytics only scan the required columns, significantly reducing I/O. This integrated approach eliminates the need for engineers to manually correlate data across different UIs and query languages, a common pain point in fragmented open-source stacks. For teams migrating from other tools, ClickStack offers a clear path forward, with documented examples for replacing existing agents like Filebeat with a modern Filebeat → Vector → OTel pipeline.

Go beyond observability: unifying with business analytics #

A key strategic differentiator of this database-first approach is the ability to unify observability and business data. As Arup Malakar from Sierra puts it, "It would be really cool if we no longer thought of observability and analytics as two different islands, but just one data problem, powered by a really good compute engine like ClickHouse." By using standard SQL, teams can go beyond traditional monitoring to practice "Observability Science", correlating system behavior with business outcomes, such as how API latency impacts user conversion rates, all within the same platform.

Built on OpenTelemetry for portability and no vendor lock-in #

ClickStack offers an end-to-end, open-source observability solution by combining the OpenTelemetry Collector for data ingestion, ClickHouse as the storage and query layer, and HyperDX as the user interface. Built on a foundation of open standards, ClickStack is OTel-native, aligning with the industry standard for cloud native observability and ensuring data portability. This commitment to OpenTelemetry prevents vendor lock-in and allows teams to use a large, growing ecosystem.

The power of custom-built, with the simplicity of an integrated stack #

ClickStack offers a good balance: the power and cost-savings of a custom-built observability solution with the simplicity of an integrated, opinionated stack. This approach avoids choosing between complex, fragmented open-source tools and costly, proprietary SaaS platforms. Whether you're looking to replace an entire expensive SaaS platform or start by significantly reducing your logging bill, ClickStack provides a scalable, cost-effective path forward.

The new standard for Kubernetes observability #

Mastering Kubernetes observability at scale requires a fundamental shift in perspective: the challenge is not about better dashboards, but a better database. The slow queries and rising costs that plague modern cloud native observability efforts are symptoms of an architectural mismatch, where legacy time-series databases and search indexes struggle with high-cardinality, high-volume telemetry. The solution is to fix the data foundation.

To make this power accessible, ClickStack provides a complete, open-source Kubernetes monitoring solution, bundling the OpenTelemetry collector and a modern UI with ClickHouse. It offers the performance of a custom-built system in an integrated, easy-to-deploy package, addressing the trade-off between fragmented open-source stacks and expensive SaaS platforms.

Ready to stop choosing between visibility and budget? Explore ClickStack and see how leaders like OpenAI and Anthropic are solving Kubernetes monitoring at scale.