Our log volume is growing to petabyte scale. What methods do organizations use for intelligent log sampling?

The modern approach is to drastically reduce the need for sampling. Maintaining complex sampling rules creates blind spots for "unknown unknown" errors and affects your analytics. With ClickStack, the ClickHouse Observability Stack, our columnar architecture and 10-20x compression make it economically feasible to store 100% of your log data, so you reduce the risk of missing a critical event.

How do you handle messy, semi-structured JSON logs?

Columnar databases like ClickHouse support a native JSON data type designed for high performance on semi-structured data. Unlike legacy systems that might force you to flatten data or suffer performance penalties, this data type automatically optimizes storage by creating columns for frequent properties and bucketing rare ones, all while preserving native data types like integers and arrays.

What observability platforms support extended data retention without sampling?

ClickStack, the ClickHouse Observability Stack, is designed for extended, unsampled data retention. By combining a columnar architecture with superior 10-20x compression and decoupling compute from cost-efficient object storage, ClickStack makes it affordable to store 100% of your observability data, reducing the need to discard valuable information through sampling.

What challenges exist with managing petabyte-scale observability data?

Managing petabyte-scale data with legacy tools presents three key challenges: 1) a massive "indexing overhead" that inflates storage costs, 2) slower analytical performance due to architectures not built for aggregation, and 3) high costs from coupled compute and storage, forcing you to overprovision expensive resources.

We need to query terabytes of data without compromising speed. What scaling considerations are important?

To query billions of records at speed, look beyond inverted indexes. Key considerations include a columnar architecture with efficient data skipping via sparse primary indexes, a vectorized query engine for fast aggregations, and a decoupled design that separates compute from cost-efficient object storage for independent, economic scaling.

Beyond sampling: managing petabyte-scale logs without losing data | Engineering

TL;DR: minimizing the need for log data compromise

The problem: Aggressive log sampling, often necessitated by the scaling costs of legacy platforms, creates visibility gaps. You risk missing critical errors needed to solve outages and affecting the statistical integrity of your data.
The cause: Engineers typically rely on separate stacks for search (ELK) and trends (Prometheus). As data grows, neither scales efficiently for the other’s workload, forcing teams to sample logs to control costs.
The solution: OpenTelemetry provides the universal standards and semantic conventions required to break down the silos between logs, metrics, and traces. When this unified, OpenTelemetry-standard data needs to be stored at scale, modern columnar architectures are the solution. By using ClickHouse, teams benefit from 10-20x compression and a cloud-native design that separates compute from cost-efficient object storage. The ClickStack (the ClickHouse Observability Stack) embodies this architecture to help reduce trade-offs.
The proof: Engineering teams are achieving petabyte-scale ingestion and real-time analysis on 100% of their observability data, moving away from sampling to gain complete visibility without a runaway budget.

Introduction: the scaling compromise #

As infrastructure scales into the petabytes, observability costs often follow a steep upward trajectory. To keep budgets under control, engineers are frequently advised to accept a difficult compromise: aggressive sampling.

Historically, engineers have relied on separate stacks to manage this data: search-based tools (like ELK) to isolate specific errors, and metric stores (like Prometheus) to monitor system health. Sampling became the primary survival mechanism to keep these distinct architectures affordable.

However, this compromise is a symptom of using legacy tools at a scale they were not designed for. The choice between complete visibility and a manageable budget does not have to be binary. In practice, sampling introduces risks that create critical blind spots.

You risk losing the "needle in the haystack" #

The most immediate risk is missing the low-frequency, high-severity event you didn't know you needed to look for. Critical system failures are often outliers: a single corrupt transaction, a slow API call causing tail latency, or a novel security threat. Aggressive sampling statistically increases the likelihood of discarding these exact events. By definition, it is difficult to write a sampling rule for an "unknown unknown" error you have never seen before. When a major outage occurs, the log line containing the root cause may have been dropped.

You compromise data integrity #

Beyond missing rare events, sampling can skew the statistical integrity of your data. When a percentage of telemetry is discarded, analytics become approximations. For example, sampling-based systems can under- or overestimate crucial metrics, such as duration, rendering percentile-based SLOs and trend analyses less reliable. This makes it difficult to accurately measure the true impact of minor issues or perform forensic analysis after a security incident.

You create operational complexity #

Maintaining complex sampling rules creates a significant operational task. Engineers must constantly adjust sampling configurations as applications evolve and new services come online. The criteria for what constitutes a "low-value" log today could be the critical signal for tomorrow's outage, requiring engineers to engage in a reactive cycle of adjusting rules.

Risk category	Description	Business impact
Missing critical events	Low-frequency but high-severity events ("unknown unknowns") are statistically likely to be dropped by sampling rules.	Increased Mean Time to Resolution (MTTR) during major outages; missed security threats from unusual sources.
Compromised analytics	Discarding data skews metrics, making percentile-based SLOs, trend analysis, and capacity planning less reliable.	Inaccurate performance measurements; reduced trust in dashboards; difficult resource allocation and decision-making.
Operational complexity	Engineers must constantly create, manage, and update complex sampling rules as services and applications evolve.	Engineering resources spent on reactive infrastructure management; risk of human error leading to data loss.

The architectural shift: unifying search and trends #

For years, sampling was a reaction to the architectural boundaries of available tools. We relied on search engines (Splunk, Elastic) that excelled at finding a specific error message but were less efficient at high-level aggregations and trends. Conversely, we used metric stores (Prometheus) that were excellent for trends but lacked the context of the raw event.

Modern OLAP databases eliminate this trade-off by excelling at both fast aggregations and high-speed filtering. ClickStack, the ClickHouse Observability Stack, embodies this architecture to provide an end-to-end, open-source solution. It allows engineers to see high-level trends and quickly drill down to the raw logs without jumping between tools.

The solution starts with open, vendor-neutral data collection via OpenTelemetry. OpenTelemetry acts as the critical bridge, providing the shared semantic conventions that allow you to seamlessly correlate logs with metrics and traces. From there, ClickStack uses ClickHouse as the scalable storage and analytics engine, and HyperDX as a purpose-built UI.

From storage overhead to efficient analytics #

Legacy platforms often face two challenges at scale: storage overhead and query performance. The first is a significant "indexing overhead." Reliance on inverted indexes for search creates storage requirements that can grow larger than the raw data itself. It's not uncommon for an initial 3GB of logs to consume 30GB of disk space in an Elasticsearch cluster.

The second challenge is analytical performance. While effective for search, inverted index architectures can struggle with the large-scale aggregations needed for observability. In contrast, ClickHouse is built for analytics. It uses lightweight sparse primary indexes on data that is physically sorted on disk. When you query a time range, ClickHouse uses this index to skip reading vast blocks of data. This I/O reduction is combined with a vectorized execution engine that processes data in batches and massively parallel processing that uses available CPU cores.

From row-oriented storage to columnar compression #

Traditional platforms are often row-oriented, storing all values for a single log event together. ClickHouse uses a columnar format, storing all values for a single column (e.g., all timestamps, all service names) together. This structure is highly effective for compression, creating results that routinely achieve 10-20x compression ratios.

A common trade-off with columnar stores has been handling messy, semi-structured logs efficiently. ClickHouse addresses this with its native JSON data type. This feature offers the performance of a defined schema with the flexibility of a document store.

The JSON data type works by dynamically analyzing the data structure:

Optimized column generation: It automatically identifies the most frequent properties and creates dedicated, optimized columns for them.
Long-tail bucketing: Rare or "long-tail" properties are bucketed into a generic store to prevent "column explosion," which effectively balances write performance with query speed.
Native types: Integers stay integers and aren't lazily cast to strings, maintaining high query performance.

Note: While this simplifies ingestion significantly, it does require engineers to consider query patterns. Accessing the dedicated columns is faster than querying the "long-tail" bucket, meaning schema design still matters for optimal performance on your most frequent queries.

This efficiency applies not just to logs, but also to high-volume traces and high-cardinality metrics. To further bridge the gap between metrics and logs, ClickHouse is actively working on PromQL support, which will allow engineers to bring their existing Prometheus queries and dashboards directly to ClickHouse without rewriting logic.

From coupled architecture to decoupled storage and compute #

Legacy systems often couple storage and compute resources. In these systems, storage and compute are tightly linked, forcing you to scale expensive servers just to retain more data. In contrast, ClickHouse was designed for the cloud, using cost-efficient and scalable object storage like Amazon S3 or Google Cloud Storage. This decouples storage from compute, allowing you to retain 100% of your data at a lower cost while scaling compute resources independently.

Proof in production: handling massive scale #

This architectural shift is becoming the standard for companies operating at the largest scale.

Cloudflare, which helps secure and accelerate a significant portion of the internet, was an early adopter of ClickHouse. Their previous Elasticsearch-based pipeline struggled with ingestion limitations when faced with millions of requests per second. The ingestion rate limited their ability to query data in real-time. By migrating to ClickHouse, they removed these ingestion bottlenecks. The engineering outcome was the ability to query across massive datasets with low latency, powering critical systems like firewall analytics.

More recently, Anthropic required an architecture capable of ingesting semi-structured data in real time to support their Claude AI models. When the Claude 3 and 3.5 models saw explosive user adoption, their existing observability systems faced scaling friction. With ClickHouse, Anthropic can now affordably store and instantly query 100% of their log data. The system is robust enough that engineers report they "sometimes don't even notice that the database is running."

Company	Challenge	Solution with ClickHouse	Key outcome
Cloudflare	Previous pipeline struggled with ingestion throughput and resource efficiency at millions of logs per second.	Migrated logging pipelines to a self-managed ClickHouse cluster for superior ingestion speed and compression.	Petabyte-scale real-time analysis. Engineers can grep through massive datasets in seconds, enabling faster debugging and reliable customer analytics.
Anthropic	Existing systems faced scaling friction due to the data volume generated by Claude AI models.	Deployed ClickHouse to ingest and analyze 100% of their log data in real-time within their secure environment.	Full data retention at scale. Enabled deep model insights and robust debugging without requiring data sampling.

Conclusion: stop sampling, start analyzing #

The trade-offs regarding log sampling have shifted. The practice of sampling was largely a workaround for the architectural limitations of search-based platforms and the separation of "search" and "metrics" tools. Storing 100% of your data is becoming the strategic standard for modern engineering teams.

A modern columnar platform like ClickStack, built on ClickHouse, changes the relationship between cost and performance. By decoupling compute from cost-efficient object storage and using 10-20x compression, it is economically feasible to retain petabytes of data. This architecture provides the visibility to find any "needle in the haystack," accurately analyze trends, and reduce the risk of missing the one critical event that explains a major outage.

It's time to move beyond the choice of sampling and start analyzing everything.

Ready to see for yourself? Start a free ClickHouse Cloud trial to experience petabyte-scale analytics firsthand. You can also read our case studies with companies like Anthropic and Cloudflare who manage massive-scale observability with ClickHouse.

Beyond sampling: managing petabyte-scale logs without losing data