Running OpenTelemetry in production eventually exposes the challenges and tradeoffs of data storage strategy. The decisions made at this stage directly determine the effectiveness of any observability initiative.
Poor storage strategies typically lead to three predictable outcomes: exponentially rising costs, degraded query performance, and reduced value from telemetry data. Many engineering teams invest heavily in building advanced OpenTelemetry pipelines, only to encounter unmanageable storage expenses or unusable datasets within months.
This guide focuses on practical, implementation-level guidance for engineers responsible for designing and maintaining scalable OpenTelemetry storage systems. It explores proven architectural patterns, effective data volume management techniques, and scalability approaches that maintain both performance and cost efficiency.
Whether the goal is to deploy OpenTelemetry for the first time or optimize an existing environment, the following best practices provide a framework for balancing cost control with operational capability.
1. Storage architecture options #
When it comes to OpenTelemetry data storage, the strategy you use will depend on several characteristics of your engineering system. Before we get into the implementation details, evaluate your requirements against the following.
Product considerations #
- Data volume and growth: Be realistic about total volume and per-signal type growth. Most people underestimate by 3-5x.
- Query patterns: Different teams have different needs - SREs need real-time alerting, product teams need historical trend analysis, etc.
- Technical expertise: Your team’s experience with specific databases will impact operational success.
- Cost constraints: Observability typically consumes 5-15% of infrastructure budgets. Set boundaries.
- Compliance requirements: For example, regulated industries need immutable audit trails for specific telemetry.
Implementation considerations #
As you build your chosen architecture remember:
- Schema evolution: OpenTelemetry semantic conventions change. Your storage solution must not break with data migration.
- Cross-signal correlation: The magic of observability happens when you connect metrics, traces and logs. Your architecture should support this with support for correlation at the database layer beyond simple trace id lookups.
- Query performance at scale: Solutions that work great with 10 services fail with 100. Are you at 10 now, but on your way to 100?
- Operational complexity: Every additional component adds operational overhead. Balance capability vs cost of maintenance.
ClickStack’s approach #
ClickStack approaches these challenges with a flexible architecture designed to scale alongside growing telemetry data. Purpose-built for OpenTelemetry at scale, it maintains compatibility with evolving semantic conventions and supports seamless cross-signal correlation—an essential capability for effective root cause analysis in modern, distributed systems.
2. Tiered storage for cost-performance balance #
Tiered storage gives you the best cost-performance balance by having multiple storage layers with different characteristics and retention periods.
Hot Tier (0-7 days) #
Your hot tier handles operational data for immediate troubleshooting:
- Requirements: High performance storage with sub-second query response times
- Technology: SSD or memory-optimized database technologies
- Performance: Enough IOPS to handle peak write and query loads
- Cost: 5-10x more expensive per GB than cold storage
Warm Tier (7-30 days) #
The warm tier is for trend analysis and post-mortems:
- Requirements: Balanced performance-cost storage
- Technology: Columnar storage formats for compression and query performance
- Performance: Less IOPS than hot tier
- Data Movement: Automated policies to move data from hot tier
Cold Tier (30+ days) #
Your cold tier holds historical data for compliance and capacity planning:
- Requirements: Low cost object storage or archival databases
- Compression: 10:1 or better for cost efficiency
- Query Patterns: Batch oriented with relaxed performance expectations
- Lifecycle: Automated policies for data expiration
Unified tier - Separation of storage and compute in ClickHouse Cloud #
While tiered storage provides a structured approach to balancing performance and cost, ClickHouse Cloud takes this concept further by combining the advantages of each tier into a unified, elastic architecture. ClickHouse Cloud employs a modern separation of storage and compute that eliminates the traditional compromises between hot, warm, and cold data. All data is stored durably in low-cost object storage, with a distributed cache in front and local caching on compute nodes. This design enables compute resources to scale elastically with workload demand while ensuring that every dataset - regardless of age or access frequency, receives the same high-performance query experience as a hot tier. Furthermore, it enables read and write compute to be separated - ensuring user query performance is not impacted by an increase in insert throughput. This architecture delivers the low-cost retention benefits of cold storage with the responsiveness typically reserved for operational datasets. It represents the best of both worlds: cost efficiency without sacrificing query speed or analytical depth.
Starting Point #
For teams beginning their OpenTelemetry journey, a practical approach is to start with a two-tier model consisting of operational and archival storage, then introduce a cold tier once data volumes exceed approximately 100 GB per day. This structure provides a clear path to scalability without unnecessary complexity in the early stages.
ClickStack OSS offers an excellent foundation for this model, providing full control over deployment, configuration, and storage optimization. It allows teams to incrementally evolve their observability stack as data volumes grow.
Alternatively, ClickHouse Cloud provides a fully managed option that simplifies this progression entirely. Its separation of storage and compute ensures that all data, whether recent or historical, receives consistent query performance while benefiting from the cost efficiency of object storage. In this architecture, there is no need to manage multiple tiers manually, as data of any age is treated equally and queried with the same low-latency performance.
This flexibility allows organizations to begin with ClickStack OSS and scale up, or start directly with ClickHouse Cloud to take advantage of automatic elasticity and simplified operations.
3. Signal-Specific Storage: Telemetry Types #
Different telemetry signals have unique storage and query requirements. Historically, this led teams to adopt specialized databases for each signal type, creating fragmented architectures that become increasingly difficult to scale and operate as data volumes grow.
Metrics Storage #
Metrics have distinct characteristics that require tailored optimization:
- High write throughput for frequent, time-series data points
- Efficient compression for repeated numerical values
- Fast aggregation across time intervals
- Handling high cardinality without performance degradation
Purpose-built time-series databases such as Prometheus, InfluxDB, or TimescaleDB meet these requirements effectively, up to a point. As the number of metrics and unique labels grows, cardinality quickly becomes a scaling bottleneck. These stores are typically not designed for exploratory workflows, with users querying across time-series e.g. when performing higher level trend analysis.
Trace Storage #
Distributed traces present a different set of challenges:
- Variable-sized documents with parent-child relationships
- Graph-based query capabilities for service dependency analysis
- Efficient storage of high-cardinality span attributes
Specialized trace storage systems such as Jaeger, Tempo, or Elasticsearch with trace-specific schema optimizations work well at moderate data volumes, typically in the gigabyte range per day. However, they struggle to scale linearly due to the complexity of indexing high-cardinality span attributes and the storage overhead of maintaining large dependency graphs. Performance can degrade sharply once trace volumes reach billions of spans, and cost grows disproportionately to data volume.
Log Storage #
Logs bring another layer of complexity to observability pipelines:
- Full-text search for quick troubleshooting
- Term-frequency indexing for efficient filtering and queries
- Variable retention based on log source and importance
Log-focused solutions such as Elasticsearch, Loki, or OpenSearch are effective for moderate-scale operations. However, search-oriented log stores with inverted indices consume large amounts of disk and memory as datasets expand. Scaling horizontally increases infrastructure cost, while performance tuning becomes increasingly fragile. Loki addresses some of these issues by simplifying indexing, but that comes at the cost of less flexible query capabilities for unstructured or exploratory search.
The Silo Problem #
Running separate systems for metrics, traces, and logs introduces data silos that hinder cross-signal correlation. When each telemetry type resides in its own store, identifying issues that span across metrics and logs, or linking traces to corresponding performance metrics, requires stitching data together at the application layer. Teams often end up passing trace IDs or other identifiers across systems and tools, creating operational friction and delaying root cause analysis.
This fragmentation is manageable at small scale but becomes a major challenge as telemetry data and team size increase. The inability to query and correlate across signals in real time limits observability value precisely when the system needs it most.
For Smaller Organizations #
For environments with fewer than 500 services, a combination of Prometheus (metrics), Tempo (traces), and Loki (logs) offers a practical and cost-effective solution. These tools integrate well, provide sufficient observability, and maintain reasonable operational overhead at smaller scales.
However, as data volumes grow or as the need for unified analysis across telemetry types increases, this architecture becomes difficult to sustain. At that point, moving toward a unified, columnar-based approach such as ClickStack powered by ClickHouse provides the scalability and performance needed to manage metrics, traces, and logs within a single system, while preserving cost efficiency and operational simplicity.
4. Advanced Data Volume Management Techniques #
As telemetry adoption grows, data volume management is key to controlling costs and performance. Implement these early in your OpenTelemetry journey.
Strategic Filtering #
Filtering is your first line of defense against data volume explosion:
- Put filters processors early in the pipeline to minimize resource usage
- Target high volume, low value data like health checks and background processes
- Use both allowlist and blocklist depending on data criticality
- Consider both collection time and query time filtering
Here is an example filtering configuration for health checks:
processors:
filter:
metrics:
exclude:
match_type: strict
metric_names:
- http.server.duration
resource_attributes:
- key: http.target
value: /health
Column-oriented databases such as ClickHouse, the foundation of ClickStack, deliver exceptional storage efficiency through advanced compression algorithms. Compression ratios typically range from 10x to as high as 170x, depending on data structure and redundancy. This allows large volumes of telemetry data to be retained at a fraction of the raw storage footprint without compromising query performance. Combined with the use of object storage in modern architectures like ClickHouse Cloud, storage costs become a relatively small part of the total observability expense. Object storage provides durable, low-cost retention while columnar compression ensures that even extensive datasets remain efficient to store and query. While this dramatically reduces the pressure to drop or sample data, filtering still plays an important role. By reducing the amount of data that needs to be read and processed during queries, filtering improves responsiveness, lowers compute demand, and helps ensure that engineers focus on the most relevant telemetry signals.
Sampling traces #
Choosing the right sampling approach is critical for trace data:
- Tail-based sampling preserves important traces while reducing volume
- Probabilistic sampling provides baseline coverage across all services
- Deterministic sampling ensures consistent decisions for related traces
Most organizations achieve a 70–90% reduction in trace storage while maintaining diagnostic capabilities through intelligent sampling. The key is to preserve error traces and performance outliers while sampling normal operations more aggressively.
Even though systems like ClickStack, powered by ClickHouse, can efficiently store high volumes of trace data, sampling still provides clear advantages. It reduces query load, shortens analysis time, and ensures that engineers focus on the most meaningful traces without overwhelming dashboards or alerting systems.
Cardinality control techniques #
High cardinality is the silent killer of metrics systems. It occurs when the number of unique combinations of metric labels or attributes grows exponentially, causing storage bloat, performance degradation, and unstable query times.
Common approaches to mitigating this issue include:
- Reducing dimensionality by grouping or pruning high-cardinality attributes that multiply storage requirements.
- Applying hashing algorithms to attributes in order to preserve analytical value while minimizing unique value expansion.
Removing or hashing high-cardinality attributes can reduce metrics storage requirements by 40–60% in high-traffic environments while maintaining essential analytical capabilities.
However, column-oriented databases like ClickHouse, which power ClickStack, fundamentally change how this problem is addressed. ClickHouse efficiently handles high-cardinality datasets through its compressed columnar storage, vectorized query execution, and advanced indexing techniques. This architecture enables engineers to store and query large volumes of diverse telemetry data without compromising performance.
As a result, users can send telemetry data freely without the constant concern of hitting cardinality limits, gaining both analytical depth and predictable scalability across metrics, traces, and logs.
5. Common gotchas and technical solutions #
Below are some common mistakes that organizations implementing OpenTelemetry storage run into:
Collecting too much unhelpful data #
The "collect everything" approach quickly backfires:
- Collecting everything "just in case" is expensive and reduces signal to noise. You can mitigate this by implementing value based collection. To do this, require justification for new telemetry data types.
- Regularly review usage patterns and prune unused metrics. One way to operationalize this is to create a telemetry data catalog to track ownership and usage of collected signals.
ClickStack’s usage analytics can help you identify unused metrics and logs so you can make data driven decisions about what to collect and retain.
Insufficient Resource Allocation #
OpenTelemetry Collector resource constraints are a common source of data loss:
- Underestimating Collector resource requirements causes crashes and data loss during traffic spikes. To solve this problem, implement proper resource limits, memory management and auto-scaling based on telemetry data volume. It’s also a good idea to put in place predictive scaling based on historical patterns
- Lastly, you should get in the habit of monitoring Collector metrics as closely as application metrics
Ignoring Schema Management #
Most modern telemetry stores, including ClickHouse, natively support semi-structured data. Features such as JSON types with dynamic column creation mean users do not need to predefine their schema upfront. Columns can be created automatically as new attributes appear in the data.
This flexibility is especially valuable in dynamic environments such as Kubernetes, where attributes and labels can vary across deployments and evolve over time. Semi-structured support allows teams to capture this variability without pipeline failures or costly schema migrations, making it easier to onboard new services and telemetry sources.
However, as observability implementations mature, schema management becomes essential. Without a consistent approach, attribute naming can drift across services, resulting in analytical blind spots, inaccurate aggregations, and reduced correlation between metrics, traces, and logs.
To maintain data quality and consistency:
- Establish a central attribute registry that defines approved attributes and naming conventions.
- Validate telemetry against this registry at collection or ingestion time.
- Use OpenTelemetry semantic conventions as a foundational schema, extending them with organization-specific attributes where necessary.
- Create service-level agreements (SLAs) for introducing schema changes to ensure updates remain controlled and backward compatible.
By combining schema flexibility with disciplined management, teams gain the best of both worlds: the agility to ingest evolving telemetry data without disruption and the reliability required for accurate, cross-signal analysis.
Conclusion #
OpenTelemetry storage implementation requires thoughtful architecture decisions, signal specific optimizations and ongoing data volume management. These best practices provide a framework for getting that balance between cost and capability.
Start by choosing the right storage architecture for your use case. Implement tiered storage or use separation of storage and compute to get the cost performance balance, and use signal specific storage to get the most out of your metrics, traces and logs. Apply data volume management early to prevent costs from getting out of control and address common implementation pitfalls before they impact your ops.
As your OpenTelemetry adoption grows, revisit these best practices regularly. What works at 10 services doesn’t scale to 100 or 1,000. The most successful observability implementations evolve over time with regular pruning of unused telemetry and ongoing optimization of storage patterns.
Whether you’re building your own storage or using platforms like HyperDX that implement these best practices out of the box, focusing on these technical fundamentals will get you measurable improvements in cost and capability. Your future self and your infrastructure budget will thank you.