What is OLAP? #
What does OLAP stand for? #
OLAP stands for Online Analytical Processing. It is an approach to answering multi-dimensional analytical queries on large datasets, often in sub-second to second timeframes. Unlike OLTP (Online Transaction Processing), which optimises for inserting, updating, and retrieving individual records, OLAP is designed for reading and aggregating data across many dimensions. An OLTP system powers a retail checkout. An OLAP system answers questions like "what was our total revenue by product category across all regions last quarter?"
OLAP enables organisations to explore historical and real-time data and uncover patterns that drive business decisions. The category encompasses everything from traditional data warehouses running overnight batch reports to modern real-time OLAP databases serving sub-second dashboards on billions of rows.
The distinction matters because not all analytical workloads are the same. A CFO reviewing quarterly revenue can tolerate a few seconds of query latency. A fraud detection system scanning live transactions cannot. Both are OLAP workloads, but they demand very different architectures. This guide covers the full spectrum.
What problem does OLAP solve? #
Transactional databases excel at point operations but collapse under the weight of analytical workloads. When a query scans millions of rows to compute aggregates across multiple dimensions, a system optimised for individual record writes becomes a bottleneck. Organisations accumulate vast amounts of data from every part of their operations, yet extracting insights from that data at speed remains difficult without the right architecture.
OLAP solves this by reshaping how data is stored, indexed, and queried so that analytical questions that would take hours on a transactional database return in seconds. This speed difference determines whether analysis informs real-time decisions or arrives too late to matter.
Consider a concrete example. An e-commerce company wants to know which product categories drove the most revenue last quarter, broken down by region and customer segment. The underlying table has 2 billion rows. On a transactional database, this query takes 30 minutes because the system reads every column of every row, processes them one at a time, and contends with production writes happening simultaneously. On an OLAP database with columnar storage, the same query reads only the three relevant columns, skips data that doesn't match the filters, and returns in under a second.
Where did the term OLAP come from? #
Edgar F. Codd, the computer scientist who invented the relational model, coined the term OLAP in 1993 in a paper commissioned by Arbor Software (later acquired by Hyperion Solutions, then Oracle). That paper, "Providing OLAP to User-Analysts: An IT Mandate," defined 12 rules that characterise a true OLAP system, establishing the conceptual foundation for the category.
The sponsorship arrangement proved controversial. When Computerworld discovered the undisclosed funding, it retracted the paper. The definitions survived regardless and became the industry standard.
OLAP products predated the term. Express, released in 1970 by Information Resources Inc., performed OLAP-style queries decades before Codd gave the category a name. Essbase, the most commercially successful early OLAP system, shipped in 1992 and became the de facto standard for enterprise financial reporting.
How does OLAP work? #
What are dimensions, measures, and cubes in OLAP? #
Dimensions are the categorical axes along which data is analysed. Common dimensions include time (year, quarter, month), geography (country, state, city), product (category, brand, SKU), and customer (segment, region). Measures are the numerical values stored at the intersections of dimensions: revenue, quantity sold, profit margin, cost.
A cube is the conceptual model that brings these together. Imagine a three-dimensional array where each axis represents a dimension and each cell contains a measure. The intersection for "Q2 2026, Widget A, North America" contains a revenue figure of $2.4 million. The term "cube" persists even when dealing with far more than three dimensions (a ten-dimensional model is still called a cube in common usage, not a hypercube).
Cubes are logical structures, not physical ones. How they are stored depends on the OLAP architecture (MOLAP, ROLAP, or HOLAP), which is covered in the next section.
What are OLAP hierarchies and why do they matter? #
Hierarchies define parent-child relationships within dimensions and enable navigation from coarse to fine grain. In a time dimension, the hierarchy might flow from Year → Quarter → Month → Day. In geography: Country → State → City → Postal Code. In product: Category → Subcategory → Product → SKU.
These relationships are not automatically inferred from data. They are explicitly modelled based on business logic. Hierarchies matter because they enable two critical operations: drill-down (moving from coarse to finer grain, for example from annual revenue to monthly revenue) and roll-up (aggregating from fine to coarse grain).
Without hierarchies, an analyst asking "show me monthly sales for Q2" would need to manually specify which months belong to Q2. With them, the system understands this relationship and handles aggregation automatically. This turns data exploration from a rigid process into intuitive navigation.
What operations can you perform on OLAP data? #
OLAP systems support six core operations that shape how users interact with analytical data.
Slice removes one dimension from the cube by fixing it to a single value. Slicing by Q1 2026 reduces a four-dimensional cube to three dimensions and shows all products and regions for that quarter.
Dice narrows the cube along multiple dimensions simultaneously. Filter to Region = North America AND Product Category = Electronics AND Time = 2025–2026 to produce a focused subcube.
Drill-down navigates from a summary level to finer detail within a hierarchy. Start with annual sales, then expand to quarterly, then monthly, then daily to pinpoint when a trend shifted.
Roll-up does the reverse: aggregating fine-grained data into coarser categories. Combine daily figures into weekly totals, then monthly, then quarterly.
Pivot rotates the cube, swapping which dimensions appear as rows and columns. A financial analyst displays months as rows and regions as columns, then pivots to regions as rows and products as columns without re-running the query.
These operations work together. An analyst slices by a specific quarter, drills down into one region, then pivots to compare product categories side by side — all within seconds.
In modern OLAP systems, these operations map to SQL constructs rather than proprietary interfaces. A slice is a WHERE clause. A dice is multiple WHERE conditions. Drill-down and roll-up change the GROUP BY granularity. A pivot restructures the SELECT and GROUP BY. The conceptual vocabulary from the 1990s persists, but the implementation is now standard SQL running on columnar databases.
OLAP architectures: MOLAP, ROLAP, and HOLAP #
These three architectures defined OLAP for two decades. Understanding them explains why the industry moved to modern columnar databases, and which ideas survived the transition.
What is MOLAP and when is it the right choice? #
MOLAP stands for Multidimensional OLAP. Data is pre-aggregated and stored in specialised multidimensional array structures (cubes) built offline before queries arrive. Query performance is exceptional because the answers are already computed: the system retrieves pre-built aggregates rather than calculating on demand.
This speed comes at a cost. Cubes must be rebuilt whenever underlying data changes, creating refresh downtime that can last hours. Scalability is limited by the physical storage required to hold all possible aggregations. Adding a new dimension or hierarchy requires redesigning and rebuilding the cube.
MOLAP works well for stable, well-understood datasets where schemas don't change frequently, such as financial consolidation or standard corporate reporting with a fixed set of KPIs. It is less suitable for real-time data or exploratory analysis where users ask unanticipated questions.
What is ROLAP and how does it differ from MOLAP? #
ROLAP stands for Relational OLAP. Instead of pre-computing cubes, ROLAP systems run analytical queries directly against relational databases at query time, computing aggregations dynamically. This approach scales to virtually unlimited data volumes because no pre-aggregation is required.
ROLAP queries are slower than MOLAP because work happens on demand rather than in advance. But the tradeoff is acceptable for many use cases, especially when combined with columnar storage and vectorised execution. ROLAP always queries current data, making it a natural fit for organisations that cannot tolerate staleness.
The principles underlying ROLAP evolved into the foundation of modern cloud data warehouses and columnar OLAP databases. Snowflake, BigQuery, and ClickHouse all extend ROLAP ideas with columnar compression, vectorised execution, and distributed computing to achieve both scale and speed.
What is HOLAP and when does the hybrid approach make sense? #
HOLAP stands for Hybrid OLAP. It combines MOLAP and ROLAP by storing aggregates in pre-computed multidimensional cubes while keeping detail-level data in relational tables. Queries on aggregated data hit the MOLAP layer (fast). Drill-through to detail records queries the relational layer (slower but complete).
In practice, HOLAP is complex to implement and maintain. Data consistency across two storage layers introduces synchronisation challenges. Determining which aggregations to pre-compute requires careful capacity planning.
HOLAP rarely appears as a standalone deployed system today, but the underlying idea — pre-computed aggregates for speed, raw detail for completeness — lives on inside modern columnar databases. In ClickHouse, continuous materialized views incrementally maintain pre-aggregated tables as data is inserted, serving the same role as the MOLAP layer but updated in real time rather than via batch refresh. The full detail data stays in the underlying MergeTree tables for drill-through. Both layers live in a single system with no synchronisation overhead.
How do the three OLAP architectures compare? #
| Dimension | MOLAP | ROLAP | HOLAP |
|---|---|---|---|
| Storage | Pre-computed cubes | Relational tables | Aggregates + relational detail |
| Query speed | Very fast | Moderate | Fast (aggregates), slower (detail) |
| Scalability | Limited | Unlimited | Scales to detail data size |
| Data freshness | Batch refresh | Real-time | Mixed |
| Flexibility | Low (schema-bound) | High (ad-hoc) | Medium |
| Current relevance | Declining | Underpins modern cloud DWH | Lives on as semantic layers |
OLAP vs OLTP: what is the difference? #
How do OLAP and OLTP workloads differ? #
OLAP and OLTP represent fundamentally different approaches to data processing, each optimised for distinct access patterns. OLAP systems handle analytical queries that scan millions or billions of rows, compute aggregations, and return results in seconds. OLTP systems handle transactional operations that read or write individual records in milliseconds, prioritising consistency and isolation. For a detailed comparison, see the full OLTP vs OLAP guide.
| Dimension | OLAP | OLTP |
|---|---|---|
| Purpose | Analytical queries, decision support | Transactional processing, operational data |
| Query type | Complex aggregations across millions/billions of rows | Simple reads/writes on individual records |
| Data volume | GB to PB | GB to low TB |
| Response time | Sub-second to minutes | Milliseconds |
| Schema design | Denormalised (star/snowflake schema) | Normalised (3NF/BCNF) |
| Storage format | Columnar | Row-oriented |
| Update pattern | Bulk loads, append-heavy | Row-level INSERT/UPDATE/DELETE |
| Concurrency | Tens to hundreds of analytical queries | Thousands to tens of thousands of transactions |
| Typical systems | ClickHouse, Snowflake, BigQuery, Redshift | PostgreSQL, MySQL, Oracle, SQL Server |
These differences make the workloads incompatible on shared hardware. Running analytical queries on a production transactional database destroys transaction performance. The industry standard is to separate concerns: OLTP for the source of truth, OLAP for analytics, connected by CDC or ETL. The modern composed architecture — PostgreSQL for transactions, ClickHouse for analytics, CDC for real-time synchronisation — gives each system what it needs without compromise. See Unifying OLTP and OLAP for the full analysis.
How has OLAP evolved? From cubes to columnar databases #
Why did traditional OLAP cubes fall out of favour? #
OLAP cubes dominated data analysis from the 1990s through the early 2010s. Tools like Essbase and SQL Server Analysis Services pre-aggregated data into multidimensional structures optimised for dimensional analysis.
Five factors drove cubes into decline. Data volume explosion was first: cubes required predefinition of all dimensions and hierarchies, and as organisations accumulated more data attributes, cube explosion became unmanageable. Second, refresh downtime. Building a cube from raw data took hours. As data freshness expectations shifted from weekly to daily to hourly, rebuild windows became unacceptable.
Third, rigidity. Adding a new dimension required rebuilding the entire cube. Fourth, proprietary lock-in: Essbase and SSAS were expensive specialised platforms with proprietary query languages. Fifth, SQL's dominance. SQL became the universal query language, and analysts preferred writing queries to navigating proprietary cube interfaces.
Modern columnar databases solved the underlying problem that cubes addressed. They process raw data fast enough that pre-aggregation is unnecessary for most workloads. A columnar database can compute aggregations across billions of rows in sub-second time, eliminating the need to predefine every possible query in a cube structure.
The shift also reflected a broader change in how organisations use data. In the cube era, a small team of BI specialists built cubes for a known set of reports. Today, hundreds of engineers, analysts, and product managers run ad-hoc queries across company data. The flexibility of SQL on a columnar database serves this self-service model far better than rigid cube structures ever could.
How do columnar databases accelerate OLAP queries? #
Columnar storage organises data by column instead of by row. In a table with 200 columns, a row-oriented database stores all 200 values for record one, then all 200 for record two. A columnar database stores all values from column one together, then all values from column two.
This difference transforms analytical performance. A query computing average customer age for orders over $10,000 touches three columns: age, order amount, and customer ID. A row-oriented database reads all 200 columns to access these three. A columnar database reads only those three.
For a billion-row table, this reduces I/O by roughly 98%. In practice, the result is 10–100x faster query execution. Columnar storage also enables aggressive compression because values within a column tend to be similar — customer ages cluster around certain ranges, geographic regions repeat, payment amounts concentrate in common bands. Compression ratios of 10:1 or higher are typical.
ClickHouse exemplifies columnar database performance. It processes over 1 billion rows per second in benchmarks on commodity hardware. Queries that take minutes on traditional databases complete in sub-seconds.
What role do semantic layers play in modern OLAP? #
Semantic layers emerged to fill the gap left by cube decline. Tools like Looker, dbt metrics, and AtScale sit between columnar databases and business intelligence platforms, providing a business-friendly abstraction over raw data.
A semantic layer defines dimensions, hierarchies, and metrics in business terms. Instead of writing SQL that joins three tables and applies complex filtering, an analyst selects "customer segment" and "monthly revenue." The semantic layer translates these business concepts into SQL executed against the underlying database.
The key difference from cubes is flexibility. Semantic layers query raw data in columnar databases rather than pre-aggregated structures. Adding a new dimension requires no rebuilds or downtime — just a change to the semantic model. Query performance stays fast because the underlying columnar database processes raw data efficiently.
This architecture represents the modern OLAP stack: raw data in a columnar database, a semantic layer for business-friendly abstractions, BI tools for visualisation. Organisations get OLAP's dimensional analysis without cube complexity.
The trend is accelerating. dbt's metrics layer and Looker's semantic modelling both gained significant adoption in 2024–2025. Open-source alternatives like Cube.js provide semantic layer capabilities without vendor lock-in. ClickHouse integrates natively with dbt through a supported adapter, so teams can define transformations and semantic models in dbt and execute them against ClickHouse. In every case, the analytical database handles the heavy computation while the semantic layer handles business logic and access control. The split mirrors the OLTP/OLAP separation: each component does what it does best.
What is real-time OLAP? #
What makes real-time OLAP different from traditional batch OLAP? #
Traditional OLAP runs on stale data. A nightly ETL job extracts data from production systems, transforms it, and loads it into a data warehouse. Analysts query data that is one day old. Strategic decisions get made on yesterday's information.
Real-time OLAP processes data within seconds of generation. The database must handle continuous high-throughput ingestion and sub-second analytical queries simultaneously — not one or the other. This enables use cases that batch OLAP cannot serve: fraud detection on transactions as they occur, live operational dashboards, dynamic pricing that reacts to inventory changes, and real-time personalisation.
The technical challenge is significant. Batch OLAP accepts hours of latency because it processes massive volumes infrequently. Real-time OLAP must ingest thousands or millions of events per second, index them immediately, and make them queryable within seconds. Traditional data warehouses designed for batch loading were not built for this. The full guide to real-time analytics covers the architecture in detail.
What latency and freshness should you expect from a real-time OLAP system? #
Data freshness and query latency exist on a spectrum. Understanding the tiers helps set realistic expectations.
Batch OLAP operates at the far end: data freshness ranges from hours to days. Near-real-time OLAP achieves freshness in the minutes range, with ETL jobs running every 5–15 minutes. Real-time OLAP delivers freshness in the seconds range — data ingested at 10:00:05 becomes queryable by 10:00:10. Streaming OLAP approaches sub-second windows, necessary for fraud detection and similar high-velocity use cases.
Query latency should be sub-second for common dashboards and a few seconds for complex ad-hoc analysis. A dashboard showing top products by sales volume should return in under one second. A detailed cohort analysis touching billions of rows takes five seconds.
Modern columnar databases like ClickHouse achieve these latencies at massive scale, handling billions of rows in single-second analytical queries while simultaneously ingesting millions of new rows per second.
Speed and efficiency are two sides of the same coin. The engines that deliver real-time query performance do so because they use hardware more efficiently — better compression, fewer bytes read from disk, more work done per CPU cycle. Even if sub-second latency is not a requirement, that efficiency translates directly into lower cost. A query that finishes in one second on a real-time OLAP engine might take thirty seconds on a batch-oriented warehouse — and the warehouse needs thirty times the compute to match the same throughput. For teams where speed itself is not the priority, the cost savings alone can justify choosing a more efficient engine. The fast query response becomes a free by-product.
How does change data capture (CDC) connect OLTP and OLAP in real time? #
Change data capture extracts row-level modifications from source databases as they happen. Every INSERT, UPDATE, and DELETE operation gets captured and streamed to a CDC pipeline. This replaces batch ETL with continuous synchronisation.
Traditional batch ETL extracts a full snapshot from production systems once daily. Changes made between batch windows are invisible to analysts. CDC captures changes continuously, reducing latency from hours to seconds and transferring only changed rows rather than entire datasets.
ClickHouse integrates with CDC tools like PeerDB to provide sub-10-second data freshness from PostgreSQL and other OLTP databases. CDC pipelines stream changes directly into ClickHouse tables, keeping analytics current with transactional data.
This "composed architecture" — dedicated OLTP + dedicated OLAP + CDC — is the modern standard for unifying transactional and analytical workloads. Each database optimises for its purpose without compromise, while users enjoy near-real-time analytical visibility into transactional data.
The practical impact is substantial. An organisation running PostgreSQL for its transactional application and ClickHouse for analytics gets sub-10-second data freshness without any changes to its application code. Engineers don't need to build custom ETL pipelines or schedule batch jobs. The CDC tool handles replication continuously. When a customer places an order in the transactional system, it appears in the analytics system within seconds, available for dashboards and ad-hoc queries.
Common OLAP use cases #
How is OLAP used for business intelligence and reporting? #
OLAP powers the dashboards and reports that guide business decisions across every department. Sales teams analyse revenue by product, region, and time period to identify which offerings drive growth. Marketing uses it to track campaign ROI and attribution across channels. Finance is one of the heaviest users: consolidating revenue, expenses, and profit margins by cost centre and business unit is a natural fit for OLAP's hierarchical aggregation. Executive KPI scorecards round out the picture, updating hourly or daily.
The biggest advantage is self-service analytics. Rather than submitting requests to IT and waiting days for a custom report, business analysts explore data interactively. They pivot tables, filter dimensions, drill down into anomalies, and answer follow-up questions in minutes. OLAP's multi-dimensional structure maps naturally to how businesses think about their data: dimensions like time, geography, product, and customer paired with measures like revenue, quantity, and cost.
Traditional BI tools like Tableau, Looker, Power BI, and Grafana all sit on top of OLAP backends, whether cloud data warehouses or real-time OLAP databases. The architecture is consistent: users interact with visual dashboards, those dashboards query the OLAP database, and results return in seconds or sub-seconds.
A retail company, for example, builds a self-service analytics layer where regional managers view their own stores' performance, filter by product category, compare against the previous year, and drill into individual store metrics. Without OLAP, each of these views requires a separate report request. With OLAP, the manager explores freely within a single dashboard, and the underlying database handles the computation in real time.
How does OLAP support ad-hoc analysis and data exploration? #
Ad-hoc analysis is unstructured exploration where the analyst does not know the question in advance. They start with a hypothesis, query the data, discover unexpected patterns, and ask new questions on the fly. OLAP enables this through fast interactive querying across arbitrary dimensions and aggregations.
An analyst starts with "What were our top 10 selling products last month?" then pivot to "Which regions drove the most growth?" then drill down into "Why did sales drop in the Northeast?" Each question requires a different slice of the data, different aggregations, and different filters. In traditional systems, each query takes minutes. In OLAP systems, answers come back in seconds, keeping the analyst in flow.
Modern OLAP databases achieve this speed through columnar storage and compression. Only the relevant columns are scanned. Vectorised query engines process batches of values at once rather than one row at a time, maximising CPU efficiency. The result: scanning billions of rows and returning aggregated results in sub-second time is routine.
What role does OLAP play in financial planning and forecasting? #
Financial planning relies on slicing budgets across time periods, departments, and cost centres. OLAP systems store historical actuals and approved budgets side by side, enabling variance analysis: actual spending versus planned spending, broken down by any dimension. A CFO sees that the Northeast region overspent by 15% while the West came in under budget, then drill down to which expense categories drove the variance.
Consolidation is another common use case in large enterprises with subsidiaries. Each subsidiary maintains its own ledger. Consolidation rolls these up to a group-wide view, eliminating intercompany transactions and ensuring consistent accounting policies. OLAP handles the hierarchical nature of this process efficiently.
Scenario modelling asks "what if" questions. "What if we increase prices by 10%? How does that affect gross margin?" "What if we shift 20% of budget from sales to product development?" OLAP systems store baseline actuals and allow analysts to create derived scenarios without modifying source data.
Rolling forecasts add another layer. Rather than annual budget cycles where plans go stale within weeks, companies forecast the next 12 or 24 months on a rolling basis. Each month, the oldest actuals drop off and new forecast months extend the horizon. OLAP handles this well because fact tables append new data incrementally and summary views refresh continuously. The system scales as history grows without requiring rearchitecture.
Risk management is a related use case in financial services. An institution uses OLAP to roll up millions of individual loan transactions to view risk exposure by region, then drill down into specific loan types or branches to identify concentrations. Interactive analysis at this scale is essential for quarterly regulatory reporting and internal risk committees.
How is OLAP used in operational analytics — logs, metrics, and events? #
Operational analytics is a growing OLAP use case. Modern OLAP databases ingest hundreds of millions to billions of log lines, metrics, and events per day from applications, infrastructure, and user interactions. Engineers query this data in real time to detect anomalies and understand system behaviour.
Observability queries ask about application health: error rates by service, 95th and 99th percentile latencies, slowest database queries, resource consumption by container. These questions require filtering trillions of events by service, endpoint, and time window, then computing percentiles and aggregations. The volume is massive — a moderate-scale cloud platform generates terabytes of observability data daily — and latency must be sub-second so engineers can investigate incidents as they happen.
Security analytics detects threats by analysing access patterns, failed authentication attempts, and anomalous data access across time windows. A security team queries "show me all accounts with more than 50 failed login attempts in the past hour, grouped by IP address and geography." This requires scanning billions of authentication events and returning results in seconds. That is exactly the access pattern OLAP was designed for.
Product analytics tracks user behaviour: feature adoption, onboarding drop-off, engagement trends, cohort retention. A product manager asks "what percentage of users who signed up last month completed onboarding within 7 days, broken down by acquisition channel?" This is a multi-dimensional aggregation across millions of user events. OLAP databases answer it instantly.
ClickHouse is widely adopted for operational analytics because it handles this volume and latency natively. The data model fits naturally: each event becomes a row in a fact table. Data is compressed aggressively, so terabytes of raw logs fit in gigabytes of storage. Queries are parallelised across CPU cores and, in distributed deployments, across nodes.
OpenAI and Anthropic both run their observability infrastructure on ClickHouse, ingesting petabytes of logs, metrics, and traces to monitor training runs, inference pipelines, and production services in real time.
LLM observability is an emerging subcategory. Unlike traditional infrastructure monitoring, it tracks non-deterministic model outputs — response quality, latency per token, hallucination rates, and cost per call. Langfuse, now part of ClickHouse, is an open-source platform purpose-built for this. It captures traces and evaluations from LLM applications and stores them natively in ClickHouse, giving teams real-time visibility into how their AI systems behave in production.
ClickStack packages ClickHouse, OpenTelemetry ingestion, and the HyperDX UI into a unified observability solution for logs, metrics, and traces. It is available as open source for self-hosted deployments and as a managed service in ClickHouse Cloud.
What do different OLAP use cases require? #
| Use case | Latency requirement | Data freshness | Concurrency | Typical data volume |
|---|---|---|---|---|
| Executive dashboards | 1–5 seconds | Hours (batch OK) | Low (10–50 users) | GB to TB |
| Self-service BI | Sub-second to seconds | Hours to minutes | Medium (50–500 users) | TB |
| Financial planning | Seconds | Daily/weekly batch | Low | GB to TB |
| Operational analytics | Sub-second | Seconds (real-time) | High (100–1,000+ queries/sec) | TB to PB |
| Product analytics | Sub-second | Seconds to minutes | High | TB to PB |
| Ad-hoc exploration | Seconds to minutes | Varies | Low | TB to PB |
OLAP database examples in 2026 #
The OLAP market has split into two broad categories. Cloud data warehouses — Snowflake, BigQuery, Redshift, Databricks, Microsoft Fabric — are managed platforms optimised for batch processing and periodic analysis. They handle petabyte-scale datasets, offer standard SQL, and remove operational burden. Their limitations are architectural: sub-second interactive queries at high concurrency are expensive, and real-time ingestion below minute latencies is not their primary design target.
Real-time OLAP databases — ClickHouse, Apache Pinot, Apache Druid — target a different design point: sub-second queries, high concurrency, and continuous ingestion. Many are open source and available both self-hosted and as managed cloud services.
For local OLAP work, DuckDB embeds an OLAP engine inside a Python process for single-machine analytics. ClickHouse offers equally accessible options: clickhouse-local runs queries without starting a server, chDB embeds the ClickHouse engine inside Python, and clickhousectl simplifies local development further. These options let teams start locally and scale to distributed production without changing query syntax or architecture.
How do OLAP databases compare? #
| Database | Query latency | Concurrency | Deployment | Deployment complexity | Open source |
|---|---|---|---|---|---|
| ClickHouse | Sub-second | Very high | Self-hosted, cloud | Low | Yes |
| Snowflake | Seconds to minutes | Low | Cloud only | Low | No |
| BigQuery | Seconds to minutes | Low | Cloud only (GCP) | Low | No |
| Redshift | Seconds to minutes | Low | Cloud only (AWS) | Low | No |
| Apache Pinot | Sub-second | Very high | Self-hosted | Very high | Yes |
| Apache Druid | Seconds | Medium | Self-hosted | Very high | Yes |
| DuckDB | Sub-second | Low | Embedded | Low | Yes |
How to choose an OLAP database: five evaluation criteria #
How should you measure query performance for OLAP? #
P95 and P99 latency under concurrent load matters far more than average latency. Some databases show excellent single-query benchmarks but degrade sharply when multiple queries run simultaneously. Test with your actual schema and queries, not synthetic data generated for marketing benchmarks.
Measure sustained performance over hours or days, not just initial runs. Query performance can degrade as background processes accumulate or caches warm up. Single-query latency tells you almost nothing about production behaviour.
The best practice is to load a representative sample of your production data and execute your actual workload under realistic concurrency. Document your methodology so results remain reproducible. The full evaluation guide walks through this process in detail.
What concurrency level does your workload need? #
Batch analytics operates at 1–10 concurrent queries, where individual query latency dominates the user experience. User-facing analytics and real-time dashboards often require hundreds to thousands of concurrent queries, where sustained throughput matters more than any single query's speed.
AI is accelerating this shift. A single natural language question from an AI agent doesn't generate one SQL query — it triggers dozens in rapid succession as the agent explores the dataset, tests hypotheses, and reasons through parallel possibilities. What looks like one user interaction becomes a burst of concurrent database queries. Internal analyst workloads now resemble external customer-facing workloads: high concurrency, low latency, interactive response times. Traditional data warehouses designed for batch-oriented ad-hoc analytics struggle with the simultaneous pressures of autonomous agents and real-time response expectations. This trend is reshaping what "adequate concurrency" means for OLAP systems.
Some databases excel at deep analytical queries on low concurrency. Others are built for high-concurrency, simpler queries. Matching database design to your concurrency profile prevents expensive mistakes. A system optimised for 10 concurrent queries will tank if you need 1,000.
Identify your peak concurrent query count, not your average. That number determines which systems are viable. A system that handles your average load comfortably but buckles under peak traffic creates exactly the kind of outage that erodes trust in analytics platforms.
How fresh does your data need to be? #
The freshness spectrum runs from batch (hours/days), through near-real-time (minutes), to real-time (seconds), to streaming (sub-second). Each step up adds complexity and cost. Most BI dashboards work fine with hourly refreshes. Operational analytics and fraud detection need seconds. Financial trading and sensor monitoring need milliseconds.
Choose the freshness level that matches the business requirement, not the one that sounds most impressive. If a dashboard is correct within an hour, paying for sub-second freshness wastes money. If fraud detection misses a two-minute window, one-minute freshness is useless.
The ingestion method also matters. Batch loads are cheap but slow. Streaming from Kafka is fast but requires pipeline management. CDC from your source system balances cost and freshness well.
How should you evaluate total cost of ownership? #
Break costs into three categories: infrastructure (compute, storage, network), operations (managed service fees versus engineering time for self-hosted), and efficiency (cost per sustained query throughput, or $/QPS). Cloud managed services have higher per-query costs but eliminate operational burden. Self-hosted systems have lower per-query costs but require infrastructure expertise.
Calculate TCO for your actual workload, not theoretical maximums. Hidden costs include data egress fees (some cloud platforms charge heavily for moving data out), storage costs for intermediate tables or materialised views, and engineering time spent optimising queries or debugging performance.
What SQL and data management features matter most? #
Standard SQL compliance, complex joins, window functions, and subqueries are table stakes. JSON and semi-structured data handling determines how easily you ingest varied data sources. Updates and deletes — some OLAP databases handle these efficiently while others require complex workarounds. Schema evolution without downtime prevents painful migration projects.
Compression and tiered storage reduce costs at scale. Evaluate against your actual requirements, not a feature checklist that includes capabilities you'll never use.
One often-overlooked feature is how the database handles schema evolution. Production OLAP systems change constantly: new columns get added, old ones get deprecated, data types get adjusted. A database that requires downtime or full table rewrites for schema changes creates operational pain that compounds over time. ClickHouse, for example, supports adding and removing columns without rewriting existing data, which keeps the system running as schemas evolve.
How can you structure the decision? #
| Requirement | Key metric | If yes, consider... |
|---|---|---|
| Sub-second queries at 1,000+ QPS | p99 latency under load | Real-time OLAP (ClickHouse, Pinot, Druid) |
| Ad-hoc analyst queries, seconds OK | Cost per query | Cloud data warehouse |
| Real-time data freshness (seconds) | Ingestion-to-query latency | Real-time OLAP with CDC |
| Single-machine local analysis | Startup time, memory efficiency | clickhouse-local, chDB, DuckDB |
| Minimal operational overhead | Managed service availability | ClickHouse Cloud, Snowflake, BigQuery |
| Open source, no vendor lock-in | Source availability, community size | ClickHouse, Druid, Pinot |
How does ClickHouse fit the OLAP landscape? #
Why is ClickHouse described as a real-time OLAP database? #
ClickHouse is a column-oriented SQL database management system designed specifically for online analytical processing. It combines the analytical depth of traditional OLAP with the speed and freshness of real-time systems: sub-second queries on billions of rows, continuous ingestion from streaming sources, and high concurrency for user-facing dashboards.
Traditional OLAP systems required batch loads every few hours or days. Modern applications need query results on freshly ingested data without that delay. ClickHouse eliminates the tradeoff. Data arrives continuously, and analytical queries run on the freshest data available, all within the same system.
ClickHouse can ingest millions of rows per second while simultaneously serving hundreds of analytical queries at sub-second latency. Traditional data warehouses optimise for one or the other. ClickHouse handles both.
What performance should you expect from ClickHouse on OLAP workloads? #
ClickHouse processes over 1 billion rows per second in analytical benchmarks. Columnar storage with advanced compression means queries read only the data they need, stored 5–10x more compactly than in row-oriented systems. Queries that take minutes in traditional databases often complete in milliseconds.
A well-tuned ClickHouse deployment compresses data aggressively (8:1 to 15:1 is common), meaning terabytes of raw data fit in hundreds of gigabytes of storage. Queries filter on the sorting key and skip entire granules (blocks of rows) that don't match, reading only relevant data. The technical deep dive explains the architecture behind this performance.
How does ClickHouse handle data ingestion and freshness? #
ClickHouse supports both batch and streaming ingestion. For real-time pipelines, data ingested via Kafka, HTTP, or CDC tools becomes queryable within seconds. ClickPipes on ClickHouse Cloud provides managed ingestion from Kafka, S3, GCS, and other sources without maintaining pipeline infrastructure.
For OLTP-to-OLAP replication, PeerDB provides sub-10-second CDC from PostgreSQL, MySQL, and other relational databases. This eliminates the stale-data problem that plagued older OLAP architectures. Analysts query data that is current as of the last few seconds.
How does ClickHouse handle JOINs in OLAP queries? #
JOINs are production-ready in ClickHouse for a wide range of analytical workloads. ClickHouse supports all standard SQL join types (INNER, LEFT, RIGHT, FULL, CROSS) plus SEMI, ANTI, and ASOF joins for specialised use cases. Multiple join algorithms are available: hash, parallel hash, grace hash (which spills to disk for joins larger than memory), full sorting merge, partial merge, and direct.
Automatic join optimisation with statistics and global join reordering further improve performance. TPC-H join performance improved by nearly 4x throughout 2025 alone, with continued optimisations in 2026. Denormalization remains a valid design choice for maximum single-query performance on specific access patterns, but JOINs are the right choice when they simplify schemas and make data easier to manage. Read the technical deep dive on JOIN support.
How does ClickHouse handle semi-structured data? #
ClickHouse has a native JSON type (GA from version 25.3) that brings full analytical capability to semi-structured data. JSON fields can be queried, filtered, and aggregated using standard SQL syntax, including nested structures and arrays, without extraction functions or preprocessing.
Known paths can be materialised as physical columns, delivering the same query performance as typed columns. For highly dynamic data with unpredictable paths, ClickHouse supports shared data serialisation where dynamic paths are stored in columnar structures that support efficient data skipping without requiring every path to be materialised. This matters because modern OLAP workloads increasingly involve semi-structured data from APIs, event streams, and application logs.
What are ClickHouse's capabilities across OLAP evaluation criteria? #
| Evaluation criteria | ClickHouse capability |
|---|---|
| Query latency | Sub-second on billions of rows |
| Concurrency | Hundreds to thousands of concurrent queries |
| Data freshness | Seconds (via CDC, Kafka, ClickPipes) |
| SQL support | Full SQL with JOINs, window functions, subqueries, CTEs |
| Semi-structured data | Native JSON type with columnar storage (GA from 25.3) |
| Updates and deletes | Standard SQL UPDATE/DELETE, ReplacingMergeTree for CDC |
| Deployment | Open-source single binary, ClickHouse Cloud (managed) |
| Cost | Open source with zero licence cost; Cloud pricing based on compute and storage |
Best practices for OLAP implementations #
How should you design schemas for OLAP workloads? #
Start with the queries. OLAP schema design is query-driven: identify the most common access patterns and optimise the physical layout for those patterns. Star schemas work well as a logical model, but modern columnar databases like ClickHouse don't require strict star or snowflake schemas the way traditional ROLAP systems did.
The primary key (called the sorting key in ClickHouse) determines physical sort order and is the single biggest lever for query performance. A well-chosen sorting key allows the database to skip entire blocks of data during query execution. A poorly chosen one forces full table scans.
A common mistake is reaching for partitioning as a performance optimisation. In ClickHouse, partitioning is a data management feature, not a query performance feature. Use it for data lifecycle operations: TTL-based retention, bulk deletion by partition, and efficient data archival. For query performance, optimise the sorting key first — it is almost always the bigger lever.
What is the right approach to indexing and sort order in OLAP? #
The sorting key controls how data is physically arranged on disk. Queries that filter on sorting key columns benefit from data skipping: ClickHouse can skip entire granules (blocks of rows) that don't match the filter, reading only relevant data. This is far more efficient than traditional secondary indexes.
Choose sorting key columns based on the most common WHERE clause filters in your actual queries. Put the lowest-cardinality column first. The order matters: filtering first on a high-cardinality column wastes the skipping benefit. The guide to sparse primary indexes explains the mechanics in detail.
When should you denormalize vs use JOINs in OLAP? #
Both approaches are valid. Denormalization (flattening dimension data into the fact table) eliminates JOINs at query time and can deliver the fastest single-query performance for a specific access pattern. But it increases storage, complicates data updates, and locks you into one schema design.
JOINs keep the schema normalised, make updates easier, and support multiple access patterns without duplicating data. In ClickHouse, JOINs are production-ready with multiple algorithms and automatic optimisation. Start with JOINs for flexibility. Denormalize specific tables only when benchmarks show a meaningful performance gap on critical queries. Explore best practices for joining tables.
How should you approach OLAP performance testing? #
Test with production-representative data, not synthetic datasets. Load at least 10% of your production volume, ideally more. Run the actual queries your application will execute, not generic benchmarks.
Measure p95 and p99 latency under concurrent load, not just single-query performance. Test sustained performance over hours, not minutes — some systems degrade as background processes accumulate or caches warm up in unexpected ways. Document your methodology so results remain reproducible. Use this guide to structure your evaluation.