OPTIMIZE TABLE ... FINAL in ClickHouse: when to use it, when to avoid it, and how merges work

Last updated: Jun 15, 2026

If you manage a ClickHouse cluster in production, you may have hit duplicate rows or the "too many parts" error. A search result, an older blog post, or a forum thread probably pointed you to OPTIMIZE TABLE ... FINAL. You ran it, the immediate symptom vanished, and you scheduled it in a cron job.

Weeks later, ClickHouse feels slower after OPTIMIZE, and your cluster's performance has degraded.

Key takeaways #

  • Don't blindly run OPTIMIZE TABLE ... FINAL on a schedule. It forces a physical rewrite of active parts in the targeted partition or partitions, consuming significant I/O and CPU.
  • OPTIMIZE ... FINAL bypasses the background merge scheduler's normal size guardrails, including max_bytes_to_merge_at_max_space_in_pool, which is typically around 150 GB.
  • On self-managed ReplicatedMergeTree, OPTIMIZE creates a replicated merge task. Depending on alter_sync, ClickHouse waits for execution on the current replica or all replicas. Other replicas may merge locally or fetch the merged part, depending on settings and replica state, which can add merge, network, and replication-queue pressure.
  • For current-state queries on ReplacingMergeTree, use SELECT ... FINAL because background merges provide eventual deduplication, not immediate query correctness. Enable do_not_merge_across_partitions_select_final=1 only when the partition key guarantees all versions of the same logical row stay in the same partition.
  • To prevent "too many parts," fix ingestion: batch inserts into larger blocks, target roughly one insert per second, or use asynchronous inserts when client-side batching is not practical.

Why scheduled OPTIMIZE TABLE ... FINAL backfires #

You might reach for OPTIMIZE TABLE ... FINAL in two common situations: when a ReplacingMergeTree query returns duplicate versions before background merges finish, or when inserts fail with a TOO_MANY_PARTS error because one partition has too many active parts.

Both symptoms create pressure for immediate relief, so forcing a merge can feel like the obvious next step.

The operation takes a while, but when it finishes, your part count drops and duplicates disappear. So you automate it to run nightly or hourly.

At first, the cluster seems healthier: part count drops, and duplicate versions in ReplacingMergeTree tables may disappear. But OPTIMIZE ... FINAL can create parts larger than the background merge scheduler would normally select.

New inserts still create new parts. If those new parts contain newer versions of rows already inside the oversized part, ClickHouse may not automatically merge them together. The result is that duplicates can accumulate again, while SELECT ... FINAL has to resolve more of that state at query time.

Very large parts can also make merge-dependent maintenance more expensive later. Lightweight deletes are reflected in query results when the delete completes, but ClickHouse physically removes the affected data later during background merges. If the affected rows live inside very large parts, that cleanup requires larger merge work. Heavier ALTER TABLE ... UPDATE and ALTER TABLE ... DELETE mutations can also compete with or wait behind merge activity, so they are best reserved for deliberate, large, often partition-aligned changes. During these rewrites, ClickHouse must keep old parts until the new parts are committed, so disk I/O and temporary disk usage can rise, and query latency can degrade.

The instinct to manually "vacuum" or compact a database is understandable. Many storage engines benefit from periodic operator-driven compaction.

In ClickHouse's MergeTree family, forcing large, unscheduled table rewrites usually works against the background merge scheduler. The scheduler's heuristics and safety limits are tuned to keep merges running as a resource-aware background process. Bypassing them shifts that work onto the same cluster serving live queries and ingestion.

What is a data part in ClickHouse (MergeTree) #

In ClickHouse, an INSERT into a MergeTree-family table writes data into a new immutable unit on storage called a data part.

A part is a collection of physical assets rather than a single file. In common wide-format parts, this includes compressed column files, mark files that map index granules to byte offsets, a sparse primary index, checksums, and metadata. Compact parts and newer storage formats can organize these files differently, but the core idea is the same: a part is a self-contained, immutable unit of sorted columnar data. You can read more about the structure in the table parts documentation.

Parts are immutable. Once written, the engine does not update them in place. It creates new parts, merges existing parts, and later removes parts that are no longer active.

Part names encode their history. The format is partition_id_minimum_block_number_maximum_block_number_level_data_version, where data_version is optional and is incremented when a mutation writes a new part. For example, a part named all_1_4_1_6 belongs to the all partition, covers the block-number range 1 through 4, has merge level 1, and carries data version 6. The level field tracks merge history only: level 0 means the part has not been produced by a merge, while a higher level indicates one or more merge generations in its lineage. Mutations are tracked separately in data_version, so a part can remain level 0 after being rewritten by a mutation.

To see what your active parts look like, query system.parts:

SELECT
    partition,
    name,
    level,
    rows,
    formatReadableSize(bytes_on_disk) AS size
FROM system.parts
WHERE table = 'your_table_name'
  AND active = 1
ORDER BY bytes_on_disk DESC;

Examining levels and sizes tells you whether ingestion is creating many fragmented, level-0 parts or whether background merges are consolidating them into larger, higher-level parts.

Healthy ClickHouse part management means allowing the engine to maintain a balanced distribution of parts continuously, not forcing every partition into a single huge part.

How ClickHouse background merges work (and why partitions matter) #

ClickHouse does not wait for manual intervention to clean up data. The engine continuously evaluates active parts in the background, selects candidates, and schedules merges to reduce fragmentation.

This process is scoped per partition. ClickHouse does not merge parts from two different partitions together.

The background merge scheduler balances reducing part count against write amplification. To protect the cluster from resource starvation, it operates under safety limits such as max_bytes_to_merge_at_max_space_in_pool.

This setting caps the total source-part size that an automatic background merge will select when enough resources are available. It roughly corresponds to the largest part size automatic merges normally create. The common default is around 150 GB.

Merging data requires reading source parts, merging rows according to the table engine's rules, writing compressed columnar output, and committing a replacement part. On large parts, ClickHouse can use vertical merge algorithms that process columns in a more memory-efficient way, but the operation still consumes CPU, memory, and I/O.

Merging multiple very large parts can create significant write amplification. That work competes with concurrent queries and ingestion for disk or object-storage bandwidth and can evict useful data from caches. The size limits keep normal background merging incremental and resource-aware.

OPTIMIZE TABLE in ClickHouse: what each variant does #

OPTIMIZE TABLE. Running the base command without modifiers asks ClickHouse to initiate an unscheduled merge. It may perform no merge if the table does not have useful candidates. By default, ClickHouse does not notify the client when no merge is performed unless optimize_throw_if_noop is enabled. On replicated tables, wait behavior depends on alter_sync.

OPTIMIZE TABLE ... PARTITION. This scopes the optimize request to a specific partition. It is most appropriate for bounded maintenance on a closed partition, such as a historical month of data that will not receive more writes.

OPTIMIZE TABLE ... FINAL. This forces optimization even if the data is already in one part. In practice, it forces ClickHouse to merge active parts in each targeted partition down to a single part, bypassing the normal automatic-merge size guardrails. It should be treated as an infrequent, intentional administrative operation, not routine maintenance.

Why OPTIMIZE TABLE ... FINAL is expensive #

ClickHouse must read the active source parts, apply the relevant MergeTree-family merge logic, write the merged output, and commit the new part. For large datasets, this is CPU- and I/O-intensive.

OPTIMIZE TABLE ... FINAL also bypasses the max_bytes_to_merge_at_max_space_in_pool safeguard. It can attempt merges that automatic background merging would not choose, limited primarily by available resources such as disk space.

Once very large parts exist, the background scheduler may not be able to merge future incoming parts with them. It will continue to merge smaller new parts where it can, but it may no longer be able to consolidate the whole partition automatically. For ReplacingMergeTree, this can mean duplicate versions accumulate again as new data arrives.

Why OPTIMIZE can add replication pressure #

For self-managed ReplicatedMergeTree, OPTIMIZE creates a replicated merge task. Depending on alter_sync, ClickHouse waits for execution on the current replica or all replicas.

Other replicas may merge locally or fetch the merged part, depending on settings and replica state. Either way, a forced large merge can add substantial CPU, disk, network, and replication-queue work. During that window, replicas can have different part structures, so queries may use different execution plans depending on which replica serves the read.

A cron-scheduled OPTIMIZE ... FINAL makes this recurring. The cluster repeatedly absorbs large administrative merge work, and inserts can see higher latency or failures under pressure.

ClickHouse Cloud uses SharedMergeTree with shared object storage and shared metadata, so it does not use the same explicit local-part replication model as self-managed ReplicatedMergeTree. The same principle still applies: forced large merges consume resources and should not be scheduled as routine maintenance.

OPTIMIZE TABLE ... FINAL vs. SELECT ... FINAL: what FINAL means #

OPTIMIZE TABLE ... FINAL is a storage operation. It rewrites physical data parts on disk or object storage.

SELECT ... FINAL is a query-time modifier. It applies the table engine's merge logic when reading data and does not alter the underlying parts.

Older guidance often recommended avoiding SELECT ... FINAL because earlier implementations were much more expensive. ClickHouse has since optimized FINAL significantly. The ClickHouse 23.12 release notes describe improvements that split data into intersecting and non-intersecting ranges so ClickHouse can process ranges that do not need finalization as normal reads. The 24.1 release notes describe additional vertical processing optimizations for FINAL with ReplacingMergeTree.

SELECT ... FINAL still has overhead, especially when a query must read many candidate rows that are not filtered by the sorting key. But it is the right tool when users need correct current-state results from unmerged ReplacingMergeTree data. It gives query-time correctness without forcing a physical rewrite of the storage layer.

How to deduplicate rows and handle ReplacingMergeTree safely #

Use ReplacingMergeTree for current-state deduplication when data arrives as immutable inserts. During background merges, ClickHouse identifies duplicates by the table's ORDER BY columns. If a version column is configured, it keeps the row with the highest version. Without a version column, replacement is based on merge order, so a version column is the safer pattern for update-style workloads.

Background merging is asynchronous, so you cannot rely on it for query-time correctness. Queries against unmerged data can still see multiple versions of the same logical row unless you use FINAL.

For better FINAL performance on partitioned tables, use do_not_merge_across_partitions_select_final when your schema guarantees that all versions of a logical row land in the same partition:

SELECT
    user_id,
    event_name,
    event_timestamp
FROM events_table FINAL
WHERE event_date = today()
SETTINGS do_not_merge_across_partitions_select_final = 1;

This setting tells ClickHouse to process partitions independently during SELECT ... FINAL, which can improve parallelism and reduce unnecessary cross-partition work. It is only correct when the partition key cooperates. If versions of the same row can land in different partitions, using this setting can produce incorrect current-state results.

If a dataset is large enough that SELECT ... FINAL is still too expensive for a hot query path, do not solve that by scheduling OPTIMIZE ... FINAL. Instead, shift the latest-state computation to ingestion time.

One common pattern is a materialized view into an AggregatingMergeTree table using argMaxState at ingest time and argMaxMerge at query time. That keeps a precomputed latest-state representation for fast reads, while the source ReplacingMergeTree can retain the raw immutable event stream and let background merges reclaim storage over time.

How to fix the "too many parts" error #

The "too many parts" error appears when the number of active parts in a partition exceeds the parts_to_throw_insert limit. This protects the cluster when inserts create parts faster than background merges can consolidate them. Check system.merge_tree_settings in your environment for the exact default in your ClickHouse version.

This is an ingestion problem, not a merge engine failure.

OPTIMIZE TABLE ... FINAL is not a fix. It rewrites a snapshot of the parts you already have, but it does not change the rate at which new parts arrive. The exception can return as soon as ingestion resumes, while your storage is left with very large parts that can make future operations harder.

Find the rapidly growing partition #

Run this diagnostic query to find where part growth is concentrated:

SELECT
    partition,
    count() AS active_parts,
    max(level) AS highest_level,
    sum(rows) AS total_rows,
    formatReadableSize(sum(bytes_on_disk)) AS total_size
FROM system.parts
WHERE table = 'your_table_name'
  AND active = 1
GROUP BY partition
HAVING active_parts > 150
ORDER BY active_parts DESC;

The 150 threshold is a monitoring heuristic, not a ClickHouse limit. It gives an early warning before inserts are delayed or rejected.

Fix excessive part creation with batch inserts #

The most effective client-side fix is batching inserts. ClickHouse recommends fairly large batches: at least 1,000 rows per insert, and ideally 10,000 to 100,000 rows, while keeping insert query frequency around once per second when using synchronous inserts.

Sending hundreds or thousands of tiny inserts per second creates too many parts. Every flush creates at least one new part per affected partition, and the merge scheduler then has to consolidate them.

Use asynchronous inserts to batch on the server #

If rewriting your upstream application to buffer data client-side is not practical, use server-side batching with asynchronous inserts.

Enable async_insert = 1 and use wait_for_async_insert = 1, which is the recommended production mode. ClickHouse buffers incoming inserts in memory and flushes them to storage when configurable thresholds are met, such as buffer size, elapsed time, or number of queued queries.

With wait_for_async_insert = 1, the client receives acknowledgment only after data is flushed to storage. That preserves durability expectations and lets your application receive insertion errors instead of hiding them in server logs.

Avoid over-partitioning #

Check your schema for over-partitioning. A common mistake is using a granular daily PARTITION BY key for low-volume data or for workloads that write across many dates at once.

If each insert touches many partitions, ClickHouse creates at least one part per partition per flush. The problem multiplies quickly.

Switching from daily toYYYYMMDD(date) partitions to monthly or yearly partitions can reduce part pressure when the workload does not require daily partition management. Monitor merge health with system.merges for in-flight merges and system.part_log for historical merge activity.

When to use OPTIMIZE TABLE, SELECT ... FINAL, or do nothing #

Trust background merges as your default. This is the right approach for most production workloads. The heuristics controlling merge selection and the limits protecting I/O bandwidth exist so operators do not have to intervene manually.

Use SELECT ... FINAL for reading deduplicated data. When you need a correct current-state view of a ReplacingMergeTree, apply FINAL at query time because background merges provide eventual deduplication, not immediate query correctness. Use partition-aware settings only when your partitioning scheme keeps every version of the same logical row in the same partition.

Use OPTIMIZE TABLE without FINAL as a merge request, not a guarantee. After a heavy backfill or maintenance event, a plain OPTIMIZE TABLE can ask ClickHouse to initiate an unscheduled merge. It may no-op if there are no useful merge candidates.

Use OPTIMIZE TABLE ... PARTITION for targeted maintenance. When an older partition is closed and will not receive more writes, optimizing that specific partition can be reasonable. It is more bounded than a table-wide forced rewrite.

Use OPTIMIZE TABLE ... FINAL only for bounded, one-time administrative operations. Reserve forced full rewrites for specific, scoped situations, such as collapsing all versions in a small, closed ReplacingMergeTree partition that will never receive further writes.
. Do not put it in a cron job or automated pipeline. In most cases, OPTIMIZE TABLE ... FINAL is probably the wrong lever to reach for. If in doubt, you should reach out to ClickHouse Support for guidance.

Conclusion #

ClickHouse is designed to manage its own storage. Background merges, immutable parts, and merge-size limits exist so operators do not have to run periodic forced rewrites.

Use SELECT ... FINAL for query-time correctness, fix ingestion when part counts climb, and reserve OPTIMIZE TABLE ... FINAL for bounded administrative operations. Trust the part lifecycle, and avoid putting forced rewrites in a cron job.

Whether you run ClickHouse yourself or use ClickHouse Cloud, the same principle applies. To evaluate ClickHouse Cloud, spin up a free trial, load as much of your own data as practical, run an evaluation at realistic scale, and compare against your existing system.

Frequently asked questions

Should I run OPTIMIZE TABLE ... FINAL regularly in ClickHouse?

No. OPTIMIZE ... FINAL forces a heavy rewrite of active parts in the targeted partition or partitions, bypasses automatic merge guardrails, and can add sustained CPU and I/O pressure.

What does OPTIMIZE TABLE ... FINAL actually do?

It forces ClickHouse to optimize even when data is already in one part. For MergeTree-family tables, that commonly means merging active parts in each targeted partition down to a single part.

Is OPTIMIZE TABLE replicated across replicas?

For self-managed ReplicatedMergeTree, OPTIMIZE creates a replicated merge task. Depending on alter_sync, ClickHouse waits for execution on the current replica or all replicas. Other replicas may merge locally or fetch the merged part, depending on settings and replica state. In ClickHouse Cloud, SharedMergeTree uses shared storage and shared metadata, so the self-managed local-part replication model does not apply in the same way.

Why can OPTIMIZE ... FINAL make performance worse over time?

It can create very large parts and substantial write amplification. Those large parts may be outside the size range automatic merges normally select, so future inserts can create new parts that are harder to consolidate with the largest existing parts.

What should I use instead of OPTIMIZE ... FINAL to remove duplicates in ReplacingMergeTree?

Use SELECT ... FINAL when you need correct query-time deduplication. Background merges will clean up storage over time, but they are asynchronous and should not be relied on for immediate query correctness.

How do I speed up SELECT ... FINAL on partitioned tables?

Use do_not_merge_across_partitions_select_final = 1 when all versions of the same logical row are guaranteed to stay within one partition. This lets ClickHouse process partitions independently during FINAL.

What causes the "too many parts" error?

Inserts are creating parts faster than background merges can consolidate them, usually because of too many small inserts or inserts that touch too many partitions.

What batch size should I target to avoid excessive part creation?

Use batches of at least 1,000 rows, ideally 10,000 to 100,000 rows, and keep synchronous insert frequency around once per second where possible.

Can ClickHouse batch small inserts automatically?

Yes. Use asynchronous inserts with async_insert = 1 and wait_for_async_insert = 1 so ClickHouse buffers small inserts and flushes them as larger parts based on configurable thresholds.

When is OPTIMIZE TABLE ... PARTITION appropriate?

It can be appropriate for bounded maintenance on a closed partition that will not receive more writes. It is safer than forcing a table-wide rewrite, but it should still be used deliberately.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...