June 2025 Newsletter

Hello, and welcome to the June 2025 ClickHouse newsletter!

This month, we’ve announced ClickStack, our new open-source observability solution. We also learn about the mark cache, how the CloudQuery team built a full-text search engine with ClickHouse, building agentic applications with the MCP Server, analyzing FIX data, and more!

Featured community member: Joe Karlsson #

This month's featured community member is Joe Karlsson, Senior Developer Advocate at CloudQuery.

Joe is a seasoned developer advocate with over 5 years of experience building developer communities around cutting-edge data technologies, progressing through roles at MongoDB, SingleStore, Tinybird, and currently CloudQuery, where he specializes in creating technical content, proof-of-concepts, and educational resources that help developers effectively leverage modern data infrastructure tools.

Joe is a prolific writer in the data engineering space, covering everything from Kubernetes asset tracing to querying cloud infrastructure for expired dependencies.. He's also shared his hands-on ClickHouse experience in How We Handle Billion-Row ClickHouse Inserts With UUID Range Bucketing and Six Months with ClickHouse at CloudQuery (The Good, The Bad, and the Unexpected).

➡️ Follow Joe on LinkedIn

Upcoming events #

25.5 release #

ClickHouse 25.5 is here, and the vector similarity index has moved from experimental to beta.

We’ve also added Hive metastore catalog support, made clickhouse-local a bit easier to use (you can skip FROM and SELECT with stdin now), and made the Parquet reader handle Geo types.

➡️ Read the release post

ClickStack: A high-performance OSS observability stack on ClickHouse #

At the recent OpenHouse conference, Mike Shi announced ClickStack, our new open-source observability solution that delivers a complete, out-of-the-box experience for logs, metrics, traces, and session replay powered by ClickHouse's high-performance database technology.

This product announcement represents our increased investment in the observability ecosystem. It combines the ClickHouse columnar storage engine with a purpose-built UI from HyperDX - a company we recently acquired - to create an accessible, unified observability platform.

The stack is completed with native OpenTelemetry integration, providing standardized data collection that simplifies the instrumentation and ingestion of telemetry data from all your applications and services.

➡️ Read the blog post

Why (and how) CloudQuery built a full-text search engine with ClickHouse #

Our featured community member, Joe Karlsson, and his colleague James Riley have published an insightful blog post detailing their innovative approach to implementing full-text search capabilities.

Rather than adding external search infrastructure like Elasticsearch or MeiliSearch, they built their search index directly within ClickHouse using ngrambf_v1 Bloom filter indices.

They also explain how they tuned performance, using multi-size ngram Bloom filters, weighted scoring, and thoughtful partitioning to support sub-400 ms search across more than 150 million rows. The post concludes with lessons learned, trade-offs around write performance, and a peek at upcoming features like LLM-based search and incremental indexing.

➡️ Read the blog post

Mark Cache: The ClickHouse speed hack you’re not using (yet) #

In his blog post on The New Stack, Anil Inamdar highlights the mark cache in ClickHouse.

This memory-resident mechanism stores metadata pointers that allow ClickHouse to quickly locate data without scanning or decompressing entire files, reducing query times and disk I/O for analytical workloads.

Anil explains how we can configure the size of this cache and then monitor performance using built-in metrics.

➡️ Read the blog post

Building an agentic application with ClickHouse MCP Server #

Lionel Palacin explores how agentic applications powered by LLMs can transform data interaction. Instead of clicking through filters and dropdowns, users can simply ask "Show me the price evolution in Manchester for the last 10 years" and get instant charts with explanations.

Lio takes us through the technical implementation using ClickHouse MCP Server and CopilotKit with React/Next.js, showing developers how to build their own conversational analytics experiences.

➡️ Read the blog post

Analyzing FIX Data With ClickHouse #

Benjamin Wootton shows how we can use ClickHouse to analyze high-volume Financial Information eXchange (FIX) protocol data commonly used in capital markets trading.

Ben shows how to parse raw FIX messages using ClickHouse's built-in string and array functions, creating materialized views that incrementally process trade requests and confirmations. By joining this data with market prices and applying window functions, he calculates the financial impact of trade rejections on different banks' profit and loss positions.

➡️ Read the blog post

Building a scalable user segmentation pipeline with ClickHouse and Airflow - Part 1: Model Training #

A/B Tasty is building a scalable, automated user segmentation pipeline using ClickHouse and Apache Airflow. In the first article of a two-part blog series, Jhon Steven Neira covers the model training phase that periodically learns the clusters (centroids) from user behavior data.

ClickHouse handles aggregating user behavior features and performing K-Means clustering in SQL. Airflow ensures the training runs on schedule and that daily inference runs reliably each day using the latest available model.

Steven provides a detailed walkthrough of implementing K-Means clustering in ClickHouse, demonstrating how to use aggregation states and materialized views to build an efficient segmentation system.

➡️ Read the blog post

ClickHouse in the wild: An odyssey through our data-driven marketing campaign in Q-Commerce #

Parham Abbasi shares how Snapp! Market used ClickHouse to drive a personalized marketing campaign at scale. Millions of users were profiled using MBTI-style traits derived from real purchase behavior, like impulse levels, health focus, and price sensitivity.

The team used a multi-tiered data lake (Bronze-Silver-Gold) and ClickHouse’s ability to query Parquet directly to generate production-ready profiles. They also use the partial_merge join algorithm to keep memory use stable across multi-year datasets, enabling LLM-generated personas to be delivered at scale.

➡️ Read the blog post