ClickHouse as an Arrow Flight Server
Mark Needham
Arrow Flight is a high-performance RPC framework built on gRPC that's specifically designed for efficiently transferring large Arrow datasets between services. In this hands-on tutorial, we explore how to use ClickHouse as an Arrow Flight server, demonstrating both bulk data retrieval and streaming approaches. We'll work with a real dataset of over 30 million UK property records to show you the practical performance characteristics and capabilities you can expect.
The video walks through three different approaches to querying data via Arrow Flight. First, we look at a simple bulk query that retrieves all 30 million rows in one go, which takes around 7-9 seconds. Then we explore streaming the data in batches, which improves performance to about 4.7 million rows per second and completes in roughly 6.5 seconds. Finally, we demonstrate how to compute aggregates on-the-fly while streaming, calculating running statistics like average and maximum property prices as the data flows through.
Key Learning Points:
- Setting up ClickHouse as an Arrow Flight server with basic configuration
- Connecting to ClickHouse using the Arrow Flight client protocol
- Bulk data retrieval vs. streaming approaches and their performance trade-offs
- Working with Arrow Flight metadata and endpoints for distributed queries
- Computing real-time aggregates during streaming without significant performance impact
- Practical performance benchmarks with a 30+ million row dataset

Scaling ClickHouse to petabytes of logs at OpenAI

How ClickHouse helps Anthropic scale observability

How Capital One cut infrastructure costs by 50%
Engineering leaders at Capital One share how they cut infrastructure costs by 50% and reduced average dashboard load time from 5+ to under 500ms with ClickHouse Cloud.