ClickHouse as an Arrow Flight Server

Mark Needham

Arrow Flight is a high-performance RPC framework built on gRPC that's specifically designed for efficiently transferring large Arrow datasets between services. In this hands-on tutorial, we explore how to use ClickHouse as an Arrow Flight server, demonstrating both bulk data retrieval and streaming approaches. We'll work with a real dataset of over 30 million UK property records to show you the practical performance characteristics and capabilities you can expect.

The video walks through three different approaches to querying data via Arrow Flight. First, we look at a simple bulk query that retrieves all 30 million rows in one go, which takes around 7-9 seconds. Then we explore streaming the data in batches, which improves performance to about 4.7 million rows per second and completes in roughly 6.5 seconds. Finally, we demonstrate how to compute aggregates on-the-fly while streaming, calculating running statistics like average and maximum property prices as the data flows through.

Key Learning Points:

  • Setting up ClickHouse as an Arrow Flight server with basic configuration
  • Connecting to ClickHouse using the Arrow Flight client protocol
  • Bulk data retrieval vs. streaming approaches and their performance trade-offs
  • Working with Arrow Flight metadata and endpoints for distributed queries
  • Computing real-time aggregates during streaming without significant performance impact
  • Practical performance benchmarks with a 30+ million row dataset