Introducing pg_clickhouse: A Postgres extension for querying ClickHouse

Over the last year, we’ve noticed a strong pattern in customers who’ve migrated their analytics workloads to ClickHouse Cloud: After self-hosted ClickHouse, PostgreSQL is the most common source of migrations. ClickPipes made data replication and migrations easy for these use-cases. However, we found that users still face significant challenges migrating queries and application code from PostgreSQL to ClickHouse. To address this, a few months ago we started looking into ways to simplify and reduce the time required to migrate analytical queries from PostgreSQL to ClickHouse.

Today, we’re pleased to release pg_clickhouse v0.1.0, an Apache 2-licensed PostgreSQL extension to transparently execute analytics queries on ClickHouse directly from PostgreSQL.

Download pg_clickhouse from:

PGXN
GitHub

Or kick the tires by spinning up a Docker instance:


1docker run --name pg_clickhouse -e POSTGRES_PASSWORD=my_pass 
2       -d ghcr.io/clickhouse/pg_clickhouse:18

Consider starting with the tutorial, or watch Sai run through the tutorial in the video below.

Goals #

Consider the common case where an organization builds an application backed by Postgres, including not only business data and transaction processing, but logging and metrics, as well. As the product grows, user traffic and data volume exponentially increase. As a result, the analytical queries powering both real-time customer-facing features and observability systems begin to slow down.

Developers frequently mitigate these issues by using PostgreSQL read replicas, a provisional solution, at best. Eventually, they look to move the workload to a specialized analytics database such as ClickHouse. ClickPipes powers rapid data migration, but what to do with the existing PostgreSQL queries against that data, often created by SQL libraries or ORMs?

The time-consuming part isn’t moving data; ClickPipes has that bit nailed. It’s rewriting months or years of analytics SQL baked into dashboards, ORMs, and cron jobs.

We imagined a PostgreSQL extension to reduce the need to migrate those queries at all, where one could follow the data migration with the workload migration simply by pointing those queries at a new Postgres database or schema.

Thus we set forth to build pg_clickhouse with a number of goals in mind:

Provide ClickHouse query execution from PostgreSQL
Allow existing PostgreSQL queries to run unmodified
Push down query execution to ClickHouse
Create a foundation for continual query and pushdown evolution

What if ClickHouse tables looked just like regular PostgreSQL tables? Say they live in a separate schema from the existing Postgres analytics tables, but provide the identical structure? This pattern would allow existing queries to work as before, with just a change to search_path.

History #

The SQL/MED addresses exactly this use case by providing database extensions, called foreign data wrappers, to allow external data management via SQL. PostgreSQL has supported foreign data wrappers since version 9.3 back in 2011, and a robust array of “FDW” extensions, as they’re commonly
called, has grown in the ensuing years.

We cast about for existing solutions, and quickly found and validated clickhouse_fdw, developed by Ildus Kurbangaliev for Adjust based on initial work by Ibrar Ahmed at Percona. It supports not only raw data access, but query pushdown, including for some JOINs and aggregate functions.

The project originated in 2019 from a fork of postgres_fdw, the canonical reference implementation for PostgreSQL FDWs, as well as a fork of the ClickHouse C++ library. Sadly, it has seen only basic maintenance work since late 2020, mostly patches to ensure it works with newer versions of PostgreSQL. clickhouse_fdw was a great start, but hasn’t benefited from recent pushdown improvements in the PostgreSQL FDW API, including support for advanced aggregations, SEMI-JOINs, subqueries and more. It also lacked testing and support for platforms beyond Linux and lagged updates for new major PostgreSQL releases. After chatting about it with Ildus, we imported much of the functionality into a new project, pg_clickhouse, keeping the Apache 2 license for consistency of distribution.

Improvements #

While clickhouse_fdw and its predecessor, postgres_fdw, provided the foundation for our FDW, we set out to modernize the code & build process, to fix bugs & address shortcomings, and to engineer into a complete product featuring near universal pushdown for analytics queries and aggregations.

Such advances include:

Adopting standard PGXS build pipeline for PostgreSQL extensions
Adding prepared INSERT support to and adopting the latest supported
release of the ClickHouse C++ library
Creating test cases and CI workflows to ensure it works on PostgreSQL versions 13-18 and ClickHouse versions 22-25
Support for TLS-based connections for both the binary protocol and the HTTP API, required for ClickHouse Cloud
Bool, Decimal, and JSON support
Transparent aggregate function pushdown, including for ordered-set aggregates: like percentile_cont()
SEMI JOIN pushdown

These last two features significantly advance the fitness of a foreign data wrapper for an analytics database. After all, the entire point is to benefit from the execution speed of running analytics workload on a specialized and efficient engine. It wouldn’t be much use if it just returned millions of rows for PostgreSQL to aggregate.

Aggregate Pushdown #

Ordered-set aggregates are some of the hardest functions to map between engines because the syntax does not translate directly. Ideally, an aggregate function fully pushes down to ClickHouse for efficient execution. Consider this query adapted from our HouseClick project:


1SELECT
2    type,
3    round(min(price)) + 100 AS min,
4    round(max(price)) AS max,
5    round(percentile_cont(0.5) WITHIN GROUP (ORDER BY price)) AS median,
6    round(percentile_cont(0.25) WITHIN GROUP (ORDER BY price)) AS "25th",
7    round(percentile_cont(0.75) WITHIN GROUP (ORDER BY price)) AS "75th"
8FROM
9    uk.uk_price_paid
10GROUP BY
11    type

This query uses 3 aggregate functions. min() and max() pushdown automatically, as they have the same names in both ClickHouse and PostgreSQL. But not percentile_cont(), which computes or averages the highest value within the percentage of all values. No such function exists in ClickHouse; nor does ClickHouse support the WITHIN GROUP (ORDER BY x) ordered set aggregate syntax.

It does, however, provide parametric aggregate functions, including quantile, which implement a subset of the ordered set aggregate syntax. Thus, pg_clickhouse rewrites this query for ClickHouse as:


1SELECT
2    type,
3    (round(min(price)) + 100),
4    round(max(price)),
5    round(quantile(0.5)(price)),
6    round(quantile(0.25)(price)),
7    round(quantile(0.75)(price))
8FROM
9    uk.uk_price_paid
10GROUP BY
11    type

Note the transparent conversion of percentile_cont()s direct arguments (0.5, 0.25, 0.75) to quantile()’s parametric constants, and then the ORDER BY arguments to function arguments:

percentile_cont(0.5) WITHIN GROUP (ORDER BY price) => quantile(0.5)(price)

More than that, pg_clickhouse, like clickhouse_fdw before it, translates PostgreSQL aggregate FILTER (WHERE) expressions to ClickHouse -If combinators. Here’s the full PostgreSQL query from HouseClick:


1SELECT
2    type,
3    round(min(price)) + 100 AS min,
4    round(min(price) FILTER (WHERE town='ILMINSTER' AND district='SOUTH SOMERSET' AND postcode1='TA19')) AS min_filtered,
5    round(max(price)) AS max,
6    round(max(price) FILTER (WHERE town='ILMINSTER' AND district='SOUTH SOMERSET' AND postcode1='TA19')) AS max_filtered,
7    round(percentile_cont(0.5) WITHIN GROUP (ORDER BY price)) AS median,
8    round(percentile_cont(0.5) WITHIN GROUP (ORDER BY price) FILTER (WHERE town='ILMINSTER' AND district='SOUTH SOMERSET' AND postcode1='TA19')) AS median_filtered,
9    round(percentile_cont(0.25) WITHIN GROUP (ORDER BY price)) AS "25th",
10    round(percentile_cont(0.25) WITHIN GROUP (ORDER BY price) FILTER (WHERE town='ILMINSTER' AND district='SOUTH SOMERSET' AND postcode1='TA19')) AS "25th_filtered",
11    round(percentile_cont(0.75) WITHIN GROUP (ORDER BY price)) AS "75th",
12    round(percentile_cont(0.75) WITHIN GROUP (ORDER BY price) FILTER (WHERE town='ILMINSTER' AND district='SOUTH SOMERSET' AND postcode1='TA19')) AS "75th_filtered"
13FROM
14    uk.uk_price_paid
15GROUP BY
16    type

Run it with EXPLAIN to see the query plan:

                    QUERY PLAN                     
---------------------------------------------------
 Foreign Scan  (cost=1.00..-0.90 rows=1 width=112)
   Relations: Aggregate on (uk_price_paid)

Fully pushed down! This rewrite avoids shipping millions of rows back to PostgreSQL and keeps the heavy work inside ClickHouse. Use EXPLAIN (VERBOSE) to also output the query sent to ClickHouse (reformatted here):


1SELECT
2    type,
3    (round(min(price)) + 100),
4    round(minIf(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
5    round(max(price)),
6    round(maxIf(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
7    round(quantile(0.5)(price)),
8    round(quantileIf(0.5)(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
9    round(quantile(0.25)(price)),
10    round(quantileIf(0.25)(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))),
11    round(quantile(0.75)(price)),
12    round(quantileIf(0.75)(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))) 
13FROM
14    uk.uk_price_paid
15GROUP BY
16    type
17;

Note that each FILTER (WHERE) expression has been converted to an -If suffixed ClickHouse function, which computes the equivalent filtering. In other words, this expression:


1min(price) FILTER (WHERE town='ILMINSTER' AND district='SOUTH SOMERSET' AND postcode1='TA19')

Becomes:


1minIf(price,((((town = 'ILMINSTER') AND (district = 'SOUTH SOMERSET') AND (postcode1 = 'TA19'))) > 0))

SEMI JOIN Pushdown #

As we nailed down the basics for pg_clickhouse, we began testing pushdown against TPC-H, the venerable “decision support workload” database benchmark, loaded into ClickHouse with scaling factor 1. At first, having added support for the Decimal type, 10 of the 22 queries ran quickly; of those, only 3 fully pushed down from pg_clickhouse foreign tables to ClickHouse sources. One was query 3's joins:

EXPLAIN (ANALYZE, COSTS)
-- using default substitutions
select
    l_orderkey,
    sum(l_extendedprice * (1 - l_discount)) as revenue,
    o_orderdate,
    o_shippriority
from
    customer,
    orders,
    lineitem
where
    c_mktsegment = 'BUILDING'
    and c_custkey = o_custkey
    and l_orderkey = o_orderkey
    and o_orderdate < date '1995-03-15'
    and l_shipdate > date '1995-03-15'
group by
    l_orderkey,
    o_orderdate,
    o_shippriority
order by
    revenue desc,
    o_orderdate
LIMIT 10;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Foreign Scan  (cost=0.00..-10.00 rows=1 width=44) (actual time=60.146..60.162 rows=10.00 loops=1)
   Relations: Aggregate on (((customer) INNER JOIN (orders)) INNER JOIN (lineitem))
   FDW Time: 0.106 ms
 Planning:
   Buffers: shared hit=230
 Planning Time: 6.973 ms
 Execution Time: 61.567 ms
(7 rows)

Many of the rest that failed, however, use JOINs to subqueries or EXISTS subqueries in WHERE clauses. Query 4 offers a perfect example of the latter case (with ANALYZE disabled, because it took too long):

EXPLAIN (COSTS, VERBOSE, BUFFERS)
-- using default substitutions
select
    o_orderpriority,
    count(*) as order_count
from
    orders
where
    o_orderdate >= date '1993-07-01'and o_orderdate < date(date '1993-07-01' + interval '3month')
    and exists (select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate)
group by
    o_orderpriority
order by
    o_orderpriority;
                                                                                                                                                                                          QUERY PLAN                                                                                                                                                                                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=-80.86..-80.36 rows=200 width=40)
   Output: orders.o_orderpriority, (count(*))
   Sort Key: orders.o_orderpriority
   ->  HashAggregate  (cost=-90.50..-88.50 rows=200 width=40)
         Output: orders.o_orderpriority, count(*)
         Group Key: orders.o_orderpriority
         ->  Nested Loop  (cost=3.50..-93.00 rows=500 width=32)
               Output: orders.o_orderpriority
               Join Filter: (orders.o_orderkey = lineitem.l_orderkey)
               ->  HashAggregate  (cost=2.50..4.50 rows=200 width=4)
                     Output: lineitem.l_orderkey
                     Group Key: lineitem.l_orderkey
                     ->  Foreign Scan on tpch.lineitem  (cost=0.00..0.00 rows=0 width=4)
                           Output: lineitem.l_orderkey, lineitem.l_partkey, lineitem.l_suppkey, lineitem.l_linenumber, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, lineitem.l_tax, lineitem.l_returnflag, lineitem.l_linestatus, lineitem.l_shipdate, lineitem.l_commitdate, lineitem.l_receiptdate, lineitem.l_shipinstruct, lineitem.l_shipmode, lineitem.l_comment
                           Remote SQL: SELECT l_orderkey FROM tpch.lineitem WHERE ((l_commitdate < l_receiptdate))
               ->  Foreign Scan on tpch.orders  (cost=1.00..-0.50 rows=1 width=36)
                     Output: orders.o_orderkey, orders.o_custkey, orders.o_orderstatus, orders.o_totalprice, orders.o_orderdate, orders.o_orderpriority, orders.o_clerk, orders.o_shippriority, orders.o_comment
                     Remote SQL: SELECT o_orderkey, o_orderpriority FROM tpch.orders WHERE ((o_orderdate >= '1993-07-01')) AND ((o_orderdate < '1993-10-01')) ORDER BY o_orderpriority ASC NULLS LAST
 Planning:
   Buffers: shared hit=236
(20 rows)

Two foreign scans deep in the plan will never be efficient for such a simple query!

We’ve started to address these cases by two means:

Setting costs to encourage the PostgreSQL planner to push down queries, as befits analytics use cases
More importantly, we added support for SEMI JOIN pushdown

These changes brought efficient execution (less than 1s) to 21 of the 22 queries, and full pushdown to 12, including Query 4:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
-- using default substitutions
select
    o_orderpriority,
    count(*) as order_count
from
    orders
where
    o_orderdate >= date '1993-07-01'and o_orderdate < date(date '1993-07-01' + interval '3month')
    and exists (select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate)
group by
    o_orderpriority
order by
    o_orderpriority;
                                                                                                                                                             QUERY PLAN                                                                                                                                                              
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Foreign Scan  (cost=1.00..5.10 rows=1000 width=40) (actual time=51.835..51.847 rows=5.00 loops=1)
   Output: orders.o_orderpriority, (count(*))
   Relations: Aggregate on ((orders) LEFT SEMI JOIN (lineitem))
   Remote SQL: SELECT r1.o_orderpriority, count(*) FROM  tpch.orders r1 LEFT SEMI JOIN tpch.lineitem r3 ON (((r3.l_commitdate < r3.l_receiptdate)) AND ((r1.o_orderkey = r3.l_orderkey))) WHERE ((r1.o_orderdate >= '1993-07-01')) AND ((r1.o_orderdate < '1993-10-01')) GROUP BY r1.o_orderpriority ORDER BY r1.o_orderpriority ASC
   FDW Time: 0.056 ms
 Planning:
   Buffers: shared hit=242
 Planning Time: 6.583 ms
 Execution Time: 54.937 ms
(9 rows)

This table compares query performance between regular PostgreSQL tables, pg_clickhouse prior to the introduction of SEMI-JOIN performance, and pg_clickhouse with SEMI-JOIN performance (as released today). The tests ran against PostgreSQL and ClickHouse tables loaded with TPC-H data at scaling facto 1; ✅ indicates full pushdown, while a dash indicates a query cancellation after 1m:

Query	Postgres Runtime	Original Runtime	SEMI JOIN Runtime
1	4478ms	✅ 82ms	✅ 73ms
2	560ms	-	-
3	1454ms	✅ 74ms	✅ 74ms
4	650ms	-	✅ 67ms
5	452ms	-	✅ 104ms
6	740ms	✅ 33ms	✅ 42ms
7	633ms	-	✅ 83ms
8	320ms	-	✅ 114ms
9	3028ms	-	✅136ms
10	6ms	10ms	✅ 10ms
11	213ms	-	✅ 78ms
12	1101ms	99ms	✅ 37ms
13	967ms	1028ms	1242ms
14	193ms	168ms	✅ 51ms
15	1095ms	101ms	522 ms
16	492ms	1387ms	1639ms
17	1802ms	-	9ms
18	6185ms	-	10ms
19	64ms	75m	65ms
20	473ms	-	4595ms
21	1334ms	-	1702ms
22	257ms	-	268ms

Note the overall performance improvement for nearly all queries against pg_clickhouse foreign tables with SEMI-JOIN support; In a couple of cases — queries 13, 15, and 16 — the query optimizer selects slower plans, and clearly we need to get to the bottom of Query 2’s performance. But the overall performance gain for the other queries is undeniable.

The Future #

We’re super happy with these improvements and pleased to bring them to you in this first release. But we’re far from done. Our top focus is finishing pushdown coverage for analytic workloads before adding DML features. Our road map:

Get the remaining 10 un-pushed-down TPC-H queries optimally planned
Test and fix pushdown for the ClickBench queries
Support transparent pushdown of all PostgreSQL aggregate functions
Support transparent pushdown of all PostgreSQL functions
Implement comprehensive subquery pushdown
Allow server-level and user-level ClickHouse settings via CREATE SERVER and CREATE USER
Support all ClickHouse data types
Support lightweight DELETEs and UPDATEs
Support batch insertion via COPY
Add a function to execute an arbitrary ClickHouse query and return its results as a tables
Add support for pushdown of UNION queries when they all query the remote
database

And more; so much to do! Install pg_clickhouse from GitHub and PGXN releases and try it on a real workload. Tell us via project issues where pushdown breaks. We’ll fix it.

Interested in trying out the new extension?

Try out the pg_clickhouse tutorial in the project's GitHub repository.