SQL Dynamic Column Selection with ClickHouse

When working with datasets that contain a lot of columns, we’ll often want to compute aggregations on a subset of those columns.

Having to type out all the columns that we want to operate on is pretty tedious, so I was pleased to learn that ClickHouse has functionality that allows for dynamic column selection.

Import NYC Taxis dataset

We’re going to use the NYC taxis dataset and in particular the data for yellow taxis in January 2023. We’ll download the Parquet file for that month and then launch a ClickHouse Local instance and ingest it:

1./clickhouse local -m

1CREATE TABLE trips ENGINE MergeTree 
2ORDER BY (tpep_pickup_datetime) AS 
3from file('yellow tripdata Jan 2023.parquet', Parquet)
4select *
5SETTINGS schema_inference_make_columns_nullable = 0;

We can have a look at the schema of the table by running the following query:

1DESCRIBE TABLE trips
2SETTINGS describe_compact_output = 1;

1┌─name──────────────────┬─type──────────┐
2│ VendorID              │ Int64         │
3│ tpep_pickup_datetime  │ DateTime64(6) │
4│ tpep_dropoff_datetime │ DateTime64(6) │
5│ passenger_count       │ Float64       │
6│ trip_distance         │ Float64       │
7│ RatecodeID            │ Float64       │
8│ store_and_fwd_flag    │ String        │
9│ PULocationID          │ Int64         │
10│ DOLocationID          │ Int64         │
11│ payment_type          │ Int64         │
12│ fare_amount           │ Float64       │
13│ extra                 │ Float64       │
14│ mta_tax               │ Float64       │
15│ tip_amount            │ Float64       │
16│ tolls_amount          │ Float64       │
17│ improvement_surcharge │ Float64       │
18│ total_amount          │ Float64       │
19│ congestion_surcharge  │ Float64       │
20│ airport_fee           │ Float64       │
21└───────────────────────┴───────────────┘

Dynamically selecting columns

Now, let’s say that we only want to work with the columns that contain _amount. Rather than having to type out those columns individually, we can use the COLUMNS clause to return the columns that match a regular expression. A query to return the first 10 rows for the amount columns would look like this:

1FROM trips 
2SELECT COLUMNS('.*_amount')
3LIMIT 10;

1┌─fare_amount─┬─tip_amount─┬─tolls_amount─┬─total_amount─┐
2│           0 │          0 │            0 │            0 │
3│         120 │          0 │            0 │        120.3 │
4│          45 │       9.06 │            0 │        54.36 │
5│          75 │      15.06 │            0 │        90.36 │
6│          55 │      14.45 │            0 │        72.25 │
7│         4.5 │          0 │            0 │         6.55 │
8│          10 │          0 │            0 │         10.8 │
9│         115 │          5 │            0 │        120.3 │
10│          78 │      15.76 │            0 │        94.56 │
11│        19.5 │          0 │            0 │        21.55 │
12└─────────────┴────────────┴──────────────┴──────────────┘

Let’s say we also want to return columns that contain the terms fee or tax. We can update the regular expression to include those:

1FROM trips
2SELECT
3  COLUMNS('.*_amount|fee|tax')
4ORDER BY rand() 
5LIMIT 3
6FORMAT Vertical;

1Row 1:
2──────
3fare_amount:  9.3
4mta_tax:      0.5
5tip_amount:   0
6tolls_amount: 0
7total_amount: 13.3
8airport_fee:  0
9
10Row 2:
11──────
12fare_amount:  10
13mta_tax:      0.5
14tip_amount:   2
15tolls_amount: 0
16total_amount: 16
17airport_fee:  0
18
19Row 3:
20──────
21fare_amount:  18.4
22mta_tax:      0.5
23tip_amount:   1
24tolls_amount: 0
25total_amount: 23.4
26airport_fee:  0

Apply functions to all columns

We can also use the APPLY function to apply functions across every column. For example, if we wanted to find the maximum value of each of those columns, we could run the following query:

1FROM trips 
2SELECT 
3  COLUMNS('.*_amount|fee|tax')
4  APPLY(max)
5FORMAT Vertical;

1Row 1:
2──────
3max(fare_amount):  1160.1
4max(mta_tax):      53.16
5max(tip_amount):   380.8
6max(tolls_amount): 196.99
7max(total_amount): 1169.4
8max(airport_fee):  1.25

Or maybe, we’d like to see the average instead:

1FROM trips 
2SELECT 
3  COLUMNS('.*_amount|fee|tax')
4  APPLY(avg)
5FORMAT Vertical;

1Row 1:
2──────
3avg(fare_amount):  18.36706861234277
4avg(mta_tax):      0.48828997712900174
5avg(tip_amount):   3.3679406710521764
6avg(tolls_amount): 0.5184906575852216
7avg(total_amount): 27.020383107155837
8avg(airport_fee):  0.10489592293640923

Those values contain a lot of decimal places, but luckily we can fix that by chaining functions. In this case, we’ll apply the avg function, followed by the round function:

1FROM trips 
2SELECT 
3  COLUMNS('.*_amount|fee|tax')
4  APPLY(avg)
5  APPLY(round)
6FORMAT Vertical;

1Row 1:
2──────
3round(avg(fare_amount)):  18
4round(avg(mta_tax)):      0
5round(avg(tip_amount)):   3
6round(avg(tolls_amount)): 1
7round(avg(total_amount)): 27
8round(avg(airport_fee)):  0

But that rounds the averages to whole numbers. If we want to round to, say, 2 decimal places, we can do that as well. As well as taking in functions, the APPLY function takes in a lambda, which gives us the flexibility to have the round function round our average values to 2 decimal places:

1FROM trips 
2SELECT 
3  COLUMNS('.*_amount|fee|tax')
4  APPLY(avg)
5  APPLY(col -> round(col, 2))
6FORMAT Vertical;

1Row 1:
2──────
3round(avg(fare_amount), 2):  18.37
4round(avg(mta_tax), 2):      0.49
5round(avg(tip_amount), 2):   3.37
6round(avg(tolls_amount), 2): 0.52
7round(avg(total_amount), 2): 27.02
8round(avg(airport_fee), 2):  0.1

Replacing columns

So far so good. But let’s say we want to adjust one of the values, while leaving the other ones as they are. For example, maybe we want to double the total amount and divide the MTA tax by 1.1. We can do that by using the REPLACE clause, which will replace a column while leaving the other ones as they are.

1FROM trips 
2SELECT 
3  COLUMNS('.*_amount|fee|tax')
4  REPLACE(
5    total_amount*2 AS total_amount,
6    mta_tax/1.1 AS mta_tax
7  ) 
8  APPLY(avg)
9  APPLY(col -> round(col, 2))
10FORMAT Vertical;

1Row 1:
2──────
3round(avg(fare_amount), 2):               18.37
4round(divide(avg(mta_tax), 1.1), 2):      0.44
5round(avg(tip_amount), 2):                3.37
6round(avg(tolls_amount), 2):              0.52
7round(multiply(avg(total_amount), 2), 2): 54.04
8round(avg(airport_fee), 2):               0.1

We can see that those two columns have both been replaced and the other columns are as they were in the previous query. Excluding columns

We can also choose to exclude a field by using the EXCEPT clause. For example, to remove the tolls_amount column, we would write the following query:

1FROM trips 
2SELECT 
3  COLUMNS('.*_amount|fee|tax') EXCEPT(tolls_amount)
4  REPLACE(
5    total_amount*2 AS total_amount,
6    mta_tax/1.1 AS mta_tax
7  ) 
8  APPLY(avg)
9  APPLY(col -> round(col, 2))
10FORMAT Vertical;

1Row 1:
2──────
3round(avg(fare_amount), 2):               18.37
4round(divide(avg(mta_tax), 1.1), 2):      0.44
5round(avg(tip_amount), 2):                3.37
6round(multiply(avg(total_amount), 2), 2): 54.04
7round(avg(airport_fee), 2):               0.1

tolls_amount has now been removed and the other columns remain.

In Conclusion

Hopefully you’ve seen that even with a dataset that didn’t have that many columns, ClickHouse’s dynamic column selection functionality saves us a bunch of typing in our SQL queries.

Give these clauses a try on your own data and let us know how you get on!

SQL Dynamic Column Selection with ClickHouse

Import NYC Taxis dataset

Dynamically selecting columns

Apply functions to all columns

Replacing columns

In Conclusion

Subscribe to our newsletter

Recent posts

What's new in clickhousectl v0.4.0

ClickHouse joins the Open Secure AI Alliance

How Jump Trading uses ClickHouse with Iceberg for analytics

Choosing Between ClickStack and Grafana for ClickHouse Observability

SQL Dynamic Column Selection with ClickHouse

Import NYC Taxis dataset #

Dynamically selecting columns #

Apply functions to all columns #

Replacing columns #

In Conclusion #

Subscribe to our newsletter

Recent posts

What's new in clickhousectl v0.4.0

ClickHouse joins the Open Secure AI Alliance

How Jump Trading uses ClickHouse with Iceberg for analytics

Choosing Between ClickStack and Grafana for ClickHouse Observability

Import NYC Taxis dataset

Dynamically selecting columns

Apply functions to all columns

Replacing columns

In Conclusion