If a column is sparse (empty or contains mostly zeros), ClickHouse can encode it in a sparse format and automatically optimize calculations - the data does not require full decompression during queries. In fact, if you know how sparse a column is, you can define its ratio using the `ratio_of_defaults_for_sparse_serialization`

setting to optimize serialization.

This handy query can take a while, but it analyzes every row in your table and determines the ratio of values that are zero (or the default) in every column in the specified table:

`SELECT *`

APPLY x -> (x = defaultValueOfArgumentType(x)) APPLY avg APPLY x -> round(x, 3)

FROM table_name

FORMAT Vertical

For example, we ran this query above on the environmental sensors dataset table named `sensors`

which has over 20B rows and 19 columns:

`SELECT *`

APPLY x -> (x = defaultValueOfArgumentType(x)) APPLY avg APPLY x -> round(x, 3)

FROM sensors

FORMAT Vertical

Here is response:

Row 1:

──────

round(avg(equals(sensor_id, defaultValueOfArgumentType(sensor_id))), 3): 0

round(avg(equals(sensor_type, defaultValueOfArgumentType(sensor_type))), 3): 0.159

round(avg(equals(location, defaultValueOfArgumentType(location))), 3): 0

round(avg(equals(lat, defaultValueOfArgumentType(lat))), 3): 0.001

round(avg(equals(lon, defaultValueOfArgumentType(lon))), 3): 0.001

round(avg(equals(timestamp, defaultValueOfArgumentType(timestamp))), 3): 0

round(avg(equals(P1, defaultValueOfArgumentType(P1))), 3): 0.474

round(avg(equals(P2, defaultValueOfArgumentType(P2))), 3): 0.475

round(avg(equals(P0, defaultValueOfArgumentType(P0))), 3): 0.995

round(avg(equals(durP1, defaultValueOfArgumentType(durP1))), 3): 0.999

round(avg(equals(ratioP1, defaultValueOfArgumentType(ratioP1))), 3): 0.999

round(avg(equals(durP2, defaultValueOfArgumentType(durP2))), 3): 1

round(avg(equals(ratioP2, defaultValueOfArgumentType(ratioP2))), 3): 1

round(avg(equals(pressure, defaultValueOfArgumentType(pressure))), 3): 0.83

round(avg(equals(altitude, defaultValueOfArgumentType(altitude))), 3): 1

round(avg(equals(pressure_sealevel, defaultValueOfArgumentType(pressure_sealevel))), 3): 1

round(avg(equals(temperature, defaultValueOfArgumentType(temperature))), 3): 0.532

round(avg(equals(humidity, defaultValueOfArgumentType(humidity))), 3): 0.544

1 row in set. Elapsed: 992.041 sec. Processed 20.69 billion rows, 1.39 TB (20.86 million rows/s., 1.40 GB/s.)

From the results above:

- the
`sensor_id`

columns is not sparse at all. In fact, every row has a non-zero value - the
`sensor_type`

is only sparse about 15.9% of the time - the
`P0`

column is very sparse: 99.9% of the values are zero - the
`pressure`

column is quite sparse at 83% - and
`temperature`

column has 53.2% of its values missing or zero

Like we said, it's a handy query for computing how sparse your columns are in a ClickHouse table!