The UK property prices dataset
This data contains prices paid for real-estate property in England and Wales. The data is available since 1995, and the size of the dataset in uncompressed form is about 4 GiB (which will only take about 278 MiB in ClickHouse).
- Source: https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
- Description of the fields: https://www.gov.uk/guidance/about-the-price-paid-data
- Contains HM Land Registry data © Crown copyright and database right 2021. This data is licensed under the Open Government Licence v3.0.
Create the Table
Preprocess and Insert the Data
We will use the url
function to stream the data into ClickHouse. We need to preprocess some of the incoming data first, which includes:
- splitting the
postcode
to two different columns -postcode1
andpostcode2
, which is better for storage and queries - converting the
time
field to date as it only contains 00:00 time - ignoring the UUid field because we don't need it for analysis
- transforming
type
andduration
to more readableEnum
fields using the transform function - transforming the
is_new
field from a single-character string (Y
/N
) to a UInt8 field with 0 or 1 - drop the last two columns since they all have the same value (which is 0)
The url
function streams the data from the web server into your ClickHouse table. The following command inserts 5 million rows into the uk_price_paid
table:
Wait for the data to insert - it will take a minute or two depending on the network speed.
Validate the Data
Let's verify it worked by seeing how many rows were inserted:
At the time this query was run, the dataset had 27,450,499 rows. Let's see what the storage size is of the table in ClickHouse:
Notice the size of the table is just 221.43 MiB!
Run Some Queries
Let's run some queries to analyze the data:
Query 1. Average Price Per Year
Query 2. Average Price per Year in London
Something happened to home prices in 2020! But that is probably not a surprise...
Query 3. The Most Expensive Neighborhoods
Speeding up queries with projections
We can speed up these queries with projections. See "Projections" for examples with this dataset.
Test it in the Playground
The dataset is also available in the Online Playground.