Foursquare places
Dataset
This dataset by Foursquare is available to download and to use for free under the Apache 2.0 license.
It contains over 100 million records of commercial points-of-interest (POI), such as shops, restaurants, parks, playgrounds, and monuments. It also includes additional metadata about those places, such as categories and social media information.
Data exploration
For exploring the data we'll use clickhouse-local
, a small command-line tool
that provides the full ClickHouse engine, although you could also use
ClickHouse Cloud, clickhouse-client
or even chDB
.
Run the following query to select the data from the s3 bucket where the data is stored:
We see that quite a few fields have ᴺᵁᴸᴸ
, so we can add some additional conditions
to our query to get back more usable data:
Run the following query to view the automatically inferred schema of the data using
the DESCRIBE
:
Loading the data into ClickHouse
If you'd like to persist the data on disk, you can use clickhouse-server
or ClickHouse Cloud.
To create the table, run the following command:
Take note of the use of the LowCardinality
data type for several columns which changes the internal representation of the data
types to be dictionary-encoded. Operating with dictionary encoded data significantly
increases the performance of SELECT
queries for many applications.
Additionally, two UInt32
MATERIALIZED
columns, mercator_x
and mercator_y
are created
that map the lat/lon coordinates to the Web Mercator projection
for easier segmentation of the map into tiles:
Let's break down what is happening above for each column.
mercator_x
This column converts a longitude value into an X coordinate in the Mercator projection:
longitude + 180
shifts the longitude range from [-180, 180] to [0, 360]- Dividing by 360 normalizes this to a value between 0 and 1
- Multiplying by
0xFFFFFFFF
(hex for maximum 32-bit unsigned integer) scales this normalized value to the full range of a 32-bit integer
mercator_y
This column converts a latitude value into a Y coordinate in the Mercator projection:
latitude + 90
shifts latitude from [-90, 90] to [0, 180]- Dividing by 360 and multiplying by pi() converts to radians for the trigonometric functions
- The
log(tan(...))
part is the core of the Mercator projection formula - multiplying by
0xFFFFFFFF
scales to the full 32-bit integer range
Specifying MATERIALIZED
makes sure that ClickHouse calculates the values for these
columns when we INSERT
the data, without having to specify these columns (which are not
part of the original data schema) in the `INSERT statement.
The table is ordered by mortonEncode(mercator_x, mercator_y)
which produces a
Z-order space-filling curve of mercator_x
, mercator_y
in order to significantly
improve geospatial query performance. This Z-order curve ordering ensures data is
physically organized by spatial proximity:
Two minmax
indices are also created for faster search:
As you can see, ClickHouse has absolutely everything you need for real-time mapping applications!
Run the following query to load the data:
Visualizing the data
To see what's possible with this dataset, check out adsb.exposed. adsb.exposed was originally built by co-founder and CTO Alexey Milovidov to visualize ADS-B (Automatic Dependent Surveillance-Broadcast) flight data, which is 1000x times larger. During a company hackathon Alexey added the Foursquare data to the tool.
Some of our favourite visualizations are produced here below for you to enjoy.