Skip to main content
Skip to main content

Foursquare places

Dataset

This dataset by Foursquare is available to download and to use for free under the Apache 2.0 license.

It contains over 100 million records of commercial points-of-interest (POI), such as shops, restaurants, parks, playgrounds, and monuments. It also includes additional metadata about those places, such as categories and social media information.

Data exploration

For exploring the data we'll use clickhouse-local, a small command-line tool that provides the full ClickHouse engine, although you could also use ClickHouse Cloud, clickhouse-client or even chDB.

Run the following query to select the data from the s3 bucket where the data is stored:

We see that quite a few fields have ᴺᵁᴸᴸ, so we can add some additional conditions to our query to get back more usable data:

Run the following query to view the automatically inferred schema of the data using the DESCRIBE:

Loading the data into ClickHouse

If you'd like to persist the data on disk, you can use clickhouse-server or ClickHouse Cloud.

To create the table, run the following command:

Take note of the use of the LowCardinality data type for several columns which changes the internal representation of the data types to be dictionary-encoded. Operating with dictionary encoded data significantly increases the performance of SELECT queries for many applications.

Additionally, two UInt32 MATERIALIZED columns, mercator_x and mercator_y are created that map the lat/lon coordinates to the Web Mercator projection for easier segmentation of the map into tiles:

Let's break down what is happening above for each column.

mercator_x

This column converts a longitude value into an X coordinate in the Mercator projection:

  • longitude + 180 shifts the longitude range from [-180, 180] to [0, 360]
  • Dividing by 360 normalizes this to a value between 0 and 1
  • Multiplying by 0xFFFFFFFF (hex for maximum 32-bit unsigned integer) scales this normalized value to the full range of a 32-bit integer

mercator_y

This column converts a latitude value into a Y coordinate in the Mercator projection:

  • latitude + 90 shifts latitude from [-90, 90] to [0, 180]
  • Dividing by 360 and multiplying by pi() converts to radians for the trigonometric functions
  • The log(tan(...)) part is the core of the Mercator projection formula
  • multiplying by 0xFFFFFFFF scales to the full 32-bit integer range

Specifying MATERIALIZED makes sure that ClickHouse calculates the values for these columns when we INSERT the data, without having to specify these columns (which are not part of the original data schema) in the `INSERT statement.

The table is ordered by mortonEncode(mercator_x, mercator_y) which produces a Z-order space-filling curve of mercator_x, mercator_y in order to significantly improve geospatial query performance. This Z-order curve ordering ensures data is physically organized by spatial proximity:

Two minmax indices are also created for faster search:

As you can see, ClickHouse has absolutely everything you need for real-time mapping applications!

Run the following query to load the data:

Visualizing the data

To see what's possible with this dataset, check out adsb.exposed. adsb.exposed was originally built by co-founder and CTO Alexey Milovidov to visualize ADS-B (Automatic Dependent Surveillance-Broadcast) flight data, which is 1000x times larger. During a company hackathon Alexey added the Foursquare data to the tool.

Some of our favourite visualizations are produced here below for you to enjoy.