Blog / Product

ClickHouse Newsletter July 2022: Geo queries for railway enthusiasts

author avatar
Christoph Wurm
Jul 14, 2022 - 9 minutes read

A warm welcome to you all. It is that time of the year when we wish our office in Amsterdam had air conditioning. Then again, that would mean it was ultramodern and on the outskirts of town rather than old and charming looking out on one of the central canals. Oh well, you can’t have it all in life, same as you can’t have a distributed database that satisfies all three guarantees of the CAP theorem.

Keep reading for our upcoming events (hope to see you in person in London or Munich), what’s new in ClickHouse 22.6 and some fun with geo queries.

By the way, if you’re reading this on our website, did you know you can receive every monthly newsletter as an email in your inbox as well? Sign up here.

Upcoming Events

Mark your calendars for these:

  • ClickHouse v22.7 Release Webinar
    When? Thursday, July 21 @ 9 am PDT / 5 pm GMT
    How do I join? Register here.

  • [IN PERSON] ClickHouse London Meetup
    Join us in the Cloudflare London office for a night full of talks. Cloudflare will share tips around schema management at scale and how they enable 100s of engineers to modify ClickHouse schemas. Hear from analytics startup Clippd about how they are using ClickHouse, and there are more talks on using ClickHouse for financial data, optimizing ClickHouse for ARM and visualizing data with ClickHouse.
    When? Wednesday, July 20 @ 6 pm BST
    How? Register here.

  • [IN PERSON] ClickHouse Silicon Valley Meetup
    We are very excited to be holding our next in-person ClickHouse meetup at the Barracuda offices! Please join us for an evening of talks, food and discussion. There will be talks from ClickHouse users, and the ClickHouse team will share our latest updates and are available for plenty of questions!
    When? Wednesday, July 20 @ 6 pm PDT
    How? Register here.

  • [IN PERSON] ClickHouse Munich Meetup
    We’ll be coming together at the Metoda office in Munich for talks from Metoda, Akamai and ClickHouse. A number of ClickHouse engineers live around Munich, so we’ll have a lot of expertise present, come prepared with your questions!
    When? Wednesday, July 27 @ 6 pm CEST
    How? Register here.

ClickHouse v22.6

What’s in our regular monthly June release:

  1. ARM as a first-class citizen We continue to improve the experience of ClickHouse users running on ARM. This year, we introduced official Docker images, and now we’ve made tons of performance optimizations. Also, ClickHouse now shows stack traces on ARM, useful for debugging.
  2. Search with dynamic values You can now use non-constant parameters for LIKE and MATCH. For example, to find customers that use their first name as part of their email address use SELECT * FROM customers WHERE match(email, firstName).
  3. Parameters for external functions You can now pass parameters to external user-defined functions.
  4. Show server certificate Use SELECT showCertificate() to see the server certificate. It better match what you are seeing in your client/browser!
  5. ZooKeeper writes You can now insert into the system.zookeeper table, directly manipulating data in ZooKeeper or ClickHouse Keeper. Should you? Let’s just say “it depends” and “only if you’re absolutely sure what you’re doing”.

Take a look at the release webinar slides, the recording and please upgrade (unless you want to stay on an LTS release).

Query of the Month: Geo queries for railway enthusiasts

Let’s say you are working for a multinational coffee store chain that has expanded into every corner of the world, and you’re looking to find out where you have maybe a few too many stores. And let’s assume that you have all data about your stores including their geographic coordinates in ClickHouse. You have a table and each row is one store and its latitude and longitude. How would you write a query to find the stores that are closest to each other?

Now, we’re not aware of a public dataset of coffee store locations, but there is this dataset of all public transport stops in England and Wales. Formally known as “National Public Transport Access Nodes” (ever asked a helpful stranger for directions to the nearest access node?), when you download the CSV the much more sensible filename is “Stops.csv”.

To load it into ClickHouse, run:

CREATE TABLE stops ENGINE = Memory AS
SELECT * FROM file('Stops.csv', 'CSVWithNames')
SETTINGS format_csv_allow_single_quotes = 0

First, we have to get the data the way we need it. Many stations have many rows, one for each entrance, bus stop, taxi rank, platform, etc. To get this down to one row per station let’s filter on just London Underground platforms (there is a 211-page schema guide in case you’re wondering how we knew what to filter on):

SELECT * FROM stops
WHERE (StopType = 'PLT') AND (CommonName LIKE '%Underground%')

There will still be multiple rows per station but they all seem to be completely identical, so let’s just group them into one:

SELECT
       CommonName AS Name,
       any(Longitude) AS Lon,
       any(Latitude) AS Lat
FROM stops
WHERE (StopType = 'PLT') AND (CommonName LIKE '%Underground%')
GROUP BY CommonName

How do we find the stations that are closest to each other? Like this:

SELECT Name,Lat, Lon,
    lagInFrame(Name) OVER (Rows BETWEEN 1 PRECEDING AND 1 PRECEDING) AS PrevName,
    lagInFrame(Lat) OVER (Rows BETWEEN 1 PRECEDING AND 1 PRECEDING) AS PrevLat,
    lagInFrame(Lon) OVER (Rows BETWEEN 1 PRECEDING AND 1 PRECEDING) AS PrevLon,
    geoDistance(Lon, Lat, PrevLon, PrevLat) AS Distance
FROM (
    SELECT CommonName AS Name, any(Longitude) AS Lon, any(Latitude) AS Lat
    FROM stops
    WHERE (StopType = 'PLT') AND (Name LIKE '%Underground%')
    GROUP BY CommonName
    ORDER BY Lat * Lon ASC
)
ORDER BY Distance ASC

Stations close to each other will have almost exactly the same latitude and longitude, so the product of the two will be almost exactly the same. Sorting the result set by Lat * Lon causes each station to be sorted next to the station closest to it. Then we use a window function to find the previous row (so the closest station to this station), calculate the distance between the two and order by that.

Turns out the two closest stations are not two different stations at all! At a distance of just 4 meters, “Heathrow Terminals 1-2-3 Underground Station” and “Heathrow Terminals 2 & 3 Underground Station” are really more or less the same station. The next result, however, is what we are looking for: Queensway and Bayswater are two separate stations at the opposite end of a city block. Google Maps says it takes 2 minutes to walk from one to the other. Why are there two stations so close to each other? Well, they serve different tube lines: Queensway is a stop on the Central line, Bayswater is served by the Circle and District lines. If you ever visit London and have a choice between the two, avoid the former and go with the latter. You’ll thank me, especially in summer.

Reading Corner

What we’ve been reading:

  1. How We Optimize Complex Queries at Processing Time Instana wrote about how they use materialized views and tagging of incoming data to speed up recurring complex queries in ClickHouse.

  2. Full-Text Search with Quickwit and ClickHouse in a Cluster-to-Cluster Context Contentsquare is using open source search engine Quickwit (written in Rust) to combine analytical queries in ClickHouse with full text search capabilities.

  3. ClickHouse scalability and power for building data-intensive applications Cube explains in detail how ClickHouse can be used to underpin analytics dashboards and how Cube are interacting with it in their headless BI platform.

  4. DENIC Improves Query Times By 10x with ClickHouse: DENIC manages all registrations for the .de domain. Read about how they chose ClickHouse for their in-house analytics platform.

  5. How QuickCheck uses ClickHouse to bring banking to the Unbanked: QuickCheck moved analytical queries from PostgreSQL to ClickHouse, query times went from “forever” to “instant”!

  6. Amsterdam Meetup With The ClickHouse Team – June 8th, 2022: Thanks to all who joined us in person in Amsterdam. Catch up on the recording here.

  7. ClickHouse Over the Years with Benchmarks: At ClickHouse, we are obsessed with benchmarks! Read here how we tested every ClickHouse version.

  8. Collecting Semi-structured Data from Kafka Topics Using ClickHouse Kafka Engine: In a guest post, Superology writes about how to ingest protocol buffers from Kafka into ClickHouse.

  9. ClickHouse + Cumul.io: Cumul.io writes about using their customer-facing analytics dashboarding technology with ClickHouse.

  10. ClickHouse + Deepnote: Announcing a new integration between collaborative data notebook Deepnote and ClickHouse!

  11. New ClickHouse Adopters: Welcome crypto & NFT visual explorer Santiment, security data lake Dassana, JSON data visualization platform GraphJSON, and privacy-friendly tag manager Scale8. Get yourself added as well!

Thanks for reading. We’ll see you next month!

The ClickHouse Team

Photo by delfi de la Rua on Unsplash

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
X imageSlack imageGitHub image
Telegram imageMeetup imageRss image