Blog / Engineering

Visualizing Data with ClickHouse - Part 2 - Superset

author avatar
Dale McDiarmid
Oct 12, 2022 - 11 minutes read

In this post, we continue our series on the visualization of data in Clickhouse by exploring Superset as a popular tool for business users.

This blog post is part of a series:

What is Superset?

Apache Superset is an Apache Open Source visualization tool launched in 2017 as a result of an initiative at AirBnB. With a focus on SQL-based data stores, users are expected to be comfortable writing queries but with assistance from an IDE-like editor. Retaining the concept of dashboards, data analysts and more technical BI users are presented with a rich BI experience and a wide range of visualization types, including tree maps and box plots.

One of Superset's key strengths is its extensibility, allowing developers to create custom visualizations and plugins to meet specific organizational needs. Additionally, Superset supports a variety of data sources, including traditional relational databases, NoSQL databases, and cloud-native data warehouses, making it a versatile choice for organizations with diverse data ecosystems. The platform also emphasizes security and governance, providing features like role-based access control and data source management to ensure data integrity and compliance with organizational policies.

Getting started with Superset and ClickHouse

docker-compose represents the simplest getting started experience for new users of Superset.

First clone the Superset repository:

1git clone --depth=1  https://github.com/apache/superset.git
2cd superset

Prior to running any docker-compose commands, ensure the official ClickHouse driver will be installed by adding “clickhouse-connect” to the requirements file as shown below. We also add a map box key to enable geo visualizations.

1echo "clickhouse-connect" >> ./docker/requirements-local.txt
2echo "MAPBOX_API_KEY=" >> docker/.env-non-dev

Next, let's launch Superset

1export TAG=3.1.1
2docker compose -f docker-compose-image-tag.yml up

Finally, we're going to create our Superset admin and initialize Superset's database:

1SUPERSET_ID=$(docker ps -aqf "name=superset_app")
1docker exec -it ${SUPERSET_ID} superset fab create-admin \
2  --username admin --firstname Superset --lastname Admin \
3  --email [email protected] --password admin
4docker exec -it ${SUPERSET_ID}  superset db upgrade
5docker exec -it ${SUPERSET_ID}  superset init

You can access Superset via http://localhost:8088 and log in using the credentials admin/admin.

Adding database connection

Once we've logged in, we're going to create a database connection. Below we connect to sql.clickhouse.com to exploit our datasets using the user explorer (no password). Note the use of the HTTP port.

Use the following credentials:

  • Host - sql-clickhouse.clickhouse.com
  • Port - 443
  • User - demo
super-attach-db-2.gif

Creating a Superset dataset from a ClickHouse

Superset requires you to create virtual “datasets”. These can either provide a connection to a specific table or capture the results of a query. Below we create a connection to the hackernews table in sql.clickhouse.com, making this available for chart creation.

Note a Schema for ClickHouse is a database.

superset-create-dataset-2.gif

Creating a Superset chart

Once we've configured a dataset, we can create charts based on it.

Recreating the simple line chart visualization from our previous post, showing ClickHouse Hacker News posts over time is shown below:

superset-simple-line-2.gif

Using our forex dataset from a previous post, Superset is able to visualize the most commonly traded currency pairs using a treemap. Here we limit to data from the 2020s and assume the reader has created the dataset in Superset using the previous example. Note we use 1/avg(ask-bid) as an estimation of market activity in a currency pair. This computes the inverse of the spread (see our earlier post for the definition), for which a lower value indicates greater liquidity and trading volumes.

superset-treemap-2.gif

Creating a Superset dataset from a SQL query

We can perform a similar analysis of the GBP/EUR but visualize it as a calendar to see those days during which the largest change in price occurred during 2016 - unsurprisingly, the day of the Brexit referendum. This requires us to use the SQL IDE, where the results of a query can be saved as a dataset. This powerful feature allows us to capture more complex logic. This specific query is from our earlier forex post and requires a window function to compute the daily change.

SELECT base, quote, day, close, close - any(close) OVER (PARTITION BY base, quote ORDER BY base ASC, quote ASC, day ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS change FROM ( SELECT base, quote, day, argMax(ask, datetime) AS close FROM blogs.forex WHERE (quote = 'GBP') AND (base = 'EUR') AND (datetime > '2016-01-01 00:00:00.000') AND (datetime < '2017-01-01 00:00:00.000') GROUP BY base, quote, toStartOfDay(datetime) AS day ORDER BY base ASC, quote ASC, day ASC ) ORDER BY base ASC, quote ASC, day ASC

superset-calendar.gif

Although we haven’t explored the UK house price dataset in our blog series, this is often used for example purposes throughout our docs. By using a list of regional codes for the UK (ISO 3166-2) we can identify regions that have seen the largest percentage change in median house prices in the last 20 years. The query here uses some nice aggregate functions, joining our codes and prices, and yet still runs in less than 0.2 secs in our public play environment. The UK has undergone changes in regional codes over the last 25 years, so some houses have no valid value in the current codes - however, this compromises less than one-tenth of a percent and is not meaningful to our analysis.

SELECT code, (anyIf(med_2020, med_2020 > 0) - anyIf(med_2000, med_2000 > 0)) / anyIf(med_2000, med_2000 > 0) AS percent_change FROM ( SELECT code, medianIf(price, year = 2000) AS med_2000, medianIf(price, year = 2020) AS med_2020 FROM ( SELECT date, price, locality, town, district, county, code FROM uk_price_paid LEFT JOIN blogs.uk_codes AS codes ON (uk_price_paid.county = codes.name) OR (uk_price_paid.district = codes.name) OR (uk_price_paid.town = codes.name) OR (uk_price_paid.locality = codes.name) OR (replaceAll(uk_price_paid.district, 'CITY OF ', '') = codes.name) ) WHERE (code != '') AND ((toYear(date) = 2000) OR (toYear(date) = 2020)) GROUP BY code, toYear(date) AS year ORDER BY code ASC ) GROUP BY code ORDER BY percent_change DESC

Using the country map we can in turn visualize this.

superset-country-map.gif

Superset offers a box plot visualization, effective for showing the distributions of a metric across multiple groups. This allows us to show how precipitation varies per country using the weather dataset explored in a previous post. For this, we need a simple mapping of fips 10-4 codes to country names. Recall that the first 2 digits of the station id for the weather dataset represent a fips country code. Our mapping is this then loaded into a ClickHouse Complex Key Hashed dictionary, using our table as a source, for use in the following query via the dictGet function.

CREATE TABLE blogs.country_codes ( `name` LowCardinality(String), `code` LowCardinality(String) ) ENGINE = MergeTree ORDER BY code INSERT INTO blogs.country_codes SELECT Name as name, `FIPS 10-4` as code FROM url('https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv', 'CSVWithNames') CREATE DICTIONARY blogs.country_iso_codes ( `code` String, `name` String ) PRIMARY KEY code SOURCE(CLICKHOUSE(TABLE 'country_codes' DATABASE 'blogs')) LIFETIME(MIN 0 MAX 0) LAYOUT(COMPLEX_KEY_HASHED())

Below we show the dataset being created from a query that computes the average precipitation per year by country. We limit the countries to those in the EU by their FIPS code. This can, in turn, be used to create our box plot.

superset-box-plot.gif
SELECT year, avg(`precipitation`) AS `avg_precipitation`, dictGet(`blogs`.`country_iso_codes`, 'name', code) as country FROM `blogs`.`noaa_v2` WHERE date > '1970-01-01' AND code IN ('AL', 'AN', 'AU', 'BE', 'BO', 'CY', 'DA', 'EI', 'EZ', 'EN', 'FI', 'FR', 'GG', 'GI', 'GK', 'GM', 'GR', 'HR', 'HU', 'IC', 'IM', 'IT', 'JE', 'LG', 'LH', 'LO', 'LS', 'LU', 'MD', 'MK', 'MN', 'MT', 'NL', 'NO', 'PL', 'PO', 'RO', 'SI', 'SM', 'SP', 'SW', 'SZ', 'TU', 'UK', 'UP', 'VT') GROUP BY toStartOfYear(`date`) AS `year`, substring(station_id, 1, 2) as code HAVING avg_precipitation > 0 ORDER BY country, year ASC LIMIT 100000

As well as offering your usual numerical analysis charts, Superset allows some simple text analysis with a word cloud. Focusing on the Hacker News dataset from our earlier post, we can use the tokenization and array functions to identify the top N 2-term phrases (shingles).

WITH stop_words AS ( SELECT token FROM blogs.stop_words ) SELECT phrase, count() AS c FROM ( SELECT arrayJoin(shingles) AS shingle, concat(shingle.1, ' ', shingle.2) AS phrase FROM ( SELECT tokens, arrayFilter(t -> (NOT ((t.2) IS NULL)), arrayZip(tokens, arrayPushBack(arrayPopFront(tokens), NULL))) AS shingles FROM ( SELECT arrayFilter(t -> ((t NOT IN (stop_words)) AND (length(t) > 2)), alphaTokens(title)) AS tokens FROM default.hackernews WHERE (type IN ('story', 'comment')) ) WHERE length(tokens) > 0 ) ) GROUP BY phrase ORDER BY c DESC LIMIT 20

Note how we exclude phrases containing stop words. This obviously is a crude means of identifying top phrases and statistical based techniques would like yield more interesting results!

superset-wordcloud.gif

Finally, like Grafana, Superset offers the ability to combine chart types. Below we explore a more classical visualization mixing line and bar charts to show the value of the GBP/USD and USD/EUR currency pairs against their spread. Note how we use multiple axes and add a moving average of the spread over the previous 10 data points.

superset-multi-2.gif

Creating a Superset dashboard

Once you're happy with your chart, you can save it and optionally add it to an existing dashboard or create a new dashboard, as shown in the screenshot below.

2024-12-09_17-18-03.png

You can then go directly to the dashboard, or find it via the Dashboards link on the top navigation bar. Below is an example of a Hacker News dashboard that we created with a couple of charts:

2024-12-09_17-30-39.png

To publish a dashboard so that other people can view it, click on the 'Draft' button, which will then change to 'Published'

Conclusion

In this post, we have explored using some of the unique visualization capabilities of Superset and its ability to be highly useful to both technical and business users. In the next post, we will demonstrate how Metabase targets a simpler experience for rapid data investigation.

If you’re enthusiastic about the latest technologies and are passionate about Open Source, we’re currently hiring for our integrations team and would love to hear from you.

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...
Follow us
X imageSlack imageGitHub image
Telegram imageMeetup imageRss image