Skip to main content

Kafka and the JSON data type

· 4 min read

With the introduction of the new JSON data type, ClickHouse is now a good choice of database for doing JSON analytics. In this guide, we're going to learn how to load JSON messages from Apache Kafka directly into a single JSON column in ClickHouse.

Setup Kafka

Let's start by running a Kafka broker on our machine. We're also going to map port 9092 to port 9092 on our host operating system so that it's easier to interact with Kafka:

docker run --name broker -p 9092:9092 apache/kafka:3.8.1

Ingest data into Kafka

Once that's running, we need to ingest some data. The Wikimedia recent changes feed is a good source of streaming data, so let's ingest that into the wiki_events topic:

curl -N https://stream.wikimedia.org/v2/stream/recentchange 2>/dev/null |
awk '/^data: /{gsub(/^data: /, ""); print}' |
jq -cr --arg sep ø '[.meta.id, tostring] | join($sep)' |
kcat -P -b localhost:9092 -t wiki_events -Kø

We can check tha the data's being ingested by running the following command:

kcat -C -b localhost:9092  -t wiki_events
{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://www.wikidata.org/wiki/Q130972321","request_id":"5c687ded-4721-4bfc-ae6c-58ca25f4a6ce","id":"0fbb0982-c43b-4e8b-989b-db7e78dbdc76","dt":"2024-11-06T11:59:57Z","domain":"www.wikidata.org","stream":"mediawiki.recentchange","topic":"codfw.mediawiki.recentchange","partition":0,"offset":1228777205},"id":2338656448,"type":"edit","namespace":0,"title":"Q130972321","title_url":"https://www.wikidata.org/wiki/Q130972321","comment":"/* wbsetclaim-create:2||1 */ [[Property:P18]]: Mahdi Rrezaei Journalist.jpg","timestamp":1730894397,"user":"Wikimellatir","bot":false,"notify_url":"https://www.wikidata.org/w/index.php?diff=2270885254&oldid=2270870214&rcid=2338656448","minor":false,"patrolled":false,"length":{"old":4269,"new":4636},"revision":{"old":2270870214,"new":2270885254},"server_url":"https://www.wikidata.org","server_name":"www.wikidata.org","server_script_path":"/w","wiki":"wikidatawiki","parsedcomment":"<span dir=\"auto\"><span class=\"autocomment\">Created claim: </span></span> <a href=\"/wiki/Property:P18\" title=\"image | image of relevant illustration of the subject; if available, also use more specific properties (sample: coat of arms image, locator map, flag image, signature image, logo image, collage image)\"><span class=\"wb-itemlink\"><span class=\"wb-itemlink-label\" lang=\"en\" dir=\"ltr\">image</span> <span class=\"wb-itemlink-id\">(P18)</span></span></a>: Mahdi Rrezaei Journalist.jpg"}
{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://www.wikidata.org/wiki/Q75756596","request_id":"eb116219-7372-4725-986f-790211708d36","id":"9e0d5299-5bd1-4c58-b796-9852afd8a84e","dt":"2024-11-06T11:59:54Z","domain":"www.wikidata.org","stream":"mediawiki.recentchange","topic":"codfw.mediawiki.recentchange","partition":0,"offset":1228777206},"id":2338656449,"type":"edit","namespace":0,"title":"Q75756596","title_url":"https://www.wikidata.org/wiki/Q75756596","comment":"/* wbeditentity-update-languages-and-other:0||55 */ mv labels and aliases matching [[Property:P528]] or [[Property:P3083]] to mul","timestamp":1730894394,"user":"Twofivesixbot","bot":true,"notify_url":"https://www.wikidata.org/w/index.php?diff=2270885237&oldid=2147709089&rcid=2338656449","minor":false,"patrolled":true,"length":{"old":30879,"new":27161},"revision":{"old":2147709089,"new":2270885237},"server_url":"https://www.wikidata.org","server_name":"www.wikidata.org","server_script_path":"/w","wiki":"wikidatawiki","parsedcomment":"<span dir=\"auto\"><span class=\"autocomment\">Changed label, description and/or aliases in 55 languages, and other parts: </span></span> mv labels and aliases matching <a href=\"/wiki/Property:P528\" title=\"catalog code | catalog name of an object, use with qualifier P972\"><span class=\"wb-itemlink\"><span class=\"wb-itemlink-label\" lang=\"en\" dir=\"ltr\">catalog code</span> <span class=\"wb-itemlink-id\">(P528)</span></span></a> or <a href=\"/wiki/Property:P3083\" title=\"SIMBAD ID | identifier for an astronomical object, in the University of Strasbourg&#039;s SIMBAD database\"><span class=\"wb-itemlink\"><span class=\"wb-itemlink-label\" lang=\"en\" dir=\"ltr\">SIMBAD ID</span> <span class=\"wb-itemlink-id\">(P3083)</span></span></a> to mul"}

So far, so good.

Ingest data into ClickHouse

Next, we're going to ingest the data into ClickHouse. First, let's enable the JSON type (which is currently experimental), by setting the following property:

SET allow_experimental_json_type = 1;

Now, we'll create the wiki_queue table, which uses the Kafka table engine.

CREATE TABLE wiki_queue
(
json JSON
)
ENGINE = Kafka(
'localhost:9092',
'wiki_events',
'clickhouse-consumer-group',
'JSONAsObject'
);

Note that we're using the JSONAsObject format, which will ensure that incoming messages are made available as a JSON object. This format can only be parsed into a table that has a single column with the JSON type.

Next, we'll create the underlying table to store the Wiki data:

CREATE TABLE wiki
(
json JSON,
id String MATERIALIZED getSubcolumn(json, 'meta.id')
)
ENGINE = MergeTree
ORDER BY id;

Finally, let's create a materialized view to populate the wiki table:

CREATE MATERIALIZED VIEW wiki_mv TO wiki AS 
SELECT json
FROM wiki_queue;

Querying JSON data in ClickHouse

We can then write queries against the wiki table. For example, we could count the number of bots that have committed changes:

SELECT json.bot, count()
FROM wiki
GROUP BY ALL
   ┌─json.bot─┬─count()─┐
1. │ true │ 2526 │
2. │ false │ 4691 │
└──────────┴─────────┘

Or we could find out the users that make the most changes on en.wikipedia.org:

SELECT
json.user,
count()
FROM wiki
WHERE json.server_name = 'en.wikipedia.org'
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10
    ┌─json.user──────────────────────────────┬─count()─┐
1. │ Monkbot │ 267 │
2. │ Onel5969 │ 107 │
3. │ Bangwiki │ 37 │
4. │ HHH Pedrigree │ 28 │
5. │ REDACTED403 │ 23 │
6. │ KylieTastic │ 22 │
7. │ Tinniesbison │ 21 │
8. │ XTheBedrockX │ 20 │
9. │ 2001:4455:1DB:4000:51F3:6A16:408E:69FC │ 19 │
10. │ Wcquidditch │ 15 │
└────────────────────────────────────────┴─────────┘