Skip to main content

Structured Approach

First, we confirm we can read the JSON dataset and highlight the challenges of handling semi-structured data using more traditional types used in other databases. We don’t rely on Schema inference to map the JSON fields to columns in the example below - instead, we specify a format of JSONEachRow and map the fields explicitly to columns in the s3 functions.

SELECT type, `actor.display_login`, `repo.name`, created_at
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022-flat.ndjson.gz',
'JSONEachRow',
'type String, `actor.avatar_url` String, `actor.display_login` String, ' ||
'`actor.id` Float64, `actor.login` String, `actor.url` String, `repo.id` Float64, ' ||
'`repo.name` String, `repo.url` String, created_at String, `payload.pull_request.updated_at` String, ' ||
'`payload.action` String, `payload.ref` String, `payload.ref_type` String, ' ||
'`payload.pull_request.user.login` String, `payload.pull_request.number` Float64, ' ||
'`payload.pull_request.title` String, `payload.pull_request.state` String, ' ||
'`payload.pull_request.author_association` String, `payload.pull_request.head.ref` String, ' ||
'`payload.pull_request.head.sha` String, `payload.pull_request.base.ref` String, ' ||
'`payload.pull_request.base.sha` String, `payload.size` Float64, `payload.distinct_size` Float64')
LIMIT 10;
typeactor.display_loginrepo.namecreated_at
PushEventLakshmipatil2021revacprogramming/pps-test1-Lakshmipatil20212022-01-04T07:00:00Z
MemberEventKStevenTKStevenT/HTML_ExternalWorkshop2022-01-04T07:00:00Z
PushEventSoumojit28Soumojit28/Oxytocin2022-01-04T07:00:00Z
PushEventgithub-actionsdiogoaraujo017/diogoaraujo0172022-01-04T07:00:00Z
PushEventAman-SonwaniAman-Sonwani/crwn-clothing2022-01-04T07:00:00Z
PushEventhuangshanyoumumingwutonghuangshanyoumumingwutong/picgo2022-01-04T07:00:00Z
PullRequestEventrfprodrfprod/nx-ng-starter2022-01-04T07:00:00Z
PushEventHelikopter-BojowyHelikopter-Bojowy/Exp-na-helikopterze2022-01-04T07:00:00Z
IssueCommentEventPRMerger-test-1MicrosoftDocs/CSIDev-Public2022-01-04T07:00:00Z
PushEventgithub-actionspioug/yield-data2022-01-04T07:00:00Z

Note this dataset is a subset of the example used later, with no nested objects within the JSON itself - the fields have been flattened using a period separator. Although nested objects can be handled through an explicit mapping, it requires either the use of the new JSON object field or (for older ClickHouse versions) Tuples, Map and Nested structures (see Other Approaches) further complicate usage.

This approach requires mapping all fields and has obvious limitations when the JSON is potentially dynamic or unknown. We could use an INSERT INTO SELECT statement to persist the results into a local Merge Tree table. Defining such a table would require the user to know all fields and express the verbose definition below.

CREATE table github_flat
(
type String,
`actor.avatar_url` String,
`actor.display_login` String,
`actor.id` Float64,
`actor.login` String,
`actor.url` String,
`repo.id` Float64,
`repo.name` String,
`repo.url` String,
created_at String,
`payload.pull_request.updated_at` String,
`payload.action` String,
`payload.ref` String,
`payload.ref_type` String,
`payload.pull_request.user.login` String,
`payload.pull_request.number` Float64,
`payload.pull_request.title` String,
`payload.pull_request.state` String,
`payload.pull_request.author_association` String,
`payload.pull_request.head.ref` String,
`payload.pull_request.head.sha` String,
`payload.pull_request.base.ref` String,
`payload.pull_request.base.sha` String,
`payload.size` Float64,
`payload.distinct_size` Float64
) ENGINE = MergeTree ORDER BY (type, `repo.name`, created_at);

INSERT INTO github_flat SELECT * FROM s3 ('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github/github-2022-flat.ndjson.gz', 'JSONEachRow');

SELECT count() from github_flat;
count()
1000000

Furthermore, if new properties are added to the JSON, the table would need to be updated, i.e., via ALTER TABLE. Naturally, this leads us to use ClickHouse’s semi-structured features.