Explore Github with ClickHouse powered Real-time Analytics

Imagine watching the pulse of the entire GitHub ecosystem as it happens - seeing which repositories are trending, understanding developer activity patterns, and analyzing open-source collaboration in real-time. This is the kind of insight we can unlock with real-time analytics.


SELECT message FROM (
SELECT '🚀 Welcome to the interactive guide to Real-time analytics with GitHub data! 📊'  as message, 0 as order
UNION ALL
SELECT 'All queries in this guide execute against our live demo environment sql.clickhouse.com.' as messsage, 1 as order
UNION ALL
SELECT 'To run a query simply use the ▶ button' as messsage, 2 as order
) ORDER BY order ASC

Real-time analytics refers to data processing that delivers insights to end users and customers as soon as the data is generated. Real-time analytics systems are built on top of event streams, which consist of a series of events ordered in time.

One such stream of data is GitHub events. When analyzing GitHub events in real-time, we need a database that can handle both high-speed ingestion and quick analytical queries. ClickHouse provides the capabilities to process our dataset of over 7 billion GitHub events, which grows as new events are ingested.

The system we build must have specific characteristics to achieve real-time analytics:

Ingestion speed - The value of streaming data decreases over time. We, therefore, need a fast data ingestion process to make the data immediately available for querying.
Query speed - We should aim for web page response times, i.e., a query's time to run and return a result should be sub-second and ideally 100 milliseconds or less.
Concurrency - Real-time analytics systems are often user-facing and must handle tens or hundreds of thousands of queries per second.

In this interactive guide, you'll learn how to build a real-time analytics system that can process and analyze GitHub activity as it happens. We'll start with some simple queries and build our way up to more complex ones, showing how to maintain performance while handling real-world data volumes.

The GitHub events data is continuously loaded into the github_events table of the github database in the ClickHouse SQL playground. We’ll query the dataset via a JavaScript widget embedded in the page. Each query has a link to the underlying query in the SQL playground, so you can play around with and tweak the queries yourself.

Let’s get started!

The GitHub events dataset contains individual records for various GitHub activities, including people liking, forking, or commenting on repositories.

That data has been ingested into a table called events. Below is an example of one row in the table:


1SELECT *
2FROM github.events
3WHERE (event_type = 'PullRequestReviewCommentEvent') 
4AND (repo_name = 'ClickHouse/ClickHouse')
5ORDER BY created_at ASC
6LIMIT 1
7FORMAT PrettyJSONEachRow;


{
	"file_time": "2019-09-23 11:00:00",
	"event_type": "PullRequestReviewCommentEvent",
	"actor_login": "excitoon",
	"repo_name": "ClickHouse/ClickHouse",
	"repo_id": "60246359",
	"created_at": "2019-09-23 11:25:54",
	"updated_at": "2019-09-23 11:25:54",
	"action": "created",
	"comment_id": "327062451",
	"body": "c03857b2aee08fee4616cecae6caf4dbd94555bf",
	"path": "dbms/src/TableFunctions/TableFunctionS3.h",
	"position": 5,
	"line": 0,
	"ref": "",
	"ref_type": "none",
	"creator_user_login": "excitoon",
	"number": 5596,
	"title": "s3 table function and storage",
	"labels": [
    	"can be tested",
    	"pr-feature"
	],
	"state": "closed",
	"locked": 0,
	"assignee": "",
	"assignees": [],
	"comments": 0,
	"author_association": "CONTRIBUTOR",
	"closed_at": "2019-09-22 21:53:07",
	"merged_at": "2019-09-22 21:53:07",
	"merge_commit_sha": "2054f80623f0454b1aabeccbaffc49e17e005926",
	"requested_reviewers": [
    	"stavrolia"
	],
	"requested_teams": [],
	"head_ref": "table_function_s3",
	"head_sha": "4608da13449dacbeabda77ec9d7d10bb8db1358b",
	"base_ref": "master",
	"base_sha": "617604193531842d5e4059ace9d3ef24eeaa8ab6",
	"merged": 0,
	"mergeable": 0,
	"rebaseable": 0,
	"mergeable_state": "unknown",
	"merged_by": "",
	"review_comments": 0,
	"maintainer_can_modify": 0,
	"commits": 0,
	"additions": 0,
	"deletions": 0,
	"changed_files": 0,
	"diff_hunk": "@@ -0,0 +1,25 @@n+#pragma oncen+n+#include n+#include n+#include ",
	"original_position": 5,
	"commit_id": "4608da13449dacbeabda77ec9d7d10bb8db1358b",
	"original_commit_id": "4608da13449dacbeabda77ec9d7d10bb8db1358b",
	"push_size": 0,
	"push_distinct_size": 0,
	"member_login": "",
	"release_tag_name": "",
	"release_name": "",
	"review_state": "none"
}

We can write the following query to get the schema for this table:


DESCRIBE github.events

The full schema is quite large, so we’re just showing the columns we’ll use in our queries below.


CREATE TABLE github.events
(
	`file_time` DateTime,
	`event_type` Enum8('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), 
	`actor_login` LowCardinality(String),
	`repo_name` LowCardinality(String),
	`repo_id` LowCardinality(String),
	`created_at` DateTime,
..
)
ENGINE = MergeTree
ORDER BY (event_type, repo_name, toDate(created_at))

The most important thing to note here is the ordering key as indicated by the ORDER BY clause. This sorts the data by (event_type, repo_name, toDate(created_at)) because most of our queries will filter by event_type and repo_name with a possible date range filter. Sorting the data this way means that ClickHouse can easily prune the data it needs to read at query time.

Sorting data according to query patterns is one of the things that makes ClickHouse fast. You can read more in the Why is ClickHouse so fast? guide.

Let’s have a look at how much data we’ll be working with:


1SELECT formatReadableQuantity(sum(rows)) AS rows,
2       formatReadableSize(sum(data_compressed_bytes)) AS compressedBytes,
3       formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressedBytes
4FROM system.parts
5WHERE database = 'github' AND table = 'events';

At the time of writing, we’ve got just over 9 billion rows, which takes up over 400GB of space in its compressed form and over 3TB uncompressed. It’s a pretty sizable dataset!

ClickHouse is an analytics database optimized for fast aggregation queries written in SQL. Thanks to the ordering key, high compression, and parallelized execution, even complex aggregations can be fast for almost 10 billion rows.

Over the following few sections, we will first run some aggregations before learning how to optimize these queries for even faster performance.

Let’s start by getting an overview of the dataset. The following query counts the number of each event type:


1SELECT event_type, count() AS count, formatReadableQuantity(count)
2FROM github.events
3GROUP BY ALL
4ORDER BY count DESC LIMIT 8;

We can see a breakdown of the event types in the GitHub event types documentation.

GitHub is centered around repositories where various communities and people collaborate. However, not all repositories are created equal. Some will be extremely popular, and others will act as dumping grounds for code (I’m sure we’ve all got a few of those!).

One way to determine a repository's popularity is to look at its number of stars. We can compute the number of stars by counting the times that event_type = 'WatchEvent'. We’ll also filter the results so we find the most popular repositories in 2025:


1SELECT splitByChar('/', repo_name)[2] as project, count() as stars
2FROM github.events 
3WHERE event_type = 'WatchEvent'
4AND created_at >= '2025-01-01'
5GROUP BY ALL
6ORDER BY count() DESC 
7LIMIT 5;

When writing this guide, in March 2025, the top repository is deepseek-ai/DeepSeek-R1, which has received over 80,000 stars in about 6 weeks. This is the repository for DeepSeek’s R1 reasoning model, which was the talk of the (AI) world when it was released.

Next, we’ll see how the DeepSeek repository has accumulated those stars over time. The following query counts the number of stars received each day:


1SELECT toDate(created_at) AS day, count() AS dailyCount
2FROM github.events 
3WHERE event_type = 'WatchEvent' AND repo_name = 'deepseek-ai/DeepSeek-R1'
4GROUP BY ALL
5ORDER BY day;

When the model was first released, we saw a surge of over 10,000 stars in one day at the end of January, but it quickly fell off, and now, it receives between 100 and 300 stars per day.

Many people have starred the DeepSeek repository, but is there any other activity? We can count some of the other event types like this:


1SELECT countIf(event_type == 'WatchEvent') AS stars,
2       countIf(event_type == 'PullRequestEvent') AS prs,
3       countIf(event_type == 'ForkEvent') AS forks,
4       countIf(event_type == 'IssuesEvent') AS issues
5FROM github.events 
6WHERE repo_name = 'deepseek-ai/DeepSeek-R1';

We can see that, in addition to stars, many issues have been created for this repository.

We can see how those event types have evolved by writing the following query:


1SELECT toDate(created_at) AS day, event_type, count() as events
2FROM github.events 
3WHERE event_type IN ('IssuesEvent', 'ForkEvent', 'PullRequestEvent', 'WatchEvent')
4AND repo_name = 'deepseek-ai/DeepSeek-R1'
5GROUP BY ALL
6ORDER BY day

This chart more clearly shows the surge in interest on January 28th, 2025, followed by the quick tailoff.

Let’s move away from DeepSeek now and do some more general queries.

It might also be interesting to see which repositories have “emerged” in 2024 and become popular. We can again filter by event_type = 'WatchEvent' and conditionally count using the countIf function to create buckets for 2023 and 2024 before computing year-over-year growth.


1SELECT repo_name,
2       countIf(toYear(created_at) <= 2023) AS stars2023,
3       countIf(toYear(created_at) <= 2024) AS stars2024,
4       round((stars2024 - stars2023) / stars2023, 3) AS yoy,
5       min(created_at) AS first_seen
6FROM github.events
7WHERE event_type = 'WatchEvent'
8GROUP BY repo_name
9ORDER BY yoy DESC, stars2024 DESC
10LIMIT 10

We can see the emergence of ollama for running LLM models locally, as well as the hugely popular Zed editor. All of these projects have infinite growth with no stars in 2023.

If we filter to projects which already existed prior to 2024 with an established presence i.e. > 1000 stars.


1SELECT repo_name,
2        countIf(toYear(created_at) <= 2023) AS stars2023,
3        countIf(toYear(created_at) <= 2024) AS stars2024,
4        round((stars2024 - stars2023) / stars2023, 3) AS yoy,
5        min(created_at) AS first_seen
6FROM github.events
7WHERE event_type = 'WatchEvent'
8GROUP BY repo_name
9HAVING stars2023 > 1000
10ORDER BY yoy DESC, stars2024 DESC
11LIMIT 10

The conductor project looks particularly interesting with impressive community growth numbers in the last year. CopilotKit looks potentially applicable to ClickHouse.

Something that you often see with tech datasets is that there will be a dip in engagement over the weekend. Let’s see if that’s the case in this dataset by counting the number of events by the day of the week:


1SELECT toDayOfWeek(created_at) AS year, event_type, count() AS stars
2FROM github.events
3WHERE event_type IN (
4  'ForkEvent', 'IssuesEvent', 'WatchEvent', 'PushEvent', 'PullRequestEvent'
5)  
6GROUP BY ALL
7ORDER BY year ASC;

It’s hard to see the dip over the weekend for the less popular event types, as PushEvent dominates the chart. We can fix that by writing the following query to compute the relative change in the numbers of each event by day of the week:


1WITH eventTypeStars AS (
2  SELECT toDayOfWeek(created_at) AS dayOfWeek, event_type, 
3         count() AS stars, 
4         max(stars) OVER(PARTITION BY event_type) AS maxDay
5  FROM github.events
6  WHERE event_type IN (
7    'ForkEvent', 'IssuesEvent', 'WatchEvent', 'PushEvent', 'PullRequestEvent'
8  )  
9  GROUP BY ALL
10  ORDER BY dayOfWeek ASC
11)
12
13SELECT *, stars/maxDay AS ratio
14FROM eventTypeStars;

This more clearly shows that activity dropped significantly over the weekend, with Tuesday being the best day to “catch a star”. Another interesting thing in this chart is that all event types except PullRequestEvent recovered from Saturday to Sunday.

Let’s change track and explore the contributors on GitHub. We’re going to find the most active contributors by counting the number of pushes done on the top 1000 repositories:


1SELECT
2	actor_login,
3	countIf(event_type = 'PushEvent') AS c,
4	uniq(repo_name) AS repos,
5	countIf(event_type = 'IssuesEvent') AS issues,
6	countIf(event_type = 'WatchEvent') AS stars,
7	anyHeavy(repo_name)
8FROM github.events
9WHERE (event_type IN ('PushEvent', 'IssuesEvent', 'WatchEvent')) AND (repo_name IN (
10	SELECT repo_name
11	FROM github.events
12	WHERE event_type = 'WatchEvent'
13	GROUP BY repo_name
14	ORDER BY count() DESC
15	LIMIT 1000
16))
17GROUP BY actor_login
18HAVING (repos < 10000) AND (issues > 1) AND (stars > 1)
19ORDER BY c DESC
20LIMIT 20;

You may notice the query took a little longer to run than the others we’ve written. That’s because it is the first query we’ve written that doesn’t exploit the ordering key very well - we don’t filter by event_type or repo_name in the outer query. Therefore, many rows need to be scanned to compute the answer.

Let’s look at some ways to improve this query's performance.

Incremental materialized views shift the cost of computation from query time to insert time. An incremental materialized view is equivalent to a trigger that runs a query on blocks of data as they are inserted into a table. The result of this query is inserted into a second "target" table. These results are merged transparently in the background over time, ensuring the results are consistent with a query over the original table.

We’ll start by creating a materialized view for the inner query that returns the most popular repositories. The materialized view will write into the following target table:


CREATE TABLE repo_stars (
  repo_name String,
  count UInt32
)
ENGINE = SummingMergeTree 
ORDER BY (repo_name)

And the materialized view itself is shown below:


CREATE MATERIALIZED VIEW github.repo_stars_mv
TO github.repo_stars AS 
SELECT repo_name, count() as count 
FROM github.events
WHERE event_type = 'WatchEvent' 
GROUP BY repo_name

Incremental materialized views are only triggered when new rows are added to the base table (github.events in this case). We’ll also need to write a query to backfill previous data into the target table. The easiest way to do this is by writing an [INSERT INTO...SELECT](https://clickhouse.com/docs/sql-reference/statements/insert-into#inserting-the-results-of-select) query:


INSERT INTO github.repo_stars SELECT repo_name
FROM github.events
WHERE event_type = 'WatchEvent'
GROUP BY repo_name;

For an alternate way of backfilling data that’s more memory efficient and resilient to failure, see the backfilling data guide.

Once repo_stars is populated, we can update our query to find the most active contributors. Notice how the inner query now uses repo_stars:


1SELECT
2	actor_login,
3	countIf(event_type = 'PushEvent') AS c,
4	uniq(repo_name) AS repos,
5	countIf(event_type = 'IssuesEvent') AS issues,
6	countIf(event_type = 'WatchEvent') AS stars,
7	anyHeavy(repo_name)
8FROM github.events
9WHERE (event_type IN ('PushEvent', 'IssuesEvent', 'WatchEvent')) AND (repo_name IN (
10	SELECT repo_name
11	FROM github.repo_stars
12	GROUP BY repo_name
13	ORDER BY sum(count) DESC
14	LIMIT 1000
15))
16GROUP BY actor_login
17HAVING (repos < 10000) AND (issues > 1) AND (stars > 1)
18ORDER BY c DESC
19LIMIT 20;

You might also notice that we use sum(count). This ensures that any rows for a given repository are consolidated, even if ClickHouse’s background merging (which is eventual and not guaranteed to have completed) hasn’t finished yet.

That’s knocked a few seconds off the query time, but can we speed it up further?

We can also create an incremental materialized view for the main/outer part of the query. The target table is described below:


CREATE TABLE actors_per_repo
(
	`actor_login` LowCardinality(String),
	`repo_name` String,
	`pushes` UInt32,
	`issues` UInt32,
	`stars` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (repo_name, actor_login);

This table uses the SummingMergeTree table engine. When data parts are merged with this table engine, ClickHouse replaces all the rows with the same sorting key, with one row containing summed values for numeric columns. This allows the table to receive incremental updates.

The SummingMergeTree is a simpler version of the AggregatingMergeTree table engine. This latter engine can be used for all types of aggregations, storing intermediate states for aggregation results and merging them correctly. A sum presents the simplest example, but all aggregation results in ClickHouse can be stored this way.

The definition of the materialized view is shown below:


CREATE MATERIALIZED VIEW github.actors_per_repo_mv 
TO github.actors_per_repo
AS 
SELECT
	actor_login,
	repo_name,
	countIf(event_type = 'PushEvent') AS pushes,
	countIf(event_type = 'IssuesEvent') AS issues,
	countIf(event_type = 'WatchEvent') AS stars
FROM github.events
GROUP BY actor_login, repo_name;

And, as before, we’ll need to backfill this table:


INSERT INTO actors_per_repo SELECT
	actor_login,
	repo_name,
	countIf(event_type = 'PushEvent') AS pushes,
	countIf(event_type = 'IssuesEvent') AS issues,
	countIf(event_type = 'WatchEvent') AS stars
FROM github.events
GROUP BY actor_login, repo_name;

Now we’re ready to update our query to use actors_per_repo and see if we’ve been able to improve the performance further:


1SELECT
2	actor_login,
3	sum(pushes) AS pushes,
4	uniq(repo_name) AS repos,
5	sum(issues) AS repo_issues,
6	sum(stars) AS stars,
7	argMax(repo_name, issues) AS repo
8FROM github.actors_per_repo
9WHERE repo_name IN (
10	SELECT repo_name
11	FROM github.repo_stars
12	GROUP BY repo_name
13	ORDER BY sum(count) DESC
14	LIMIT 1000
15)
16GROUP BY actor_login
17HAVING (repos < 10000) AND (repo_issues > 1) AND (stars > 1)
18ORDER BY pushes DESC
19LIMIT 20;

Introducing these materialized views has helped reduce the query time to a couple of seconds, which is fine for many cases. If that’s still not fast enough, we have one more trick up our sleeve.

Refreshable materialized views are similar to the materialized views that you see in relational databases like Postgres. You specify a query that will be run at intervals (a bit like a cron job), and the results are stored in a target table.

Refreshable materialized views are most frequently used when you want to speed up a query that contains joins, but they can also be used when you’re happy to trade off data freshness for improved query latency and need to store a result set periodically.

We will start by speeding up the query to find the most popular repositories. We’ll create the following target table. Notice how this table is ordered by stars, thus allowing the top N to be computed easily:


CREATE TABLE github.top_repos
(
	`repo_name` String,
	`stars` UInt64
)
ENGINE = MergeTree
ORDER BY stars;

And a refreshable materialized view to populate that table:


CREATE MATERIALIZED VIEW github.top_repos_mv
REFRESH EVERY 30 MINUTES TO github.top_repos
(
	`repo_name` String,
	`stars` Int64
)
AS SELECT
	repo_name,
	sum(count) AS stars
FROM github.repo_events_per_day
WHERE event_type = 'WatchEvent'
GROUP BY ALL;

Notice how the refreshable materialized view executes every 30 minutes. The target table top_repos will be atomically updated with the results.

If we update our query to use the new target table:


1SELECT
2	actor_login,
3	sum(pushes) AS pushes,
4	uniq(repo_name) AS repos,
5	sum(issues) AS repo_issues,
6	sum(stars) AS stars,
7	argMax(repo_name, issues) AS repo
8FROM github.actors_per_repo
9WHERE repo_name IN (
10	SELECT repo_name
11	FROM github.top_repos
12	ORDER BY stars DESC
13	LIMIT 1000
14)
15GROUP BY actor_login
16HAVING (repos < 10000) AND (repo_issues > 1) AND (stars > 1)
17ORDER BY pushes DESC
18LIMIT 20;

This improves our query further to around 1.5s. We could take this even further and use a refreshable materialized view to store the complete result of the query. In this case, the final query becomes very simple - it just retrieves the top 20 results from a table.

Our target table is defined below:


CREATE TABLE github.top_actors
(
	`actor_login` LowCardinality(String),
	`pushes` UInt64,
	`repos` UInt64,
	`repo_issues` UInt64,
	`stars` UInt64,
	`repo` String
)
ENGINE = MergeTree
ORDER BY pushes;

Now, let’s create a materialized view to populate the target table:


CREATE MATERIALIZED VIEW github.top_actors_mv
REFRESH EVERY 1 HOUR TO github.top_actors
AS SELECT
	actor_login,
	sum(pushes) AS pushes,
	uniq(repo_name) AS repos,
	sum(issues) AS repo_issues,
	sum(stars) AS stars,
	argMax(repo_name, issues) AS repo
FROM github.actors_per_repo
WHERE repo_name IN (
	SELECT repo_name
	FROM github.top_repos
	ORDER BY stars DESC
	LIMIT 1000
)
GROUP BY actor_login
HAVING (repos < 10000) AND (repo_issues > 1) AND (stars > 1)
ORDER BY pushes DESC

We can then write the following query to find the top 10 most active contributors:


1SELECT * 
2FROM github.top_actors 
3ORDER BY pushes DESC 
4LIMIT 10

1/100th of a second should be fast enough for any dashboard!

We’ll conclude this guide by revisiting the actors_per_repo table we created earlier. The table stores a row for every actor and repo name along with the number of pushes, issues, and stars.


CREATE TABLE actors_per_repo(	
  `actor_login` LowCardinality(String),
  `repo_name` String,
  `pushes` UInt32,
  `issues` UInt32,
  `stars` UInt32
)
ENGINE = SummingMergeTree
ORDER BY (repo_name, actor_login);

The table is optimized for filtering by repo_name and then actor_login.

In ClickHouse, queries will be faster if the filter criteria match a column earlier in the ordering key. If you’re curious why, see A Practical Introduction to Primary Indexes in ClickHouse.

We can optimize tables for different access patterns by adding projections. Projections store a copy of the data in a table in a different order. At query time, ClickHouse will decide which projection, if any, to use.

To make this table more generically useful, and allow users to query for their own contributions, we can add a projection that sorts by actor_login. We need first to set the following property so that ClickHouse knows what to do with projections when parts are merged:


ALTER TABLE github.actors_per_repo  
MODIFY SETTING deduplicate_merge_projection_mode='rebuild';

We can then create the projection and materialize it:


ALTER TABLE github.actors_per_repo 
ADD PROJECTION actor_login_order (
  SELECT * 
  ORDER BY actor_login
);
ALTER TABLE github.actors_per_repo 
MATERIALIZE PROJECTION actor_login_order;

With this projection in place, queries for specific users are also very fast:


1SELECT actor_login, splitByChar('/', repo_name)[2] as project, sum(pushes) as pushes, sum(issues) as issues, sum(stars) as stars
2FROM github.actors_per_repo
3WHERE actor_login = 'alexey-milovidov'
4GROUP BY actor_login, project
5ORDER BY pushes DESC
6LIMIT 5;

That’s the end of this interactive guide to querying GitHub events data with ClickHouse. We hope you enjoyed reading it. If you’d like to play around with more datasets in ClickHouse, head over to the SQL playground.

Explore GitHub stats with ClickHouse

What is Real-time analytics?

An intro to GitHub events

Fast SQL aggregations

Most popular event types

Top repositories by stars in 2025

Stars per day

Counting event types

Counting event types over time

Emerging Repos

Events by day of week

Most active contributors (not bots)

Faster SQL aggregations with incremental materialized views

Refreshable materialized views

Projections

That's a wrap!

Subscribe to our newsletter

Other demos

Observability demo - ClickStack demo

SQL Playground

ClickPy

CryptoHouse

Products

Resources

Company

Join our community

Comparisons

Partners