Using ClickHouse to Monitor Job Queues in GitHub Actions

Hello and Happy New Year from the ClickHouse development team! Anyone who uses GitHub Actions intensively faces the lack of some metrics at some point. One of the essential metrics is the internal queue of workflow jobs. How many are currently running? How many are in the queue? How many of them are already finished?

Luckily, there is a way to get the info and it involves ClickHouse!

In this post, we will show how you can collect all of the data from your GitHub Actions using GitHub Webhooks and store this data in ClickHouse for analysis.

We need the status for each existing workflow job. To achieve this, we'll use GitHub webhooks. It can be a repository or organizational-scoped webhook.

An organization-level webhook works much better because we receive information about all repositories. It allows us to have a full picture.

We need to receive workflow_job events in JSON format:

An example of the hook's body with necessary objects:

{
  "action": "completed",
  "workflow_job": {
    "id": 10232656529,
    "run_id": 3749960853,
    "workflow_name": "DocsCheck",
    "head_branch": "from-final",
    "run_url": "https://api.github.com/repos/ClickHouse/ClickHouse/actions/runs/3749960853",
    "run_attempt": 1,
    "node_id": "CR_kwDOA5dJV88AAAACYenykQ",
    "head_sha": "e791e7bbfef2a56cd7d22779046c24086140b325",
    "url": "https://api.github.com/repos/ClickHouse/ClickHouse/actions/jobs/10232656529",
    "html_url": "https://github.com/ClickHouse/ClickHouse/actions/runs/3749960853/jobs/6369064705",
    "status": "completed",
    "conclusion": "failure",
    "started_at": "2022-12-21T14:04:05Z",
    "completed_at": "2022-12-21T14:04:22Z",
    "name": "CheckLabels",
    "steps": [
      {..},
      ..
    ],
    "check_run_url": "https://api.github.com/repos/ClickHouse/ClickHouse/check-runs/10232656529",
    "labels": [
      "self-hosted",
      "style-checker"
    ],
    "runner_id": 14977597,
    "runner_name": "i-04aca26d696a1b457",
    "runner_group_id": 1,
    "runner_group_name": "Default"
  },
  "repository": {
    ..,
    "full_name": "ClickHouse/ClickHouse",
    "private": false,
    ..,
  },
  ..
}

The simplest way-to-go is a lambda in AWS that receives and inserts data into a storage layer, such as ClickHouse! The code was developed under PR 44520 and is available in the ClickHouse repository for reuse. It just reads the values from the event, optionally anonymizes sensitive information for private repositories, and inserts the data in our public playground ClickHouse instance.

If you need a ClickHouse cluster, a development instance in ClickHouse Cloud is more than enough for this use case. Spin up a cluster on a free trial in minutes, let us deal with the infrastructure and get querying!

Finally, in ClickHouse, you need to create a table to receive the events:


CREATE TABLE default.workflow_jobs
(
    `id` UInt64,
    `run_id` UInt64,
    `workflow_name` LowCardinality(String),
    `head_branch` LowCardinality(String),
    `run_url` String,
    `run_attempt` UInt16,
    `node_id` String,
    `head_sha` String,
    `url` String,
    `html_url` String,
    `status` Enum8('queued' = 1, 'in_progress' = 2, 'completed' = 3, 'waiting' = 4),
    `conclusion` LowCardinality(String),
    `started_at` DateTime,
    `completed_at` DateTime,
    `name` LowCardinality(String),
    `steps` UInt16,
    `check_run_url` String,
    `labels` Array(LowCardinality(String)),
    `runner_id` UInt64,
    `runner_name` String,
    `runner_group_id` UInt64,
    `runner_group_name` LowCardinality(String),
    `repository` LowCardinality(String),
    `updated_at` DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY toStartOfMonth(started_at)
ORDER BY (id, updated_at)

After some events are received, getting as much info as we need is trivial. Suppose want to see the number of jobs queued or waiting over the last 10 days. If using ClickHouse Cloud you can even visualize the results of this query with SQL console:


SELECT
    uniq(id) as num_jobs,
    toStartOfHour(started_at) as hour
FROM default.workflow_jobs
WHERE has(labels, 'self-hosted') AND status IN ('in_progress', 'waiting')
    AND started_at > now() - INTERVAL 10 DAY
GROUP BY hour
ORDER BY hour ASC

A more interesting query might be to show the total time jobs spend in the queue by the time of day. Here we include the average, max and 90th percentile:


SELECT max(time_spent) as max_time_spent, avg(time_spent) as avg_time_spent, quantile(0.9)(time_spent) as 90th_percentile, toStartOfHour(start_time) as hour FROM (
    SELECT
        max(workflow_jobs.completed_at) - min(workflow_jobs.started_at) as time_spent,
        argMax(status,updated_at) as last_event,
        id,
        min(started_at) as start_time
    FROM default.workflow_jobs
    WHERE has(labels, 'self-hosted')
        AND started_at > now() - INTERVAL 10 DAY
    GROUP BY id HAVING last_event = 'completed'
) GROUP BY hour ORDER BY hour ASC

Finally, with a little more complexity we can see the current ClickHouse queue size for self-managed GitHub Runners:


SELECT
    last_status,
    count() AS queue,
    labels
FROM
(
    SELECT
        argMax(status, updated_at) AS last_status,
        labels,
        id,
        html_url
    FROM default.workflow_jobs
    WHERE has(labels, 'self-hosted') AND (started_at > (now() - toIntervalDay(2)))
    GROUP BY ALL
    HAVING last_status != 'completed'
)
GROUP BY ALL
ORDER BY
    labels ASC,
    last_status ASC

The result will look like the following: