Skip to main content

How to query Pandas DataFrames with chDB

Pandas is a popular library for data manipulation and analysis in Python. In version 2 of chDB, we've improved the performance of querying Pandas DataFrames and introduced the Python table function. In this guide, we will learn how to query Pandas using the Python table function.

Setup

Let's first create a virtual environment:

python -m venv .venv
source .venv/bin/activate

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

pip install "chdb>=2.0.2"

And now we're going to install Pandas and a couple of other libraries:

pip install pandas requests ipython

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

ipython

You can also use the code in a Python script or in your favorite notebook.

Creating a Pandas DataFrame from a URL

We're going to query some data from the StatsBomb GitHub repository. Let's first import requests and pandas:

import requests
import pandas as pd

Then, we'll load one of the matches JSON files into a DataFrame:

response = requests.get(
"https://raw.githubusercontent.com/statsbomb/open-data/master/data/matches/223/282.json"
)
matches_df = pd.json_normalize(response.json(), sep='_')

Let's have a look what data we'll be working with:

matches_df.iloc[0]
match_id                                                                  3943077
match_date 2024-07-15
kick_off 04:15:00.000
home_score 1
away_score 0
match_status available
match_status_360 unscheduled
last_updated 2024-07-15T15:50:08.671355
last_updated_360 None
match_week 6
competition_competition_id 223
competition_country_name South America
competition_competition_name Copa America
season_season_id 282
season_season_name 2024
home_team_home_team_id 779
home_team_home_team_name Argentina
home_team_home_team_gender male
home_team_home_team_group None
home_team_country_id 11
home_team_country_name Argentina
home_team_managers [{'id': 5677, 'name': 'Lionel Sebastián Scalon...
away_team_away_team_id 769
away_team_away_team_name Colombia
away_team_away_team_gender male
away_team_away_team_group None
away_team_country_id 49
away_team_country_name Colombia
away_team_managers [{'id': 5905, 'name': 'Néstor Gabriel Lorenzo'...
metadata_data_version 1.1.0
metadata_shot_fidelity_version 2
metadata_xy_fidelity_version 2
competition_stage_id 26
competition_stage_name Final
stadium_id 5337
stadium_name Hard Rock Stadium
stadium_country_id 241
stadium_country_name United States of America
referee_id 2638
referee_name Raphael Claus
referee_country_id 31
referee_country_name Brazil
Name: 0, dtype: object

Next, we'll load one of the events JSON files and also add a column called match_id to that DataFrame:

response = requests.get(
"https://raw.githubusercontent.com/statsbomb/open-data/master/data/events/3943077.json"
)
events_df = pd.json_normalize(response.json(), sep='_')
events_df["match_id"] = 3943077

And again, let's have a look at the first row:

with pd.option_context("display.max_rows", None):
first_row = events_df.iloc[0]
non_nan_columns = first_row[first_row.notna()].T
display(non_nan_columns)
id                                   279b7d66-92b5-4daa-8ff6-cba8fce271d9
index 1
period 1
timestamp 00:00:00.000
minute 0
second 0
possession 1
duration 0.0
type_id 35
type_name Starting XI
possession_team_id 779
possession_team_name Argentina
play_pattern_id 1
play_pattern_name Regular Play
team_id 779
team_name Argentina
tactics_formation 442.0
tactics_lineup [{'player': {'id': 6909, 'name': 'Damián Emili...
match_id 3943077
Name: 0, dtype: object

Querying Pandas DataFrames

Next, let's see how to query these DataFrames using chDB. We'll import the library:

import chdb

We can query Pandas DataFrames by using the Python table function:

SELECT *
FROM Python(<name-of-variable>)

So, if we wanted to list the columns in matches_df, we could write the following:

chdb.query("""
DESCRIBE Python(matches_df)
SETTINGS describe_compact_output=1
""", "DataFrame")
                              name    type
0 match_id Int64
1 match_date String
2 kick_off String
3 home_score Int64
4 away_score Int64
5 match_status String
6 match_status_360 String
7 last_updated String
8 last_updated_360 String
9 match_week Int64
10 competition_competition_id Int64
11 competition_country_name String
12 competition_competition_name String
13 season_season_id Int64
14 season_season_name String
15 home_team_home_team_id Int64
16 home_team_home_team_name String
17 home_team_home_team_gender String
18 home_team_home_team_group String
19 home_team_country_id Int64
20 home_team_country_name String
21 home_team_managers String
22 away_team_away_team_id Int64
23 away_team_away_team_name String
24 away_team_away_team_gender String
25 away_team_away_team_group String
26 away_team_country_id Int64
27 away_team_country_name String
28 away_team_managers String
29 metadata_data_version String
30 metadata_shot_fidelity_version String
31 metadata_xy_fidelity_version String
32 competition_stage_id Int64
33 competition_stage_name String
34 stadium_id Int64
35 stadium_name String
36 stadium_country_id Int64
37 stadium_country_name String
38 referee_id Int64
39 referee_name String
40 referee_country_id Int64
41 referee_country_name String

We could then find out which referees have officiated more than one match by writing the following query:

chdb.query("""
SELECT referee_name, count() AS count
FROM Python(matches_df)
GROUP BY ALL
HAVING count > 1
ORDER BY count DESC
""", "DataFrame")
                    referee_name  count
0 César Arturo Ramos Palazuelos 3
1 Maurizio Mariani 3
2 Piero Maza Gomez 3
3 Mario Alberto Escobar Toca 2
4 Wilmar Alexander Roldán Pérez 2
5 Jesús Valenzuela Sáez 2
6 Wilton Pereira Sampaio 2
7 Darío Herrera 2
8 Andrés Matonte 2
9 Raphael Claus 2

Now, let's explore events_df.

chdb.query("""
SELECT pass_recipient_name, count()
FROM Python(events_df)
WHERE type_name = 'Pass' AND pass_recipient_name <> ''
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10
""", "DataFrame")
               pass_recipient_name  count()
0 Davinson Sánchez Mina 76
1 Ángel Fabián Di María Hernández 64
2 Alexis Mac Allister 62
3 Enzo Fernandez 57
4 James David Rodríguez Rubio 56
5 Johan Andrés Mojica Palacio 55
6 Rodrigo Javier De Paul 54
7 Jefferson Andrés Lerma Solís 53
8 Jhon Adolfo Arias Andrade 52
9 Carlos Eccehomo Cuesta Figueroa 50

Joining Pandas DataFrames

We can also join DataFrames together in a query. For example, to get an overview of the match, we could write the following query:

chdb.query("""
SELECT home_team_home_team_name, away_team_away_team_name, home_score, away_score,
countIf(type_name = 'Pass' AND possession_team_id=home_team_home_team_id) AS home_passes,
countIf(type_name = 'Pass' AND possession_team_id=away_team_away_team_id) AS away_passes,
countIf(type_name = 'Shot' AND possession_team_id=home_team_home_team_id) AS home_shots,
countIf(type_name = 'Shot' AND possession_team_id=away_team_away_team_id) AS away_shots
FROM Python(matches_df) AS matches
JOIN Python(events_df) AS events ON events.match_id = matches.match_id
GROUP BY ALL
LIMIT 5
""", "DataFrame").iloc[0]
home_team_home_team_name    Argentina
away_team_away_team_name Colombia
home_score 1
away_score 0
home_passes 527
away_passes 669
home_shots 11
away_shots 19
Name: 0, dtype: object

Populating a table from a DataFrame

We can also create and populate ClickHouse tables from DataFrames. If we want to create a table in chDB, we need to use the Stateful Session API.

Let's import the session module:

from chdb import session as chs

Initialize a session:

sess = chs.Session()

Next, we'll create a database:

sess.query("CREATE DATABASE statsbomb")

Then, create an events table based on events_df:

sess.query("""
CREATE TABLE statsbomb.events ORDER BY id AS
SELECT *
FROM Python(events_df)
""")

We can then run the query that returns the top pass recipient:

sess.query("""
SELECT pass_recipient_name, count()
FROM statsbomb.events
WHERE type_name = 'Pass' AND pass_recipient_name <> ''
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10
""", "DataFrame")
               pass_recipient_name  count()
0 Davinson Sánchez Mina 76
1 Ángel Fabián Di María Hernández 64
2 Alexis Mac Allister 62
3 Enzo Fernandez 57
4 James David Rodríguez Rubio 56
5 Johan Andrés Mojica Palacio 55
6 Rodrigo Javier De Paul 54
7 Jefferson Andrés Lerma Solís 53
8 Jhon Adolfo Arias Andrade 52
9 Carlos Eccehomo Cuesta Figueroa 50

Joining a Pandas DataFrame and table

Finally, we can also update our join query to join the matches_df DataFrame with the statsbomb.events table:

sess.query("""
SELECT home_team_home_team_name, away_team_away_team_name, home_score, away_score,
countIf(type_name = 'Pass' AND possession_team_id=home_team_home_team_id) AS home_passes,
countIf(type_name = 'Pass' AND possession_team_id=away_team_away_team_id) AS away_passes,
countIf(type_name = 'Shot' AND possession_team_id=home_team_home_team_id) AS home_shots,
countIf(type_name = 'Shot' AND possession_team_id=away_team_away_team_id) AS away_shots
FROM Python(matches_df) AS matches
JOIN statsbomb.events AS events ON events.match_id = matches.match_id
GROUP BY ALL
LIMIT 5
""", "DataFrame").iloc[0]
home_team_home_team_name    Argentina
away_team_away_team_name Colombia
home_score 1
away_score 0
home_passes 527
away_passes 669
home_shots 11
away_shots 19
Name: 0, dtype: object