跳到主要内容
跳到主要内容

如何使用 chDB 查询 Pandas DataFrame

Pandas 是 Python 中广泛使用的数据处理与分析库。 在 chDB 的第 2 版中,我们提升了对 Pandas DataFrame 的查询性能,并引入了 Python 表函数。 在本指南中,我们将学习如何使用 Python 表函数查询 Pandas DataFrame。

环境准备

我们先创建一个虚拟环境:

python -m venv .venv
source .venv/bin/activate

接下来我们将安装 chDB。 请确保安装的版本为 2.0.2 或更高:

pip install "chdb>=2.0.2"

接下来我们要安装 Pandas 和另外几个库:

pip 安装 pandas requests ipython

接下来我们将使用 ipython 来运行本指南其余部分中的命令,你可以通过运行以下命令来启动:

ipython

你也可以在 Python 脚本或常用的 Notebook 环境中使用这段代码。

通过 URL 创建 Pandas DataFrame

我们将从 StatsBomb GitHub 仓库 查询一些数据。 先导入 requests 和 pandas:

import requests
import pandas as pd

然后,我们将把其中一个 matches 的 JSON 文件加载到 DataFrame 中:

response = requests.get(
  "https://raw.githubusercontent.com/statsbomb/open-data/master/data/matches/223/282.json"
)
matches_df = pd.json_normalize(response.json(), sep='_')

先来看看我们将要处理的数据:

matches_df.iloc[0]
match_id                                                                  3943077
match_date                                                             2024-07-15
kick_off                                                             04:15:00.000
home_score                                                                      1
away_score                                                                      0
match_status                                                            available
match_status_360                                                      unscheduled
last_updated                                           2024-07-15T15:50:08.671355
last_updated_360                                                             None
match_week                                                                      6
competition_competition_id                                                    223
competition_country_name                                            南美洲
competition_competition_name                                         美洲杯
season_season_id                                                              282
season_season_name                                                           2024
home_team_home_team_id                                                        779
home_team_home_team_name                                                阿根廷
home_team_home_team_gender                                                   male
home_team_home_team_group                                                    None
home_team_country_id                                                           11
home_team_country_name                                                  阿根廷
home_team_managers              [{'id': 5677, 'name': 'Lionel Sebastián Scalon...
away_team_away_team_id                                                        769
away_team_away_team_name                                                 哥伦比亚
away_team_away_team_gender                                                   male
away_team_away_team_group                                                    None
away_team_country_id                                                           49
away_team_country_name                                                   哥伦比亚
away_team_managers              [{'id': 5905, 'name': 'Néstor Gabriel Lorenzo'...
metadata_data_version                                                       1.1.0
metadata_shot_fidelity_version                                                  2
metadata_xy_fidelity_version                                                    2
competition_stage_id                                                           26
competition_stage_name                                                      决赛
stadium_id                                                                   5337
stadium_name                                                    Hard Rock Stadium
stadium_country_id                                                            241
stadium_country_name                                         美国
referee_id                                                                   2638
referee_name                                                        Raphael Claus
referee_country_id                                                             31
referee_country_name                                                       巴西
Name: 0, dtype: object

接下来,我们将加载其中一个事件 JSON 文件,并在该 DataFrame 中添加一个名为 match_id 的列:

response = requests.get(
  "https://raw.githubusercontent.com/statsbomb/open-data/master/data/events/3943077.json"
)
events_df = pd.json_normalize(response.json(), sep='_')
events_df["match_id"] = 3943077

我们再来看一下第一行:

with pd.option_context("display.max_rows", None):
    first_row = events_df.iloc[0]
    non_nan_columns = first_row[first_row.notna()].T
    display(non_nan_columns)
id                                   279b7d66-92b5-4daa-8ff6-cba8fce271d9
index                                                                   1
period                                                                  1
timestamp                                                    00:00:00.000
minute                                                                  0
second                                                                  0
possession                                                              1
duration                                                              0.0
type_id                                                                35
type_name                                                     首发阵容
possession_team_id                                                    779
possession_team_name                                            Argentina
play_pattern_id                                                         1
play_pattern_name                                            常规比赛
team_id                                                               779
team_name                                                       Argentina
tactics_formation                                                   442.0
tactics_lineup          [{'player': {'id': 6909, 'name': 'Damián Emili...
match_id                                                          3943077
Name: 0, dtype: object

查询 Pandas DataFrame

接下来,我们来看如何使用 chDB 查询这些 DataFrame。 我们先导入该库:

import chdb

我们可以使用 Python 表函数对 Pandas DataFrame 进行查询:

SELECT *
FROM Python(<name-of-variable>)

因此,如果我们想列出 matches_df 中的列,可以这样写:

chdb.query("""
DESCRIBE Python(matches_df)
SETTINGS describe_compact_output=1
""", "DataFrame")
                              名称    类型
0                         match_id   Int64
1                       match_date  String
2                         kick_off  String
3                       home_score   Int64
4                       away_score   Int64
5                     match_status  String
6                 match_status_360  String
7                     last_updated  String
8                 last_updated_360  String
9                       match_week   Int64
10      competition_competition_id   Int64
11        competition_country_name  String
12    competition_competition_name  String
13                season_season_id   Int64
14              season_season_name  String
15          home_team_home_team_id   Int64
16        home_team_home_team_name  String
17      home_team_home_team_gender  String
18       home_team_home_team_group  String
19            home_team_country_id   Int64
20          home_team_country_name  String
21              home_team_managers  String
22          away_team_away_team_id   Int64
23        away_team_away_team_name  String
24      away_team_away_team_gender  String
25       away_team_away_team_group  String
26            away_team_country_id   Int64
27          away_team_country_name  String
28              away_team_managers  String
29           metadata_data_version  String
30  metadata_shot_fidelity_version  String
31    metadata_xy_fidelity_version  String
32            competition_stage_id   Int64
33          competition_stage_name  String
34                      stadium_id   Int64
35                    stadium_name  String
36              stadium_country_id   Int64
37            stadium_country_name  String
38                      referee_id   Int64
39                    referee_name  String
40              referee_country_id   Int64
41            referee_country_name  String

然后,我们可以通过编写如下查询来找出哪些裁判执法了多场比赛:

chdb.query("""
SELECT referee_name, count() AS count
FROM Python(matches_df)
GROUP BY ALL
HAVING count > 1
ORDER BY count DESC
""", "DataFrame")
                    裁判姓名  计数
0  César Arturo Ramos Palazuelos      3
1               Maurizio Mariani      3
2               Piero Maza Gomez      3
3     Mario Alberto Escobar Toca      2
4  Wilmar Alexander Roldán Pérez      2
5          Jesús Valenzuela Sáez      2
6         Wilton Pereira Sampaio      2
7                  Darío Herrera      2
8                 Andrés Matonte      2
9                  Raphael Claus      2

现在我们来查看一下 events_df

chdb.query("""
SELECT pass_recipient_name, count()
FROM Python(events_df)
WHERE type_name = 'Pass' AND pass_recipient_name <> ''
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10
""", "DataFrame")
               传球接收者姓名  次数()
0            达文森 桑切斯 米纳       76
1  安赫尔 法比安 迪 马利亚 赫南德斯       64
2              亚历克西斯 马克·阿里斯特       62
3                   恩佐 费尔南德斯       57
4      詹姆斯 大卫 罗德里格斯 鲁比奥       56
5      约翰 安德烈斯 莫希卡 帕拉西奥       55
6           罗德里戈 哈维尔 德保罗       54
7     杰斐逊 安德烈斯 勒尔马 索利斯       53
8        约翰 阿道夫 奥 阿里亚斯 安德拉德       52
9  卡洛斯 埃塞霍莫 库埃斯塔 菲格罗阿       50

连接 Pandas DataFrame

我们也可以在查询中将多个 DataFrame 连接在一起。 例如,为了获得比赛的总体概况,我们可以编写如下查询:

chdb.query("""
SELECT home_team_home_team_name, away_team_away_team_name, home_score, away_score,
       countIf(type_name = 'Pass' AND possession_team_id=home_team_home_team_id) AS home_passes,
       countIf(type_name = 'Pass' AND possession_team_id=away_team_away_team_id) AS away_passes,
       countIf(type_name = 'Shot' AND possession_team_id=home_team_home_team_id) AS home_shots,
       countIf(type_name = 'Shot' AND possession_team_id=away_team_away_team_id) AS away_shots
FROM Python(matches_df) AS matches
JOIN Python(events_df) AS events ON events.match_id = matches.match_id
GROUP BY ALL
LIMIT 5
""", "DataFrame").iloc[0]
home_team_home_team_name    阿根廷
away_team_away_team_name     哥倫比亞
home_score                          1
away_score                          0
home_passes                       527
away_passes                       669
home_shots                         11
away_shots                         19
Name: 0, dtype: object

使用 DataFrame 填充表

我们也可以从 DataFrame 创建并填充 ClickHouse 表。 如果我们想在 chDB 中创建一张表,就需要使用 Stateful Session API。

让我们先导入 session 模块:

from chdb import session as chs

初始化会话:

sess = chs.Session()

接下来,我们将创建数据库:

sess.query("CREATE DATABASE statsbomb")

然后基于 events_df 创建一张 events 表:

sess.query("""
CREATE TABLE statsbomb.events ORDER BY id AS
SELECT * 
FROM Python(events_df)
""")

然后我们可以运行查询来返回接收次数最多的接收方:

sess.query("""
SELECT pass_recipient_name, count()
FROM statsbomb.events
WHERE type_name = 'Pass' AND pass_recipient_name <> ''
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10
""", "DataFrame")
               pass_recipient_name  count()
0            Davinson Sánchez Mina       76
1  Ángel Fabián Di María Hernández       64
2              Alexis Mac Allister       62
3                   Enzo Fernandez       57
4      James David Rodríguez Rubio       56
5      Johan Andrés Mojica Palacio       55
6           Rodrigo Javier De Paul       54
7     Jefferson Andrés Lerma Solís       53
8        Jhon Adolfo Arias Andrade       52
9  Carlos Eccehomo Cuesta Figueroa       50

连接 Pandas DataFrame 和表

最后,我们还可以更新连接查询,将 matches_df DataFrame 与 statsbomb.events 表进行连接:

sess.query("""
SELECT home_team_home_team_name, away_team_away_team_name, home_score, away_score,
       countIf(type_name = 'Pass' AND possession_team_id=home_team_home_team_id) AS home_passes,
       countIf(type_name = 'Pass' AND possession_team_id=away_team_away_team_id) AS away_passes,
       countIf(type_name = 'Shot' AND possession_team_id=home_team_home_team_id) AS home_shots,
       countIf(type_name = 'Shot' AND possession_team_id=away_team_away_team_id) AS away_shots
FROM Python(matches_df) AS matches
JOIN statsbomb.events AS events ON events.match_id = matches.match_id
GROUP BY ALL
LIMIT 5
""", "DataFrame").iloc[0]
home_team_home_team_name    阿根廷
away_team_away_team_name     哥伦比亚
home_score                          1
away_score                          0
home_passes                       527
away_passes                       669
home_shots                         11
away_shots                         19
Name: 0, dtype: object