Engineering Resources / Setting up Apache Iceberg locally using PySpark

Setting up Apache Iceberg locally using PySpark

In this guide, we'll learn how to set up Apache Iceberg locally using PySpark. We'll learn how to create a table, populate it with data from a Parquet file, and query the resulting table. We'll then switch gears and see how to populate and query an Apache Iceberg table using ClickHouse.

Prerequisites

PySpark has a dependency on Java, so we need to install that first. We'd recommended using sdkman, which you can download like this:

curl -s "https://get.sdkman.io" | bash

You can install Java like this:

sdk install java 17.0.0-tem

We'll also be using the uv package manager, so make sure you have that installed as well.

Launching PySpark

We're going to launch PySpark using the uv tool.

1uv run --with pyspark==3.5.2 pyspark \
2--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0 \
3--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
4--conf spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog \
5--conf spark.sql.catalog.sandbox.type=hadoop \
6--conf spark.sql.catalog.sandbox.warehouse=/tmp/warehouse

If this works successfully, you should see the following output:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.2
      /_/

Using Python version 3.12.0 (main, Oct  2 2023 20:56:14)
Spark context Web UI available at http://mac.lan:4040
Spark context available as 'sc' (master = local[*], app id = local-1754397799642).
SparkSession available as 'spark'.

New York Taxis dataset

Next, we're going to download some data from the New York Taxis dataset:

1curl "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet" -o yellow_tripdata_2023-02.parquet
2curl "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet" -o yellow_tripdata_2023-01.parquet

Create and populate Apache Iceberg table with PySpark

Let's now go back to our Spark shell and run the following command to create a namespace:

1spark.sql("CREATE NAMESPACE IF NOT EXISTS sandbox.quickstart")
DataFrame[]

We can then create an Iceberg table based on the data from the first file that we downloaded:

1spark.sql("""
2CREATE TABLE sandbox.quickstart.taxi_dataset
3USING ICEBERG
4AS SELECT * FROM parquet.`yellow_tripdata_2023-01.parquet`
5""")
DataFrame[]

Querying Apache Iceberg with PySpark

We can query this table to make sure it's been created successfully:

1spark.table("sandbox.quickstart.taxi_dataset").describe()
DataFrame[summary: string, VendorID: string, passenger_count: string, trip_distance: string, RatecodeID: string, store_and_fwd_flag: string, PULocationID: string, DOLocationID: string, payment_type: string, fare_amount: string, extra: string, mta_tax: string, tip_amount: string, tolls_amount: string, improvement_surcharge: string, total_amount: string, congestion_surcharge: string, airport_fee: string]

How many records will we be working with?

1spark.table("sandbox.quickstart.taxi_dataset").count()
3066766

Just over 3 million, not bad.

Let's do one more query to count the number of rides by passenger count:

1spark.sql("""
2SELECT passenger_count, count(*)
3FROM sandbox.quickstart.taxi_dataset
4GROUP BY ALL
5ORDER BY passenger_count
6""").show()
+---------------+--------+
|passenger_count|count(1)|
+---------------+--------+
|           NULL|   71743|
|            0.0|   51164|
|            1.0| 2261400|
|            2.0|  451536|
|            3.0|  106353|
|            4.0|   53745|
|            5.0|   42681|
|            6.0|   28124|
|            7.0|       6|
|            8.0|      13|
|            9.0|       1|
+---------------+--------+

Working with Apache Iceberg tables from ClickHouse

Now let's have a look at how to work with Apache Iceberg tables from ClickHouse. We're going to use clickhouse-local, which you can launch like this:

1curl https://clickhouse.com  | sh
2./clickhouse -mn

ClickHouse has supported Iceberg tables since version 23.2. We can create a table in ClickHouse that points to an Iceberg table by running the following:

1CREATE TABLE taxi_dataset
2ENGINE = IcebergLocal('/tmp/warehouse/quickstart/taxi_dataset/');

We can then treat this table as if it was any other table, but the queries will be executed against the Iceberg data.

Writing to Apache Iceberg tables with ClickHouse

Starting from ClickHouse 24.7, it's now possible to write data to Iceberg tables. It's an experimental feature as of August 2025, so we need to enable it by running the following:

1SET allow_experimental_insert_into_iceberg = 1;

We can import the second Parquet file that we downloaded by running the following:

1INSERT INTO taxi_dataset
2SELECT * 
3FROM file('yellow_tripdata_2023-02.parquet');
0 rows in set. Elapsed: 0.481 sec. Processed 2.91 million rows, 47.74 MB (6.06 million rows/s., 99.22 MB/s.)
Peak memory usage: 437.60 MiB.

Querying Apache Iceberg with ClickHouse

We can then query that table like this:

1select count()
2FROM taxi_dataset
3WHERE tip_amount > 5;
   ┌─count()─┐
1.1013386-- 1.01 million
   └─────────┘

We can also see the history of the table by running the following:

1SELECT * 
2FROM system.iceberg_history
3FORMAT Vertical;

Row 1:
──────
database:            default
table:               taxi_dataset
made_current_at:     2025-08-05 13:52:04.270
snapshot_id:         6166062234151479044
parent_id:           0
is_current_ancestor: 1

Row 2:
──────
database:            default
table:               taxi_dataset
made_current_at:     2025-08-05 14:00:42.534
snapshot_id:         1467823196 -- 1.47 billion
parent_id:           6166062234151479044
is_current_ancestor: 1

There are two snapshots of the table - the one we created from PySpark and the new one created after ingesting data from ClickHouse.

We can write a time travel query to count the number of tips of more than $5 by providing the snapshot_id from the first snapshot:

1select count()
2FROM taxi_dataset
3WHERE tip_amount > 5
4SETTINGS iceberg_snapshot_id = 6166062234151479044;
   ┌─count()─┐
1.517597 │
   └─────────┘

We have half as many results when running against the first snapshot.

Share this resource
Follow us
X imageBluesky imageSlack image
GitHub imageTelegram imageMeetup image
Rss image