Announcing the ClickHouse Connector for AWS Glue

Available now in the AWS Marketplace #

AWS Glue is Amazon’s serverless data integration service that extracts, transforms, and loads data from or to multiple data sources and services. Today, we’re announcing the launch of the official ClickHouse Connector for AWS Glue, which utilizes their Apache Spark-based serverless ETL engine.

The new ClickHouse Connector for AWS Glue is built on our native Spark connector, enabling you to work with PySpark or Scala within Glue notebooks or ETL scripts. Our new connector eliminates the complexity of installing and managing the ClickHouse Spark connector. Now, with a few clicks, you can install it directly from the AWS Marketplace and add it to your Glue environment in no time.

The connector is available today from the AWS Marketplace.

In this blog, we’ll learn how to build scalable production-ready Apache Spark Jobs with AWS Glue in Python.

Setting up the connector #

We’ll start by setting up the connector.

Prerequisites #

Our connector will work with version 4 of AWS Glue. The first thing you’ll need to verify is that you have access to the following versions and use the version below:

ClickHouse Cloud 25.4+
AWS Glue 4 (Spark 3.3, Scala 2, Python 3)

Get started today

Interested in seeing how ClickHouse works on your data? Get started with ClickHouse Cloud in minutes and receive $300 in free credits.

Get your ClickHouse connection details #

Next, let’s gather the connection details for the ClickHouse service. You can find these by clicking the Connect button in ClickHouse Cloud.

Make a note of the following credentials, as we’ll be using them in a few steps.

Host
Port
Username
Password

Note: It’s worth downloading your credentials when you create your service for later use; otherwise, if you lose your password, you’ll need to use the reset functionality in the connect modal.

Installing the connector via the Marketplace #

To install the connector, go to the AWS Marketplace and search for “ClickHouse Connector for AWS Glue.” It’s free to subscribe and add to your organization's account.

Once you’ve subscribed, head to the Glue console via the search bar.

Creating a connection #

Finding the Data Connection menu item in the left-hand navigation, we can create a connection with the ClickHouse Connector for AWS Glue. Take note of the class name here; we’ll be reusing that later as part of our Spark configuration when setting up our job.

ClickHouse Connector for AWS Glue Details page

After selecting the Connector, you can create a connection, which requires you to supply a name and description that can be reused by many jobs. You can also set up networking options if the data you’re retrieving through your Glue job requires a VPC connection or access via a specific Security Group.

AWS Glue configurable network settings

Creating an AWS Glue Job #

Heading back to ETL Jobs via the navigation, we can create a job using the script editor. The Visual UI is not supported at the moment, but we plan to add support for this option so that you can create jobs using the Glue no-code interface.

The different AWS Glue options for creating jobs within Glue Studio

Our first goal is to create a small job that writes test data to ClickHouse, so that we can verify our connection and that everything is set up correctly. The script editor provides a full IDE-like experience within the web browser, perfect for creating and editing Glue jobs. You can also set up version control for collaboration and review of Glue Job scripts, set a job on a schedule for repeated runs, and execute the job directly from the editor.

AWS Glue Script tab code editor

Before we get started with our script, we’re going to do some setup to make sure we’re using the right IAM permissions, selecting the right AWS Glue version, which will contain the right versions of Apache Spark, Scala, and Python, and configuring some parameters that can be used for our Spark Job.

IAM Permissions for the Role #

You will need to follow the AWS Documentation to create an IAM role that AWS Glue can assume when interacting with other AWS Services, such as temporarily writing to AWS S3 or reading data sources. Ensure the glue:GetJob and glue:GetJobs are also included in the IAM roles for the role used in conjunction with the ClickHouse Connector for AWS Glue.

Now we can scroll down and check the environment versions for Spark, Glue, and Python.

AWS Glue version #

It’s important to note that the ClickHouse connector currently supports running Spark 3.3, Scala 2, and Python 3. You’ll need to configure the Glue Version to be version 4, which supports those versions.

Configuring Job Parameters #

I suggest also setting up some parameters that can be reused in the code. You don’t have to do this step, and could replace the arguments section in the script below with the values, but doing this makes it easier to run jobs against multiple environments like dev and production. I suggest creating the following job parameters that can be modified at runtime:

CLICKHOUSE_HOST
CLICKHOUSE_PORT
CLICKHOUSE_USERNAME
CLICKHOUSE_PASSWORD

Note: parameters need to be supplied through the UI with a prefix of -- for them to be picked up and used in scripts.

Configuring job parameters under Job Details tab

Heading back to the script tab, we can start building a script to prepare for ingesting data into ClickHouse.

Setting up the Spark Environment #

The first thing we need to set up is registering the catalog, the Spark catalog will hold the job parameters we created above. We can use the parameters by enhancing the getResolvedOptions method to get all the CLICKHOUSE_ parameters we set up in the UI and then set them in our Spark configuration.


1## @params: [JOB_NAME, CLICKHOUSE_HOST, CLICKHOUSE_PORT, CLICKHOUSE_USERNAME, CLICKHOUSE_PASSWORD]
2args = getResolvedOptions(sys.argv, [
3    'JOB_NAME',
4    'CLICKHOUSE_HOST',
5    'CLICKHOUSE_PORT',
6    'CLICKHOUSE_USERNAME',
7    'CLICKHOUSE_PASSWORD',
8])
9
10spark.conf.set("spark.sql.catalog.clickhouse", "com.clickhouse.spark.ClickHouseCatalog")
11#  Protocol and SSL config for ClickHouse Cloud
12spark.conf.set("spark.sql.catalog.clickhouse.protocol", "https")
13spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "true")
14spark.conf.set("spark.sql.catalog.clickhouse.option.ssl_mode", "NONE")
15#  Connection details
16spark.conf.set("spark.sql.catalog.clickhouse.host",args["CLICKHOUSE_HOST"])
17spark.conf.set("spark.sql.catalog.clickhouse.http_port", args["CLICKHOUSE_PORT"])
18spark.conf.set("spark.sql.catalog.clickhouse.user", args["CLICKHOUSE_USERNAME"])
19spark.conf.set("spark.sql.catalog.clickhouse.password", args["CLICKHOUSE_PASSWORD"])
20#  Suggestion: consider making this a parameter 
21spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
22# spark.clickhouse.write.format default value is arrow
23spark.conf.set("spark.clickhouse.write.format", "json")
24#  spark.clickhouse.read.format default value is json
25spark.conf.set("spark.clickhouse.read.format", "arrow")

Setting the Spark configuration from the resolved Glue Job parameters

Now that the connector is set up, we can start looking at how we write and read data into ClickHouse. The first thing we’ll need to do is create a table in our database. We’re just going to create a narrow table containing employees and their identifiers for testing the connector.

Managing database tables #

Our connector allows you to execute DDL operations with Spark SQL (Spark Connector ANTLR Grammar), which is extremely handy for creating, altering, or dropping tables and views. We will just create a simple table for this example that will house some data, containing an id and a name.

For more information about what data types ClickHouse supports, you can have a look through our data types documentation when you’re creating your DDL.


1logger.info("spark create tbl")
2# use ClickHouse database
3spark.sql(
4    "USE clickhouse;"
5)
6
7
8# create table DDL in the database
9spark.sql(
10    """
11    CREATE TABLE default.glue_job_example (
12      id          BIGINT    NOT NULL COMMENT 'sort key',
13      name       STRING
14    ) USING ClickHouse
15    TBLPROPERTIES (
16      engine = 'MergeTree()',
17      order_by = 'id'
18    );
19    """
20)

DDL operations using Spark SQL

Note: We currently do not support sending multiple statements in a single method call. You will need to break these into individual calls.

Writing data to ClickHouse #

We will import Row from pyspark.sql to create our dataframe and populate it with employee data. It’s trivial to write data from Dataframes into ClickHouse; the simplest method is to use the writeTo method introduced in Spark 3 with the introduction of the DataFrameWriteV2 API.


1from pyspark.sql import Row
2
3# Create DataFrame
4data = [
5    Row(id=6, name="Mark"), 
6    Row(id=19, name="LAEG"),
7    Row(id=50, name="Jim"),
8    Row(id=71, name="Ian"),
9    Row(id=12, name="APCJ"),
10    Row(id=11, name="Peter"),
11    Row(id=71, name="Eva"),
12    Row(id=17, name="Jonny"),
13    Row(id=40, name="Anthony"),
14    Row(id=55, name="Petra"),
15    Row(id=42, name="Nigel"),
16    Row(id=48, name="BBC"),
17]
18df = spark.createDataFrame(data)
19
20# Write DataFrame to ClickHouse
21df.writeTo("clickhouse.default.glue_job_example")
22    .append()

Writing a Dataframe into a simple table in ClickHouse

Aggregating data sources before writing data to ClickHouse #

If we wanted to do some preprocessing and merge two sets of data to create a wide table within ClickHouse, we could use Spark to read from disparate data sources. Below is a pseudo example of taking a dataframe, reading data from S3, and creating summaries before ingesting into ClickHouse.


1# Read from multiple sources
2first_df = ...
3    
4# Maybe we need to read from another datasource like S3
5second_df = spark.read.parquet("s3://.../")
6    
7# We can join and create aggregated summaries before ingesting the data into ClickHouse
8combined_df = first_df 
9    .join(second_df, "column_to_join_on") 
10    .groupBy("column_to_group_by_1", "column_to_group_by_2") 
11    .agg(
12        sum("column_to_sum").alias("sum_column_name"),
13        count("column_to_count").alias("count_of_entries_column"),
14        avg("column_to_average").alias("avg_column_name")
15    )
16    
17# Write 
18df.writeTo("clickhouse.database_name.table_name")
19  .append()

Combining dataframes before writing to ClickHouse

Reading data from ClickHouse #

You can read data from Clickhouse directly into a dataframe using the spark.sql method. Here we’re just going to read all the data we’ve ingested in the previous step by using a simple select query and logging it to the output.


1# Read DataFrame from ClickHouse
2df_read = spark.sql("""
3    SELECT 
4        *
5    FROM 
6        clickhouse.default.glue_job_example
7""")
8
9# Take the first 10 records from the dataframe
10logger.info(str(df.take(10)))

Reading from ClickHouse into a Spark Dataframe

Full scripts for the example Glue job #

Copy and paste the whole script into your AWS Glue job. Remember to configure the connector, your environment, and variables as shown above!


1import sys
2from awsglue.transforms import *
3from awsglue.utils import getResolvedOptions
4from pyspark.context import SparkContext
5from awsglue.context import GlueContext
6from awsglue.job import Job
7from pyspark.sql import Row
8
9## @params: [JOB_NAME, CLICKHOUSE_PORT, CLICKHOUSE_USERNAME, CLICKHOUSE_PASSWORD]
10args = getResolvedOptions(sys.argv, ['JOB_NAME',
11    'CLICKHOUSE_HOST',
12    'CLICKHOUSE_PORT',
13    'CLICKHOUSE_USERNAME',
14    'CLICKHOUSE_PASSWORD',
15])
16
17sc = SparkContext()
18glueContext = GlueContext(sc)
19logger = glueContext.get_logger()
20spark = glueContext.spark_session
21job = Job(glueContext)
22job.init(args['JOB_NAME'], args)
23
24logger.debug("setting spark conf")
25spark.conf.set("spark.sql.catalog.clickhouse", "com.clickhouse.spark.ClickHouseCatalog")
26
27#  Protocol and SSL config for ClickHouse Cloud
28spark.conf.set("spark.sql.catalog.clickhouse.protocol", "https")
29spark.conf.set("spark.sql.catalog.clickhouse.option.ssl", "true")
30spark.conf.set("spark.sql.catalog.clickhouse.option.ssl_mode", "NONE")
31
32#  Connection details
33spark.conf.set("spark.sql.catalog.clickhouse.host",args["CLICKHOUSE_HOST"])
34spark.conf.set("spark.sql.catalog.clickhouse.http_port", args["CLICKHOUSE_PORT"])
35spark.conf.set("spark.sql.catalog.clickhouse.user", args["CLICKHOUSE_USERNAME"])
36spark.conf.set("spark.sql.catalog.clickhouse.password", args["CLICKHOUSE_PASSWORD"])
37
38#  Suggestion: consider making the database name a parameter for use in different environments
39spark.conf.set("spark.sql.catalog.clickhouse.database", "default")
40spark.conf.set("spark.clickhouse.write.format", "json")
41#  spark.clickhouse.read.format default value is json
42spark.conf.set("spark.clickhouse.read.format", "arrow")
43logger.debug("spark conf set")
44
45logger.debug("creating table in ClickHouse")
46# use ClickHouse database
47spark.sql(
48    "USE clickhouse;"
49)
50
51# create table DDL in the database
52spark.sql(
53    """
54    CREATE TABLE default.glue_job_example (
55      id          BIGINT    NOT NULL COMMENT 'sort key',
56      name       STRING
57    ) USING ClickHouse
58    TBLPROPERTIES (
59      engine = 'MergeTree()',
60      order_by = 'id'
61    );
62    """
63)
64logger.debug("table created in ClickHouse")
65
66# Create the DataFrame
67data = [
68    Row(id=6, name="Mark"), 
69    Row(id=19, name="Luke"),
70    Row(id=50, name="Jim"),
71    Row(id=71, name="Ian"),
72]
73df = spark.createDataFrame(data)
74
75# Write DataFrame to ClickHouse
76df.writeTo("clickhouse.default.glue_job_example")
77    .append()
78logger.debug("data ingested into ClickHouse")
79
80# Read DataFrame from ClickHouse
81df_read = spark.sql("""
82    SELECT 
83        *
84    FROM 
85        clickhouse.default.glue_job_example
86""")
87
88logger.debug("data read from ClickHouse")
89logger.debug(str(df.take(10)))
90
91job.commit()

Taking your AWS Glue pipeline to production #

Of course, the above examples are just to get you started, but as you are developing your scripts in AWS Glue, it is always good to think about how you make and manage your pipelines in a reusable and scalable way for production workloads. Here are some things to consider:

Tuning your Glue job #

Considering and tuning your batch size of data and the types it contains will be crucial for performance. As part of your .writeTo method, you can add options to tune the batchsize, socket_timeout, and connection_timeout. I recommend going through all the Spark Connector configuration options, selecting and tuning the ones that make the most sense for your data shape and use case.


1# ClickHouse batch and connection settings
2...
3.option("batchsize", "100000")  
4.option("socket_timeout", "300000")  
5.option("connection_timeout", "10000") 
6...

Using your Glue Job as part of a Glue Workflow #

How and when your Glue jobs run are crucial to your data pipeline. AWS Glue provides Glue Workflows that allow you to configure triggers to kick off the execution of your Glue job. There are several ways you can trigger your job using Glue workflows:

Schedule
On-demand (via Console, API, or CLI)
EventBridge events

Below, we can set up a scheduled trigger in the Glue workflow console to run our job every day at 9 a.m.

Creating a scheduled trigger in Glue Workflows

Editing job parameters for different environments

We could customize our job parameters so that a job can populate different environments, with different amounts of data into different tables. Follow the configuring job parameters section above and modify the script to add additional parameters.

Adding parameters for your Glue Job within your Glue Workflow

Programmatically creating Glue Jobs with Cloud Formation #

There are many neat features that help with managing and scheduling AWS Glue jobs. For example, you can create and manage version-controlled configurations by using AWS Cloud Formation to define ETL objects like scripts or triggers.

Below is an example of creating a cron job for an AWS Glue job that runs every 10 minutes Monday to Friday.


1---
2AWSTemplateFormatVersion: '2010-09-09'
3# Sample CFN YAML to demonstrate creating a scheduled trigger
4#
5# Parameters section contains names that are substituted in the Resources section
6# Parameters for CFN template
7Parameters:
8  # The existing Glue Job to be started by this trigger 
9  CFNJobName:
10    Type: String
11    Default: 
12  # The name of the trigger to be created
13  CFNTriggerName:
14    Type: String
15    Default: 	
16#
17# Sample CFN YAML to demonstrate creating a scheduled trigger for a job
18#	
19Resources:                                      
20# Create trigger to run an existing job (CFNJobName) on a cron schedule.	
21  TriggerSample1CFN:
22    Type: AWS::Glue::Trigger   
23    Properties:
24      Name:
25        Ref: CFNTriggerName		
26      Description: Trigger created with CloudFormation
27      Type: SCHEDULED                                                        	   
28      Actions:
29        - JobName: !Ref CFNJobName                	  
30        # Arguments: JSON object
31      # # Run the trigger every 10 minutes on Monday to Friday 		
32      Schedule: cron(0/10 * ? * MON-FRI *)

Example of creating an AWS Cloud Formation template to schedule Glue Job

What's next for the ClickHouse Connector for AWS Glue? #

With the first release of ClickHouse Connector for AWS Glue, we focused on the most common use case we see in production - batch write operations. We’re looking towards the future to extend the functionality and enhance the user experience.

Before the end of the year, we’ll look to support AWS Glue’s no-code visual interface to make it even simpler for creating and managing dataflows within

The connector's roadmap includes adding more support for IAM roles with AWS Secret Manager and Glue catalog crawler for automated discovery of data sources. If you’re interested in particular features being added, you can submit an integration request directly in ClickHouse Cloud.

Integrations tab within ClickHouse Cloud with “Request a new integration”

Ready to get started? #

If you’re also excited about the potential of using Spark Jobs inside of AWS Glue, sign up to ClickHouse Cloud to get $300 of trial credits that you can use to create a database service on AWS. You can then subscribe to “ClickHouse Connector for AWS Glue” for free by finding it on the AWS Marketplace.

If you have any feedback, feature requests, or issues relating to Apache Spark, you can create an issue in the ClickHouse Connector for Apache repository. Don’t forget to share your experiments and use cases with others in our ClickHouse Community Slack!