Integrating Apache Spark with ClickHouse
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
There are two main ways to connect Apache Spark and ClickHouse:
- Spark Connector - The Spark connector implements the
DataSourceV2and has its own Catalog management. As of today, this is the recommended way to integrate ClickHouse and Spark. - Spark JDBC - Integrate Spark and ClickHouse using a JDBC data source.
Both solutions have been successfully tested and are fully compatible with various APIs, including Java, Scala, PySpark, and Spark SQL.
Spark Runtime Environments
Standard Spark runtimes
The Spark Connector works out of the box on environments that closely follow the upstream Apache Spark runtime, such as Amazon EMR or a Kubernetes-based Spark deployments.
Managed Spark platforms
Platforms such as AWS Glue and Databricks introduce additional abstractions and environment-specific behavior. While the core integration remains the same, they may require dedicated configuration and setup steps. See the respective documentation pages for details.