This guide is compatible with ClickHouse Cloud and for self-hosted ClickHouse v23.5+.
This guide shows how to migrate data from BigQuery to ClickHouse.
How long will exporting data to ClickHouse take?
Exporting data from BigQuery to ClickHouse is dependent on the size of your dataset. As a comparison, it takes about an hour to export the 4TB public Ethereum dataset from BigQuery to ClickHouse using this guide.
|Table||Rows||Files Exported||Data Size||BigQuery Export||Slot Time||ClickHouse Import|
|blocks||16,569,489||73||14.53GB||23 secs||37 min||15.4 secs|
|transactions||1,864,514,414||5169||957GB||1 min 38 sec||1 day 8hrs||18 mins 5 secs|
|traces||6,325,819,306||17,985||2.896TB||5 min 46 sec||5 days 19 hr||34 mins 55 secs|
|contracts||57,225,837||350||45.35GB||16 sec||1 hr 51 min||39.4 secs|
|Total||8.26 billion||23,577||3.982TB||8 min 3 sec||> 6 days 5 hrs||53 mins 45 secs|
1. Export table data to GCS
DECLARE export_path STRING;
DECLARE n INT64;
DECLARE i INT64;
SET i = 0;
-- We recommend setting n to correspond to x billion rows. So 5 billion rows, n = 5
SET n = 100;
WHILE i < n DO
SET export_path = CONCAT('gs://mybucket/mytable/', i,'-*.parquet');
uri = export_path,
format = 'PARQUET',
overwrite = true
SELECT * FROM mytable WHERE export_id = i
SET i = i + 1;
In the above query, we export our BigQuery table to the Parquet data format. We also have a
* character in our
uri parameter. This ensures the output is sharded into multiple files, with a numerically increasing suffix, should the export exceed 1GB of data.
This approach has a number of advantages:
- Google allows up to 50TB per day to be exported to GCS for free. Users only pay for GCS storage.
- Exports produce multiple files automatically, limiting each to a maximum of 1GB of table data. This is beneficial to ClickHouse since it allows imports to be parallelized.
- Parquet, as a column-oriented format, represents a better interchange format since it is inherently compressed and faster for BigQuery to export and ClickHouse to query
2. Importing data into ClickHouse from GCS
You must first create your table in ClickHouse:
-- If your BigQuery table contains a column of type STRUCT, you must enable this setting
-- to map that column to a ClickHouse column of type Nested
SET input_format_parquet_import_nested = 1;
CREATE TABLE default.mytable
ENGINE = MergeTree
ORDER BY (timestamp);
After creating the table, enable the setting
parallel_distributed_insert_select if you have multiple ClickHouse replicas in your cluster to speed up our export. If you only have one ClickHouse node, you can skip this step:
SET parallel_distributed_insert_select = 1;
Finally, we can insert the data from GCS into our ClickHouse table using the
INSERT INTO SELECT command, which inserts data into a table based on the results from a
To retrieve the data to
INSERT, we can use the s3Cluster function to retrieve data from our GCS bucket since GCS is interoperable with Amazon S3. If you only have one ClickHouse node, you can use the s3 table function instead of the
INSERT INTO mytable
ifNull(some_text, '') as some_text
SECRET used in the above query is your HMAC key associated with your GCS bucket.
ifNullwhen exporting nullable columns
In the above query, we use the
ifNull function with the
some_text column to insert data into our ClickHouse table with a default value. You can also make your columns in ClickHouse
Nullable, but this is not recommended as it may affect negatively performance.
Alternatively, you can
SET input_format_null_as_default=1 and any missing or NULL values will be replaced by default values for their respective columns, if those defaults are specified.
3. Testing successful data export
To test whether your data was properly inserted, simply run a
SELECT query on your new table:
SELECT * FROM mytable limit 10;
To export more BigQuery tables, simply redo the steps above for each additional table.
Further reading and support
In addition to this guide, we also recommend reading our blog post that shows how to use ClickHouse to speed up BigQuery and how to handle incremental imports.
If you are having issues transferring data from BigQuery to ClickHouse, please feel free to contact us at [email protected].