Skip to main content

· One min read

In order to build and contribute to ClickHouse, you must use LLVM and Clang.

These are the commands to build the latest version of LLVM and Clang on Linux:

git clone [email protected]:llvm/llvm-project.git
mkdir llvm-build
cd llvm-build
cmake -GNinja -DCMAKE_BUILD_TYPE:STRING=Release -DLLVM_ENABLE_PROJECTS=all -DLLVM_TARGETS_TO_BUILD=all ../llvm-project/llvm
time ninja
sudo ninja install

· 3 min read

This is useful when there are tables that have similar naming conventions and similar columns but are not replicated. An example is searching the system database for entries in the query log tables.

The query_log table is not replicated, and only queries that are executed on a specific node get logged. Data may also roll to a different table For example, data may be inserted into query_log_0, query_log_1, etc. Since one node may roll at a different time than others, it is useful to try to find the data we're looking for in tables that are not exactly named the same.

In essence, we need to do something like this, but in ClickHouse syntax:

SELECT column1, column2 FROM my_db.my_table_*

For this, we can use the clusterAllReplicas() to search all the nodes and the merge() table function to be able to use a regex pattern to search the multiple tables.

The following example shows how to query all tables with the prefix query_log:

clickhouse-cloud :) SELECT 
`event_time`,
`query_id`,
`query`,
`type`
FROM
clusterAllReplicas(default,merge('system', '^query_log*'))
WHERE
query ilike '%db1.table1%' and event_time > now() - toIntervalMinute(5);

SELECT
event_time,
query_id,
query,
type
FROM clusterAllReplicas(default, merge('system', '^query_log*'))
WHERE (query ILIKE '%db1.table1%') AND (event_time > (now() - toIntervalMinute(5)))

Query id: de95c13e-5759-436e-90d9-a12c1327889e

┌──────────event_time─┬─query_id─────────────────────────────┬─query──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type────────┐
│ 2024-02-08 00:15:20 │ d1dd0d6a-4337-4e58-bdd1-c2c827b6dfe2 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryStart │
│ 2024-02-08 00:15:20 │ d1dd0d6a-4337-4e58-bdd1-c2c827b6dfe2 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryFinish │
└─────────────────────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
┌──────────event_time─┬─query_id─────────────────────────────┬─query──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type────────┐
│ 2024-02-08 00:15:20 │ f0ca43b2-544e-4b94-a21d-0f05e777fa96 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryStart │
│ 2024-02-08 00:15:20 │ f0ca43b2-544e-4b94-a21d-0f05e777fa96 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryFinish │
└─────────────────────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
┌──────────event_time─┬─query_id─────────────────────────────┬─query──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type────────┐
│ 2024-02-08 00:15:20 │ 5cc0a508-7f64-460b-a5be-949ef1d1f2ca │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryStart │
│ 2024-02-08 00:15:20 │ 5cc0a508-7f64-460b-a5be-949ef1d1f2ca │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryFinish │
│ 2024-02-08 00:15:20 │ d1e01cb0-a27c-44b2-829c-90fb2596c9c0 │ create table db1.table1
(
id Int32,
string_column String
)
engine = MergeTree
order by id │ QueryStart │
│ 2024-02-08 00:15:20 │ d1e01cb0-a27c-44b2-829c-90fb2596c9c0 │ create table db1.table1
(
id Int32,
string_column String
)
engine = MergeTree
order by id │ QueryFinish │
│ 2024-02-08 00:15:27 │ 6c2c6c3f-173e-464f-bfa0-643089ca085e │ insert into db1.table1
values
│ QueryStart │
│ 2024-02-08 00:15:27 │ 6c2c6c3f-173e-464f-bfa0-643089ca085e │ insert into db1.table1
values
│ QueryFinish │
└─────────────────────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘

10 rows in set. Elapsed: 0.046 sec. Processed 317.27 thousand rows, 33.57 MB (6.89 million rows/s., 729.43 MB/s.)
Peak memory usage: 67.04 MiB.

Note that the columns you select must exist on each of the tables being queried or you may encounter an error such as:

Received exception from server (version 24.0.2):
Code: 47. DB::Exception: Received from abc123.us-west-2.aws.clickhouse.cloud:9440. DB::Exception: Missing columns: 'hostname' while processing query: 'WITH 'query_log_0' AS _table

· 2 min read

Dictionaries created in ClickHouse Cloud may experience inconsistency during the initial creation phase. This means that you may not see any data in the dictionary right after creation. However, after several retries, the creation query may land on different replicas, and data will be visible.

This sometimes occurs because the dictionary was created before the part reached the server. As an example:

2024-01-25 13:38:25.615837 - CREATE DICTIONARY received
2024-01-25 13:38:25.626468 - CREATE DICTIONARY finished
2024-01-25 13:38:25.733008 - Part all_0_0_0 downloaded

As you can see, the part only arrived after the dictionary was created. This can be a bigger problem if you are using LIFETIME(MIN 0 MAX 0) because this means that dictionary will never be refreshed automatically. Therefore, the dictionary will remain empty until the command RELOAD DICTIONARIES is executed.

The solution to this issue is to use a SELECT query instead of specifying a source table when creating the dictionary and enabling the setting select_sequential_consistency=1.

Instead of specifying a source table:

SOURCE(CLICKHOUSE(
table 'test.temp_title_table_1706189903924'
user default password 'PASSWORD'))

Use a SELECT query with select_sequential_consistency=1:

SOURCE(CLICKHOUSE(QUERY
'SELECT songTitle, mappedTitle
FROM test.temp_title_table_1706189903924
SETTINGS select_sequential_consistency=1' USER default PASSWORD ''))

Why does this issue occur?

When you insert data and then create or reload a dictionary, the DDL may reach a replica before the data (or new data) does. This leads to the dictionaries being inconsistent between replicas. Then, depending on which replica receives the query, you may get different results.

Note that the same thing happens when you insert and immediately after read from a table. If you read from a replica that hasn't replicated the data yet, you won't see the newly inserted data. When you need sequential consistency, at the cost of performance (which is why it's generally not recommended to use) you can enable select_sequential_consistency.

The case of dictionaries is a bit trickier since dictionaries don't use the settings from the query, but the settings from the server. As a result, when loading data into the dictionary, even if you SET select_sequential_consistency=1 data may load inconsistently across replicas. Specifying select_sequential_consistency=1 in the dictionary source query allows the dictionary to adhere to this setting even if it's not globally enabled as a server setting.

· 3 min read

Question

How can I backup a specific partition in ClickHouse?

Answer

See the below example, this uses the S3(Minio) disk configuration listed in our docker compose examples page.

note

This does NOT apply to ClickHouse Cloud

Create a table:

ch_minio_s3 :) CREATE TABLE my_table
(
`event_time` DateTime,
`field_foo` String,
`field_bar` String,
`number` UInt256
)
ENGINE = MergeTree
PARTITION BY number % 2
ORDER BY tuple()

CREATE TABLE my_table
(
`event_time` DateTime,
`field_foo` String,
`field_bar` String,
`number` UInt256
)
ENGINE = MergeTree
PARTITION BY number % 2
ORDER BY tuple()

Query id: a1a54a5a-eac0-477c-b847-b40acaa62780

Ok.

0 rows in set. Elapsed: 0.016 sec.

Add some data that will fill both partitions equally:

ch_minio_s3 :) INSERT INTO my_table SELECT
toDateTime(now() + number) AS event_time,
randomPrintableASCII(10) AS field_foo,
randomPrintableASCII(20) AS field_bar,
number
FROM numbers(1000000)

INSERT INTO my_table SELECT
toDateTime(now() + number) AS event_time,
randomPrintableASCII(10) AS field_foo,
randomPrintableASCII(20) AS field_bar,
number
FROM numbers(1000000)

Query id: bf6ef803-5747-4ea1-ad00-a17967e349b6

Ok.

0 rows in set. Elapsed: 0.282 sec. Processed 1.00 million rows, 8.00 MB (3.55 million rows/s., 28.39 MB/s.)

verify data:

ch_minio_s3 :) SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

Query id: d8febfb0-5339-4f97-aefa-ef0003128526

┌─partition_id─┬─cityHash64(sum(number))─┬──count─┐
015460940821314360342500000
111827822647069388611500000
└──────────────┴─────────────────────────┴────────┘

2 rows in set. Elapsed: 0.025 sec. Processed 1.00 million rows, 32.00 MB (39.97 million rows/s., 1.28 GB/s.)

backup partition with id 1 to configured s3 disk:

ch_minio_s3 :) BACKUP TABLE my_table PARTITION 1 TO Disk('s3','backups/');

BACKUP TABLE my_table PARTITION 1 TO Disk('s3', 'backups/')

Query id: 810f6144-e282-42e2-99d0-9a80c75a927d

┌─id───────────────────────────────────┬─status─────────┐
4d1da197-c4c9-4b6e-966c-76202eadbd53 │ BACKUP_CREATED │
└──────────────────────────────────────┴────────────────┘

1 row in set. Elapsed: 0.095 sec.

Drop the table:

ch_minio_s3 :) DROP TABLE my_table

DROP TABLE my_table

Query id: c3456044-4689-406e-82ac-8d08b8b618fe

Ok.

0 rows in set. Elapsed: 0.007 sec.

restore just partition with id 1 from backup:

ch_minio_s3 :) RESTORE TABLE my_table PARTITION 1 FROM Disk('s3','backups/');

RESTORE TABLE my_table PARTITION 1 FROM Disk('s3', 'backups/')

Query id: ea306c73-83c5-479f-9c0c-391594facc69

┌─id───────────────────────────────────┬─status───┐
│ ec6841a8-0607-465e-bc4d-d446f960d40a │ RESTORED │
└──────────────────────────────────────┴──────────┘

1 row in set. Elapsed: 0.065 sec.

validate the restored data:

ch_minio_s3 :) SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

Query id: a916176d-6a6e-47fc-ba7d-79bb33b152d8

┌─partition_id─┬─────────────────hash─┬──count─┐
111827822647069388611500000
└──────────────┴──────────────────────┴────────┘

1 row in set. Elapsed: 0.012 sec. Processed 500.00 thousand rows, 16.00 MB (41.00 million rows/s., 1.31 GB/s.)

· 2 min read

Question

I'm writing data into ClickHouse cloud and need to be able ,when reading data, to guarantee that I'm getting the latest complete information.

Answer

Talking to same node

If you are using native protocol, or a session to do your write/read, you should then be connected to the same replica: in this scenario you're reading directly from the node where you're writing, then your read will always be consistent.

Talking to a random node

If you can't guarantee you're talking to the same node (for example talking to the node via HTTPS calls which get shuffled via the a load load balancer), you can either:

A)

  1. write your data
  2. connect to a new replica
  3. run SYSTEM SYNC REPLICA db.table_name LIGHTWEIGHT
  4. read the latest data

See SYSTEM commands reference

OR

B) read anytime with sequential consistency

SELECT 
...
SETTINGS select_sequential_consistency = 1

note when using ClickHouse Cloud and its default SharedMergeTree Engine, using insert_quorum_parallel is not required (it's a given)

Using SYSTEM SYNC REPLICAS or select_sequential_consistency will increase the load on ClickHouse Keeper and might have slower performance depending on the load on the service.

The recommended approach is to do the writes/read using the same session or the native protocol (sticky connection).

· 2 min read

If you install ClickHouse using brew, you may encounter an error from MacOS. By default, MacOS will not run applications or tools created by a developer who cannot be verified. When attempting to run any clickhouse command, you may see this error:

MacOS showing a developer verification error.

To get around this verification error, you need to remove the app from MacOS' quarintine bin either by finding the appropriate setting in your System Settings window, or using the terminal.

System settings process

The easiest way to remove the clickhouse executable from the quarintine bin is to:

  1. Open System settings.

  2. Navigate to Privacy & Security:

  3. Scroll to the bottom of the window to find a message saying _"clickhouse-macos-aarch64" was blocked from use because it is not from an identified developer".

  4. Click Allow Anyway.

  5. Enter your MacOS user password.

You should now be able to run clickhouse commands in your terminal.

Terminal process

You can perform this process using the command-line:

First find out where Homebrew installed the clickhouse executable:

which clickhouse

This should output something like:

/opt/homebrew/bin/clickhouse

Remove clickhouse from the quarantine bin by running xattr -d com.apple.quarantine following by the path from the previous command:

xattr -d com.apple.quarantine /opt/homebrew/bin/clickhouse

You should now be able to run the clickhouse executable:

clickhouse

This should output something like:

Use one of the following commands:
clickhouse local [args]
clickhouse client [args]
clickhouse benchmark [args]
...

· 2 min read

Here is a basic code snippet filemain.ts.

Package.json (place it under ./):

{
"name": "a simple clickhouse client example",
"version": "1.0.0",
"main": "main.js",
"license": "MIT",
"devDependencies": {
"typescript": "^5.3.2"
},
"dependencies": {
"@clickhouse/client": "^0.2.6"
}
}

Main.ts (place it under ./src):

import { ClickHouseClient, createClient } from '@clickhouse/client'; // or '@clickhouse/client-web'

interface ClickHouseResultSet<T> {
meta: Meta[];
data: T[];
rows: number;
statistics: Statistics;
}

interface Statistics {
elapsed: number;
rows_read: number;
bytes_read: number;
}

interface Meta {
name: string;
type: string;
}

interface Count {
c: number;
}

//Please replace client connection parameters like`host`
//`username`, `passowrd`, `database` as needed.

const initClickHouseClient = async (): Promise<ClickHouseClient> => {
const client = createClient({
host: 'https://FQDN.aws.clickhouse.cloud',
username: 'default',
password: 'password',
database: 'default',
application: `pingpong`,
});

console.log('ClickHouse ping');
if (!(await client.ping())) {
throw new Error('failed to ping clickhouse!');
}
console.log('ClickHouse pong!');
return client;
};

const main = async () => {
console.log('Initialising clickhouse client');
const client = await initClickHouseClient();

const row = await client.query({
query: `SELECT count() AS c FROM system.tables WHERE database='system'`,
});

const jsonRow: ClickHouseResultSet<Count> = await row.json();

console.log(`I have found ${jsonRow.data[0].c} system tables!`);

await client.close();
console.log(`👋`);
};

main();

To install the packages, run yarn from ./:

$ yarn
yarn install v1.22.19
[1/4] 🔍 Resolving packages...
[2/4] 🚚 Fetching packages...
[3/4] 🔗 Linking dependencies...
[4/4] 🔨 Building fresh packages...
✨ Done in 0.14s.

execute the code in main.ts from ./ with:

$ npx ts-node src/main.ts

will output:

Initialising clickhouse client
ClickHouse ping
ClickHouse pong!
I have found 120 system tables!
👋

· 2 min read

Background

If you don't like how clickhouse client displays the prompt in your terminal window, it's possible to change it by creating a single XML file. This article explains how to change the prompt to whatever you want.

The default prompt is your local computer name followed by :) :

However, you can edit the prompt to be whatever you want:

Steps

To edit the prompt, follow these steps:

  1. Find where you clickhouse executable is stored, and create a file call custom-config.xml in the same directory:

    ./
    ├── clickhouse
    ├── custom-config.xml
    ...
    ├── user_scripts
    └── uuid
  2. Inside custom-config.xml paste the following code:

    <?xml version="1.0" ?>
    <clickhouse>
    <prompt_by_server_display_name>
    <default>CUSTOM_PROMPT_HERE</default>
    </prompt_by_server_display_name>
    </clickhouse>
  3. Replace CUSTOM_PROMPT_HERE with whatever you want your prompt to say. You must keep the prompt to a single line between the opening and closing <default> tags:

    <?xml version="1.0" ?>
    <clickhouse>
    <prompt_by_server_display_name>
    <default>local_clickhouse_client $> </default>
    </prompt_by_server_display_name>
    </clickhouse>
  4. Save the custom-config.xml file.

  5. Start the Clickhouse server if it isn't already running:

    ./clickhouse server
  6. In a new terminal window, start the Clickhouse client with the --config-file=custom-config.xml argument:

    ./clickhouse client --config-file="custom-config.xml"
  7. The Clickhouse client should open and display your custom prompt:

· One min read

How do I check my ClickHouse Cloud Service state? I want to check if the Service is stopped, idle, or running, but I don't want to wake the Service up in doing so.

Answer

The ClickHouse Cloud API is great for checking the status of a cloud service. You need to create an API Key in your service before you can use the Cloud API. You can do this in ClickHouse Cloud clickhouse.cloud:

  1. To check the status of a service, run the following. Make sure to replace Key-ID and Key-Secret with your respective details:

    curl --user '[Key-ID]:[Key-Secret]' https://api.clickhouse.cloud/v1/organizations/[Org-ID]/services/[Service-ID]

    This will output something like:

    result":{"id":"[Service-ID]","name":"[Service-Name]","provider":"aws","region":"us-east-1","state":"**idle**","endpoints":[{"protocol":"nativesecure","host":"[Connect-URL]","port":9440},{"protocol":"https","host":"[Connect-URL]","port":8443}],"tier":"development","idleScaling":true,"idleTimeoutMinutes":15,"ipAccessList":[{"source":"[my-IP]","description":"[my-IP-name]"}],"createdAt":"2023-04-13T23:47:47Z"},"status":200}
  2. You can use the JQ utility to extract the state key:

    curl --user '[Key-ID]:[Key-Secret]' https://api.clickhouse.cloud/v1/organizations/[Org-ID]/services/[Service-ID] | jq '.state'

    This will output something like:

    **idle**
  3. Running the same command against an actively running service will output:

    **running**

· 2 min read

Yes, ClickHouse can perform vector search. The main advantages of using ClickHouse for vector search compared to using more specialized vector databases include:

  • Using ClickHouse's filtering and full-text search capabilities to refine your dataset before performing a search.
  • Performing analytics on your datasets.
  • Running a JOIN against your existing data.
  • No need to manage yet another database and complicate your infrastructure.

Here is a quick tutorial on how to use ClickHouse for vector search.

1. Create embeddings

Your data (documents, images, or structured data) must be converted to embeddings. We recommend creating embeddings using the OpenAI Embeddings API or using the open-source Python library SentenceTransformers.

You can think of an embedding as a large array of floating-point numbers that represent your data. Check out this guide from OpenAI to learn more about embeddings.

2. Store the embeddings

Once you have generated embeddings, you need to store them in ClickHouse. Each embedding should be stored in a separate row and can include metadata for filtering, aggregations, or analytics. Here's an example of a table that can store images with captions:

CREATE TABLE images
(
`_file` LowCardinality(String),
`caption` String,
`image_embedding` Array(Float32)
)
ENGINE = MergeTree;

Let's say you want to search for pictures of dogs in your dataset. You can use a distance function like cosineDistance to take an embedding of a dog image and search for related images:

SELECT
_file,
caption,
cosineDistance(
-- An embedding of your "input" dog picture
[0.5736801028251648, 0.2516217529773712, ..., -0.6825592517852783],
image_embedding
) AS score
FROM images
ORDER BY score ASC
LIMIT 10

This query returns the _file names and caption of the top 10 images most likely to be related to your provided dog image.

Further Reading

To follow a more in-depth tutorial on vector search using ClickHouse, please see: