Skip to main content

Can you use ClickHouse for vector search?

· 2 min read

Yes, ClickHouse can perform vector search. The main advantages of using ClickHouse for vector search compared to using more specialized vector databases include:

  • Using ClickHouse's filtering and full-text search capabilities to refine your dataset before performing a search.
  • Performing analytics on your datasets.
  • Running a JOIN against your existing data.
  • No need to manage yet another database and complicate your infrastructure.

Here is a quick tutorial on how to use ClickHouse for vector search.

1. Create embeddings

Your data (documents, images, or structured data) must be converted to embeddings. We recommend creating embeddings using the OpenAI Embeddings API or using the open-source Python library SentenceTransformers.

You can think of an embedding as a large array of floating-point numbers that represent your data. Check out this guide from OpenAI to learn more about embeddings.

2. Store the embeddings

Once you have generated embeddings, you need to store them in ClickHouse. Each embedding should be stored in a separate row and can include metadata for filtering, aggregations, or analytics. Here's an example of a table that can store images with captions:

CREATE TABLE images
(
`_file` LowCardinality(String),
`caption` String,
`image_embedding` Array(Float32)
)
ENGINE = MergeTree;

Let's say you want to search for pictures of dogs in your dataset. You can use a distance function like cosineDistance to take an embedding of a dog image and search for related images:

SELECT
_file,
caption,
cosineDistance(
-- An embedding of your "input" dog picture
[0.5736801028251648, 0.2516217529773712, ..., -0.6825592517852783],
image_embedding
) AS score
FROM images
ORDER BY score ASC
LIMIT 10

This query returns the _file names and caption of the top 10 images most likely to be related to your provided dog image.

Further Reading

To follow a more in-depth tutorial on vector search using ClickHouse, please see: