Yes, ClickHouse can perform vector search. The main advantages of using ClickHouse for vector search compared to using more specialized vector databases include:
- Using ClickHouse's filtering and full-text search capabilities to refine your dataset before performing a search.
- Performing analytics on your datasets.
- Running a
JOINagainst your existing data.
- No need to manage yet another database and complicate your infrastructure.
Here is a quick tutorial on how to use ClickHouse for vector search.
1. Create embeddings
Your data (documents, images, or structured data) must be converted to embeddings. We recommend creating embeddings using the OpenAI Embeddings API or using the open-source Python library SentenceTransformers.
You can think of an embedding as a large array of floating-point numbers that represent your data. Check out this guide from OpenAI to learn more about embeddings.
2. Store the embeddings
Once you have generated embeddings, you need to store them in ClickHouse. Each embedding should be stored in a separate row and can include metadata for filtering, aggregations, or analytics. Here's an example of a table that can store images with captions:
CREATE TABLE images
ENGINE = MergeTree;
3. Search for related embeddings
Let's say you want to search for pictures of dogs in your dataset. You can use a distance function like
cosineDistance to take an embedding of a dog image and search for related images:
-- An embedding of your "input" dog picture
[0.5736801028251648, 0.2516217529773712, ..., -0.6825592517852783],
) AS score
ORDER BY score ASC
This query returns the
_file names and
caption of the top 10 images most likely to be related to your provided dog image.
To follow a more in-depth tutorial on vector search using ClickHouse, please see: