dbpedia dataset
The dbpedia dataset contains 1 million articles from Wikipedia and their vector embeddings generated using the text-embedding-3-large model from OpenAI.
The dataset is an excellent starter dataset to understand vector embeddings, vector similarity search and Generative AI. We use this dataset to demonstrate approximate nearest neighbor search in ClickHouse and a simple but powerful Q&A application.
Dataset details
The dataset contains 26 Parquet
files located on huggingface.co. The files are named 0.parquet
, 1.parquet
, ..., 25.parquet
. To view some example rows of the dataset, please visit this Hugging Face page.
Create table
Create the dbpedia
table to store the article id, title, text and embedding vector:
Load table
To load the dataset from all Parquet files, run the following shell command:
Alternatively, individual SQL statements can be run as shown below to load each of the 25 Parquet files:
Verify that 1 million rows are seen in the dbpedia
table:
Semantic search
Recommended reading: "Vector embeddings " OpenAPI guide
Semantic search (also referred to as similarity search) using vector embeddings involves the following steps:
- Accept a search query from a user in natural language e.g "Tell me about some scenic rail journeys”, “Suspense novels set in Europe” etc
- Generate embedding vector for the search query using the LLM model
- Find nearest neighbours to the search embedding vector in the dataset
The nearest neighbours are documents, images or content that are results relevant to the user query. The retrieved results are the key input to Retrieval Augmented Generation (RAG) in Generative AI applications.
Run a brute-force vector similarity search
KNN (k - Nearest Neighbours) search or brute force search involves calculating the distance of each vector in the dataset
to the search embedding vector and then ordering the distances to get the nearest neighbours. With the dbpedia
dataset,
a quick technique to visually observe semantic search is to use embedding vectors from the dataset itself as search
vectors. For example:
Note down the query latency so that we can compare it with the query latency of ANN (using vector index).
Also record the query latency with cold OS file cache and with max_theads=1
to recognize the real compute
usage and storage bandwidth usage (extrapolate it to a production dataset with millions of vectors!)
Build a vector similarity index
Run the following SQL to define and build a vector similarity index on the vector
column:
The parameters and performance considerations for index creation and search are described in the documentation.
Building and saving the index could take a few minutes depending on number of CPU cores available and the storage bandwidth.
Perform ANN search
Approximate Nearest Neighbours or ANN refers to group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
Once the vector similarity index has been built, vector search queries will automatically use the index:
Generating embeddings for search query
The similarity search queries seen until now use one of the existing vectors in the dbpedia
table as the search vector. In real world applications, the search vector has to be
generated for a user input query which could be in natural language. The search vector
should be generated by using the same LLM model used to generate embedding vectors
for the dataset.
An example Python script is listed below to demonstrate how to programmatically call OpenAI API's to
generate embedding vectors using the text-embedding-3-large
model. The search embedding vector
is then passed as an argument to the cosineDistance()
function in the SELECT
query.
Running the script requires an OpenAI API key to be set in the environment variable OPENAI_API_KEY
.
The OpenAI API key can be obtained after registering at https://platform.openai.com.
Q&A demo application
The examples above demonstrated semantic search and document retrieval using ClickHouse. A very simple but high potential generative AI example application is presented next.
The application performs the following steps:
- Accepts a topic as input from the user
- Generates an embedding vector for the topic by invoking OpenAI API with model
text-embedding-3-large
- Retrieves highly relevant Wikipedia articles/documents using vector similarity search on the
dbpedia
table - Accepts a free-form question in natural language from the user relating to the topic
- Uses the OpenAI
gpt-3.5-turbo
Chat API to answer the question based on the knowledge in the documents retrieved in step #3. The documents retrieved in step #3 are passed as context to the Chat API and are the key link in Generative AI.
A couple of conversation examples by running the Q&A application are first listed below, followed by the code
for the Q&A application. Running the application requires an OpenAI API key to be set in the environment
variable OPENAI_API_KEY
. The OpenAI API key can be obtained after registering at https://platform.openai.com.
Code: