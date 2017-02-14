LAION 5B dataset

The LAION 5b dataset contains 5.85 billion image-text embeddings and associated image metadata. The embeddings were generated using Open AI CLIP model ViT-L/14. The dimension of each embedding vector is 768 .

This dataset can be used to model design, sizing and performance aspects for a large scale, real world vector search application. The dataset can be used for both text to image search and image to image search.

The complete dataset is available as a mixture of npy and Parquet files at the-eye.eu

ClickHouse has made available a subset of 100 million vectors in a S3 bucket. The S3 bucket contains 10 Parquet files, each Parquet file is filled with 10 million rows.

We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.