Alexey Milovidov at Web Summit 2025: Working with massive datasets at scale
Alexey Milovidov
A practical deep-dive into working with massive datasets in ClickHouse, using training data collections as fascinating case studies. This presentation gets hands-on with real data engineering challenges—loading, analyzing, and comparing datasets that range from Common Crawl's web pages to GitHub repositories, Wikipedia, Reddit, and beyond.
The demo walks through ingesting the FineWeb dataset (81.5 terabytes of web data) directly from Hugging Face into ClickHouse, then explores creative analytical approaches across different data sources. You'll see techniques for calculating "style fingerprints" of websites, tracking word trends across platforms, and mapping a billion photos geographically - all using SQL.
- Loading 81.5TB of FineWeb dataset from Hugging Face in 72 minutes using parallel ingestion
- Automatic schema inference from Parquet files with better compression ratios than the source format
- Comparing writing styles across platforms (Wikipedia vs Reddit vs Hacker News vs Bluesky) using token analysis
- Creating website "style fingerprints" with hash-based vectorization to find similar domains
- Tracking word trends over time and analyzing multimodal datasets like 1 billion photos with geographic queries
- Practical examples working with GitHub (300TB), Wikipedia (45GB), and other major open datasets
Recent videos
View all Videos
Open House
Open House 2026: Day 1 Keynote
The latest ClickHouse announcements, featuring real-world use cases from Shopify, Zoox, Visa, and Cisco.

Open House
Fireside Chat: The state of data and AI with Bret Taylor (Sierra) and Aaron Katz (ClickHouse)
Aaron Katz (CEO, ClickHouse) and Bret Taylor (Co-Founder Sierra, Chairman of the Board, OpenAI) have an open conversation on the state of AI.

Open House, ClickHouse
How to build a great database (Alexey Milovidov)
The principles behind building a great database, and the new frontiers shaping the field.

Open House
Fireside Chat: Ecosystem and technology trends (Vercel, dbt Labs, CoreWeave)
Aaron Katz (CEO, ClickHouse), Guillermo Rauch (CEO, Vercel), Tristan Handy (CEO, dbt Labs), and Lukas Biewald (SVP of AI, CoreWeave) discuss how AI is changing the data landscape.