Skip to content

Alexey Milovidov at Web Summit 2025: Working with massive datasets at scale

Alexey Milovidov

A practical deep-dive into working with massive datasets in ClickHouse, using training data collections as fascinating case studies. This presentation gets hands-on with real data engineering challenges—loading, analyzing, and comparing datasets that range from Common Crawl's web pages to GitHub repositories, Wikipedia, Reddit, and beyond.

The demo walks through ingesting the FineWeb dataset (81.5 terabytes of web data) directly from Hugging Face into ClickHouse, then explores creative analytical approaches across different data sources. You'll see techniques for calculating "style fingerprints" of websites, tracking word trends across platforms, and mapping a billion photos geographically - all using SQL.

  • Loading 81.5TB of FineWeb dataset from Hugging Face in 72 minutes using parallel ingestion
  • Automatic schema inference from Parquet files with better compression ratios than the source format
  • Comparing writing styles across platforms (Wikipedia vs Reddit vs Hacker News vs Bluesky) using token analysis
  • Creating website "style fingerprints" with hash-based vectorization to find similar domains
  • Tracking word trends over time and analyzing multimodal datasets like 1 billion photos with geographic queries
  • Practical examples working with GitHub (300TB), Wikipedia (45GB), and other major open datasets

Recent videos

YouTube Video: GwCRcRa8f3A

Open House

Open House 2026: Day 1 Keynote

The latest ClickHouse announcements, featuring real-world use cases from Shopify, Zoox, Visa, and Cisco.

YouTube Video: ZtvlCz7Ukg4

Open House

Fireside Chat: The state of data and AI with Bret Taylor (Sierra) and Aaron Katz (ClickHouse)

Aaron Katz (CEO, ClickHouse) and Bret Taylor (Co-Founder Sierra, Chairman of the Board, OpenAI) have an open conversation on the state of AI.

YouTube Video: FmS7VopaqNg

Open House, ClickHouse

How to build a great database (Alexey Milovidov)

The principles behind building a great database, and the new frontiers shaping the field.

Follow us

XBlueskySlackGithubTelegramMeetupRSS