Tutorials and Example Datasets
We have a lot of resources for helping you get started and learn how ClickHouse works:
- If you need to get ClickHouse up and running, check out our Quick Start
- The ClickHouse Tutorial analyzes a dataset of New York City taxi rides
In addition, the sample datasets provide a great experience on working with ClickHouse, learning important techniques and tricks, and seeing how to take advantage of the many powerful functions in ClickHouse. The sample datasets include:
Page | Description |
---|---|
Amazon Customer Review | Over 150M customer reviews of Amazon products |
AMPLab Big Data Benchmark | A benchmark dataset used for comparing the performance of data warehousing solutions. |
Brown University Benchmark | A new analytical benchmark for machine-generated log data |
Geo Data using the Cell Tower Dataset | Learn how to load OpenCelliD data into ClickHouse, connect Apache Superset to ClickHouse and build a dashboard based on data |
COVID-19 Open-Data | COVID-19 Open-Data is a large, open-source database of COVID-19 epidemiological data and related factors like demographics, economics, and government responses |
Terabyte Click Logs from Criteo | A terabyte of Click Logs from Criteo |
Environmental Sensors Data | Over 20 billion records of data from Sensor.Community, a contributors-driven global sensor network that creates Open Environmental Data. |
GitHub Events Dataset | Dataset containing all events on GitHub from 2011 to Dec 6 2020, with a size of 3.1 billion records. |
Writing Queries in ClickHouse using GitHub Data | Dataset containing all of the commits and changes for the ClickHouse repository |
Laion-400M dataset | Dataset containing 400 million images with English image captions |
New York Public Library "What's on the Menu?" Dataset | Dataset containing 1.3 million records of historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices. |
Anonymized Web Analytics | Dataset consisting of two tables containing anonymized web analytics data with hits and visits |
NOAA Global Historical Climatology Network | 2.5 billion rows of climate data for the last 120 yrs |
New York Taxi Data | Data for billions of taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009 |
NYPD Complaint Data | Ingest and query Tab Separated Value data in 5 steps |
OnTime | Dataset containing the on-time performance of airline flights |
Crowdsourced air traffic data from The OpenSky Network 2020 | The data in this dataset is derived and cleaned from the full OpenSky dataset to illustrate the development of air traffic during the COVID-19 pandemic. |
Recipes Dataset | The RecipeNLG dataset, containing 2.2 million recipes |
Reddit comments dataset | Dataset containing publicly available comments on Reddit from December 2005 to March 2023 with over 14B rows of data in JSON format |
Analyzing Stack Overflow data with ClickHouse | Analyzing Stack Overflow data with ClickHouse |
Star Schema Benchmark (SSB, 2009) | The Star Schema Benchmark (SSB) data set and queries |
TPC-DS (2012) | The TPC-DS benchmark data set and queries. |
TPC-H (1999) | The TPC-H benchmark data set and queries. |
Taiwan Historical Weather Datasets | 131 million rows of weather observation data for the last 128 yrs |
The UK property prices dataset | Learn how to use projections to improve the performance of queries that you run frequently using the UK property dataset, which contains data about prices paid for real-estate property in England and Wales |
WikiStat | Explore the WikiStat dataset containing 0.5 trillion records. |
YouTube dataset of dislikes | A collection is dislikes of YouTube videos. |