Skip to main content

New York Public Library "What's on the Menu?" Dataset

The dataset is created by the New York Public Library. It contains historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices.

Source: The data is in public domain.

The data is from library's archive and it may be incomplete and difficult for statistical analysis. Nevertheless it is also very yummy. The size is just 1.3 million records about dishes in the menus β€” it's a very small data volume for ClickHouse, but it's still a good example.

Download the Dataset​

Run the command:


Replace the link to the up to date link from if needed. Download size is about 35 MB.

Unpack the Dataset​

tar xvf 2021_08_01_07_01_17_data.tgz

Uncompressed size is about 150 MB.

The data is normalized consisted of four tables:

  • Menu β€” Information about menus: the name of the restaurant, the date when menu was seen, etc.
  • Dish β€” Information about dishes: the name of the dish along with some characteristic.
  • MenuPage β€” Information about the pages in the menus, because every page belongs to some menu.
  • MenuItem β€” An item of the menu. A dish along with its price on some menu page: links to dish and menu page.

Create the Tables​

We use Decimal data type to store prices.

id UInt32,
name String,
description String,
menus_appeared UInt32,
times_appeared Int32,
first_appeared UInt16,
last_appeared UInt16,
lowest_price Decimal64(3),
highest_price Decimal64(3)
) ENGINE = MergeTree ORDER BY id;

id UInt32,
name String,
sponsor String,
event String,
venue String,
place String,
physical_description String,
occasion String,
notes String,
call_number String,
keywords String,
language String,
date String,
location String,
location_type String,
currency String,
currency_symbol String,
status String,
page_count UInt16,
dish_count UInt16
) ENGINE = MergeTree ORDER BY id;

CREATE TABLE menu_page
id UInt32,
menu_id UInt32,
page_number UInt16,
image_id String,
full_height UInt16,
full_width UInt16,
uuid UUID
) ENGINE = MergeTree ORDER BY id;

CREATE TABLE menu_item
id UInt32,
menu_page_id UInt32,
price Decimal64(3),
high_price Decimal64(3),
dish_id UInt32,
created_at DateTime,
updated_at DateTime,
xpos Float64,
ypos Float64
) ENGINE = MergeTree ORDER BY id;

Import the Data​

Upload data into ClickHouse, run:

clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO dish FORMAT CSVWithNames" < Dish.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu FORMAT CSVWithNames" < Menu.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --query "INSERT INTO menu_page FORMAT CSVWithNames" < MenuPage.csv
clickhouse-client --format_csv_allow_single_quotes 0 --input_format_null_as_default 0 --date_time_input_format best_effort --query "INSERT INTO menu_item FORMAT CSVWithNames" < MenuItem.csv

We use CSVWithNames format as the data is represented by CSV with header.

We disable format_csv_allow_single_quotes as only double quotes are used for data fields and single quotes can be inside the values and should not confuse the CSV parser.

We disable input_format_null_as_default as our data does not have NULL. Otherwise ClickHouse will try to parse \N sequences and can be confused with \ in data.

The setting date_time_input_format best_effort allows to parse DateTime fields in wide variety of formats. For example, ISO-8601 without seconds like '2000-01-01 01:02' will be recognized. Without this setting only fixed DateTime format is allowed.

Denormalize the Data​

Data is presented in multiple tables in normalized form. It means you have to perform JOIN if you want to query, e.g. dish names from menu items. For typical analytical tasks it is way more efficient to deal with pre-JOINed data to avoid doing JOIN every time. It is called "denormalized" data.

We will create a table menu_item_denorm where will contain all the data JOINed together:

CREATE TABLE menu_item_denorm
ENGINE = MergeTree ORDER BY (dish_name, created_at)
ypos, AS dish_id, AS dish_name,
dish.description AS dish_description,
dish.menus_appeared AS dish_menus_appeared,
dish.times_appeared AS dish_times_appeared,
dish.first_appeared AS dish_first_appeared,
dish.last_appeared AS dish_last_appeared,
dish.lowest_price AS dish_lowest_price,
dish.highest_price AS dish_highest_price, AS menu_id, AS menu_name,
menu.sponsor AS menu_sponsor,
menu.event AS menu_event,
menu.venue AS menu_venue, AS menu_place,
menu.physical_description AS menu_physical_description,
menu.occasion AS menu_occasion,
menu.notes AS menu_notes,
menu.call_number AS menu_call_number,
menu.keywords AS menu_keywords,
menu.language AS menu_language, AS menu_date,
menu.location AS menu_location,
menu.location_type AS menu_location_type,
menu.currency AS menu_currency,
menu.currency_symbol AS menu_currency_symbol,
menu.status AS menu_status,
menu.page_count AS menu_page_count,
menu.dish_count AS menu_dish_count
FROM menu_item
JOIN dish ON menu_item.dish_id =
JOIN menu_page ON menu_item.menu_page_id =
JOIN menu ON menu_page.menu_id =;

Validate the Data​


SELECT count() FROM menu_item_denorm;


β”‚ 1329175 β”‚

Run Some Queries​

Averaged historical prices of dishes​


round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
round(avg(price), 2),
bar(avg(price), 0, 100, 100)
FROM menu_item_denorm
WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022)


β”Œβ”€β”€β”€β”€d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 100, 100)─┐
β”‚ 1850 β”‚ 618 β”‚ 1.5 β”‚ β–ˆβ– β”‚
β”‚ 1860 β”‚ 1634 β”‚ 1.29 β”‚ β–ˆβ–Ž β”‚
β”‚ 1870 β”‚ 2215 β”‚ 1.36 β”‚ β–ˆβ–Ž β”‚
β”‚ 1880 β”‚ 3909 β”‚ 1.01 β”‚ β–ˆ β”‚
β”‚ 1890 β”‚ 8837 β”‚ 1.4 β”‚ β–ˆβ– β”‚
β”‚ 1900 β”‚ 176292 β”‚ 0.68 β”‚ β–‹ β”‚
β”‚ 1910 β”‚ 212196 β”‚ 0.88 β”‚ β–Š β”‚
β”‚ 1920 β”‚ 179590 β”‚ 0.74 β”‚ β–‹ β”‚
β”‚ 1930 β”‚ 73707 β”‚ 0.6 β”‚ β–Œ β”‚
β”‚ 1940 β”‚ 58795 β”‚ 0.57 β”‚ β–Œ β”‚
β”‚ 1950 β”‚ 41407 β”‚ 0.95 β”‚ β–Š β”‚
β”‚ 1960 β”‚ 51179 β”‚ 1.32 β”‚ β–ˆβ–Ž β”‚
β”‚ 1970 β”‚ 12914 β”‚ 1.86 β”‚ β–ˆβ–‹ β”‚
β”‚ 1980 β”‚ 7268 β”‚ 4.35 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–Ž β”‚
β”‚ 1990 β”‚ 11055 β”‚ 6.03 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚
β”‚ 2000 β”‚ 2467 β”‚ 11.85 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ β”‚
β”‚ 2010 β”‚ 597 β”‚ 25.66 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ β”‚

Take it with a grain of salt.

Burger Prices​


round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
round(avg(price), 2),
bar(avg(price), 0, 50, 100)
FROM menu_item_denorm
WHERE (menu_currency = 'Dollars') AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%burger%')


β”Œβ”€β”€β”€β”€d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)───────────┐
β”‚ 1880 β”‚ 2 β”‚ 0.42 β”‚ β–‹ β”‚
β”‚ 1890 β”‚ 7 β”‚ 0.85 β”‚ β–ˆβ–‹ β”‚
β”‚ 1900 β”‚ 399 β”‚ 0.49 β”‚ β–Š β”‚
β”‚ 1910 β”‚ 589 β”‚ 0.68 β”‚ β–ˆβ–Ž β”‚
β”‚ 1920 β”‚ 280 β”‚ 0.56 β”‚ β–ˆ β”‚
β”‚ 1930 β”‚ 74 β”‚ 0.42 β”‚ β–‹ β”‚
β”‚ 1940 β”‚ 119 β”‚ 0.59 β”‚ β–ˆβ– β”‚
β”‚ 1950 β”‚ 134 β”‚ 1.09 β”‚ β–ˆβ–ˆβ– β”‚
β”‚ 1960 β”‚ 272 β”‚ 0.92 β”‚ β–ˆβ–‹ β”‚
β”‚ 1970 β”‚ 108 β”‚ 1.18 β”‚ β–ˆβ–ˆβ–Ž β”‚
β”‚ 1980 β”‚ 88 β”‚ 2.82 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ β”‚
β”‚ 1990 β”‚ 184 β”‚ 3.68 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž β”‚
β”‚ 2000 β”‚ 21 β”‚ 7.14 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž β”‚
β”‚ 2010 β”‚ 6 β”‚ 18.42 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ β”‚



round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
round(avg(price), 2),
bar(avg(price), 0, 50, 100)
FROM menu_item_denorm
WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%vodka%')


β”Œβ”€β”€β”€β”€d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)─┐
β”‚ 1910 β”‚ 2 β”‚ 0 β”‚ β”‚
β”‚ 1920 β”‚ 1 β”‚ 0.3 β”‚ β–Œ β”‚
β”‚ 1940 β”‚ 21 β”‚ 0.42 β”‚ β–‹ β”‚
β”‚ 1950 β”‚ 14 β”‚ 0.59 β”‚ β–ˆβ– β”‚
β”‚ 1960 β”‚ 113 β”‚ 2.17 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–Ž β”‚
β”‚ 1970 β”‚ 37 β”‚ 0.68 β”‚ β–ˆβ–Ž β”‚
β”‚ 1980 β”‚ 19 β”‚ 2.55 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚
β”‚ 1990 β”‚ 86 β”‚ 3.6 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– β”‚
β”‚ 2000 β”‚ 2 β”‚ 3.98 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š β”‚

To get vodka we have to write ILIKE '%vodka%' and this definitely makes a statement.


Let's print caviar prices. Also let's print a name of any dish with caviar.


round(toUInt32OrZero(extract(menu_date, '^\\d{4}')), -1) AS d,
round(avg(price), 2),
bar(avg(price), 0, 50, 100),
FROM menu_item_denorm
WHERE (menu_currency IN ('Dollars', '')) AND (d > 0) AND (d < 2022) AND (dish_name ILIKE '%caviar%')


β”Œβ”€β”€β”€β”€d─┬─count()─┬─round(avg(price), 2)─┬─bar(avg(price), 0, 50, 100)──────┬─any(dish_name)──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
β”‚ 1090 β”‚ 1 β”‚ 0 β”‚ β”‚ Caviar β”‚
β”‚ 1880 β”‚ 3 β”‚ 0 β”‚ β”‚ Caviar β”‚
β”‚ 1890 β”‚ 39 β”‚ 0.59 β”‚ β–ˆβ– β”‚ Butter and caviar β”‚
β”‚ 1900 β”‚ 1014 β”‚ 0.34 β”‚ β–‹ β”‚ Anchovy Caviar on Toast β”‚
β”‚ 1910 β”‚ 1588 β”‚ 1.35 β”‚ β–ˆβ–ˆβ–‹ β”‚ 1/1 BrΓΆtchen Caviar β”‚
β”‚ 1920 β”‚ 927 β”‚ 1.37 β”‚ β–ˆβ–ˆβ–‹ β”‚ ASTRAKAN CAVIAR β”‚
β”‚ 1930 β”‚ 289 β”‚ 1.91 β”‚ β–ˆβ–ˆβ–ˆβ–‹ β”‚ Astrachan caviar β”‚
β”‚ 1940 β”‚ 201 β”‚ 0.83 β”‚ β–ˆβ–‹ β”‚ (SPECIAL) Domestic Caviar Sandwich β”‚
β”‚ 1950 β”‚ 81 β”‚ 2.27 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–Œ β”‚ Beluga Caviar β”‚
β”‚ 1960 β”‚ 126 β”‚ 2.21 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ– β”‚ Beluga Caviar β”‚
β”‚ 1970 β”‚ 105 β”‚ 0.95 β”‚ β–ˆβ–Š β”‚ BELUGA MALOSSOL CAVIAR AMERICAN DRESSING β”‚
β”‚ 1980 β”‚ 12 β”‚ 7.22 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– β”‚ Authentic Iranian Beluga Caviar the world's finest black caviar presented in ice garni and a sampling of chilled 100Β° Russian vodka β”‚
β”‚ 1990 β”‚ 74 β”‚ 14.42 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ β”‚ Avocado Salad, Fresh cut avocado with caviare β”‚
β”‚ 2000 β”‚ 3 β”‚ 7.82 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ β”‚ Aufgeschlagenes Kartoffelsueppchen mit Forellencaviar β”‚
β”‚ 2010 β”‚ 6 β”‚ 15.58 β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– β”‚ "OYSTERS AND PEARLS" "Sabayon" of Pearl Tapioca with Island Creek Oysters and Russian Sevruga Caviar β”‚

At least they have caviar with vodka. Very nice.

Online Playground​

The data is uploaded to ClickHouse Playground, example.