メインコンテンツまでスキップ
メインコンテンツまでスキップ

New York Public Library "What's on the Menu?" Dataset

The dataset is created by the New York Public Library. It contains historical data on the menus of hotels, restaurants and cafes with the dishes along with their prices.

Source: http://menus.nypl.org/data The data is in public domain.

The data is from library's archive and it may be incomplete and difficult for statistical analysis. Nevertheless it is also very yummy. The size is just 1.3 million records about dishes in the menus — it's a very small data volume for ClickHouse, but it's still a good example.

ダウンロード データセット

Run the command:

Replace the link to the up to date link from http://menus.nypl.org/data if needed. Download size is about 35 MB.

データセットを展開

Uncompressed size is about 150 MB.

The data is normalized consisted of four tables:

  • Menu — Information about menus: the name of the restaurant, the date when menu was seen, etc.
  • Dish — Information about dishes: the name of the dish along with some characteristic.
  • MenuPage — Information about the pages in the menus, because every page belongs to some menu.
  • MenuItem — An item of the menu. A dish along with its price on some menu page: links to dish and menu page.

テーブルを作成

We use Decimal data type to store prices.

データをインポート

Upload data into ClickHouse, run:

We use CSVWithNames format as the data is represented by CSV with header.

We disable format_csv_allow_single_quotes as only double quotes are used for data fields and single quotes can be inside the values and should not confuse the CSV parser.

We disable input_format_null_as_default as our data does not have NULL. Otherwise ClickHouse will try to parse \N sequences and can be confused with \ in data.

The setting date_time_input_format best_effort allows to parse DateTime fields in wide variety of formats. For example, ISO-8601 without seconds like '2000-01-01 01:02' will be recognized. Without this setting only fixed DateTime format is allowed.

データを非正規化

Data is presented in multiple tables in normalized form. It means you have to perform JOIN if you want to query, e.g. dish names from menu items. For typical analytical tasks it is way more efficient to deal with pre-JOINed data to avoid doing JOIN every time. It is called "denormalized" data.

We will create a table menu_item_denorm where will contain all the data JOINed together:

データを検証

Query:

Result:

クエリを実行

平均的な歴史的価格

Query:

Result:

Take it with a grain of salt.

ハンバーガーの価格

Query:

Result:

ウォッカ

Query:

Result:

To get vodka we have to write ILIKE '%vodka%' and this definitely makes a statement.

キャビア

Let's print caviar prices. Also let's print a name of any dish with caviar.

Query:

Result:

At least they have caviar with vodka. Very nice.

オンラインプレイグラウンド

The data is uploaded to ClickHouse Playground, example.