Skip to main content
Skip to main content

Recipes Dataset

The RecipeNLG dataset is available for download here. It contains 2.2 million recipes. The size is slightly less than 1 GB.

Download and Unpack the Dataset

  1. Go to the download page https://recipenlg.cs.put.poznan.pl/dataset.
  2. Accept Terms and Conditions and download zip file.
  3. Option: Using the md5sum dataset.zip to validate the zip file and it should be equal to 3a168dfd0912bb034225619b3586ce76.
  4. Unpack the zip file with unzip dataset.zip. You will get the full_dataset.csv file in the dataset directory.

Create a Table

Run clickhouse-client and execute the following CREATE query:

Insert the Data

Run the following command:

This is a showcase how to parse custom CSV, as it requires multiple tunes.

Explanation:

  • The dataset is in CSV format, but it requires some preprocessing on insertion; we use table function input to perform preprocessing;
  • The structure of CSV file is specified in the argument of the table function input;
  • The field num (row number) is unneeded - we parse it from file and ignore;
  • We use FORMAT CSVWithNames but the header in CSV will be ignored (by command line parameter --input_format_with_names_use_header 0), because the header does not contain the name for the first field;
  • File is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that's why we also add the --format_csv_allow_single_quote 0 parameter;
  • Some strings from CSV cannot parse, because they contain \M/ sequence at the beginning of the value; the only value starting with backslash in CSV can be \N that is parsed as SQL NULL. We add --input_format_allow_errors_num 10 parameter and up to ten malformed records can be skipped;
  • There are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use JSONExtract function to transform it to Array.

Validate the Inserted Data

By checking the row count:

Query:

Result:

Example Queries

Top Components by the Number of Recipes:

In this example we learn how to use arrayJoin function to expand an array into a set of rows.

Query:

Result:

The Most Complex Recipes with Strawberry

Result:

In this example, we involve has function to filter by array elements and sort by the number of directions.

There is a wedding cake that requires the whole 126 steps to produce! Show that directions:

Query:

Result:

Online Playground

The dataset is also available in the Online Playground.