Parquet

Input	Output	Alias
✔	✔

Description

Apache Parquet is a columnar storage format widespread in the Hadoop ecosystem. ClickHouse supports read and write operations for this format.

Data types matching

The table below shows supported data types and how they match ClickHouse data types in INSERT and SELECT queries.

Parquet data type (`INSERT`)	ClickHouse data type	Parquet data type (`SELECT`)
`BOOL`	Bool	`BOOL`
`UINT8`, `BOOL`	UInt8	`UINT8`
`INT8`	Int8/Enum8	`INT8`
`UINT16`	UInt16	`UINT16`
`INT16`	Int16/Enum16	`INT16`
`UINT32`	UInt32	`UINT32`
`INT32`	Int32	`INT32`
`UINT64`	UInt64	`UINT64`
`INT64`	Int64	`INT64`
`FLOAT`	Float32	`FLOAT`
`DOUBLE`	Float64	`DOUBLE`
`DATE`	Date32	`DATE`
`TIME (ms)`	DateTime	`UINT32`
`TIMESTAMP`, `TIME (us, ns)`	DateTime64	`TIMESTAMP`
`STRING`, `BINARY`	String	`BINARY`
`STRING`, `BINARY`, `FIXED_LENGTH_BYTE_ARRAY`	FixedString	`FIXED_LENGTH_BYTE_ARRAY`
`DECIMAL`	Decimal	`DECIMAL`
`LIST`	Array	`LIST`
`STRUCT`	Tuple	`STRUCT`
`MAP`	Map	`MAP`
`UINT32`	IPv4	`UINT32`
`FIXED_LENGTH_BYTE_ARRAY`, `BINARY`	IPv6	`FIXED_LENGTH_BYTE_ARRAY`
`FIXED_LENGTH_BYTE_ARRAY`, `BINARY`	Int128/UInt128/Int256/UInt256	`FIXED_LENGTH_BYTE_ARRAY`
`JSON`	JSON	`JSON`

Arrays can be nested and can have a value of Nullable type as an argument. Tuple and Map types can also be nested.

Unsupported Parquet data types are:

FIXED_SIZE_BINARY
UUID
ENUM.

Data types of ClickHouse table columns can differ from the corresponding fields of the Parquet data inserted. When inserting data, ClickHouse interprets data types according to the table above and then casts the data to that data type which is set for the ClickHouse table column.

Example usage

Inserting data

Using a Parquet file with the following data, named as football.parquet:

Insert the data:

Reading data

Read data using the Parquet format:

Tip

Parquet is a binary format that does not display in a human-readable form on the terminal. Use the INTO OUTFILE to output Parquet files.

To exchange data with Hadoop, you can use the HDFS table engine.

Format settings

Setting	Description	Default
`input_format_parquet_case_insensitive_column_matching`	Ignore case when matching Parquet columns with CH columns.	`0`
`input_format_parquet_preserve_order`	Avoid reordering rows when reading from Parquet files. Usually makes it much slower.	`0`
`input_format_parquet_filter_push_down`	When reading Parquet files, skip whole row groups based on the WHERE/PREWHERE expressions and min/max statistics in the Parquet metadata.	`1`
`input_format_parquet_bloom_filter_push_down`	When reading Parquet files, skip whole row groups based on the WHERE expressions and bloom filter in the Parquet metadata.	`0`
`input_format_parquet_use_native_reader`	When reading Parquet files, to use native reader instead of arrow reader.	`0`
`input_format_parquet_allow_missing_columns`	Allow missing columns while reading Parquet input formats	`1`
`input_format_parquet_local_file_min_bytes_for_seek`	Min bytes required for local read (file) to do seek, instead of read with ignore in Parquet input format	`8192`
`input_format_parquet_enable_row_group_prefetch`	Enable row group prefetching during parquet parsing. Currently, only single-threaded parsing can prefetch.	`1`
`input_format_parquet_skip_columns_with_unsupported_types_in_schema_inference`	Skip columns with unsupported types while schema inference for format Parquet	`0`
`input_format_parquet_max_block_size`	Max block size for parquet reader.	`65409`
`input_format_parquet_prefer_block_bytes`	Average block bytes output by parquet reader	`16744704`
`input_format_parquet_enable_json_parsing`	When reading Parquet files, parse JSON columns as ClickHouse JSON Column.	`1`
`output_format_parquet_row_group_size`	Target row group size in rows.	`1000000`
`output_format_parquet_row_group_size_bytes`	Target row group size in bytes, before compression.	`536870912`
`output_format_parquet_string_as_string`	Use Parquet String type instead of Binary for String columns.	`1`
`output_format_parquet_fixed_string_as_fixed_byte_array`	Use Parquet FIXED_LENGTH_BYTE_ARRAY type instead of Binary for FixedString columns.	`1`
`output_format_parquet_version`	Parquet format version for output format. Supported versions: 1.0, 2.4, 2.6 and 2.latest (default)	`2.latest`
`output_format_parquet_compression_method`	Compression method for Parquet output format. Supported codecs: snappy, lz4, brotli, zstd, gzip, none (uncompressed)	`zstd`
`output_format_parquet_compliant_nested_types`	In parquet file schema, use name 'element' instead of 'item' for list elements. This is a historical artifact of Arrow library implementation. Generally increases compatibility, except perhaps with some old versions of Arrow.	`1`
`output_format_parquet_use_custom_encoder`	Use a faster Parquet encoder implementation.	`1`
`output_format_parquet_parallel_encoding`	Do Parquet encoding in multiple threads. Requires output_format_parquet_use_custom_encoder.	`1`
`output_format_parquet_data_page_size`	Target page size in bytes, before compression.	`1048576`
`output_format_parquet_batch_size`	Check page size every this many rows. Consider decreasing if you have columns with average values size above a few KBs.	`1024`
`output_format_parquet_write_page_index`	Add a possibility to write page index into parquet files.	`1`
`input_format_parquet_import_nested`	Obsolete setting, does nothing.	`0`

Description​

Data types matching​

Example usage​

Inserting data​

Reading data​

Format settings​