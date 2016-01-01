Parquet
|Input
|Output
|Alias
|✔
|✔
Description
Apache Parquet is a columnar storage format widespread in the Hadoop ecosystem. ClickHouse supports read and write operations for this format.
Data Types Matching
The table below shows supported data types and how they match ClickHouse data types in
INSERT and
SELECT queries.
|Parquet data type (
INSERT)
|ClickHouse data type
|Parquet data type (
SELECT)
BOOL
|Bool
BOOL
UINT8,
BOOL
|UInt8
UINT8
INT8
|Int8/Enum8
INT8
UINT16
|UInt16
UINT16
INT16
|Int16/Enum16
INT16
UINT32
|UInt32
UINT32
INT32
|Int32
INT32
UINT64
|UInt64
UINT64
INT64
|Int64
INT64
FLOAT
|Float32
FLOAT
DOUBLE
|Float64
DOUBLE
DATE
|Date32
DATE
TIME (ms)
|DateTime
UINT32
TIMESTAMP,
TIME (us, ns)
|DateTime64
TIMESTAMP
STRING,
BINARY
|String
BINARY
STRING,
BINARY,
FIXED_LENGTH_BYTE_ARRAY
|FixedString
FIXED_LENGTH_BYTE_ARRAY
DECIMAL
|Decimal
DECIMAL
LIST
|Array
LIST
STRUCT
|Tuple
STRUCT
MAP
|Map
MAP
UINT32
|IPv4
UINT32
FIXED_LENGTH_BYTE_ARRAY,
BINARY
|IPv6
FIXED_LENGTH_BYTE_ARRAY
FIXED_LENGTH_BYTE_ARRAY,
BINARY
|Int128/UInt128/Int256/UInt256
FIXED_LENGTH_BYTE_ARRAY
Arrays can be nested and can have a value of
Nullable type as an argument.
Tuple and
Map types can also be nested.
Unsupported Parquet data types are:
FIXED_SIZE_BINARY
JSON
UUID
ENUM.
Data types of ClickHouse table columns can differ from the corresponding fields of the Parquet data inserted. When inserting data, ClickHouse interprets data types according to the table above and then casts the data to that data type which is set for the ClickHouse table column.
Example Usage
Inserting and Selecting Data
You can insert Parquet data from a file into ClickHouse table using the following command:
You can select data from a ClickHouse table and save it into some file in the Parquet format using the following command:
To exchange data with Hadoop, you can use the
HDFS table engine.
Format Settings
|Setting
|Description
|Default
input_format_parquet_case_insensitive_column_matching
|Ignore case when matching Parquet columns with CH columns.
0
input_format_parquet_preserve_order
|Avoid reordering rows when reading from Parquet files. Usually makes it much slower.
0
input_format_parquet_filter_push_down
|When reading Parquet files, skip whole row groups based on the WHERE/PREWHERE expressions and min/max statistics in the Parquet metadata.
1
input_format_parquet_bloom_filter_push_down
|When reading Parquet files, skip whole row groups based on the WHERE expressions and bloom filter in the Parquet metadata.
0
input_format_parquet_use_native_reader
|When reading Parquet files, to use native reader instead of arrow reader.
0
input_format_parquet_allow_missing_columns
|Allow missing columns while reading Parquet input formats
1
input_format_parquet_local_file_min_bytes_for_seek
|Min bytes required for local read (file) to do seek, instead of read with ignore in Parquet input format
8192
input_format_parquet_enable_row_group_prefetch
|Enable row group prefetching during parquet parsing. Currently, only single-threaded parsing can prefetch.
1
input_format_parquet_skip_columns_with_unsupported_types_in_schema_inference
|Skip columns with unsupported types while schema inference for format Parquet
0
input_format_parquet_max_block_size
|Max block size for parquet reader.
65409
input_format_parquet_prefer_block_bytes
|Average block bytes output by parquet reader
16744704
output_format_parquet_row_group_size
|Target row group size in rows.
1000000
output_format_parquet_row_group_size_bytes
|Target row group size in bytes, before compression.
536870912
output_format_parquet_string_as_string
|Use Parquet String type instead of Binary for String columns.
1
output_format_parquet_fixed_string_as_fixed_byte_array
|Use Parquet FIXED_LENGTH_BYTE_ARRAY type instead of Binary for FixedString columns.
1
output_format_parquet_version
|Parquet format version for output format. Supported versions: 1.0, 2.4, 2.6 and 2.latest (default)
2.latest
output_format_parquet_compression_method
|Compression method for Parquet output format. Supported codecs: snappy, lz4, brotli, zstd, gzip, none (uncompressed)
zstd
output_format_parquet_compliant_nested_types
|In parquet file schema, use name 'element' instead of 'item' for list elements. This is a historical artifact of Arrow library implementation. Generally increases compatibility, except perhaps with some old versions of Arrow.
1
output_format_parquet_use_custom_encoder
|Use a faster Parquet encoder implementation.
1
output_format_parquet_parallel_encoding
|Do Parquet encoding in multiple threads. Requires output_format_parquet_use_custom_encoder.
1
output_format_parquet_data_page_size
|Target page size in bytes, before compression.
1048576
output_format_parquet_batch_size
|Check page size every this many rows. Consider decreasing if you have columns with average values size above a few KBs.
1024
output_format_parquet_write_page_index
|Add a possibility to write page index into parquet files.
1
input_format_parquet_import_nested
|Obsolete setting, does nothing.
0