Working with Avro, Arrow, and ORC data in ClickHouse

Apache has released multiple data formats actively used in analytics environments, including the popular Avro, Arrow, and Orc. ClickHouse supports importing and exporting data using any from that list.

Importing and exporting in Avro format

ClickHouse supports reading and writing Apache Avro data files, which are widely used in Hadoop systems.

To import from an avro file, we should use Avro format in the INSERT statement:

INSERT INTO sometable
FROM INFILE 'data.avro'
FORMAT Avro

With the file() function, we can also explore Avro files before actually importing data:

SELECT path, hits
FROM file('data.avro', Avro)
ORDER BY hits DESC
LIMIT 5;

┌─path────────────┬──hits─┐
│ Amy_Poehler     │ 62732 │
│ Adam_Goldberg   │ 42338 │
│ Aaron_Spelling  │ 25128 │
│ Absence_seizure │ 18152 │
│ Ammon_Bundy     │ 11890 │
└─────────────────┴───────┘

To export to Avro file:

SELECT * FROM sometable
INTO OUTFILE 'export.avro'
FORMAT Avro;

Avro and ClickHouse data types

Consider data types matching when importing or exporting Avro files. Use explicit type casting to convert when loading data from Avro files:

SELECT
    date,
    toDate(date)
FROM file('data.avro', Avro)
LIMIT 3;

┌──date─┬─toDate(date)─┐
│ 16556 │   2015-05-01 │
│ 16556 │   2015-05-01 │
│ 16556 │   2015-05-01 │
└───────┴──────────────┘

Avro messages in Kafka

When Kafka messages use Avro format, ClickHouse can read such streams using AvroConfluent format and Kafka engine:

CREATE TABLE some_topic_stream
(
    field1 UInt32,
    field2 String
)
ENGINE = Kafka() SETTINGS
kafka_broker_list = 'localhost',
kafka_topic_list = 'some_topic',
kafka_group_name = 'some_group',
kafka_format = 'AvroConfluent';

Working with Arrow format

Another columnar format is Apache Arrow, also supported by ClickHouse for import and export. To import data from an Arrow file, we use the Arrow format:

INSERT INTO sometable
FROM INFILE 'data.arrow'
FORMAT Arrow

Exporting to Arrow file works the same way:

SELECT * FROM sometable
INTO OUTFILE 'export.arrow'
FORMAT Arrow

Also, check data types matching to know if any should be converted manually.

Arrow data streaming

The ArrowStream format can be used to work with Arrow streaming (used for in-memory processing). ClickHouse can read and write Arrow streams.

To demonstrate how ClickHouse can stream Arrow data, let's pipe it to the following python script (it reads input stream in Arrow streaming format and outputs the result as a Pandas table):

import sys, pyarrow as pa

with pa.ipc.open_stream(sys.stdin.buffer) as reader:
  print(reader.read_pandas())

Now we can stream data from ClickHouse by piping its output to the script:

clickhouse-client -q "SELECT path, hits FROM some_data LIMIT 3 FORMAT ArrowStream" | python3 arrow.py

                           path  hits
0       b'Akiba_Hebrew_Academy'   241
1           b'Aegithina_tiphia'    34
2  b'1971-72_Utah_Stars_season'     1

ClickHouse can read Arrow streams as well using the same ArrowStream format:

arrow-stream | clickhouse-client -q "INSERT INTO sometable FORMAT ArrowStream"

We've used arrow-stream as a possible source of Arrow streaming data.

Importing and exporting ORC data

Apache ORC format is a columnar storage format typically used for Hadoop. ClickHouse supports importing as well as exporting Orc data using ORC format:

SELECT *
FROM sometable
INTO OUTFILE 'data.orc'
FORMAT ORC;

INSERT INTO sometable
FROM INFILE 'data.orc'
FORMAT ORC;

Also, check data types matching as well as additional settings to tune export and import.

Importing and exporting in Avro format​

Avro and ClickHouse data types​

Avro messages in Kafka​

Working with Arrow format​

Arrow data streaming​

Importing and exporting ORC data​

Further reading​