How to convert CSV to ORC

Al Brown
Last updated: Jun 8, 2026

To convert a CSV file to ORC, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then point it at the CSV and write the result into ORC:

1clickhouse local -q "SELECT * FROM file('events.csv') INTO OUTFILE 'events.orc' TRUNCATE FORMAT ORC"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	signup	7.02	3
42026-01-04	4	FR	purchase	8.03	4
52026-01-05	5	IN	click	9.04	5

ClickHouse reads the CSV header for column names, infers types from the data, and streams rows into a typed columnar ORC file. The source file is read in place with no import step, and because it streams, it handles files larger than RAM.

What you get #

The result is a real columnar ORC file with the column names from the header and the types ClickHouse inferred from the data. Check both ends with DESCRIBE:

1clickhouse local -q "DESCRIBE file('events.csv')"   # source
2clickhouse local -q "DESCRIBE file('events.orc')"   # result
1-- CSV (source):
2event_date	Nullable(Date)
3event_id	Nullable(Int64)
4country	Nullable(String)
5action	Nullable(String)
6amount	Nullable(Float64)
7quantity	Nullable(Int64)
8-- ORC (result):
9event_date	Nullable(Date32)
10event_id	Nullable(Int64)
11country	Nullable(String)
12action	Nullable(String)
13amount	Nullable(Float64)
14quantity	Nullable(Int64)

Read the ORC straight back, or aggregate on it without a load step:

1clickhouse local -q "SELECT * FROM file('events.orc') LIMIT 5"
12026-01-01	1	GB	click	5	1
22026-01-02	2	US	view	6.01	2
32026-01-03	3	DE	signup	7.02	3
42026-01-04	4	FR	purchase	8.03	4
52026-01-05	5	IN	click	9.04	5

Types carry over, with ORC's encoding #

This is the information you don't get from a drag-and-drop converter site: the type mapping is explicit and it round-trips. ORC is a typed columnar format, so the inferred CSV types are stored as real ORC types rather than as text. Two mappings are worth knowing up front, because they hold even when you pin the types yourself:

  • ORC has no 32-bit Date type, so a ClickHouse Date is written as Date32. The values are identical; only the on-disk width differs.
  • ORC integers are signed. An unsigned ClickHouse type such as UInt32 is stored as the matching signed Int32.

You can see this by pinning the schema on the way in. Even when the source columns are declared Date, UInt32, and UInt8, the ORC result reports Date32, Int32, and Int8:

1clickhouse local -q "
2SELECT * FROM file('events.csv', 'CSVWithNames',
3  'event_date Date, event_id UInt32, country String, action String, amount Float64, quantity UInt8')
4INTO OUTFILE 'events_typed.orc' TRUNCATE FORMAT ORC"
5clickhouse local -q "DESCRIBE file('events_typed.orc')"
1event_date	Nullable(Date32)
2event_id	Nullable(Int32)
3country	Nullable(String)
4action	Nullable(String)
5amount	Nullable(Float64)
6quantity	Nullable(Int8)

Pinning the schema is still worth doing when type inference guesses wrong on the CSV — for example when an ID or a ZIP code should stay a string rather than become an integer. Pass the format and an explicit structure as the second and third arguments to file().

Options worth setting #

ORC compresses each column block on write. ClickHouse defaults to LZ4; you can switch the codec with one setting:

1clickhouse local -q "
2SELECT * FROM file('events_large.csv')
3INTO OUTFILE 'events_large.orc' TRUNCATE FORMAT ORC
4SETTINGS output_format_orc_compression_method='zstd'"

On the 3,000,000-row sample CSV, the codec choice maps to file size like this:

1events_large.csv	123.28 MiB
2events_large_none.orc	54.54 MiB
3events_large.orc (lz4 default)	19.50 MiB
4events_large_zstd.orc	19.50 MiB

Even uncompressed, ORC is less than half the size of the CSV because it stores typed columns instead of text. With LZ4 or zstd it lands near 20 MiB, roughly a 6x reduction over the source CSV, and zstd will usually edge ahead of LZ4 on larger, less repetitive data at a small write-time cost. none is useful when another layer already compresses the bytes. Other useful knobs: output_format_orc_row_index_stride controls the row-group stride that lets readers skip blocks, and output_format_orc_string_as_string keeps ORC string columns as string rather than binary.

Convert it in Python instead #

If you work in a notebook, chDB is ClickHouse as an in-process Python module. The conversion is the same SQL, no server and no separate binary:

1import chdb
2
3chdb.query("SELECT * FROM file('events.csv') INTO OUTFILE 'events.orc' TRUNCATE FORMAT ORC")
4print(chdb.query("SELECT count() FROM file('events.orc')", "CSV"), end="")
120

The INTO OUTFILE ... FORMAT ORC clause writes the file directly from the engine, so you skip the pandas-then-to_orc round trip and the conversion handles inputs larger than memory. To read ORC back into a DataFrame, see how to read an ORC file in Python.

How fast is it? #

Converting events_large.csv (3,000,000 rows, ~123 MiB) to ORC, including parsing the CSV text from scratch every run:

1clickhouse local -q "SELECT * FROM file('events_large.csv') INTO OUTFILE 'events_large.orc' TRUNCATE FORMAT ORC"

~0.50 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). The number can wobble slightly under concurrent load. That is a one-time cost: you convert once, then every later query reads the typed columnar ORC instead of re-parsing CSV text.

Why convert at all #

CSV is convenient to produce but slow to scan repeatedly, because every query re-parses the text and re-infers the types. ORC stores the data column by column, already typed and compressed, so analytical scans read less and skip what they don't need. Parquet is the other common columnar choice and is often a better default for cross-tool interchange; see convert CSV to Parquet and Parquet vs CSV. Reach for ORC when your stack expects it, for example a Hive or Spark pipeline standardized on ORC.

Going the other way? See how to convert ORC to CSV.

The same SQL that converts a file on your laptop runs unchanged against a ClickHouse server and ClickHouse Cloud when the data outgrows one machine — no rewrite.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample CSVs (including the ~123 MiB file used for the timing above), run.sh with every command on this page, a chDB run.py / run.ipynb, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-csv-to-orc

Related: run SQL on a CSV file, read an ORC file with SQL, and what is the ORC file format.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...