How to read a semicolon-separated file

To read a semicolon-separated file, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then set the field delimiter to ; and query:

1clickhouse local -q "
2SELECT * FROM file('orders.csv', CustomSeparatedWithNames)
3LIMIT 10
4SETTINGS format_custom_field_delimiter = ';', format_custom_escaping_rule = 'CSV'"

1┌─order_date─┬─order_id─┬─country─┬─product───┬─revenue─┬─quantity─┐
2 1. │ 2026-01-01 │        1 │ GB      │ widget    │       5 │        1 │
3 2. │ 2026-01-02 │        2 │ US      │ gadget    │    6.01 │        2 │
4 3. │ 2026-01-03 │        3 │ DE      │ gizmo     │    7.02 │        3 │
5 4. │ 2026-01-04 │        4 │ FR      │ doohickey │    8.03 │        4 │
6 5. │ 2026-01-05 │        5 │ IN      │ widget    │    9.04 │        5 │
7    └────────────┴──────────┴─────────┴───────────┴─────────┴──────────┘

CustomSeparatedWithNames reads the header row for column names and infers types from the data; the ; delimiter is just a setting. The file is read in place with no import step first. Semicolon files are the common European export from spreadsheets and ERP systems, where the comma is already the decimal separator. That detail has a sting in the tail; see the gotcha section below.

See the schema without declaring one

CustomSeparatedWithNames takes the column names from the header and infers each type from the data. DESCRIBE shows what it found:

1clickhouse local -q "
2DESCRIBE file('orders.csv', CustomSeparatedWithNames)
3SETTINGS format_custom_field_delimiter = ';', format_custom_escaping_rule = 'CSV'"

1┌─name───────┬─type──────────────┐
21. │ order_date │ Nullable(Date)    │
32. │ order_id   │ Nullable(Int64)   │
43. │ country    │ Nullable(String)  │
54. │ product    │ Nullable(String)  │
65. │ revenue    │ Nullable(Float64) │
76. │ quantity   │ Nullable(Int64)   │
8   └────────────┴───────────────────┘

The header gave the names; the data gave the types. The format_custom_escaping_rule = 'CSV' setting means values follow CSV quoting rules, so a field that itself contains a ; can be wrapped in quotes and still parse. If your file has no header, use the CustomSeparated format instead of CustomSeparatedWithNames and pass an explicit schema as the third argument to file().

Aggregate directly on the file

Because the file is a table, the full SQL surface works on it (WHERE, GROUP BY, JOIN, window functions) with no load step in between:

1clickhouse local -q "
2SELECT country, count() AS orders, round(sum(revenue), 2) AS revenue, round(avg(quantity), 2) AS avg_qty
3FROM file('orders.csv', CustomSeparatedWithNames)
4GROUP BY country ORDER BY revenue DESC
5SETTINGS format_custom_field_delimiter = ';', format_custom_escaping_rule = 'CSV'
6FORMAT PrettyCompact"

1┌─country─┬─orders─┬─revenue─┬─avg_qty─┐
21. │ US      │      4 │    60.4 │     3.5 │
32. │ GB      │      4 │   56.36 │     2.5 │
43. │ AU      │      3 │   48.33 │       2 │
54. │ IN      │      3 │    45.3 │    2.67 │
65. │ FR      │      3 │   42.27 │    3.33 │
76. │ DE      │      3 │   39.24 │       4 │
8   └─────────┴────────┴─────────┴─────────┘

The decimal-comma gotcha

This is the catch with European semicolon files, and it is exactly why the delimiter is ; rather than , in the first place: the comma is the decimal separator. A price written 1234,50 means one thousand two hundred thirty-four and a half. ClickHouse infers types per the C locale, where the decimal point is ., so it sees 1234,50 as text and types the whole column as String:

1clickhouse local -q "
2DESCRIBE file('orders_eu.csv', CustomSeparatedWithNames)
3SETTINGS format_custom_field_delimiter = ';', format_custom_escaping_rule = 'CSV'"

1┌─name───────┬─type─────────────┐
21. │ order_date │ Nullable(Date)   │
32. │ order_id   │ Nullable(Int64)  │
43. │ country    │ Nullable(String) │
54. │ product    │ Nullable(String) │
65. │ revenue    │ Nullable(String) │   <- decimal commas, read as text
76. │ quantity   │ Nullable(Int64)  │
8   └────────────┴──────────────────┘

The fix is one expression: read the column as text, swap the comma for a point, and cast to a number. replaceOne(revenue, ',', '.')::Float64 does it inline:

1clickhouse local -q "
2SELECT country, replaceOne(revenue, ',', '.')::Float64 AS revenue_eur
3FROM file('orders_eu.csv', CustomSeparatedWithNames)
4ORDER BY revenue_eur DESC
5SETTINGS format_custom_field_delimiter = ';', format_custom_escaping_rule = 'CSV'
6FORMAT PrettyCompact"

1┌─country─┬─revenue_eur─┐
21. │ DE      │      1234.5 │
32. │ FR      │        1000 │
43. │ DE      │       89.99 │
54. │ NL      │        7.02 │
65. │ FR      │        6.01 │
7   └─────────┴─────────────┘

Now revenue_eur is a real Float64 and aggregates correctly:

1clickhouse local -q "
2SELECT round(sum(replaceOne(revenue, ',', '.')::Float64), 2) AS total_eur
3FROM file('orders_eu.csv', CustomSeparatedWithNames)
4SETTINGS format_custom_field_delimiter = ';', format_custom_escaping_rule = 'CSV'"

12337.52

If thousands are also grouped with a separator (for example 1.234,50), strip the grouping dot first: replaceOne(replaceAll(revenue, '.', ''), ',', '.')::Float64. Do the conversion once at read time and the rest of your SQL stays clean.

The same SQL runs unchanged across dozens of file formats and remote sources, and against a ClickHouse server or ClickHouse Cloud when the data outgrows your machine.

Compressed files work too

You do not unzip anything first. clickhouse local detects a .gz, .zst, .xz or similar suffix from the file name and decompresses on the fly, so a orders.csv.gz semicolon file is read exactly like the plain one — same file() call, same ; settings.

How fast is it?

The import step you skipped was the slow part. On a 3,000,000-row, ~110 MB semicolon file (orders_large.csv), a full GROUP BY country with sum and avg over every row runs in:

1clickhouse local -q "
2SELECT country, count() AS orders, round(sum(revenue),2) AS revenue, round(avg(quantity),3) AS avg_qty
3FROM file('orders_large.csv', CustomSeparatedWithNames)
4GROUP BY country ORDER BY revenue DESC
5SETTINGS format_custom_field_delimiter = ';', format_custom_escaping_rule = 'CSV'"

~0.59 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes parsing the delimited text from scratch on every run; there is no cached table. Convert the same data to Parquet once and repeat scans get faster again, because Parquet is columnar and typed.

The same SQL scales unchanged

The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. Nothing about SELECT ... WHERE ... GROUP BY changes. You swap file('orders.csv', ...) for a table name and the rest stays put. Prototype against a file on your machine, ship the identical logic to production.

Run it yourself

The complete, runnable example lives in the ClickHouse examples repo. It has generate.sh (builds the demo file, the European decimal-comma file, and the ~110 MB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-semicolon

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-semicolon
3./generate.sh && ./run.sh

See the schema without declaring one

Aggregate directly on the file

The decimal-comma gotcha

Compressed files work too

How fast is it?

The same SQL scales unchanged

Run it yourself

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

How to read a semicolon-separated file

See the schema without declaring one #

Aggregate directly on the file #

The decimal-comma gotcha #

Compressed files work too #

How fast is it? #

The same SQL scales unchanged #

Run it yourself #

Related #

Subscribe to our newsletter

More like this

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

How to query a REST API in Python

How to convert Parquet to ORC

How to query a REST API with SQL

See the schema without declaring one

Aggregate directly on the file

The decimal-comma gotcha

Compressed files work too

How fast is it?

The same SQL scales unchanged

Run it yourself

Related