To read a file with a custom delimiter, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then use the CustomSeparated format and tell it the separator:
clickhouse local -q "
SELECT * FROM file('orders.txt', CustomSeparatedWithNames) LIMIT 5
SETTINGS format_custom_field_delimiter='|~|', format_custom_escaping_rule='CSV'"
1 GB widget 5 1
2 US gadget 6.01 2
3 DE gizmo 7.02 3
4 FR doohickey 8.03 4
5 IN widget 9.04 5
The file above separates fields with |~|, a multi-character sequence that no CSV or TSV reader would split on correctly. CustomSeparated handles it because you supply the delimiter yourself, and the file is read in place with no import step first.
CustomSeparated is configured by settings, not by a fixed dialect. The two you reach for most:
format_custom_field_delimiter sets what separates the columns. Default is \t. Set it to |~|, ::, ;, or any string.
format_custom_escaping_rule sets how a field's text is interpreted. CSV is the sane default: it honours quoting, so a value that itself contains the delimiter still parses when wrapped in quotes.
If the field delimiter is a single tab, you have a TSV, so use TSVWithNames. If it's a single comma, that's a CSV. CustomSeparated is for everything else.
Use CustomSeparatedWithNames when the first line holds column names. With names present, ClickHouse infers the column types from the data, so you don't declare a schema.
DESCRIBE prints the column names (from the header) and the types ClickHouse inferred (from the data), with no CREATE TABLE:
clickhouse local -q "
DESCRIBE file('orders.txt', CustomSeparatedWithNames)
SETTINGS format_custom_field_delimiter='|~|', format_custom_escaping_rule='CSV'"
order_id Nullable(Int64)
country Nullable(String)
product Nullable(String)
revenue Nullable(Float64)
quantity Nullable(Int64)
The header gave the names; the values gave the types. Columns come back Nullable because inference allows for empty fields. If inference guesses wrong, say an order ID that should stay a string, pass the format and an explicit schema as the second and third arguments to file(), exactly as you would for a CSV.
Filter and group by, directly on the file
Once the delimiter is set, the file is just a SQL source. The full ClickHouse dialect applies (WHERE, GROUP BY, aggregates, joins, window functions) with no load step in between:
clickhouse local -q "
SELECT country, count() AS orders, round(sum(revenue), 2) AS revenue
FROM file('orders.txt', CustomSeparatedWithNames)
GROUP BY country
ORDER BY revenue DESC
SETTINGS format_custom_field_delimiter='|~|', format_custom_escaping_rule='CSV'"
AU 2 26.16
IN 2 24.14
FR 2 22.12
DE 2 20.1
US 2 18.08
GB 2 16.06
Some exports don't end rows with a bare newline. They append a marker such as ;; or a record separator byte. CustomSeparated covers that with format_custom_row_after_delimiter. Here a file uses :: between fields and ;;\n after each row, with no header, so the schema is given explicitly:
clickhouse local -q "
SELECT * FROM file('orders_pipe.txt', CustomSeparated,
'order_id UInt32, country String, revenue Float64')
LIMIT 5
SETTINGS format_custom_field_delimiter=' :: ',
format_custom_row_after_delimiter=' ;;\n',
format_custom_escaping_rule='CSV'"
1 GB 5
2 US 6.01
3 DE 7.02
4 FR 8.03
5 IN 9.04
There are matching settings for the rest of the layout when you need them: format_custom_row_before_delimiter, format_custom_row_between_delimiter, and format_custom_result_before_delimiter / format_custom_result_after_delimiter for a fixed header or footer wrapping the whole result. The common pipe and semicolon cases have shorthand guides: pipe-delimited and semicolon-separated files.
You don't unzip anything first. clickhouse local detects the .gz extension and decompresses on the fly, so a custom-delimited .txt.gz is queried exactly like the plain file, with the same delimiter settings:
clickhouse local -q "
SELECT count() FROM file('orders.txt.gz', CustomSeparatedWithNames)
SETTINGS format_custom_field_delimiter='|~|', format_custom_escaping_rule='CSV'"
The same applies to .zst, .xz, and other supported codecs. The compression is inferred from the file name.
The delimiter being unusual costs nothing at scale. On a 3,000,000-row, ~112 MB file (orders_large.txt, |~| delimited), a full GROUP BY country with sum over every row completes in:
clickhouse local -q "
SELECT country, count() AS orders, round(sum(revenue), 2) AS revenue
FROM file('orders_large.txt', CustomSeparatedWithNames)
GROUP BY country ORDER BY revenue DESC
SETTINGS format_custom_field_delimiter='|~|', format_custom_escaping_rule='CSV'"
~0.55 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That parses the delimited text from scratch on every run across all cores; there is no cached table. A column-aware split on a multi-character separator, done with a hand-written script, would be slower and more error-prone on quoted fields.
US 300294 75094223.6
FR 300166 75079479.91
JP 300169 75064502.48
CA 300164 75009787.27
GB 300110 75002592.83
One binary reads this custom format alongside CSV, JSON, Parquet, ORC and many others, and the same SQL runs unchanged against a ClickHouse server or ClickHouse Cloud when the data outgrows your machine. You prototype against the file and ship the identical query.
The complete, runnable example lives here. It has generate.sh (creates the |~| file, a gzipped copy, a custom field-and-row-delimiter file, and the ~112 MB perf file), run.sh (every command above), and expected_output.txt:
github.com/ClickHouse/examples → local-analytics/clickhouse-local-custom-delimiter
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-custom-delimiter
./generate.sh && ./run.sh