How to parse a log file with regex in SQL

Al Brown
Last updated: Jun 6, 2026

To parse a log file with regex in SQL, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then use the Regexp format and supply the pattern and column schema:

1clickhouse local -q "
2SELECT * FROM file('access.log', Regexp,
3  'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
4LIMIT 5
5SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
6         format_regexp_escaping_rule = 'Raw'
7FORMAT PrettyCompact"
1   ┌─ip───────────┬─ts─────────────────────────┬─method─┬─path───────────┬─status─┬─size─┬────rt─┐
21.203.0.113.707/Jun/2026:08:44:18 +0000GET/api/login     │    50012531.32932.192.168.1.1006/Jun/2026:23:43:01 +0000GET/static/app.js │    3017240.243.192.168.1.1006/Jun/2026:10:05:01 +0000 │ PUT    │ /cart          │    40438700.32454.203.0.113.706/Jun/2026:16:16:21 +0000GET/api/orders    │    20048550.72965.198.51.100.206/Jun/2026:21:35:01 +0000DELETE/static/app.js │    20029280.2457   └──────────────┴────────────────────────────┴────────┴────────────────┴────────┴──────┴───────┘

Each capture group in the regex maps to one column in the declared schema. The log is read in place with no import step, and from there it's ordinary SQL against typed columns.

Start with the raw line #

A web server access log is just text. Each line packs an IP, a timestamp, a request, a status code, a byte count and a response time into one string:

1203.0.113.7 - - [07/Jun/2026:08:44:18 +0000] "GET /api/login HTTP/1.1" 500 1253 1.329
2192.168.1.10 - - [06/Jun/2026:23:43:01 +0000] "GET /static/app.js HTTP/1.1" 301 724 0.2
3192.168.1.10 - - [06/Jun/2026:10:05:01 +0000] "PUT /cart HTTP/1.1" 404 3870 0.324

The usual approach is a chain of grep, awk and cut, rebuilt for every question. A regex with capture groups describes the line once, and then SQL answers any question against it.

One capture group per column #

The Regexp format is the gotcha that makes this work, and it behaves differently from CSV or Parquet. Those formats infer a schema; Regexp cannot. You must supply both the structure (the third argument to file()) and the pattern (format_regexp), and the order of the capture groups in the pattern must line up with the order of the columns in the structure.

1clickhouse local -q "
2SELECT * FROM file('access.log', Regexp,
3  'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
4LIMIT 5
5SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
6         format_regexp_escaping_rule = 'Raw'
7FORMAT PrettyCompact"
1   ┌─ip───────────┬─ts─────────────────────────┬─method─┬─path───────────┬─status─┬─size─┬────rt─┐
21.203.0.113.707/Jun/2026:08:44:18 +0000GET/api/login     │    50012531.32932.192.168.1.1006/Jun/2026:23:43:01 +0000GET/static/app.js │    3017240.243.192.168.1.1006/Jun/2026:10:05:01 +0000 │ PUT    │ /cart          │    40438700.32454.203.0.113.706/Jun/2026:16:16:21 +0000GET/api/orders    │    20048550.72965.198.51.100.206/Jun/2026:21:35:01 +0000DELETE/static/app.js │    20029280.2457   └──────────────┴────────────────────────────┴────────┴────────────────┴────────┴──────┴───────┘

Two settings carry the work:

  • format_regexp is the pattern. Seven capture groups, seven columns. The non-capturing parts (- -, the brackets, HTTP/1.1) match and are discarded.
  • format_regexp_escaping_rule = 'Raw' reads each captured field as a literal string with no unescaping, then casts it to the declared column type. status and size arrive as text and land as UInt16/UInt32; rt becomes Float64. (The other escaping rules, Escaped, Quoted, CSV and JSON, exist for fields that carry their own quoting; Raw is right for plain log text.)

A row that doesn't match the pattern raises an error by default, which is usually what you want: it tells you the regex is wrong or the log has a line you didn't account for.

Make the timestamp a real DateTime #

ts came back as a string because the log writes time in the Apache/NGINX style (07/Jun/2026:08:44:18 +0000). Convert it once with parseDateTime and the MySQL-style format codes, then sort and filter on it like any other timestamp:

1clickhouse local -q "
2SELECT
3  ip,
4  parseDateTime(substring(ts, 1, 20), '%d/%b/%Y:%H:%i:%s') AS event_time,
5  method, path, status
6FROM file('access.log', Regexp,
7  'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
8ORDER BY event_time
9LIMIT 5
10SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
11         format_regexp_escaping_rule = 'Raw'
12FORMAT PrettyCompact"
1   ┌─ip───────────┬──────────event_time─┬─method─┬─path───────────┬─status─┐
21.192.168.1.102026-06-06 10:05:01 │ PUT    │ /cart          │    40432.10.0.0.52026-06-06 12:33:01 │ POST   │ /api/login     │    30143.203.0.113.72026-06-06 16:16:21GET/api/orders    │    20054.172.16.0.92026-06-06 16:34:10 │ POST   │ /cart          │    40465.192.168.1.102026-06-06 16:40:47DELETE/static/app.js │    4017   └──────────────┴─────────────────────┴────────┴────────────────┴────────┘

substring(ts, 1, 20) drops the +0000 zone suffix so the format string matches cleanly. Now event_time is a true DateTime you can bucket by hour, range-filter, or join on.

Ask the log a real question #

Once the line is typed columns, the full SQL surface applies — WHERE, GROUP BY, conditional aggregates, window functions. Here's error rate and average response time per path, the kind of answer a grep pipeline makes painful:

1clickhouse local -q "
2SELECT
3  path,
4  count() AS hits,
5  countIf(status >= 400) AS errors,
6  round(100.0 * countIf(status >= 400) / count(), 1) AS error_pct,
7  round(avg(rt), 3) AS avg_rt_s
8FROM file('access.log', Regexp,
9  'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
10GROUP BY path
11ORDER BY hits DESC
12SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
13         format_regexp_escaping_rule = 'Raw'
14FORMAT PrettyCompact"
1   ┌─path───────────┬─hits─┬─errors─┬─error_pct─┬─avg_rt_s─┐
21./static/app.js │    52400.72632./api/orders    │    43750.9743./cart          │    3266.71.1154./health        │    3133.31.00665./api/login     │    3133.30.79376./index.html    │    221001.3928   └────────────────┴──────┴────────┴───────────┴──────────┘

countIf(status >= 400) counts errors inline, no subquery needed. The Regexp format earns its place when the line isn't cleanly delimited (quoted requests, bracketed timestamps, optional fields) because a single pattern handles the irregularity that a column splitter can't.

Gzipped logs work too #

Rotated logs are usually compressed. You don't unzip anything first: clickhouse local detects the .gz suffix and decompresses on the fly, so a .log.gz is queried exactly like a .log. The same applies to .zst, .xz and other codecs.

1clickhouse local -q "
2SELECT status, count() AS n
3FROM file('access.log.gz', Regexp,
4  'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
5GROUP BY status
6ORDER BY status
7SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
8         format_regexp_escaping_rule = 'Raw'
9FORMAT PrettyCompact"
1   ┌─status─┬─n─┐
21. │    200 │ 5 │
32. │    301 │ 4 │
43. │    401 │ 3 │
54. │    404 │ 4 │
65. │    500 │ 4 │
7   └────────┴───┘

To scan a whole directory of rotated logs in one query, pass a glob to file(), for example file('logs/access.*.log.gz', Regexp, ...), and the same pattern applies across every file.

How fast is it? #

Regex matching is the cost here, and it still runs at log-ingest speed. On a 2,000,000-line, ~163 MB access log (access_large.log), parsing every line with the regex and aggregating by status code completes in:

1clickhouse local --time -q "
2SELECT status, count() AS n, round(avg(rt),3) AS avg_rt
3FROM file('access_large.log', Regexp,
4  'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
5GROUP BY status ORDER BY status
6SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
7         format_regexp_escaping_rule = 'Raw'
8FORMAT Null"

~0.46 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes running the regex against all 2 million lines from scratch on every run; there's no cached table. If you'll query the same logs repeatedly, parse once and write the typed rows to Parquet (INTO OUTFILE 'access.parquet' FORMAT Parquet), then read that.

The same SQL scales unchanged #

The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. You swap file('access.log', Regexp, ...) for a table name, and WHERE ... GROUP BY ... countIf(...) stays put. Prototype the parse against one log on your machine, then ship the identical logic to production over the full log stream.

Run it yourself #

The complete, runnable example lives here. It has generate.sh (builds the demo log, a gzipped copy, and the ~163 MB perf file), run.sh (every command above), and expected_output.txt:

github.com/ClickHouse/examples → local-analytics/clickhouse-local-regexp

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-regexp
3./generate.sh && ./run.sh
Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...