To parse a log file with regex in SQL, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then use the Regexp format and supply the pattern and column schema:
clickhouse local -q "
SELECT * FROM file('access.log', Regexp,
'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
LIMIT 5
SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
format_regexp_escaping_rule = 'Raw'
FORMAT PrettyCompact"
┌─ip───────────┬─ts─────────────────────────┬─method─┬─path───────────┬─status─┬─size─┬────rt─┐
1. │ 203.0.113.7 │ 07/Jun/2026:08:44:18 +0000 │ GET │ /api/login │ 500 │ 1253 │ 1.329 │
2. │ 192.168.1.10 │ 06/Jun/2026:23:43:01 +0000 │ GET │ /static/app.js │ 301 │ 724 │ 0.2 │
3. │ 192.168.1.10 │ 06/Jun/2026:10:05:01 +0000 │ PUT │ /cart │ 404 │ 3870 │ 0.324 │
4. │ 203.0.113.7 │ 06/Jun/2026:16:16:21 +0000 │ GET │ /api/orders │ 200 │ 4855 │ 0.729 │
5. │ 198.51.100.2 │ 06/Jun/2026:21:35:01 +0000 │ DELETE │ /static/app.js │ 200 │ 2928 │ 0.245 │
└──────────────┴────────────────────────────┴────────┴────────────────┴────────┴──────┴───────┘
Each capture group in the regex maps to one column in the declared schema. The log is read in place with no import step, and from there it's ordinary SQL against typed columns.
A web server access log is just text. Each line packs an IP, a timestamp, a request, a status code, a byte count and a response time into one string:
203.0.113.7 - - [07/Jun/2026:08:44:18 +0000] "GET /api/login HTTP/1.1" 500 1253 1.329
192.168.1.10 - - [06/Jun/2026:23:43:01 +0000] "GET /static/app.js HTTP/1.1" 301 724 0.2
192.168.1.10 - - [06/Jun/2026:10:05:01 +0000] "PUT /cart HTTP/1.1" 404 3870 0.324
The usual approach is a chain of grep, awk and cut, rebuilt for every question. A regex with capture groups describes the line once, and then SQL answers any question against it.
One capture group per column
The Regexp format is the gotcha that makes this work, and it behaves differently from CSV or Parquet. Those formats infer a schema; Regexp cannot. You must supply both the structure (the third argument to file()) and the pattern (format_regexp), and the order of the capture groups in the pattern must line up with the order of the columns in the structure.
clickhouse local -q "
SELECT * FROM file('access.log', Regexp,
'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
LIMIT 5
SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
format_regexp_escaping_rule = 'Raw'
FORMAT PrettyCompact"
┌─ip───────────┬─ts─────────────────────────┬─method─┬─path───────────┬─status─┬─size─┬────rt─┐
1. │ 203.0.113.7 │ 07/Jun/2026:08:44:18 +0000 │ GET │ /api/login │ 500 │ 1253 │ 1.329 │
2. │ 192.168.1.10 │ 06/Jun/2026:23:43:01 +0000 │ GET │ /static/app.js │ 301 │ 724 │ 0.2 │
3. │ 192.168.1.10 │ 06/Jun/2026:10:05:01 +0000 │ PUT │ /cart │ 404 │ 3870 │ 0.324 │
4. │ 203.0.113.7 │ 06/Jun/2026:16:16:21 +0000 │ GET │ /api/orders │ 200 │ 4855 │ 0.729 │
5. │ 198.51.100.2 │ 06/Jun/2026:21:35:01 +0000 │ DELETE │ /static/app.js │ 200 │ 2928 │ 0.245 │
└──────────────┴────────────────────────────┴────────┴────────────────┴────────┴──────┴───────┘
Two settings carry the work:
format_regexp is the pattern. Seven capture groups, seven columns. The non-capturing parts (- -, the brackets, HTTP/1.1) match and are discarded.
format_regexp_escaping_rule = 'Raw' reads each captured field as a literal string with no unescaping, then casts it to the declared column type. status and size arrive as text and land as UInt16/UInt32; rt becomes Float64. (The other escaping rules, Escaped, Quoted, CSV and JSON, exist for fields that carry their own quoting; Raw is right for plain log text.)
A row that doesn't match the pattern raises an error by default, which is usually what you want: it tells you the regex is wrong or the log has a line you didn't account for.
ts came back as a string because the log writes time in the Apache/NGINX style (07/Jun/2026:08:44:18 +0000). Convert it once with parseDateTime and the MySQL-style format codes, then sort and filter on it like any other timestamp:
clickhouse local -q "
SELECT
ip,
parseDateTime(substring(ts, 1, 20), '%d/%b/%Y:%H:%i:%s') AS event_time,
method, path, status
FROM file('access.log', Regexp,
'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
ORDER BY event_time
LIMIT 5
SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
format_regexp_escaping_rule = 'Raw'
FORMAT PrettyCompact"
┌─ip───────────┬──────────event_time─┬─method─┬─path───────────┬─status─┐
1. │ 192.168.1.10 │ 2026-06-06 10:05:01 │ PUT │ /cart │ 404 │
2. │ 10.0.0.5 │ 2026-06-06 12:33:01 │ POST │ /api/login │ 301 │
3. │ 203.0.113.7 │ 2026-06-06 16:16:21 │ GET │ /api/orders │ 200 │
4. │ 172.16.0.9 │ 2026-06-06 16:34:10 │ POST │ /cart │ 404 │
5. │ 192.168.1.10 │ 2026-06-06 16:40:47 │ DELETE │ /static/app.js │ 401 │
└──────────────┴─────────────────────┴────────┴────────────────┴────────┘
substring(ts, 1, 20) drops the +0000 zone suffix so the format string matches cleanly. Now event_time is a true DateTime you can bucket by hour, range-filter, or join on.
Once the line is typed columns, the full SQL surface applies — WHERE, GROUP BY, conditional aggregates, window functions. Here's error rate and average response time per path, the kind of answer a grep pipeline makes painful:
clickhouse local -q "
SELECT
path,
count() AS hits,
countIf(status >= 400) AS errors,
round(100.0 * countIf(status >= 400) / count(), 1) AS error_pct,
round(avg(rt), 3) AS avg_rt_s
FROM file('access.log', Regexp,
'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
GROUP BY path
ORDER BY hits DESC
SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
format_regexp_escaping_rule = 'Raw'
FORMAT PrettyCompact"
┌─path───────────┬─hits─┬─errors─┬─error_pct─┬─avg_rt_s─┐
1. │ /static/app.js │ 5 │ 2 │ 40 │ 0.726 │
2. │ /api/orders │ 4 │ 3 │ 75 │ 0.97 │
3. │ /cart │ 3 │ 2 │ 66.7 │ 1.11 │
4. │ /health │ 3 │ 1 │ 33.3 │ 1.006 │
5. │ /api/login │ 3 │ 1 │ 33.3 │ 0.793 │
6. │ /index.html │ 2 │ 2 │ 100 │ 1.392 │
└────────────────┴──────┴────────┴───────────┴──────────┘
countIf(status >= 400) counts errors inline, no subquery needed. The Regexp format earns its place when the line isn't cleanly delimited (quoted requests, bracketed timestamps, optional fields) because a single pattern handles the irregularity that a column splitter can't.
Rotated logs are usually compressed. You don't unzip anything first: clickhouse local detects the .gz suffix and decompresses on the fly, so a .log.gz is queried exactly like a .log. The same applies to .zst, .xz and other codecs.
clickhouse local -q "
SELECT status, count() AS n
FROM file('access.log.gz', Regexp,
'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
GROUP BY status
ORDER BY status
SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
format_regexp_escaping_rule = 'Raw'
FORMAT PrettyCompact"
┌─status─┬─n─┐
1. │ 200 │ 5 │
2. │ 301 │ 4 │
3. │ 401 │ 3 │
4. │ 404 │ 4 │
5. │ 500 │ 4 │
└────────┴───┘
To scan a whole directory of rotated logs in one query, pass a glob to file(), for example file('logs/access.*.log.gz', Regexp, ...), and the same pattern applies across every file.
Regex matching is the cost here, and it still runs at log-ingest speed. On a 2,000,000-line, ~163 MB access log (access_large.log), parsing every line with the regex and aggregating by status code completes in:
clickhouse local --time -q "
SELECT status, count() AS n, round(avg(rt),3) AS avg_rt
FROM file('access_large.log', Regexp,
'ip String, ts String, method String, path String, status UInt16, size UInt32, rt Float64')
GROUP BY status ORDER BY status
SETTINGS format_regexp = '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]+\" (\d+) (\d+) (\S+)',
format_regexp_escaping_rule = 'Raw'
FORMAT Null"
~0.46 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). That includes running the regex against all 2 million lines from scratch on every run; there's no cached table. If you'll query the same logs repeatedly, parse once and write the typed rows to Parquet (INTO OUTFILE 'access.parquet' FORMAT Parquet), then read that.
The query you just ran on a laptop file is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. You swap file('access.log', Regexp, ...) for a table name, and WHERE ... GROUP BY ... countIf(...) stays put. Prototype the parse against one log on your machine, then ship the identical logic to production over the full log stream.
The complete, runnable example lives here. It has generate.sh (builds the demo log, a gzipped copy, and the ~163 MB perf file), run.sh (every command above), and expected_output.txt:
github.com/ClickHouse/examples → local-analytics/clickhouse-local-regexp
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-regexp
./generate.sh && ./run.sh