To analyze log files with SQL, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then read raw log lines with the LineAsString format:
clickhouse local -q "SELECT * FROM file('access.log', LineAsString) LIMIT 5"
198.51.100.248 - - [01/Jun/2026:00:00:00 +0000] "POST /login HTTP/1.1" 200 31481 "-" "Mozilla/5.0 (compatible)" 0.25
198.51.100.223 - - [01/Jun/2026:00:00:00 +0000] "POST /api/products HTTP/1.1" 200 11489 "-" "Mozilla/5.0 (compatible)" 1.82
198.51.100.227 - - [01/Jun/2026:00:00:00 +0000] "PUT /api/users HTTP/1.1" 200 6366 "-" "Mozilla/5.0 (compatible)" 0.906
198.51.100.154 - - [01/Jun/2026:00:00:00 +0000] "DELETE /api/orders HTTP/1.1" 304 29177 "-" "Mozilla/5.0 (compatible)" 0.76
198.51.100.170 - - [01/Jun/2026:00:00:00 +0000] "GET /images/logo.png HTTP/1.1" 200 21098 "-" "Mozilla/5.0 (compatible)" 0.485
LineAsString hands each log line to SQL as a single string column, so the file is read in place with no import or schema declaration. From there you parse the fields with a regex and run any aggregation the full ClickHouse SQL dialect supports.
So you can follow along without a production server, synthesize a realistic nginx combined access log with clickhouse local itself:
clickhouse local -q "
WITH
['GET','GET','GET','GET','POST','PUT','DELETE'] AS methods,
['/','/index.html','/api/users','/api/orders','/login','/static/app.js','/static/style.css','/images/logo.png','/api/products','/checkout'] AS paths,
[200,200,200,200,200,301,304,404,500,503] AS statuses
SELECT
'198.51.100.' || toString(cityHash64(number) % 254 + 1)
|| ' - - [' || formatDateTime(toDateTime('2026-06-01 00:00:00') + intDiv(number, 5), '%d/%b/%Y:%H:%i:%S +0000')
|| '] \"' || methods[(cityHash64(number, 1) % 7) + 1] || ' ' || paths[(cityHash64(number, 2) % 10) + 1] || ' HTTP/1.1\" '
|| toString(statuses[(cityHash64(number, 3) % 10) + 1]) || ' '
|| toString(cityHash64(number, 4) % 50000 + 200)
|| ' \"-\" \"Mozilla/5.0 (compatible)\" '
|| toString(round((cityHash64(number, 5) % 2000000) / 1000000.0, 3)) AS line
FROM numbers(50000)
INTO OUTFILE 'access.log' TRUNCATE FORMAT TSVRaw
"
This writes 50,000 nginx-format lines to access.log. The companion examples folder wraps this in a generate.sh that also produces rotated and multi-GB files.
A log line is just text until you split it into columns. Read each line as a string, then pull out the fields with extractGroups(): one regex, one capture group per field. Parse the nginx timestamp into a real DateTime with parseDateTime():
clickhouse local -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT
g[1] AS ip,
parseDateTime(g[2], '%d/%b/%Y:%H:%i:%S %z', 'UTC') AS ts,
g[3] AS method,
g[4] AS path,
toUInt16(g[5]) AS status,
toFloat64(g[7]) AS request_time
FROM file('access.log', LineAsString)
LIMIT 5
"
198.51.100.248 2026-06-01 00:00:00 POST /login 200 0.25
198.51.100.223 2026-06-01 00:00:00 POST /api/products 200 1.82
198.51.100.227 2026-06-01 00:00:00 PUT /api/users 200 0.906
198.51.100.154 2026-06-01 00:00:00 DELETE /api/orders 304 0.76
198.51.100.170 2026-06-01 00:00:00 GET /images/logo.png 200 0.485
extractGroups returns an array; g[1], g[2], … are the capture groups in order. parseDateTime takes an explicit format (%d/%b/%Y:%H:%i:%S %z), so the nginx timestamp, including its +0000 offset, becomes a proper DateTime you can filter, bucket, and sort on.
ClickHouse also has a dedicated Regexp input format that maps capture groups straight onto named columns. The extractGroups-on-LineAsString approach above keeps everything in one query and is the easiest to adapt to your own log shape: change the regex and the column list, and you are done.
For structured logs you do not even need the regex. JSON lines parse directly with SELECT * FROM file('app.log', JSONEachRow), and delimited logs work the same way you would run SQL on a CSV file.
With the fields extracted, aggregation is ordinary SQL. Status-code distribution in one query:
clickhouse local -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT toUInt16(g[5]) AS status, count() AS requests
FROM file('access.log', LineAsString)
GROUP BY status ORDER BY requests DESC
"
200 24837
301 5127
404 5098
503 5071
304 4985
500 4882
Averages hide the tail. Use quantile to get p95 and p99 of the request-time field directly:
clickhouse local -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT
round(quantile(0.95)(toFloat64(g[7])), 3) AS p95_s,
round(quantile(0.99)(toFloat64(g[7])), 3) AS p99_s,
round(max(toFloat64(g[7])), 3) AS max_s
FROM file('access.log', LineAsString)
"
Top URLs and noisiest IPs
Find the most-requested paths:
clickhouse local -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT g[4] AS path, count() AS hits
FROM file('access.log', LineAsString)
GROUP BY path ORDER BY hits DESC LIMIT 5
"
/login 5135
/images/logo.png 5076
/checkout 5052
/api/orders 5011
/api/users 5005
And the IPs responsible for the most 5xx errors, the kind of question you ask mid-incident:
clickhouse local -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT g[1] AS ip, count() AS errors
FROM file('access.log', LineAsString)
WHERE toUInt16(g[5]) >= 500
GROUP BY ip ORDER BY errors DESC LIMIT 5
"
198.51.100.67 66
198.51.100.193 56
198.51.100.4 55
198.51.100.242 55
198.51.100.254 54
Bucket the parsed timestamp with toStartOfMinute to plot traffic per minute (swap in toStartOfHour or toStartOfFiveMinutes as needed):
clickhouse local -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT
toStartOfMinute(parseDateTime(g[2], '%d/%b/%Y:%H:%i:%S %z', 'UTC')) AS minute,
count() AS requests
FROM file('access.log', LineAsString)
GROUP BY minute ORDER BY minute LIMIT 5
"
2026-06-01 00:00:00 300
2026-06-01 00:01:00 300
2026-06-01 00:02:00 300
2026-06-01 00:03:00 300
2026-06-01 00:04:00 300
Rotated and compressed logs
Real log directories are full of rotated, gzip-compressed files: access.log, access-1.log.gz, access-2.log.gz, and so on. A glob reads all of them in one query, and clickhouse local decompresses .gz transparently, with no gunzip step:
clickhouse local -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT toUInt16(g[5]) AS status, count() AS requests
FROM file('logs/*.log.gz', LineAsString)
GROUP BY status ORDER BY requests DESC
"
200 7422
404 1665
301 1617
304 1482
503 1455
500 1359
Use file('logs/*.log*', ...) to sweep plain and compressed files together. The glob works the same whether you point it at three rotated files or three hundred.
The same query scales from a sample file to a real archive without changing. To check, the examples folder can generate a 20-million-line, 2.4 GB log (./generate.sh --big). Parsing every line with the regex and computing status counts plus p95 latency over that file:
clickhouse local --time -q "
WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
SELECT toUInt16(g[5]) AS status, count() AS requests,
round(quantile(0.95)(toFloat64(g[7])), 3) AS p95
FROM file('big_access.log', LineAsString)
GROUP BY status ORDER BY requests DESC
"
That ran in ~1.80s (best of three, warm OS page cache) on an Apple M4 Pro laptop (14 cores, 24 GB RAM): a full regex parse of 20M lines plus a quantile aggregation, from a cold SQL prompt with no indexing or import.
ClickHouse's pitch for log work is breadth in one tool (raw text, regex, JSON, CSV, and transparent gzip across a directory glob) plus the fact that the exact same SQL runs unchanged on your laptop, on a server, and on ClickHouse Cloud when a one-off investigation turns into a permanent logging table. No rewrite when the logs outgrow the laptop.
Every command above, plus the data generator, lives in the clickhouse-local-logs examples folder:
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-logs
./generate.sh && ./run.sh
Prefer Python? The same parse-and-aggregate pattern works in-process with chDB.