How to analyze log files with SQL

Al Brown
Last updated: Jun 8, 2026

To analyze log files with SQL, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then read raw log lines with the LineAsString format:

1clickhouse local -q "SELECT * FROM file('access.log', LineAsString) LIMIT 5"
1198.51.100.248 - - [01/Jun/2026:00:00:00 +0000] "POST /login HTTP/1.1" 200 31481 "-" "Mozilla/5.0 (compatible)" 0.25
2198.51.100.223 - - [01/Jun/2026:00:00:00 +0000] "POST /api/products HTTP/1.1" 200 11489 "-" "Mozilla/5.0 (compatible)" 1.82
3198.51.100.227 - - [01/Jun/2026:00:00:00 +0000] "PUT /api/users HTTP/1.1" 200 6366 "-" "Mozilla/5.0 (compatible)" 0.906
4198.51.100.154 - - [01/Jun/2026:00:00:00 +0000] "DELETE /api/orders HTTP/1.1" 304 29177 "-" "Mozilla/5.0 (compatible)" 0.76
5198.51.100.170 - - [01/Jun/2026:00:00:00 +0000] "GET /images/logo.png HTTP/1.1" 200 21098 "-" "Mozilla/5.0 (compatible)" 0.485

LineAsString hands each log line to SQL as a single string column, so the file is read in place with no import or schema declaration. From there you parse the fields with a regex and run any aggregation the full ClickHouse SQL dialect supports.

Generate a sample log #

So you can follow along without a production server, synthesize a realistic nginx combined access log with clickhouse local itself:

1clickhouse local -q "
2WITH
3  ['GET','GET','GET','GET','POST','PUT','DELETE'] AS methods,
4  ['/','/index.html','/api/users','/api/orders','/login','/static/app.js','/static/style.css','/images/logo.png','/api/products','/checkout'] AS paths,
5  [200,200,200,200,200,301,304,404,500,503] AS statuses
6SELECT
7  '198.51.100.' || toString(cityHash64(number) % 254 + 1)
8  || ' - - [' || formatDateTime(toDateTime('2026-06-01 00:00:00') + intDiv(number, 5), '%d/%b/%Y:%H:%i:%S +0000')
9  || '] \"' || methods[(cityHash64(number, 1) % 7) + 1] || ' ' || paths[(cityHash64(number, 2) % 10) + 1] || ' HTTP/1.1\" '
10  || toString(statuses[(cityHash64(number, 3) % 10) + 1]) || ' '
11  || toString(cityHash64(number, 4) % 50000 + 200)
12  || ' \"-\" \"Mozilla/5.0 (compatible)\" '
13  || toString(round((cityHash64(number, 5) % 2000000) / 1000000.0, 3)) AS line
14FROM numbers(50000)
15INTO OUTFILE 'access.log' TRUNCATE FORMAT TSVRaw
16"

This writes 50,000 nginx-format lines to access.log. The companion examples folder wraps this in a generate.sh that also produces rotated and multi-GB files.

Extract the fields #

A log line is just text until you split it into columns. Read each line as a string, then pull out the fields with extractGroups(): one regex, one capture group per field. Parse the nginx timestamp into a real DateTime with parseDateTime():

1clickhouse local -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT
4  g[1] AS ip,
5  parseDateTime(g[2], '%d/%b/%Y:%H:%i:%S %z', 'UTC') AS ts,
6  g[3] AS method,
7  g[4] AS path,
8  toUInt16(g[5]) AS status,
9  toFloat64(g[7]) AS request_time
10FROM file('access.log', LineAsString)
11LIMIT 5
12"
1198.51.100.248	2026-06-01 00:00:00	POST	/login	200	0.25
2198.51.100.223	2026-06-01 00:00:00	POST	/api/products	200	1.82
3198.51.100.227	2026-06-01 00:00:00	PUT	/api/users	200	0.906
4198.51.100.154	2026-06-01 00:00:00	DELETE	/api/orders	304	0.76
5198.51.100.170	2026-06-01 00:00:00	GET	/images/logo.png	200	0.485

extractGroups returns an array; g[1], g[2], … are the capture groups in order. parseDateTime takes an explicit format (%d/%b/%Y:%H:%i:%S %z), so the nginx timestamp, including its +0000 offset, becomes a proper DateTime you can filter, bucket, and sort on.

ClickHouse also has a dedicated Regexp input format that maps capture groups straight onto named columns. The extractGroups-on-LineAsString approach above keeps everything in one query and is the easiest to adapt to your own log shape: change the regex and the column list, and you are done.

For structured logs you do not even need the regex. JSON lines parse directly with SELECT * FROM file('app.log', JSONEachRow), and delimited logs work the same way you would run SQL on a CSV file.

Count status codes #

With the fields extracted, aggregation is ordinary SQL. Status-code distribution in one query:

1clickhouse local -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT toUInt16(g[5]) AS status, count() AS requests
4FROM file('access.log', LineAsString)
5GROUP BY status ORDER BY requests DESC
6"
1200	24837
2301	5127
3404	5098
4503	5071
5304	4985
6500	4882

Latency percentiles #

Averages hide the tail. Use quantile to get p95 and p99 of the request-time field directly:

1clickhouse local -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT
4  round(quantile(0.95)(toFloat64(g[7])), 3) AS p95_s,
5  round(quantile(0.99)(toFloat64(g[7])), 3) AS p99_s,
6  round(max(toFloat64(g[7])), 3)            AS max_s
7FROM file('access.log', LineAsString)
8"
11.903	1.982	2

Top URLs and noisiest IPs #

Find the most-requested paths:

1clickhouse local -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT g[4] AS path, count() AS hits
4FROM file('access.log', LineAsString)
5GROUP BY path ORDER BY hits DESC LIMIT 5
6"
1/login	5135
2/images/logo.png	5076
3/checkout	5052
4/api/orders	5011
5/api/users	5005

And the IPs responsible for the most 5xx errors, the kind of question you ask mid-incident:

1clickhouse local -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT g[1] AS ip, count() AS errors
4FROM file('access.log', LineAsString)
5WHERE toUInt16(g[5]) >= 500
6GROUP BY ip ORDER BY errors DESC LIMIT 5
7"
1198.51.100.67	66
2198.51.100.193	56
3198.51.100.4	55
4198.51.100.242	55
5198.51.100.254	54

Requests over time #

Bucket the parsed timestamp with toStartOfMinute to plot traffic per minute (swap in toStartOfHour or toStartOfFiveMinutes as needed):

1clickhouse local -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT
4  toStartOfMinute(parseDateTime(g[2], '%d/%b/%Y:%H:%i:%S %z', 'UTC')) AS minute,
5  count() AS requests
6FROM file('access.log', LineAsString)
7GROUP BY minute ORDER BY minute LIMIT 5
8"
12026-06-01 00:00:00	300
22026-06-01 00:01:00	300
32026-06-01 00:02:00	300
42026-06-01 00:03:00	300
52026-06-01 00:04:00	300

Rotated and compressed logs #

Real log directories are full of rotated, gzip-compressed files: access.log, access-1.log.gz, access-2.log.gz, and so on. A glob reads all of them in one query, and clickhouse local decompresses .gz transparently, with no gunzip step:

1clickhouse local -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT toUInt16(g[5]) AS status, count() AS requests
4FROM file('logs/*.log.gz', LineAsString)
5GROUP BY status ORDER BY requests DESC
6"
1200	7422
2404	1665
3301	1617
4304	1482
5503	1455
6500	1359

Use file('logs/*.log*', ...) to sweep plain and compressed files together. The glob works the same whether you point it at three rotated files or three hundred.

Performance #

The same query scales from a sample file to a real archive without changing. To check, the examples folder can generate a 20-million-line, 2.4 GB log (./generate.sh --big). Parsing every line with the regex and computing status counts plus p95 latency over that file:

1clickhouse local --time -q "
2WITH extractGroups(line, '^(\S+) - - \[([^\]]+)\] \"(\S+) (\S+) [^\"]*\" (\d+) (\d+) \"[^\"]*\" \"[^\"]*\" (\S+)') AS g
3SELECT toUInt16(g[5]) AS status, count() AS requests,
4       round(quantile(0.95)(toFloat64(g[7])), 3) AS p95
5FROM file('big_access.log', LineAsString)
6GROUP BY status ORDER BY requests DESC
7"

That ran in ~1.80s (best of three, warm OS page cache) on an Apple M4 Pro laptop (14 cores, 24 GB RAM): a full regex parse of 20M lines plus a quantile aggregation, from a cold SQL prompt with no indexing or import.

ClickHouse's pitch for log work is breadth in one tool (raw text, regex, JSON, CSV, and transparent gzip across a directory glob) plus the fact that the exact same SQL runs unchanged on your laptop, on a server, and on ClickHouse Cloud when a one-off investigation turns into a permanent logging table. No rewrite when the logs outgrow the laptop.

Run it yourself #

Every command above, plus the data generator, lives in the clickhouse-local-logs examples folder:

1git clone https://github.com/ClickHouse/examples
2cd examples/local-analytics/clickhouse-local-logs
3./generate.sh && ./run.sh

Prefer Python? The same parse-and-aggregate pattern works in-process with chDB.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...