To read a .npy (NumPy) file from the command line, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then name the Npy format and query the file:
clickhouse local -q "SELECT * FROM file('revenue.npy', Npy) LIMIT 5"
┌─array─┐
1. │ 100 │
2. │ 107.1 │
3. │ 114.2 │
4. │ 121.3 │
5. │ 128.4 │
└───────┘
The file is read in place with no import step. An .npy file embeds its own dtype in the header, so ClickHouse resolves the column type from the file itself and returns the array as a single column called array.
This is the one gotcha that trips people up. Most formats are inferred from the file extension, so file('data.parquet') and file('data.csv') just work. .npy is the exception. The extension is not enough; you have to pass the Npy format as the second argument, every time:
clickhouse local -q "SELECT * FROM file('revenue.npy', Npy) LIMIT 5"
Drop the Npy and ClickHouse will not know how to parse the file. Once you name it, everything else behaves like any other table.
An .npy file carries its own dtype in the header (the >f8, <i4 and friends NumPy writes). DESCRIBE reads it back as a ClickHouse type, so you never write CREATE TABLE:
clickhouse local -q "DESCRIBE file('revenue.npy', Npy) FORMAT PrettyCompact"
┌─name──┬─type────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
1. │ array │ Float64 │ │ │ │ │ │
└───────┴─────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
A NumPy float64 array maps to Float64, int32 to Int32, and so on. The single column is always named array, because the file holds exactly one array and nothing else: no header row, no column names. (If you saved a 2D array, the column comes back as Array(Float32), one nested array per row.)
Rename the column and aggregate
The full SQL surface works on the array (WHERE, GROUP BY, aggregate functions, quantiles). Refer to the column as array:
clickhouse local -q "
SELECT count() AS n, round(avg(array), 2) AS mean, round(max(array), 2) AS max
FROM file('revenue.npy', Npy)
FORMAT PrettyCompact"
┌──n─┬───mean─┬───max─┐
1. │ 10 │ 121.35 │ 149.7 │
└────┴────────┴───────┘
array is a clunky name to read SQL against. Give it a real one by passing an explicit structure as the third argument — the same place you would override an inferred type:
clickhouse local -q "
SELECT revenue FROM file('revenue.npy', Npy, 'revenue Float64')
WHERE revenue > 130
ORDER BY revenue DESC
FORMAT PrettyCompact"
┌─revenue─┐
1. │ 149.7 │
2. │ 142.6 │
3. │ 135.5 │
└─────────┘
The type you declare must match the array's dtype; this renames the column, it does not cast across incompatible types.
Because each .npy file holds a single array, a NumPy workflow that produces several parallel arrays (a price array, a quantity array, an embedding column) lands as several files. There is no shared key inside the files; the only thing linking them is row position, the array index.
To turn parallel arrays back into rows, attach a row number to each with rowNumberInAllBlocks() and join on it:
clickhouse local -q "
SELECT r.rn AS i, r.array AS revenue, q.array AS quantity
FROM (SELECT rowNumberInAllBlocks() AS rn, array FROM file('revenue.npy', Npy)) AS r
INNER JOIN (SELECT rowNumberInAllBlocks() AS rn, array FROM file('quantity.npy', Npy)) AS q
USING rn
ORDER BY i
FORMAT PrettyCompact"
┌─i─┬─revenue─┬─quantity─┐
1. │ 0 │ 100 │ 1 │
2. │ 1 │ 107.1 │ 4 │
3. │ 2 │ 114.2 │ 2 │
4. │ 3 │ 121.3 │ 5 │
5. │ 4 │ 128.4 │ 3 │
6. │ 5 │ 135.5 │ 1 │
7. │ 6 │ 142.6 │ 4 │
8. │ 7 │ 149.7 │ 2 │
9. │ 8 │ 103.8 │ 5 │
10. │ 9 │ 110.9 │ 3 │
└───┴─────────┴──────────┘
Now the two arrays are a single table you can filter, group, and join further. This is the move whenever you have arrays saved straight out of NumPy and want them as relational rows without writing a Python merge step. NumPy and pandas handle this natively too, of course; the point of doing it in SQL is that the same query runs unchanged against the other formats sitting next to your .npy files (Parquet, CSV, Arrow) and against a server when the data grows.
The .npy layout is a contiguous block of fixed-width numbers, so reading it is close to a raw mmap. On a 3,000,000-value Float64 array (scores_large.npy, ~23 MB, generated by the example folder below), computing the count, mean, and 95th percentile across every value runs in:
clickhouse local -q "
SELECT count(), round(avg(array), 3), round(quantile(0.95)(array), 3)
FROM file('scores_large.npy', Npy)"
┌─count()─┬─round(avg(array), 3)─┬─round(quanti⋯(array), 3)─┐
1. │ 3000000 │ 499.929 │ 950.141 │
└─────────┴──────────────────────┴──────────────────────────┘
~0.17 seconds, best of three with the file warm in the OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM; clickhouse local 26.6.1.117). The number is honest but small: 3M doubles is a modest array, and the read is essentially memory-bandwidth bound.
The query you just ran on a local array is the same SQL you would run on a ClickHouse server or in ClickHouse Cloud. You swap file('scores_large.npy', Npy) for a table name and nothing else changes. You prototype against arrays on your laptop and ship the identical logic to production, with no separate "local dialect" to unlearn.
The complete, runnable example lives here. It has generate.sh (writes the two small arrays and the ~23 MB perf array), run.sh (every command above), and expected_output.txt:
github.com/ClickHouse/examples → local-analytics/clickhouse-local-npy
git clone https://github.com/ClickHouse/examples
cd examples/local-analytics/clickhouse-local-npy
./generate.sh && ./run.sh