To convert a .npy file to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.
Install it with clickhousectl:
curl https://clickhouse.com/cli | sh # install clickhousectl
clickhousectl local use latest # download ClickHouse and put it on your PATH
Then write the Parquet file with INTO OUTFILE:
clickhouse local -q "SELECT array AS reading FROM file('readings.npy') INTO OUTFILE 'readings.parquet' FORMAT Parquet"
clickhouse local -q "DESCRIBE file('readings.parquet')"
The dtype is read from the .npy header and carried into Parquet as a typed column, so the file is converted in place with no upload step and no intermediate representation.
A .npy file is a single typed NumPy array plus a small header describing its dtype and shape. There are no column names. When clickhouse-local reads one, the array becomes a single column always named array:
clickhouse local -q "DESCRIBE file('readings.npy')"
That is the one thing to know before converting. Parquet is a columnar format with named columns, so you usually want to rename array to something meaningful as part of the SELECT:
clickhouse local -q "SELECT array AS reading FROM file('readings.npy') INTO OUTFILE 'readings.parquet' FORMAT Parquet"
clickhouse local -q "DESCRIBE file('readings.parquet')"
The dtype is preserved: a float64 array in, a Float64 column out. The type comes from the .npy header, not from sniffing values, so there is nothing to guess.
Add TRUNCATE before FORMAT (as the runnable example does) to overwrite an existing output file instead of erroring.
A 2D .npy becomes one Array column
.npy files are often matrices: feature tables, image batches, embedding sets. A 2D array reads as one row per outer element, with each row holding a fixed-length Array:
clickhouse local -q "DESCRIBE file('embeddings.npy')"
You have two sensible ways to land that in Parquet. Keep each row as a vector in an Array column, which is the natural shape for embeddings:
clickhouse local -q "SELECT array AS embedding FROM file('embeddings.npy') INTO OUTFILE 'embeddings.parquet' FORMAT Parquet"
clickhouse local -q "SELECT count() AS rows, length(any(embedding)) AS dims FROM file('embeddings.parquet')"
Or expand the vector into named scalar columns, which is friendlier for tools that expect flat tables:
clickhouse local -q "
SELECT array[1] AS f0, array[2] AS f1, array[3] AS f2
FROM file('embeddings.npy')
INTO OUTFILE 'embeddings_flat.parquet' FORMAT Parquet"
clickhouse local -q "SELECT * FROM file('embeddings_flat.parquet') LIMIT 3"
0.88 -0.449 0.351
-0.659 -0.236 0.152
-0.009 0.651 -0.92
Both are lossless on the numbers. The only thing .npy does not carry is column names, because it never had any — you supply them in the SELECT.
Why bother, and why not just an online converter
The Parquet you get back is smaller and faster to query. The same embeddings.npy shrinks from 128 MB to 44 MB once it's columnar and compressed, and a typed columnar file skips the dtype header parse on every later read.
Upload-required converter sites own the obvious searches for this, and they work for a small array you don't mind handing over. The reasons to do it locally:
- No upload. The data never leaves your machine. That matters for embeddings, model weights, or anything proprietary.
- Correct types, no surprises. The dtype is read from the
.npy header and mapped directly to a Parquet logical type. A float32 stays 32-bit; an int64 stays 64-bit.
- Scriptable. It's one shell line, so it drops into a Makefile, a cron job, or a CI step.
- Bigger than RAM. The conversion streams, so file size is bounded by disk, not memory.
You can set the Parquet compression codec explicitly if you want, for example SETTINGS output_format_parquet_compression_method = 'zstd' after the FORMAT Parquet clause.
On the 2,000,000-row by 16-wide float32 matrix in the example (embeddings.npy, 128 MB), the full convert to Parquet runs in:
clickhouse local -q "SELECT array AS embedding FROM file('embeddings.npy') INTO OUTFILE 'embeddings.parquet' TRUNCATE FORMAT Parquet"
~0.69 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That reads 32 million float values, builds the row groups, compresses, and writes the file. The number may shift slightly under concurrent load, but conversions of this size are sub-second.
If you're already in Python, chDB is the same engine in-process. Run the identical SQL with chdb.query and write straight to a file:
import chdb
chdb.query(
"SELECT array AS embedding FROM file('embeddings.npy') "
"INTO OUTFILE 'embeddings.parquet' TRUNCATE FORMAT Parquet"
)
print(chdb.query("DESCRIBE file('embeddings.parquet')", "CSV"), end="")
"embedding","Array(Float32)","","","","",""
There's no NumPy load and no pyarrow round-trip in between; chDB reads the .npy and writes the Parquet directly. See how to read a .npy file in Python with chDB for the read side.
The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample .npy files (including the 128 MB matrix used for the timing above), run.sh with every command on this page, plus run.py and run.ipynb for the chDB version, and expected_output.txt.
github.com/ClickHouse/examples/tree/main/local-analytics/convert-npy-to-parquet
The same SQL scales without a rewrite: from a .npy on your laptop, to a ClickHouse server, to ClickHouse Cloud when the data outgrows one machine. For more on the target format, see what is a Parquet file and what is a .npy file.