How to convert NPY to Parquet

Al Brown
Last updated: Jun 15, 2026

To convert a .npy file to Parquet, use clickhouse local. It runs SQL directly on files from the command line, with no server to install. It's part of ClickHouse, so the same query scales to billions of rows when you outgrow your laptop.

Install it with clickhousectl:

1curl https://clickhouse.com/cli | sh   # install clickhousectl
2clickhousectl local use latest         # download ClickHouse and put it on your PATH

Then write the Parquet file with INTO OUTFILE:

1clickhouse local -q "SELECT array AS reading FROM file('readings.npy') INTO OUTFILE 'readings.parquet' FORMAT Parquet"
2clickhouse local -q "DESCRIBE file('readings.parquet')"
1reading	Float64

The dtype is read from the .npy header and carried into Parquet as a typed column, so the file is converted in place with no upload step and no intermediate representation.

NumPy's .npy holds exactly one array #

A .npy file is a single typed NumPy array plus a small header describing its dtype and shape. There are no column names. When clickhouse-local reads one, the array becomes a single column always named array:

1clickhouse local -q "DESCRIBE file('readings.npy')"
1array	Float64

That is the one thing to know before converting. Parquet is a columnar format with named columns, so you usually want to rename array to something meaningful as part of the SELECT:

1clickhouse local -q "SELECT array AS reading FROM file('readings.npy') INTO OUTFILE 'readings.parquet' FORMAT Parquet"
2clickhouse local -q "DESCRIBE file('readings.parquet')"
1reading	Float64

The dtype is preserved: a float64 array in, a Float64 column out. The type comes from the .npy header, not from sniffing values, so there is nothing to guess.

Add TRUNCATE before FORMAT (as the runnable example does) to overwrite an existing output file instead of erroring.

A 2D .npy becomes one Array column #

.npy files are often matrices: feature tables, image batches, embedding sets. A 2D array reads as one row per outer element, with each row holding a fixed-length Array:

1clickhouse local -q "DESCRIBE file('embeddings.npy')"
1array	Array(Float32)

You have two sensible ways to land that in Parquet. Keep each row as a vector in an Array column, which is the natural shape for embeddings:

1clickhouse local -q "SELECT array AS embedding FROM file('embeddings.npy') INTO OUTFILE 'embeddings.parquet' FORMAT Parquet"
2clickhouse local -q "SELECT count() AS rows, length(any(embedding)) AS dims FROM file('embeddings.parquet')"
12000000	16

Or expand the vector into named scalar columns, which is friendlier for tools that expect flat tables:

1clickhouse local -q "
2SELECT array[1] AS f0, array[2] AS f1, array[3] AS f2
3FROM file('embeddings.npy')
4INTO OUTFILE 'embeddings_flat.parquet' FORMAT Parquet"
5clickhouse local -q "SELECT * FROM file('embeddings_flat.parquet') LIMIT 3"
10.88	-0.449	0.351
2-0.659	-0.236	0.152
3-0.009	0.651	-0.92

Both are lossless on the numbers. The only thing .npy does not carry is column names, because it never had any — you supply them in the SELECT.

Why bother, and why not just an online converter #

The Parquet you get back is smaller and faster to query. The same embeddings.npy shrinks from 128 MB to 44 MB once it's columnar and compressed, and a typed columnar file skips the dtype header parse on every later read.

Upload-required converter sites own the obvious searches for this, and they work for a small array you don't mind handing over. The reasons to do it locally:

  • No upload. The data never leaves your machine. That matters for embeddings, model weights, or anything proprietary.
  • Correct types, no surprises. The dtype is read from the .npy header and mapped directly to a Parquet logical type. A float32 stays 32-bit; an int64 stays 64-bit.
  • Scriptable. It's one shell line, so it drops into a Makefile, a cron job, or a CI step.
  • Bigger than RAM. The conversion streams, so file size is bounded by disk, not memory.

You can set the Parquet compression codec explicitly if you want, for example SETTINGS output_format_parquet_compression_method = 'zstd' after the FORMAT Parquet clause.

How fast is it? #

On the 2,000,000-row by 16-wide float32 matrix in the example (embeddings.npy, 128 MB), the full convert to Parquet runs in:

1clickhouse local -q "SELECT array AS embedding FROM file('embeddings.npy') INTO OUTFILE 'embeddings.parquet' TRUNCATE FORMAT Parquet"

~0.69 seconds, best of three with a warm OS page cache, on an Apple M4 Pro laptop (14 cores, 24 GB RAM). That reads 32 million float values, builds the row groups, compresses, and writes the file. The number may shift slightly under concurrent load, but conversions of this size are sub-second.

chDB: the same conversion in Python #

If you're already in Python, chDB is the same engine in-process. Run the identical SQL with chdb.query and write straight to a file:

1import chdb
2
3chdb.query(
4    "SELECT array AS embedding FROM file('embeddings.npy') "
5    "INTO OUTFILE 'embeddings.parquet' TRUNCATE FORMAT Parquet"
6)
7print(chdb.query("DESCRIBE file('embeddings.parquet')", "CSV"), end="")
1"embedding","Array(Float32)","","","","",""

There's no NumPy load and no pyarrow round-trip in between; chDB reads the .npy and writes the Parquet directly. See how to read a .npy file in Python with chDB for the read side.

Run it yourself #

The complete, runnable example lives in the ClickHouse examples repo: generate.sh to create the sample .npy files (including the 128 MB matrix used for the timing above), run.sh with every command on this page, plus run.py and run.ipynb for the chDB version, and expected_output.txt.

github.com/ClickHouse/examples/tree/main/local-analytics/convert-npy-to-parquet

The same SQL scales without a rewrite: from a .npy on your laptop, to a ClickHouse server, to ClickHouse Cloud when the data outgrows one machine. For more on the target format, see what is a Parquet file and what is a .npy file.

Share this resource

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!
Loading form...