How to read a TSV file in Python (faster than pandas)

To read a TSV file in Python, use chDB as a drop-in replacement for pandas. Change one import line and the code you already write keeps working:

1import chdb.datastore as pd
2df = pd.read_csv("events.tsv", sep="\t")

df supports the pandas API you already use (indexing, filtering, groupby, merge), but the work runs on ClickHouse's engine instead of pandas, so it stays fast as files grow. There is no server to start and no separate load step.

1pip install chdb

Prefer the command line? See how to query a TSV file to do this with clickhouse local.

Read a TSV into a DataFrame

pd.read_csv works the way you expect. Pass sep="\t" for a tab-separated file and point it at a path:

1import chdb.datastore as pd
2
3df = pd.read_csv("data/events.tsv", sep="\t")
4print(df)
5print(df.dtypes)

1event_id country event_type  revenue         product
20         1      GB   purchase    49.99  Widget, Deluxe
31         2      AU       view     0.00          Gadget
42         3      GB   purchase    19.50          Widget
53         4      IN   purchase     5.00            Bolt
64         5      AU   purchase    99.00           Gizmo
75         6      IN       view     0.00            Bolt
8event_id        int64
9country           str
10event_type        str
11revenue       float64
12product           str
13dtype: object

Look at the dtypes. chDB inferred them from the data: event_id is an integer, revenue is a float. You did not declare a single type or pass a dtype map.

Notice Widget, Deluxe in the product column. The comma inside it is just text. In a TSV the delimiter is a tab, so commas inside fields need no quoting and never split a column. That is why some teams ship TSV instead of CSV.

One thing is different from pandas under the hood: df is not the whole file loaded into memory. It is a lazy, ClickHouse-backed object that records what you asked for and runs it on ClickHouse's engine when you materialize a result, whether you print it, take len(), or call .to_pandas(). That is why it stays fast on large files: it only computes the rows and columns you actually use.

Filter and aggregate the way you already do

The pandas you write does not change. Filter with a boolean mask, group, and aggregate:

1import chdb.datastore as pd
2
3df = pd.read_csv("data/events.tsv", sep="\t")
4revenue = df[df["event_type"] == "purchase"].groupby("country")["revenue"].sum()
5print(revenue)

1country
2AU    99.00
3GB    69.49
4IN     5.00
5Name: revenue, dtype: float64

Same syntax as pandas, same result. The difference is that chDB compiles the whole chain into one optimized query and only scans the columns the aggregation touches. Joins, computed columns with assign, the .str and .dt accessors, and merge across files all work the same way.

No header? Name the columns

A headerless TSV needs column names from you, exactly as in pandas. Pass names and sep="\t", and chDB still infers each column's type:

1df = pd.read_csv(
2    "data/events_noheader.tsv",
3    sep="\t",
4    names=["event_id", "country", "event_type", "revenue", "product"],
5)
6print(df.head(3))

1event_id country event_type  revenue         product
20         1      GB   purchase    49.99  Widget, Deluxe
31         2      AU       view     0.00          Gadget
42         3      GB   purchase    19.50          Widget

Leave names off a headerless file and the columns come back as c1, c2, c3 and so on. Pass header=None if you want that positional naming on purpose.

Hand off to real pandas when you need it

When a library downstream needs an actual pandas DataFrame (scikit-learn, a plotting call, anything that mutates in place), call .to_pandas() to materialize one:

1df = pd.read_csv("data/events.tsv", sep="\t")
2pdf = df.to_pandas()   # a real pandas.DataFrame, in memory

The usual pattern is to do the heavy filtering and aggregation on the chDB object, where it is fast, then .to_pandas() the small result and continue in ordinary pandas.

Is it faster than `pandas.read_csv`?

On a file big enough that scanning is the actual cost, yes. TSV is a text format, so most of the work is parsing bytes into typed values, and chDB does that in compiled, multi-threaded ClickHouse code rather than the Python layer. The aggregation above on a 3M-row (~85 MB) TSV, best-of-3 with a warm page cache on an Apple M4 Pro (14 cores, 24 GB RAM, macOS); chDB 4.1.8, Python 3.14; best-of-3, warm:

1import pandas as pd:            0.484s
2import chdb.datastore as pd:    0.056s
3speedup:                        8.6x

About 8.6x here, from the same code with one import changed. The back-to-back ratio is the robust claim; absolute times shift with cache state and concurrent load. The gap widens as files grow, because plain pandas parses and holds the whole file in memory before it can group anything, while chDB streams and aggregates in one pass.

The small case can cut the other way: on a tiny TSV (a few thousand rows) plain pandas is as fast or faster, because chDB pays a fixed per-query startup cost that dominates when there is barely any data. Reach for chDB when the file is large, or when you want one import to cover both.

Works in Jupyter

Printing a chDB object renders it as a table in a notebook, and .to_pandas() feeds straight into .plot(), joins, or any pandas code. The companion folder ships a run.ipynb you can open and run cell by cell.

Run it yourself

The complete, runnable example is here, with generate.sh to create the sample TSV files plus run.ipynb and a run.py mirror containing the exact code above:

github.com/ClickHouse/examples/tree/main/local-analytics/chdb-tsv

Working with other formats? Read a CSV file in Python and read a JSON file in Python use the same drop-in import chdb.datastore as pd pattern; only the reader changes. If you keep re-reading the same TSV, convert it to Parquet once and every later read gets faster.

The same code scales from a TSV on your laptop to a server to ClickHouse Cloud with no rewrite, so there is nothing to re-learn when the data outgrows one machine.

Read a TSV into a DataFrame

Filter and aggregate the way you already do

No header? Name the columns

Hand off to real pandas when you need it

Is it faster than `pandas.read_csv`?

Works in Jupyter

Run it yourself

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

How to read a TSV file in Python (faster than pandas)

Read a TSV into a DataFrame #

Filter and aggregate the way you already do #

No header? Name the columns #

Hand off to real pandas when you need it #

Is it faster than pandas.read_csv? #

Works in Jupyter #

Run it yourself #

Subscribe to our newsletter

More like this

How to engineer cost-efficient open source observability with ClickHouse (ClickStack) - 2026 technical playbook

Build a dashboard in Python with ClickHouse and Streamlit

Instrumenting OpenAI with OpenTelemetry (OTel)

Real-time data visualization

Read a TSV into a DataFrame

Filter and aggregate the way you already do

No header? Name the columns

Hand off to real pandas when you need it

Is it faster than `pandas.read_csv`?

Works in Jupyter

Run it yourself