How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

Engineering teams default to Postgres and pgvector because nobody wants more infrastructure to manage. The out-of-the-box setup works great for prototyping.

But as retrieval-augmented generation (RAG) pipelines and AI agents move into production, teams hit a wall. Scaling vector search inside a relational database means moving beyond default configurations.

Getting there requires bypassing memory limits, fixing recall loss under metadata filters, and knowing when to hand the hot path to a dedicated vector database.

TL;DR

Prototypes: Default Postgres + pgvector works great for initial RAG development and small corpora. Tuning, quantization, and extensions become necessary past a few million high-dimensional vectors.
What breaks at scale: HNSW index size grows quickly as vector count and dimensionality increase. The index is stored on disk, but performance depends heavily on keeping the active working set cache-resident. When active graph pages are evicted from memory, latency can degrade sharply, and standard post-filtering can silently degrade recall.
How to scale in Postgres: Use pgvector 0.8+ iterative scans to improve filtering recall. For memory limits, use quantization, partitioning, and, where supported, disk-oriented vector indexes such as pgvectorscale/StreamingDiskANN.
When to go hybrid: Move to a dedicated vector database if sub-20ms p99 latency is a requirement, you're scaling into the billions of vectors, or high write churn from frequent re-embedding would cause severe MVCC bloat.

How to scale vector search in Postgres at a glance

Scaling pgvector beyond prototypes comes down to three steps. First, fix recall under metadata filters by enabling pgvector 0.8+ iterative index scans. Second, bypass HNSW's RAM ceiling using quantization, tenant-level partitioning, and, where available, SSD-oriented vector indexes. Third, know the workload signals, such as sub-20 ms p99 latency, billions of vectors, GPU ANN, or heavy re-embedding churn, that justify a hybrid Postgres + dedicated vector database architecture.

Why pgvector HNSW slows down when the working set outgrows memory

Postgres vector search can slow down at scale because the Hierarchical Navigable Small World (HNSW) index is most effective when its active graph pages remain in cache. The graph can technically spill beyond memory, but HNSW's random-access traversal pattern is sensitive to cache misses, so latency can degrade sharply once the working set exceeds available RAM.

How much RAM does HNSW need per vector?

Take a baseline corpus using OpenAI's text-embedding-ada-002 model, which generates 1,536-dimensional vectors.

Storing a single vector at full precision (float32) requires 4 bytes per dimension, roughly 6 kilobytes per vector. The HNSW index adds overhead to maintain its multi-layered proximity graph. Pointers, neighbor lists, and structural metadata push the memory requirement well past raw vector storage. Real-world deployments can land around 20 to 25 kilobytes per vector with common settings once graph metadata is accounted for, though the exact footprint depends on index parameters, dimensions, and workload.

At one million vectors, that can mean tens of gigabytes of memory. At ten million vectors, memory pressure becomes a central design concern. At 100 million vectors, raw FP32 vector storage alone is over 600 gigabytes before index overhead is included.

What happens when the HNSW working set is no longer cache-resident?

pgvector indexes are stored on disk like other Postgres indexes, but query performance depends heavily on the active graph pages being served from shared buffers or the operating system page cache. When the active HNSW working set no longer fits in memory, queries may need to fetch graph pages from storage during traversal.

HNSW traversal involves random access rather than sequential scanning. Disk spillover can cause severe page churn. Search latency can degrade from milliseconds to seconds, depending on storage architecture, cache hit rate, and concurrency. That can stall the active connection pool and eventually compete with standard OLTP workloads.

As a rough rule of thumb, pgvector scaling tends to fall into bands, though the exact thresholds depend heavily on dimensionality, hardware, recall targets, and filter patterns. Around one million vectors, naive implementations often succeed. Approaching ten million, quantization, partitioning, and strict memory tuning tend to matter more. Near 100 million, it's worth evaluating whether standard in-memory HNSW is still the right index structure. Beyond a billion vectors, a pure Postgres architecture is often no longer the simplest or most cost-effective design.

Common pgvector pitfalls: when you don't need ANN, dimension limits, and MVCC bloat

The most common myth in vector database architecture is the assumption that every semantic search requires an Approximate Nearest Neighbor (ANN) index.

For multi-tenant SaaS applications, developers often partition data by workspace or customer ID. If a single tenant only owns a few thousand vectors, building and querying an HNSW index can be overkill. A standard B-tree index on the partition key followed by a flat scan of the relevant vectors yields 100 percent recall with zero approximation loss, and can execute faster than traversing a graph structure.

Another critical pitfall is ignoring Postgres internal storage mechanics, specifically the hard 8K block-size limit. Postgres uses a fixed 8-kilobyte page size and relies on the TOAST mechanism to compress or move large tuples out-of-line.

The pgvector extension enforces a hard limit of 2,000 dimensions for HNSW and IVFFlat indexing on the standard vector type. Storing 3,072-dimension vectors from newer, wider embedding models works, but CREATE INDEX USING hnsw on the standard vector type can fail because index entries cannot fit within Postgres's page constraints. The workarounds are halfvec, which raises the indexable dimension cap to 4,000, or dimension reduction.

Teams also frequently treat vector indexes like deterministic B-trees, assuming they provide exact ordering and deterministic results, but they do not.

Vector graphs degrade under high-concurrency update workloads. Postgres uses Multi-Version Concurrency Control (MVCC), meaning updates generate new row versions and leave behind dead tuples. In a highly volatile RAG corpus, these dead tuples accumulate before autovacuum can clean them. Queries recheck and discard stale entries, shrinking the effective candidate pool and degrading recall.

Finally, teams operating Postgres vector workloads often skip RAG observability entirely. They deploy vector search without establishing telemetry for embedding drift, recall degradation, or query performance. At minimum, track query latency percentiles and recall against a held-out ground truth set. Without these signals, recall degradation from MVCC bloat or embedding drift is invisible until users complain.

How to fix pgvector filtering and recall loss

The most common failure point in production RAG systems is the post-filtering trap.

Historically, developers applied metadata filters using a standard WHERE clause after the vector index had already retrieved its nearest neighbors. If your query requests ten nearest neighbors, the index retrieves ten. If the post-filter then rejects nine of those results because of strict metadata constraints, your query silently returns a single document. That destroys system recall and starves the language model of context.

How iterative index scans work in pgvector 0.8+

The fix for this arrived in pgvector 0.8 with the introduction of iterative index scans. Rather than retrieving a fixed batch of nearest neighbors, iterative scans keep pulling more candidates from the index until enough rows pass the WHERE clause.

Iterative scans are not automatic. Enable them per session or transaction:

1SET hnsw.iterative_scan = strict_order;
2-- or
3SET hnsw.iterative_scan = relaxed_order;

For IVFFlat indexes, use:

1SET ivfflat.iterative_scan = relaxed_order;

The index continues scanning candidates, applying the WHERE clause as it goes, and only ends the scan once it has enough valid, filtered results or reaches configured scan limits.

How to avoid runaway queries with distance thresholds

Iterative scans introduce a dangerous edge case. If you execute a query using a distance-based filter, for example embedding <-> query < 0.1, alongside dense metadata filters, the engine may traverse far more of the index than expected.

Two mitigations address this. Inside pgvector, the hnsw.max_scan_tuples setting caps how many candidates the iterative scan will pull before stopping. The default is 20,000. Lower it to cap CPU usage, but be aware that a smaller value may cause the scan to stop before finding enough valid results. From SQL, pre-filter highly selective constraints using a materialized Common Table Expression (CTE) so Postgres resolves the metadata filter first, then runs the vector distance calculation only against the constrained subset.

Partition pruning for multi-tenant vector search

Native partition pruning is your first scaling tactic for filtering. Partition your embedding tables by tenant ID or workspace, and the Postgres query planner skips partitions that do not match the target tenant.

This reduces the vector search space from hundreds of millions of global vectors to the slice owned by a specific tenant. For multi-tenant SaaS, the filtering penalty often becomes much easier to control.

Your second scaling tactic is advanced pre-filtering using external index extensions where your Postgres environment supports them. For example, the pgvectorscale extension implements StreamingDiskANN and supports label-based filtering internally.

Rather than relying entirely on the Postgres query planner to intersect a B-tree metadata index with a vector index, disk-oriented vector indexes can internalize metadata labels and maintain traversal speed without sacrificing as much candidate volume. Availability varies by managed Postgres provider, so verify extension support before designing around it.

Production patterns to scale pgvector: disk-oriented indexes, quantization, and IVFFlat

Scaling past the prototyping phase means moving beyond the default Postgres vector stack to prevent memory exhaustion and maintain throughput. The standard architectural pattern for larger production workloads moves away from pure in-memory HNSW and uses the broader set of available extensions.

Use disk-oriented vector indexes where supported

Where available, pgvectorscale adds StreamingDiskANN, a disk-oriented ANN index designed to reduce the requirement that the vector graph remain entirely in RAM.

This index structure keeps more of the vector graph on solid-state storage instead of shared memory. By compressing the graph and reducing the SSD reads required per traversal, disk-oriented indexes can bypass part of HNSW's RAM ceiling. This can help Postgres serve much larger vector collections than would be practical with full in-memory HNSW alone.

Because extension availability differs across managed Postgres services, check your provider's supported extension list before making DiskANN or pgvectorscale a hard architectural dependency.

Use halfvec (FP16) quantization to cut memory in half

Beyond specialized indexes, quantization density is your most effective lever for scaling.

Deploying the halfvec data type changes the math of your deployment. By downcasting 32-bit floating-point embeddings to 16-bit floats, halfvec delivers an immediate two-fold density gain. Storage drops from four bytes per dimension to two. That cuts your vector storage requirement by 50 percent.

In a standard retrieval-then-rerank pipeline, the recall loss from FP16 quantization is often minimal, though it varies by embedding model and dataset. pgvector's bit type supports binary quantization, cutting storage roughly 32x at the cost of a larger recall hit that a rerank stage may recover.

If standard quantization is not enough, use model-native dimensionality controls where available, or validate dimensionality reduction techniques such as PCA before indexing. Shrink vectors before they hit the database.

When IVFFlat is the better choice than HNSW

IVFFlat is the alternative when HNSW memory overhead is the bottleneck and disk-oriented extensions are unavailable. It ships with pgvector, builds dramatically faster than HNSW, and trades some recall and query latency for predictable memory usage and simpler operations.

Inverted File Flat (IVFFlat) indexes generally perform worse than HNSW in terms of recall and query latency, but they reduce memory overhead by clustering vectors into distinct lists rather than maintaining a massive, interconnected graph.

Pivot to an IVFFlat index and tune the lists parameter carefully. Larger values such as 4096 or 8192 may be useful for large datasets, depending on corpus size. Set probes at query time to control the recall-speed trade-off. A common starting point is the square root of the list count. For 4096 lists, that means starting with 64 probes. Tune empirically against your recall target, as the right value varies significantly by dataset.

Combined with quantization, this lets larger datasets run in memory-constrained environments where HNSW would otherwise become expensive or unstable.

How to implement hybrid search in Postgres using RRF

Relying solely on vector embeddings for retrieval introduces blind spots. Exact keyword matches, serial numbers, and domain-specific acronyms get missed.

Hybrid retrieval combines semantic vector search with traditional lexical search, and is often required for high-accuracy RAG pipelines, especially when exact identifiers, acronyms, product names, or domain terms matter. You do not need to export data to an external search engine to achieve this. Postgres provides these patterns natively.

The foundation of native hybrid retrieval uses tsvector paired with Generalized Inverted Indexes (GIN). The tsvector type handles lexical keyword retrieval with TF-IDF-style ranking via ts_rank. Full BM25 ranking requires an extension such as vchord_bm25, ParadeDB, or a custom implementation. You then execute your vector distance calculation to generate the semantic candidate list.

Lexical rankings and semantic distances operate on different mathematical scales, so they cannot be safely summed.

Merging the results requires Reciprocal Rank Fusion (RRF). RRF combines candidate lists by analyzing ordinal ranks rather than raw scores.

Implementing RRF in Postgres requires window functions and full outer joins. Generate a materialized CTE for the top-N lexical results, a second materialized CTE for the top-N vector results, and use the standard algorithm to fuse them: sum(1 / (60 + rank)). This rank-based fusion provides resilient retrieval that degrades gracefully if either the semantic or lexical signal is weak.

A second option for the lexical leg of hybrid search is sparse vectors. Postgres supports native sparse vector retrieval directly. The sparsevec data type, shipped in pgvector 0.7.0, provides first-class support for sparse representations like SPLADE, which can be used alongside BM25-weighted vectors generated externally.

Instead of maintaining parallel text-search structures, sparsevec efficiently stores zero-heavy matrices using an {index:value} format. Developers can index both dense semantic vectors and sparse lexical vectors using unified vector operations. This can simplify some sparse-retrieval patterns, but it does not automatically replace BM25 or a full inverted lexical index. Validate it against your vocabulary size, scoring model, and recall targets.

Postgres settings for vector search: work_mem, maintenance_work_mem, autovacuum, and pooling

Poorly configured vector operations are a common cause of AI agents failing in production. Vector workloads place heavy strain on memory allocators and background processes. You need precise postgresql.conf tuning to prevent lockups and latency spikes.

Index build settings: maintenance_work_mem and parallel workers

The most critical parameter during index builds is maintenance_work_mem. HNSW graph construction is memory-bound.

Left at the default setting, building an index on millions of high-dimensional vectors can spill to disk and slow dramatically. Increase maintenance_work_mem during index builds based on available RAM and concurrency. For large builds, values in the 8 GB to 16 GB range may be appropriate on sufficiently large instances. Also increase max_parallel_maintenance_workers so the index build uses multiple CPU cores effectively.

Query memory settings: work_mem and concurrency math

Developers frequently fall into the work_mem multiplier trap. Standard work_mem defines the baseline memory for query operations like sorting and hashing. Postgres allocates work_mem per operation, per connection.

Tune work_mem to 1 gigabyte to speed up a query, and a complex execution plan across 100 concurrent connections can demand far more memory than expected. That can trigger an Out-Of-Memory crash that kills the database process.

Set work_mem conservatively, typically between 16 megabytes and 64 megabytes unless you have tested the exact workload and concurrency model. Tune autovacuum_work_mem separately to prevent background cleanup processes from starving active memory.

Pooling for vector workloads

Connection pooling is useful for applications that open many short-lived connections, especially serverless or ephemeral compute environments. Managed Postgres by ClickHouse includes bundled PgBouncer for server-side connection pooling.

Be careful with prepared statements and long-running session state. PgBouncer transaction pooling is not compatible with PostgreSQL features that rely on state across transactions, including prepared statements across transactions. If your vector workload depends on those features, connect directly to Postgres on port 5432.

You also need to give the query planner accurate information. The effective_cache_size parameter must accurately reflect your available memory. If this setting is misconfigured, the Postgres query planner can reason incorrectly about the cost of random disk fetches versus memory hits during iterative scans, and may default to expensive sequential scans.

For operational ingestion, use the COPY command rather than streaming individual INSERT statements to reduce transaction overhead.

When updating vector indexes on production tables, use CREATE INDEX CONCURRENTLY where supported. This avoids taking an exclusive lock that blocks writes while the new index builds.

Managed Postgres for AI workloads: why local NVMe matters for vector indexes and analytical telemetry

Scaling vector search and RAG systems effectively requires infrastructure designed for low-latency operational workloads and high-volume analytical feedback loops. Managed Postgres by ClickHouse is the Unified Data Stack approach: vanilla Postgres handles OLTP as the system of record, while ClickHouse handles OLAP through ClickPipes CDC and the pg_clickhouse unified query layer in a single ClickHouse Cloud experience.

The primary differentiator for vector workloads is the hardware storage advantage.

Standard managed database services like AWS RDS and Aurora rely heavily on network-attached storage environments like Elastic Block Store (EBS). Network-attached storage injects millisecond-level latency into every single disk read. For standard OLTP workloads, this is manageable.
Managed Postgres by ClickHouse uses local NVMe-backed storage physically colocated with compute, delivering microsecond-level disk latency, high IOPS, and consistent performance without network-storage jitter. When an iterative scan traverses an SSD-based vector graph, the local NVMe storage prevents the I/O subsystem from bottlenecking the CPU. This is the difference between a complex hybrid search executing in single-digit milliseconds versus timing out under disk pressure.

ClickHouse's public, reproducible PostgresBench results show the advantage on I/O-heavy transactional workloads. On one configuration:

Service	TPS
Managed Postgres by ClickHouse	28,668
AWS Aurora	12,628
AWS RDS	8,133

Managed Postgres by ClickHouse also supports a curated set of 100+ PostgreSQL extensions, including vector/pgvector 0.8.1 for vector data types plus IVFFlat and HNSW indexes, vchord for vector indexing, and vchord_bm25 for BM25 full-text search. For the RAG observability discussed above, Managed Postgres by ClickHouse also ships pg_stat_ch, which surfaces per-query telemetry (timings, rows, buffers, WAL, and errors) in the ClickHouse Cloud UI with no setup, making it easier to track the query-latency percentiles that production RAG monitoring depends on.

Beyond hardware, pg_clickhouse pairs Postgres with ClickHouse for analytics on the same platform. Production RAG is an analytical problem. You need to compute embedding drift, audit flat-scan recall matrices against ANN results, and aggregate massive logs of session-level A/B testing data to know whether your AI responses are actually improving. Without that continuous monitoring, silent recall failures pollute the context window of your AI agents and output quality degrades with no system alerts.

Running heavy GROUP BY aggregations on your primary Postgres node consumes CPU that user-facing vector queries and transactional workloads may need.

pg_clickhouse is a Postgres foreign data wrapper that exposes ClickHouse tables as foreign tables inside Postgres. It lets teams run analytical queries through a familiar Postgres interface while pushing supported work down to ClickHouse's columnar engine.

The data gets into ClickHouse via ClickPipes, the managed CDC pipeline that streams Postgres changes into ClickHouse with seconds-level latency. Your AI agents continue writing operational state to Postgres, while analytical workloads such as embedding-drift dashboards, recall audits, and session A/B aggregations can run against ClickHouse instead of competing directly with OLTP queries on the primary Postgres instance.

The result is a unified Postgres + ClickHouse stack: Postgres remains the transactional system of record, ClickHouse handles analytics, and teams can keep operational and analytical workflows connected without turning Postgres into an overloaded analytical engine.

When Postgres is not enough for vector search: latency, scale, and write churn thresholds

Even with managed Postgres optimizations and the analytical pairing, Postgres has limits for the vector workload itself. Engineering leaders need to recognize the specific breaking points where forcing Postgres to act as a pure vector database starts costing more in latency, cost, or migration risk than it returns.

When you breach these thresholds, a hybrid architecture using a dedicated vector database is the right move.

Latency threshold: when you need sub-20 ms p99

The first breakpoint is strict latency. If your SLAs demand sub-20-millisecond p99 latency for complex, high-dimensional vector searches under heavy concurrency, treat that as a strong signal to benchmark Postgres carefully against purpose-built vector-serving systems.

Dedicated vector databases like Weaviate and Qdrant are built specifically for vector-serving workloads. Pinecone is a managed service engineered for the same workload profile.

Scale threshold: hundreds of millions vs. billions of vectors

The second breakpoint is scale. While tuned Postgres vector search can reach large datasets, operating at multibillion-vector scale usually requires specialized distributed architectures.

If your workload requires GPU-accelerated Approximate Nearest Neighbor search or multi-region, active-active read/write synchronization for vector embeddings, Postgres native replication and CPU-bound index traversal will not be sufficient.

The most common trigger for migrating to hybrid architecture, however, is full-corpus write churn. Embedding models change. When you upgrade from an older embedding model to a newer, higher-dimension model, you may need to re-embed your entire historical corpus.

In Postgres, the MVCC architecture means updating one billion vectors generates one billion new row versions and leaves old versions behind for vacuum to clean. Regular autovacuum can mark space for reuse without blocking reads or writes, but actually shrinking the table requires heavier operations such as VACUUM FULL or pg_repack, which can be disruptive. Dedicated vector databases are often better suited to massive re-indexing and write-churn cycles.

CDC hybrid pattern: Postgres system of record + dedicated vector database

When these thresholds are met, the recommended solution is the CDC hybrid pattern. You reposition Postgres rather than abandoning it.

Postgres remains the system of record. If you are using Managed Postgres by ClickHouse, that role stays unchanged. Your application writes core entities, user metadata, and document text to the same Postgres instance, and the analytical layer continues running on the paired ClickHouse service via ClickPipes and pg_clickhouse. Only the hot vector path moves to the dedicated vector database.

You then use Change Data Capture pipelines, such as Debezium, or a custom logical-replication worker, to stream structural changes downstream to the embedding pipeline and the vector database. The worker generates embeddings via your embedding service of choice and writes the resulting vectors into the dedicated vector database. The vector database then serves high-throughput, low-latency semantic queries against those embeddings.

Your application queries the vector database for targeted document IDs, then fetches authoritative relational metadata from Postgres using indexed primary key lookups.

In this hybrid architecture, the analytical layer remains unified. Even though hot-path retrieval occurs in an external vector database, your interaction logs, semantic telemetry, and application state continue flowing through Postgres and ClickHouse.

Your engineering team can still run complex RAG observability analytics in SQL across the system state. Decoupling your retrieval engine does not have to fracture operational visibility.

Checklist: how to scale pgvector now, and when to move to a hybrid vector database

Scaling vector search inside Postgres is a workload-dependent problem, not a fixed hardware threshold.

Move beyond default implementations. Use pgvector 0.8+ iterative scans to improve filtered recall, apply halfvec or binary quantization where quality allows, partition by tenant or workspace when possible, and evaluate disk-oriented vector indexes where your environment supports them. You can scale advanced AI workloads further inside Postgres than many teams expect.

Out-of-memory crashes and runaway queries are avoidable once you can tell when to optimize your current stack versus when your latency and data scale demand a dedicated vector store.

If you're building high-throughput RAG agents, look at Managed Postgres by ClickHouse for a unified, low-latency Postgres + ClickHouse platform that pairs transactional Postgres with analytical ClickHouse infrastructure.

Before writing complex hybrid search infrastructure, do a deep architecture review. Benchmark your true recall requirements, map your metadata filter cardinality, establish a firm latency budget, and make sure you're building the right system for your specific scale.

Join the private preview for Managed Postgres by ClickHouse, run your RAG pipeline against real data at production scale, and see what the stack looks like when Postgres and analytics are designed to work together.

Frequently asked questions

How many vectors can pgvector handle in Postgres?

It depends on index type, dimensionality, RAM, storage, filtering patterns, and latency targets. HNSW typically becomes harder to operate as you move from millions into tens of millions of vectors because the graph benefits from staying memory-resident. Beyond that, you usually need quantization, partitioning, disk-oriented ANN where supported, or a hybrid architecture.

Why does HNSW in pgvector require so much RAM?

HNSW stores a multi-layer neighbor graph with pointer and metadata overhead, not just the raw vectors. In practice, index overhead can be significant, so memory grows quickly and performance degrades when the active graph cannot stay resident.

Why does vector search recall drop when I add metadata filters?

If you retrieve top-K neighbors first and apply filters afterward, most candidates can be filtered out, leaving too few results. The fix is to apply filters during retrieval with iterative index scans or pre-filter the candidate set before computing distances.

What are iterative index scans in pgvector and when should I use them?

Iterative index scans, introduced in pgvector 0.8, keep scanning the index until they find enough rows that match your WHERE clause or reach configured scan limits. Use them when you need better results with metadata filters, but avoid combining very strict distance thresholds with highly selective filters without pre-filtering or scan limits.

What is pgvectorscale and why use it with Postgres?

pgvectorscale is a Postgres extension from Timescale that adds StreamingDiskANN, a disk-oriented ANN index. It can reduce the need to keep the full HNSW-style graph in RAM. It is a useful Postgres extension where supported, but availability varies by managed Postgres provider.

When should I use IVFFlat instead of HNSW in Postgres?

Use IVFFlat when HNSW memory overhead becomes the bottleneck and you can trade some recall and latency for lower memory usage. IVFFlat can be a pragmatic option at larger scales when tuned carefully and paired with quantization.

Does halfvec hurt retrieval quality for RAG?

Usually not much in a retrieval-then-rerank pipeline, but you should test it. halfvec cuts vector storage in half by using FP16 rather than FP32 and often has minimal impact on final answer quality when a stronger reranker is used.

What Postgres settings matter most for pgvector performance?

The biggest levers are maintenance_work_mem for index build speed, work_mem for avoiding OOM under concurrency, and autovacuum tuning to prevent dead tuples from degrading performance. Connection pooling configuration also matters, especially if your application relies on session state or prepared statements.

When do I need a hybrid architecture with Postgres and a dedicated vector database?

Move to hybrid when you need sub-20ms p99 under heavy concurrency, GPU ANN, multi-region active-active vector serving, or billions of vectors. Hybrid also helps when frequent re-embedding would cause severe MVCC bloat in Postgres.

What is the CDC pattern for syncing Postgres to a vector database?

Postgres remains the system of record, and CDC streams changes to a vector store that serves semantic queries. Your app queries the vector database for IDs, then fetches authoritative metadata and rows from Postgres by primary key for final assembly.

How to scale vector search in Postgres (pgvector) for RAG and AI agents: memory limits, filtering, and when to go hybrid

TL;DR #

How to scale vector search in Postgres at a glance #

Why pgvector HNSW slows down when the working set outgrows memory #

How much RAM does HNSW need per vector? #

What happens when the HNSW working set is no longer cache-resident? #

Common pgvector pitfalls: when you don't need ANN, dimension limits, and MVCC bloat #

How to fix pgvector filtering and recall loss #

How iterative index scans work in pgvector 0.8+ #

How to avoid runaway queries with distance thresholds #

Partition pruning for multi-tenant vector search #

Production patterns to scale pgvector: disk-oriented indexes, quantization, and IVFFlat #

Use disk-oriented vector indexes where supported #

Use halfvec (FP16) quantization to cut memory in half #

When IVFFlat is the better choice than HNSW #

How to implement hybrid search in Postgres using RRF #

Postgres settings for vector search: work_mem, maintenance_work_mem, autovacuum, and pooling #

Index build settings: maintenance_work_mem and parallel workers #

Query memory settings: work_mem and concurrency math #

Pooling for vector workloads #

Managed Postgres for AI workloads: why local NVMe matters for vector indexes and analytical telemetry #

When Postgres is not enough for vector search: latency, scale, and write churn thresholds #

Latency threshold: when you need sub-20 ms p99 #

Scale threshold: hundreds of millions vs. billions of vectors #

CDC hybrid pattern: Postgres system of record + dedicated vector database #

Checklist: how to scale pgvector now, and when to move to a hybrid vector database #