Vector Databases: Selection, Architecture, and Optimization for AI Engineering
When building a RAG pipeline in production, choosing the right vector database is not an implementation detail - it is an architectural decision that directly impacts latency, operational costs, recall accuracy, and system scalability. The vector database market exceeded $2.65 billion in 2025, projected to reach $8.9 billion by 2030 at 27.5% CAGR. The explosion in options has made selection increasingly complex.
This article is not a marketing overview of commercial features. It is a technical deep dive into how vector databases work internally, which indexing algorithms they use, and how to configure and optimize them for real workloads. We will analyze Qdrant, Pinecone, Milvus, and Weaviate on concrete engineering dimensions: HNSW architecture, quantization strategies, filtered search, DiskANN vs in-memory tradeoffs, and parameter tuning to hit the recall/latency targets your application requires.
Whether you are building a RAG system handling millions of documents with sub-50ms latency and 95%+ recall, or optimizing an existing system consuming too much memory, this article gives you the conceptual and practical tools to make informed decisions.
What You Will Learn
- Internal architecture of vector databases: how HNSW works at the algorithmic level
- IVF vs HNSW vs DiskANN: when to use which algorithm and why
- Scalar, product, and binary quantization: memory/accuracy tradeoffs
- Filtered vector search: pre-filtering, post-filtering, and the narrow filter problem
- Practical configuration of Qdrant, Milvus, and Pinecone with working code examples
- Benchmarking and production tuning: measuring and improving QPS and recall
Internal Architecture: How a Vector Database Works
A vector database differs from a relational database not only in the data it stores, but in the fundamental operation it must optimize. Instead of exact lookups on discrete keys, it performs Approximate Nearest Neighbor (ANN) search across high-dimensional spaces, typically 768-4096 dimensions for modern LLM embeddings.
Exact k-nearest neighbor search (kNN) has O(n*d) complexity where n is the number of vectors and d is the dimensionality. With 10 million vectors at 1536 dimensions (OpenAI ada-002 standard), an exact query would require roughly 15 billion floating-point operations - completely unacceptable for a real-time system. All modern vector databases therefore use ANN algorithms that sacrifice some recall to gain orders of magnitude in speed.
The internal stack of a vector database operates across several layers:
- Storage layer: management of compressed vectors on disk or in memory, with mmap support for efficient access
- Index layer: ANN data structure (HNSW, IVF, DiskANN) for navigating the vector space
- Payload/metadata layer: scalar attributes associated with vectors for filtering
- Query planner: decides the optimal strategy combining vector search and payload filtering
- Replication/sharding layer: for distributed systems like Milvus or Pinecone
HNSW: Algorithmic Deep Dive
Hierarchical Navigable Small World (HNSW) is the dominant ANN algorithm in 2025, used by default in Qdrant, Weaviate, and available in Milvus. Understanding its internal mechanics is essential for correct configuration.
HNSW builds a multi-layer hierarchical graph. At the highest layer, there are few nodes strongly connected to each other (the "hubs"), while lower layers become progressively denser until layer 0 which contains all vectors. During search, the algorithm starts at the top and descends through layers, progressively refining the nearest neighbor candidates. This approach is inspired by the "small world" phenomenon in social graphs: from any node, you can reach any other in just a few hops thanks to long-range connections.
The three fundamental HNSW parameters are:
- M (default 16): maximum number of bidirectional edges per node. Typical range: 8-64. Increasing M improves recall but grows memory and build time. For high-dimensional datasets (1536+), M=32-64 yields good results.
- efConstruction (default 100-200): size of the candidate list during index construction. Does not affect final index size, but determines connection quality. Higher values produce a better index but slower build time. Recommended range: 200-400 for high quality.
- ef (or efSearch, configurable at runtime): candidate list size during query. Must be >= k (number of results requested). Increasing ef improves recall but increases latency. Typical range: 50-500.
The fundamental tradeoff: M and efConstruction determine index quality (a costly one-time operation), while ef balances recall vs latency at query time and can be changed dynamically without rebuilding the index.
# HNSW configuration in Qdrant - practical examples with explicit tradeoffs
from qdrant_client import QdrantClient
from qdrant_client.models import (
VectorParams, Distance,
HnswConfigDiff, OptimizersConfigDiff
)
client = QdrantClient(url="http://localhost:6333")
# --- HIGH-RECALL configuration for critical RAG ---
# Target: recall >= 0.98, latency acceptable up to 50ms
# Cost: ~4x memory compared to baseline
client.recreate_collection(
collection_name="rag_high_recall",
vectors_config=VectorParams(
size=1536, # OpenAI text-embedding-3-small
distance=Distance.COSINE
),
hnsw_config=HnswConfigDiff(
m=64, # High connectivity: better recall but +memory
ef_construct=400, # Slow build but high-quality index
full_scan_threshold=10000, # Use brute force under 10k vectors
on_disk=False # In-memory for minimum latency
)
)
# --- BALANCED configuration for typical production ---
# Target: recall >= 0.95, latency < 20ms, optimized memory
client.recreate_collection(
collection_name="rag_balanced",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE
),
hnsw_config=HnswConfigDiff(
m=32, # Good recall/memory balance
ef_construct=200,
full_scan_threshold=5000,
on_disk=False
)
)
# --- LOW-LATENCY configuration for real-time ---
# Target: latency < 5ms, acceptable recall >= 0.90
client.recreate_collection(
collection_name="rag_fast",
vectors_config=VectorParams(
size=768, # Compact embeddings (all-MiniLM-L6-v2)
distance=Distance.COSINE
),
hnsw_config=HnswConfigDiff(
m=16,
ef_construct=128,
full_scan_threshold=1000,
on_disk=False
)
)
# Configure ef at query time (more flexible than rebuild)
results = client.search(
collection_name="rag_balanced",
query_vector=query_embedding,
limit=10,
search_params={
"hnsw_ef": 128, # Increase recall without index rebuild
"exact": False
}
)
for hit in results:
print(f"Score: {hit.score:.4f} | ID: {hit.id}")
IVF vs HNSW vs DiskANN: Choosing the Right Algorithm
HNSW is not the only available indexing algorithm. The choice strongly depends on memory constraints, dataset size, and update patterns.
IVF (Inverted File Index)
IVF partitions the vector space into nlist clusters via k-means.
During query, it searches only the nprobe clusters nearest to the query vector.
Key parameters are nlist (cluster count) and nprobe (clusters to inspect).
Recommended empirical formula: nlist = 4 * sqrt(n_vectors).
IVF pros: fast build, moderate memory, good for static datasets. IVF cons: requires re-clustering on significant data changes, lower recall than HNSW for the same computational budget, cold start on new collections.
HNSW
As described above, builds a multi-layer graph. The most versatile algorithm, used by default in most vector databases.
HNSW pros: excellent recall-speed tradeoff, native support for incremental updates, parameters configurable at query time. HNSW cons: requires the entire index to fit in RAM, becomes prohibitive beyond 50-100M vectors on standard hardware.
DiskANN
Developed by Microsoft Research, DiskANN is designed for datasets that do not fit in RAM. It keeps a compact navigable graph structure in memory while storing full vectors on NVMe SSDs. With PCIe Gen5 hardware, it maintains 95%+ recall and sub-10ms latency at billion-vector scale, with 10-20x lower DRAM cost compared to equivalent HNSW.
DiskANN pros: scales to billions of vectors on commodity hardware, reduced operational cost. DiskANN cons: requires fast NVMe SSDs, higher latency than in-memory HNSW, the base implementation is immutable (FreshDiskANN handles updates). Available in Milvus, Azure PostgreSQL, and increasingly other systems.
Beware the "HNSW for Everything" Anti-Pattern
Many teams configure in-memory HNSW for 50M+ vector datasets, then find themselves running 256GB RAM instances at unsustainable cost. The practical rule: if your dataset exceeds 10-20M vectors and you do not have ultra-low latency requirements (<5ms), seriously evaluate DiskANN or aggressive quantization before adding hardware.
# IVF_FLAT and HNSW configuration in Milvus - practical comparison
from pymilvus import (
connections, Collection, CollectionSchema,
FieldSchema, DataType
)
connections.connect("default", host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="timestamp", dtype=DataType.INT64),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536)
]
schema = CollectionSchema(fields, description="RAG document collection")
collection = Collection("rag_docs", schema)
# --- IVF_FLAT: for static or near-static datasets ---
ivf_index_params = {
"metric_type": "COSINE",
"index_type": "IVF_FLAT",
"params": {
"nlist": 4096 # For 1M vectors: 4 * sqrt(1M) ≈ 4000
}
}
# --- HNSW: for datasets with frequent updates ---
hnsw_index_params = {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {
"M": 32,
"efConstruction": 256
}
}
# --- DiskANN: for datasets >50M vectors ---
diskann_index_params = {
"metric_type": "L2",
"index_type": "DISKANN",
"params": {
"search_list": 100
}
}
collection.create_index(
field_name="embedding",
index_params=hnsw_index_params
)
collection.load()
# Query parameters for HNSW
search_params_hnsw = {
"metric_type": "COSINE",
"params": {
"ef": 256 # Increase for more recall, decrease for more speed
}
}
# Query parameters for IVF
search_params_ivf = {
"metric_type": "COSINE",
"params": {
"nprobe": 64 # nprobe/nlist = fraction of clusters inspected
}
}
results = collection.search(
data=[query_embedding],
anns_field="embedding",
param=search_params_hnsw,
limit=10,
output_fields=["content", "category", "timestamp"]
)
for hit in results[0]:
print(f"Distance: {hit.distance:.4f} | Category: {hit.entity.get('category')}")
Vector Quantization: Compressing Without Losing Recall
Quantization is the most powerful technique for reducing vector database memory usage, with controllable impact on search quality. A float32 vector with 1536 dimensions occupies 6144 bytes (6KB). With quantization, we can reduce it to 384 bytes or less.
Scalar Quantization (SQ)
Maps each float32 value (4 bytes) to int8 (1 byte), achieving 4x compression. The algorithm analyzes the value distribution in each dimension and determines an optimal quantization range. Distances are computed directly on int8 values, which are computationally simpler than float32 operations. Typical recall loss: 1-3% vs float32.
When to use: recommended starting point for any deployment, 4x reduction with minimal quality impact. Supported by Qdrant, Milvus (IVF_SQ8), and Weaviate.
Product Quantization (PQ)
Divides the vector into m sub-vectors and quantizes each against a codebook of 2^nbits entries. Typical compression: 16-64x. Each vector is represented as a sequence of codebook indices. Approximate distances are computed via precomputed lookup tables (ADC - Asymmetric Distance Computation).
PQ tradeoff: aggressive compression (10-50MB instead of 10GB) but significant recall loss (5-15%). Requires codebook training on the dataset. Useful for huge datasets where memory is the primary constraint.
Binary Quantization (BQ)
Reduces each dimension to 1 bit: the most compressed representation possible. A 1536-dimension vector becomes 192 bytes (32x compression vs float32). Distance is computed using Hamming distance (XOR + popcount), an extremely fast operation on modern CPUs. Qdrant reports speedups of up to 40x on distance calculations.
However, Binary Quantization only works well for embeddings with specific properties: values must be symmetrically distributed around zero (a property satisfied by OpenAI ada-002, Cohere embed-v3, and e5 models). For embeddings with asymmetric distributions, the recall drop can be severe (15-30%).
Qdrant introduced 1.5-bit and 2-bit quantization in 2025, offering a middle ground between scalar (4x) and binary (32x) compression.
# Quantization configuration in Qdrant - all types
from qdrant_client import QdrantClient
from qdrant_client.models import (
VectorParams, Distance,
ScalarQuantizationConfig, ScalarType,
ProductQuantizationConfig, CompressionRatio,
BinaryQuantizationConfig,
QuantizationConfig
)
client = QdrantClient(url="http://localhost:6333")
# --- 1. Scalar Quantization (SQ8) ---
# 4x compression, recall loss ~1-3%
# RECOMMENDED: best starting point
client.recreate_collection(
collection_name="rag_sq8",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE
),
quantization_config=ScalarQuantizationConfig(
scalar=QuantizationConfig(
type=ScalarType.INT8,
quantile=0.99, # Use 99th percentile to define the range
always_ram=True # Keep quantized vectors in RAM (+speed)
)
)
)
# --- 2. Product Quantization (PQ) ---
# 16-64x compression, recall loss 5-15%
# FOR huge datasets (>100M vectors) with severe memory constraints
client.recreate_collection(
collection_name="rag_pq",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE
),
quantization_config=ProductQuantizationConfig(
product=QuantizationConfig(
compression=CompressionRatio.X16,
always_ram=True
)
)
)
# --- 3. Binary Quantization (BQ) ---
# 32x compression, 40x speedup, variable recall loss
# ONLY for embeddings with symmetric distribution (OpenAI, Cohere)
client.recreate_collection(
collection_name="rag_binary",
vectors_config=VectorParams(
size=1536,
distance=Distance.COSINE
),
quantization_config=BinaryQuantizationConfig(
binary=QuantizationConfig(
always_ram=True
)
)
)
# rescore=True: use BQ for candidate generation, then
# recompute exact distances with float32 on top-k candidates
def search_with_rescore(client, collection_name, query_vector, limit=10):
return client.search(
collection_name=collection_name,
query_vector=query_vector,
limit=limit,
search_params={
"quantization": {
"ignore": False,
"rescore": True, # Final rescore with float32
"oversampling": 3.0 # Fetch 3x candidates for rescore
}
}
)
# Typical benchmark results (commodity hardware, 1M vectors, 1536-dim)
# float32: recall=1.00, p95_latency=45ms, memory=6.1GB
# SQ8: recall=0.98, p95_latency=18ms, memory=1.6GB ← sweet spot
# PQ16: recall=0.91, p95_latency=8ms, memory=0.4GB
# Binary: recall=0.93, p95_latency=3ms, memory=0.2GB (with rescore)
Practical Rule: Choosing Quantization
- Dataset <10M vectors, critical recall: float32 native (no quantization)
- Dataset 10-100M vectors: Scalar Quantization INT8, sweet spot quality/memory
- Dataset >100M vectors, limited memory: Product Quantization with rescore
- Ultra-low latency with OpenAI/Cohere embeddings: Binary Quantization + rescore
Filtered Vector Search: The Narrow Filter Problem
In practice, most RAG queries are not pure vector searches: you want to find semantically similar documents and belonging to a specific user, date range, category, or tenant. Filtered vector search is one of the algorithmically most challenging problems in vector databases.
The fundamental problem: with highly selective filters (e.g., "only documents from the last week" matching 0.1% of the dataset), the k nearest neighbors in vector space might all be excluded by the filter, forcing the search to explore a large portion of the HNSW graph before finding k valid results. This can increase latency by 10-100x compared to unfiltered search.
Filtering Strategies
Post-filtering: run the ANN search normally, then filter results. Works if the filter is not very selective (excludes less than 50% of results). Problem: if the filter excludes 99% of vectors, you need to retrieve 100x more candidates.
Pre-filtering: first identify points satisfying the filter, then run ANN search only on that subset. Requires an efficient scalar index on the filtered field. Works well with highly selective filters but requires payload indexing.
Filterable HNSW (Qdrant): Qdrant implements a sophisticated HNSW extension that adds extra edges to the graph based on indexed payload values. The query planner estimates filter cardinality and dynamically chooses the strategy: if the filter is very selective it uses the payload index, otherwise it uses filterable HNSW.
For cases with combinations of multiple strict filters, Qdrant recommends the ACORN (Adaptive Component-Overlap Routing Network) algorithm, which better handles disconnected graph components caused by aggressive filtering.
# Filtered vector search in Qdrant - best practices
from qdrant_client import QdrantClient
from qdrant_client.models import (
Filter, FieldCondition, MatchValue, Range,
MatchAny, SearchRequest
)
import datetime
client = QdrantClient(url="http://localhost:6333")
# STEP 1: Create payload indexes for frequently filtered fields
# CRITICAL: without payload index, filtering scans all points
client.create_payload_index(
collection_name="rag_docs",
field_name="tenant_id",
field_schema="keyword"
)
client.create_payload_index(
collection_name="rag_docs",
field_name="category",
field_schema="keyword"
)
client.create_payload_index(
collection_name="rag_docs",
field_name="created_at",
field_schema="integer" # UNIX timestamp for range queries
)
# STEP 2: Queries with filters - simple to complex
def search_by_tenant(query_vector, tenant_id, limit=10):
return client.search(
collection_name="rag_docs",
query_vector=query_vector,
query_filter=Filter(
must=[
FieldCondition(
key="tenant_id",
match=MatchValue(value=tenant_id)
)
]
),
limit=limit
)
# Combined filter (narrow filter): uses filterable HNSW
def search_recent_high_quality(query_vector, tenant_id, days_back=7, limit=10):
cutoff = int((datetime.datetime.now() -
datetime.timedelta(days=days_back)).timestamp())
return client.search(
collection_name="rag_docs",
query_vector=query_vector,
query_filter=Filter(
must=[
FieldCondition(
key="tenant_id",
match=MatchValue(value=tenant_id)
),
FieldCondition(
key="created_at",
range=Range(gte=cutoff)
),
FieldCondition(
key="relevance_score",
range=Range(gte=0.7)
)
]
),
limit=limit,
search_params={
"hnsw_ef": 256, # Increase ef for narrow filters
"exact": False
}
)
# STEP 3: Batch search for performance (avoids N sequential single queries)
def batch_search(query_vectors, tenant_id, limit=10):
"""
Use search_batch to reduce overhead of N independent queries.
Typical throughput: 3-5x vs sequential queries.
"""
requests = [
SearchRequest(
vector=qv,
filter=Filter(
must=[FieldCondition(
key="tenant_id",
match=MatchValue(value=tenant_id)
)]
),
limit=limit
)
for qv in query_vectors
]
return client.search_batch(
collection_name="rag_docs",
requests=requests
)
Database Comparison: Qdrant vs Pinecone vs Milvus vs Weaviate
Each database has a distinct strength profile. There is no universally optimal choice: the decision depends on deployment constraints, team capabilities, and specific requirements.
Qdrant
Written in Rust, it offers the best performance-to-operational-complexity ratio in 2025. Supports sophisticated filtering with payload indexes and filterable HNSW, scalar/product/binary quantization, multi-vector for named vectors, sparse vectors for native hybrid search. The simplest deployment path: single binary, Docker, or cloud managed. Excellent for teams wanting control without massive operational overhead.
Ideal for: enterprise RAG, multi-tenant systems, on-premise deployment, teams with Python expertise but without complex Kubernetes infrastructure.
Pinecone
Fully managed, serverless, zero ops. Pricing is higher than self-hosted alternatives but completely eliminates infrastructure operational cost. Excellent for teams preferring to focus on product without managing clusters. Supports serverless pods with transparent autoscaling and multi-region replication. Latency is consistently low thanks to optimized infrastructure.
Ideal for: early-stage startups, small teams, variable workloads, proofs of concept that become production without rework.
Milvus / Zilliz Cloud
The most mature and feature-complete distributed system. Supports all index types (HNSW, IVF, DiskANN, ScaNN, GPU-accelerated), automatic Kubernetes sharding, compute/storage separation. The cloud version (Zilliz) wins throughput benchmarks on datasets exceeding 100M vectors. Significant operational overhead on Kubernetes.
Ideal for: datasets >50M vectors, teams with existing Kubernetes infrastructure, maximum throughput requirements, GPU acceleration.
Weaviate
Positioned between a pure vector database and a knowledge graph. Supports integrated modules for automatic embedding generation (text2vec-openai, text2vec-cohere), GraphQL query interface, and native BM25 hybridization. Requires more memory than alternatives for the same dataset size. Excellent for teams wanting to integrate retrieval and knowledge graph.
Ideal for: semantic search with knowledge graph, GraphQL teams, direct integration with model providers without managing an embedding pipeline.
Decision Matrix: How to Choose
- Small team, development speed: Pinecone (zero ops) or Qdrant (simplicity)
- Dataset >50M vectors, high throughput: Milvus with DiskANN or GPU index
- Multi-tenant RAG with complex filters: Qdrant (filterable HNSW)
- Knowledge graph + semantic search: Weaviate
- Already on PostgreSQL, moderate volume: pgvector (avoid additional infrastructure)
- Native hybrid search without overhead: Qdrant sparse vectors or Weaviate BM25
Pinecone: Configuration and Optimization
Pinecone has further simplified its SDK with the serverless architecture in 2024-2025. You no longer explicitly configure the indexing algorithm: Pinecone internally manages the index choice based on dataset size.
# Pinecone - setup and optimization with SDK v3+
from pinecone import Pinecone, ServerlessSpec, PodSpec
import os
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# --- Serverless Index (recommended for most use cases) ---
pc.create_index(
name="rag-serverless",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
# --- Pod Index (for guaranteed latency and high throughput) ---
pc.create_index(
name="rag-pod-optimized",
dimension=1536,
metric="cosine",
spec=PodSpec(
environment="us-east1-gcp",
pod_type="p2.x1", # p1=storage, p2=speed, s1=storage-optimized
pods=1,
replicas=2,
shards=1
)
)
index = pc.Index("rag-serverless")
# Upsert with rich metadata for filtering
def upsert_documents(documents, embeddings):
vectors = [
{
"id": doc["id"],
"values": emb.tolist(),
"metadata": {
"text": doc["text"][:1000], # Pinecone limit: 40KB per vector
"source": doc["source"],
"tenant_id": doc["tenant_id"],
"created_at": doc["created_at"],
"category": doc["category"],
"language": doc.get("language", "en")
}
}
for doc, emb in zip(documents, embeddings)
]
batch_size = 100
for i in range(0, len(vectors), batch_size):
batch = vectors[i:i + batch_size]
index.upsert(vectors=batch)
# Query with metadata filtering
def query_pinecone(query_embedding, tenant_id, limit=10, category=None):
filter_dict = {"tenant_id": {"$eq": tenant_id}}
if category:
filter_dict["category"] = {"$in": category if isinstance(category, list) else [category]}
return index.query(
vector=query_embedding.tolist(),
top_k=limit,
filter=filter_dict,
include_metadata=True
)
# Pinecone Namespaces: logical multi-tenant isolation at no extra cost
index.upsert(
vectors=[{"id": "doc1", "values": embedding}],
namespace="tenant-acme-corp"
)
index.query(
vector=query_embedding,
top_k=10,
namespace="tenant-acme-corp"
)
Benchmarking and Recall Measurement
No optimization is valid without rigorous measurement. The standard framework for evaluating vector databases is based on three fundamental metrics:
- Recall@k: percentage of true k nearest neighbors found among the k results returned. The most important quality metric. Formula: |retrieved ∩ true| / k
- QPS (Queries Per Second): system throughput under load. Typically measured at a fixed recall target (e.g., "QPS @ recall=0.95").
- Latency percentiles (p50, p95, p99): mean latency is misleading. In production, the p99 matters: 99% of queries must complete within the SLA.
# Vector database benchmarking framework
import time
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass
@dataclass
class BenchmarkResult:
mean_recall: float
p50_latency_ms: float
p95_latency_ms: float
p99_latency_ms: float
qps: float
total_queries: int
class VectorDBBenchmark:
"""
Benchmarking framework for a vector database.
Generates ground truth with brute force and compares with ANN.
"""
def __init__(self, collection_size: int, dim: int, n_test_queries: int = 1000):
self.collection_size = collection_size
self.dim = dim
self.n_test_queries = n_test_queries
def generate_test_data(self) -> Tuple[np.ndarray, np.ndarray]:
data = np.random.randn(self.collection_size, self.dim).astype(np.float32)
data = data / np.linalg.norm(data, axis=1, keepdims=True)
queries = np.random.randn(self.n_test_queries, self.dim).astype(np.float32)
queries = queries / np.linalg.norm(queries, axis=1, keepdims=True)
return data, queries
def compute_ground_truth(self, data: np.ndarray, queries: np.ndarray, k: int = 10) -> np.ndarray:
"""Brute force ground truth computation (slow but required as reference)."""
ground_truth = np.zeros((len(queries), k), dtype=np.int64)
for i, query in enumerate(queries):
similarities = data @ query
ground_truth[i] = np.argsort(similarities)[::-1][:k]
return ground_truth
def run_benchmark(self, search_fn, queries, ground_truth, k=10) -> BenchmarkResult:
recalls = []
latencies = []
# Warmup to avoid cold start penalty
for _ in range(10):
search_fn(queries[0], k)
for i, query in enumerate(queries):
start = time.perf_counter()
results = search_fn(query, k)
elapsed_ms = (time.perf_counter() - start) * 1000
latencies.append(elapsed_ms)
retrieved = set(results[:k])
true_set = set(ground_truth[i].tolist())
recalls.append(len(retrieved & true_set) / k)
total_time = sum(latencies) / 1000
return BenchmarkResult(
mean_recall=float(np.mean(recalls)),
p50_latency_ms=float(np.percentile(latencies, 50)),
p95_latency_ms=float(np.percentile(latencies, 95)),
p99_latency_ms=float(np.percentile(latencies, 99)),
qps=len(queries) / total_time,
total_queries=len(queries)
)
# Run benchmark sweeping ef values
benchmark = VectorDBBenchmark(collection_size=1_000_000, dim=1536)
data, queries = benchmark.generate_test_data()
gt = benchmark.compute_ground_truth(data, queries[:100], k=10)
for ef_value in [32, 64, 128, 256, 512]:
def search_fn(query_vector, k, ef=ef_value):
results = client.search(
collection_name="rag_docs",
query_vector=query_vector.tolist(),
limit=k,
search_params={"hnsw_ef": ef}
)
return [hit.id for hit in results]
result = benchmark.run_benchmark(search_fn, queries[:100], gt, k=10)
print(f"ef={ef_value:3d} | Recall: {result.mean_recall:.3f} | "
f"P95: {result.p95_latency_ms:.1f}ms | QPS: {result.qps:.0f}")
Production Checklist
Before bringing a vector database to production, verify these critical points:
- Benchmark on your real dataset: generic results do not automatically transfer to your use case. Measure recall and latency with real queries.
- Payload indexes configured: every field you filter on must have an index, otherwise filter conditions scan all points.
- Appropriate quantization: evaluate SQ8 as default, measure recall loss. If acceptable, apply it immediately: the memory savings are significant.
- Backup and snapshots: configure automatic snapshots. Vector databases do not always have ACID transactions; a crash during ingestion can corrupt the index.
- Monitoring: track indexed_vectors_count vs vectors_count to detect indexing lag that degrades query performance.
- Memory sizing: calculate the actual footprint before deployment. A server with insufficient memory causes swapping that destroys latency.
- Test with narrow filters: if your application uses highly selective filters, test these scenarios explicitly. Latency under narrow filters is very different from unfiltered searches.
Common Anti-patterns to Avoid
- Indexing threshold too low: with indexing_threshold=0, every insert triggers reindexing, making ingestion extremely slow. Use thresholds of 10k-100k for bulk insert, then optimize.
- M too high without measuring: M=128 is not always better than M=32. Beyond a certain point, recall improves marginally while memory grows linearly. Measure with your dataset.
- No payload index on filtered fields: without an index, every filter condition is O(n). With 10M vectors, an unindexed filter is the difference between 5ms and 5000ms.
- Unnormalized vectors with cosine similarity: if using cosine similarity, vectors must be normalized. Some models do not normalize by default. Unnormalized vectors with cosine produce semantically incorrect results.
Related Articles
- RAG: Retrieval-Augmented Generation Explained - RAG fundamentals to contextualize the role of the vector database
- Embeddings and Vector Search: BERT vs Sentence Transformers - Choosing the right embedding model for your pipeline
- Hybrid Retrieval: BM25 + Vector Search - Combine vector search with keyword search for better recall
- PostgreSQL with pgvector - Vector search on PostgreSQL without additional infrastructure
Conclusions and Next Steps
Selecting and optimizing a vector database is one of the most technical and impactful aspects of AI engineering. There are no universal answers: each system has a distinct strength profile and the optimal configuration depends on your specific workload.
The recommended path for a new project: start with Qdrant with SQ8 for operational simplicity and good performance, then measure recall and latency on your real dataset. If performance is insufficient, explore M and ef tuning. If memory is a constraint, evaluate product quantization or DiskANN. If you are already on PostgreSQL and have moderate volume (<5M vectors), consider pgvector before adding new infrastructure.
The next articles in this series build on these foundations: the Hybrid Retrieval article covers combining vector search with BM25 to improve recall on precise queries, while RAG in Production shows how to measure the end-to-end impact of vector database choices on RAG response quality.
The embedding concepts presented here connect directly to the Modern NLP series and to the PostgreSQL AI series for teams wanting to implement vector search on existing infrastructure.







