Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

Vector Databases: Selection, Architecture, and Optimization for AI Engineering

When building a RAG pipeline in production, choosing the right vector database is not an implementation detail - it is an architectural decision that directly impacts latency, operational costs, recall accuracy, and system scalability. The vector database market exceeded $2.65 billion in 2025, projected to reach $8.9 billion by 2030 at 27.5% CAGR. The explosion in options has made selection increasingly complex.

This article is not a marketing overview of commercial features. It is a technical deep dive into how vector databases work internally, which indexing algorithms they use, and how to configure and optimize them for real workloads. We will analyze Qdrant, Pinecone, Milvus, and Weaviate on concrete engineering dimensions: HNSW architecture, quantization strategies, filtered search, DiskANN vs in-memory tradeoffs, and parameter tuning to hit the recall/latency targets your application requires.

Whether you are building a RAG system handling millions of documents with sub-50ms latency and 95%+ recall, or optimizing an existing system consuming too much memory, this article gives you the conceptual and practical tools to make informed decisions.

What You Will Learn

Internal architecture of vector databases: how HNSW works at the algorithmic level
IVF vs HNSW vs DiskANN: when to use which algorithm and why
Scalar, product, and binary quantization: memory/accuracy tradeoffs
Filtered vector search: pre-filtering, post-filtering, and the narrow filter problem
Practical configuration of Qdrant, Milvus, and Pinecone with working code examples
Benchmarking and production tuning: measuring and improving QPS and recall

Internal Architecture: How a Vector Database Works

A vector database differs from a relational database not only in the data it stores, but in the fundamental operation it must optimize. Instead of exact lookups on discrete keys, it performs Approximate Nearest Neighbor (ANN) search across high-dimensional spaces, typically 768-4096 dimensions for modern LLM embeddings.

Exact k-nearest neighbor search (kNN) has O(n*d) complexity where n is the number of vectors and d is the dimensionality. With 10 million vectors at 1536 dimensions (OpenAI ada-002 standard), an exact query would require roughly 15 billion floating-point operations - completely unacceptable for a real-time system. All modern vector databases therefore use ANN algorithms that sacrifice some recall to gain orders of magnitude in speed.

The internal stack of a vector database operates across several layers:

Storage layer: management of compressed vectors on disk or in memory, with mmap support for efficient access
Index layer: ANN data structure (HNSW, IVF, DiskANN) for navigating the vector space
Payload/metadata layer: scalar attributes associated with vectors for filtering
Query planner: decides the optimal strategy combining vector search and payload filtering
Replication/sharding layer: for distributed systems like Milvus or Pinecone

HNSW: Algorithmic Deep Dive

Hierarchical Navigable Small World (HNSW) is the dominant ANN algorithm in 2025, used by default in Qdrant, Weaviate, and available in Milvus. Understanding its internal mechanics is essential for correct configuration.

HNSW builds a multi-layer hierarchical graph. At the highest layer, there are few nodes strongly connected to each other (the "hubs"), while lower layers become progressively denser until layer 0 which contains all vectors. During search, the algorithm starts at the top and descends through layers, progressively refining the nearest neighbor candidates. This approach is inspired by the "small world" phenomenon in social graphs: from any node, you can reach any other in just a few hops thanks to long-range connections.

The three fundamental HNSW parameters are:

M (default 16): maximum number of bidirectional edges per node. Typical range: 8-64. Increasing M improves recall but grows memory and build time. For high-dimensional datasets (1536+), M=32-64 yields good results.
efConstruction (default 100-200): size of the candidate list during index construction. Does not affect final index size, but determines connection quality. Higher values produce a better index but slower build time. Recommended range: 200-400 for high quality.
ef (or efSearch, configurable at runtime): candidate list size during query. Must be >= k (number of results requested). Increasing ef improves recall but increases latency. Typical range: 50-500.

The fundamental tradeoff: M and efConstruction determine index quality (a costly one-time operation), while ef balances recall vs latency at query time and can be changed dynamically without rebuilding the index.

# HNSW configuration in Qdrant - practical examples with explicit tradeoffs

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance,
    HnswConfigDiff, OptimizersConfigDiff
)

client = QdrantClient(url="http://localhost:6333")

# --- HIGH-RECALL configuration for critical RAG ---
# Target: recall >= 0.98, latency acceptable up to 50ms
# Cost: ~4x memory compared to baseline
client.recreate_collection(
    collection_name="rag_high_recall",
    vectors_config=VectorParams(
        size=1536,          # OpenAI text-embedding-3-small
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=64,               # High connectivity: better recall but +memory
        ef_construct=400,   # Slow build but high-quality index
        full_scan_threshold=10000,  # Use brute force under 10k vectors
        on_disk=False       # In-memory for minimum latency
    )
)

# --- BALANCED configuration for typical production ---
# Target: recall >= 0.95, latency < 20ms, optimized memory
client.recreate_collection(
    collection_name="rag_balanced",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=32,               # Good recall/memory balance
        ef_construct=200,
        full_scan_threshold=5000,
        on_disk=False
    )
)

# --- LOW-LATENCY configuration for real-time ---
# Target: latency < 5ms, acceptable recall >= 0.90
client.recreate_collection(
    collection_name="rag_fast",
    vectors_config=VectorParams(
        size=768,           # Compact embeddings (all-MiniLM-L6-v2)
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,
        ef_construct=128,
        full_scan_threshold=1000,
        on_disk=False
    )
)

# Configure ef at query time (more flexible than rebuild)
results = client.search(
    collection_name="rag_balanced",
    query_vector=query_embedding,
    limit=10,
    search_params={
        "hnsw_ef": 128,     # Increase recall without index rebuild
        "exact": False
    }
)

for hit in results:
    print(f"Score: {hit.score:.4f} | ID: {hit.id}")

IVF vs HNSW vs DiskANN: Choosing the Right Algorithm

HNSW is not the only available indexing algorithm. The choice strongly depends on memory constraints, dataset size, and update patterns.

IVF (Inverted File Index)

IVF partitions the vector space into nlist clusters via k-means. During query, it searches only the nprobe clusters nearest to the query vector. Key parameters are nlist (cluster count) and nprobe (clusters to inspect). Recommended empirical formula: nlist = 4 * sqrt(n_vectors).

IVF pros: fast build, moderate memory, good for static datasets. IVF cons: requires re-clustering on significant data changes, lower recall than HNSW for the same computational budget, cold start on new collections.

HNSW

As described above, builds a multi-layer graph. The most versatile algorithm, used by default in most vector databases.

HNSW pros: excellent recall-speed tradeoff, native support for incremental updates, parameters configurable at query time. HNSW cons: requires the entire index to fit in RAM, becomes prohibitive beyond 50-100M vectors on standard hardware.

DiskANN

Developed by Microsoft Research, DiskANN is designed for datasets that do not fit in RAM. It keeps a compact navigable graph structure in memory while storing full vectors on NVMe SSDs. With PCIe Gen5 hardware, it maintains 95%+ recall and sub-10ms latency at billion-vector scale, with 10-20x lower DRAM cost compared to equivalent HNSW.

DiskANN pros: scales to billions of vectors on commodity hardware, reduced operational cost. DiskANN cons: requires fast NVMe SSDs, higher latency than in-memory HNSW, the base implementation is immutable (FreshDiskANN handles updates). Available in Milvus, Azure PostgreSQL, and increasingly other systems.

Beware the "HNSW for Everything" Anti-Pattern

Many teams configure in-memory HNSW for 50M+ vector datasets, then find themselves running 256GB RAM instances at unsustainable cost. The practical rule: if your dataset exceeds 10-20M vectors and you do not have ultra-low latency requirements (<5ms), seriously evaluate DiskANN or aggressive quantization before adding hardware.

# IVF_FLAT and HNSW configuration in Milvus - practical comparison

from pymilvus import (
    connections, Collection, CollectionSchema,
    FieldSchema, DataType
)

connections.connect("default", host="localhost", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=100),
    FieldSchema(name="timestamp", dtype=DataType.INT64),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536)
]
schema = CollectionSchema(fields, description="RAG document collection")
collection = Collection("rag_docs", schema)

# --- IVF_FLAT: for static or near-static datasets ---
ivf_index_params = {
    "metric_type": "COSINE",
    "index_type": "IVF_FLAT",
    "params": {
        "nlist": 4096   # For 1M vectors: 4 * sqrt(1M) ≈ 4000
    }
}

# --- HNSW: for datasets with frequent updates ---
hnsw_index_params = {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {
        "M": 32,
        "efConstruction": 256
    }
}

# --- DiskANN: for datasets >50M vectors ---
diskann_index_params = {
    "metric_type": "L2",
    "index_type": "DISKANN",
    "params": {
        "search_list": 100
    }
}

collection.create_index(
    field_name="embedding",
    index_params=hnsw_index_params
)
collection.load()

# Query parameters for HNSW
search_params_hnsw = {
    "metric_type": "COSINE",
    "params": {
        "ef": 256  # Increase for more recall, decrease for more speed
    }
}

# Query parameters for IVF
search_params_ivf = {
    "metric_type": "COSINE",
    "params": {
        "nprobe": 64  # nprobe/nlist = fraction of clusters inspected
    }
}

results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params_hnsw,
    limit=10,
    output_fields=["content", "category", "timestamp"]
)

for hit in results[0]:
    print(f"Distance: {hit.distance:.4f} | Category: {hit.entity.get('category')}")

Vector Quantization: Compressing Without Losing Recall

Quantization is the most powerful technique for reducing vector database memory usage, with controllable impact on search quality. A float32 vector with 1536 dimensions occupies 6144 bytes (6KB). With quantization, we can reduce it to 384 bytes or less.

Scalar Quantization (SQ)

Maps each float32 value (4 bytes) to int8 (1 byte), achieving 4x compression. The algorithm analyzes the value distribution in each dimension and determines an optimal quantization range. Distances are computed directly on int8 values, which are computationally simpler than float32 operations. Typical recall loss: 1-3% vs float32.

When to use: recommended starting point for any deployment, 4x reduction with minimal quality impact. Supported by Qdrant, Milvus (IVF_SQ8), and Weaviate.

Product Quantization (PQ)

Divides the vector into m sub-vectors and quantizes each against a codebook of 2^nbits entries. Typical compression: 16-64x. Each vector is represented as a sequence of codebook indices. Approximate distances are computed via precomputed lookup tables (ADC - Asymmetric Distance Computation).

PQ tradeoff: aggressive compression (10-50MB instead of 10GB) but significant recall loss (5-15%). Requires codebook training on the dataset. Useful for huge datasets where memory is the primary constraint.

Binary Quantization (BQ)

Reduces each dimension to 1 bit: the most compressed representation possible. A 1536-dimension vector becomes 192 bytes (32x compression vs float32). Distance is computed using Hamming distance (XOR + popcount), an extremely fast operation on modern CPUs. Qdrant reports speedups of up to 40x on distance calculations.

However, Binary Quantization only works well for embeddings with specific properties: values must be symmetrically distributed around zero (a property satisfied by OpenAI ada-002, Cohere embed-v3, and e5 models). For embeddings with asymmetric distributions, the recall drop can be severe (15-30%).

Qdrant introduced 1.5-bit and 2-bit quantization in 2025, offering a middle ground between scalar (4x) and binary (32x) compression.

# Quantization configuration in Qdrant - all types

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance,
    ScalarQuantizationConfig, ScalarType,
    ProductQuantizationConfig, CompressionRatio,
    BinaryQuantizationConfig,
    QuantizationConfig
)

client = QdrantClient(url="http://localhost:6333")

# --- 1. Scalar Quantization (SQ8) ---
# 4x compression, recall loss ~1-3%
# RECOMMENDED: best starting point
client.recreate_collection(
    collection_name="rag_sq8",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    quantization_config=ScalarQuantizationConfig(
        scalar=QuantizationConfig(
            type=ScalarType.INT8,
            quantile=0.99,      # Use 99th percentile to define the range
            always_ram=True     # Keep quantized vectors in RAM (+speed)
        )
    )
)

# --- 2. Product Quantization (PQ) ---
# 16-64x compression, recall loss 5-15%
# FOR huge datasets (>100M vectors) with severe memory constraints
client.recreate_collection(
    collection_name="rag_pq",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    quantization_config=ProductQuantizationConfig(
        product=QuantizationConfig(
            compression=CompressionRatio.X16,
            always_ram=True
        )
    )
)

# --- 3. Binary Quantization (BQ) ---
# 32x compression, 40x speedup, variable recall loss
# ONLY for embeddings with symmetric distribution (OpenAI, Cohere)
client.recreate_collection(
    collection_name="rag_binary",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    ),
    quantization_config=BinaryQuantizationConfig(
        binary=QuantizationConfig(
            always_ram=True
        )
    )
)

# rescore=True: use BQ for candidate generation, then
# recompute exact distances with float32 on top-k candidates
def search_with_rescore(client, collection_name, query_vector, limit=10):
    return client.search(
        collection_name=collection_name,
        query_vector=query_vector,
        limit=limit,
        search_params={
            "quantization": {
                "ignore": False,
                "rescore": True,    # Final rescore with float32
                "oversampling": 3.0 # Fetch 3x candidates for rescore
            }
        }
    )

# Typical benchmark results (commodity hardware, 1M vectors, 1536-dim)
# float32: recall=1.00, p95_latency=45ms, memory=6.1GB
# SQ8:     recall=0.98, p95_latency=18ms, memory=1.6GB  ← sweet spot
# PQ16:    recall=0.91, p95_latency=8ms,  memory=0.4GB
# Binary:  recall=0.93, p95_latency=3ms,  memory=0.2GB (with rescore)

Practical Rule: Choosing Quantization

Dataset <10M vectors, critical recall: float32 native (no quantization)
Dataset 10-100M vectors: Scalar Quantization INT8, sweet spot quality/memory
Dataset >100M vectors, limited memory: Product Quantization with rescore
Ultra-low latency with OpenAI/Cohere embeddings: Binary Quantization + rescore

Filtered Vector Search: The Narrow Filter Problem

In practice, most RAG queries are not pure vector searches: you want to find semantically similar documents and belonging to a specific user, date range, category, or tenant. Filtered vector search is one of the algorithmically most challenging problems in vector databases.

The fundamental problem: with highly selective filters (e.g., "only documents from the last week" matching 0.1% of the dataset), the k nearest neighbors in vector space might all be excluded by the filter, forcing the search to explore a large portion of the HNSW graph before finding k valid results. This can increase latency by 10-100x compared to unfiltered search.

Filtering Strategies

Post-filtering: run the ANN search normally, then filter results. Works if the filter is not very selective (excludes less than 50% of results). Problem: if the filter excludes 99% of vectors, you need to retrieve 100x more candidates.

Pre-filtering: first identify points satisfying the filter, then run ANN search only on that subset. Requires an efficient scalar index on the filtered field. Works well with highly selective filters but requires payload indexing.

Filterable HNSW (Qdrant): Qdrant implements a sophisticated HNSW extension that adds extra edges to the graph based on indexed payload values. The query planner estimates filter cardinality and dynamically chooses the strategy: if the filter is very selective it uses the payload index, otherwise it uses filterable HNSW.

For cases with combinations of multiple strict filters, Qdrant recommends the ACORN (Adaptive Component-Overlap Routing Network) algorithm, which better handles disconnected graph components caused by aggressive filtering.

# Filtered vector search in Qdrant - best practices

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Filter, FieldCondition, MatchValue, Range,
    MatchAny, SearchRequest
)
import datetime

client = QdrantClient(url="http://localhost:6333")

# STEP 1: Create payload indexes for frequently filtered fields
# CRITICAL: without payload index, filtering scans all points

client.create_payload_index(
    collection_name="rag_docs",
    field_name="tenant_id",
    field_schema="keyword"
)

client.create_payload_index(
    collection_name="rag_docs",
    field_name="category",
    field_schema="keyword"
)

client.create_payload_index(
    collection_name="rag_docs",
    field_name="created_at",
    field_schema="integer"  # UNIX timestamp for range queries
)

# STEP 2: Queries with filters - simple to complex

def search_by_tenant(query_vector, tenant_id, limit=10):
    return client.search(
        collection_name="rag_docs",
        query_vector=query_vector,
        query_filter=Filter(
            must=[
                FieldCondition(
                    key="tenant_id",
                    match=MatchValue(value=tenant_id)
                )
            ]
        ),
        limit=limit
    )

# Combined filter (narrow filter): uses filterable HNSW
def search_recent_high_quality(query_vector, tenant_id, days_back=7, limit=10):
    cutoff = int((datetime.datetime.now() -
                  datetime.timedelta(days=days_back)).timestamp())

    return client.search(
        collection_name="rag_docs",
        query_vector=query_vector,
        query_filter=Filter(
            must=[
                FieldCondition(
                    key="tenant_id",
                    match=MatchValue(value=tenant_id)
                ),
                FieldCondition(
                    key="created_at",
                    range=Range(gte=cutoff)
                ),
                FieldCondition(
                    key="relevance_score",
                    range=Range(gte=0.7)
                )
            ]
        ),
        limit=limit,
        search_params={
            "hnsw_ef": 256,   # Increase ef for narrow filters
            "exact": False
        }
    )

# STEP 3: Batch search for performance (avoids N sequential single queries)
def batch_search(query_vectors, tenant_id, limit=10):
    """
    Use search_batch to reduce overhead of N independent queries.
    Typical throughput: 3-5x vs sequential queries.
    """
    requests = [
        SearchRequest(
            vector=qv,
            filter=Filter(
                must=[FieldCondition(
                    key="tenant_id",
                    match=MatchValue(value=tenant_id)
                )]
            ),
            limit=limit
        )
        for qv in query_vectors
    ]

    return client.search_batch(
        collection_name="rag_docs",
        requests=requests
    )

Database Comparison: Qdrant vs Pinecone vs Milvus vs Weaviate

Each database has a distinct strength profile. There is no universally optimal choice: the decision depends on deployment constraints, team capabilities, and specific requirements.

Qdrant

Written in Rust, it offers the best performance-to-operational-complexity ratio in 2025. Supports sophisticated filtering with payload indexes and filterable HNSW, scalar/product/binary quantization, multi-vector for named vectors, sparse vectors for native hybrid search. The simplest deployment path: single binary, Docker, or cloud managed. Excellent for teams wanting control without massive operational overhead.

Ideal for: enterprise RAG, multi-tenant systems, on-premise deployment, teams with Python expertise but without complex Kubernetes infrastructure.

Pinecone

Fully managed, serverless, zero ops. Pricing is higher than self-hosted alternatives but completely eliminates infrastructure operational cost. Excellent for teams preferring to focus on product without managing clusters. Supports serverless pods with transparent autoscaling and multi-region replication. Latency is consistently low thanks to optimized infrastructure.

Ideal for: early-stage startups, small teams, variable workloads, proofs of concept that become production without rework.

Milvus / Zilliz Cloud

The most mature and feature-complete distributed system. Supports all index types (HNSW, IVF, DiskANN, ScaNN, GPU-accelerated), automatic Kubernetes sharding, compute/storage separation. The cloud version (Zilliz) wins throughput benchmarks on datasets exceeding 100M vectors. Significant operational overhead on Kubernetes.

Ideal for: datasets >50M vectors, teams with existing Kubernetes infrastructure, maximum throughput requirements, GPU acceleration.

Weaviate

Positioned between a pure vector database and a knowledge graph. Supports integrated modules for automatic embedding generation (text2vec-openai, text2vec-cohere), GraphQL query interface, and native BM25 hybridization. Requires more memory than alternatives for the same dataset size. Excellent for teams wanting to integrate retrieval and knowledge graph.

Ideal for: semantic search with knowledge graph, GraphQL teams, direct integration with model providers without managing an embedding pipeline.

      Decision Matrix: How to Choose
      Small team, development speed: Pinecone (zero ops) or Qdrant (simplicity)
Dataset >50M vectors, high throughput: Milvus with DiskANN or GPU index
Multi-tenant RAG with complex filters: Qdrant (filterable HNSW)
Knowledge graph + semantic search: Weaviate
Already on PostgreSQL, moderate volume: pgvector (avoid additional infrastructure)
Native hybrid search without overhead: Qdrant sparse vectors or Weaviate BM25

    

Pinecone: Configuration and Optimization

Pinecone has further simplified its SDK with the serverless architecture in 2024-2025. You no longer explicitly configure the indexing algorithm: Pinecone internally manages the index choice based on dataset size.

# Pinecone - setup and optimization with SDK v3+

from pinecone import Pinecone, ServerlessSpec, PodSpec
import os

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# --- Serverless Index (recommended for most use cases) ---
pc.create_index(
    name="rag-serverless",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# --- Pod Index (for guaranteed latency and high throughput) ---
pc.create_index(
    name="rag-pod-optimized",
    dimension=1536,
    metric="cosine",
    spec=PodSpec(
        environment="us-east1-gcp",
        pod_type="p2.x1",  # p1=storage, p2=speed, s1=storage-optimized
        pods=1,
        replicas=2,
        shards=1
    )
)

index = pc.Index("rag-serverless")

# Upsert with rich metadata for filtering
def upsert_documents(documents, embeddings):
    vectors = [
        {
            "id": doc["id"],
            "values": emb.tolist(),
            "metadata": {
                "text": doc["text"][:1000],  # Pinecone limit: 40KB per vector
                "source": doc["source"],
                "tenant_id": doc["tenant_id"],
                "created_at": doc["created_at"],
                "category": doc["category"],
                "language": doc.get("language", "en")
            }
        }
        for doc, emb in zip(documents, embeddings)
    ]

    batch_size = 100
    for i in range(0, len(vectors), batch_size):
        batch = vectors[i:i + batch_size]
        index.upsert(vectors=batch)

# Query with metadata filtering
def query_pinecone(query_embedding, tenant_id, limit=10, category=None):
    filter_dict = {"tenant_id": {"$eq": tenant_id}}

    if category:
        filter_dict["category"] = {"$in": category if isinstance(category, list) else [category]}

    return index.query(
        vector=query_embedding.tolist(),
        top_k=limit,
        filter=filter_dict,
        include_metadata=True
    )

# Pinecone Namespaces: logical multi-tenant isolation at no extra cost
index.upsert(
    vectors=[{"id": "doc1", "values": embedding}],
    namespace="tenant-acme-corp"
)

index.query(
    vector=query_embedding,
    top_k=10,
    namespace="tenant-acme-corp"
)

Benchmarking and Recall Measurement

No optimization is valid without rigorous measurement. The standard framework for evaluating vector databases is based on three fundamental metrics:

Recall@k: percentage of true k nearest neighbors found among the k results returned. The most important quality metric. Formula: |retrieved ∩ true| / k
QPS (Queries Per Second): system throughput under load. Typically measured at a fixed recall target (e.g., "QPS @ recall=0.95").
Latency percentiles (p50, p95, p99): mean latency is misleading. In production, the p99 matters: 99% of queries must complete within the SLA.

# Vector database benchmarking framework

import time
import numpy as np
from typing import List, Tuple
from dataclasses import dataclass

@dataclass
class BenchmarkResult:
    mean_recall: float
    p50_latency_ms: float
    p95_latency_ms: float
    p99_latency_ms: float
    qps: float
    total_queries: int

class VectorDBBenchmark:
    """
    Benchmarking framework for a vector database.
    Generates ground truth with brute force and compares with ANN.
    """

    def __init__(self, collection_size: int, dim: int, n_test_queries: int = 1000):
        self.collection_size = collection_size
        self.dim = dim
        self.n_test_queries = n_test_queries

    def generate_test_data(self) -> Tuple[np.ndarray, np.ndarray]:
        data = np.random.randn(self.collection_size, self.dim).astype(np.float32)
        data = data / np.linalg.norm(data, axis=1, keepdims=True)
        queries = np.random.randn(self.n_test_queries, self.dim).astype(np.float32)
        queries = queries / np.linalg.norm(queries, axis=1, keepdims=True)
        return data, queries

    def compute_ground_truth(self, data: np.ndarray, queries: np.ndarray, k: int = 10) -> np.ndarray:
        """Brute force ground truth computation (slow but required as reference)."""
        ground_truth = np.zeros((len(queries), k), dtype=np.int64)
        for i, query in enumerate(queries):
            similarities = data @ query
            ground_truth[i] = np.argsort(similarities)[::-1][:k]
        return ground_truth

    def run_benchmark(self, search_fn, queries, ground_truth, k=10) -> BenchmarkResult:
        recalls = []
        latencies = []

        # Warmup to avoid cold start penalty
        for _ in range(10):
            search_fn(queries[0], k)

        for i, query in enumerate(queries):
            start = time.perf_counter()
            results = search_fn(query, k)
            elapsed_ms = (time.perf_counter() - start) * 1000
            latencies.append(elapsed_ms)

            retrieved = set(results[:k])
            true_set = set(ground_truth[i].tolist())
            recalls.append(len(retrieved & true_set) / k)

        total_time = sum(latencies) / 1000
        return BenchmarkResult(
            mean_recall=float(np.mean(recalls)),
            p50_latency_ms=float(np.percentile(latencies, 50)),
            p95_latency_ms=float(np.percentile(latencies, 95)),
            p99_latency_ms=float(np.percentile(latencies, 99)),
            qps=len(queries) / total_time,
            total_queries=len(queries)
        )

# Run benchmark sweeping ef values
benchmark = VectorDBBenchmark(collection_size=1_000_000, dim=1536)
data, queries = benchmark.generate_test_data()
gt = benchmark.compute_ground_truth(data, queries[:100], k=10)

for ef_value in [32, 64, 128, 256, 512]:
    def search_fn(query_vector, k, ef=ef_value):
        results = client.search(
            collection_name="rag_docs",
            query_vector=query_vector.tolist(),
            limit=k,
            search_params={"hnsw_ef": ef}
        )
        return [hit.id for hit in results]

    result = benchmark.run_benchmark(search_fn, queries[:100], gt, k=10)
    print(f"ef={ef_value:3d} | Recall: {result.mean_recall:.3f} | "
          f"P95: {result.p95_latency_ms:.1f}ms | QPS: {result.qps:.0f}")

Production Checklist

Before bringing a vector database to production, verify these critical points:

Benchmark on your real dataset: generic results do not automatically transfer to your use case. Measure recall and latency with real queries.
Payload indexes configured: every field you filter on must have an index, otherwise filter conditions scan all points.
Appropriate quantization: evaluate SQ8 as default, measure recall loss. If acceptable, apply it immediately: the memory savings are significant.
Backup and snapshots: configure automatic snapshots. Vector databases do not always have ACID transactions; a crash during ingestion can corrupt the index.
Monitoring: track indexed_vectors_count vs vectors_count to detect indexing lag that degrades query performance.
Memory sizing: calculate the actual footprint before deployment. A server with insufficient memory causes swapping that destroys latency.
Test with narrow filters: if your application uses highly selective filters, test these scenarios explicitly. Latency under narrow filters is very different from unfiltered searches.

Common Anti-patterns to Avoid

Indexing threshold too low: with indexing_threshold=0, every insert triggers reindexing, making ingestion extremely slow. Use thresholds of 10k-100k for bulk insert, then optimize.
M too high without measuring: M=128 is not always better than M=32. Beyond a certain point, recall improves marginally while memory grows linearly. Measure with your dataset.
No payload index on filtered fields: without an index, every filter condition is O(n). With 10M vectors, an unindexed filter is the difference between 5ms and 5000ms.
Unnormalized vectors with cosine similarity: if using cosine similarity, vectors must be normalized. Some models do not normalize by default. Unnormalized vectors with cosine produce semantically incorrect results.

RAG: Retrieval-Augmented Generation Explained - RAG fundamentals to contextualize the role of the vector database
Embeddings and Vector Search: BERT vs Sentence Transformers - Choosing the right embedding model for your pipeline
Hybrid Retrieval: BM25 + Vector Search - Combine vector search with keyword search for better recall
PostgreSQL with pgvector - Vector search on PostgreSQL without additional infrastructure

Conclusions and Next Steps

Selecting and optimizing a vector database is one of the most technical and impactful aspects of AI engineering. There are no universal answers: each system has a distinct strength profile and the optimal configuration depends on your specific workload.

The recommended path for a new project: start with Qdrant with SQ8 for operational simplicity and good performance, then measure recall and latency on your real dataset. If performance is insufficient, explore M and ef tuning. If memory is a constraint, evaluate product quantization or DiskANN. If you are already on PostgreSQL and have moderate volume (<5M vectors), consider pgvector before adding new infrastructure.

The next articles in this series build on these foundations: the Hybrid Retrieval article covers combining vector search with BM25 to improve recall on precise queries, while RAG in Production shows how to measure the end-to-end impact of vector database choices on RAG response quality.

The embedding concepts presented here connect directly to the Modern NLP series and to the PostgreSQL AI series for teams wanting to implement vector search on existing infrastructure.