Hi! I'm

Federico Calò

Software Developer | Technical Writer

I create modern web applications and custom digital tools to help businesses grow through technological innovation. My passion is combining computer science and economics to generate real value.

Contact Me

About Me

My passion for computer science was born at the Technical Commercial Institute of Maglie, where I discovered the power of programming and the fascination of creating digital solutions. From the start, I understood that computer science was not just code, but an extraordinary tool for turning ideas into reality.

During my studies in Business Information Systems, I began to interweave computer science and economics, understanding how technology can be the engine of growth for any business. This vision accompanied me to the University of Bari, where I obtained my degree in Computer Science, deepening my technical skills and passion for software development.

Today I put this experience at the service of businesses, professionals and startups, creating tailor-made digital solutions that automate processes, optimize resources and open new business opportunities. Because true innovation begins when technology meets the real needs of people.

My Skills

Data Analysis & Predictive Models

I transform data into strategic insights with in-depth analysis and predictive models for informed decisions

Process Automation

I create custom tools that automate repetitive operations and free up time for value-added activities

Custom Systems

I develop tailor-made software systems, from platform integrations to customized dashboards

const federico = {
  nome: "Federico Calò",
  ruolo: "Sviluppatore Software",
  città: "Bari, Italia",
  missione: "Aiutare attraverso l'informatica",
  passioni: [
    "Codice Pulito",
    "Innovazione",
    "Crescita Continua"
  ]
};

La Mia Missione

Credo fermamente che l'informatica sia lo strumento più potente per trasformare le idee in realtà e migliorare la vita delle persone.

Democratizzare la Tecnologia

La mia missione è rendere l'informatica accessibile a tutti: dalle piccole imprese locali alle startup innovative, fino ai professionisti che vogliono digitalizzare la propria attività. Ogni realtà merita di sfruttare le potenzialità del digitale.

Unire Informatica ed Economia

Non è solo questione di scrivere codice: è capire come la tecnologia possa generare valore reale. Intrecciando competenze informatiche e visione economica, aiuto le attività a crescere, ottimizzare processi e raggiungere nuovi traguardi di efficienza e redditività.

Creare Soluzioni su Misura

Ogni attività è unica, e così devono esserlo le soluzioni. Sviluppo strumenti personalizzati che rispondono alle esigenze specifiche di ciascun cliente, automatizzando processi ripetitivi e liberando tempo per ciò che conta davvero: far crescere il business.

Trasforma la Tua Attività con la Tecnologia

Che tu gestisca un negozio, uno studio professionale o un'azienda, posso aiutarti a sfruttare le potenzialità dell'informatica per lavorare meglio, più velocemente e in modo più intelligente.

Parliamone Insieme →

Join the Community

Join the developer community where we discuss software, AI, architecture and DevOps. Share ideas, ask questions and grow with us.

Channel

FC Dev Blog

Get notifications on new articles, complete series, weekly tips and featured tools. Bilingual IT/EN content directly in your Telegram.

New articles as they are published
Weekly tips and code snippets
Polls on future topics

Subscribe to Channel

Group

FC Dev Community

A bilingual IT/EN community for developers. Discussions, Q&A, mutual help and networking with other professionals.

Discussions on articles and technologies
Coding help and code review
Job opportunities and collaboration

Join the Group

Discussion Topics

View

Master SQL

RoadMap.sh

November 2024

View

Oracle Certified Foundations Associate

Oracle

October 2024

View

People Leadership Credential

Connect

September 2024

💻 Languages & Technologies

Java

Python

JavaScript

Angular

React

TypeScript

SQL

PHP

CSS/SCSS

Node.js

Docker

Git

💼

12/2024 - Present

Custom Software Engineering Analyst

Accenture

Bari, Puglia, Italy · Hybrid Analysis and development of computer systems through the use of Java and Quarkus in Health and Public Sector. Continuous training on modern technologies for creating customized and efficient software solutions and on agents.

💼

06/2022 - 12/2024

Software analyst and Back End Developer Associate Consultant

Links Management and Technology SpA

Experience analyzing as-is software systems and ETL flows using PowerCenter. Completed Spring Boot training for developing modern and scalable backend applications. Backend developer specialized in Spring Boot, with experience in database design, analysis, development and testing of assigned tasks.

💼

02/2021 - 10/2021

Software programmer

Adesso.it (prima era WebScience srl)

Experience in AS-IS and TO-BE analysis, SEO evolutions and website evolutions to improve user performance and engagement.

🎓

2018 - 2025

Degree in Computer Science

University of Bari Aldo Moro

Bachelor's degree in Computer Science, focusing on software engineering, algorithms, and modern development practices.

📚

2013 - 2018

Diploma - Corporate Information Systems

Technical Commercial Institute of Maglie

Technical diploma specializing in Business Information Systems, combining IT knowledge with business management.

Contattami

Hai un progetto in mente? Parliamone! Compila il form qui sotto e ti risponderò al più presto.

* Campi obbligatori. I tuoi dati saranno utilizzati solo per rispondere alla tua richiesta.

RAG with PostgreSQL: From Document to Answer

Have you ever wanted your AI system to answer questions based on your company's specific documents, without training a custom model? The solution is called Retrieval-Augmented Generation (RAG), and it is one of the most powerful and practical architectures in modern AI. PostgreSQL with pgvector is one of the best tools available to implement it.

RAG combines two complementary capabilities: semantic search (finding the most relevant documents for a question) with natural language generation (producing a coherent answer grounded in those documents). The result is a system that responds with knowledge from your own data, not with the general knowledge of a pre-trained model.

In this article, we will build a complete end-to-end RAG pipeline: from document ingestion to GPT-4-powered query answering, all running on PostgreSQL. No additional vector database, no external vector store service.

Series Overview

#	Article	Focus
1	pgvector	Installation, operators, indexing
2	Embeddings Deep Dive	Models, distances, generation
3	You are here - RAG with PostgreSQL	End-to-end RAG pipeline
4	Similarity Search	Algorithms and optimization
5	HNSW and IVFFlat	Advanced indexing strategies
6	RAG in Production	Scalability and performance

What You Will Learn

The complete architecture of a RAG system: components and data flow
Document ingestion pipeline: loading, parsing, chunking
Storage strategy in PostgreSQL with pgvector
Retrieval: from query to selecting the most relevant chunks
Generation: how to build the prompt and integrate with GPT-4
Hybrid search: combining vector search and PostgreSQL full-text search
RAG quality evaluation: metrics and tools

RAG Architecture: How It Works

A RAG system has two main phases that operate at different times:

Phase 1: Ingestion (offline)

This runs once (or periodically when documents change). The steps are:

Load: Load documents from filesystem, URLs, databases, APIs
Parse: Extract text from PDF, DOCX, HTML, Markdown
Chunk: Split text into optimally-sized fragments
Embed: Generate an embedding vector for each chunk
Store: Save chunk + embedding + metadata in PostgreSQL

Phase 2: Retrieval + Generation (online, for each query)

Query: The user asks a question in natural language
Embed query: Transform the question into a vector using the same model
Search: Find the k most similar chunks in PostgreSQL
Context: Assemble the found chunks as context
Generate: Send question + context to the LLM to get an answer

## RAG Flow Diagram

INGESTION (offline):
PDF Document
    |
    v
[Parser] -> Raw text
    |
    v
[Chunker] -> ["chunk 1", "chunk 2", ..., "chunk N"]
    |
    v
[Embedding Model] -> [[0.023, -0.841, ...], [0.891, 0.234, ...], ...]
    |
    v
[PostgreSQL + pgvector] -> Permanent storage

QUERY (online):
User question: "How does HNSW indexing work?"
    |
    v
[Embedding Model] -> [0.045, -0.823, ...]  (query vector)
    |
    v
[PostgreSQL ANN Search] -> Top 5 most similar chunks
    |
    v
[Prompt Builder] -> "Use this context: [chunk1, chunk2, ...] Question: ..."
    |
    v
[GPT-4 / Claude] -> "HNSW (Hierarchical Navigable Small World) indexing ..."
    |
    v
Answer to user

Project Setup

Dependencies

# requirements.txt
openai>=1.12.0
psycopg2-binary>=2.9.9
langchain>=0.1.0
langchain-openai>=0.0.5
langchain-community>=0.0.20
pypdf>=3.17.0
python-dotenv>=1.0.0
tiktoken>=0.5.0
beautifulsoup4>=4.12.0
requests>=2.31.0

# Install
pip install -r requirements.txt

Database Configuration

-- PostgreSQL initial setup
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;  -- for full-text search

-- Complete RAG schema
CREATE TABLE IF NOT EXISTS rag_documents (
    id              BIGSERIAL PRIMARY KEY,
    -- Source information
    source_path     TEXT NOT NULL,
    source_type     TEXT NOT NULL CHECK (source_type IN ('pdf', 'txt', 'md', 'html', 'docx')),
    source_hash     TEXT NOT NULL,         -- MD5 hash of original file
    -- Chunk info
    chunk_index     INTEGER NOT NULL,
    chunk_total     INTEGER,
    -- Content
    title           TEXT,
    content         TEXT NOT NULL,
    content_length  INTEGER GENERATED ALWAYS AS (length(content)) STORED,
    -- Embedding
    embedding_model TEXT NOT NULL DEFAULT 'text-embedding-3-small',
    embedding       vector(1536),
    -- Metadata
    metadata        JSONB DEFAULT '{}',
    tags            TEXT[] DEFAULT '{}',
    -- Timestamps
    ingested_at     TIMESTAMPTZ DEFAULT NOW(),
    updated_at      TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE (source_path, chunk_index, source_hash)
);

-- HNSW index for fast vector search
CREATE INDEX idx_rag_embedding_hnsw
ON rag_documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- GIN index for full-text search
CREATE INDEX idx_rag_content_fts
ON rag_documents
USING gin (to_tsvector('english', content));

-- Common filter indexes
CREATE INDEX idx_rag_source_type ON rag_documents (source_type);
CREATE INDEX idx_rag_tags ON rag_documents USING gin (tags);
CREATE INDEX idx_rag_metadata ON rag_documents USING gin (metadata);

Document Ingestion Pipeline

Project Structure

rag_system/
├── config.py          # DB config, API keys, parameters
├── ingestion/
│   ├── __init__.py
│   ├── loaders.py     # Document loading from various sources
│   ├── parsers.py     # PDF, DOCX, HTML, Markdown parsing
│   ├── chunkers.py    # Chunking strategies
│   └── pipeline.py    # Ingestion pipeline orchestrator
├── retrieval/
│   ├── __init__.py
│   ├── embedder.py    # Embedding generation
│   └── searcher.py    # Vector search and hybrid search
├── generation/
│   ├── __init__.py
│   ├── prompts.py     # Prompt templates
│   └── generator.py   # LLM integration
├── rag.py             # Main RAGSystem class
└── main.py            # Entry point

config.py

import os
from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv()

@dataclass
class Config:
    # Database
    db_host: str = os.getenv("DB_HOST", "localhost")
    db_port: int = int(os.getenv("DB_PORT", "5432"))
    db_name: str = os.getenv("DB_NAME", "ragdb")
    db_user: str = os.getenv("DB_USER", "postgres")
    db_password: str = os.getenv("DB_PASSWORD", "")

    # OpenAI
    openai_api_key: str = os.getenv("OPENAI_API_KEY", "")
    embedding_model: str = "text-embedding-3-small"
    embedding_dim: int = 1536
    chat_model: str = "gpt-4o-mini"  # cost-effective default

    # Chunking
    chunk_size: int = 800
    chunk_overlap: int = 150
    min_chunk_size: int = 100

    # Retrieval
    top_k: int = 5
    similarity_threshold: float = 0.65  # minimum cosine similarity

    # Generation
    max_context_tokens: int = 8000
    temperature: float = 0.1  # low temperature for factual answers

    def get_db_url(self) -> str:
        return f"postgresql://{self.db_user}:{self.db_password}@{self.db_host}:{self.db_port}/{self.db_name}"

config = Config()

ingestion/loaders.py - Multi-Source Document Loading

import hashlib
from pathlib import Path
from dataclasses import dataclass
from typing import Optional
import requests
from bs4 import BeautifulSoup

@dataclass
class RawDocument:
    content: str
    source_path: str
    source_type: str
    source_hash: str
    title: Optional[str] = None
    metadata: dict = None

    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {}

def load_text_file(path: str) -> RawDocument:
    p = Path(path)
    content = p.read_text(encoding="utf-8")
    return RawDocument(
        content=content,
        source_path=path,
        source_type="txt",
        source_hash=hashlib.md5(content.encode()).hexdigest(),
        title=p.stem
    )

def load_markdown_file(path: str) -> RawDocument:
    p = Path(path)
    content = p.read_text(encoding="utf-8")
    # Extract title from frontmatter or first H1 line
    title = None
    for line in content.split("\n"):
        if line.startswith("# "):
            title = line[2:].strip()
            break
    return RawDocument(
        content=content,
        source_path=path,
        source_type="md",
        source_hash=hashlib.md5(content.encode()).hexdigest(),
        title=title
    )

def load_pdf_file(path: str) -> RawDocument:
    from pypdf import PdfReader
    reader = PdfReader(path)
    pages = []
    for page in reader.pages:
        pages.append(page.extract_text())
    content = "\n\n".join(pages)
    return RawDocument(
        content=content,
        source_path=path,
        source_type="pdf",
        source_hash=hashlib.md5(content.encode()).hexdigest(),
        title=Path(path).stem,
        metadata={"pages": len(reader.pages)}
    )

def load_url(url: str) -> RawDocument:
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    # Remove script, style, nav, header, footer
    for tag in soup(["script", "style", "nav", "header", "footer"]):
        tag.decompose()
    content = soup.get_text(separator="\n", strip=True)
    title = soup.title.string if soup.title else url
    return RawDocument(
        content=content,
        source_path=url,
        source_type="html",
        source_hash=hashlib.md5(content.encode()).hexdigest(),
        title=title
    )

def load_document(source: str) -> RawDocument:
    """Smart loader that picks the correct parser."""
    if source.startswith("http"):
        return load_url(source)
    p = Path(source)
    loaders = {
        ".txt":  load_text_file,
        ".md":   load_markdown_file,
        ".pdf":  load_pdf_file,
    }
    loader = loaders.get(p.suffix.lower())
    if not loader:
        raise ValueError(f"Unsupported file type: {p.suffix}")
    return loader(source)

ingestion/chunkers.py - Smart Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter
from dataclasses import dataclass
from typing import Optional

@dataclass
class TextChunk:
    content: str
    chunk_index: int
    source_path: str
    source_type: str
    source_hash: str
    title: Optional[str] = None
    metadata: dict = None

    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {}

class SmartChunker:
    """
    Chunker that adapts its strategy to the document type.
    """
    def __init__(self, chunk_size: int = 800, chunk_overlap: int = 150):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

        # Generic text splitter
        self._text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["\n\n", "\n", ". ", "! ", "? ", "; ", ", ", " "],
            length_function=len
        )

        # Markdown-aware splitter (respects document structure)
        self._md_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            separators=["## ", "# ", "\n\n", "\n", ". "],
            length_function=len
        )

    def chunk(self, doc) -> list[TextChunk]:
        """Chunk a document, choosing the right strategy."""
        if doc.source_type == "md":
            raw_chunks = self._md_splitter.split_text(doc.content)
        else:
            raw_chunks = self._text_splitter.split_text(doc.content)

        # Filter out chunks that are too small
        raw_chunks = [c for c in raw_chunks if len(c.strip()) > 100]

        return [
            TextChunk(
                content=chunk.strip(),
                chunk_index=i,
                source_path=doc.source_path,
                source_type=doc.source_type,
                source_hash=doc.source_hash,
                title=doc.title,
                metadata={
                    **doc.metadata,
                    "chunk_total": len(raw_chunks),
                    "char_count": len(chunk)
                }
            )
            for i, chunk in enumerate(raw_chunks)
        ]

ingestion/pipeline.py - Main Orchestrator

import psycopg2
from psycopg2.extras import execute_values
import json
import time
from .loaders import load_document
from .chunkers import SmartChunker

class IngestionPipeline:
    def __init__(self, config, embedder):
        self.config = config
        self.embedder = embedder
        self.chunker = SmartChunker(
            chunk_size=config.chunk_size,
            chunk_overlap=config.chunk_overlap
        )
        self.conn = psycopg2.connect(config.get_db_url())

    def is_already_ingested(self, source_path: str, source_hash: str) -> bool:
        """Check if the document is already in DB with the same hash (unchanged)."""
        with self.conn.cursor() as cur:
            cur.execute(
                "SELECT COUNT(*) FROM rag_documents WHERE source_path = %s AND source_hash = %s",
                (source_path, source_hash)
            )
            return cur.fetchone()[0] > 0

    def ingest(self, source: str, tags: list[str] = None, force: bool = False) -> dict:
        """
        Process a document and insert it into PostgreSQL.
        Returns statistics about the operation.
        """
        tags = tags or []
        start_time = time.time()

        # 1. Load document
        doc = load_document(source)
        print(f"Loaded: {source} ({len(doc.content)} chars, hash: {doc.source_hash[:8]})")

        # 2. Check if already present (incremental update)
        if not force and self.is_already_ingested(source, doc.source_hash):
            print(f"  Skipped: document unchanged")
            return {"skipped": True, "source": source}

        # 3. Chunking
        chunks = self.chunker.chunk(doc)
        print(f"  Chunked: {len(chunks)} chunks created")

        # 4. Delete previous version (if exists)
        with self.conn.cursor() as cur:
            cur.execute("DELETE FROM rag_documents WHERE source_path = %s", (source,))

        # 5. Generate embeddings in batch
        texts = [c.content for c in chunks]
        embeddings = self.embedder.embed_batch(texts)
        print(f"  Embeddings: {len(embeddings)} vectors of dim {len(embeddings[0])}")

        # 6. Insert into PostgreSQL
        rows = [
            (
                c.source_path,
                c.source_type,
                c.source_hash,
                c.chunk_index,
                len(chunks),  # chunk_total
                c.title,
                c.content,
                self.config.embedding_model,
                embeddings[i],
                json.dumps(c.metadata),
                tags
            )
            for i, c in enumerate(chunks)
        ]

        with self.conn.cursor() as cur:
            execute_values(cur, """
                INSERT INTO rag_documents
                    (source_path, source_type, source_hash, chunk_index, chunk_total,
                     title, content, embedding_model, embedding, metadata, tags)
                VALUES %s
                ON CONFLICT (source_path, chunk_index, source_hash) DO UPDATE SET
                    content = EXCLUDED.content,
                    embedding = EXCLUDED.embedding,
                    updated_at = NOW()
            """, rows, template="(%s,%s,%s,%s,%s,%s,%s,%s,%s::vector,%s::jsonb,%s::text[])")
            self.conn.commit()

        elapsed = time.time() - start_time
        stats = {
            "source": source,
            "chunks": len(chunks),
            "embeddings": len(embeddings),
            "elapsed_sec": round(elapsed, 2)
        }
        print(f"  Completed in {elapsed:.1f}s - {stats}")
        return stats

    def ingest_directory(self, directory: str, extensions: list[str] = None) -> list[dict]:
        """Ingest all documents in a directory."""
        from pathlib import Path
        extensions = extensions or [".txt", ".md", ".pdf"]
        results = []
        for path in Path(directory).rglob("*"):
            if path.suffix.lower() in extensions:
                result = self.ingest(str(path))
                results.append(result)
        return results

Retrieval: Finding the Right Chunks

retrieval/searcher.py

import psycopg2
from dataclasses import dataclass
from typing import Optional

@dataclass
class SearchResult:
    id: int
    source_path: str
    source_type: str
    chunk_index: int
    title: Optional[str]
    content: str
    similarity: float
    metadata: dict

class HybridSearcher:
    """
    Combines vector search (semantic) with full-text search (keyword).
    Uses Reciprocal Rank Fusion to merge the two result sets.
    """
    def __init__(self, config, embedder):
        self.config = config
        self.embedder = embedder
        self.conn = psycopg2.connect(config.get_db_url())

    def vector_search(self, query: str, top_k: int = 10,
                      source_type: Optional[str] = None,
                      tags: Optional[list[str]] = None) -> list[SearchResult]:
        """Semantic search with optional filters."""
        query_embedding = self.embedder.embed_single(query)
        threshold = 1 - self.config.similarity_threshold  # convert to cosine distance

        # Build dynamic query with optional filters
        filters = ["embedding <=> %s::vector < %s"]
        params = [query_embedding, threshold]

        if source_type:
            filters.append("source_type = %s")
            params.append(source_type)
        if tags:
            filters.append("tags && %s::text[]")  -- overlap: at least one tag in common
            params.append(tags)

        where_clause = " AND ".join(filters)

        with self.conn.cursor() as cur:
            cur.execute(f"""
                SELECT
                    id, source_path, source_type, chunk_index, title, content,
                    1 - (embedding <=> %s::vector) AS similarity,
                    metadata
                FROM rag_documents
                WHERE {where_clause}
                ORDER BY embedding <=> %s::vector
                LIMIT %s
            """, [query_embedding] + params + [query_embedding, top_k])

            rows = cur.fetchall()
            return [
                SearchResult(
                    id=r[0], source_path=r[1], source_type=r[2],
                    chunk_index=r[3], title=r[4], content=r[5],
                    similarity=round(r[6], 4), metadata=r[7]
                )
                for r in rows
            ]

    def fulltext_search(self, query: str, top_k: int = 10) -> list[SearchResult]:
        """Full-text search with ts_rank for ranking."""
        with self.conn.cursor() as cur:
            cur.execute("""
                SELECT
                    id, source_path, source_type, chunk_index, title, content,
                    ts_rank(to_tsvector('english', content),
                            plainto_tsquery('english', %s)) AS rank,
                    metadata
                FROM rag_documents
                WHERE to_tsvector('english', content) @@
                      plainto_tsquery('english', %s)
                ORDER BY rank DESC
                LIMIT %s
            """, (query, query, top_k))

            rows = cur.fetchall()
            return [
                SearchResult(
                    id=r[0], source_path=r[1], source_type=r[2],
                    chunk_index=r[3], title=r[4], content=r[5],
                    similarity=round(float(r[6]), 4), metadata=r[7]
                )
                for r in rows
            ]

    def hybrid_search(self, query: str, top_k: int = 5,
                       vector_weight: float = 0.7) -> list[SearchResult]:
        """
        Reciprocal Rank Fusion (RRF) to combine vector and full-text results.
        RRF Score = sum(1 / (k + rank)) for each result list.
        """
        k_rrf = 60  # standard RRF constant

        # Get both result sets
        vector_results = self.vector_search(query, top_k=top_k * 2)
        fts_results = self.fulltext_search(query, top_k=top_k * 2)

        # Compute RRF scores
        scores = {}
        all_results = {}

        for rank, result in enumerate(vector_results):
            scores[result.id] = scores.get(result.id, 0) + vector_weight / (k_rrf + rank + 1)
            all_results[result.id] = result

        fts_weight = 1 - vector_weight
        for rank, result in enumerate(fts_results):
            scores[result.id] = scores.get(result.id, 0) + fts_weight / (k_rrf + rank + 1)
            all_results[result.id] = result

        # Sort by RRF score and take top_k
        sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
        final_results = [all_results[id] for id in sorted_ids[:top_k]]

        # Update similarity with normalized RRF score
        max_score = scores[sorted_ids[0]] if sorted_ids else 1
        for result in final_results:
            result.similarity = round(scores[result.id] / max_score, 4)

        return final_results

Generation: From Context to Answer

generation/prompts.py

from string import Template

# System prompt that defines the AI's behavior
RAG_SYSTEM_PROMPT = """You are a precise and helpful AI assistant. Answer questions
based EXCLUSIVELY on the provided context documents.

Rules:
1. Use ONLY information present in the context. Do not make things up.
2. If the answer is not in the context, say so clearly.
3. Cite sources using [Source: filename, chunk X] after each claim.
4. Keep a professional and concise tone.
5. Structure the answer clearly with paragraphs or bullet points where appropriate.
"""

def build_rag_prompt(query: str, context_chunks: list, include_sources: bool = True) -> str:
    """
    Build the prompt for the LLM with the retrieved context.

    Args:
        query: The user's question
        context_chunks: List of SearchResult objects
        include_sources: Whether to include source information

    Returns:
        The formatted prompt for the LLM
    """
    if not context_chunks:
        return f"Question: {query}\n\nNote: No relevant documents found in the knowledge base."

    # Build context with numbering and source
    context_parts = []
    for i, chunk in enumerate(context_chunks, 1):
        source_info = f"[Source: {chunk.source_path}, chunk {chunk.chunk_index}]" if include_sources else ""
        context_parts.append(f"--- Document {i} {source_info} ---\n{chunk.content}")

    context_text = "\n\n".join(context_parts)

    return f"""Context from documents:
{context_text}

---

User question: {query}

Answer based on the provided context."""

generation/generator.py

from openai import OpenAI
from dataclasses import dataclass
from typing import Optional
import tiktoken
from .prompts import RAG_SYSTEM_PROMPT, build_rag_prompt

@dataclass
class RAGResponse:
    answer: str
    sources: list[dict]
    model: str
    total_tokens: int
    prompt_tokens: int
    completion_tokens: int

class RAGGenerator:
    def __init__(self, config):
        self.config = config
        self.client = OpenAI(api_key=config.openai_api_key)
        self.tokenizer = tiktoken.encoding_for_model("gpt-4o")

    def count_tokens(self, text: str) -> int:
        return len(self.tokenizer.encode(text))

    def truncate_context(self, chunks: list, max_tokens: int) -> list:
        """
        Truncate context to avoid exceeding the token limit.
        Keeps the most relevant chunks (already sorted by similarity).
        """
        selected = []
        used_tokens = 0

        for chunk in chunks:
            chunk_tokens = self.count_tokens(chunk.content)
            if used_tokens + chunk_tokens > max_tokens:
                break
            selected.append(chunk)
            used_tokens += chunk_tokens

        return selected

    def generate(self, query: str, context_chunks: list,
                 stream: bool = False) -> RAGResponse:
        """
        Generate a RAG response.

        Args:
            query: The user's question
            context_chunks: Chunks retrieved from PostgreSQL
            stream: If True, use streaming (not implemented here for simplicity)
        """
        # Truncate context if necessary
        max_context_tokens = self.config.max_context_tokens
        truncated_chunks = self.truncate_context(context_chunks, max_context_tokens)

        if len(truncated_chunks) < len(context_chunks):
            print(f"  Context truncated: {len(context_chunks)} -> {len(truncated_chunks)} chunks")

        # Build the prompt
        user_prompt = build_rag_prompt(query, truncated_chunks)

        # Call the LLM
        response = self.client.chat.completions.create(
            model=self.config.chat_model,
            messages=[
                {"role": "system", "content": RAG_SYSTEM_PROMPT},
                {"role": "user", "content": user_prompt}
            ],
            temperature=self.config.temperature,
            max_tokens=1500
        )

        answer = response.choices[0].message.content
        usage = response.usage

        # Prepare sources for the response
        sources = [
            {
                "source": chunk.source_path,
                "chunk_index": chunk.chunk_index,
                "similarity": chunk.similarity,
                "excerpt": chunk.content[:200] + "..."
            }
            for chunk in truncated_chunks
        ]

        return RAGResponse(
            answer=answer,
            sources=sources,
            model=self.config.chat_model,
            total_tokens=usage.total_tokens,
            prompt_tokens=usage.prompt_tokens,
            completion_tokens=usage.completion_tokens
        )

The Complete RAG System

rag.py - Main Class

from config import config, Config
from ingestion.pipeline import IngestionPipeline
from retrieval.searcher import HybridSearcher
from generation.generator import RAGGenerator

class EmbeddingService:
    """Wrapper for OpenAI embedding generation."""
    def __init__(self, cfg: Config):
        from openai import OpenAI
        self.client = OpenAI(api_key=cfg.openai_api_key)
        self.model = cfg.embedding_model

    def embed_single(self, text: str) -> list[float]:
        resp = self.client.embeddings.create(
            input=[text.replace("\n", " ")],
            model=self.model
        )
        return resp.data[0].embedding

    def embed_batch(self, texts: list[str]) -> list[list[float]]:
        cleaned = [t.replace("\n", " ").strip() for t in texts]
        resp = self.client.embeddings.create(input=cleaned, model=self.model)
        return [item.embedding for item in resp.data]

class RAGSystem:
    """
    Complete RAG system: ingestion + retrieval + generation.
    """
    def __init__(self, cfg: Config = None):
        self.config = cfg or config
        self.embedder = EmbeddingService(self.config)
        self.ingestion = IngestionPipeline(self.config, self.embedder)
        self.searcher = HybridSearcher(self.config, self.embedder)
        self.generator = RAGGenerator(self.config)

    def add_document(self, source: str, tags: list[str] = None) -> dict:
        """Add a document to the knowledge base."""
        return self.ingestion.ingest(source, tags=tags)

    def add_directory(self, directory: str, extensions: list[str] = None) -> list[dict]:
        """Add all documents in a directory."""
        return self.ingestion.ingest_directory(directory, extensions)

    def ask(self, question: str, use_hybrid: bool = True,
            source_type: str = None) -> dict:
        """
        Ask a question to the RAG system.

        Returns:
            dict with answer, sources, usage
        """
        # 1. Retrieval
        if use_hybrid:
            chunks = self.searcher.hybrid_search(question, top_k=self.config.top_k)
        else:
            chunks = self.searcher.vector_search(
                question, top_k=self.config.top_k, source_type=source_type
            )

        if not chunks:
            return {
                "answer": "No relevant information found to answer this question.",
                "sources": [],
                "retrieval": {"chunks_found": 0}
            }

        # 2. Generation
        response = self.generator.generate(question, chunks)

        return {
            "answer": response.answer,
            "sources": response.sources,
            "retrieval": {
                "chunks_found": len(chunks),
                "top_similarity": chunks[0].similarity if chunks else 0
            },
            "usage": {
                "model": response.model,
                "total_tokens": response.total_tokens
            }
        }

main.py - System Usage

from rag import RAGSystem

# Initialize the system
rag = RAGSystem()

# --- INGESTION ---
print("=== Adding documents to the knowledge base ===")

# Add individual files
rag.add_document("docs/postgresql_guide.pdf", tags=["postgresql", "database"])
rag.add_document("docs/pgvector_tutorial.md", tags=["pgvector", "vector-search"])
rag.add_document("https://www.postgresql.org/docs/current/", tags=["official-docs"])

# Add an entire directory
stats = rag.add_directory("docs/", extensions=[".md", ".txt", ".pdf"])
print(f"Ingested {len(stats)} documents")

# --- QUERY ---
print("\n=== Querying the system ===")

questions = [
    "How do I install pgvector on PostgreSQL 16?",
    "What is the difference between HNSW and IVFFlat?",
    "How do I optimize memory for vector search?",
]

for q in questions:
    print(f"\nQuestion: {q}")
    print("-" * 60)
    result = rag.ask(q)
    print(f"Answer:\n{result['answer']}")
    print(f"\nSources used ({len(result['sources'])}):")
    for src in result["sources"]:
        print(f"  - {src['source']} [similarity: {src['similarity']}]")
    print(f"\nTokens used: {result['usage']['total_tokens']}")

Hybrid Search: PostgreSQL Full-Text + Vector

One of PostgreSQL's great strengths for RAG is that you can combine semantic (vector) search with classic full-text search in a single query. This is especially useful for queries containing precise technical terms (proper names, acronyms, software versions) that semantic search alone might not capture perfectly:

-- Pure SQL hybrid search: vector + full-text in a single query
WITH vector_search AS (
    SELECT id, content, source_path, chunk_index,
           1 - (embedding <=> %s::vector) AS vector_score,
           ROW_NUMBER() OVER (ORDER BY embedding <=> %s::vector) AS vector_rank
    FROM rag_documents
    ORDER BY embedding <=> %s::vector
    LIMIT 20
),
fts_search AS (
    SELECT id, content, source_path, chunk_index,
           ts_rank(to_tsvector('english', content),
                   plainto_tsquery('english', %s)) AS fts_score,
           ROW_NUMBER() OVER (
               ORDER BY ts_rank(to_tsvector('english', content),
                                plainto_tsquery('english', %s)) DESC
           ) AS fts_rank
    FROM rag_documents
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
    LIMIT 20
),
-- Reciprocal Rank Fusion
rrf AS (
    SELECT
        COALESCE(v.id, f.id) AS id,
        COALESCE(v.content, f.content) AS content,
        COALESCE(v.source_path, f.source_path) AS source_path,
        -- RRF score: 0.7 * vector_weight + 0.3 * fts_weight
        COALESCE(0.7 / (60 + v.vector_rank), 0) +
        COALESCE(0.3 / (60 + f.fts_rank), 0) AS rrf_score
    FROM vector_search v
    FULL OUTER JOIN fts_search f ON v.id = f.id
)
SELECT id, content, source_path, rrf_score
FROM rrf
ORDER BY rrf_score DESC
LIMIT 5;

RAG Quality Evaluation

How do you measure whether your RAG system is working well? The key metrics are:

Metric	What It Measures	Target	How to Compute
Recall@K	Right documents found in top K results	> 0.70	Test set with ground truth
Precision@K	Retrieved results are actually relevant	> 0.60	Manual annotation
Answer Faithfulness	Answer is grounded in retrieved context	> 0.80	RAGAS framework
Answer Relevancy	Answer addresses the question asked	> 0.75	RAGAS framework
P95 Latency	Response time at 95th percentile	< 3s	Production monitoring

# Evaluation with RAGAS
# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

# Prepare the test dataset
test_data = {
    "question": [
        "How do I create an HNSW index in pgvector?",
        "What is the vector dimension limit in pgvector?",
    ],
    "answer": [
        # Answers generated by your RAG system
        rag.ask("How do I create an HNSW index in pgvector?")["answer"],
        rag.ask("What is the vector dimension limit in pgvector?")["answer"],
    ],
    "contexts": [
        # The chunks retrieved for each question
        [c["excerpt"] for c in rag.ask("How do I create an HNSW index in pgvector?")["sources"]],
        [c["excerpt"] for c in rag.ask("What is the vector dimension limit in pgvector?")["sources"]],
    ],
    "ground_truth": [
        "CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64)",
        "The limit is 16000 dimensions for vector type in pgvector 0.7+",
    ]
}

dataset = Dataset.from_dict(test_data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(results)

Advanced Chunking Strategies

Chunking quality is one of the most important factors for RAG performance. A poorly calibrated chunking strategy can degrade results even with the best embedding model. Here are advanced strategies for specific use cases:

Header-Based Chunking for Structured Documents

import re
from typing import Generator

def chunk_by_headers(markdown_text: str, max_chunk_size: int = 800) -> Generator:
    """
    Chunking that respects the hierarchical structure of Markdown documents.
    Each H2/H3 section becomes a separate context, preserving the title
    as the chunk header (critical for embedding quality - the model needs
    to understand what topic the chunk is about).
    """
    # Regex to find Markdown headers (H1-H4)
    header_pattern = re.compile(r'^(#{1,4})\s+(.+), re.MULTILINE)

    # Find all headers with their positions
    headers = list(header_pattern.finditer(markdown_text))

    if not headers:
        # No headers: use standard chunking
        yield {"content": markdown_text, "header": "", "level": 0}
        return

    # Process each section delimited by headers
    for i, header in enumerate(headers):
        level = len(header.group(1))  # number of # = header level
        title = header.group(2).strip()

        # Content from current position to next header
        start = header.end()
        end = headers[i + 1].start() if i + 1 < len(headers) else len(markdown_text)
        section_content = markdown_text[start:end].strip()

        if not section_content:
            continue

        # Prefix each chunk with the section title
        # CRITICAL: the title dramatically improves embedding quality
        full_chunk = f"# {title}\n\n{section_content}"

        # If the section is too large, split it
        if len(full_chunk) <= max_chunk_size:
            yield {"content": full_chunk, "header": title, "level": level}
        else:
            # Large section: split but keep the title as prefix in each sub-chunk
            from langchain.text_splitter import RecursiveCharacterTextSplitter
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=max_chunk_size - len(title) - 10,
                chunk_overlap=100
            )
            for j, sub_chunk in enumerate(splitter.split_text(section_content)):
                yield {
                    "content": f"# {title}\n\n{sub_chunk}",
                    "header": title,
                    "level": level,
                    "sub_index": j
                }

Query Rewriting and Decomposition

An advanced technique for improving RAG quality is query rewriting: before running the vector search, reformulate the user's query to make it better suited for semantic search. Conversational queries ("that one from before", "how does it work?") often do not match well against technical documents.

from openai import OpenAI
import json

client = OpenAI()

def rewrite_query_for_search(original_query: str, chat_history: list = None) -> str:
    """
    Reformulate the user's query to optimize semantic search.
    Useful for:
    1. Conversational queries with implicit references
    2. Short and ambiguous queries
    3. Queries with non-standard abbreviations or jargon
    """
    history_context = ""
    if chat_history:
        history_context = f"\nPrevious conversation:\n{chr(10).join(chat_history[-4:])}\n"

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """You are an expert in semantic search. Given a user query,
                rewrite it to maximize the probability of finding relevant documents
                in a vector search. The rewritten query must:
                1. Be self-contained (no implicit references to "that one")
                2. Use explicit and precise technical terms
                3. Clearly express the concept being sought
                4. Be 1-3 sentences long
                Respond ONLY with the rewritten query, no explanations."""
            },
            {
                "role": "user",
                "content": f"{history_context}Original query: {original_query}\nRewritten query:"
            }
        ],
        temperature=0,
        max_tokens=200
    )
    return response.choices[0].message.content.strip()

def decompose_complex_query(query: str) -> list[str]:
    """
    Decompose a complex query into simpler sub-queries.
    Useful for multi-aspect questions requiring information from multiple documents.

    Example: "What is the difference between HNSW and IVFFlat, and which is faster?"
    -> ["How does HNSW work?", "How does IVFFlat work?", "HNSW vs IVFFlat performance benchmark"]
    """
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """Analyze the query and, if it contains multiple distinct questions
                or aspects, decompose it into 2-4 simple sub-queries. If the query is already
                simple, return only the original query. Response format: JSON list of strings."""
            },
            {
                "role": "user",
                "content": f"Query: {query}"
            }
        ],
        temperature=0,
        response_format={"type": "json_object"}
    )
    result = json.loads(response.choices[0].message.content)
    return result.get("sub_queries", [query])

Monitoring RAG Quality in Production

-- SQL queries for monitoring the health of your RAG knowledge base

-- 1. Documents by source type and average chunk size
SELECT
    source_type,
    COUNT(*) AS total_chunks,
    COUNT(DISTINCT source_path) AS unique_documents,
    ROUND(AVG(content_length)) AS avg_chunk_chars,
    MIN(content_length) AS min_chunk_chars,
    MAX(content_length) AS max_chunk_chars,
    SUM(content_length) AS total_chars
FROM rag_documents
GROUP BY source_type
ORDER BY total_chunks DESC;

-- 2. Ingestion over time (last 30 days)
SELECT
    DATE_TRUNC('day', ingested_at) AS day,
    COUNT(*) AS chunks_ingested,
    COUNT(DISTINCT source_path) AS docs_ingested
FROM rag_documents
WHERE ingested_at >= NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day DESC;

-- 3. Oldest documents (candidates for re-ingestion)
SELECT
    source_path,
    source_type,
    COUNT(*) AS chunks,
    MAX(ingested_at) AS last_ingested,
    NOW() - MAX(ingested_at) AS age
FROM rag_documents
GROUP BY source_path, source_type
ORDER BY last_ingested ASC
LIMIT 20;

-- 4. Verify embeddings have correct dimensions (no NULLs)
SELECT
    embedding_model,
    COUNT(*) AS total,
    COUNT(embedding) AS with_embedding,
    COUNT(*) - COUNT(embedding) AS missing_embedding
FROM rag_documents
GROUP BY embedding_model;

-- 5. Suspiciously short chunks (likely fragmentation issues)
SELECT id, source_path, chunk_index, content_length, content
FROM rag_documents
WHERE content_length < 100  -- very short chunks
ORDER BY content_length ASC
LIMIT 10;

-- 6. Total knowledge base size
SELECT
    pg_size_pretty(pg_total_relation_size('rag_documents')) AS total_size,
    pg_size_pretty(pg_relation_size('rag_documents')) AS table_size,
    pg_size_pretty(pg_relation_size('idx_rag_embedding_hnsw')) AS hnsw_index_size,
    COUNT(*) AS total_chunks,
    COUNT(DISTINCT source_path) AS total_documents
FROM rag_documents;

Streaming RAG Responses

For production user-facing applications, streaming responses dramatically improve perceived performance. Instead of waiting 3-5 seconds for a complete answer, users see the first words within 0.5 seconds. Here is how to implement streaming with FastAPI and Server-Sent Events:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI()

async def stream_rag_response(question: str, chunks: list) -> AsyncGenerator[str, None]:
    """
    Stream the RAG response using Server-Sent Events (SSE).
    Yields JSON events that the frontend can process incrementally.
    """
    # 1. First, yield the sources immediately (zero latency for source display)
    sources_event = {
        "type": "sources",
        "data": [{"source": c.source_path, "similarity": c.similarity} for c in chunks]
    }
    yield f"data: {json.dumps(sources_event)}\n\n"

    # 2. Build context from chunks
    context = "\n\n".join([
        f"[Source: {c.source_path}]\n{c.content}"
        for c in chunks
    ])

    # 3. Stream the LLM response
    stream = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer based on the provided context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        stream=True,  # ENABLE STREAMING
        temperature=0.1
    )

    # Yield tokens as they arrive
    for chunk in stream:
        if chunk.choices[0].delta.content:
            token_event = {
                "type": "token",
                "data": chunk.choices[0].delta.content
            }
            yield f"data: {json.dumps(token_event)}\n\n"

    # Signal completion
    yield f"data: {json.dumps({'type': 'done'})}\n\n"

@app.get("/api/query/stream")
async def query_stream(question: str, source_type: str = None):
    """
    Streaming RAG endpoint using Server-Sent Events.
    The frontend subscribes with EventSource API.
    """
    # 1. Retrieve context (not streamed, fast)
    query_vec = await get_embedding_async(question)
    chunks = await db.hybrid_search(query_vec, top_k=5, source_type=source_type)

    # 2. Stream the generation
    return StreamingResponse(
        stream_rag_response(question, chunks),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no"  # disable nginx buffering
        }
    )

# Frontend JavaScript (simplified):
# const eventSource = new EventSource('/api/query/stream?question=...');
# eventSource.onmessage = (event) => {
#     const { type, data } = JSON.parse(event.data);
#     if (type === 'token') appendToAnswer(data);
#     if (type === 'sources') displaySources(data);
#     if (type === 'done') eventSource.close();
# };

Anti-Patterns to Avoid

Top 5 RAG Mistakes to Avoid

Chunks too large: Chunks over 3000 chars contain multiple topics, confusing the embedding model. Max 1000 chars (~200 tokens).
No similarity threshold: Returning chunks with low similarity (below 0.6) injects noise into answers. Set a minimum of 0.60-0.70.
Mismatched embedding models: If you ingested with text-embedding-3-small, always use that same model for queries.
Generic system prompt: The system prompt must instruct the LLM to stay within context and cite sources.
No caching: Identical queries recompute the embedding every time. Cache query embeddings in Redis for frequent questions.

Conclusions and Next Steps

You now have a complete, working RAG system on PostgreSQL. The architecture is modular: you can swap the embedding model, change the LLM, or add new document sources without touching the core system. PostgreSQL handles both the vector store and full-text search, eliminating the need for separate systems like Pinecone or Elasticsearch.

The next article explores Advanced Similarity Search: how ANN (Approximate Nearest Neighbor) algorithms work, the differences between exact and approximate search, and optimization techniques for low-latency queries.

Continue the Series

Previous: Vector Embeddings Explained: Theory and Practice
Next: Advanced Similarity Search in PostgreSQL
Related: AI Engineering: RAG Architecture