Skip to content
Cover image for: Building a RAG Chat: Full Pipeline Guide

Building a RAG Chat: Full Pipeline Guide

May 11, 2026
Updated May 14, 2026
30 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: building-a-rag-chat-full-pipeline-guide

Building a production RAG chat system from scratch using OpenAI embeddings, pgvector HNSW indexing, and hybrid search. Explains how the Ask AI page works: OpenAI text-embedding-3-small generates 1536-dimensional embeddings stored in PostgreSQL with pgvector and HNSW indexes for fast approximate nearest-neighbor search, combined with PostgreSQL tsvector full-text search using weighted BM25 scoring, relevance thresholds for preventing hallucination, source-grounded citations showing which post each answer is from, and the complete API pipeline from query to streaming SSE response with session management and rate limiting.

Quick Summary

The "Ask AI" page on this blog is not a generic chatbot. It's a Retrieval-Augmented Generation system that answers questions using only the content from this site's posts, and it shows you exactly which post each part of the answer came from.

Building a RAG Chat System From Zero

The "Ask AI" page on this blog is not a generic chatbot. It's a Retrieval-Augmented Generation system that answers questions using only the content from this site's posts, and it shows you exactly which post each part of the answer came from.

Here's how it works, from embedding to response.


Why RAG Instead of Fine-Tuning

Fine-tuning a model on blog content would:

  • Require retraining every time a new post is published
  • Risk hallucinating facts not present in the training data
  • Give no way to cite sources in the response

RAG solves all three: query the content at runtime, inject the relevant chunks into the prompt, and return the source citations alongside the answer. No retraining needed.


Architecture

text

Step 1: The Embedding Pipeline

Every published post is split into chunks and embedded. The chunks are stored in the rag_chunks table:

sql

The embedding dimension (1536) comes from the model: text-embedding-3-small from OpenAI. The choice was pragmatic — it's the cheapest per-token of the high-quality embedding models and produces 1536-dimensional vectors that work well with pgvector's HNSW index.

Chunking Strategy

Posts are split on paragraph boundaries, not fixed token counts:

python

Why paragraph boundaries? Code blocks, lists, and blockquotes are semantic units. Splitting mid-paragraph would separate a code example from its explanation, making the chunk useless for both retrieval and generation.

Each chunk stores its chunk_index so the frontend can link back to the correct section of the post. Metadata includes the post slug, title, section heading, and URL.


Step 2: The HNSW Index

pgvector supports two index types for approximate nearest neighbor search: IVFFlat and HNSW. I chose HNSW for three reasons:

  • Faster build time — HNSW builds incrementally. IVFFlat requires a full rebuild when data changes.
  • Better recall at same speed — HNSW consistently achieves 99% recall at 10ms query time with my dataset size (~50K chunks).
  • No training required — IVFFlat needs a clustering step that depends on representative data. HNSW is parameter-free.
sql

The parameters:

  • m = 16 — each node connects to 16 neighbors. Higher = better recall, slower build. 16 is the sweet spot for datasets under 100K vectors.
  • ef_construction = 200 — the dynamic list size during construction. Higher = better index quality, slower build. 200 is conservative.

At query time, the search uses SET hnsw.ef_search = 40 — this controls the search breadth. Higher = better recall, slower query.


Vector search alone misses exact keyword matches. "How do I install FastAPI?" matches the vector of "FastAPI installation guide" but misses the exact phrase match. Full-text search via tsvector catches what vector search misses.

The hybrid query combines both:

python

The alpha parameter controls the weight between vector and keyword scores. 0.7 means 70% vector, 30% keyword — biased toward semantic understanding while still catching exact matches.


Step 4: Hybrid Ranking

Results from both searches are combined using Reciprocal Rank Fusion (RRF):

python

RRF is simple, fast, and doesn't require training a learned ranker. The constant k=60 prevents any single ranking from dominating.


Step 5: Context Assembly

The top 5-10 chunks are assembled into a prompt. The system prompt is:

text

The user prompt includes the question and the chunk content:

text

Step 6: Source Verification

After the LLM generates a response, a verification step checks that each cited source actually exists in the provided chunks:

python

Unverified claims are flagged but not removed from the response — they're marked with a warning icon in the frontend. This happens rarely (less than 2% of queries) and usually when the LLM rephrases a source name.


Step 7: The Database Model

python

The Vector type comes from pgvector.sqlalchemy. It maps directly to PostgreSQL's vector extension type.


Cold Start: First User Experience

When a user visits the Ask AI page for the first time, there are no chunks to search. The solution: a pre-computed set of seed questions and answers, one per published post, generated during the embedding pipeline.

python

These seed questions are embedded and stored alongside the post chunks. On the first page load, the frontend fetches 3-5 seed questions as suggestions. When the user clicks one, it triggers a RAG query, which populates the embedding cache. Subsequent queries hit the cache.


What's Next

In the next post, I'll cover the production RAG pipeline — streaming responses via SSE, progressive rendering, citation badges, fallback strategies, rate limiting, and the cold-start UX flow in detail.


Built with FastAPI, pgvector, OpenAI embeddings, and zero third-party CMS.

Frequently Asked Questions

A RAG chat system is a Retrieval-Augmented Generation system that answers questions using only the content from a specific site's posts, providing source citations for each part of the answer.
RAG is preferred because it doesn't require retraining with each new post, avoids hallucinating facts not present in the training data, and allows for source citations in the responses.
The embedding model converts each post into 1536-dimensional vectors, which are used to find the most relevant content chunks for answering a user's question.
Splitting on paragraph boundaries preserves semantic units like code blocks and lists, ensuring that examples and their explanations remain together for effective retrieval and generation.
The HNSW index offers faster build times and better recall at the same speed compared to the IVFFlat index, making it more efficient for approximate nearest neighbor searches.

How was this tutorial?

Building a RAG Chat: Full Pipeline Guide | Madhu Dadi