Building a RAG Chat System From Zero — Full Pipeline Guide

Building a RAG Chat System From Zero — Full Pipeline Guide

May 11, 2026
20 min read

AI Insights

Powered by GPT-4o-mini

Verified Context: building-a-rag-chat-system-from-zero

How the Ask AI page works — embeddings, pgvector HNSW index, hybrid search with tsvector + vector scoring, relevance thresholds, and source-grounded answers. Full pipeline from query to response.

Quick Summary

The "Ask AI" page on this blog is not a generic chatbot. It's a Retrieval-Augmented Generation system that answers questions using only the content from this site's posts, and it shows you exactly which post each part of the answer came from.

Building a RAG Chat System From Zero

The "Ask AI" page on this blog is not a generic chatbot. It's a Retrieval-Augmented Generation system that answers questions using only the content from this site's posts, and it shows you exactly which post each part of the answer came from.

Here's how it works, from embedding to response.


Why RAG Instead of Fine-Tuning

Fine-tuning a model on blog content would:

  • Require retraining every time a new post is published
  • Risk hallucinating facts not present in the training data
  • Give no way to cite sources in the response

RAG solves all three: query the content at runtime, inject the relevant chunks into the prompt, and return the source citations alongside the answer. No retraining needed.


Architecture

text
1User Question234[Embedding Model] ──→ Question Vector567[pgvector HNSW Index] ──→ Top-K Similar Chunks (by vector distance)8910[tsvector Full-Text Search] ──→ Top-K Chunks (by keyword relevance)111213[Hybrid Scorer] ──→ Weighted + Reranked Results141516[Context Assembly] ──→ Prompt with top chunks + question171819[LLM] ──→ Generated Answer + Source Citations202122[Source Verification] ──→ Verify citations match chunks232425[Streaming Response] ──→ SSE to frontend

Step 1: The Embedding Pipeline

Every published post is split into chunks and embedded. The chunks are stored in the rag_chunks table:

sql
1CREATE TABLE rag_chunks (2    id UUID PRIMARY KEY,3    post_id UUID REFERENCES posts(id) ON DELETE CASCADE,4    chunk_index INTEGER NOT NULL,5    content TEXT NOT NULL,6    embedding vector(1536),7    metadata JSONB,8    created_at TIMESTAMPTZ DEFAULT NOW()9);

The embedding dimension (1536) comes from the model: text-embedding-3-small from OpenAI. The choice was pragmatic — it's the cheapest per-token of the high-quality embedding models and produces 1536-dimensional vectors that work well with pgvector's HNSW index.

Chunking Strategy

Posts are split on paragraph boundaries, not fixed token counts:

python
1def chunk_post(content: str, max_tokens: int = 500) -> list[str]:2    paragraphs = content.split("\n\n")3    chunks = []4    current = []5 6    for p in paragraphs:7        estimated_tokens = len(p.split())8        current_token_count = sum(len(c.split()) for c in current)9 10        if current_token_count + estimated_tokens > max_tokens and current:11            chunks.append("\n\n".join(current))12            current = [p]13        else:14            current.append(p)15 16    if current:17        chunks.append("\n\n".join(current))18 19    return chunks

Why paragraph boundaries? Code blocks, lists, and blockquotes are semantic units. Splitting mid-paragraph would separate a code example from its explanation, making the chunk useless for both retrieval and generation.

Each chunk stores its chunk_index so the frontend can link back to the correct section of the post. Metadata includes the post slug, title, section heading, and URL.


Step 2: The HNSW Index

pgvector supports two index types for approximate nearest neighbor search: IVFFlat and HNSW. I chose HNSW for three reasons:

  • Faster build time — HNSW builds incrementally. IVFFlat requires a full rebuild when data changes.
  • Better recall at same speed — HNSW consistently achieves 99% recall at 10ms query time with my dataset size (~50K chunks).
  • No training required — IVFFlat needs a clustering step that depends on representative data. HNSW is parameter-free.
sql
1CREATE INDEX idx_rag_chunks_embedding ON rag_chunks2USING hnsw (embedding vector_cosine_ops)3WITH (m = 16, ef_construction = 200);

The parameters:

  • m = 16 — each node connects to 16 neighbors. Higher = better recall, slower build. 16 is the sweet spot for datasets under 100K vectors.
  • ef_construction = 200 — the dynamic list size during construction. Higher = better index quality, slower build. 200 is conservative.

At query time, the search uses SET hnsw.ef_search = 40 — this controls the search breadth. Higher = better recall, slower query.


Vector search alone misses exact keyword matches. "How do I install FastAPI?" matches the vector of "FastAPI installation guide" but misses the exact phrase match. Full-text search via tsvector catches what vector search misses.

The hybrid query combines both:

python
1async def hybrid_search(query: str, limit: int = 10):2    query_embedding = await embed(query)3 4    vector_results = await db.execute(5        text("""6            SELECT id, content, post_id, chunk_index,7                   1 - (embedding <=> :query_emb) AS vector_score8            FROM rag_chunks9            ORDER BY embedding <=> :query_emb10            LIMIT :limit11        """),12        {"query_emb": query_embedding, "limit": limit * 2}13    )14 15    fts_results = await db.execute(16        text("""17            SELECT id, content, post_id, chunk_index,18                   ts_rank(to_tsvector('english', content),19                           plainto_tsquery('english', :query)) AS fts_score20            FROM rag_chunks21            WHERE to_tsvector('english', content) @@ plainto_tsquery('english', :query)22            ORDER BY fts_score DESC23            LIMIT :limit24        """),25        {"query": query, "limit": limit * 2}26    )27 28    return hybrid_rank(vector_results, fts_results, alpha=0.7)

The alpha parameter controls the weight between vector and keyword scores. 0.7 means 70% vector, 30% keyword — biased toward semantic understanding while still catching exact matches.


Step 4: Hybrid Ranking

Results from both searches are combined using Reciprocal Rank Fusion (RRF):

python
1def hybrid_rank(vector_results, fts_results, alpha=0.7, k=60):2    scores = {}3 4    for rank, row in enumerate(vector_results):5        scores[row.id] = scores.get(row.id, 0) + alpha * (1 / (k + rank + 1))6 7    for rank, row in enumerate(fts_results):8        scores[row.id] = scores.get(row.id, 0) + (1 - alpha) * (1 / (k + rank + 1))9 10    ranked = sorted(scores.items(), key=lambda x: -x[1])11    return [chunk_id for chunk_id, _ in ranked[:10]]

RRF is simple, fast, and doesn't require training a learned ranker. The constant k=60 prevents any single ranking from dominating.


Step 5: Context Assembly

The top 5-10 chunks are assembled into a prompt. The system prompt is:

text
1You are a technical assistant for Madhu Dadi — AI, Python & Analytics Hub.2Answer the user's question based ONLY on the provided context.3If the context doesn't contain enough information, say so.4Always cite the source post title and section for each claim.5Format citations as [Source: Post Title → Section].

The user prompt includes the question and the chunk content:

text
1Context:2[1] Post: "Understanding Python Classes" → Section: "Class Methods"3Content: Class methods are functions defined inside a class...4 5[2] Post: "FastAPI Routes" → Section: "Path Parameters"6Content: Path parameters are declared using Python type hints...7 8Question: How do I define a class method in Python?

Step 6: Source Verification

After the LLM generates a response, a verification step checks that each cited source actually exists in the provided chunks:

python
1def verify_citations(response: str, chunks: list[dict]) -> dict:2    citations_found = re.findall(r'\[Source: (.+?)\]', response)3    valid_citations = []4    missing_citations = []5 6    for citation in citations_found:7        matched = any(citation in chunk["source"] for chunk in chunks)8        if matched:9            valid_citations.append(citation)10        else:11            missing_citations.append(citation)12 13    return {14        "verified_response": response,15        "citations": valid_citations,16        "unverified_claims": missing_citations17    }

Unverified claims are flagged but not removed from the response — they're marked with a warning icon in the frontend. This happens rarely (less than 2% of queries) and usually when the LLM rephrases a source name.


Step 7: The Database Model

python
1class RagChunk(Base):2    __tablename__ = "rag_chunks"3 4    id: Mapped[uuid.UUID] = mapped_column(UUID, primary_key=True, default=uuid.uuid4)5    post_id: Mapped[uuid.UUID] = mapped_column(ForeignKey("posts.id", ondelete="CASCADE"))6    chunk_index: Mapped[int]7    content: Mapped[str] = mapped_column(Text)8    embedding: Mapped[Optional[Vector]] = mapped_column(Vector(1536))9    metadata: Mapped[Optional[dict]] = mapped_column(JSONB)10    created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), server_default=func.now())

The Vector type comes from pgvector.sqlalchemy. It maps directly to PostgreSQL's vector extension type.


Cold Start: First User Experience

When a user visits the Ask AI page for the first time, there are no chunks to search. The solution: a pre-computed set of seed questions and answers, one per published post, generated during the embedding pipeline.

python
1SEED_QUESTIONS = {2    "why-i-built-yet-another-blog-but-not-really": [3        "Why did you build your own blog platform?",4        "What features does this blog have that others don't?"5    ],6    "the-monorepo-that-runs-29-services": [7        "How is the monorepo structured?",8        "What are the 29 API routers?"9    ]10}

These seed questions are embedded and stored alongside the post chunks. On the first page load, the frontend fetches 3-5 seed questions as suggestions. When the user clicks one, it triggers a RAG query, which populates the embedding cache. Subsequent queries hit the cache.


What's Next

In the next post, I'll cover the production RAG pipeline — streaming responses via SSE, progressive rendering, citation badges, fallback strategies, rate limiting, and the cold-start UX flow in detail.


Built with FastAPI, pgvector, OpenAI embeddings, and zero third-party CMS.

Frequently Asked Questions

A RAG chat system is a Retrieval-Augmented Generation system that answers questions using only the content from a specific site's posts, providing source citations for each part of the answer.

RAG is preferred because it doesn't require retraining with each new post, avoids hallucinating facts not present in the training data, and allows for source citations in the responses.

The embedding model converts each post into 1536-dimensional vectors, which are used to find the most relevant content chunks for answering a user's question.

Splitting on paragraph boundaries preserves semantic units like code blocks and lists, ensuring that examples and their explanations remain together for effective retrieval and generation.

The HNSW index offers faster build times and better recall at the same speed compared to the IVFFlat index, making it more efficient for approximate nearest neighbor searches.

How was this tutorial?