Not a tutorial with toy data. This is what a production RAG system looks like when real customers depend on it — chunking strategies, retrieval pipelines, eval harnesses, and the numbers that actually moved.
A B2B SaaS client came to me with a support problem. Their tier-1 ticket queue was drowning — customers asking the same questions about configuration, billing, and integrations that were already answered somewhere in their docs, help centre, and internal knowledge base. The average response time was 4 hours. Some tickets sat for a full business day.
They'd tried a keyword-search chatbot before. It was useless. Customers would ask "how do I set up SSO for my team?" and get back a page about password resets. After two weeks of bad answers, nobody trusted it, and ticket volume actually went up because people now had a new thing to complain about.
I built the replacement. Within the first month, tier-1 ticket volume dropped by roughly 40%. Average response time went from 4 hours to under 8 seconds.
Here's what I actually built and why.
The Architecture
The system has four layers:
- Ingestion — documents chunked and embedded into PostgreSQL with pgvector
- Retrieval — history-aware query rewriting + semantic search + reranking
- Generation — LLM generates answers with source citations
- Serving — streaming FastAPI backend via SSE, consumed by the client's existing React frontend
No LangChain initially. I used it later for the retrieval chain, but the first version was raw Python because I needed to understand every decision the system was making.
Chunking: The Part Everyone Gets Wrong
Most RAG tutorials show you RecursiveCharacterTextSplitter with a 1000-token chunk size and call it a day. That works for blog posts. It does not work for enterprise documentation with tables, nested lists, configuration examples, and multi-step procedures that span 3 pages.
I used two chunking strategies depending on content type:
Semantic chunking for conceptual docs (guides, explanations, FAQs):
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
chunker = SemanticChunker(
OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile",
breakpoint_percentile_threshold=85,
)
docs = chunker.create_documents([page.content for page in pages])Semantic chunking splits on meaning boundaries rather than character count. A paragraph about SSO setup stays together even if it's 1,500 tokens, instead of getting sliced in half and losing context.
Sliding-window chunking for reference docs (API specs, config tables, changelogs):
def sliding_window_chunks(text: str, size: int = 800, overlap: int = 200):
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), size - overlap):
chunk_tokens = tokens[i : i + size]
chunks.append(tokenizer.decode(chunk_tokens))
return chunksReference docs have dense, interrelated information. The overlap ensures that a config parameter mentioned at the end of one chunk also appears at the start of the next, so retrieval doesn't miss context that straddles a boundary.
Every chunk gets metadata: source URL, document title, section heading, last-updated timestamp. This metadata is critical for citations and for freshness-aware retrieval.
Embeddings and Storage
I used OpenAI's text-embedding-3-small (1536 dimensions) and stored everything in PostgreSQL with pgvector. Not a dedicated vector database — just a pgvector extension on the same PostgreSQL instance the application already used.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE doc_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
embedding VECTOR(1536),
metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON doc_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);Why pgvector over Pinecone or Qdrant? Three reasons:
- Operational simplicity. One database to back up, monitor, and scale. The client's ops team already knew PostgreSQL.
- Joins. I could join chunk results with the application's user and subscription tables to personalise answers per customer tier.
- Good enough performance. With ~50K chunks and IVFFlat indexing, query latency was under 40ms. At this scale, a dedicated vector DB adds complexity without measurable benefit.
Retrieval: History-Aware Rewriting + Reranking
Naive RAG embeds the user's question, finds the nearest chunks, and passes them to the LLM. This breaks immediately in a real conversation:
- User: "How do I set up SSO?"
- Bot: explains SAML SSO setup
- User: "What about for Google Workspace specifically?"
That follow-up — "What about for Google Workspace specifically?" — has no useful embedding on its own. It needs the conversation history to make sense.
I used a history-aware retriever that rewrites the user's query to be self-contained before embedding:
from langchain.chains import create_history_aware_retriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
contextualize_prompt = """Given the chat history and latest user question,
reformulate it as a standalone question that captures full context.
Do NOT answer the question — just reformulate it."""
retriever = create_history_aware_retriever(
llm,
vectorstore.as_retriever(search_kwargs={"k": 10}),
contextualize_prompt,
)"What about for Google Workspace specifically?" becomes "How do I set up SSO with Google Workspace as the identity provider?" before it hits the vector store. Night and day difference in retrieval quality.
After retrieval, I rerank the top 10 results down to 5 using a cross-encoder. The initial vector search casts a wide net; the reranker picks the genuinely relevant chunks:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_k: int = 5):
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for _, chunk in ranked[:top_k]]This two-stage retrieval (broad vector search → precise reranking) was the single biggest quality improvement. Without reranking, about 30% of retrieved chunks were topically adjacent but not actually answering the question. With reranking, that dropped to under 10%.
The Eval Harness
This is the part that separated the chatbot from its keyword-search predecessor. The previous bot had no systematic way to measure quality. When it gave a wrong answer, nobody knew until a customer complained.
I built a 200-question golden set:
# eval_dataset.jsonl (simplified)
{"question": "How do I enable SSO for my organisation?", "expected_sources": ["docs/sso-setup"], "expected_answer_contains": ["SAML", "identity provider", "metadata URL"]}
{"question": "What's the difference between Team and Enterprise plans?", "expected_sources": ["docs/pricing"], "expected_answer_contains": ["audit log", "SSO", "API rate limit"]}Each entry has the question, expected source documents, and key phrases the answer should contain. The eval runs weekly via a GitHub Actions cron job:
def evaluate(qa_pairs: list[dict]) -> dict:
results = {"retrieval_hits": 0, "answer_quality": 0, "total": len(qa_pairs)}
for pair in qa_pairs:
retrieved = retriever.invoke(pair["question"])
retrieved_sources = [doc.metadata["source"] for doc in retrieved]
# Retrieval hit: did we find the right source document?
if any(src in retrieved_sources for src in pair["expected_sources"]):
results["retrieval_hits"] += 1
# Answer quality: does the generated answer contain expected key phrases?
answer = chain.invoke({"input": pair["question"]})
if all(phrase.lower() in answer.lower() for phrase in pair["expected_answer_contains"]):
results["answer_quality"] += 1
return resultsIn the first week, retrieval accuracy was 78% and answer quality was 64%. After tuning the chunking strategy and adding reranking, those numbers climbed to 91% retrieval and 84% answer quality. The weekly regression catches drift — when the client updates their docs, if the chunking changes degrade retrieval for existing questions, I know within 7 days instead of waiting for user complaints.
Streaming via FastAPI + SSE
The chatbot streams responses token-by-token. Nobody wants to wait 8 seconds staring at a blank screen while the LLM generates a full response. Streaming shows the first token in under 500ms.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI
app = FastAPI()
@app.post("/chat")
async def chat(request: ChatRequest):
async def generate():
async for chunk in chain.astream({
"input": request.message,
"chat_history": request.history,
}):
if "answer" in chunk:
yield f"data: {json.dumps({'token': chunk['answer']})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")The client's React frontend consumed this via the Vercel AI SDK's useChat hook, which handles SSE parsing, token accumulation, and loading states out of the box. I used this pattern across several client projects — the AI SDK's provider abstraction meant I could swap between OpenAI and Anthropic Claude on the backend without touching the frontend.
What Actually Moved the Numbers
After the first month in production:
| Metric | Before | After |
|---|---|---|
| Tier-1 ticket volume | ~850/month | ~510/month (−40%) |
| Avg response time | 4 hours | 8 seconds |
| Customer satisfaction (CSAT) | 3.2/5 | 4.1/5 |
| Escalation rate (bot → human) | N/A | 22% |
The 22% escalation rate was intentional. The bot is configured to hand off to a human when confidence is low or when the question involves account-specific data it can't access. Trying to answer everything is how the previous bot destroyed trust.
The 40% ticket reduction was specifically in tier-1 (how-do-I, where-is, what-does-this-mean). Tier-2 tickets (bugs, outages, billing disputes) were unaffected, which is exactly what you'd expect — the bot handles knowledge retrieval, not judgment calls.
Lessons
Chunking matters more than the LLM. I spent 60% of the project time on chunking and retrieval, 20% on the eval harness, and 20% on everything else (LLM prompts, streaming, UI). The model is the easiest part to swap out. Bad chunks with GPT-4 produce worse results than good chunks with GPT-4o-mini.
Eval before optimising. Without the 200-question golden set, every change was a guess. With it, I could measure the impact of each decision — semantic vs. fixed-size chunking, with vs. without reranking, different embedding models — and pick the one that actually improved retrieval.
pgvector is enough for most use cases. Unless you're at millions of documents with sub-10ms latency requirements, a dedicated vector database adds operational complexity without measurable benefit. PostgreSQL with pgvector handled 50K chunks at 40ms query latency without breaking a sweat.
Stream everything. The perceived performance difference between "wait 8 seconds then see the full answer" and "see the first word in 500ms" is enormous. Users don't mind waiting if they can see progress.
Build the escalation path first. The bot's "I'm not sure — let me connect you with a human" response is more important than any correct answer. Trust is built by knowing your limits, not by faking confidence.
If you're building a RAG system for production, the bar isn't "can it answer questions?" — it's "will customers trust it more than filing a ticket?" That trust comes from retrieval quality, honest escalation, and the eval harness that keeps both in check.