Volume 2 — RAG & Knowledge Retrieval
A working reference for the transition from Senior iOS Developer to AI Engineer
How to use this volume
Same format as Volume 1: every chapter has a real explanation, a diagram, working code, two interview Q&As, and an exercise. This volume is the one that maps most directly onto SmartStore AI — by the end, you'll have built (in Chapter 13) a minimal but real version of the exact pipeline your app needs.
Contents 1. What Is RAG, and What Problem Does It Actually Solve 2. Why Enterprises Use RAG 3. The RAG Pipeline End-to-End 4. Chunking Strategies 5. Embeddings for Retrieval (Practical) 6. Vector Databases and How They Search 7. Traditional Search vs. Vector Search vs. Hybrid Search 8. Metadata Filtering 9. Reranking 10. Grounded Answers and Citations 11. RAG vs. Fine-Tuning 12. Enterprise RAG: Permissions and Multi-Tenancy 13. Hands-On: Building a Minimal Product-Location RAG Pipeline
Appendix A — Glossary Appendix B — Chapter Summary Table
Chapter 1 — What Is RAG, and What Problem Does It Actually Solve
Recall from Volume 1 (Chapters 2 and 11): an LLM generates plausible continuations of text — it has no live connection to your data, and it hallucinates when asked things outside what it actually "knows" from training.
Retrieval-Augmented Generation (RAG) solves this by inserting a retrieval step before generation: instead of asking the model to answer from memory, you first search your own data for the relevant pieces, and hand those to the model as context, instructing it to answer only from what you gave it.
Without RAG: With RAG:
User question User question
│ │
▼ ▼
LLM ──→ Answer (from memory, Search your own data
may be wrong/stale/ │
made up) ▼
Relevant chunks retrieved
│
▼
LLM (answers using retrieved
chunks as grounding)
│
▼
Answer (grounded, current,
checkable)
This single architectural change solves three separate problems at once: 1. Staleness — the model's training data has a cutoff; your product catalog, store layout, or HR policy updates daily. RAG lets the model answer using current data without retraining anything. 2. Hallucination — answering from retrieved real text is far more reliable than answering from parametric memory, especially when combined with "answer only from the provided context" instructions (Chapter 10). 3. Privacy/scope — your proprietary data (product catalog, internal docs) never has to be baked into a model's weights. It stays in your own database, retrieved only when relevant.
For SmartStore AI specifically: "where's the olive oil" is a textbook RAG question. The model doesn't know your store's layout — it was never trained on it. It needs to be handed the actual aisle data for that specific store, every time, which is exactly what the retrieval step does.
Interview Q&A
Q: Could you solve "where's the olive oil" by just putting your entire product catalog in the system prompt instead of using RAG? A: For a tiny catalog, maybe — but it doesn't scale. A real store catalog is thousands of products across multiple locations; stuffing all of it into every request wastes tokens (cost and latency), and you'd hit context window limits fast. Retrieval narrows it down to only the few relevant items per query.
Q: What's the actual mechanism that makes a RAG answer more trustworthy than a non-RAG answer? A: It's not magic — it's that the model is reasoning over text you handed it directly in the prompt (which it can directly "see" and reference) rather than trying to recall something from its training weights. Combined with an instruction to stick to the provided context, this dramatically narrows the room for fabrication, though it doesn't eliminate it entirely (Chapter 10 covers the limits).
Exercise: Name two SmartStore AI questions that are a good fit for RAG (need current/proprietary data) and two that aren't (general knowledge a model would already know, or something requiring a live computation rather than a document lookup).
Chapter 2 — Why Enterprises Use RAG
Outside of toy demos, RAG is the default architecture for serious enterprise AI because company knowledge is large, constantly changing, and often sensitive — exactly the conditions where fine-tuning (Chapter 11) is impractical.
Common categories where this shows up: - Internal knowledge assistants — HR policies, IT runbooks, onboarding docs (this is your own "Internal Knowledge Assistant" alternative use case). - Customer support — answering from product manuals, support tickets, return policies. - Document Q&A — contracts, compliance documents, technical specs. - Retail/product search — exactly SmartStore AI's core use case: product locations, descriptions, availability.
Knowledge source Update frequency Why RAG fits
────────────────────────────────────────────────────────────
Store product catalog Daily/hourly Stale fast; RAG keeps answers current
HR policy docs Monthly Changes break a fine-tuned model's "facts"
Support tickets archive Continuous Far too large to fit in any context window
Legal/compliance docs Rare, but high-stakes Needs traceable citations, not memorized facts
The pattern across all of these: the knowledge is too large to fit in a prompt, changes faster than you'd want to retrain a model, and the business needs traceability — "where did this answer come from" — which RAG gives you for free (you know exactly which chunk was retrieved) and a purely fine-tuned model does not.
Interview Q&A
Q: A company says "we want to fine-tune a model on our entire support ticket history instead of using RAG." What would you push back on? A: Fine-tuning bakes patterns into model weights at a point in time — it doesn't give you a way to trace which specific ticket informed an answer, it gets stale as new tickets come in (requiring expensive retraining to stay current), and it's a much heavier, slower iteration loop than updating a vector index. RAG handles continuously changing, traceable, large-volume knowledge far more naturally.
Q: Why does "traceability" matter so much in enterprise contexts specifically? A: Compliance, auditing, and trust — if a customer-facing or HR-facing answer turns out to be wrong, the business needs to know exactly which source document led to it, in order to fix the source, explain the error, or meet regulatory requirements. RAG retrieval inherently gives you that paper trail; a fine-tuned model's internal weights do not.
Exercise: From your own notes, you listed an "Internal Knowledge Assistant" and "Gift Card Wallet AI" as alternative use cases alongside SmartStore AI. For the Internal Knowledge Assistant idea specifically, name one type of document you'd want to retrieve from and one traceability requirement that document type would realistically need.
Chapter 3 — The RAG Pipeline End-to-End
RAG has two distinct phases that happen at completely different times: ingestion (done once, or whenever source data changes) and query time (done on every user request). Conflating these two is the most common source of confusion for people new to RAG.
INGESTION (offline, runs when data changes)
───────────────────────────────────────────
Raw documents (PDFs, DB rows, product catalog)
│
▼
Chunking (Chapter 4)
│
▼
Embedding each chunk (Chapter 5)
│
▼
Store vectors + metadata in Vector DB (Chapter 6)
QUERY TIME (online, runs on every user request)
───────────────────────────────────────────────
User question
│
▼
Embed the question (same embedding model as ingestion)
│
▼
Vector search → top-K most similar chunks (Chapter 6/7)
│
▼
(Optional) Rerank for precision (Chapter 9)
│
▼
Insert retrieved chunks into the LLM prompt as context
│
▼
LLM generates a grounded answer (Chapter 10)
The critical detail engineers miss: the same embedding model must be used for both ingestion and query-time embedding. If you embed your product catalog with one model and later embed user queries with a different model, the vectors live in different mathematical spaces and similarity comparisons become meaningless.
For SmartStore AI's architecture specifically: ingestion is your product catalog being chunked/embedded into Qdrant (probably triggered whenever inventory/location data updates); query time is the FastAPI backend embedding the user's spoken or typed question and searching Qdrant for matching products before calling the LLM.
Interview Q&A
Q: Why is mixing embedding models between ingestion and query time such a serious bug, rather than just a minor quality issue? A: Embedding spaces are specific to the model that produced them — two different models can place semantically identical text at completely different coordinates. Comparing a query embedded with Model A against documents embedded with Model B isn't "slightly worse" similarity search, it's comparing numbers that have no defined relationship to each other — retrieval quality can collapse entirely, often silently (it still returns something, just not meaningfully relevant).
Q: At what point in this pipeline would caching meaningfully reduce cost/latency for a high-traffic app? A: At the embedding step for repeated or very similar queries (cache query embeddings keyed by normalized query text), and at the LLM generation step for genuinely repeated questions with the same retrieved context (e.g., Redis caching the final answer for common queries like "where's the milk" at a given store) — both of which map directly onto the Redis layer already in SmartStore AI's architecture.
Exercise: Draw (in plain text, like the diagram above) what triggers ingestion to re-run for SmartStore AI specifically — what events should cause a product's chunk/embedding to be regenerated?
Chapter 4 — Chunking Strategies
You can't embed an entire 500-page PDF (or your entire product catalog) as one vector — it would be far too coarse; a single vector can't represent "everything in this document" usefully. Chunking splits source data into smaller pieces, each gets its own embedding, and retrieval finds the most relevant pieces, not whole documents.
Common strategies:
1. Fixed-size chunking
[-------- 500 tokens --------][-------- 500 tokens --------]
Simple, fast, but can cut a sentence or idea awkwardly in half.
2. Fixed-size with overlap
[-------- 500 tokens --------]
[-------- 500 tokens --------]
Overlap (e.g. 50-100 tokens) reduces the "cut off mid-idea" problem
at the cost of some duplicate storage.
3. Semantic/structural chunking
[Paragraph 1][Paragraph 2][Section heading + content]...
Splits at natural boundaries (paragraphs, headings, list items)
instead of a fixed token count — usually higher quality, more
implementation effort.
4. Recursive chunking
Try splitting by section → if still too big, split by paragraph →
if still too big, split by sentence. Falls back progressively.
For structured data like a product catalog (which is SmartStore AI's actual case, not a PDF), chunking looks different: each product (or each product-location pair) is naturally already a sensible "chunk" — you're not splitting a wall of prose, you're deciding how much structured info belongs in one retrievable unit (just the product name + aisle? or name + aisle + description + price?).
# Simple fixed-size chunking with overlap, for unstructured text (e.g. a policy PDF)
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunks.append(" ".join(words[start:end]))
start += chunk_size - overlap
return chunks
# Structured "chunking" for a product catalog — each row becomes one retrievable unit
def product_to_chunk(product: dict) -> str:
return (
f"Product: {product['name']}. "
f"Category: {product['category']}. "
f"Aisle: {product['aisle']}. "
f"Store: {product['store_id']}."
)
Interview Q&A
Q: Why does chunk size matter for retrieval quality, not just storage cost? A: Too large, and a chunk's embedding becomes a blurry average of multiple ideas — a query about one specific detail may not score highly against it even though the detail is technically present in the chunk. Too small, and you lose surrounding context the model would need to actually use the chunk correctly (Chapter 12 in Volume 1's "lost in the middle" issue can also bite here if you retrieve many tiny fragmented chunks).
Q: For SmartStore AI, would you chunk at the level of "one product" or "one aisle containing many products"? Justify it. A: One product per chunk is the better default — a user query is almost always about a specific product, so retrieval needs to match at that granularity. Aisle-level chunks would force the model to scan through many unrelated products to find the one being asked about, reintroducing the "needle in a haystack" problem retrieval is meant to solve.
Exercise: Your product catalog includes a long free-text "description" field for some products (a paragraph) alongside short structured fields (name, aisle, price). Design a chunking approach that handles both without losing the structured fields' precision.
Chapter 5 — Embeddings for Retrieval (Practical)
Volume 1, Chapter 4 covered what embeddings are. This chapter covers the practical decisions you actually make when building retrieval.
Choosing an embedding model. You're choosing between API-hosted embedding models (e.g., OpenAI's embedding models) and self-hosted open-weight options. Key practical factors: embedding dimensionality (affects storage size and search speed — more dimensions isn't automatically better), whether the model was trained on data similar to yours (general web text vs. domain-specific), and cost at your expected ingestion + query volume. Check current provider docs for exact dimension counts and pricing, since these change.
Embedding documents vs. embedding queries. Most embedding models support this directly — the same model embeds both, but some models offer separate "embed this as a document" vs. "embed this as a search query" modes, optimized slightly differently for each side of the comparison. Using the wrong mode (or the wrong model entirely, per Chapter 3) silently degrades retrieval quality without throwing any error.
# Embedding a batch of product chunks at ingestion time (OpenAI example)
from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(
model="text-embedding-3-small", # check current docs for the latest model name
input=texts,
)
return [item.embedding for item in response.data]
product_chunks = [
"Product: Extra Virgin Olive Oil 500ml. Category: Cooking Oils. Aisle: 7. Store: store_123.",
"Product: Canola Oil 1L. Category: Cooking Oils. Aisle: 7. Store: store_123.",
]
vectors = embed_texts(product_chunks)
Interview Q&A
Q: Your retrieval quality is poor even though chunking looks reasonable. What embedding-related cause would you check first? A: Whether queries and documents were embedded with the same model and the same mode (document vs. query embedding, if the model distinguishes them) — a mismatch here silently produces meaningless similarity scores without any error message, making it the most common "invisible" root cause.
Q: Does a higher-dimensional embedding always mean better retrieval quality? A: No — higher dimensionality can capture more nuance but also increases storage and search cost, and beyond a certain point yields diminishing returns for a given domain; the right choice depends on benchmarking against your actual data and query patterns, not just picking the largest available model.
Exercise: Write the query-time equivalent of the embed_texts function above — a function that takes a single user question string and returns its embedding, ready to be compared against the stored product vectors.
Chapter 6 — Vector Databases and How They Search
A vector database stores embeddings (plus their original text and metadata) and is purpose-built to answer one question extremely fast, even across millions of vectors: "which stored vectors are most similar to this query vector?"
A naive approach — compare the query against every single stored vector one by one (a full linear scan) — works fine at small scale but becomes too slow as data grows. Vector databases instead use Approximate Nearest Neighbor (ANN) indexing algorithms, most commonly a structure called HNSW (Hierarchical Navigable Small World graphs), which trades a tiny amount of accuracy for massive speed gains by organizing vectors into a navigable graph structure instead of scanning everything.
Linear scan (small scale, fine): ANN index (large scale, needed):
Query ──compare──▶ vector 1 Query ──▶ enter graph at a coarse level
──compare──▶ vector 2 ──▶ navigate down through layers,
──compare──▶ vector 3 only visiting a small fraction
... (every vector) of all stored vectors
──compare──▶ vector N ──▶ return approximate top-K nearest
You're already using Qdrant for SmartStore AI, so here's what that looks like concretely:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
client = QdrantClient(url="http://localhost:6333")
# One-time setup: create a collection with the right vector size and distance metric
client.create_collection(
collection_name="products",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
# Ingestion: store a product's vector + its original text + metadata
client.upsert(
collection_name="products",
points=[
PointStruct(
id=1,
vector=vectors[0], # from Chapter 5's embed_texts
payload={
"text": product_chunks[0],
"store_id": "store_123",
"aisle": 7,
"category": "Cooking Oils",
},
)
],
)
# Query time: search for the closest matches to a query vector
results = client.search(
collection_name="products",
query_vector=query_vector,
limit=5,
)
for r in results:
print(r.score, r.payload["text"])
Note the payload field — this is exactly how metadata filtering (Chapter 8) becomes possible: you're not just storing a vector, you're storing structured fields alongside it that you can filter on directly inside the same search call.
Interview Q&A
Q: Why not just use a regular SQL database with a brute-force similarity calculation in application code instead of a dedicated vector database? A: It would work at very small scale, but vector databases exist specifically to make similarity search fast at scale via ANN indexing (HNSW or similar), and they natively support combining vector similarity with metadata filtering in a single efficient query — reimplementing that well in application code on top of a general-purpose SQL database is a significant undertaking for no real benefit once you're past toy-scale data.
Q: What's the accuracy tradeoff being made by using an approximate (vs. exact) nearest-neighbor search? A: ANN indexes like HNSW occasionally miss the true single best match in exchange for being dramatically faster at scale — in practice this tradeoff is tunable (most vector DBs expose parameters trading recall for speed) and the accuracy loss is negligible for the vast majority of real applications, especially since you're typically retrieving the top-K (e.g., 5) candidates, not relying on exactly one perfect match.
Exercise: Using the Qdrant example above as a template, write the upsert call for the second product chunk (product_chunks[1], canola oil) with id=2 and the appropriate payload fields.
Chapter 7 — Traditional Search vs. Vector Search vs. Hybrid Search
Traditional/keyword search (often built on an algorithm called BM25) matches based on exact or near-exact word overlap, weighted by how rare/distinctive those words are across the whole document collection. It's extremely good at exact terms — product SKUs, brand names, error codes — and extremely bad at paraphrases.
Vector search (Chapters 4-6) matches based on meaning, regardless of exact wording. It's the reverse: great at paraphrases and conceptual matches, weaker on exact rare tokens (an unusual SKU like "SKU-88421-X" may not embed distinctively at all — to an embedding model, it can look like generic noise).
Query: "olive oil"
Keyword (BM25) search: Vector search:
Matches: products containing Matches: products semantically
the literal words "olive" close in meaning — could surface
and/or "oil" "extra virgin cooking oil" even
without the literal word "olive"
Misses: "extra virgin cooking Misses (sometimes): exact SKU
oil" if neither literal word codes, brand abbreviations, or
appears very rare exact terms
Hybrid search runs both in parallel and combines the results (commonly via a fusion technique like Reciprocal Rank Fusion), giving you the precision of keyword matching for exact terms and the recall of vector matching for paraphrases and conceptual queries. In practice, most serious production RAG systems use hybrid search, not vector search alone — pure vector search alone is the simplified version taught in tutorials, not what most mature systems actually run.
# Conceptual sketch of hybrid search combining Qdrant vector results
# with a keyword/BM25-style search, then merging by rank
def hybrid_search(query: str, query_vector: list[float], limit: int = 5):
vector_results = client.search(
collection_name="products", query_vector=query_vector, limit=limit * 2
)
keyword_results = keyword_search(query, limit=limit * 2) # e.g. Postgres full-text search
# Reciprocal Rank Fusion: combine ranks from both lists
scores = {}
for rank, r in enumerate(vector_results):
scores[r.id] = scores.get(r.id, 0) + 1 / (rank + 1)
for rank, r in enumerate(keyword_results):
scores[r.id] = scores.get(r.id, 0) + 1 / (rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:limit]
Interview Q&A
Q: A user searches for a specific product by its exact SKU, and your pure vector-search RAG system fails to find it even though it's in the database. What's happening, and how would you fix it?
A: Embedding models often don't represent rare alphanumeric codes distinctively — the SKU can look like noise in vector space, so it doesn't reliably retrieve via semantic similarity alone. Fixing it means adding keyword/exact-match search (e.g., a SQL WHERE sku = ... lookup or full-text search) alongside vector search — i.e., moving to hybrid search rather than relying on vectors for everything.
Q: Why use Reciprocal Rank Fusion (or similar) instead of just averaging the raw similarity scores from each search method? A: Vector similarity scores and keyword relevance scores (like BM25) are on entirely different, non-comparable scales — averaging them directly is mathematically meaningless. Rank-based fusion sidesteps this by combining based on each result's position in its own ranked list, which is comparable across different scoring systems.
Exercise: For SmartStore AI, name one query type best served by pure keyword search, one best served by pure vector search, and one that genuinely needs both.
Chapter 8 — Metadata Filtering
Vector similarity answers "what's semantically closest," but real queries usually have hard structured constraints too — "olive oil, but specifically in this store," not any store in the entire chain's catalog. Metadata filtering combines a vector search with exact filters on structured fields stored alongside each vector (recall the payload field from Chapter 6).
Pure vector search: Vector search + metadata filter:
"olive oil" → top 5 matches "olive oil" AND store_id = "store_123"
across the ENTIRE catalog, → top 5 matches, but ONLY from
every store mixed together that specific store's products
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = client.search(
collection_name="products",
query_vector=query_vector,
query_filter=Filter(
must=[
FieldCondition(key="store_id", match=MatchValue(value="store_123")),
FieldCondition(key="category", match=MatchValue(value="Cooking Oils")),
]
),
limit=5,
)
This is a near-mandatory feature for any multi-tenant or multi-location RAG system. Without it, SmartStore AI's assistant would happily retrieve and recommend products from a store the user isn't even in — technically "semantically relevant," practically useless (or actively wrong).
Interview Q&A
Q: Why store store_id as metadata on each vector rather than just running a separate Qdrant collection per store?
A: A single collection with metadata filtering is far easier to operate and scale (one index, one place to monitor and tune) than managing potentially thousands of separate per-store collections, and it allows cross-store queries when you actually need them (e.g., "which stores have this in stock") without restructuring anything.
Q: Could you achieve the same result by just filtering the results after getting them back from vector search, in application code? A: Technically yes for small result sets, but it's wasteful and fragile at scale — you'd need to over-fetch a much larger top-K to ensure enough post-filter results remain, and you lose the database's ability to use the filter to actually prune the search space efficiently. Native filtering (filtering during the search itself) is both faster and more correct.
Exercise: SmartStore AI wants to support "show me this in any store within 5 miles." Sketch what metadata fields you'd need to store per product to make that filterable, beyond store_id alone.
Chapter 9 — Reranking
Vector (or hybrid) search is optimized to be fast across potentially millions of items — it uses relatively lightweight similarity comparisons to quickly narrow a huge field down to a top-K candidate set. Reranking adds a second, slower-but-more-accurate pass over just that small candidate set, using a more powerful model (commonly a "cross-encoder") that directly compares the query against each candidate in full, rather than comparing pre-computed vectors.
Stage 1 (fast, broad): Stage 2 (slower, precise):
Vector/hybrid search Reranker model
over entire catalog over just the top 20-50 candidates
│ │
▼ ▼
Top 20-50 candidates ─────────▶ Re-scored, re-ordered top 5
(good recall, rough ranking) (much better precision)
The intuition: a pre-computed embedding has to represent a piece of text's meaning in isolation, without knowing what query it'll eventually be compared against. A cross-encoder reranker looks at the query and the candidate together, which lets it catch nuances pure vector similarity misses — at the cost of being too slow to run over your entire catalog, which is exactly why it's a second-stage refinement, not a replacement for the first-stage search.
# Conceptual sketch — many providers and open models offer reranking endpoints
def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
scored = [
{"item": c, "score": reranker_model.score(query, c["text"])}
for c in candidates
]
scored.sort(key=lambda x: x["score"], reverse=True)
return [s["item"] for s in scored[:top_n]]
For SmartStore AI, reranking is genuinely optional at small-to-medium catalog scale — hybrid search alone is often "good enough." It earns its cost once you notice the top-1 or top-3 results are frequently plausible but not quite the best match, which is exactly the failure mode reranking targets.
Interview Q&A
Q: If cross-encoder rerankers are more accurate, why not just use them for the entire search instead of a two-stage pipeline? A: Cross-encoders compare the query against each candidate directly and can't be pre-computed/indexed the way embeddings can — running one against your entire catalog for every query would be far too slow at any real scale. The two-stage design exists specifically to get cross-encoder-level accuracy on a small candidate set without paying that cost across the whole dataset.
Q: How would you decide whether SmartStore AI actually needs a reranking stage, rather than just adding it because it sounds like a best practice? A: Measure it — run your evaluation set (Volume 1, Chapter 12) with and without reranking and compare relevance/faithfulness metrics and the actual cost/latency added. If hybrid search alone already retrieves the correct product in the top 1-3 results consistently, reranking adds cost and latency without a measurable quality gain; add it only once you see a concrete precision problem it would solve.
Exercise: Describe a SmartStore AI query where you'd expect reranking to meaningfully change the top result compared to vector search alone (hint: think about ambiguous product names with multiple close matches).
Chapter 10 — Grounded Answers and Citations
Retrieval gets you the right source material in front of the model. Grounding is making sure the model's final answer actually sticks to that material rather than drifting back into ungrounded generation (Volume 1, Chapter 11) — and citations make that traceable, so you (and the user) can verify where an answer came from.
The core technique is in the prompt structure itself:
SYSTEM PROMPT:
You are a retail navigation assistant. You will be given retrieved
product information inside <context> tags. Answer the user's question
using ONLY the information in <context>. If the answer isn't present
in <context>, say plainly that you don't have that information — do
not guess. When you answer, mention which product entry your answer
is based on.
USER PROMPT:
<context>
Product: Extra Virgin Olive Oil 500ml. Aisle: 7. Store: store_123.
Product: Canola Oil 1L. Aisle: 7. Store: store_123.
</context>
Question: Where's the olive oil?
Notice three deliberate design choices here, each one directly addressing a failure mode from earlier chapters:
1. Delimiting retrieved content (<context> tags) — separates data from instructions, addressing the prompt injection concern from Volume 1, Chapter 13.
2. "Use ONLY the information in context" — directly constrains the model away from filling gaps with parametric memory.
3. "Say plainly that you don't have that information" — gives the model explicit permission to decline rather than guess, which measurably reduces confident fabrication.
def build_grounded_prompt(retrieved_chunks: list[str], question: str) -> dict:
context_block = "\n".join(retrieved_chunks)
return {
"system": (
"You are a retail navigation assistant. Answer using ONLY the "
"information in the <context> tags below. If the answer isn't "
"present, say you don't have that information."
),
"user": f"<context>\n{context_block}\n</context>\n\nQuestion: {question}",
}
Interview Q&A
Q: You've grounded the prompt well, but the model still occasionally adds a plausible-sounding detail not present in the retrieved context. Is this a complete prompt-engineering failure, and what's the realistic next step? A: No — grounding instructions substantially reduce hallucination, they don't mathematically guarantee zero occurrences, because the underlying model is still a generative system (Volume 1, Chapter 2). The realistic next step is evaluation (Volume 1, Chapter 12): measure faithfulness on a golden set, and if it's still too frequent for your use case, consider lower temperature, stricter prompt wording, or a verification pass before showing the answer to a user.
Q: Why explicitly tell the model to mention which product entry its answer is based on, rather than just trusting the grounding instruction alone? A: Forcing explicit attribution makes ungrounded claims more visible and checkable — both to you during evaluation, and to the end user, who can sanity-check the answer against the cited source rather than blindly trusting a confident-sounding response.
Exercise: Rewrite the system prompt above to also require the model to respond in a specific JSON shape (e.g., {"aisle": ..., "product_name": ..., "found": true/false}) instead of free text — useful for a SwiftUI frontend that needs structured data, not prose, to render a result.
Chapter 11 — RAG vs. Fine-Tuning
These solve different problems and are frequently confused as competing alternatives for the same job.
| RAG | Fine-tuning | |
|---|---|---|
| Best for | Injecting current, large, or frequently changing knowledge | Changing behavior/style/format the model produces |
| Update speed | Update the index any time, instantly reflected | Requires retraining; slower iteration loop |
| Traceability | High — you know exactly which retrieved chunk informed the answer | Low — knowledge is baked into weights, not inspectable |
| Data volume needed | Works with as little as a handful of documents | Generally needs a meaningful curated dataset of examples |
| Typical use | "Answer using our product catalog / policy docs" | "Always respond in this exact tone/format" or teaching a narrow specialized skill |
A useful rule of thumb: if the problem is "the model doesn't know X," reach for RAG. If the problem is "the model knows X but won't behave/format the way I need," consider fine-tuning (or often, simpler prompt engineering first — fine-tuning is usually the heavier, later-stage tool, not the first thing to try).
They're also not mutually exclusive — a production system might fine-tune a model to be better at following a specific structured-output format or domain-specific reasoning style, and use RAG to feed it current factual content. SmartStore AI almost certainly only needs RAG; nothing in its requirements (current product/location lookup) calls for changing model behavior itself.
Interview Q&A
Q: A team wants to fine-tune a model on their entire product catalog so it "just knows" the products without needing retrieval at query time. What's the problem with this plan? A: The catalog changes (new products, price changes, restocks) far faster than a fine-tuning/retraining cycle can keep up with, so the model's "knowledge" would constantly drift out of date, and there'd be no traceability for which catalog version informed any given answer. This is exactly the kind of frequently-changing, large-volume knowledge that RAG — not fine-tuning — is designed for.
Q: Give an example where fine-tuning, not RAG, is clearly the right tool. A: Teaching a model to consistently output a specific structured response format your downstream system depends on (e.g., a very particular JSON schema or terse internal shorthand) in a way that's more reliable than prompting alone — that's a behavior/format change, not a knowledge-injection problem, and is a legitimate fine-tuning use case.
Exercise: For each of these, decide RAG, fine-tuning, both, or neither, with one sentence justifying each: (a) "answer questions about today's store promotions"; (b) "always respond in a specific brand voice"; (c) "do basic arithmetic correctly"; (d) "cite the exact policy clause an answer came from."
Chapter 12 — Enterprise RAG: Permissions and Multi-Tenancy
The moment a RAG system serves more than one user, store, or tenant with different access levels, retrieval itself becomes a security boundary, not just a relevance mechanism. If your vector search can retrieve a document the requesting user isn't allowed to see, grounding that answer in it doesn't make it safe — it just means you've built a very articulate way to leak data.
Without access-aware retrieval: With access-aware retrieval:
Query embeds & searches the Query embeds & searches ONLY
ENTIRE index, regardless of within the subset of data the
who's asking requesting user/tenant is
actually permitted to access
│ │
▼ ▼
Risk: user A's question retrieves Retrieval is pre-scoped by
and surfaces user B's private permission BEFORE similarity
data, simply because it was ranking ever happens — not
semantically relevant filtered as an afterthought
The metadata filtering technique from Chapter 8 is exactly the mechanism used here — but framed as a security control, not just a relevance improvement: every retrieval call should include a mandatory filter scoping results to what the requesting identity is authorized to see (by store, by tenant, by role), applied as a must condition the application always injects, never something the model or the end user can influence via their query text.
# Permission-aware retrieval — the user/session never gets to control this filter directly
def search_with_access_control(query_vector, user_store_id: str, limit: int = 5):
return client.search(
collection_name="products",
query_vector=query_vector,
query_filter=Filter(
must=[FieldCondition(key="store_id", match=MatchValue(value=user_store_id))]
),
limit=limit,
)
# user_store_id comes from authenticated session context (e.g. Firebase Auth claims),
# never from request body text the user could tamper with
This connects directly back to Volume 1, Chapter 13's core principle: never let untrusted input control something with security implications. A store-scoping filter derived from an authenticated session is trustworthy; a "store_id" parameter taken directly from a request body or, worse, parsed out of free-text user input, is not.
Interview Q&A
Q: Where exactly should the access-control filter be enforced — in the prompt, in application code before the vector search, or relying on the LLM to "only mention things the user is allowed to see"? A: In application code, before/during the vector search call itself — never rely on the LLM to self-enforce access control via prompt instructions. The LLM has no reliable way to verify permissions; if unauthorized data is retrieved and placed in its context at all, it can leak that data regardless of instructions. The filter must be applied at the database query level, derived from authenticated session data the user cannot influence.
Q: A single shared Qdrant collection holds all stores' products with store_id metadata used for filtering. What's the security risk if a bug ever allows that filter to be skipped or bypassed?
A: Every store's product data becomes retrievable by any user, regardless of which store they're actually associated with — a single missed filter is a full cross-tenant data leak, not a minor relevance bug. This is exactly why the filter should be enforced as a mandatory, non-optional part of the search function itself (as in the code above) rather than something callers remember to add.
Exercise: SmartStore AI eventually adds a "store employee" role that should see internal-only fields (cost price, supplier info) that regular shoppers shouldn't. Sketch how you'd extend the metadata-filtering approach above to support this second, role-based access dimension alongside the existing store-scoping.
Chapter 13 — Hands-On: Building a Minimal Product-Location RAG Pipeline
This ties every chapter in this volume into one small, real, runnable pipeline — the actual core of SmartStore AI's RAG feature, minus the FastAPI wrapper and auth layer (which belong in later volumes/your actual build).
# ── ingest.py ──────────────────────────────────────────────────────────
# Run this whenever the product catalog changes.
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
openai_client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")
PRODUCTS = [
{"id": 1, "name": "Extra Virgin Olive Oil 500ml", "category": "Cooking Oils", "aisle": 7, "store_id": "store_123"},
{"id": 2, "name": "Canola Oil 1L", "category": "Cooking Oils", "aisle": 7, "store_id": "store_123"},
{"id": 3, "name": "Whole Milk 1 Gallon", "category": "Dairy", "aisle": 2, "store_id": "store_123"},
]
def product_to_text(p: dict) -> str:
return f"Product: {p['name']}. Category: {p['category']}. Aisle: {p['aisle']}."
def embed(texts: list[str]) -> list[list[float]]:
response = openai_client.embeddings.create(model="text-embedding-3-small", input=texts)
return [item.embedding for item in response.data]
def ingest():
qdrant.create_collection(
collection_name="products",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
texts = [product_to_text(p) for p in PRODUCTS]
vectors = embed(texts)
qdrant.upsert(
collection_name="products",
points=[
PointStruct(
id=p["id"],
vector=vec,
payload={"text": text, "store_id": p["store_id"], "aisle": p["aisle"]},
)
for p, vec, text in zip(PRODUCTS, vectors, texts)
],
)
print(f"Ingested {len(PRODUCTS)} products.")
if __name__ == "__main__":
ingest()
# ── query.py ───────────────────────────────────────────────────────────
# This is what your FastAPI endpoint calls on every user question.
import anthropic
from qdrant_client.models import Filter, FieldCondition, MatchValue
from ingest import openai_client, qdrant, embed
anthropic_client = anthropic.Anthropic()
def retrieve(question: str, store_id: str, limit: int = 3) -> list[str]:
query_vector = embed([question])[0]
results = qdrant.search(
collection_name="products",
query_vector=query_vector,
query_filter=Filter(must=[FieldCondition(key="store_id", match=MatchValue(value=store_id))]),
limit=limit,
)
return [r.payload["text"] for r in results]
def answer(question: str, store_id: str) -> str:
retrieved = retrieve(question, store_id)
context_block = "\n".join(retrieved)
response = anthropic_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=300,
system=(
"You are a retail navigation assistant. Answer using ONLY the "
"information in the context below. If the answer isn't present, "
"say you don't have that information."
),
messages=[
{"role": "user", "content": f"<context>\n{context_block}\n</context>\n\nQuestion: {question}"}
],
)
return "".join(block.text for block in response.content if block.type == "text")
if __name__ == "__main__":
print(answer("Where's the olive oil?", store_id="store_123"))
Run ingest.py once, then query.py — that's the entire pipeline from Chapters 1 through 10, working end-to-end: chunking (each product is already chunk-sized), embedding, vector storage, metadata-filtered retrieval, and a grounded, citable answer.
Exercise (the real one): Extend this with a fourth product that's deliberately ambiguous (e.g., a second "olive oil" variant), run the query, and check whether the top retrieved result is the one you'd expect. If not, that's your first real retrieval-quality debugging session — work through Chapters 4-9 to figure out why.
Appendix A — Glossary
| Term | Meaning |
|---|---|
| RAG | Retrieval-Augmented Generation — retrieving relevant data before generation, instead of relying on model memory |
| Ingestion | The offline process of chunking, embedding, and storing source data |
| Chunking | Splitting source data into smaller retrievable units |
| ANN / HNSW | Approximate Nearest Neighbor search; HNSW is the graph-based indexing algorithm most vector DBs use |
| Hybrid search | Combining keyword (e.g. BM25) and vector search results, typically via rank fusion |
| Metadata filtering | Constraining vector search results by structured fields (store, category, role, etc.) |
| Reranking | A second, more precise scoring pass over a small candidate set from initial retrieval |
| Grounding | Constraining a model's answer to only use retrieved/provided content |
| Multi-tenancy | Serving multiple isolated customers/stores/users from shared infrastructure, with enforced access boundaries |
Appendix B — Chapter Summary Table
| # | Chapter | Core takeaway |
|---|---|---|
| 1 | What RAG solves | Retrieve real data first, then generate — fixes staleness, hallucination, and scope |
| 2 | Why enterprises use it | Knowledge too large/changing/sensitive for context-stuffing or fine-tuning |
| 3 | Pipeline end-to-end | Ingestion (offline) and query time (online) are distinct phases — don't conflate them |
| 4 | Chunking | Granularity choice directly drives retrieval precision |
| 5 | Embeddings, practically | Same model, same mode, for both documents and queries — always |
| 6 | Vector databases | ANN/HNSW indexing makes similarity search fast at scale |
| 7 | Hybrid search | Pure vector search alone misses exact terms; hybrid is the production default |
| 8 | Metadata filtering | Combines structured constraints with semantic search — essential for multi-store/tenant data |
| 9 | Reranking | A second, slower, more precise pass over a small candidate set — optional until proven necessary |
| 10 | Grounding & citations | Prompt structure (delimiting, "only use this," "say if unknown") is the real lever against hallucination |
| 11 | RAG vs. fine-tuning | RAG for knowledge, fine-tuning for behavior/format — not competing tools |
| 12 | Permissions | Access control belongs in the retrieval query itself, never in prompt instructions alone |
| 13 | Hands-on pipeline | The exact ingestion + query pattern SmartStore AI's backend needs |
Next: Volume 3 — AI Agents & MCP (tool calling, agent memory, multi-agent orchestration, and the Model Context Protocol — covers how SmartStore AI's assistant could go beyond answering questions into taking actions).