Volume 6 — Advanced Retrieval, Multimodal/Voice, and LLMOps

A working reference for the transition from Senior iOS Developer to AI Engineer

How to use this volume

Volumes 1-5 covered the original curriculum end to end. This volume goes beyond it — closing the technical gaps that didn't fit anywhere else (multi-hop retrieval, fine-tuning, knowledge graphs) and building the two input modalities SmartStore AI's own spec mentions but Volumes 1-5 never touched: voice and images. Same format throughout: explanation, diagram, code, two interview Q&As, and an exercise per chapter.

Contents 1. Query Rewriting and Expansion 2. Multi-Hop / Agentic Retrieval 3. GraphRAG — Knowledge Graphs as a Retrieval Layer 4. Fine-Tuning Fundamentals: When and How 5. Building a Fine-Tuning Dataset from Production Logs 6. Multimodal Input: Vision 7. Speech-to-Text Integration 8. Text-to-Speech Integration 9. Semantic Caching and Cost Optimization at Scale 10. Model Routing and Cascading 11. Prompt Versioning and A/B Testing in Production 12. Advanced Evaluation: Regression Testing and Continuous Eval 13. Hands-On: Adding Voice + Image Input to SmartStore AI

Appendix A — Glossary Appendix B — Chapter Summary Table

Chapter 1 — Query Rewriting and Expansion

Volume 2's retrieval embeds the user's question exactly as typed. That's often suboptimal: a user's casual phrasing ("cheap pasta sauce") may sit further in embedding space from how your catalog actually describes products ("Value Brand Marinara Sauce") than a rewritten version would. Query rewriting inserts an LLM call before retrieval to reformulate the question into something that retrieves better — a small extra cost that frequently pays for itself in retrieval quality.

Raw query:                          Rewritten/expanded query:
"cheap pasta sauce"                 "budget-friendly pasta sauce, marinara,
                                      tomato sauce, value brand"
        │                                    │
        ▼                                    ▼
   Embed & search                      Embed & search
   (may miss differently-              (broader phrasing increases
    worded matches)                     overlap with catalog text)

A related, often more effective technique is HyDE (Hypothetical Document Embeddings): instead of embedding the question itself, ask the model to write a hypothetical answer first, then embed that — because a plausible answer's phrasing is structurally closer to how real documents/products are written than a question's phrasing is.

def rewrite_query(question: str, conversation_history: str = "") -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=100,
        system=(
            "Rewrite the user's question to maximize retrieval recall: resolve "
            "any pronouns using the conversation history, and add likely "
            "synonyms or related terms. Return ONLY the rewritten query."
        ),
        messages=[{"role": "user", "content": f"History: {conversation_history}\nQuestion: {question}"}],
    )
    return "".join(b.text for b in response.content if b.type == "text")

def hyde_query(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=150,
        system="Write a short, plausible-sounding hypothetical answer to this question, as if it were a real product catalog entry. It doesn't need to be factually correct.",
        messages=[{"role": "user", "content": question}],
    )
    hypothetical = "".join(b.text for b in response.content if b.type == "text")
    return hypothetical  # this gets embedded instead of the raw question

The conversation-history resolution piece matters specifically for multi-turn chat: "is it gluten-free?" following "where's the pasta sauce?" needs "it" resolved to "the pasta sauce" before embedding, or retrieval has nothing meaningful to search for.

Interview Q&A

Q: Why would embedding a hypothetical answer (HyDE) ever retrieve better results than embedding the actual question? A: Embedding models are trained on text-to-text similarity, and a question's grammatical structure ("where's the X?") is often quite different in form from how the actual target content is written (a declarative product description). A hypothetical answer, even if factually wrong, mimics the style of the real documents being searched, which can produce a closer embedding-space match than the question's own phrasing.

Q: What's the cost trade-off of adding a query-rewriting step before every retrieval call? A: An extra LLM call adds latency and token cost to every single query, even ones where the original phrasing would have retrieved just fine — it's a net win only when measured retrieval quality improvement (via your eval suite, Volume 1 Ch.12) justifies that added cost, not something to add by default without checking.

Exercise: Take the query "is it still here" following a prior turn about "olive oil" in conversation history. Write what rewrite_query should ideally output, and explain what would go wrong in retrieval if this rewriting step were skipped.

Chapter 2 — Multi-Hop / Agentic Retrieval

Volume 2's RAG pipeline does one retrieval pass per question. Some questions genuinely need more than one — "is the store-brand olive oil cheaper than the national brand" requires retrieving both products, and a question like "what's a substitute for an ingredient that's out of stock" requires first finding out what's out of stock before knowing what to search for as a substitute.

Single-hop (Volume 2):                Multi-hop:
Question → one retrieval →            Question → retrieval 1 → reasoning about
   answer                               result → retrieval 2 (informed by
                                         result 1) → ... → final answer

This is exactly Volume 3's ReAct/agent pattern, applied specifically to retrieval as the tool, rather than treating "search the catalog" as a single fixed pipeline step:

tools = [{
    "name": "search_products",
    "description": "Search the product catalog by name or description.",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string"}, "store_id": {"type": "string"}},
        "required": ["query", "store_id"],
    },
}]

def search_products(query: str, store_id: str) -> str:
    return "\n".join(retrieve(query, store_id))  # Volume 2's retrieve()

# Reusing Volume 3, Chapter 13's run_agent loop directly — the model now
# decides HOW MANY retrieval calls a question needs, and what to search
# for at each step, rather than your code fixing it at exactly one call.
answer = run_agent(
    "Is the store-brand olive oil cheaper than the national brand?",
    store_id="store_123",
)

The model might call search_products("store brand olive oil", ...), then search_products("national brand olive oil", ...), then compare the two results itself before answering — three steps your Volume 2 pipeline structurally couldn't perform, because it only ever ran retrieval once per request.

Interview Q&A

Q: Why not just always retrieve a larger top-K (e.g., 20 instead of 5) instead of building multi-hop retrieval, to make sure both products show up in one pass? A: A larger top-K helps when the needed information is all retrievable via one semantically similar query, but it doesn't help when the second piece of information genuinely depends on knowing the first (e.g., "find a substitute for whatever's out of stock" — you can't formulate that second search query until you know what's out of stock). Multi-hop retrieval handles dependent, sequential information needs that no single query, however broad, can capture.

Q: What's the realistic downside of moving from Volume 2's fixed single-retrieval pipeline to this agentic multi-hop pattern for every query? A: Latency and cost increase — every query now potentially takes multiple model round-trips and retrieval calls instead of one fixed pipeline pass, even for simple questions that didn't need it. This is why Volume 1's "is RAG/agentic complexity actually needed" framing matters: reserve multi-hop for query types that demonstrably need it, not as a blanket replacement for the simpler pipeline.

Exercise: Trace through, on paper, what search_products calls a correctly-functioning multi-hop agent should make for: "What pairs well with the pasta sauce we have in stock, and where's that item located?"

Chapter 3 — GraphRAG: Knowledge Graphs as a Retrieval Layer

Vector search (Volume 2) treats each chunk as an independent, isolated piece of meaning. It's excellent at "find the thing that's semantically similar to this query" and structurally blind to relationships between things — "what's frequently bought with this," "what are common substitutes for this," "which products belong to the same supplier." Those are graph questions, not similarity questions.

A knowledge graph stores entities as nodes and relationships as explicit edges, and lets you query by traversing those relationships directly, rather than hoping semantic similarity happens to surface them.

Vector search view:                    Graph view:
"olive oil" embedding ──similar to──▶  (Olive Oil)──substitute_for──▶(Canola Oil)
  "canola oil" embedding                    │
  (maybe, maybe not — depends              │frequently_bought_with
  entirely on how similar the                ▼
  embedding model judges them)          (Pasta)──located_in──▶(Aisle 7)
                                        These relationships are EXPLICIT,
                                        not inferred from similarity —
                                        traversal finds them reliably,
                                        every time.

# Conceptual sketch — could be a dedicated graph database (e.g. Neo4j) or,
# at smaller scale, just structured relationship tables in PostgreSQL
def get_substitutes(product_id: str) -> list[str]:
    return db.execute(
        "SELECT substitute_id FROM product_substitutes WHERE product_id = %s", (product_id,)
    ).fetchall()

def get_frequently_bought_with(product_id: str) -> list[str]:
    return db.execute(
        "SELECT related_product_id FROM product_associations "
        "WHERE product_id = %s ORDER BY co_purchase_count DESC LIMIT 5", (product_id,)
    ).fetchall()

GraphRAG combines both: vector search finds the relevant starting entity (Volume 2's normal retrieval), then graph traversal pulls in explicitly related entities that similarity search alone might miss or might surface unreliably. For SmartStore AI specifically, "what's a substitute for X" (Chapter 2's exercise) is a far better fit for an explicit substitutes table than for hoping vector similarity happens to rank the right substitute highest.

This is genuinely the most optional chapter in this volume — a dedicated graph layer is real added complexity, worth it specifically when your product has meaningful, queryable relationships (substitutes, bundles, supplier hierarchies), not as a default upgrade to every RAG system.

Interview Q&A

Q: Why might vector similarity alone be an unreliable way to find product substitutes, even though "canola oil" and "olive oil" likely have high embedding similarity? A: High semantic similarity tells you two things are talked about similarly, not that one is a sanctioned, business-meaningful substitute for the other — a vector search might just as easily rank "olive oil soap" or an unrelated oil-adjacent product highly, with no way to distinguish "similar in meaning" from "actually a valid substitute" as a business fact. An explicit substitutes table encodes that business fact directly and reliably.

Q: At what point would you actually justify the engineering cost of adding a dedicated graph layer to SmartStore AI, versus just adding a few structured relationship tables in the existing PostgreSQL database? A: Structured relationship tables (as in the code above) are sufficient as long as the relationships are simple and the traversal queries are shallow (one or two hops, like substitutes or co-purchases). A dedicated graph database earns its place once you need deep, variable-length traversals (e.g., "find all products connected through a chain of substitutions") that relational queries become awkward and slow to express.

Exercise: Design the schema (just column names, no need for full SQL) for a product_substitutes table that could answer "what's a substitute for X" for SmartStore AI's catalog.

Chapter 4 — Fine-Tuning Fundamentals: When and How

Volume 2, Chapter 11 established the rule of thumb: RAG for knowledge, fine-tuning for behavior. This chapter goes one level deeper into how fine-tuning actually works, because "fine-tune the model" hides a real technical choice.

Full fine-tuning updates every weight in the model — expensive, requires substantial compute and data, and risks "catastrophic forgetting" (the model getting worse at things it wasn't being fine-tuned for). LoRA (Low-Rank Adaptation) and other PEFT (Parameter-Efficient Fine-Tuning) methods instead freeze the base model entirely and train a small set of additional "adapter" weights layered on top — dramatically cheaper, faster, and the base model's general capability stays intact underneath.

Full fine-tuning:                      LoRA / PEFT:
┌─────────────────────┐                ┌─────────────────────┐
│ Every weight in the   │                │ Base model weights    │
│ model gets updated     │                │ (FROZEN — unchanged)   │
│ — expensive, risks      │                │   +                    │
│ forgetting general      │                │ Small adapter weights  │
│ capability               │                │ (the only thing trained)│
└─────────────────────┘                └─────────────────────┘

When fine-tuning (of either kind) is actually worth reaching for, versus the lighter-weight tools from earlier volumes:

Problem                                    Right tool
──────────────────────────────────────────────────────────────
Model doesn't know X                       RAG (Volume 2)
Model needs to decide when to act          Tool calling / agents (Volume 3)
Model's tone/format is inconsistent         Better prompting first (Volume 1, Ch.9)
                                            — try this before fine-tuning
Model's tone/format STILL inconsistent      Fine-tuning (LoRA), once you have
after prompting, at meaningful scale         enough quality examples (typically
                                              hundreds to thousands)

Fine-tuning is genuinely the last resort in this list, not the first thing to reach for — it requires real data, real training infrastructure (or a provider's managed fine-tuning service), and a slower iteration loop than a prompt change. For SmartStore AI specifically, nothing in the current roadmap clearly needs it; RAG and good prompting cover the actual requirements.

Interview Q&A

Q: Why try prompt engineering before fine-tuning, given that fine-tuning seems like the more "thorough" fix for inconsistent model behavior? A: Prompt changes are nearly free to test and iterate on (no training run, no data collection needed) and often solve the same problem fine-tuning would — inconsistent formatting is frequently a matter of insufficiently explicit instructions or missing few-shot examples (Volume 1, Ch.9), not something requiring weight changes. Fine-tuning is the right tool only once you've exhausted prompting and still see the problem at a scale that justifies the much higher cost of collecting data and running training.

Q: Why does LoRA avoid the "catastrophic forgetting" risk that full fine-tuning carries? A: Because the original base model weights are never modified — they remain frozen exactly as trained. The small adapter layers learn the new, narrow behavior on top, and if something goes wrong, you can simply remove the adapter and you're back to the unmodified base model's full original capability, which isn't true of full fine-tuning, where the original weights themselves have been overwritten.

Exercise: Looking at SmartStore AI's actual feature set, identify one (if any) realistic future feature that would genuinely need fine-tuning rather than RAG/prompting — or explain why none of the currently planned features clear that bar.

Chapter 5 — Building a Fine-Tuning Dataset from Production Logs

If Chapter 4's bar for fine-tuning is ever cleared, the data has to come from somewhere real — and the most natural source, once a product has real usage, is its own production logs (the same audit logs from Volume 3, Chapter 12 and Volume 4's governance chapter).

Production logs (every request, Volume 3 Ch.12)
        │
        ▼
Filter for QUALITY signals:
  - User did NOT immediately rephrase (Volume 4, Ch.9 — a proxy for
    "the first answer was actually good")
  - Answer was grounded/cited correctly (Volume 2, Ch.10)
  - No flagged errors or low-confidence retrieval
        │
        ▼
Strip PII / sensitive fields (governance, Volume 4, Ch.11)
        │
        ▼
Format as prompt/completion pairs (JSONL)
        │
        ▼
Fine-tuning-ready dataset

import json

def build_finetune_dataset(logs: list[dict], output_path: str):
    examples = []
    for log in logs:
        # Quality filter: skip anything where the user immediately rephrased
        # (Volume 4, Ch.9's dissatisfaction signal) or that was flagged
        if log.get("user_rephrased_next_turn") or log.get("flagged_low_quality"):
            continue

        examples.append({
            "messages": [
                {"role": "system", "content": log["system_prompt"]},
                {"role": "user", "content": strip_pii(log["question"])},
                {"role": "assistant", "content": log["answer"]},
            ]
        })

    with open(output_path, "w") as f:
        for ex in examples:
            f.write(json.dumps(ex) + "\n")

def strip_pii(text: str) -> str:
    # Placeholder — a real implementation needs a proper PII-detection
    # step (names, emails, phone numbers) before any data leaves your logs
    # for use in a training dataset, per Volume 4's governance principles
    return text

The quality filter is the entire point of this chapter — fine-tuning on unfiltered production logs trains the model to reproduce whatever mix of good and bad answers it actually gave, including its own mistakes. The filtering step is what turns raw logs into something worth training on at all.

Interview Q&A

Q: Why is "the user didn't rephrase their question" a useful, if imperfect, quality filter for fine-tuning data? A: It's a cheap, scalable proxy for "the first answer was satisfactory," available automatically from existing usage data without requiring manual labeling of every interaction — imperfect because a user might not rephrase for reasons unrelated to quality (they gave up, or got distracted), but at scale it meaningfully biases the dataset toward genuinely good examples rather than including everything indiscriminately.

Q: What governance requirement from Volume 4 does this entire pipeline depend on having been in place from the start? A: Audit logging that actually captures enough detail (the system prompt, question, answer, and downstream signals like rephrasing) per request, retained long enough to accumulate a useful dataset — and a defined PII-handling/data-retention policy, since training data built from real user questions is exactly the kind of sensitive data governance (Volume 4, Ch.11) is meant to account for.

Exercise: Beyond "user didn't rephrase," propose one additional quality signal you could extract from SmartStore AI's logs to decide whether a given interaction belongs in a fine-tuning dataset.

Chapter 6 — Multimodal Input: Vision

Every prior volume assumed the user's question arrives as text. A real shopping assistant benefits enormously from also accepting a photo — "what aisle has more of this?" pointed at a product in hand, with no need to type its name correctly (or know it at all).

User takes a photo of a product
        │
        ▼
Image encoded (base64) and sent alongside
the text question, in one request
        │
        ▼
Vision-capable model identifies the product
from the image
        │
        ▼
Identified product name fed into the EXISTING
RAG pipeline (Volume 2) — nothing about retrieval
itself changes, only how the product name was obtained

import base64
import anthropic

client = anthropic.Anthropic()

def ask_with_image(image_path: str, question: str, store_id: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=300,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", "data": image_data}},
                {"type": "text", "text": f"Identify this product by name. Question: {question}"},
            ],
        }],
    )
    product_name = "".join(b.text for b in response.content if b.type == "text")

    # Feed the identified product name into Volume 2's existing pipeline —
    # vision is a NEW INPUT METHOD, not a replacement for the retrieval pipeline
    return answer(product_name, store_id)

// SwiftUI — capturing and encoding the image before sending
import UIKit

func encodeImageForRequest(_ image: UIImage) -> String? {
    guard let jpegData = image.jpegData(compressionQuality: 0.8) else { return nil }
    return jpegData.base64EncodedString()
}
// Send `encodeImageForRequest(capturedImage)` as the `data` field in the
// request body alongside the user's typed (or transcribed, Chapter 7) question.

Notice the architectural discipline here: vision is doing exactly one job — turning a photo into a product name/text description — after which it hands off to the exact same RAG pipeline built in Volume 2. This is a deliberately narrow scope; resist the temptation to have the vision-capable model try to also answer the location question directly from the image, since it has no actual knowledge of your store's specific aisle layout (Volume 1, Chapter 2's "generation isn't lookup" principle still applies, regardless of input modality).

Interview Q&A

Q: Why route the identified product name through the existing RAG pipeline rather than asking the vision model directly "where would this be in a grocery store"? A: The vision model can recognize what the product is, but has no actual knowledge of your specific store's specific aisle layout — answering directly from the image would produce a plausible-sounding but ungrounded guess (Volume 1, Ch.2/11), exactly the hallucination risk RAG exists to prevent. Vision should only ever handle identification; retrieval against your real data still has to handle location.

Q: A user photographs a product with packaging text in a language your catalog doesn't include. What's the most likely failure mode, and how would you address it? A: The vision model may correctly identify the product visually but produce a name/description that doesn't textually match how it's stored in your catalog (different language, different naming convention), causing the downstream retrieval step to miss it — addressing this likely means normalizing the identified name (e.g., translating or mapping to a canonical catalog name) before passing it into retrieval, rather than assuming visual identification and catalog matching are automatically the same problem.

Exercise: Sketch the full SwiftUI-to-backend flow (in plain text, like earlier diagrams) for a user tapping a "scan product" button, taking a photo, and receiving an aisle location back — labeling each step with which chapter/volume's code it reuses.

Chapter 7 — Speech-to-Text Integration

SmartStore AI's own spec mentions an optional spoken response — the input side of voice (speech-to-text, STT) is the natural companion, and it's a place where your iOS background gives you a genuinely better-informed architectural choice than most backend-only AI engineers would make: on-device transcription vs. a cloud STT API is a real trade-off, not just an implementation detail.

On-device (iOS Speech framework):       Cloud (e.g. Whisper API):
+ No audio leaves the device             + Generally higher accuracy,
  (privacy)                                especially for accents/noise
+ Works offline                          + No on-device model/language
+ No per-request API cost                  limitations
- Accuracy varies by device/language     - Audio leaves the device — a
- Limited to languages Apple supports      privacy/governance consideration
                                            (Volume 4, Ch.11)
                                          - Network dependency, added latency,
                                            per-request cost

// On-device — iOS Speech framework
import Speech

func transcribeOnDevice(audioURL: URL) async throws -> String {
    let recognizer = SFSpeechRecognizer()
    let request = SFSpeechURLRecognitionRequest(url: audioURL)

    return try await withCheckedThrowingContinuation { continuation in
        recognizer?.recognitionTask(with: request) { result, error in
            if let error = error { continuation.resume(throwing: error); return }
            if let result = result, result.isFinal {
                continuation.resume(returning: result.bestTranscription.formattedString)
            }
        }
    }
}

# Cloud — sent to backend, transcribed via a hosted STT API
import openai

def transcribe_cloud(audio_file_path: str) -> str:
    client = openai.OpenAI()
    with open(audio_file_path, "rb") as f:
        transcript = client.audio.transcriptions.create(model="whisper-1", file=f)
    return transcript.text  # check current API docs for the latest model name

For SmartStore AI specifically, on-device transcription is the more natural default: it keeps a customer's spoken query entirely on their phone until it's already text, avoids per-request transcription cost, and works in a store's spotty in-building network conditions — exactly the scenario this feature is meant to serve. Cloud STT becomes worth considering only if on-device accuracy proves insufficient for your actual users' accents/environments, measured against your eval suite (Volume 1, Ch.12), not assumed upfront.

Interview Q&A

Q: Why is keeping voice transcription on-device a meaningfully better default for SmartStore AI specifically, compared to a generic enterprise AI product? A: A shopper's spoken query inside a physical store is plausibly low-bandwidth (in-building WiFi/cellular), latency-sensitive (they're standing in an aisle waiting for an answer), and the audio itself doesn't need to leave the device to be useful once transcribed — all three favor on-device processing specifically for this use case, even though a different product (e.g., a call-center transcription tool) might reasonably prefer cloud STT for its higher accuracy ceiling.

Q: If you measured on-device transcription accuracy and found it was meaningfully worse for users with certain accents, what would that imply for your evaluation strategy (Volume 1, Ch.12)? A: It implies your golden evaluation set needs to include voice samples across a representative range of accents/environments, not just clean, single-accent test audio — an eval set that doesn't reflect real usage diversity would miss exactly this kind of accuracy gap until real users hit it in production, which is a regression no CI pipeline (Volume 4, Ch.7) would catch without deliberately representative test data.

Exercise: Design a fallback strategy — if on-device transcription confidence is low (the Speech framework can expose confidence scores), should the app silently fall back to cloud STT, ask the user to repeat themselves, or something else? Justify your choice.

Chapter 8 — Text-to-Speech Integration

The actual feature named in SmartStore AI's spec — "optional spoken response" — is the output side of voice. The same on-device-vs-cloud trade-off from Chapter 7 applies here, with the calculus shifted slightly: TTS quality differences (how natural the voice sounds) tend to matter more to a satisfied user than STT accuracy differences do, since a stilted-but-correct transcription is invisible to the user, while a robotic-sounding spoken response is immediately, audibly noticeable.

On-device (AVSpeechSynthesizer):        Cloud (e.g. OpenAI TTS):
+ Instant, no network round-trip          + Noticeably more natural-sounding
+ Free, no per-request cost                 voices, currently
+ Works offline                           - Added latency (network round-trip
- Voice quality is more robotic-            before any audio starts playing)
  sounding, varies by iOS version          - Per-request cost

// On-device — instant, free, no network dependency
import AVFoundation

func speakOnDevice(_ text: String) {
    let utterance = AVSpeechUtterance(string: text)
    utterance.voice = AVSpeechSynthesisVoice(language: "en-US")
    let synthesizer = AVSpeechSynthesizer()
    synthesizer.speak(utterance)
}

# Cloud — backend generates audio, sent to the app for playback
import openai

def synthesize_cloud(text: str) -> bytes:
    client = openai.OpenAI()
    response = client.audio.speech.create(model="tts-1", voice="alloy", input=text)
    return response.content  # raw audio bytes, streamed/played by the SwiftUI app

A sensible default for SmartStore AI: start with on-device AVSpeechSynthesizer for the MVP — it's free, instant, and "the aisle is 7" doesn't need to sound expressive to be useful — and revisit cloud TTS only if user feedback (PostHog, Volume 4 Ch.9) specifically flags response voice quality as a friction point, rather than assuming the more "advanced" cloud option is automatically the better choice for this particular feature.

Interview Q&A

Q: Why might the "default to the simpler, on-device option, upgrade only if data justifies it" principle apply even more strongly to TTS than it did to STT in Chapter 7? A: For STT, transcription accuracy directly affects whether the downstream pipeline even understands the right question — a real correctness concern. For TTS, voice naturalness is a polish/satisfaction concern on top of an already-correct text answer; the actual information being conveyed is identical either way. That makes on-device TTS an even lower-risk default starting point, with cloud TTS purely an upgrade-for-experience decision, not a correctness one.

Q: A user is in a noisy store environment and the assistant's spoken response is hard to hear over ambient noise. Is this a TTS quality problem, and what would actually address it? A: Not primarily — this is more likely a playback/UX problem (volume, ducking other audio, lack of haptic/visual fallback) than a voice-naturalness problem; switching to a more "natural-sounding" cloud voice wouldn't meaningfully fix audibility in a loud environment. The right fix is more likely ensuring visual text is always shown alongside speech (so voice is a convenience, not the only channel) and handling playback volume/audio session configuration properly.

Exercise: Sketch the decision logic (in plain text, an if/else flow) for when SmartStore AI's app should actually trigger a spoken response at all, given it's explicitly described as "optional" — what should the on/off trigger be (a setting? a button press? always-on)?

Chapter 9 — Semantic Caching and Cost Optimization at Scale

Volume 4, Chapter 1's prototype-to-production checklist flagged cost tracking as a production concern. At real scale, the single highest-leverage cost optimization for a RAG/agent system is caching — but exact-match caching (hash the literal query string, return a cached answer on an identical repeat) misses an enormous number of effectively-duplicate queries that are phrased slightly differently.

Semantic caching instead checks whether a new query is embedding-similar to a recently answered one, and reuses that cached answer if the similarity exceeds a threshold — catching "where's the olive oil" and "where can I find olive oil" as the same cache hit, where exact-match caching would treat them as entirely unrelated requests.

New query arrives
        │
        ▼
Embed the query
        │
        ▼
Search a "recent answers" cache (itself
just a small Qdrant collection!) for a
past query above a similarity threshold
        │
        ▼
   Above threshold?           Below threshold?
        │                          │
        ▼                          ▼
  Return cached answer        Run full pipeline (Volume 2/3),
  (no LLM call needed —        cache the new query + answer
   fast, free)                  for future reuse

def check_semantic_cache(query: str, store_id: str, threshold: float = 0.95) -> str | None:
    query_vector = embed([query])[0]
    results = qdrant.search(
        collection_name="query_cache",
        query_vector=query_vector,
        query_filter=Filter(must=[FieldCondition(key="store_id", match=MatchValue(value=store_id))]),
        limit=1,
    )
    if results and results[0].score >= threshold:
        return results[0].payload["cached_answer"]
    return None

def cache_answer(query: str, answer: str, store_id: str):
    vector = embed([query])[0]
    qdrant.upsert(
        collection_name="query_cache",
        points=[PointStruct(id=str(uuid.uuid4()), vector=vector,
                             payload={"cached_answer": answer, "store_id": store_id})],
    )

The similarity threshold (0.95 above) is the entire tuning knob here, and it's a genuine precision/cost trade-off: too low, and you'll serve a cached answer for a query that was actually meaningfully different (wrong answer, fast); too high, and you'll rarely get cache hits at all (always correct, but cache provides little benefit). This is exactly the kind of parameter to tune against your eval suite, not guess at.

Interview Q&A

Q: What's the specific risk of setting the semantic cache similarity threshold too low, beyond "occasionally wrong answers"? A: A wrong cached answer being served fails completely silently from the system's perspective — no error, no exception, just a confidently wrong response (the same hallucination-adjacent risk pattern from Volume 1, Ch.11, now introduced by your own caching layer rather than the model itself). This is exactly why threshold tuning needs to be validated against real eval data, not set by gut feeling, since the failure mode is invisible without deliberate testing.

Q: A store's product catalog updates (a price change, new stock) — what happens to previously cached answers, and what does this require you to add? A: Cached answers can go stale exactly the way Volume 2, Chapter 1 described un-grounded model knowledge going stale — the cache needs an invalidation strategy (a time-to-live expiry, or an explicit cache-clear triggered by the same ingestion events that trigger re-indexing, Volume 2 Ch.3) or it will confidently serve outdated information indefinitely.

Exercise: Design a cache invalidation policy for SmartStore AI's semantic cache — should it be a fixed TTL (e.g., 1 hour), tied to catalog update events, or some combination? Justify your choice.

Chapter 10 — Model Routing and Cascading

Not every question SmartStore AI's assistant receives needs the same model. "Where's the milk" and "compare the nutritional tradeoffs between these three cereal brands for someone managing diabetes" are not equally demanding — yet a naive system sends both to the same (likely most capable, most expensive) model every time.

Model routing/cascading classifies a query's complexity first, and sends simple queries to a cheaper, faster model, escalating to a stronger model only when the simpler one's confidence is low or the query is classified as genuinely complex.

Query arrives
        │
        ▼
Cheap, fast classifier (could even be a
small model, or simple heuristics) judges
complexity
        │
   ┌────┴────┐
   ▼          ▼
Simple      Complex
   │          │
   ▼          ▼
Cheap/fast   Stronger/more capable
model        model
   │          │
   └────┬─────┘
        ▼
   Final answer

def classify_complexity(question: str) -> str:
    # A cheap heuristic first — only escalate to an LLM classifier if needed
    simple_patterns = ["where's", "where is", "what aisle", "store hours"]
    if any(p in question.lower() for p in simple_patterns):
        return "simple"
    return "complex"

def route_query(question: str, store_id: str) -> str:
    complexity = classify_complexity(question)
    model = "claude-haiku-4-5" if complexity == "simple" else "claude-sonnet-4-6"
    # check current model names/pricing tiers in provider docs — these change
    return run_agent(question, store_id=store_id, model=model)

The cheap heuristic-first approach matters: running an LLM call just to decide which LLM to use defeats much of the cost savings if the classification itself is expensive. Simple keyword/pattern heuristics handle the obviously-simple cases for free; only genuinely ambiguous cases need an actual (small, cheap) classifier model call.

Interview Q&A

Q: Why use simple keyword heuristics for complexity classification before reaching for a model-based classifier, rather than just always using a small classifier model? A: A model call, even a small/cheap one, still adds latency and (marginal) cost to every single request — if a meaningful fraction of queries are obviously simple by pattern alone ("where's X"), handling those for free with heuristics before ever calling a classifier model captures most of the savings with none of the added latency, reserving the classifier model only for genuinely ambiguous cases the heuristic can't confidently sort.

Q: What's the risk of a routing system that misclassifies a genuinely complex question as "simple," beyond just a worse answer? A: The cheaper/weaker model may answer confidently but less accurately on a question it wasn't well-suited for, and because routing happens invisibly to the user, they have no signal that they got the "budget" answer — this is a quality regression that's specifically hard to detect without monitoring outcome quality (Volume 4, Ch.9) segmented by which model actually handled each request.

Exercise: Add one more pattern to the simple_patterns heuristic list that would correctly route a genuinely simple SmartStore AI question, and one example question that looks simple by keyword pattern but is actually complex enough to deserve escalation — explain the mismatch.

Chapter 11 — Prompt Versioning and A/B Testing in Production

Volume 4, Chapter 7's CI pipeline gates deploys on eval scores — but that only catches regressions before shipping. Once multiple prompt variants exist (the current production prompt, and a candidate improvement), A/B testing answers a different question: which one actually performs better with real users, not just on your golden eval set.

Prompt v1 (current production)         Prompt v2 (candidate)
        │                                      │
        └──────────────┬───────────────────────┘
                        ▼
              Split real traffic
              (e.g. 90% v1, 10% v2)
                        │
              ┌─────────┴─────────┐
              ▼                   ▼
        Track outcomes        Track outcomes
        (PostHog events,      (PostHog events,
        Volume 4 Ch.9)        Volume 4 Ch.9)
              │                   │
              └─────────┬─────────┘
                        ▼
              Compare: which version had
              lower rephrase rate / higher
              satisfaction signal?
                        │
                        ▼
              Winner becomes the new v1;
              gradually increase its traffic share

import random

PROMPTS = {
    "v1": "You are a retail navigation assistant. Answer concisely using only the provided context.",
    "v2": "You are a friendly, helpful retail assistant. Answer using only the provided context, and suggest one related product if relevant.",
}

def get_prompt_variant(user_id: str) -> tuple[str, str]:
    # Deterministic per-user assignment — same user always gets the same
    # variant within an experiment, rather than a different one every request
    variant = "v2" if hash(user_id) % 10 == 0 else "v1"  # ~10% to v2
    return variant, PROMPTS[variant]

def answer_with_experiment(question: str, store_id: str, user_id: str) -> str:
    variant, system_prompt = get_prompt_variant(user_id)
    result = run_agent_with_system(question, store_id, system_prompt)
    track_query_outcome(user_id, question, answer_found=bool(result), store_id=store_id)
    posthog.capture(distinct_id=user_id, event="prompt_variant_used", properties={"variant": variant})
    return result

Treating prompts as versioned, tracked artifacts — not values edited in place in your codebase whenever someone has an idea — is what makes this whole chapter possible. A prompt change without versioning is a silent, unmeasured experiment running on every user simultaneously; A/B testing makes that experiment explicit, measured, and reversible.

Interview Q&A

Q: Why assign variants deterministically per-user (the hash(user_id) % 10 approach) rather than randomly on every request? A: A user getting a different prompt variant on every single request makes their experience inconsistent (different tone/behavior turn to turn) and makes outcome attribution meaningless — you couldn't tell which variant caused a given rephrase/satisfaction signal if the variant changes mid-conversation. Deterministic per-user assignment keeps each user's experience consistent for the duration of the experiment and makes outcome data interpretable.

Q: Your A/B test shows prompt v2 has a lower rephrase rate but costs noticeably more tokens per response (the "suggest a related product" addition). How do you decide whether to roll it out fully? A: This becomes an explicit cost-vs-quality trade-off decision, not a purely technical one — quantify both sides (token cost increase vs. the magnitude and business value of the rephrase-rate improvement) and make a deliberate call, rather than assuming "better outcome metric" automatically justifies any cost increase, or that "cheaper" automatically wins regardless of quality difference.

Exercise: Design one more PostHog event you'd want tracked specifically to compare prompt v1 vs. v2's performance, beyond rephrase rate — something that would catch a quality difference rephrase rate alone might miss.

Chapter 12 — Advanced Evaluation: Regression Testing and Continuous Eval

Volume 1, Chapter 12 introduced golden-dataset evaluation; Volume 4, Chapter 7 wired it into CI. This chapter closes the loop: a golden set tested only at deploy time misses drift that happens between deploys — catalog changes, shifting query patterns, a vector index that's quietly degraded. Continuous evaluation runs the same kind of checks on an ongoing basis against live production behavior, not just at the moment code changes.

CI-time evaluation (Volume 4, Ch.7):    Continuous evaluation (this chapter):
Runs once, when code changes             Runs on a schedule (e.g. hourly/daily),
                                          AND on a sample of real production
                                          traffic
        │                                         │
        ▼                                         ▼
Catches regressions FROM A CODE          Catches DRIFT — degradation that
CHANGE, before it ships                  happens with no code change at all
                                          (catalog drift, query pattern shift,
                                          index degradation)

import datetime

def run_continuous_eval(golden_questions: list[dict], baseline_score: float, alert_threshold: float = 0.1):
    results = []
    for item in golden_questions:
        answer = run_agent(item["question"], store_id=item["store_id"])
        score = score_against_expected(answer, item["expected_answer"])  # e.g. LLM-as-judge, Volume 1 Ch.12
        results.append(score)

    current_score = sum(results) / len(results)
    regression = baseline_score - current_score

    log_eval_run(timestamp=datetime.datetime.utcnow(), score=current_score)  # feeds a Grafana dashboard, Volume 4 Ch.8

    if regression > alert_threshold:
        alert_team(f"Eval score dropped from {baseline_score:.2f} to {current_score:.2f} with NO recent deploy")

    return current_score

Run on a schedule (a cron job, or a scheduled CI pipeline run independent of any deploy), this turns evaluation from a one-time pre-deploy gate into an always-on signal — the same philosophy as Volume 4, Chapter 9's product analytics, but using your rigorous golden-set methodology instead of indirect proxy signals like rephrase rate, catching a different and complementary class of problem.

Interview Q&A

Q: Why is "no recent deploy, but eval scores dropped" specifically the failure mode continuous evaluation is designed to catch, that CI-gated evaluation (Volume 4, Ch.7) structurally cannot? A: CI-gated evaluation only ever runs in response to a code change — by definition, it has no opportunity to detect a regression that occurs without one (a catalog update breaking ingestion, Volume 2 Ch.3's triggers silently failing, an index slowly degrading). Continuous evaluation runs independently of deploys specifically to catch this class of drift, which is a real and common failure mode in production RAG systems.

Q: Why log every eval run's score over time (feeding a dashboard) rather than just alerting on threshold breaches? A: A single eval run's score can be noisy (model non-determinism, Volume 1 Ch.10); a time-series view lets you distinguish a genuine sustained degradation trend from normal run-to-run variance, and gives you the historical context to actually investigate when something does cross the alert threshold — an alert with no historical trend to compare against is much harder to triage.

Exercise: Your continuous eval job runs hourly and just fired an alert. List, in order, the first three things you'd check to find the root cause, drawing on specific chapters from this entire bootcamp.

Chapter 13 — Hands-On: Adding Voice + Image Input to SmartStore AI

This combines Chapters 6-8 into one real flow — a user scans a product and/or speaks a question, gets an answer, and optionally hears it spoken back, all routed through the exact agent and RAG pipeline built across Volumes 2-3.

# ── multimodal_query.py ──────────────────────────────────────────────
from agent import run_agent  # Volume 3, Chapter 13

def handle_multimodal_query(
    store_id: str,
    text_question: str | None = None,
    image_path: str | None = None,
    audio_path: str | None = None,
) -> dict:
    # Step 1: resolve input into a text question, regardless of modality
    if audio_path:
        text_question = transcribe_cloud(audio_path)  # Chapter 7, or the on-device
                                                          # equivalent, called from the app
    if image_path:
        product_name = identify_product_from_image(image_path)  # Chapter 6
        text_question = f"{text_question or ''} (product shown: {product_name})".strip()

    if not text_question:
        return {"answer": None, "error": "No question provided in any modality."}

    # Step 2: EXACTLY the same pipeline as every prior text-only example —
    # this is the whole point. Multimodal input collapses into the same
    # downstream system, it doesn't require a parallel one.
    answer_text = run_agent(text_question, store_id=store_id)

    return {
        "answer": answer_text,
        "audio_response": None,  # populated client-side via AVSpeechSynthesizer (Ch.8),
                                  # or here via synthesize_cloud() if using cloud TTS
    }

// SwiftUI — orchestrating capture, request, and optional spoken playback
func askWithVoiceOrPhoto(audioURL: URL?, image: UIImage?, storeId: String) async {
    var requestBody: [String: Any] = ["store_id": storeId]

    if let audioURL {
        requestBody["transcribed_text"] = try? await transcribeOnDevice(audioURL: audioURL)  // Ch.7
    }
    if let image, let encoded = encodeImageForRequest(image) {  // Ch.6
        requestBody["image_data"] = encoded
    }

    let response = try? await sendMultimodalRequest(requestBody)  // POSTs to handle_multimodal_query
    if let answer = response?.answer {
        speakOnDevice(answer)  // Chapter 8 — only if the user has voice responses enabled
    }
}

Notice what did not need to change: run_agent from Volume 3, the underlying RAG pipeline from Volume 2, and every guardrail from Volume 4 (auth, RBAC, RLS) all apply identically regardless of which modality the question arrived in. Voice and vision are input/output adapters sitting at the edges of a system that was already complete — which is exactly the payoff of having built the core architecture (Volumes 2-4) as cleanly separated layers in the first place.

Exercise (the real one): Implement the missing identify_product_from_image and transcribe_cloud function bodies using Chapter 6 and Chapter 7's code, wire them into handle_multimodal_query, and test it with a real product photo and a real recorded question — this is your first genuinely multimodal feature working end to end.

Appendix A — Glossary

Term	Meaning
Query rewriting	Reformulating a question before retrieval to improve recall
HyDE	Hypothetical Document Embeddings — embedding a generated hypothetical answer instead of the raw question
Multi-hop retrieval	Chaining multiple retrieval steps, each informed by the previous result
GraphRAG	Combining vector search with explicit graph/relationship traversal
LoRA / PEFT	Parameter-efficient fine-tuning — training small adapter weights instead of the full model
Semantic caching	Caching based on embedding similarity to past queries, not exact string match
Model routing/cascading	Sending simple queries to a cheaper model, escalating only when needed
A/B testing (prompts)	Comparing prompt variants against real production outcomes, not just eval sets
Continuous evaluation	Running eval checks on a schedule against live behavior, not just at deploy time

Appendix B — Chapter Summary Table

#	Chapter	Core takeaway
1	Query rewriting	Reformulate before embedding — raw phrasing isn't always retrieval-optimal
2	Multi-hop retrieval	Agentic tool-calling applied to retrieval itself, for dependent information needs
3	GraphRAG	Explicit relationships for substitute/association queries vector search can't reliably surface
4	Fine-tuning fundamentals	The last resort after RAG and prompting, with LoRA as the practical default
5	Fine-tuning datasets	Production logs, filtered for quality, are the realistic data source
6	Vision	Identification only — hands off to the existing RAG pipeline, never answers location directly
7	Speech-to-text	On-device is the better default for SmartStore AI's specific usage context
8	Text-to-speech	Voice naturalness is a polish concern, not a correctness one — start on-device
9	Semantic caching	Catches near-duplicate queries exact-match caching misses; threshold is a real tuning decision
10	Model routing	Cheap heuristics first, escalate only genuinely ambiguous/complex queries
11	Prompt A/B testing	Versioned, deterministically-assigned variants measured against real outcomes
12	Continuous evaluation	Catches drift between deploys that CI-gated evaluation structurally can't see
13	Hands-on multimodal	Voice and vision are adapters at the edge — the core pipeline never changes

Next: Volume 7 — The Job Transition Toolkit (system-design interview prep for AI roles, packaging SmartStore AI as a portfolio piece, and a behavioral/technical question bank for the iOS-to-AI-engineer pivot) — whenever you're ready for it.