Volume 5 — End-to-End AI Projects

A working reference for the transition from Senior iOS Developer to AI Engineer


How to use this volume

This is the capstone. Volumes 1-4 covered concepts; this volume is six complete builds, each introducing at least one genuinely new technique beyond what's already been covered, with real architecture, real code, and an extension exercise. The sixth project is the one that matters most to you specifically — it's the full picture of SmartStore AI's actual architecture, assembled from every preceding volume.

Contents 1. Project: ChatGPT Clone (Streaming, History, Markdown) 2. Project: PDF Q&A Assistant (Upload, Citations) 3. Project: Document Summarizer (Map-Reduce for Long Documents) 4. Project: Personal AI Agent (Tools Beyond Retail) 5. Project: Multi-Agent System (Planner → Researcher → Writer → Reviewer) 6. Capstone: The Enterprise AI Platform — SmartStore AI, Fully Assembled

Appendix A — Glossary Appendix B — Project Summary Table Appendix C — Where to Go From Here


Project 1 — ChatGPT Clone (Streaming, History, Markdown)

What this teaches beyond Volumes 1-4: every prior code example called the model and waited for the complete response. Real chat apps stream tokens as they're generated, so the user sees text appear progressively instead of staring at a blank screen for several seconds.

SwiftUI App                    FastAPI Backend                  Claude API
     │                              │                                │
     │──POST /chat (question)─────▶│                                │
     │                              │──messages.stream(...)─────────▶│
     │                              │◀──token "The"───────────────────│
     │◀──SSE: data: "The"───────────│                                │
     │                              │◀──token " olive"────────────────│
     │◀──SSE: data: " olive"────────│                                │
     │           ... (continues streaming until done) ...            │

Backend — streaming endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

def stream_response(messages: list):
    with client.messages.stream(
        model="claude-sonnet-4-6", max_tokens=1000, messages=messages
    ) as stream:
        for text in stream.text_stream:  # check current SDK docs for exact attribute name
            yield f"data: {text}\n\n"
    yield "data: [DONE]\n\n"

@app.post("/chat")
async def chat(question: str, session_id: str):
    history = load_session_state(session_id)  # Volume 3, Chapter 4 pattern
    history.append({"role": "user", "content": question})
    return StreamingResponse(stream_response(history), media_type="text/event-stream")

SwiftUI — consuming the stream

func streamChat(question: String, sessionId: String) async throws -> AsyncStream<String> {
    var request = URLRequest(url: URL(string: "https://your-api/chat?question=\(question)&session_id=\(sessionId)")!)
    request.httpMethod = "POST"

    let (bytes, _) = try await URLSession.shared.bytes(for: request)

    return AsyncStream { continuation in
        Task {
            for try await line in bytes.lines {
                if line.hasPrefix("data: ") {
                    let chunk = String(line.dropFirst(6))
                    if chunk == "[DONE]" { continuation.finish(); break }
                    continuation.yield(chunk)
                }
            }
        }
    }
}

// In the view: append each yielded chunk to a @State string as it arrives,
// so the text visibly grows token-by-token, exactly like ChatGPT's UI.

Conversation history (Volume 3, Chapter 4's Redis pattern) persists between turns; markdown rendering of the final accumulated text (code blocks, lists, bold) is handled entirely client-side once the full text is available — AttributedString or a markdown-rendering library on the SwiftUI side, no backend change needed.

Interview Q&A

Q: Why does streaming improve perceived performance even though the total time to generate the full response is unchanged? A: Total generation time is the same, but the user sees the first tokens within a few hundred milliseconds instead of staring at nothing until the entire response is ready — perceived latency (time to first visible feedback) drops dramatically even though actual total latency doesn't change at all.

Q: What happens to your CI eval suite (Volume 4, Chapter 7) when the endpoint under test now streams instead of returning a single response? A: The eval script needs to consume the full SSE stream and concatenate it into the complete text before comparing against the golden answer — evaluation logic doesn't care whether the response arrived in one piece or streamed, but the test harness does need to handle the different response shape.

Exercise: Add a "stop generating" button — sketch (in plain text) what would need to happen on both the SwiftUI and FastAPI sides to actually cancel an in-progress stream rather than just hiding it in the UI.


Project 2 — PDF Q&A Assistant (Upload, Citations)

What this teaches beyond Volumes 1-4: SmartStore AI's RAG pipeline (Volume 2) ingests a known, pre-existing product catalog. This project handles the opposite case — a user uploads an arbitrary document at request time, which must be parsed, chunked, embedded, and made queryable immediately, scoped to that specific upload.

User uploads PDF
        │
        ▼
Extract text per page (pypdf)
        │
        ▼
Chunk each page's text (Volume 2, Ch.4)
        │
        ▼
Embed chunks, store in Qdrant with
metadata: {document_id, page_number}
        │
        ▼
User asks a question
        │
        ▼
Vector search filtered by document_id ──▶ retrieved chunks (each with page_number)
        │
        ▼
Grounded answer, WITH page citation
   "...found in your document (page 7)"
from pypdf import PdfReader
import uuid

def ingest_pdf(file_path: str) -> str:
    document_id = str(uuid.uuid4())
    reader = PdfReader(file_path)

    all_chunks = []
    for page_num, page in enumerate(reader.pages, start=1):
        text = page.extract_text()
        if not text.strip():
            continue
        for chunk in chunk_text(text, chunk_size=300, overlap=30):  # Volume 2, Ch.4
            all_chunks.append({"text": chunk, "page": page_num, "document_id": document_id})

    vectors = embed([c["text"] for c in all_chunks])
    qdrant.upsert(
        collection_name="pdf_documents",
        points=[
            PointStruct(
                id=str(uuid.uuid4()),
                vector=vec,
                payload={"text": c["text"], "page": c["page"], "document_id": c["document_id"]},
            )
            for c, vec in zip(all_chunks, vectors)
        ],
    )
    return document_id

def query_pdf(question: str, document_id: str, limit: int = 3) -> str:
    query_vector = embed([question])[0]
    results = qdrant.search(
        collection_name="pdf_documents",
        query_vector=query_vector,
        query_filter=Filter(must=[FieldCondition(key="document_id", match=MatchValue(value=document_id))]),
        limit=limit,
    )
    context = "\n".join(f"[Page {r.payload['page']}]: {r.payload['text']}" for r in results)

    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=500,
        system="Answer using ONLY the provided context. Cite the page number(s) your answer comes from.",
        messages=[{"role": "user", "content": f"<context>\n{context}\n</context>\n\nQuestion: {question}"}],
    )
    return "".join(b.text for b in response.content if b.type == "text")

Notice document_id filtering here plays the exact same architectural role as store_id filtering in SmartStore AI's RAG pipeline (Volume 2, Chapter 8) — the pattern is identical: scope retrieval to the relevant subset before ranking by similarity. The only thing that changed is what the subset means.

Interview Q&A

Q: Why include page as metadata rather than just embedding the page number into the chunk's text itself? A: Keeping it as structured metadata lets you filter and cite programmatically (e.g., deduplicating or sorting citations by page) without depending on the model correctly parsing a number out of free text every time — the same reasoning as keeping store_id structured in Volume 2 rather than relying on the model to extract it from prose.

Q: A scanned PDF (an image of text, not real selectable text) is uploaded. What breaks in this pipeline, and what would you add to fix it? A: page.extract_text() returns empty or garbage for scanned image pages since there's no embedded text layer to extract — you'd need an OCR step (optical character recognition) before chunking, to first convert the scanned image into actual text, which this pipeline doesn't currently include.

Exercise: Extend query_pdf to return the actual list of cited pages as structured data (not just embedded in the answer text) alongside the answer — useful for a SwiftUI UI that wants to show clickable "jump to page 7" citation links.


Project 3 — Document Summarizer (Map-Reduce for Long Documents)

What this teaches beyond Volumes 1-4: summarizing a short document is one prompt. Summarizing a document longer than the model's context window (Volume 1, Chapter 7) requires a different technique entirely — map-reduce summarization.

Long document (too big for one context window)
        │
        ▼
Split into chunks (Volume 2, Ch.4)
        │
        ▼
MAP: summarize EACH chunk independently
   chunk 1 → summary 1
   chunk 2 → summary 2
   chunk 3 → summary 3
   ...
        │
        ▼
REDUCE: combine all chunk-summaries into
        one final summary
        │
        ▼
Still too long for one context window?
   → recursively reduce again (reduce the
     reduced summaries, repeat until it fits)
        │
        ▼
Final summary
def summarize_chunk(chunk: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=200,
        system="Summarize this section concisely, preserving key facts and figures.",
        messages=[{"role": "user", "content": chunk}],
    )
    return "".join(b.text for b in response.content if b.type == "text")

def reduce_summaries(summaries: list[str]) -> str:
    combined = "\n\n".join(summaries)
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=400,
        system="Combine these section summaries into one coherent overall summary.",
        messages=[{"role": "user", "content": combined}],
    )
    return "".join(b.text for b in response.content if b.type == "text")

def map_reduce_summarize(document: str, chunk_size: int = 2000) -> str:
    chunks = chunk_text(document, chunk_size=chunk_size, overlap=0)
    summaries = [summarize_chunk(c) for c in chunks]

    combined_length = sum(len(s) for s in summaries)
    if combined_length > chunk_size:
        # Still too big — recursively reduce in groups before the final pass
        grouped = [summaries[i:i+5] for i in range(0, len(summaries), 5)]
        summaries = [reduce_summaries(group) for group in grouped]

    return reduce_summaries(summaries)

The recursive reduce step matters for genuinely long documents (a 500-page report might produce 200 chunk-summaries — still too much to combine in a single reduce call) — this is the same "lost in the middle" concern from Volume 1, Chapter 7 applying recursively, not just at the original document level.

Interview Q&A

Q: Why summarize each chunk independently in the map step, rather than summarizing the whole document in one pass once it's chunked? A: The whole point of chunking here is that the full document doesn't fit in one context window in the first place (Volume 1, Ch.7) — independent per-chunk summarization is exactly what makes it possible to process a document of any length, since each summarization call only ever needs to fit one chunk, not the whole source.

Q: What's lost by summarizing chunks independently, compared to a model that could somehow read the entire document at once? A: Cross-chunk relationships and connections that span chunk boundaries (a fact established early in the document that meaningfully reframes a later section) can be missed, since each chunk is summarized with no visibility into the others until the reduce step — this is a real, inherent limitation of map-reduce summarization, not a bug to be fully eliminated, only mitigated (e.g., via chunk overlap, Volume 2 Ch.4).

Exercise: Add an "extractive highlights" mode to this pipeline — instead of an abstractive (rewritten) summary, have the map step extract the single most important verbatim sentence from each chunk, and the reduce step select the best 5 across all of them. (Hint: this changes what you ask the model to do in summarize_chunk, not the overall map-reduce structure.)


Project 4 — Personal AI Agent (Tools Beyond Retail)

What this teaches beyond Volumes 1-4: Volume 3's hands-on agent used retail-specific tools. This project proves the same ReAct/tool-calling pattern generalizes to a completely different domain — a personal assistant managing reminders and a calendar — which is exactly the kind of project that demonstrates the pattern, not just a memorized SmartStore-specific implementation, to an interviewer or in a portfolio.

User: "Remind me to call the supplier tomorrow at 9am, and what's on my
       calendar this afternoon?"
        │
        ▼
Agent decides: this needs TWO tool calls
        │
        ├──▶ create_reminder(text="Call the supplier", time="tomorrow 9am")
        │        → requires confirmation (Volume 3, Ch.11 — it's a real action)
        │
        └──▶ get_calendar_events(date="today", time_range="afternoon")
                 → read-only, no confirmation needed
        │
        ▼
Combined final answer
tools = [
    {
        "name": "create_reminder",
        "description": "Create a reminder for the user at a specific time.",
        "input_schema": {
            "type": "object",
            "properties": {"text": {"type": "string"}, "time": {"type": "string"}},
            "required": ["text", "time"],
        },
    },
    {
        "name": "get_calendar_events",
        "description": "Get the user's calendar events for a date and time range.",
        "input_schema": {
            "type": "object",
            "properties": {"date": {"type": "string"}, "time_range": {"type": "string"}},
            "required": ["date"],
        },
    },
]

HIGH_RISK_TOOLS = {"create_reminder"}  # Volume 3, Chapter 11 pattern

def create_reminder(text: str, time: str, confirmed: bool = False) -> str:
    if not confirmed:
        return f"CONFIRMATION_NEEDED: create reminder '{text}' for {time}?"
    # actually persist the reminder (e.g. to Postgres)
    return f"Reminder set: '{text}' at {time}"

def get_calendar_events(date: str, time_range: str = "all day") -> str:
    # placeholder — would call a real calendar API/MCP server (Volume 3, Ch.9-10)
    return "2:00 PM - Team sync, 4:00 PM - Dentist appointment"

tool_functions = {"create_reminder": create_reminder, "get_calendar_events": get_calendar_events}

This reuses the exact run_agent loop structure from Volume 3, Chapter 13, with the high-risk-tool confirmation gate from Volume 3, Chapter 11 wired in directly — proof that the pattern, once built once, transfers cleanly across domains without rewriting the orchestration logic.

Interview Q&A

Q: What did NOT need to change when moving from a retail-domain agent (Volume 3) to this personal-assistant agent? A: The core loop — call the model with tools, check for tool_use blocks, execute, feed results back, repeat until a final text answer (Volume 3, Ch.3's ReAct loop) — is completely domain-agnostic. Only the tool definitions, their implementations, and the confirmation policy for specific tools changed; the orchestration logic itself is reusable infrastructure.

Q: Why is get_calendar_events a good candidate for direct MCP integration (Volume 3, Ch.8-10) rather than a custom-written function? A: Calendar access is exactly the kind of capability likely to already exist as a community or vendor-provided MCP server (Google Calendar, Outlook, etc.) — using an existing MCP server avoids reimplementing OAuth, API pagination, and calendar-specific logic yourself, which is the entire point of MCP's standardization (Volume 3, Ch.8).

Exercise: Add a third tool, cancel_reminder(reminder_id), and decide — with justification — whether it belongs in HIGH_RISK_TOOLS alongside create_reminder.


Project 5 — Multi-Agent System (Planner → Researcher → Writer → Reviewer)

What this teaches beyond Volumes 1-4: Volume 3, Chapter 7 introduced multi-agent systems conceptually. This project actually builds one — a four-stage pipeline that researches a topic, drafts content about it, and reviews/revises before returning a final result.

┌──────────┐     ┌────────────┐     ┌─────────┐     ┌──────────┐
│ Planner    │────▶│ Researcher  │────▶│ Writer   │────▶│ Reviewer  │
│ (breaks      │     │ (gathers      │     │ (drafts    │     │ (checks    │
│  goal into   │     │  facts via     │     │  content   │     │  quality,  │
│  subtasks)   │     │  tools/RAG)    │     │  from facts)│     │  approves  │
└──────────┘     └────────────┘     └─────────┘     │  or sends  │
                                                          │  back)     │
                                                          └──────┬─────┘
                                                                 │
                                            ┌────────────────────┴─────┐
                                            │ Approved?                 │
                                            │  YES → return final result │
                                            │  NO  → back to Writer      │
                                            │        with feedback       │
                                            └────────────────────────────┘
def planner_agent(goal: str) -> list[str]:
    return make_plan(goal)  # Volume 3, Chapter 5

def researcher_agent(subtask: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=400,
        system="Research this subtask and return key facts as bullet points.",
        messages=[{"role": "user", "content": subtask}],
    )
    return "".join(b.text for b in response.content if b.type == "text")

def writer_agent(facts: str, feedback: str = "") -> str:
    prompt = f"Facts:\n{facts}\n\nWrite a short, clear paragraph based on these facts."
    if feedback:
        prompt += f"\n\nRevise based on this reviewer feedback: {feedback}"
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=400, messages=[{"role": "user", "content": prompt}]
    )
    return "".join(b.text for b in response.content if b.type == "text")

def reviewer_agent(draft: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6", max_tokens=200,
        system='Review this draft. Respond with ONLY JSON: {"approved": true/false, "feedback": "..."}',
        messages=[{"role": "user", "content": draft}],
    )
    import json
    return json.loads("".join(b.text for b in response.content if b.type == "text"))

def run_multi_agent_pipeline(goal: str, max_revisions: int = 2) -> str:
    subtasks = planner_agent(goal)
    facts = "\n".join(researcher_agent(t) for t in subtasks)

    draft = writer_agent(facts)
    for _ in range(max_revisions):
        review = reviewer_agent(draft)
        if review["approved"]:
            return draft
        draft = writer_agent(facts, feedback=review["feedback"])

    return draft  # return best effort after max revisions, even if not formally approved

The max_revisions cap is the same defensive pattern as Volume 3, Chapter 3's max_steps — without it, a Reviewer that never approves would loop indefinitely. Each agent here is just a differently-prompted function, not a fundamentally different mechanism — the "multi" in multi-agent is about role separation and explicit hand-offs (the graph structure in Volume 3, Chapter 6), not a new core capability.

Interview Q&A

Q: Why does the Reviewer return structured JSON (approved/feedback) instead of free-form text? A: Structured output (Volume 1, Ch.9) lets the pipeline's control flow (the if review["approved"] check) make a reliable programmatic decision, rather than trying to parse an arbitrary free-text review to infer whether it was positive — exactly the same reasoning as Volume 2, Chapter 10's structured grounded-answer exercise.

Q: What's the realistic failure mode of this pipeline if the Researcher agent returns inaccurate "facts"? A: The error propagates downstream — the Writer drafts content based on those facts, and the Reviewer is checking writing quality/coherence, not independently verifying factual accuracy against a ground truth, so a confidently wrong fact from Research can sail through review unflagged. A more robust version would have the Researcher ground its output via RAG/citations (Volume 2) rather than relying on the model's own unverified knowledge.

Exercise: The Reviewer currently only checks the Writer's draft. Add a check for whether the Researcher's facts are actually relevant to the original goal — where in the pipeline would that check best fit, and what would it need to compare against?


Capstone — The Enterprise AI Platform: SmartStore AI, Fully Assembled

This isn't a new technique — it's every volume of this bootcamp, assembled into one real architecture. If Projects 1-5 were about learning individual building blocks in isolation, this is what they look like combined into the actual product you're building.

                              ┌─────────────────────┐
                              │   SwiftUI iOS App     │
                              │  (streaming chat UI,  │
                              │   Project 1's pattern)│
                              └──────────┬───────────┘
                                         │ HTTPS + Firebase ID token
                                         ▼
                  ┌──────────────────────────────────────────┐
                  │              FastAPI Backend                │
                  │  ┌────────────────────────────────────┐   │
                  │  │ Auth middleware (Vol.4, Ch.2)          │   │
                  │  │ RBAC checks (Vol.4, Ch.3)              │   │
                  │  │ Tenant context / RLS (Vol.4, Ch.4)     │   │
                  │  └─────────────────┬──────────────────┘   │
                  │                     ▼                        │
                  │  ┌────────────────────────────────────┐   │
                  │  │ Agent layer (Vol.3): ReAct loop,        │   │
                  │  │ tool calling, planning, memory           │   │
                  │  └────────┬───────────────┬─────────────┘   │
                  │            ▼                ▼                  │
                  │   ┌──────────────┐  ┌──────────────┐         │
                  │   │ RAG pipeline   │  │ Other tools    │         │
                  │   │ (Vol.2):        │  │ (calendar,     │         │
                  │   │ retrieve →      │  │ notifications, │         │
                  │   │ ground → answer │  │ MCP servers)    │         │
                  │   └──────┬───────┘  └──────┬───────┘         │
                  │           ▼                  ▼                  │
                  └──────────────────────────────────────────┘
                          │              │               │
                          ▼              ▼               ▼
                   ┌──────────┐  ┌──────────────┐  ┌──────────┐
                   │ Qdrant     │  │ PostgreSQL     │  │ Redis      │
                   │ (vectors)  │  │ (RLS, durable  │  │ (sessions, │
                   │            │  │  records)       │  │  caching)  │
                   └──────────┘  └──────────────┘  └──────────┘

   Cross-cutting, applied everywhere above (Vol.4):
   Secrets Manager · OpenTelemetry/Grafana traces · PostHog product analytics ·
   CI/CD with eval-gated deploys · Docker → Render → ECS Fargate

The actual build-order recommendation, given everything in this bootcamp: don't build top-to-bottom or bottom-to-top — build the thinnest possible vertical slice first (one product, one store, no auth, no agent — just Volume 2's ingest.py/query.py working end to end), then add layers outward from there in roughly this order:

  1. Volume 2's RAG pipeline, hardcoded single store, no auth (prove the core idea works)
  2. Volume 4's auth + RBAC (so it's safe to let a second real person use it)
  3. Volume 3's agent/tool-calling layer (so it can do more than answer one fixed question shape)
  4. Volume 4's observability + CI/CD (so you can trust it running unattended)
  5. Volume 3's multi-agent/MCP layer, and Volume 4's full production deployment — only once the above is solid and you have a real reason to need them

This order isn't arbitrary — it's deliberately back-loading the most complex pieces (multi-agent orchestration, full MCP servers) until the simpler architecture has already proven the product is worth the additional complexity. Building Phase 10's multi-agent system before Phase 1's basic RAG pipeline even works is the most common way ambitious solo projects stall out.

Interview Q&A

Q: Why build the vertical slice (single store, no auth) before adding authentication, even though "real products need auth"? A: Because the riskiest unknown early on is whether the core RAG retrieval and grounding actually produces good answers for your specific data — auth, RBAC, and deployment are all well-understood, mechanical work once you decide to do them. Validating the risky, uncertain part first (does this actually answer "where's the olive oil" well) before investing in the mechanical-but-necessary parts is the more efficient use of limited solo-builder time.

Q: Looking at the full architecture diagram, which single component, if it silently failed, would be hardest to detect without Volume 4's observability chapters?* A: The RAG retrieval step returning stale or irrelevant results due to an ingestion pipeline failure (Volume 2, Ch.3's "what triggers re-ingestion" exercise) — the system would keep answering confidently and fluently (no errors, no crashes) while quietly giving wrong aisle information, which is exactly the kind of silent failure that only structured tracing and product-analytics monitoring (Volume 4, Ch.8-9), not infrastructure uptime checks, would ever surface.

There's no exercise for this chapter. The exercise is the actual SmartStore AI build, starting from wherever Phase 0 currently stands.


Appendix A — Glossary

Term Meaning
Streaming Returning a model's output incrementally as it's generated, rather than all at once
SSE Server-Sent Events — a simple protocol for streaming text from server to client over HTTP
Map-reduce summarization Summarizing chunks independently (map), then combining those summaries (reduce), recursively if needed
Vertical slice The thinnest possible end-to-end version of a system, built first to validate the riskiest assumption
OCR Optical Character Recognition — extracting text from scanned images, needed for non-text-layer PDFs

Appendix B — Project Summary Table

# Project New technique introduced
1 ChatGPT Clone Streaming responses (SSE) and progressive rendering
2 PDF Q&A Assistant Request-time ingestion of arbitrary uploads, page-level citations
3 Document Summarizer Map-reduce summarization for documents beyond the context window
4 Personal AI Agent Proving the agent pattern generalizes beyond one domain
5 Multi-Agent System A real, runnable Planner → Researcher → Writer → Reviewer pipeline
6 Capstone Every prior volume assembled into SmartStore AI's actual architecture, with a build-order recommendation

Appendix C — Where to Go From Here

This closes the five-volume bootcamp:

  • Volume 1 — AI Fundamentals (what's actually happening inside a model)
  • Volume 2 — RAG & Knowledge Retrieval (grounding answers in real data)
  • Volume 3 — Agents & MCP (models that act, not just answer)
  • Volume 4 — Production Enterprise AI (making it safe to run unattended)
  • Volume 5 — End-to-End Projects (proving every piece works together)

The honest next step isn't a Volume 6 — it's building. Everything needed to take SmartStore AI from its current Phase 0 status through a working, production-hardened product is now somewhere in these five documents, with working code you can copy directly rather than re-derive. When you hit a wall on something specific during the actual build, that's the right time to come back with the precise question — at that point you'll know exactly which chapter you're testing against reality.