SmartStore AI — Phase 10 Implementation Guide

Observability & Evaluation

What got built

backend/app/tracing.py        — OpenTelemetry setup
backend/app/eval.py             — golden-set evaluation harness
backend/tests/test_observability_eval.py
Updated: app/rag.py (answer_question now wraps retrieval and LLM generation in spans)

Key design decisions

Tracing was verified with OpenTelemetry's real InMemorySpanExporter, not by reading the instrumentation code and assuming it does what it says. test_answer_question_creates_retrieval_and_generation_spans actually runs answer_question, captures the real spans it produced, and asserts on their names and attributes — including confirming token usage from the (fake) LLM response actually landed on the llm_generation span's attributes, not just that the code compiles.

Two separate spans — retrieval and llm_generation — not one big span around the whole function. This is the entire point of Volume 4, Chapter 8's tracing chapter: a slow request needs to be diagnosable by which stage was slow. A single span around all of answer_question would tell you the total time and nothing else.

Evaluation uses simple keyword matching, not LLM-as-judge, deliberately. Volume 1, Chapter 12 of the bootcamp describes both approaches as complementary, not interchangeable — for GoldenQuestions with a small, clear-cut set of acceptable answers ("aisle 7", "olive oil"), substring matching is cheaper, faster, and has zero risk of the judge model itself being wrong. LLM-as-judge earns its place for more open-ended questions this simple golden set doesn't have yet.

The "product not found" golden question is included on purpose. GOLDEN_SET's third entry checks that asking about a nonexistent product ("unobtainium") produces an honest "I don't have that information" rather than a confident hallucination — this is Volume 2, Chapter 10's grounding behavior, now backed by an actual regression test instead of just a system-prompt instruction you're hoping works.

Verified test results

tests/test_observability_eval.py::test_answer_question_creates_retrieval_and_generation_spans PASSED
tests/test_observability_eval.py::test_score_answer_matches_expected_keyword PASSED
tests/test_observability_eval.py::test_run_eval_computes_correct_score PASSED
tests/test_observability_eval.py::test_golden_set_is_well_formed PASSED

Full suite: 39 passed

Running the eval suite for real

from app.eval import GOLDEN_SET, run_eval
from app.rag import answer_question

result = run_eval(GOLDEN_SET, lambda q, sid: answer_question(q, "<a real store UUID>"))
print(f"Score: {result['score']:.0%}")
for r in result["results"]:
    if not r["passed"]:
        print(f"FAILED: {r['question']} -> {r['answer']}")

This requires a real Anthropic key and ingested data (Phases 2-3) — not run live here for the same reason as every other live-API step in this guide.

What's next

Phase 11 — CI/CD & MVP Deployment wires this eval suite into a real GitHub Actions pipeline that gates deploys, and gets the app actually running on Render.