SmartStore AI — Phase 10 Implementation Guide
Observability & Evaluation
What got built
backend/app/tracing.py — OpenTelemetry setup
backend/app/eval.py — golden-set evaluation harness
backend/tests/test_observability_eval.py
Updated: app/rag.py (answer_question now wraps retrieval and LLM generation in spans)
Key design decisions
Tracing was verified with OpenTelemetry's real InMemorySpanExporter, not by reading the instrumentation code and assuming it does what it says. test_answer_question_creates_retrieval_and_generation_spans actually runs answer_question, captures the real spans it produced, and asserts on their names and attributes — including confirming token usage from the (fake) LLM response actually landed on the llm_generation span's attributes, not just that the code compiles.
Two separate spans — retrieval and llm_generation — not one big span around the whole function. This is the entire point of Volume 4, Chapter 8's tracing chapter: a slow request needs to be diagnosable by which stage was slow. A single span around all of answer_question would tell you the total time and nothing else.
Evaluation uses simple keyword matching, not LLM-as-judge, deliberately. Volume 1, Chapter 12 of the bootcamp describes both approaches as complementary, not interchangeable — for GoldenQuestions with a small, clear-cut set of acceptable answers ("aisle 7", "olive oil"), substring matching is cheaper, faster, and has zero risk of the judge model itself being wrong. LLM-as-judge earns its place for more open-ended questions this simple golden set doesn't have yet.
The "product not found" golden question is included on purpose. GOLDEN_SET's third entry checks that asking about a nonexistent product ("unobtainium") produces an honest "I don't have that information" rather than a confident hallucination — this is Volume 2, Chapter 10's grounding behavior, now backed by an actual regression test instead of just a system-prompt instruction you're hoping works.
Verified test results
tests/test_observability_eval.py::test_answer_question_creates_retrieval_and_generation_spans PASSED
tests/test_observability_eval.py::test_score_answer_matches_expected_keyword PASSED
tests/test_observability_eval.py::test_run_eval_computes_correct_score PASSED
tests/test_observability_eval.py::test_golden_set_is_well_formed PASSED
Full suite: 39 passed
Running the eval suite for real
from app.eval import GOLDEN_SET, run_eval
from app.rag import answer_question
result = run_eval(GOLDEN_SET, lambda q, sid: answer_question(q, "<a real store UUID>"))
print(f"Score: {result['score']:.0%}")
for r in result["results"]:
if not r["passed"]:
print(f"FAILED: {r['question']} -> {r['answer']}")
This requires a real Anthropic key and ingested data (Phases 2-3) — not run live here for the same reason as every other live-API step in this guide.
What's next
Phase 11 — CI/CD & MVP Deployment wires this eval suite into a real GitHub Actions pipeline that gates deploys, and gets the app actually running on Render.