SmartStore AI — Phase 9 Implementation Guide

Caching & Cost Optimization

What got built

backend/app/cache.py    — Redis session state + Qdrant-backed semantic caching
backend/tests/test_caching.py
Updated: app/rag.py (answer_question now accepts optional check_cache_fn/cache_answer_fn)

Key design decisions

Caching is opt-in per call site, not silently global. answer_question's cache parameters default to None — Phase 7's agent loop, for example, doesn't go through this caching layer at all, because tool-calling answers depend on dynamic state (store hours, current stock) that's far more cache-sensitive than a static grounded answer. Making caching an explicit choice at each call site avoids accidentally caching something that shouldn't be.

The semantic cache is scoped by store_id, exactly like every other layer in this app. test_semantic_cache_respects_store_scoping proves the same exact question for two different stores never cross-contaminates — this is the same multi-tenancy discipline from Phases 2 and 6, applied to a new layer, because a cache is just one more place data could leak across the store boundary if it isn't deliberately scoped.

Session TTL was tested by actually waiting for it to expire, not just asserting the code "should" do that. test_session_state_expires_after_ttl sets a 1-second TTL, sleeps 1.5 seconds, and confirms the key is genuinely gone from real Redis — this is slower than mocking time, but it tests the actual behavior, not a description of it.

Verified test results

tests/test_caching.py::test_session_state_round_trips_through_real_redis PASSED
tests/test_caching.py::test_session_state_expires_after_ttl PASSED
tests/test_caching.py::test_semantic_cache_miss_then_hit PASSED
tests/test_caching.py::test_semantic_cache_respects_store_scoping PASSED
tests/test_caching.py::test_answer_question_uses_cache_when_provided_and_skips_the_llm_call PASSED

Full suite: 35 passed

The last test is worth highlighting: it uses a fake Anthropic client that raises an exception if it's ever called at all, and confirms the test still passes — proving the cache hit genuinely short-circuits the LLM call rather than just returning the right answer by coincidence.

Using this for real

from app.cache import check_semantic_cache, cache_answer
result = answer_question(
    question, store_id,
    check_cache_fn=lambda q, sid: check_semantic_cache(q, sid, threshold=0.95),
    cache_answer_fn=lambda q, a, sid: cache_answer(q, a, sid),
)

The threshold=0.95 is the real tuning knob (Volume 6, Ch.9) — start conservative (high) and lower it only against real eval data showing it's safe to.

What's next

Phase 10 — Observability & Evaluation adds OpenTelemetry tracing across this whole pipeline and a real golden-set evaluation harness.