SmartStore AI — Phase 8 Implementation Guide

Streaming, Voice & Image Input

What got built

backend/app/streaming.py            — SSE-formatted streaming generator
backend/app/api/ask_stream.py        — GET /ask/stream
backend/app/vision.py                 — identify_product_from_image(), answer_from_image()
backend/app/api/ask_image.py           — POST /ask/image
backend/tests/test_streaming_vision.py

ios/SmartStoreAI/VoiceManager.swift   — on-device STT (Speech framework) + TTS (AVSpeechSynthesizer)
Updated: APIClient.swift (streamAsk, askImage), ChatViewModel.swift, ChatView.swift (mic + photo buttons, voice toggle)

A real API check that mattered before writing this

Before writing app/streaming.py, I checked the actual installed Anthropic SDK source (v0.112.0) rather than assume stream.text_stream still exists — a dir() check on the class initially suggested it didn't, which would have been a real, important correction to flag. Looking deeper, text_stream turned out to be an instance attribute set inside __init__ (self.text_stream = self.__stream_text__()), invisible to a class-level dir() check — so the original pattern was correct after all. Worth knowing either way: checking the actual source, not just trusting a quick attribute check, is what caught the difference between "looks deprecated" and "is actually fine, just not visible the way I checked for it."

Key design decisions

Vision identifies; it never answers the location question directly. identify_product_from_image returns only a product name, which then flows into the exact same answer_question from Phase 3. This is Volume 6, Chapter 6's rule, enforced in code, not just described: the vision model has no knowledge of this store's specific aisle layout, so letting it answer directly would reintroduce the exact grounding failure RAG exists to prevent.

On-device STT/TTS, not a cloud API, per Volume 6, Chapter 7-8's reasoning — a shopper's spoken query plausibly happens on spotty in-store WiFi, and "the aisle is 7" doesn't need an expressive cloud voice to be useful. VoiceManager.swift uses Apple's Speech and AVFoundation frameworks directly; no backend involvement for voice I/O at all — the transcribed text flows into send() exactly like typed text.

Voice responses are off by default (voiceResponsesEnabled = false), matching the "optional spoken response" language in SmartStore AI's own spec — this isn't a minor detail; an always-on spoken response in a quiet store would be a real, immediate UX complaint.

Verified test results

tests/test_streaming_vision.py::test_stream_answer_yields_sse_formatted_chunks PASSED
tests/test_streaming_vision.py::test_stream_answer_includes_done_sentinel_even_for_empty_stream PASSED
tests/test_streaming_vision.py::test_identify_product_from_image_extracts_text PASSED
tests/test_streaming_vision.py::test_answer_from_image_chains_identification_into_rag_pipeline PASSED

Full suite: 30 passed

The second test (empty_stream) is a real edge case worth having caught: confirms the [DONE] sentinel is still sent even if the model's stream produces zero text chunks — without it, a SwiftUI client waiting for that sentinel to know the stream ended would simply hang.

Honest limitation

The Swift voice/photo code (VoiceManager.swift, the PhotosPicker wiring in ChatView.swift) follows correct, current SwiftUI/Speech-framework/PhotosUI patterns but — same caveat as every Swift file in this guide — wasn't compiled here. Microphone and Speech Recognition permissions also need to be added to Info.plist (NSMicrophoneUsageDescription, NSSpeechRecognitionUsageDescription) before this will run at all on a real device or simulator.

What's next

Phase 9 — Caching & Cost Optimization adds Redis-backed session state and semantic caching on top of this now-multimodal pipeline.