6.1.2. Semantic Caching and Response Reuse
💡 First Principle: Semantic caching recognizes that two queries with different wording may be semantically identical and should return the same answer. Unlike prompt caching (which caches at the FM API level), semantic caching operates at the application level — it caches answers to previous questions and retrieves them for semantically similar new questions.
Semantic cache architecture with ElastiCache:
import boto3, json
import redis
import numpy as np
redis_client = redis.Redis(host=ELASTICACHE_ENDPOINT, port=6379, ssl=True)
def get_from_semantic_cache(query_embedding, similarity_threshold=0.92):
"""Check ElastiCache for semantically similar cached responses."""
# Retrieve all cached query embeddings (in practice, use vector index)
cached_keys = redis_client.keys('query:*')
for key in cached_keys:
cached_data = json.loads(redis_client.get(key))
cached_embedding = np.array(cached_data['embedding'])
# Cosine similarity
similarity = np.dot(query_embedding, cached_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
)
if similarity >= similarity_threshold:
return cached_data['response'], similarity
return None, 0.0
def cache_response(query, query_embedding, response, ttl_seconds=3600):
"""Cache a query-response pair with TTL."""
cache_key = f"query:{hash(query)}"
redis_client.setex(
cache_key,
ttl_seconds,
json.dumps({'embedding': query_embedding.tolist(), 'response': response})
)
Cache TTL strategy by content type:
| Content Type | Recommended TTL | Reason |
|---|---|---|
| FAQ / static documentation | 24–72 hours | Content changes infrequently |
| Product catalog queries | 1–4 hours | Inventory/pricing changes during day |
| News/current events | 15–30 minutes | Content freshness critical |
| User-specific queries | Do not cache | Privacy: responses contain personal context |
| Real-time data queries | Do not cache | Accuracy requires fresh retrieval |
⚠️ Exam Trap: Semantic caching must never cache responses to queries that contain user-specific context (account balance, personal history, health data). If a cached response to "what is my account balance?" from User A gets returned to User B because their queries are semantically similar, you have a data privacy breach. The cache key must include user identity OR the cache must be restricted to impersonal factual queries.
Reflection Question: Your GenAI FAQ chatbot receives 80% identical or near-identical questions ("What are your hours?", "When do you close?", "Are you open on weekends?"). Semantic caching could eliminate most FM invocations for these. What similarity threshold would you set, and what is the one category of question you must explicitly exclude from caching?