Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.1.2. Semantic Caching and Response Reuse

💡 First Principle: Semantic caching recognizes that two queries with different wording may be semantically identical and should return the same answer. Unlike prompt caching (which caches at the FM API level), semantic caching operates at the application level — it caches answers to previous questions and retrieves them for semantically similar new questions.

Semantic cache architecture with ElastiCache:
import boto3, json
import redis
import numpy as np

redis_client = redis.Redis(host=ELASTICACHE_ENDPOINT, port=6379, ssl=True)

def get_from_semantic_cache(query_embedding, similarity_threshold=0.92):
    """Check ElastiCache for semantically similar cached responses."""
    # Retrieve all cached query embeddings (in practice, use vector index)
    cached_keys = redis_client.keys('query:*')
    
    for key in cached_keys:
        cached_data = json.loads(redis_client.get(key))
        cached_embedding = np.array(cached_data['embedding'])
        
        # Cosine similarity
        similarity = np.dot(query_embedding, cached_embedding) / (
            np.linalg.norm(query_embedding) * np.linalg.norm(cached_embedding)
        )
        
        if similarity >= similarity_threshold:
            return cached_data['response'], similarity
    
    return None, 0.0

def cache_response(query, query_embedding, response, ttl_seconds=3600):
    """Cache a query-response pair with TTL."""
    cache_key = f"query:{hash(query)}"
    redis_client.setex(
        cache_key,
        ttl_seconds,
        json.dumps({'embedding': query_embedding.tolist(), 'response': response})
    )
Cache TTL strategy by content type:
Content TypeRecommended TTLReason
FAQ / static documentation24–72 hoursContent changes infrequently
Product catalog queries1–4 hoursInventory/pricing changes during day
News/current events15–30 minutesContent freshness critical
User-specific queriesDo not cachePrivacy: responses contain personal context
Real-time data queriesDo not cacheAccuracy requires fresh retrieval

⚠️ Exam Trap: Semantic caching must never cache responses to queries that contain user-specific context (account balance, personal history, health data). If a cached response to "what is my account balance?" from User A gets returned to User B because their queries are semantically similar, you have a data privacy breach. The cache key must include user identity OR the cache must be restricted to impersonal factual queries.

Reflection Question: Your GenAI FAQ chatbot receives 80% identical or near-identical questions ("What are your hours?", "When do you close?", "Are you open on weekends?"). Semantic caching could eliminate most FM invocations for these. What similarity threshold would you set, and what is the one category of question you must explicitly exclude from caching?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications