Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.4. Reflection Checkpoint

Key Takeaways

  • FM cost has three levers: input tokens, output tokens, and price per token. Address them in order: prompt caching (eliminates static token cost), model right-sizing (reduces price per token), then provisioned throughput (optimizes capacity pricing at high utilization).
  • Semantic caching delivers the highest cost reduction (zero FM cost for cache hits) but requires careful exclusion of user-specific and time-sensitive queries to avoid privacy violations and stale responses.
  • Retrieval latency and generation latency require different optimizations. Use X-Ray to distinguish which is the bottleneck before optimizing.
  • Streaming eliminates perceived latency for user-facing applications with minimal architectural cost — it's almost always worth implementing.
  • Quality drift is silent — it doesn't raise errors. Weekly scheduled evaluations against a golden dataset are the only reliable way to catch it.
  • CloudWatch native Bedrock metrics cover infrastructure health. Quality, accuracy, and business outcome metrics must be custom-published from your application.

Connecting Forward

Phase 7 covers Domain 5 — testing, validation, and troubleshooting. This domain connects directly back to every prior domain: evaluation frameworks assess quality from Domain 1's retrieval pipelines, troubleshooting methodology addresses failures in Domain 2's agent architectures, and the validation tools test compliance with Domain 3's safety requirements.

Self-Check Questions

  • Your FM application costs $45,000/month. A cost analysis shows: 60% is input token cost, 25% is output token cost, 15% is provisioned throughput. Rank these three optimization opportunities by implementation ROI, and name the specific technique for each.
  • Your CloudWatch dashboard shows FM application P99 latency increased from 4s to 18s after a Knowledge Base sync that added 200,000 new documents. The P50 latency is unchanged at 2.1s. What is the most likely cause, and what architectural change addresses it?
Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications