6.3. Monitoring and Observability
💡 First Principle: GenAI application monitoring must track three distinct layers simultaneously: infrastructure health (are the services up?), FM quality (are responses accurate and appropriate?), and business outcomes (is the AI actually helping users accomplish their goals?). Traditional application monitoring covers only the first layer.
An application where CloudWatch shows all green metrics while the FM has begun hallucinating frequently is a monitoring failure. Quality degradation — caused by model updates, knowledge base drift, prompt regressions, or data freshness issues — requires application-level metrics beyond infrastructure health.
⚠️
| Monitoring Layer | What It Measures | AWS Tool | Alert On |
|---|---|---|---|
| Infrastructure | Error rates, throttles, latency P50/P99 | CloudWatch Bedrock metrics | >1% error rate, P99 >SLA |
| Quality | Faithfulness, relevance, groundedness | Custom CloudWatch metrics (LLM-as-judge) | Quality score drops >10% from baseline |
| Business | Task completion rate, user satisfaction, abandonment | Custom events + CloudWatch | Completion rate drops >5% week-over-week |
| Drift | Response distribution shift over time | Scheduled Bedrock Model Evaluations | Statistical drift detected |
Common Misconception: If there are no errors in CloudWatch Logs, the application is working correctly. FM applications can produce confident, fluent, semantically coherent — but factually incorrect — responses with no infrastructure error signal whatsoever. Quality monitoring requires evaluation metrics, not just error rates.