6.3.1. CloudWatch Metrics and Alarms for GenAI
💡 First Principle: GenAI monitoring requires a custom metrics layer on top of standard AWS infrastructure metrics because most quality signals — retrieval relevance, response accuracy, user satisfaction, hallucination rate — are not native CloudWatch metrics and must be published from your application.
Critical metrics for GenAI monitoring:
| Metric Category | Metric Name | Alarm Threshold | Source |
|---|---|---|---|
| Availability | 5xxErrorRate | > 1% over 5 min | CloudWatch (native) |
| Latency | P99ResponseTime | > 15s | Custom metric from Lambda |
| Cost | DailyTokenCost | > budget threshold | Custom from token counts |
| Quality | GroundingScore | < 0.7 average | Custom from Guardrails trace |
| Safety | GuardrailTriggerRate | > 5% of requests | Custom from Guardrails trace |
| Retrieval | AverageRetrievalScore | < 0.6 | Custom from Knowledge Bases |
| Throughput | ThrottledRequestRate | > 2% | Custom from retry logic |
| Accuracy | UserCorrectionRate | > 10% | Custom from user feedback |
Publishing custom quality metrics:
def publish_response_quality_metrics(response_data):
metrics = [
{
'MetricName': 'GroundingScore',
'Value': response_data['grounding_score'],
'Unit': 'None',
'Dimensions': [
{'Name': 'KnowledgeBaseId', 'Value': response_data['kb_id']},
{'Name': 'ModelId', 'Value': response_data['model_id']}
]
},
{
'MetricName': 'RetrievalTopScore',
'Value': response_data['top_retrieval_score'],
'Unit': 'None'
},
{
'MetricName': 'ResponseTokenCount',
'Value': response_data['output_tokens'],
'Unit': 'Count'
},
{
'MetricName': 'TotalLatencyMs',
'Value': response_data['total_latency_ms'],
'Unit': 'Milliseconds'
}
]
cloudwatch.put_metric_data(Namespace='GenAI/Quality', MetricData=metrics)
Composite alarms for multi-condition alerting:
# Composite alarm: alert when BOTH latency is high AND error rate is elevated
cloudwatch.put_composite_alarm(
AlarmName='GenAI-Critical-Degradation',
AlarmDescription='Both latency and error rate elevated — likely capacity issue',
AlarmRule='ALARM("GenAI-HighLatency") AND ALARM("GenAI-ElevatedErrorRate")',
AlarmActions=['arn:aws:sns:...:GenAI-PagerDuty-Critical'],
OKActions=['arn:aws:sns:...:GenAI-Recovery-Notification']
)
⚠️ Exam Trap: CloudWatch alarms on Bedrock native metrics (like InvocationLatency) measure the FM API call time, not your application's end-to-end response time. If your retrieval pipeline is slow, this metric won't capture it. Always instrument end-to-end latency in your application layer separately from Bedrock API latency.
Reflection Question: At 2pm on a Tuesday, users start reporting that the chatbot "keeps making things up." Your CloudWatch dashboard shows: Bedrock InvocationLatency = normal, Lambda error rate = 0%, 5xx rate = 0%. What category of metric is missing from your monitoring setup, and what specifically would have caught this issue?