Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.3.1. CloudWatch Metrics and Alarms for GenAI

💡 First Principle: GenAI monitoring requires a custom metrics layer on top of standard AWS infrastructure metrics because most quality signals — retrieval relevance, response accuracy, user satisfaction, hallucination rate — are not native CloudWatch metrics and must be published from your application.

Critical metrics for GenAI monitoring:
Metric CategoryMetric NameAlarm ThresholdSource
Availability5xxErrorRate> 1% over 5 minCloudWatch (native)
LatencyP99ResponseTime> 15sCustom metric from Lambda
CostDailyTokenCost> budget thresholdCustom from token counts
QualityGroundingScore< 0.7 averageCustom from Guardrails trace
SafetyGuardrailTriggerRate> 5% of requestsCustom from Guardrails trace
RetrievalAverageRetrievalScore< 0.6Custom from Knowledge Bases
ThroughputThrottledRequestRate> 2%Custom from retry logic
AccuracyUserCorrectionRate> 10%Custom from user feedback
Publishing custom quality metrics:
def publish_response_quality_metrics(response_data):
    metrics = [
        {
            'MetricName': 'GroundingScore',
            'Value': response_data['grounding_score'],
            'Unit': 'None',
            'Dimensions': [
                {'Name': 'KnowledgeBaseId', 'Value': response_data['kb_id']},
                {'Name': 'ModelId', 'Value': response_data['model_id']}
            ]
        },
        {
            'MetricName': 'RetrievalTopScore',
            'Value': response_data['top_retrieval_score'],
            'Unit': 'None'
        },
        {
            'MetricName': 'ResponseTokenCount',
            'Value': response_data['output_tokens'],
            'Unit': 'Count'
        },
        {
            'MetricName': 'TotalLatencyMs',
            'Value': response_data['total_latency_ms'],
            'Unit': 'Milliseconds'
        }
    ]
    cloudwatch.put_metric_data(Namespace='GenAI/Quality', MetricData=metrics)
Composite alarms for multi-condition alerting:
# Composite alarm: alert when BOTH latency is high AND error rate is elevated
cloudwatch.put_composite_alarm(
    AlarmName='GenAI-Critical-Degradation',
    AlarmDescription='Both latency and error rate elevated — likely capacity issue',
    AlarmRule='ALARM("GenAI-HighLatency") AND ALARM("GenAI-ElevatedErrorRate")',
    AlarmActions=['arn:aws:sns:...:GenAI-PagerDuty-Critical'],
    OKActions=['arn:aws:sns:...:GenAI-Recovery-Notification']
)

⚠️ Exam Trap: CloudWatch alarms on Bedrock native metrics (like InvocationLatency) measure the FM API call time, not your application's end-to-end response time. If your retrieval pipeline is slow, this metric won't capture it. Always instrument end-to-end latency in your application layer separately from Bedrock API latency.

Reflection Question: At 2pm on a Tuesday, users start reporting that the chatbot "keeps making things up." Your CloudWatch dashboard shows: Bedrock InvocationLatency = normal, Lambda error rate = 0%, 5xx rate = 0%. What category of metric is missing from your monitoring setup, and what specifically would have caught this issue?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications