5.2.1. CloudWatch, X-Ray, and Observability Tools
💡 First Principle: Observability has three pillars—metrics, logs, and traces—and AWS provides a dedicated service for each. Knowing which pillar a scenario requires tells you which service to recommend, and the exam frequently tests this mapping.
Metrics tell you what is happening: CPU utilization, memory usage, endpoint latency, invocation counts. Amazon CloudWatch collects and visualizes these metrics. You create CloudWatch Alarms that trigger when metrics cross thresholds—for example, alerting when endpoint latency exceeds 500ms or when GPU utilization drops below 10% (indicating over-provisioning).
Logs tell you why something happened: error messages, stack traces, request/response payloads. CloudWatch Logs stores log data from SageMaker training jobs, endpoints, and processing jobs. CloudWatch Logs Insights lets you query logs with a SQL-like syntax—useful for troubleshooting why a specific inference request failed.
Traces tell you where time is spent across distributed systems: how long each service in a pipeline took, where bottlenecks exist. AWS X-Ray provides distributed tracing across AWS services. For ML pipelines that span Lambda → SageMaker → DynamoDB → S3, X-Ray reveals which component introduces latency.
| Tool | Pillar | Primary Use | Exam Signals |
|---|---|---|---|
| CloudWatch Metrics | Metrics | Monitor CPU, memory, latency, invocation count | "Monitor endpoint performance," "utilization metrics" |
| CloudWatch Alarms | Metrics | Alert on threshold breaches | "Alert when latency exceeds," "trigger scaling" |
| CloudWatch Logs | Logs | Store and query application logs | "Troubleshoot errors," "debug failures" |
| CloudWatch Logs Insights | Logs | Interactive log querying | "Analyze log patterns," "find error causes" |
| Lambda Insights | Metrics + Logs | Lambda function monitoring | "Lambda cold starts," "function duration" |
| AWS X-Ray | Traces | Distributed request tracing | "End-to-end latency," "pipeline bottleneck" |
| AWS CloudTrail | Audit logs | API call logging for compliance | "Who did what," "audit trail," "governance" |
| Amazon QuickSight | Visualization | BI dashboards for metrics | "Business dashboard," "visualize metrics" |
CloudTrail deserves special attention because it serves a different purpose than the other observability tools. While CloudWatch monitors operational health, CloudTrail monitors security and compliance by logging every API call made in your AWS account. The exam tests CloudTrail in Domain 4 security questions—"Who started that training job?", "When was the IAM policy changed?", "Was the S3 bucket accessed from an unusual location?"
⚠️ Exam Trap: CloudWatch monitors infrastructure metrics (CPU, latency). SageMaker Model Monitor monitors data and model metrics (feature drift, prediction distribution). A question about "monitoring model accuracy in production" needs Model Monitor, not CloudWatch. A question about "monitoring endpoint response time" needs CloudWatch, not Model Monitor.
Reflection Question: A SageMaker endpoint returns predictions in under 200ms during testing but experiences 2-second latency spikes in production during peak hours. Which observability tools would you use to diagnose this, and in what order?