4.3.1. CloudWatch Metrics, Logs, and Alarms
š” First Principle: CloudWatch is the unified monitoring nervous system for all AWS services. Every service publishes metrics to CloudWatch automatically ā Glue job duration, Lambda invocation errors, Kinesis iterator age, Redshift query throughput. Alarms turn passive metrics into active notifications, ensuring problems are detected before users notice.
CloudWatch Metrics ā built-in metrics for every AWS service. Key data engineering metrics: Glue job run status and duration, Lambda errors and throttles, Kinesis IteratorAge (how far behind a consumer is ā critical for detecting slow consumers), DynamoDB consumed capacity and throttled requests, and Redshift query duration.
CloudWatch Logs ā centralized log storage. Glue jobs, Lambda functions, and EMR clusters write logs to CloudWatch Log Groups. Log Insights provides a SQL-like query language for searching and analyzing logs interactively. Subscription filters stream log events to Lambda, Kinesis, or OpenSearch for real-time processing.
CloudWatch Alarms ā trigger actions when metrics cross thresholds. Alarm ā SNS ā email/PagerDuty for alerting. Alarm ā Lambda for automated remediation (e.g., restart a failed Glue job). Composite alarms combine multiple alarms with AND/OR logic.
Amazon Managed Grafana provides advanced dashboarding with CloudWatch as a data source. Use Grafana when teams need rich operational dashboards beyond what CloudWatch natively offers.
Key data engineering monitoring patterns:
| Metric | Service | Alert Threshold | Meaning |
|---|---|---|---|
IteratorAge | Kinesis | > 60 seconds | Consumer falling behind producer |
glue.driver.aggregate.numFailedTasks | Glue | > 0 | Spark tasks failing |
Errors | Lambda | > threshold | Function execution failures |
DatabaseConnections | RDS/Aurora | > 80% of max | Connection pool exhaustion |
WriteThrottleEvents | DynamoDB | > 0 | Write capacity exceeded |
For centralized monitoring across pipelines, create a CloudWatch dashboard that aggregates metrics from all pipeline components ā Glue job status, Lambda error rates, Kinesis throughput, and Redshift query duration ā providing a single pane of glass for the data engineering team.
ā ļø Exam Trap: CloudWatch custom metrics require explicit PutMetricData API calls ā they're not automatic. If a question describes monitoring a business metric (e.g., "records processed per batch"), you need to publish it as a custom metric from your Lambda or Glue code. Built-in metrics only cover infrastructure-level data.
Reflection Question: A Kinesis Data Streams consumer processes records for a real-time dashboard. The dashboard occasionally shows stale data. Which CloudWatch metric would you alarm on, and what does a high value indicate?