Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

4.3.1. CloudWatch Metrics, Logs, and Alarms

šŸ’” First Principle: CloudWatch is the unified monitoring nervous system for all AWS services. Every service publishes metrics to CloudWatch automatically — Glue job duration, Lambda invocation errors, Kinesis iterator age, Redshift query throughput. Alarms turn passive metrics into active notifications, ensuring problems are detected before users notice.

CloudWatch Metrics — built-in metrics for every AWS service. Key data engineering metrics: Glue job run status and duration, Lambda errors and throttles, Kinesis IteratorAge (how far behind a consumer is — critical for detecting slow consumers), DynamoDB consumed capacity and throttled requests, and Redshift query duration.

CloudWatch Logs — centralized log storage. Glue jobs, Lambda functions, and EMR clusters write logs to CloudWatch Log Groups. Log Insights provides a SQL-like query language for searching and analyzing logs interactively. Subscription filters stream log events to Lambda, Kinesis, or OpenSearch for real-time processing.

CloudWatch Alarms — trigger actions when metrics cross thresholds. Alarm → SNS → email/PagerDuty for alerting. Alarm → Lambda for automated remediation (e.g., restart a failed Glue job). Composite alarms combine multiple alarms with AND/OR logic.

Amazon Managed Grafana provides advanced dashboarding with CloudWatch as a data source. Use Grafana when teams need rich operational dashboards beyond what CloudWatch natively offers.

Key data engineering monitoring patterns:
MetricServiceAlert ThresholdMeaning
IteratorAgeKinesis> 60 secondsConsumer falling behind producer
glue.driver.aggregate.numFailedTasksGlue> 0Spark tasks failing
ErrorsLambda> thresholdFunction execution failures
DatabaseConnectionsRDS/Aurora> 80% of maxConnection pool exhaustion
WriteThrottleEventsDynamoDB> 0Write capacity exceeded

For centralized monitoring across pipelines, create a CloudWatch dashboard that aggregates metrics from all pipeline components — Glue job status, Lambda error rates, Kinesis throughput, and Redshift query duration — providing a single pane of glass for the data engineering team.

āš ļø Exam Trap: CloudWatch custom metrics require explicit PutMetricData API calls — they're not automatic. If a question describes monitoring a business metric (e.g., "records processed per batch"), you need to publish it as a custom metric from your Lambda or Glue code. Built-in metrics only cover infrastructure-level data.

Reflection Question: A Kinesis Data Streams consumer processes records for a real-time dashboard. The dashboard occasionally shows stale data. Which CloudWatch metric would you alarm on, and what does a high value indicate?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications