4.3.3. Troubleshooting Pipeline Failures
š” First Principle: Pipeline failures follow predictable patterns. Most failures fall into five categories: resource limits (memory, timeout, throttling), data issues (schema changes, nulls, corrupt files), permission errors (IAM role missing permissions), network issues (VPC endpoints, security groups), and logic errors (wrong transformation, incorrect joins). Knowing the category narrows the investigation immediately.
Glue job troubleshooting: Check CloudWatch Logs for Spark executor errors. Common failures: out-of-memory (increase DPU count or enable auto-scaling), timeout (job exceeded maximum execution time), and data issues (schema mismatch, corrupt files). Job bookmarks can cause issues if the bookmark state becomes corrupted ā reset by deleting the bookmark.
EMR troubleshooting: Check the EMR step logs in S3 (s3://aws-logs-{account}-{region}/elasticmapreduce/). Common issues: cluster sizing (under-provisioned instances), YARN resource allocation, and Spark executor configuration. EMR logs are stored hierarchically by cluster ID and step ID.
Lambda troubleshooting: Check CloudWatch Logs for the function. Common issues: timeout (15-min limit exceeded), memory exhaustion, cold starts causing latency spikes, and permission errors (execution role missing required policies). X-Ray tracing provides distributed tracing for Lambda-based pipelines.
Performance tuning patterns: right-size compute (Glue DPUs, EMR instance types), optimize data formats (CSV ā Parquet), improve partitioning, use caching (Redshift result cache, Athena query results), and eliminate unnecessary data movement (use Redshift Spectrum instead of COPY for infrequent queries).
ā ļø Exam Trap: When a Glue job fails with "out of memory," the answer is usually NOT to increase the number of workers. First, check if the job is processing more data than expected (missing job bookmark), reading unnecessarily wide datasets (lack of column projection), or creating excessive shuffle (repartition or coalesce the data). Architectural fixes before capacity increases.
Reflection Question: A Glue ETL job that ran successfully for months suddenly fails with "Exit Code: 1." CloudWatch Logs show a Java heap space error. The source data volume hasn't changed. What could have changed, and how would you investigate?