4.5. Reflection Checkpoint
Without operations knowledge, you will miss the 22% of questions that test monitoring, automation, and troubleshooting. Consider a scenario question about a Glue job timing out — without understanding DPU allocation and job bookmarks, every answer choice looks plausible. Think of operations like maintaining a car: building it is only half the job.
Key Takeaways
Before proceeding, ensure you can:
- Choose between Step Functions, MWAA, and Glue Workflows based on complexity and cost
- Select Athena for ad-hoc S3 queries and Redshift for frequent, high-concurrency analytics
- Configure CloudWatch alarms for key pipeline metrics (Glue duration, Lambda errors, Kinesis iterator age)
- Explain the difference between CloudTrail (API auditing) and CloudWatch Logs (application logging)
- Troubleshoot common pipeline failures: Glue OOM, Lambda timeout, Kinesis throttling
- Implement data quality checks with Glue Data Quality (DQDL rules) and DataBrew
- Identify and mitigate data skew in Spark jobs (salting, broadcast joins)
Connecting Forward
Phase 5 covers the final exam domain — security and governance. IAM policies, Lake Formation permissions, KMS encryption, and Macie PII detection protect the data your pipelines process. The monitoring skills from this phase (CloudTrail, CloudWatch) directly support the audit logging requirements in Phase 5.
Self-Check Questions
-
A production data pipeline uses Glue ETL → Redshift COPY → QuickSight dashboards. The dashboard shows stale data every Monday morning. CloudWatch shows the Glue job succeeded Sunday night. Where would you investigate next, and what metrics or logs would you check?
-
A Spark job on EMR processes clickstream data partitioned by
user_id. The job has been gradually slowing over the past month despite no code changes. Data volume has grown 20%, but runtime has doubled. What's the most likely cause, and how would you confirm and fix it?