1.4. Reflection Checkpoint
Without these foundational concepts, every subsequent topic will feel disconnected — like trying to read a map without knowing which direction is north. Think of this checkpoint as your compass calibration.
Key Takeaways
Before proceeding, ensure you can:
- Explain the five stages of a data pipeline and identify which stage an exam scenario is testing
- Distinguish between batch and streaming paradigms and identify when micro-batching applies
- Apply the cost-performance-reliability triangle to evaluate competing service options
- Explain why schema-on-read and schema-on-write serve different purposes and when to use both
- Identify the 5 most heavily tested services (Glue, S3, Redshift, Kinesis, Lake Formation) and their primary roles
Connecting Forward
In Phase 2, you'll apply these mental models to specific AWS services. When we examine Kinesis Data Streams vs. Firehose, you'll use the batch-vs-streaming framework. When we compare Glue ETL vs. EMR, you'll use the managed-vs-serverless trade-off. The optimization triangle will guide every "most cost-effective" or "minimum latency" question you encounter.
Self-Check Questions
-
A retail company generates 10 million order records per day in their RDS PostgreSQL database. The analytics team wants daily sales reports. The CEO also wants real-time fraud alerts. How many pipeline paradigms does this require, and what AWS services would you consider for each?
-
An IoT platform stores raw sensor data in S3 as JSON. A machine learning team needs to run Spark jobs on the data, while a business intelligence team needs structured dashboards. Describe an architecture that serves both teams. Which pattern from this phase does it follow?