Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.8. Reflection Checkpoint

Without mastering ingestion and transformation patterns, you will struggle with 34% of exam questions — the largest single domain. Consider this: if you cannot explain why Firehose stages data in S3 before loading to Redshift, the architecture questions will feel like guessing.

Key Takeaways

Before proceeding, ensure you can:

  • Choose between Kinesis Data Streams, Firehose, and MSK based on latency, ecosystem, and operational requirements
  • Explain when to use DMS (database CDC) vs AppFlow (SaaS integration) for data ingestion
  • Select the appropriate transformation service: Glue ETL for serverless, EMR for custom/complex, Lambda for lightweight, Redshift SQL for in-warehouse
  • Justify format choices: Parquet for analytics, Avro for streaming/schema evolution, CSV/JSON for raw landing
  • Compare orchestration services: Step Functions for AWS-native, MWAA for complex/cross-system, Glue Workflows for Glue-only
  • Apply CI/CD and IaC concepts to data pipeline deployment using CodePipeline, CloudFormation, and CDK
  • Identify throttling scenarios and architectural solutions (buffering with SQS, improved partition keys, on-demand scaling)

Connecting Forward

In Phase 3, you'll apply the pipeline knowledge from this phase to the destination side — choosing between S3, Redshift, DynamoDB, and Aurora based on access patterns and cost. You'll also learn about the new open table formats (Apache Iceberg, S3 Tables) and vector databases that appeared in the v1.1 syllabus update. The data formats and partitioning concepts from Section 2.5 will directly inform how you design data store schemas in Phase 3.

Self-Check Questions

  1. A retail company receives 50,000 point-of-sale transactions per second during Black Friday but only 500/second on normal days. They need transactions available for fraud analysis within 5 seconds and for daily reporting within 24 hours. Design a pipeline that handles both use cases. Which ingestion service handles the streaming path? Which handles the batch path? How do you handle the 100x traffic spike?

  2. An ETL pipeline runs a Glue Spark job that reads 200 GB of CSV files from S3, joins them with a 5 GB reference table in RDS, converts the output to Parquet, and writes it to a curated S3 zone. The job runs nightly and currently takes 3 hours. The team wants to reduce both runtime and cost. Identify at least three optimizations and the AWS features that enable each.

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications