AWS Certified Data Engineer — Associate (DEA-C01) Study Guide [165 Minute Read]
A First-Principles Approach to Data Engineering on AWS
Welcome to the AWS Certified Data Engineer — Associate (DEA-C01) Study Guide. This guide moves beyond surface-level memorization. It is designed to build a robust mental model of how data pipelines work on AWS — understanding the why behind every architectural decision, service selection, and optimization trade-off.
Each topic is aligned with the official DEA-C01 Exam Objectives (v1.1), targeting the specific cognitive skills required for success. The exam is heavily scenario-based — roughly 60% of questions present a business situation and ask you to choose the best AWS service or architectural approach. Rote memorization of service features will not get you through; you need to understand when and why each service is the right tool.
Exam Details: 65 questions (50 scored + 15 unscored) | 130 minutes | Passing score: 720/1000
Prerequisites: 2–3 years of data engineering experience. 1–2 years of hands-on AWS. Familiarity with ETL/ELT pipelines, SQL, data lakes, Git, and general networking/storage/compute concepts. No ML training or programming language-specific syntax is tested.
Exam Domain Weights
Domain 1 commands over a third of the exam — ingestion patterns, transformation services (especially AWS Glue and EMR), and pipeline orchestration are your highest-value study areas. Domain 2's data store selection questions are the second-heaviest block, so invest deeply in understanding when to choose S3, Redshift, DynamoDB, or Aurora. Domains 3 and 4 together account for 40%, with monitoring, data quality, IAM, Lake Formation, and encryption rounding out the exam.
(Table of Contents - For Reference)
- Phase 1: First Principles of Data Engineering on AWS
- 1.1. The Data Pipeline Mental Model
- 1.1.1. Why Data Pipelines Exist: The Source-to-Insight Problem
- 1.1.2. The Five Stages of a Data Pipeline
- 1.1.3. Batch vs Streaming: Two Paradigms for Data Movement
- 1.2. The AWS Data Engineering Landscape
- 1.2.1. Core Service Categories and When They Apply
- 1.2.2. Managed vs Serverless: The Operational Trade-Off
- 1.3. Thinking Like a Data Engineer
- 1.3.1. The Three Optimization Axes: Cost, Performance, and Reliability
- 1.3.2. Schema-on-Read vs Schema-on-Write: A Foundational Decision
- 1.4. Reflection Checkpoint
- 1.1. The Data Pipeline Mental Model
- Phase 2: Data Ingestion and Transformation (34%)
- 2.1. Streaming Data Ingestion
- 2.1.1. Amazon Kinesis Data Streams
- 2.1.2. Amazon Kinesis Data Firehose
- 2.1.3. Amazon MSK (Managed Streaming for Apache Kafka)
- 2.1.4. Change Data Capture: DynamoDB Streams and AWS DMS
- 2.2. Batch Data Ingestion
- 2.2.1. Amazon S3 as an Ingestion Layer
- 2.2.2. AWS Glue Crawlers and Batch Jobs
- 2.2.3. AWS Database Migration Service and AppFlow
- 2.3. Scheduling, Triggers, and Ingestion Patterns
- 2.3.1. Event-Driven Ingestion with EventBridge
- 2.3.2. Scheduled Ingestion with MWAA and Glue Triggers
- 2.3.3. Throttling, Rate Limits, and Fan-Out Patterns
- 2.4. Data Transformation Services
- 2.4.1. AWS Glue ETL: DynamicFrames, Spark, and Job Bookmarks
- 2.4.2. Amazon EMR and Apache Spark
- 2.4.3. Lightweight Transforms with Lambda and Redshift SQL
- 2.5. Data Formats and Conversion
- 2.5.1. Columnar vs Row-Based Formats
- 2.5.2. Format Selection and Conversion Patterns
- 2.6. Pipeline Orchestration
- 2.6.1. AWS Step Functions and State Machines
- 2.6.2. Amazon MWAA (Apache Airflow) and Glue Workflows
- 2.6.3. Event-Driven Architectures with SNS and SQS
- 2.7. Programming Concepts for Data Pipelines
- 2.7.1. SQL for Data Transformation and Query Optimization
- 2.7.2. CI/CD and Infrastructure as Code
- 2.7.3. Serverless Deployment with Lambda and SAM
- 2.8. Reflection Checkpoint
- 2.1. Streaming Data Ingestion
- Phase 3: Data Store Management (26%)
- 3.1. Choosing the Right Data Store
- 3.1.1. Amazon S3: The Data Lake Foundation
- 3.1.2. Amazon Redshift: The Cloud Data Warehouse
- 3.1.3. Amazon DynamoDB: NoSQL at Scale
- 3.1.4. Amazon RDS, Aurora, and Relational Options
- 3.2. Specialized and Emerging Data Stores
- 3.2.1. Amazon OpenSearch, Neptune, DocumentDB, Keyspaces, and MemoryDB
- 3.2.2. Open Table Formats: Apache Iceberg and S3 Tables
- 3.2.3. Vector Indexes and Vectorization Concepts
- 3.3. Data Cataloging Systems
- 3.3.1. AWS Glue Data Catalog and Crawlers
- 3.3.2. Business Data Catalogs: Amazon SageMaker Catalog
- 3.4. Data Lifecycle Management
- 3.4.1. S3 Storage Classes and Lifecycle Policies
- 3.4.2. Data Retention, Archiving, and Deletion Strategies
- 3.5. Data Modeling and Schema Evolution
- 3.5.1. Schema Design for Redshift, DynamoDB, and Lake Formation
- 3.5.2. Schema Evolution, Data Lineage, and Data Quality
- 3.6. Reflection Checkpoint
- 3.1. Choosing the Right Data Store
- Phase 4: Data Operations and Support (22%)
- 4.1. Automating Data Processing
- 4.1.1. Orchestration with Step Functions and MWAA
- 4.1.2. Processing with Glue, EMR, Lambda, and Athena
- 4.2. Data Analysis on AWS
- 4.2.1. SQL Analysis with Athena and Redshift
- 4.2.2. Visualization and Exploration: QuickSight, DataBrew, and Notebooks
- 4.3. Monitoring and Maintaining Pipelines
- 4.3.1. CloudWatch Metrics, Logs, and Alarms
- 4.3.2. CloudTrail, CloudTrail Lake, and Log Analysis
- 4.3.3. Troubleshooting Pipeline Failures
- 4.4. Ensuring Data Quality
- 4.4.1. Validation, Profiling, and Quality Rules
- 4.4.2. Data Sampling and Skew Mechanisms
- 4.5. Reflection Checkpoint
- 4.1. Automating Data Processing
- Phase 5: Data Security and Governance (18%)
- 5.1. Authentication on AWS
- 5.1.1. IAM Fundamentals: Users, Roles, and Policies
- 5.1.2. VPC Security, Endpoints, and Credential Management
- 5.2. Authorization and Access Control
- 5.2.1. IAM Policies, RBAC, and Attribute-Based Access
- 5.2.2. Lake Formation Permissions and Fine-Grained Access
- 5.3. Data Encryption and Masking
- 5.3.1. Encryption at Rest and in Transit with KMS
- 5.3.2. Data Masking, Anonymization, and PII Protection
- 5.4. Audit Logging
- 5.4.1. CloudTrail and CloudWatch Logs for Auditing
- 5.4.2. Centralized Log Analysis: CloudTrail Lake, Athena, and OpenSearch
- 5.5. Data Privacy and Governance
- 5.5.1. PII Detection with Macie and Data Sovereignty
- 5.5.2. Governance Frameworks: Config, SageMaker Catalog, and Data Sharing
- 5.6. Reflection Checkpoint
- 5.1. Authentication on AWS
- Phase 6: Exam Readiness
- 6.1. Exam Strategy and Time Management
- 6.2. Quick Reference: Decision Trees and Service Comparisons
- 6.3. Mixed-Topic Practice Questions
- Phase 7: Glossary
- Phase 8: Conclusion
Start Free. Upgrade When You're Ready.
Stay on your structured path while adding targeted practice with the full set of exam-like questions, expanded flashcards to reinforce concepts, and readiness tracking to identify and address weaknesses when needed.
Content last updated