Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2. Data Preparation for ML (28%)

Data preparation is the heaviest-weighted domain on the MLA-C01 at 28%, and for good reason — in production ML, data work consumes 60–80% of engineering effort. This phase covers the full pipeline from raw data to model-ready features: ingestion and storage (choosing formats, S3 strategies, and streaming with Kinesis), transformation and feature engineering (cleaning, scaling, encoding), and data integrity (bias detection with Clarify, quality validation, labeling, and compliance).

Expect the exam to test your ability to choose between AWS data services for specific scenarios. "When do you use Glue vs. Data Wrangler vs. EMR vs. DataBrew?" is the recurring decision pattern. The sections below are organized to mirror that pipeline flow — ingestion first, then transformation, then validation — because each stage's output becomes the next stage's input. A format decision in 2.1 constrains your transformation options in 2.2, which constrains your bias detection approach in 2.3.

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications