2.4. Reflection Checkpoint
Key Takeaways
Before proceeding, ensure you can:
- Select the appropriate data format (Parquet, RecordIO, CSV, Avro) based on the ML workload and access pattern
- Choose between S3, EFS, FSx for Lustre, and EBS based on throughput, access pattern, and sharing requirements
- Distinguish between Kinesis Data Streams, Data Firehose, and MSK for streaming ingestion scenarios
- Apply the correct data cleaning technique (outlier treatment, imputation strategy, deduplication) based on data characteristics
- Select the right encoding for categorical features based on cardinality and ordinality
- Match the AWS transformation tool (Glue, DataBrew, EMR, Data Wrangler) to the scenario's scale, user, and integration needs
- Use SageMaker Clarify's pre-training bias metrics (CI, DPL) to detect and mitigate data bias
- Explain data quality validation as an automated pipeline gate, not a one-time check
- Distinguish between Ground Truth (training data labeling) and A2I (production prediction review)
- Configure encryption at rest and in transit for ML data pipelines
Connecting Forward
In the next phase, you'll build on this data foundation to develop ML models. You'll learn how to select the right algorithm for your prepared data, train models efficiently using SageMaker's built-in algorithms and custom frameworks, tune hyperparameters, and evaluate model performance. The data quality and feature engineering decisions you make in Phase 2 directly determine what's possible in Phase 3—a well-prepared dataset with clean, bias-checked, properly encoded features is the best foundation for model development.
Self-Check Questions
-
A team stores 2 TB of training data as CSV files in S3. Training a model on an ml.p3.2xlarge takes 8 hours, with 2 hours spent just loading data. They want to reduce total training time. Identify two changes (one to data format, one to storage architecture) that would have the most impact.
-
A dataset for predicting employee attrition has a "department" feature with 5 values and an "employee_id" feature with 50,000 unique values. What encoding would you use for each, and why is one-hot encoding inappropriate for one of them?
-
SageMaker Clarify reports a Class Imbalance metric of 0.6 for a binary classification dataset (60% positive, 40% negative). The DPL shows that the positive label rate is 70% for Group A and 30% for Group B. Describe the bias problem and recommend a mitigation strategy.