6.2.6. Tricky Distinctions & Common Pitfalls (ML Focus)
First Principle: Nuanced understanding of seemingly similar ML concepts and AWS services, and anticipating common misconfigurations, are critical for designing robust ML solutions and avoiding errors.
The AWS MLS-C01 exam tests deep understanding, often through distinguishing between similar ML concepts or AWS services and identifying common pitfalls.
Common Areas of Confusion (ML Focus):
- Kinesis Data Streams vs. Kinesis Firehose vs. Kinesis Data Analytics:
- SageMaker Real-time Endpoints vs. Batch Transform vs. Asynchronous Inference:
- Real-time: Low latency, single predictions, persistent endpoint, higher cost.
- Batch: High throughput, offline, no persistent endpoint, cost-effective for large datasets.
- Asynchronous: Large payloads, long processing, intermittent traffic, managed queue, cost-effective.
- SageMaker Data Wrangler vs. SageMaker Processing Jobs vs. AWS Glue:
- Data Wrangler: Visual, interactive data prep for ML, generates code.
- Processing Jobs: Managed execution environment for custom Spark/Scikit-learn scripts, often used for data prep after Data Wrangler or for model evaluation.
- Glue: Serverless ETL, broader data integration, for general data lake transformations before ML-specific prep.
- SageMaker Feature Store vs. DynamoDB for Features:
- Feature Store: Purpose-built, online/offline, time-travel, consistency between training/inference, managed metadata. Uses DynamoDB/S3 under the hood.
- DynamoDB: General-purpose NoSQL, can be the online feature store backend, but lacks the built-in ML-specific capabilities of SageMaker Feature Store (e.g., time-travel, consistency).
- SageMaker Model Monitor vs. CloudWatch for Metrics:
- Model Monitor: Specifically for data quality, model quality, and bias drift of deployed models.
- CloudWatch: General-purpose monitoring, infrastructure metrics (CPU, memory), custom metrics.
- AWS AI Services vs. SageMaker:
- AI Services: Pre-trained, high-level API, no ML expertise needed, less customization.
- SageMaker: Build custom models, requires ML expertise, full control, more flexibility.
- Common Pitfalls:
- Data Leakage: Information from validation/test set or future data leaking into training. Common in feature engineering (e.g., target encoding without cross-validation, using future data for lag features).
- Class Imbalance: Not addressing imbalance in classification problems, leading to high accuracy but poor performance on minority class.
- Overfitting: Model performs well on training data but poorly on unseen data.
- Not using VPC mode for SageMaker: Exposing notebooks/training to public internet unnecessarily.
- Ignoring Cost Optimization: Not using Spot instances, auto-scaling, or proper S3 storage classes.
- "Accuracy" as Sole Metric: Especially for imbalanced classification or other problem types.
- No MLOps: Manual processes leading to errors, slow deployment, poor reproducibility.
- Ignoring Bias/Explainability: Lack of transparency and potential ethical/compliance issues.
Scenario: You are presented with an exam question that asks for the best way to process real-time clickstream data for immediate insights and then another about troubleshooting a problem where a classification model shows high accuracy but misses most fraud cases.
Reflection Question: How do you apply a First Principles approach to differentiate between Kinesis Data Streams vs. Kinesis Firehose (for real-time ingestion), and how does understanding the impact of data imbalance and the limitations of "accuracy" as a metric fundamentally help you identify common pitfalls and select the correct solution for specific ML use cases?