1.1.1. Why ML Engineering Differs from Software Engineering
💡 First Principle: Software engineering manages code complexity; ML engineering manages data complexity on top of code complexity. Every challenge of traditional software still applies—plus an entirely new category of failures where the code is correct but the system behaves wrong because the data changed.
In traditional software, you write unit tests that validate outputs for known inputs. If all tests pass, you're confident the system works. But an ML model that passes every validation metric during training can still fail catastrophically in production if the real-world data distribution shifts. Imagine deploying a fraud detection model trained on 2023 transaction patterns—it might miss entirely new fraud techniques that emerge in 2024, even though nothing in the code changed.
This creates three challenges that don't exist in traditional software:
Testing is fundamentally harder. You can't enumerate all valid inputs to an ML model. Instead, you rely on statistical metrics (accuracy, precision, recall) that give you confidence intervals, not guarantees. The exam frequently tests whether you understand which metric matters for a given business scenario.
Reproducibility requires tracking more than code. A software bug is reproducible if you have the code and the input. An ML bug requires the code, the training data, the hyperparameters, the random seeds, and the exact library versions. This is why SageMaker Experiments and Model Registry exist—and why the exam tests your understanding of model versioning.
Failures are silent. A crashing application alerts you immediately. A model whose accuracy degrades from 95% to 85% over six months might go unnoticed without explicit monitoring. This is the core motivation behind SageMaker Model Monitor and data drift detection—topics that Domain 4 tests heavily.
⚠️ Exam Trap: When a question describes a model performing well in development but poorly in production, don't jump to "the model needs retraining." First consider whether there's a data mismatch—different preprocessing, different feature distributions, or missing features between training and inference environments. The exam tests this distinction repeatedly.
Reflection Question: If an ML model's accuracy drops gradually over three months without any code changes, what are the most likely root causes, and which AWS service would you use to detect this?