5.1.2. SageMaker Model Monitor in Production
💡 First Principle: Model Monitor works by continuously comparing production data against a baseline you define at deployment time. The quality of your monitoring depends entirely on the quality of your baseline—a poorly constructed baseline produces either false alarms (too sensitive) or missed detections (too lenient).
Setting up Model Monitor follows a four-step process that the exam frequently tests:
Step 1: Enable Data Capture on your SageMaker endpoint. This captures a sample of inference requests and responses, storing them in S3. You configure what percentage of traffic to capture—100% for low-volume endpoints, a sample for high-volume ones.
Step 2: Create a Baseline using SageMaker's baselining job. This analyzes your training data to produce statistical constraints—mean, standard deviation, min/max, and distribution profiles for each feature. The baseline becomes the "normal" against which production data is measured.
Step 3: Schedule Monitoring Jobs that run at regular intervals (hourly, daily) to compare captured production data against the baseline. Each job produces a violations report identifying which features have drifted and by how much.
Step 4: Configure CloudWatch Alarms to trigger when violations exceed thresholds. These alarms can notify your team via SNS, trigger a Lambda function for automated remediation, or kick off a retraining pipeline via EventBridge.
Model Monitor supports four monitoring types, each testing a different dimension:
| Monitor Type | What It Checks | Baseline Source | Exam Signal |
|---|---|---|---|
| Data Quality | Feature statistics, missing values, data types | Training dataset statistics | "Feature distributions shifted" |
| Model Quality | Accuracy, precision, recall vs. baseline | Model evaluation metrics | "Model performance degraded" (requires ground truth) |
| Bias Drift | Fairness metrics across groups | Pre-training bias report from Clarify | "Disproportionate impact on group" |
| Feature Attribution Drift | SHAP value distributions | Feature importance baseline from Clarify | "Feature importance rankings changed" |
⚠️ Exam Trap: Model Quality monitoring requires ground truth labels for production data to compute accuracy metrics. If the scenario doesn't mention ground truth availability, Model Quality monitoring can't be used—only Data Quality monitoring (which compares feature distributions without needing labels). The exam tests this prerequisite.
Reflection Question: A team has deployed a model with Model Monitor but is getting too many false positive alerts. What are two approaches to reduce noise without missing real drift?