5.2. Monitoring and Optimizing Infrastructure
💡 First Principle: ML infrastructure costs scale with two dimensions simultaneously: compute intensity (GPUs are expensive) and data volume (more data means more storage and processing). Without active cost monitoring, ML workloads can generate surprise bills that dwarf the value the model produces. Observability isn't optional—it's the financial immune system of your ML platform.
What breaks without infrastructure monitoring? Consider a SageMaker endpoint auto-scaled for holiday traffic that never scales back down. Or a training job stuck in a retry loop, spinning up expensive GPU instances every hour. Or an EMR cluster running idle after the ETL job completes because nobody configured termination. These aren't hypothetical—they're the most common cost overruns in ML systems, and the exam tests your ability to prevent them.
Think of ML infrastructure monitoring like the dashboard in a fleet management system. You're not just tracking whether individual vehicles are running—you're tracking fuel consumption per mile, idle time, maintenance schedules, and route efficiency. CloudWatch is your fleet dashboard, CloudTrail is your GPS log, and Cost Explorer is your fuel expense report.
⚠️ Common Misconception: Spot Instances are always the cheapest option for ML training. Spot can save 60-90%, but if your training job doesn't support checkpointing and gets interrupted at 95% completion, you lose all progress and pay for the rerun. The exam tests whether you know that managed Spot training in SageMaker requires checkpointing to S3, and that jobs with short runtimes (<1 hour) benefit less because interruption risk per hour is constant but restart overhead is proportionally larger.