3.1.2. SageMaker Built-in Algorithms and When to Apply Them
Each specialized algorithm has a specific data format preference and hyperparameter vocabulary. DeepAR expects JSON Lines with "start" timestamps and "target" arrays — feeding it tabular CSV causes cryptic failures. BlazingText in unsupervised mode (Word2Vec) needs one-sentence-per-line plain text, while supervised mode needs __label__ prefixed lines. Object Detection requires RecordIO with annotation manifests. Understanding these input contracts prevents the most common training failures and saves hours of debugging. When the exam mentions "time-series forecasting with related time series," DeepAR is almost always the answer. When it mentions "fast text classification with millions of categories," BlazingText supervised mode is the target.
💡 First Principle: SageMaker built-in algorithms are pre-optimized for AWS infrastructure—they scale automatically, support distributed training out of the box, and work with SageMaker's training/hosting pipeline with minimal configuration. The trade-off is less flexibility compared to custom code. The exam tests whether you know which built-in algorithm matches a given problem.
| Algorithm | Problem Type | Data Type | Key Exam Signal |
|---|---|---|---|
| Linear Learner | Classification / Regression | Tabular | "Linear relationship," "simple classification," "fast training" |
| XGBoost | Classification / Regression | Tabular | "Tabular data," "feature importance," "gradient boosting," "structured data" |
| K-Means | Clustering | Tabular | "Group similar items," "segmentation," "no labels" |
| Random Cut Forest | Anomaly Detection | Tabular / Time-series | "Anomaly," "outlier," "unusual patterns" |
| DeepAR | Time-series Forecasting | Time-series | "Forecast," "multiple related time-series," "demand prediction" |
| BlazingText | Text Classification / Word2Vec | Text | "Text classification," "word embeddings," "fast NLP" |
| Image Classification | Image Classification | Images | "Classify images," "ResNet," "image labels" |
| Object Detection | Object Detection | Images | "Locate objects," "bounding boxes," "detect items in image" |
| Semantic Segmentation | Pixel-level Classification | Images | "Pixel-level labeling," "segment regions," "autonomous driving" |
| Factorization Machines | Recommendation / Sparse data | Sparse tabular | "Recommendation," "sparse features," "click-through prediction" |
| LDA (Latent Dirichlet Allocation) | Topic Modeling | Text | "Discover topics," "topic modeling," "document themes" |
| IP Insights | Anomaly Detection | IP usage patterns | "Unusual IP activity," "login anomalies" |
When NOT to use built-in algorithms: If your problem requires a custom neural network architecture, a framework not natively supported, or an algorithm with hyperparameters that SageMaker's built-in version doesn't expose, use Script Mode or BYOC instead. The built-in algorithms are also less suitable when you need to customize the training loop itself (e.g., curriculum learning, custom loss functions).
⚠️ Exam Trap: XGBoost in SageMaker comes in two flavors: the built-in algorithm (container managed by AWS, limited hyperparameters) and the open-source version run through Script Mode (full XGBoost API). If a question asks about "SageMaker's XGBoost" without qualification, it means the built-in. If it mentions "custom XGBoost configuration" or "XGBoost script mode," it means the open-source version running in a managed container.
Reflection Question: A team needs to forecast daily demand for 500 products across 20 stores. Each product-store combination has 2 years of daily history. Which SageMaker built-in algorithm is designed for this exact scenario, and what makes it superior to training 10,000 individual models?