2.2. FM Deployment and Lifecycle Management
💡 First Principle: FM deployment is not a one-time event — models have lifecycles, traffic patterns change, new model versions release, and a deployment strategy that works at 1,000 requests/day breaks at 1,000,000. The right deployment architecture accounts for current load, future growth, cost thresholds, and failure modes from day one.
The deployment decision directly affects every other operational concern: cost (on-demand vs. provisioned), latency (warm vs. cold), reliability (single-model vs. fallback), and governance (model version control, rollback capability). Exam scenarios will present traffic patterns and constraints and ask you to select the appropriate deployment configuration.
⚠️
| Deployment Mode | Traffic Pattern | Cost Model | Latency | Best For |
|---|---|---|---|---|
| Bedrock On-Demand | Unpredictable / low volume | Per token | ~100ms | Prototypes, variable workloads |
| Bedrock Provisioned | High, sustained (>60-70% util.) | Per MU/hour | <50ms | Production SLAs, fine-tuned models |
| SageMaker Endpoint | Custom model, always-on | Per instance/hour | <100ms warm | Full GPU control, custom models |
| SageMaker Serverless | Infrequent, fine-tuned | Per token + invoke | 1-5min cold | Infrequent fine-tuned inference |
Common Misconception: Provisioned throughput is always more cost-effective than on-demand for Bedrock. Provisioned throughput saves money only when utilization consistently exceeds ~60–70%. Below that threshold, you pay for idle capacity. On-demand pricing is almost always cheaper for variable or low-volume workloads.