3.2.1. Deployment Options and Configuration Parameters
💡 First Principle: Deployment options decide how the model's capacity is provisioned and billed, while runtime parameters decide how it generates on each call. Provisioning is about predictable cost and throughput; runtime parameters are about output behavior.
On the deployment side, you generally choose between pay-as-you-go (billed per token, great for variable or low volume) and provisioned throughput (reserved capacity via Provisioned Throughput Units, or PTUs, for predictable high-volume workloads). On the runtime side, the parameters you'll see most are temperature (randomness), top-p (an alternative way to control randomness by limiting the pool of candidate tokens), and max tokens (a cap on completion length, which bounds cost and response size).
| Parameter | Controls | Raise it to... | Lower it to... |
|---|---|---|---|
| Temperature | Randomness/creativity | Get more varied, creative output | Get more focused, consistent output |
| Top-p | Candidate token pool | Allow more diverse word choices | Restrict to the most likely words |
| Max tokens | Completion length cap | Allow longer responses | Bound length and cost |
⚠️ Exam Trap: For tasks needing consistent, factual output (like extracting a date), you want low temperature, not high. High temperature is for creative variety. Reaching for high temperature to "improve" a factual task is a common wrong answer.
Reflection Question: A summarization feature gives wildly different summaries each time it runs on the same document. Which parameter would you adjust, and in which direction?