Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.2.1. Deployment Options and Configuration Parameters

💡 First Principle: Deployment options decide how the model's capacity is provisioned and billed, while runtime parameters decide how it generates on each call. Provisioning is about predictable cost and throughput; runtime parameters are about output behavior.

On the deployment side, you generally choose between pay-as-you-go (billed per token, great for variable or low volume) and provisioned throughput (reserved capacity via Provisioned Throughput Units, or PTUs, for predictable high-volume workloads). On the runtime side, the parameters you'll see most are temperature (randomness), top-p (an alternative way to control randomness by limiting the pool of candidate tokens), and max tokens (a cap on completion length, which bounds cost and response size).

ParameterControlsRaise it to...Lower it to...
TemperatureRandomness/creativityGet more varied, creative outputGet more focused, consistent output
Top-pCandidate token poolAllow more diverse word choicesRestrict to the most likely words
Max tokensCompletion length capAllow longer responsesBound length and cost

⚠️ Exam Trap: For tasks needing consistent, factual output (like extracting a date), you want low temperature, not high. High temperature is for creative variety. Reaching for high temperature to "improve" a factual task is a common wrong answer.

Reflection Question: A summarization feature gives wildly different summaries each time it runs on the same document. Which parameter would you adjust, and in which direction?

Alvin Varughese
Written byAlvin Varughese
Founder18 professional certifications