1.2.1. Inference Parameters and Output Control
💡 First Principle: Text generation parameters don't control what the model knows — they control how the model samples from its probability distribution over possible next tokens. Getting inference parameters wrong wastes budget, produces inconsistent outputs, and fails quality thresholds.
When a model generates text, it computes a probability distribution over its entire vocabulary for the next token. The inference parameters shape how it samples from that distribution:
| Parameter | Effect | Practical Use |
|---|---|---|
| Temperature | Scales probability distribution. Low = peaked/deterministic; High = flattened/random | 0-0.2 for factual Q&A, data extraction. 0.7-1.0 for creative writing, brainstorming |
| Top-p (nucleus) | Considers only tokens whose cumulative probability ≥ p | 0.9 typical for most tasks; reduce for factual precision |
| Top-k | Considers only the k highest-probability tokens | Limits extreme randomness; 50-100 typical range |
| Max tokens | Hard cap on response length | Critical for cost control and context window management |
| Stop sequences | String(s) that halt generation | Use to enforce structured output boundaries |
The cost equation: Every token costs money. input_tokens × input_price + output_tokens × output_price. Max tokens controls your cost ceiling — but setting it too low truncates responses mid-sentence, causing downstream parsing failures.
⚠️ Exam Trap: Temperature 0 does not guarantee identical outputs for the same prompt across all models. Some models retain stochasticity at temperature 0 due to hardware floating-point differences. For truly deterministic outputs at scale, use caching (covered in Phase 6) or structured output enforcement via JSON Schema.
Reflection Question: A production chatbot returns inconsistent answer lengths — sometimes 2 sentences, sometimes 10 paragraphs for the same query type. Which inference parameters would you tune first, and why?