1.1.1. The FM as a Prediction Engine
💡 First Principle: Transformer models work by computing attention scores that let every token in a sequence attend to every other token — this global attention mechanism is what allows FMs to handle long-range dependencies, follow complex instructions, and reason across multi-step chains.
Modern foundation models are built on the transformer architecture introduced in the 2017 "Attention is All You Need" paper. The critical mechanism is self-attention: when processing a token, the model learns to weight the importance of every other token in the context. This is why increasing context window size is computationally expensive (attention scales quadratically with sequence length) and why long-context models are more expensive to run.
Key concepts the exam assumes you know:
| Concept | What It Means | Why It Matters for System Design |
|---|---|---|
| Token | The base unit of text (≈ 0.75 words on average) | Cost and limits are token-based, not word-based |
| Pre-training | Initial training on massive corpus; sets base capabilities | You can't change pre-training via Bedrock — it's baked in |
| Inference | Running the trained model to generate output | What Bedrock does — you pay per token of input + output |
| Parameters | The model's learned weights (billions to trillions) | More params ≠ better for all tasks; see Domain 2 |
| Context window | The total token limit for input + output combined | Determines what the model can "see" in one call |
| Temperature | Controls randomness of token selection | Higher = more creative/varied; lower = deterministic |
Autoregressive generation: FMs generate output one token at a time, with each new token conditioned on all previous tokens. This is why streaming responses become possible (you can emit tokens as they're generated) and why truncating output mid-stream creates inconsistent results.
⚠️ Exam Trap: Candidates often confuse context window limits with memory. The context window resets every API call unless you explicitly pass conversation history. "Memory" in an agent or chatbot is always implemented externally — in DynamoDB, in Bedrock Knowledge Bases session context, or in custom stores.
Reflection Question: If a foundation model generates text by predicting statistically likely next tokens rather than retrieving facts, what architectural component must a production system add to ensure factually accurate responses about proprietary company data?