Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

1.1.1. The FM as a Prediction Engine

💡 First Principle: Transformer models work by computing attention scores that let every token in a sequence attend to every other token — this global attention mechanism is what allows FMs to handle long-range dependencies, follow complex instructions, and reason across multi-step chains.

Modern foundation models are built on the transformer architecture introduced in the 2017 "Attention is All You Need" paper. The critical mechanism is self-attention: when processing a token, the model learns to weight the importance of every other token in the context. This is why increasing context window size is computationally expensive (attention scales quadratically with sequence length) and why long-context models are more expensive to run.

Key concepts the exam assumes you know:
ConceptWhat It MeansWhy It Matters for System Design
TokenThe base unit of text (≈ 0.75 words on average)Cost and limits are token-based, not word-based
Pre-trainingInitial training on massive corpus; sets base capabilitiesYou can't change pre-training via Bedrock — it's baked in
InferenceRunning the trained model to generate outputWhat Bedrock does — you pay per token of input + output
ParametersThe model's learned weights (billions to trillions)More params ≠ better for all tasks; see Domain 2
Context windowThe total token limit for input + output combinedDetermines what the model can "see" in one call
TemperatureControls randomness of token selectionHigher = more creative/varied; lower = deterministic

Autoregressive generation: FMs generate output one token at a time, with each new token conditioned on all previous tokens. This is why streaming responses become possible (you can emit tokens as they're generated) and why truncating output mid-stream creates inconsistent results.

⚠️ Exam Trap: Candidates often confuse context window limits with memory. The context window resets every API call unless you explicitly pass conversation history. "Memory" in an agent or chatbot is always implemented externally — in DynamoDB, in Bedrock Knowledge Bases session context, or in custom stores.

Reflection Question: If a foundation model generates text by predicting statistically likely next tokens rather than retrieving facts, what architectural component must a production system add to ensure factually accurate responses about proprietary company data?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications