3.1.1. How Generative AI Models Work
💡 First Principle: Generative models work on tokens — small chunks of text (roughly word-pieces). You send a prompt (tokens in), the model predicts the most likely next token, adds it, and repeats until done, producing a completion (tokens out). It's prediction all the way down; there's no fact lookup happening.
Two terms you must know: the context window is the maximum number of tokens the model can consider at once (prompt plus completion) — exceed it and the model "forgets" the earliest content. A large language model (LLM) is a generative model trained on enormous text corpora to do this prediction well across many topics. Because the model predicts plausible continuations rather than retrieving verified facts, it can generate fluent text that's wrong — the hallucination problem from Phase 1, now with a mechanism attached.
⚠️ Exam Trap: Tokens are not the same as words. A long word may be several tokens; punctuation and spaces count too. Pricing and context-window limits are measured in tokens, not words — a distinction the exam likes to probe.
Reflection Question: If a model has a fixed context window and you keep adding to a long conversation, what eventually happens to the earliest messages, and why?