3.1.5. The Transformer Architecture
Transformers are the architecture that made modern generative AI possible. They power GPT, DALL-E, and virtually every large language model you've heard of. Understanding transformers—even at a high level—is essential because they're the foundation of the largest exam domain (Generative AI, 20-25%).
💡 The breakthrough insight: Before transformers, language models processed text sequentially—one word at a time, like reading a book aloud. Transformers process entire sequences in parallel, using "attention" to understand how every word relates to every other word. This is why they can understand that in "The bank by the river" and "The bank approved the loan," the word "bank" means completely different things.
The attention mechanism visualized:
The attention mechanism asks: "When trying to understand 'it', which words should I pay attention to?" It learns that "it" (a pronoun) most likely refers to "cat" (the subject that could be "tired"), not "mat" (objects don't get tired). The 0.72 weight on "cat" means the model is 72% focused on that word when interpreting "it."
How transformers process language:
| Step | What Happens | Example |
|---|---|---|
| 1. Tokenization | Text split into tokens | "Hello world" → ["Hello", "world"] |
| 2. Embedding | Tokens converted to vectors | Each token → 768+ numbers |
| 3. Attention | Vectors weighted by relationships | "it" attends to "cat" |
| 4. Feed-forward | Patterns combined and refined | Multiple layers of processing |
| 5. Output | Generate prediction or next token | "The cat sat on the..." → "mat" |
Key transformer concepts for the exam:
| Concept | What It Means | Why It Matters |
|---|---|---|
| Attention | Mechanism to weigh word relationships | Enables understanding of context |
| Parallel processing | Entire sequence processed at once | Much faster than sequential models |
| Pre-training | Learning from massive text data | Creates foundation knowledge |
| Fine-tuning | Adapting to specific tasks | Customizes for your use case |
| Tokens | Units of text (often subwords) | Billing and limits based on tokens |
The transformer family tree:
- Encoder models (like BERT): Understand text, create embeddings, power search
- Decoder models (like GPT, DALL-E): Generate new content
- Encoder-decoder models: Transform input to output (translation, summarization)
Why transformers matter for Azure:
- Azure OpenAI Service: Provides access to GPT-4, GPT-3.5, DALL-E (all transformer-based)
- Embeddings: BERT-style encoders create searchable vector representations
- Token-based pricing: You pay per token processed—understanding tokenization helps predict costs
⚠️ Exam Trap: GPT stands for "Generative Pre-trained Transformer"—it's IN the name. If a question asks about the architecture behind GPT or large language models, the answer is "transformer." Don't confuse with "neural network" (too generic) or "deep learning" (also too generic).
Reflection Question: If transformers process sequences in parallel, why can they still understand word order? (Answer: Position encodings—each token's position is encoded in its vector representation.)