6.1.4. Embeddings and Retrieval-Augmented Generation
Embeddings: Embeddings convert text into numerical vectors (lists of numbers) that capture semantic meaning. Think of it like GPS coordinates for concepts—similar ideas have similar coordinates.
How embeddings work: When you create an embedding, the model analyzes the text and outputs a vector—typically 1,536 numbers for OpenAI's ada model. These numbers encode the semantic meaning of the text in a way that enables mathematical comparison.
Vector similarity:
- Texts with similar meanings have vectors pointing in similar directions
- Cosine similarity measures how "alike" two vectors are (0 = unrelated, 1 = identical meaning)
- This enables finding "similar" content without keyword matching
Example embedding use case: Imagine searching a knowledge base for "How do I reset my password?" Using embeddings:
- Convert the query to a vector
- Compare against vectors of all documents
- Return documents with highest similarity—even if they say "recovering account access" instead of "reset password"
Embedding use cases:
- Semantic search: Find documents with similar meaning, not just matching keywords
- Classification: Group documents by topic
- Recommendation: Find similar items
- Retrieval Augmented Generation (RAG): Ground AI responses in your data
- Anomaly detection: Find outliers in text data
- Clustering: Group similar items together automatically
Retrieval Augmented Generation (RAG): RAG is a critical pattern that combines generative AI with search to reduce hallucinations:
- User asks a question
- System searches your knowledge base using embeddings
- Relevant documents retrieved and added to the prompt
- Model generates response grounded in your actual data
RAG benefits:
- Responses based on YOUR verified information
- Dramatically reduced hallucinations
- Up-to-date answers (knowledge base can be updated)
- Traceable sources for verification
- No model retraining required
⚠️ Exam Trap: RAG does NOT require fine-tuning or retraining the model. It works by providing context at inference time through the prompt. This is a key distinction—RAG is faster to implement and doesn't require ML expertise.