2.2.3. Encoding Techniques: One-Hot, Label, Binary, and Tokenization
💡 First Principle: ML models operate on numbers, not categories. Encoding translates categorical data into numerical representations—but the encoding method carries implicit assumptions about relationships between categories. Choosing wrong tells the model things that aren't true, and the exam tests whether you understand these implicit assumptions.
The most dangerous mistake in encoding is treating nominal categories (no inherent order, like colors or countries) with ordinal encoding (assigning integers 1, 2, 3...). This tells the model that "red=1 < blue=2 < green=3," implying a mathematical relationship that doesn't exist.
| Encoding | How It Works | Assumption | Best For | Watch Out |
|---|---|---|---|---|
| One-hot | Creates binary column per category | No ordinal relationship | Nominal with <20 categories | High cardinality → massive feature space |
| Label encoding | Assigns integer per category | Ordinal relationship exists | Ordinal categories (low/med/high) | Misleading for nominal categories |
| Binary encoding | Encodes integer as binary digits | Partial order acceptable | High cardinality (100+ categories) | Less interpretable |
| Target encoding | Replaces category with mean of target | Category relates to target | High cardinality + tree models | Can cause data leakage if not careful |
| Tokenization | Splits text into tokens (words/subwords) | Text needs numerical representation | NLP tasks, text features | Vocabulary size and OOV handling |
One-hot encoding is the safe default for nominal categories. A "color" feature with values {red, blue, green} becomes three binary columns: color_red, color_blue, color_green. The trade-off: if your categorical feature has 10,000 unique values (like zip codes), one-hot creates 10,000 new columns—the "curse of dimensionality."
Tokenization for text data splits raw text into tokens that can be converted to numerical vectors. SageMaker's BlazingText and Hugging Face transformers handle tokenization natively. For the exam, understand that tokenization is the first step in any NLP pipeline and that different tokenizers (word-level, subword/BPE, character-level) make different trade-offs between vocabulary size and representation granularity.
⚠️ Exam Trap: When a question mentions "high cardinality categorical feature" (thousands of unique values), one-hot encoding is usually the wrong answer because of dimensionality explosion. Look for binary encoding, target encoding, or embedding layers as alternatives. The question stem will hint at cardinality—watch for phrases like "thousands of unique product IDs."
Reflection Question: A feature representing "country" has 195 unique values. Another feature representing "satisfaction_rating" has values {very_unsatisfied, unsatisfied, neutral, satisfied, very_satisfied}. Which encoding would you use for each, and why?