Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.2.3. Encoding Techniques: One-Hot, Label, Binary, and Tokenization

💡 First Principle: ML models operate on numbers, not categories. Encoding translates categorical data into numerical representations—but the encoding method carries implicit assumptions about relationships between categories. Choosing wrong tells the model things that aren't true, and the exam tests whether you understand these implicit assumptions.

The most dangerous mistake in encoding is treating nominal categories (no inherent order, like colors or countries) with ordinal encoding (assigning integers 1, 2, 3...). This tells the model that "red=1 < blue=2 < green=3," implying a mathematical relationship that doesn't exist.

EncodingHow It WorksAssumptionBest ForWatch Out
One-hotCreates binary column per categoryNo ordinal relationshipNominal with <20 categoriesHigh cardinality → massive feature space
Label encodingAssigns integer per categoryOrdinal relationship existsOrdinal categories (low/med/high)Misleading for nominal categories
Binary encodingEncodes integer as binary digitsPartial order acceptableHigh cardinality (100+ categories)Less interpretable
Target encodingReplaces category with mean of targetCategory relates to targetHigh cardinality + tree modelsCan cause data leakage if not careful
TokenizationSplits text into tokens (words/subwords)Text needs numerical representationNLP tasks, text featuresVocabulary size and OOV handling

One-hot encoding is the safe default for nominal categories. A "color" feature with values {red, blue, green} becomes three binary columns: color_red, color_blue, color_green. The trade-off: if your categorical feature has 10,000 unique values (like zip codes), one-hot creates 10,000 new columns—the "curse of dimensionality."

Tokenization for text data splits raw text into tokens that can be converted to numerical vectors. SageMaker's BlazingText and Hugging Face transformers handle tokenization natively. For the exam, understand that tokenization is the first step in any NLP pipeline and that different tokenizers (word-level, subword/BPE, character-level) make different trade-offs between vocabulary size and representation granularity.

⚠️ Exam Trap: When a question mentions "high cardinality categorical feature" (thousands of unique values), one-hot encoding is usually the wrong answer because of dimensionality explosion. Look for binary encoding, target encoding, or embedding layers as alternatives. The question stem will hint at cardinality—watch for phrases like "thousands of unique product IDs."

Reflection Question: A feature representing "country" has 195 unique values. Another feature representing "satisfaction_rating" has values {very_unsatisfied, unsatisfied, neutral, satisfied, very_satisfied}. Which encoding would you use for each, and why?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications