3.1.3. Text Preprocessing (Tokenization, Stemming, Lemmatization)
First Principle: Text preprocessing fundamentally transforms raw text data into a structured numerical representation suitable for ML algorithms, addressing linguistic variations and optimizing feature extraction for NLP tasks.
Working with text data for machine learning (Natural Language Processing - NLP) requires specialized preprocessing steps to convert human language into a numerical format that algorithms can understand.
Key Concepts of Text Preprocessing:
- Purpose: Convert raw, unstructured text into a clean, normalized, and numerical format suitable for ML algorithms. Reduce noise and dimensionality.
- Common Steps:
- Lowercasing: Convert all text to lowercase to treat "The" and "the" as the same word.
- Punctuation Removal: Remove punctuation marks (e.g., '.', '!', ',').
- Noise Removal: Remove special characters, HTML tags, or irrelevant symbols.
- Stop Word Removal: Remove common words that carry little meaning (e.g., "a", "an", "the", "is").
- Tokenization:
- What it is: Breaking down a continuous text into smaller units called tokens (words, subwords, or characters).
- Example: "Hello, world!" -> ["Hello", ",", "world", "!"] or ["Hello", "world"]
- Stemming:
- What it is: Reducing words to their root form (stem) by removing suffixes.
- Example: "running", "runs", "ran" -> "run"
- Limitation: Stems may not be actual words.
- Lemmatization:
- What it is: Reducing words to their base or dictionary form (lemma), considering vocabulary and morphological analysis.
- Example: "running", "runs", "ran" -> "run" (as a dictionary word)
- Benefit: More accurate than stemming, produces actual words.
- Vectorization: Converting text tokens into numerical vectors (e.g., Bag-of-Words, TF-IDF, Word Embeddings like Word2Vec or FastText).
AWS Tools:
- SageMaker Data Wrangler provides built-in transformations for common text preprocessing steps.
- SageMaker Processing Jobs or Glue ETL Jobs for custom Python/Spark scripts using libraries like NLTK or SpaCy.
- Amazon Comprehend offers higher-level NLP capabilities without explicit preprocessing for many tasks (e.g., sentiment analysis, entity recognition).
Scenario: You are building a model to classify customer reviews. The raw text data contains variations like "running," "ran," and "runs," as well as punctuation and common words like "the" and "a." You need to prepare this text so that these variations are treated consistently and irrelevant words are removed.
Reflection Question: How do text preprocessing techniques like tokenization, stemming, and lemmatization fundamentally transform raw text data into a structured numerical representation suitable for ML algorithms by addressing linguistic variations and optimizing feature extraction for NLP tasks?