3.4. Data Processing for AI Models and Grounding
Data processing is the bridge between raw business data and agent-ready knowledge. Phase 1 covered grounding conceptually; this section covers the practical design decisions architects must make about data processing pipelines for AI models and agent grounding.
The exam bullet — "Design data processing for AI models and grounding" — tests whether you can design the pipeline that transforms raw data into retrievable, accurate knowledge.
The Data Processing Pipeline:
Pipeline Design Decisions:
1. Cleaning — Remove duplicates, fix encoding issues, standardize formats, strip irrelevant content (headers, footers, boilerplate). Garbage in = garbage out.
2. Chunking — Split documents into segments small enough for retrieval but large enough to preserve context. Key decisions:
- Chunk size: Typically 500-1500 tokens. Too small = fragmented context. Too large = irrelevant padding.
- Overlap: 10-20% overlap between chunks preserves context at boundaries.
- Strategy: By paragraph (simple), by semantic section (smarter), or by sliding window (balanced).
3. Enrichment — Add metadata that improves retrieval: document type, topic classification, date, author, confidence level. Enriched metadata enables filtered searches ("only search HR policies from 2024 or later").
4. Indexing — Create searchable indexes. Two approaches:
- Keyword-based indexing (traditional search) — Fast, predictable, works for exact matches
- Vector embedding indexing (semantic search) — Finds relevant content based on meaning, handles paraphrasing
5. Serving — How the index serves agent queries at inference time. Design for latency (search must return in milliseconds), freshness (how quickly new data appears in search results), and availability (what happens when the index is down).
For Custom AI Model Training:
When designing data processing for model training (not just grounding), additional considerations apply:
- Training data quality — Labeled data must be accurate and representative. Biased training data produces biased models.
- Data splitting — Training, validation, and test sets must be properly separated to avoid data leakage.
- Data versioning — Track which data version was used to train which model version. Essential for reproducibility and audit.
- Privacy considerations — Ensure training data doesn't contain PII that shouldn't be memorized by the model.
Exam Trap: The exam may describe a scenario where an agent returns outdated information even though the source data was updated. The issue is usually in the indexing step — the data pipeline didn't re-index after the update. Always design refresh schedules that match the data's freshness requirements.
Reflection Question: A company's agent grounds on a 50,000-page SharePoint library. Users report that the agent sometimes gives answers that mix information from different documents. What chunking and indexing strategy would you redesign?