2.2.3. Data Quality and Its Impact on AI Solutions
💡 First Principle: AI is only as good as the data it works with—garbage in, garbage out applies more to AI than to any previous technology. Data type determines which AI approach works. Data quality determines how well it works. Representative datasets determine whether it works fairly. Understanding this trio helps you diagnose AI failures and set realistic expectations with stakeholders.
Data types and their AI implications:
| Data Type | Examples | AI Approach | Business Consideration |
|---|---|---|---|
| Structured | Databases, spreadsheets, CRM records | Traditional ML excels (classification, prediction) | Clean schemas needed; missing values degrade results |
| Unstructured | Emails, documents, images, audio | Generative AI and specialized AI Services | Volume is high; quality varies widely |
| Semi-structured | JSON, XML, tagged documents | Both approaches can work | Consistent tagging is critical |
Data quality dimensions that affect AI:
- Accuracy — Incorrect data produces incorrect AI outputs. If your product database has wrong prices, Copilot will confidently quote wrong prices.
- Completeness — Missing fields mean AI can't fully answer questions. If employee profiles lack department info, Copilot can't organize information by team.
- Currency — Outdated data produces outdated answers. If your SharePoint still has last year's policies, grounded AI will cite old policies.
- Consistency — Conflicting information across sources confuses AI. If HR and Finance have different employee counts, AI may give contradictory answers.
Representative datasets and fairness: When training or fine-tuning AI models, the training data must represent all the groups the AI will serve. If a hiring model is trained primarily on data from one demographic group, it may perform poorly—or unfairly—for others. This connects directly to the Fairness principle in responsible AI.
⚠️ Exam Trap: When asked how to improve AI response quality for organizational questions, "use a better model" is often wrong. The answer is usually to improve the underlying data quality—because grounded AI reflects the quality of its data sources.
Reflection Question: A company deploys Copilot but finds it gives inconsistent answers about company policies. The policies are stored across SharePoint, an intranet wiki, and PDF documents—some outdated. What's the root cause, and what should they fix first?