5.1.3. Data Lineage and Documentation (SageMaker Model Cards)

First Principle: Robust governance and risk management for AI require clear documentation and the ability to trace the lineage of a model from its training data to its final predictions.

You must be able to answer the questions: "Where did this model come from?" and "What data was it trained on?"

Data Lineage:
- Concept: The practice of tracking the origin, movement, and transformation of data throughout its lifecycle. For AI, it means knowing which specific dataset version was used to train which specific model version.
- Importance: Crucial for reproducibility, debugging (if a model is behaving poorly, you can trace it back to the data), and auditing.
Source Citation (for Generative AI):
- Concept: In RAG applications, it's the ability to cite the specific source document(s) that the model used to generate its answer.
- Importance: Builds user trust by allowing them to verify the model's response against the source of truth.
Documenting Data Origins & Model Details:
- Tool: Amazon SageMaker Model Cards are the primary tool for this.
- Purpose: As covered in 4.2.2, they act as a central, standardized document to record all critical information, including:
  - The source and description of the training data.
  - The model's intended use cases and limitations.
  - The evaluation metrics and bias reports.
  - This documentation is essential for governance, compliance, and sharing information across teams.

Scenario: A regulator audits a financial services company and asks them to prove that a specific credit risk model was not trained on data that included a prohibited variable, like race.

Reflection Question: How would a well-maintained system with clear data lineage and a comprehensive SageMaker Model Card allow the company to quickly and definitively answer the regulator's question?

💡 Tip: Good documentation is not optional in a professional AI environment; it's a core component of governance and risk management.