Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.
3.2.2. Training and Validation Datasets
Machine learning requires splitting data into separate sets for different purposes:
Training dataset: Used to teach the model patterns. The model learns from this data.
Validation dataset: Used to evaluate model performance. The model never sees this during training—it's a "practice test."
Test dataset: A completely held-out set for final evaluation before deployment. The model sees this only once.
Why split the data? If you test a model on the same data it learned from, it might just memorize answers instead of learning patterns. Validation data tests whether the model can generalize to new, unseen examples.
Typical splits:
| Set | Percentage | Purpose |
|---|---|---|
| Training | 60-80% | Model learns patterns |
| Validation | 10-20% | Tune hyperparameters |
| Test | 10-20% | Final evaluation |
Common pitfalls:
- Data leakage: Information from validation/test accidentally used in training
- Overfitting: Model memorizes training data instead of learning patterns
- Underfitting: Model too simple to capture patterns
Written byAlvin Varughese
Founder•15 professional certifications