Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.2.2. Training and Validation Datasets

Machine learning requires splitting data into separate sets for different purposes:

Training dataset: Used to teach the model patterns. The model learns from this data.

Validation dataset: Used to evaluate model performance. The model never sees this during training—it's a "practice test."

Test dataset: A completely held-out set for final evaluation before deployment. The model sees this only once.

Why split the data? If you test a model on the same data it learned from, it might just memorize answers instead of learning patterns. Validation data tests whether the model can generalize to new, unseen examples.

Typical splits:
SetPercentagePurpose
Training60-80%Model learns patterns
Validation10-20%Tune hyperparameters
Test10-20%Final evaluation
Common pitfalls:
  • Data leakage: Information from validation/test accidentally used in training
  • Overfitting: Model memorizes training data instead of learning patterns
  • Underfitting: Model too simple to capture patterns
Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications