Generalization in Machine Learning for better performance.

Mathanraj Sharma
4 min readApr 29, 2019

Have you ever noticed that your model false predictions over your testing data? Even though you have trained your model with enough data still you get false negatives or false positives for your test data. Why is that?

Either your model is underfitting or overfitting to your training data. Generalization is a measure of how your model performs on predicting unseen data. So, it is important to come up with the best-generalized model to give better performance against future data. Let us first understand what is underfitting and overfitting, and then see what are the best practices to train a generalized model.

A: Underfitting, B: Generalized, C: Overfitting

What is Underfitting?

Underfitting is a state where the model cannot model itself on the training data. And also not able to generalize new data. You can notice it with the help of loss function during your training. A simple rule of thumb is if both training loss and cross-validation loss are high, then your model is underfitting.

Lack of data, not enough features, lack of variance in training data or high regularization rate can cause underfitting. A simple solution is to add more shuffled data to your training. Depending on what causes underfitting to your model, you can try introducing more meaningful features, feature crossing and introducing higher order polynomials as features or reducing regularization rate if you are using regularization. In some cases trying out with different training algorithm will work fine.

What is Overfitting?

Overfitting is a situation where your model force learns the whole variance. Experts say it as model starts to memorize all the noise instead of learning. A simple rule of thumb to identify the overfitting is if your training loss is low and cross-validation loss is high then your model is overfitting.

Uncleaned data, fewer steps in training, higher complexity of the model (due to higher weights in data) can cause overfitting. It is always recommended to preprocess data and create a good data pipeline. Select only necessary and meaningful features with good variance. Reduce the complexity of the model using good regularization algorithm (L1 norm or L2 norm).

Comparison

What are the best practices to get a Generalized model?

It is important to have a training dataset with good variance (i.e. a shuffled data set). The Best way to do this is computing the hash for an appropriate feature and split data into training, evaluation and test sets based on the computed hash value. Here the evaluation set is used to cross-validate the trained model. It is always good to ensure that the distribution in all the dataset is stationary(same).

Handling outliers also important, it always depends on the task you are working around. If you are training the model to detect anomalies you should consider outliers, in such case, these anomalies may be the labels you need to identify. So you cannot classify or detect without outliers. On the other hand, if you are modeling a regression-based classification it is good to remove outliers.

Using resampling during the training. Resampling enables you to reconstruct your sample dataset in different ways for each iteration. One of the most popular resampling technique is k-fold cross-validation. It does training and testing on the model for k times with different subsets of your testing data.

It is always good to know when to stop training. It is a common human insight to determine when to stop training. When you reach a good training loss and a good validation loss at that point stop training.

Learn to do some feature engineering when needed. In some cases, your model cannot be able to converge, there may be not a meaningful relation found on the raw features you have. Doing Feature crosses and introducing new features with meaningful relation helps the model to converge.

In addition to these parameters tunings, Hyper parameter tunings, using regularization algorithms also helps to generalize the model for better performance.

Hope you all get a basic idea of generalization, underfitting, and overfitting. Use this as a base and keep exploring on subtopics for deeper understandings.

Don’t forget to applaud if you find this article useful. Your doubts and feedbacks are always welcomed.

--

--

Mathanraj Sharma

Machine Learning Engineer at H2O.ai | Maker | Developer | Tech Blogger | AWS Community Builder