What is overfitting and underfitting in machine learning?

  • What is overfitting?
  • What is underfitting?
  • Why do we get a underfit or overfit on given data?
  • How to get the best fit?

A supervised machine learning model learns relationships between the features, and target, from a training data set. During training, the model is given both the features and target and learns how to map the former to the latter. A trained model is evaluated on a testing set, where we only give it the features and it makes predictions. We compare the predictions with the known targets for the testing set to calculate accuracy. All supervised machine learning models are based on the fundamental idea of learning relationships between features and targets from training data.

A machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on such minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can be avoided by using more data and also reducing the features by feature selection. In a nutshell, Underfitting – High bias and low variance.

A machine learning is said to be overfitted, when we train it with a lot of data. When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees. In a nutshell, Overfitting – High variance and low bias

To know more on bias and variance, check out What is bias-variance trade-off?

Let us look at the following graphs:

The first graph with a linear regression clearly underfits the observations or training set, the second graph with polynomial regression of degree 13 clearly overfits the observations and the third graph polynomial regression of with degree 5 is the best fit.

Our aim is to minimize the error function and find the best fit for the data.

To get a better understanding of overfitting and underfitting in layman’s terms, visit this link.

The following can be kept in mind while building any supervised learning model:
Poor performance on the training data could be because the model is too simple to describe the target well. Performance can be improved by increasing model flexibility. To increase model flexibility, try the following:

  • Add new features and/or change the types of feature processing used.
  • Decrease the amount of regularization used.

If your model is overfitting the training data, it makes sense to take actions that reduce model flexibility. To reduce model flexibility, try the following:

  • Feature selection: consider using fewer feature combinations, and decrease the number of numeric attribute bins.
  • Increase the amount of regularization used.

Accuracy on training and test data could be poor because the learning algorithm did not have enough data to learn from. You could improve performance by doing the following:

  • Increase the amount of training data examples.
  • Increase the number of passes or iterations on the existing training data.
1 Like