What is regularisation?

  • Why should we apply regularisation instead of simple linear regression model?
  • What are the commonly used regularisation methods?
  • How do we apply regularisation with SGDRegressor?
  • What is the difference between Ridge and Lasso Regularisation?
1 Like

Often, the linear regression model comprising of a large number of features suffers from some of the following:

  • Overfitting: Overfitting results in the model failing to generalize on the unseen dataset
  • Multicollinearity: Model suffering from multicollinearity effect
  • Computationally Intensive: A model becomes computationally intensive

The above problems makes it difficult to come up with a model which has higher accuracy on unseen data and which is stable enough.
In order to take care of the above problems, one goes for adopting or applying one of the regularization techniques.

Regularization techniques are used to calibrate the coefficients of the determination of multi-linear regression models in order to minimize the adjusted loss function (a component added to the least-squares method). Primarily, the idea is that the loss of the regression model is compensated using the penalty calculated as a function of adjusting coefficients based on different regularization techniques.
Adjusted loss function = Residual Sum of Squares + F(w1, w2, …, wn)
Residual Sum of Squares = min || Xw — y ||²
In the above equation, the function denoted using “F” is a function of weights (coefficients of determination).
Thus, if the linear regression model is calculated as the following:
Y = w1x1 + w2x2 + w3x3 + bias
The above model could be regularized using the following function:
Adjusted Loss Function = Residual Sum of Squares (RSS) + F(w1, w2, w3)
In the above function, the coefficients of determination will be estimated by minimizing the adjusted loss function instead of simply RSS function.

Once the regression model is built and one of the following symptoms happen, you could apply one of the regularization techniques:

  • Model lack of generalization: Model found with higher accuracy fails to generalize on unseen or new data.
  • Model instability: Different regression models can be created with different accuracies. It becomes difficult to select one of them.

Types of Regularization

1. Ridge Regression
Ridge regression addresses some of the problems imposing a penalty on the size of the coefficients.
Ridge Regression is a remedial measure taken to alleviate collinearity amongst regression predictor variables in a model. Since the feature variables are so correlated in this way, the final regression model is quite restricted and rigid in its approximation i.e it has high variance.
To alleviate this issue, Ridge Regression adds a small squared bias factor to the variables:
min( || Xw — y ||² + alpha*|| w ||² )
The complexity parameter alpha controls the amount of shrinkage: the larger the value of alpha, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity. The coefficient estimates produced by this method are also known as the L2 norm.

2. Lasso Regression
Lasso Regression is quite similar to Ridge Regression in that both techniques have the same premise. We are again adding a biasing term to the regression optimization function in order to reduce the effect of collinearity and thus the model variance. However, instead of using a squared bias like ridge regression, lasso instead using an absolute value bias:
min (|| Xw — y ||² + alpha*|| w ||)
The coefficient estimates produced by this method are also known as the L1 norm.

3. ElasticNet Regression
ElasticNet is a hybrid of Lasso and Ridge Regression techniques. It is uses both the L1 and L2 regularization taking on the effects of both techniques:
min (|| Xw — y ||² + z_1|| w || + z_2|| w ||²)

Differences between the Ridge and Lasso Regressions

  • Built-in feature selection: It is frequently mentioned as a useful property of the L1-norm, which the L2-norm does not. This is actually a result of the L1-norm, which tends to produces sparse coefficients. For example, suppose the model have 100 coefficients but only 10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so does not have this property. Thus one can say that Lasso regression does a form of “parameter selections” since the feature variables that aren’t selected will have a total weight of 0.

  • Sparsity: This refers to that only very few entries in a matrix (or vector) is non-zero. L1-norm has the property of producing many coefficients with zero values or very small values with few large coefficients. This is connected to the previous point where Lasso performs a type of feature selection.

  • Computational efficiency: L1-norm does not have an analytical solution, but L2-norm does. This allows the L2-norm solutions to be calculated computationally efficiently. However, L1-norm solutions does have the sparsity properties which allows it to be used along with sparse algorithms, which makes the calculation more computationally efficient.

Let us look at an example to get a better sense of what we are saying here: (It is to be noted that one run the notebook associated with the following code cells to get a better sense of data set used)
The data set consists of 79 input features and the target ranges from 34.9k to 755k.
Then we try to fit sklearn.linear_model.LinearRegression on this data set:

As it can be noticed that the RMSE value for this model fit is too high, but train RMSE loss is accurate. This shows the case of overfitting on the training set.

When the data set was inspected for multicollinearity using seaborn.heatmap() , there were multiple instances were some of the input variables are highly collinear.

So, then Ridge Regression was applied to the data set and one gets a very low RMSE value as compared to the LinearRegression :

For Lasso and ElasticNet Regression, please check out the following notebook. It is the same from which the above code cells are extracted. You can also find SGDRegrossor with regularization methods in the following link:

One can always look at the scikit-learn documentation as well:
For Lasso : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso
For Ridge : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
For ElasticNet : https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet
for SGDRegressor : https://scikit-learn.org/stable/modules/sgd.html#regression