- What loss function is being used in
`LinearRegression()`

? - What loss function is being used in
`SGDRegressor()`

? - What method or algorithm is used in
`LinearRegression()`

and`SGDRegressor()`

? - Why does
`SGDRegressor`

gives different values every time you run fit on it? - When should one use
`LinearRegression()`

and`SGDRegressor()`

?

Before we get into the theory of it all, let us look at the following sets of code (to know more about the dataset used, you can visit the notebook via the following code cells)

The following RMSE loss is obtained via `sklearn.linear_model.LinearRegression()`

:

The following RMSE loss is obtained via `sklearn.linear_model.SGDRegressor()`

:

At first glance, the RMSE loss obtained via two different regression models is approximately same on both training set as well as the test set. But what is actually happening behind these few lines of code? How are the coefficients and intercept being computed in the two cases?

We will first understand what is happening and then learn when one should use which regression model to obtain optimal results.

###
`sklearn.linear_model.LinearRegression()`

`sklearn.linear_model.LinearRegression()`

According to `scikit-learn`

documentation, `LinearRegression()`

is the ordinary least squares Linear Regression. `LinearRegression()`

fits a linear model with coefficients `w`

to minimize the residual sum of squares between the observed targets in the data set, and the targets predicted by the **linear approximations.** ( It is to be noted that these linear approximations involves **matrix operations** )

Also, `scikit-learn`

standard linear regression object is actually just a piece of code from `scipy`

which is wrapped to give a predictor object. It basically uses Normal Equation to compute the minimizer analytically, and will give you an warning if the relevant matrix is non-invertible. In other words, **Normal Equation** is an **analytical approach** to Linear Regression with a Least Square Cost Function.

Also, if we wish to apply regularization techniques using `LinearRegression()`

, we can not do that directly by tuning the parameters of this class. We need to use `Ridge()`

(This linear model addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients with l2 regularization) or `Lasso()`

(It is a linear model that estimates sparse coefficients with l1 regularization) or `ElasticNet()`

(It is a linear regression model trained with both l1 and l2 -norm regularization of the coefficients) classes of the `scikit-learn`

library.

###
`sklearn.linear_model.SGDRegressor()`

`sklearn.linear_model.SGDRegressor()`

The class `SGDRegressor()`

implements a plain **stochastic gradient descent learning routine** which supports different loss functions and penalties to fit linear regression models.

In stochastic gradient descent, we repeatedly run through the training set one data point at a a time and update the parameters according to the gradient of the error with respect to each individual data point. In other words, gradient descent uses a **iterative approach**, starting with a random values of coefficients and intercept and slowly improving them using derivatives.

The `loss`

parameter of `SGDRegressor()`

provides the opportunity to change the loss function being used. The default is set to `squared_loss`

which refers to the ordinary least squares fit.

The `SGDRegressor()`

also provides a `penalty`

parameter which basically acts as a regularization term and its default value is `l2`

that is Ridge Regression. It can also be set to `l1`

or `elasticnet`

.

As we run the second code cell provided above multiple times, we will obtain slightly different values for loss each time. One should understand that as `SGDRegressor()`

is an iterative approach, the parameters (that is coefficients and the intercept) obtained for the regression fit on every function call will differ slightly from one another. This can be prevented by fixing the random state of the model. To have reproducible output across multiple function calls, the parameter `random_state`

can be set to an integer value while declaring the `SGDRegressor()`

model.

###
**When to use which class for Linear Regression model fitting?**

It is to be understood at this point that Ordinary Least Squares being the analytical approach is not memory efficient when the size and/or the features of a data set increases. So, `LinearRegression()`

approach is an effective and a time-saving option when one is working with a dataset with small features.

When it comes to memory efficiency, `SGDRegressor()`

comes to the rescue. So, we can train `SGDRegressor`

on the training data set, that does not fit into RAM. Also, we can update the `SGDRegressor`

model with a new batch of data without retraining on the whole data set. So, `SGDREgressor()`

approach is an effective one when one is working with the large data set, that is, large number of data points and/or features.

One can always look at the `scikit-learn`

documentation as well:

For `LinearRegression`

: sklearn.linear_model.LinearRegression — scikit-learn 0.24.2 documentation

for `SGDRegressor`

: 1.5. Stochastic Gradient Descent — scikit-learn 0.24.2 documentation

To learn more about OLS and SGD, one can watch the following videos:

Ordinary Least Squares: 3.2: Linear Regression with Ordinary Least Squares Part 1 - Intelligence and Learning - YouTube (better for smaller datasets)

Stochastic Gradient Descent: Gradient Descent, Step-by-Step - YouTube (better for larger datasets)