What is the difference between LinearRegression and SGDRegressor in scikit-learn library in Python?

  • What loss function is being used in LinearRegression() ?
  • What loss function is being used in SGDRegressor() ?
  • What method or algorithm is used in LinearRegression() and SGDRegressor() ?
  • Why does SGDRegressor gives different values every time you run fit on it?
  • When should one use LinearRegression() and SGDRegressor() ?

Before we get into the theory of it all, let us look at the following sets of code (to know more about the dataset used, you can visit the notebook via the following code cells)

The following RMSE loss is obtained via sklearn.linear_model.LinearRegression() :

The following RMSE loss is obtained via sklearn.linear_model.SGDRegressor() :

At first glance, the RMSE loss obtained via two different regression models is approximately same on both training set as well as the test set. But what is actually happening behind these few lines of code? How are the coefficients and intercept being computed in the two cases?

We will first understand what is happening and then learn when one should use which regression model to obtain optimal results.


According to scikit-learn documentation, LinearRegression() is the ordinary least squares Linear Regression. LinearRegression() fits a linear model with coefficients w to minimize the residual sum of squares between the observed targets in the data set, and the targets predicted by the linear approximations. ( It is to be noted that these linear approximations involves matrix operations )

Also, scikit-learn standard linear regression object is actually just a piece of code from scipy which is wrapped to give a predictor object. It basically uses Normal Equation to compute the minimizer analytically, and will give you an warning if the relevant matrix is non-invertible. In other words, Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function.

Also, if we wish to apply regularization techniques using LinearRegression() , we can not do that directly by tuning the parameters of this class. We need to use Ridge() (This linear model addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients with l2 regularization) or Lasso() (It is a linear model that estimates sparse coefficients with l1 regularization) or ElasticNet() (It is a linear regression model trained with both l1 and l2 -norm regularization of the coefficients) classes of the scikit-learn library.


The class SGDRegressor() implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models.

In stochastic gradient descent, we repeatedly run through the training set one data point at a a time and update the parameters according to the gradient of the error with respect to each individual data point. In other words, gradient descent uses a iterative approach, starting with a random values of coefficients and intercept and slowly improving them using derivatives.

The loss parameter of SGDRegressor() provides the opportunity to change the loss function being used. The default is set to squared_loss which refers to the ordinary least squares fit.

The SGDRegressor() also provides a penalty parameter which basically acts as a regularization term and its default value is l2 that is Ridge Regression. It can also be set to l1 or elasticnet .

As we run the second code cell provided above multiple times, we will obtain slightly different values for loss each time. One should understand that as SGDRegressor() is an iterative approach, the parameters (that is coefficients and the intercept) obtained for the regression fit on every function call will differ slightly from one another. This can be prevented by fixing the random state of the model. To have reproducible output across multiple function calls, the parameter random_state can be set to an integer value while declaring the SGDRegressor() model.

When to use which class for Linear Regression model fitting?

It is to be understood at this point that Ordinary Least Squares being the analytical approach is not memory efficient when the size and/or the features of a data set increases. So, LinearRegression() approach is an effective and a time-saving option when one is working with a dataset with small features.

When it comes to memory efficiency, SGDRegressor() comes to the rescue. So, we can train SGDRegressor on the training data set, that does not fit into RAM. Also, we can update the SGDRegressor model with a new batch of data without retraining on the whole data set. So, SGDREgressor() approach is an effective one when one is working with the large data set, that is, large number of data points and/or features.

One can always look at the scikit-learn documentation as well:
For LinearRegression : sklearn.linear_model.LinearRegression — scikit-learn 0.24.2 documentation
for SGDRegressor : 1.5. Stochastic Gradient Descent — scikit-learn 0.24.2 documentation

To learn more about OLS and SGD, one can watch the following videos:
Ordinary Least Squares: 3.2: Linear Regression with Ordinary Least Squares Part 1 - Intelligence and Learning - YouTube (better for smaller datasets)
Stochastic Gradient Descent: Gradient Descent, Step-by-Step - YouTube (better for larger datasets)