How to build a polynomial regression model in Python using scikit-learn?

  • What is Polynomial Regression?
  • How is polynomial regression different from linear regression?
  • How to find the degree of polynomial which best fits the given data?

Simple linear regression is used to predict finite values of a series of numerical data. There is one feature variable x that is used to predict the target variable y. There are constants like and which add as parameters to our equation.

In multiple linear regression, we predict values using more than one feature variable. These feature variables are made into a matrix of features and then used for prediction of the target variable. The equation can be represented as follows:

Polynomial regression also a type of regression and is used to make predictions using polynomial powers of the feature variables. It can be understood better using the equation shown below:

Look at the following scatter plot. It can be seen that a linear regression model will not not the fit the given data where x is feature variable and y is target variable:

Just for the sake of it, we try to fit linear regression model on the given data and get the following RMSE errors on training and test set.

For the target that is in the range of -1.5 to 1.5, we get RMSE around 0.43, which is quite high. We get the following graph. The line represents the model predictions.

We infer from the scatter plot that a polynomial curve can fit the data.

scikit-learn.presprocessing provides a class called PolynomialFeatures which computes the higher degrees of feature variables, without adding extra columns to the the given data frame.

Documentation for the same:

We compute the desired degrees or powers of feature variables using PolynomialFeatures and then use class LinearRegression to fit the regression model on our data:

We get the following results. The RMSE has reduced significantly with degree 4 and the predicted model looks as follows:

But how will we find out which degree polynomial best fits the graph?

For that, we compute the Train RMSE and Test RMSE for different degrees of polynomial and find out which degree polynomial regression works best for the given data set:

As the Test RMSE is low for degrees 4-9, one can zoom in on the graph and find out the optimal error value and build the final model accordingly.

You might also be able to notice that RMSE value for test set increases exponentially after a certain degree. Due to increase in degree of polynomial regression, the complexity of model increases, which then leads to overfitting on the training set.

To know more about overfitting, check out What is overfitting and underfitting in machine learning?