- What are the different statistical measures to validate these assumptions?
- What are the ways one can check the assumptions on a dataset in Python?

Regression is a parametric approach. āParametricā means it makes assumptions about data for the purpose of analysis. Regression fails to deliver good results with data sets which doesnāt fulfill its assumptions. Therefore, for a successful regression analysis, itās essential to validate these assumptions.

Linear Regression is a supervised machine learning model that represents the linear relationship between a dependent variable (features) and independent variable (target).

The data set being used for showcasing the validation of different assumption contains data points about 50 Start-ups. It has 4 columns: āR&D Spendā, āAdministrationā, āMarketing Spendā, and āProfitā. Here, āProfitā is the target variable.

**1. Linear Relationship between the features and target:**

A linear relationship (or linear association) is a statistical term used to describe a straight-line relationship between two variables. Linear regression models captures only linear relationship between features and targets. According to this assumption there exists a linear relationship between the features and target. This can be validated by plotting a scatter plot between the features and the target.

**2. Little or no Multicollinearity between the features:**

Multicollinearity exists when the features in a dataset are moderately or highly correlated to each other. In a model with correlated features, it becomes a tough task to figure out the true relationship of a predictors with the target. In other words, it becomes difficult to find out which feature is actually contributing to predict the target variable.

Pair plots and/or heatmaps can be used for identifying highly correlated features.

**3. Homoscedasticity:**

Homoscedasticity describes a situation in which the error term (that is, the ānoiseā or random disturbance in the relationship between the features and the target) is the same across all values of the independent variables. A scatter plot of residual values vs predicted values is a good way to check for homoscedasticity. There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic.

**4. Normal distribution of error terms:**

According to this assumption, error(residuals) follows a normal distribution. Normal distribution of the residuals can be validated by plotting a q-q plot. For a normal distribution, the plot would show a fairly straight line.

However, it is to be noted that, as sample sizes increase, the normality assumption for the residuals is not needed. This result is a consequence of an extremely important result in statistics, known as the central limit theorem.

**5. Little or No autocorrelation in the residuals:**

Autocorrelation occurs when the residual errors are dependent on each other. The presence of correlation in error terms drastically reduces modelās accuracy.

Autocorrelation can be checked using the Durbin-Watson test, where null hypothesis is that there is no serial correlation. The test statistics (DW) lies between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation.

Hi, it is a great article.

For assumption 2, how can we detect any multicollinearity from the pairplot?

Thank you @alvinyong11 !!

So as pairplot is basically a collection of scatter plots between different columns of a data set, it helps us visualise if there is any correlation between features. One can get the exact values for correlation from the heatmap as well.

Here, as you can see there seems to be little or no correlation between *Administration* and *R&D Spend* as well as *Administration* and *Marketing Spend* but there seems to be some correlation between *R&D Spend* and *Marketing Spend* indicated when as we look at higher values for both the features.

You did an excellent job of explaining this conceptā¦

Thank you @tanyachawla for this excellent code based clarification of assumptions of linear regression