- What is multi-collinearity?
- What are the effects of multi-collinearity in a given data set?
- How to detect multi-collinearity?
- How can one reduce multi-collinearity?
Linear Regression models are used to model relationship between target/dependent variables and input/independent variables.
When the input variables which are assumed to be independent of each other are closely related to each other, this correlation is referred to as collinearity. When this correlation is observed for two or more explanatory variables, it is known multi-collinearity.
Multi-collinearity is particularly undesirable because it impacts the interpretability of linear regression models. Linear regression model also identifies the individual effect of each input variable on the target variable. Hence, due to the presence of multi-collinearity, it is difficult to isolate these individual effects. In other words, multi-collinearity can be viewed as a phenomenon where two or more input variables are moderately or highly linearly related to each other to the extent that an input variable can be predicted from another with a substantial accuracy.
Effects of Multi-collinearity
- Uncertainty in coefficient estimates or unstable variance: Small changes (adding/removing rows/columns) in the data results in change of coefficients.
- Increased standard error: Reduces the accuracy of the estimates and increases the chances of detection.
- Decreased statistical significance: Due to increased standard error, z-statistic declines which negatively impacts the capability of detecting statistical significance in coefficient leading to type-II error.
- Reducing coefficient & p-value: The importance of the correlated explanatory variable is masked due to collinearity.
- Overfitting: Leads to overfitting as is indicated by the high variance problem.
Detection of Multi-collinearity
- Correlation Matrix and/or Heatmap: Heatmap of correlations helps visualize the data better by adjusting the colour for positive and negative correlation and size for magnitude.
- Variance Inflation Factor (VIF): VIF is the ratio of variance of coefficient estimate when fitting the full model divided by the variance of coefficient estimate if fit on its own. The minimum possible value is 1 which indicates no collinearity. If value exceeds 5, then collinearity should be addressed.
In the output of the below cell, you can see that VIF is inf
that is, infinity for a few variables. This indicates perfect correlation and is also the reason for the warning.
It is to be noted that variance_inflation_factor
is imported from statsmodels.stats.outliers_influence
Handling Multi-collinearity
The need to reduce multi-collinearity depends on its severity and your primary goal for your regression model. Keep the following three points in mind:
- The severity of the problems increases with the degree of multi-collinearity. Therefore, if you have only moderate multi-collinearity, you may not need to resolve it.
- Multi-collinearity affects only the specific independent variables that are correlated. Therefore, if multi-collinearity is not present for the independent variables that you are particularly interested in, you may not need to resolve it. Suppose your model contains the experimental variables of interest and some control variables. If high multi-collinearity exists for the control variables but not the experimental variables, then you can interpret the experimental variables without problems.
- Multi-collinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multi-collinearity.
Multi-collinearity can be handled in the following way:
- Introduce penalization or remove highly correlated variables: Use lasso and ridge regression to eliminate variables which provide information which is redundant. This can also be achieved by observing the VIF.
- Combine highly correlated variables: Since the collinear variables contain redundant information, combining them into a single variable using methods such as PCA to generate independent variables.