What is the importance of R^2 value? Do we need to remove outliners to get good correlation?

List item

1.Is it possible to explain the importance of R^2 value for linear regression models?
2.To get good correlation, shall we remove outlier of the data set?. If yes what are steps to be taken carefully to eliminate the outliers?

Hey @me16s058, welcome to the community,
The general notion of R^2 value is to check how well the model fits the data. The higher the R^2 value the better the model but there are exceptional cases too. You can check this link-Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?
Again, usually correlation is sensitive to outliers. It depends on what we want. There are times when outliers help in building a better model and there are cases when they don’t. Like in the case of ATM frauds, outliers can be of huge help. There are ways to handle it- dropping, imputation, transforming. Here are some interesting conversations on handling outliers- https://statisticsbyjim.com/basics/remove-outliers/

1 Like

If I may add to the outliers issue, as soon as you remove some outliers, others will appear as they are usually measured as distances from the descriptive measures (mean, median, mode). There is always a trade off between removing outliers to get an high R^2 (explanatory power) and the actual application of the model. Indeed, as you mention, analyzing of outliers in fraud prevention is quite more important that achieving a high R^2.

1 Like

Thanks for clear explanation.

1 Like

When I use SGD regression for non-smoker dataset to predict it. Prediction and rmse seems to be improved compare to Linear Regression.
However when I try to use SGD Regression for Smoker data set ( 274,1) rmse and predictions were high.unable model to train the data and fit in to it.Hence difficult to compare with linear regression. Please let me know any mistakes I would have done from my side

You can check this forum thread to know more about SGDRegressor.