- What is RMSE?
- How do outliers affect the RMSE?
- How to evaluate RMSE in Python?
- What does it mean when RMSE for the test set is way higher than that for the training set?
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the best-fit line.
If is the predicted value of the i-th observation, and is the corresponding true value, then the RMSE estimated over observations is defined as:
Good or bad value of RMSE
It is important to know that the RMSE value for a given data set has the same unit as the target variable, i.e. the variable one is trying to predict. It means there is no absolute good or bad value of RMSE, but it can be defined on the basis of range and standard deviation of the target variable.
For the House Price Regression Analysis where the price of a house is the target variable, if the range of prices of houses ranges from 79k to 600k in USD with majority of values lying between 100k to 250k, then having an RMSE value ranging from 15k to 20k may be a good estimate of one’s regression analysis.
However, although the smaller the RMSE, the better, you can make theoretical claims on levels of the RMSE by knowing what is expected from your target variable in a given data set.
RMSE and outliers
It is to be noted here that RMSE has a higher penalty for outliers. If we look at the formula, we can see that we are squaring the difference between actual value and the predicted value. So, naturally this difference will be higher in the case of outliers. Hence, the RMSE value calculated for any individual data point that is taken from the range of majority of data points (if you consider the above example, the range for majority of data points will be 100k to 250k) will be less than the RMSE evaluated over the whole data set.
RMSE in Python
mean_squared_error from the module
scikit-learn.metrics can be used to evaluate RMSE for a given data set. The attribute
squared in this function needs to be set to
False , otherwise MSE will be calculated, which is given by the following formula:
Difference in value of RMSE for training and test set
When RMSE evaluated for the test set is way higher than that for RMSE for the training set, that means the model overfits the training set and does not give a generalized regression line that would best fit the given data set. This not the case for just RMSE but this goes for any evaluation metric.
I am having problem calculating the trian_rmse in the assignment1 - Linear and Logistic regression. I get the following error
y_true and y_pred have different number of output.
I ran this code :
train_rmse = mean_squared_error(train_inputs, train_preds ,squared = False)
Oh. I figured it out…Thank You for this info…!!
hey plz help me.I encountered the same error
The mean_squared_error term. It should be:
mean_squared_error(trian_targets, train_preds, squared = False)
I put train_inputs instead of train_targets. Hence was getting error.