- What is RMSE?
- How do outliers affect the RMSE?
- How to evaluate RMSE in Python?
- What does it mean when RMSE for the test set is way higher than that for the training set?

**Root Mean Square Error** (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the best-fit line.

If is the predicted value of the *i-th* observation, and is the corresponding true value, then the RMSE estimated over observations is defined as:

**Good or bad value of RMSE**

It is important to know that the RMSE value for a given data set has the same unit as the target variable, i.e. the variable one is trying to predict. It means there is no absolute good or bad value of RMSE, but it can be defined on the basis of range and standard deviation of the target variable.

For the House Price Regression Analysis where the price of a house is the target variable, if the range of prices of houses ranges from 79k to 600k in USD with majority of values lying between 100k to 250k, then having an RMSE value ranging from 15k to 20k may be a good estimate of one’s regression analysis.

However, although the smaller the RMSE, the better, you can make theoretical claims on levels of the RMSE by knowing what is expected from your target variable in a given data set.

**RMSE and outliers**

It is to be noted here that RMSE has a higher penalty for outliers. If we look at the formula, we can see that we are squaring the difference between actual value and the predicted value. So, naturally this difference will be higher in the case of outliers. Hence, the RMSE value calculated for any individual data point that is taken from the range of majority of data points (if you consider the above example, the range for majority of data points will be 100k to 250k) will be less than the RMSE evaluated over the whole data set.

**RMSE in Python**

The function `mean_squared_error`

from the module `scikit-learn.metrics`

can be used to evaluate RMSE for a given data set. The attribute `squared`

in this function needs to be set to `False`

, otherwise MSE will be calculated, which is given by the following formula:

**Difference in value of RMSE for training and test set**

When RMSE evaluated for the test set is way higher than that for RMSE for the training set, that means the model overfits the training set and does not give a generalized regression line that would best fit the given data set. This not the case for just RMSE but this goes for any evaluation metric.

I am having problem calculating the trian_rmse in the assignment1 - Linear and Logistic regression. I get the following error

y_true and y_pred have different number of output.

I ran this code :

train_rmse = mean_squared_error(train_inputs, train_preds ,squared = False)

Please Help.

Oh. I figured it out…Thank You for this info…!!

hey plz help me.I encountered the same error

The mean_squared_error term. It should be:

mean_squared_error(trian_targets, train_preds, squared = False)

I put train_inputs instead of train_targets. Hence was getting error.