- Is it wise to remove outliers?
- What should we know before removing outliers?
If you do not know what outliers are and what are their causes and effects, you can visit this link.
Before we get into the reasoning of it all, let us look at two of examples below.
In this example, we will try to comprehend how it is effecting the different statistical measure.
Look at the following data set. The data set tells us the starting salary of recent graduate students from a given college in a given year (in USD):
Using boxplot, we can clearly see that the data set consists of one outlier:
Let us look at the some measures for the two data sets, that is, one with the outlier and one without:
Before we draw any conclusions, let us look at an example consisting of a larger data set and contains multiple outliers. We will also train an ML model and compare the RMSE result.
The following data set has been extracted from Kaggle.
Let us look at the data set:
For building the model to predict the price of a house in USA, we will drop the ‘Address’ column and first look at the pair plot to get a better idea of what are the features in the data set:
Let us first train a model with the outliers and obtain the RMSE for the same:
Using boxplot, we can easily visualize the outliers in the training set:
After removing the outliers from the training set, we train the model, and test the model again:
Considering the second example, it can be seen that removing outliers does not improve our model, at all. So, it is to be understood here, that our first instinct towards outliers is not to remove them but to build models that will accommodate the natural outliers in the best possible way. In most of the cases, we will find that our models will be able to achieve this.
Now, going back to the first example, sorting the data gave us a clear idea about the outlier and removing the data point gave us better statistical measures, which are true to most of the points in the data set. Looking at the measures when the outlier is removed, gives us the accurate idea what a student’s starting salary can be when graduating from that same college in the next one or two years.
Ultimately, when it comes to real-world data and problems, analysts must investigate unusual values and use their expertise to determine whether the outliers present are legitimate data points. Statistical procedures don’t know the subject matter or the data collection process and can’t make the final determination. One should not include or exclude an observation based entirely on the results of a hypothesis test or statistical measure. Doing research about the data set you have and acquiring in-depth knowledge about the different features will help you understand the different values present in the data set.
If you want to know more about how to detect and remove outliers, you can read the following articles: