- What are the causes of outliers in a data set?
- What are the effects of outliers on a data set?
- What are the effects of outliers on linear regression?
Outliers are extreme values that deviate from other observations on data, they may indicate variability in measurement, experimental errors or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample or a data set.
Causes of Outliers:
The following are some of the common causes of the existence of outliers in a a given data set:
- Measurement Error - This is caused when the measurement instrument used turns out to be faulty.
- Data entry Error - Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.
- Experimental Error - These errors are caused during data extraction or experiment planning or while executing an experiment.
- Data Processing Error - These are caused when manipulation or extraction of the data set is performed.
- Sampling Error - This happens when one extracts or mixes data from the wrong or various sources.
- Intentional Outlier - These are dummy outliers made to test detection methods.
- Natural Outlier - When an outlier is not artificial i.e. causes due to an error, it is a natural outlier. In the process of producing, collecting, processing and analyzing data, outliers can come from many sources and hide in many dimensions. Those that are not a product of an error are called natural outliers.
Effect of outliers on a data set
Outliers have a huge impact on the result of data analysis and various statistical measures.
Some of the most common effects are as follows:
- If the outliers are non-randomly distributed, they can decrease normality.
- It increases the error variance and reduces the power of statistical tests.
- They can cause bias and/or influence estimates.
- They can also impact the basic assumption of regression as well as other statistical models.
Let us look at the following data set to comprehend some of effects.
The following boxplot detects there is an outlier in the given dataset:
We now drop the outlier and look at different measures of central tendencies of the column y of the two datasets and one can easily notice the difference:
Effect of outliers on Linear Regression:
Let us look at the scatter plot and the best fit linear equation when the outlier exists in the dataset and when it is dropped from the dataset:
It can be clearly seen how the parameters have drastically changed between the two graphs. So the outliers will in turn have an effect on different accuracy measures of a linear regression model and can further lead to errors in estimations as well.