Everything you need to know about Outliers
Today in this blog we will study everything about outliers. If you have never heard about outliers don’t worry we will start from absolutely scratch. So we will answer five questions regarding outliers to get its understanding in and out
- What are Outliers?
- When is Outliers dangerous?
- What are the effects of Outliers on different ML algorithms?
- How to treat Outliers?
- How to detect Outliers?
and then finally we will se some techniques for outlier detection and removal…
So lets begin with the first question
- What are Outliers?
In simple words, Sharmaji ka beta is an outlier. I mean if I consider a class of 50 students where in a test everyone scored around 40–60 marks out of 100 but Sharmaji’s son scored 99 out of 100 then he is considered as an outlier. Outliers distract the models and give a false representation of the data. Now this leads us to our next question that is it every time that outliers are dangerous or some times it can be useful.
2. When is Outliers dangerous?
Now this is an interesting question and this is a point where you really need to focus and invest some time. In many of the cases we may say that outliers are dangerous but not every time. For example in anomaly detection projects like if I am looking at a credit card transaction and trying to figure out whether it’s normal transaction or a fraud one so basically it’s a outlier that I am trying to detect and in that case if I had removed all the outliers from the training data assuming it could be dangerous I actually made a deal of loss.
3. What are the effects of Outliers on different ML algorithms?
There are few algorithms that are badly impacted by the presence of outliers in data. few of such algorithms are Linear Regression, Logistic Regression, Adaboost Classifier and many Deep Learning algorithms. Now if you see that there is one common thing in all the above mentioned algorithms and that is weight. All these are weight based algos so we can make an inference that outliers are harmful for weight based algos. Whereas Tree based algos like Decision tree, Random forest etc. have a very minor impact of outliers.
4. How to treat Outliers?
There is actually three ways you can treat outliers of your data.
a. Trimming
b. Capping
c. Treat like missing values
a. Trimming- In this technique you simply remove the outliers. It’s advantage is that its very fast to implement as you simply delete those data points but if there are many outliers in your data than it can make the data very thin which could be a bad idea.
b. Capping- This is a more preferred way to treat outliers. You select two thresholds i.e. any value larger than the upper threshold will be considered as an outlier and would be given the value of the upper threshold and any value smaller than the lower threshold would be considered as an outlier and would be assigned the value of lower threshold.
c. Treat like missing values- Replace all your outliers with np.nan and then treat them normally like you treat missing values
5. How to detect Outliers?
For Normal Distribution we check that if the data point is larger than (mean + 3 x standard deviation) or smaller than (mean- 3 x standard deviation) then that data point is detected as outlier
For skewed distribution we plot the boxplot which makes outliers detection very easy.
Now finally here I would like to discuss two most used techniques for outlier treatment
In Normal Distribution we use Z score treatment which is very similar to what we saw above and for Skewed data we use IQR based filtering again we can see its diagram above.
I hope you guys got to learn something new and enjoyed this blog. If you do like it than share it with your friends. Take care. keep learning.
You could also reach me through Linkedin - https://www.linkedin.com/in/harsh-mishra-4b79031b3/