SMS Spam Predictor- End to End ML Classification Project
Ever wondered how google puts some of your mails in spam section even without asking you. Today we will create the same. A ML model that can detect your SMS weather as a spam or not a spam
All the code written in this blog is in my github- https://github.com/HarshMishra2002/sms-spam-predictor
link for the dataset used- https://www.kaggle.com/uciml/sms-spam-collection-dataset
link of the heroku app deployed- https://sms-spam-detector-harsh.herokuapp.com/
Lets start the Project. Complete project is in python language and Jupiter notebook is used as an IDE.
First we will import some important and must libraries
Now we will import the dataset and see how it looks
We will start the data cleaning process. So first see if we have any null values or not. If yes then we need to handle it accordingly.
we can see that the last 3 columns have hardly any non-null values so it doesn’t makes any sense to keep them
Now name of the remaining columns is not very descriptive so we will rename it according to the value it has and information it possesses. So I decided to change the name of v1 to target and v2 to text
we look at the target column we have two values in it- spam and ham. We have to convert it to 1 and 0 respectively. So we will use LabelEncoder for it
Now comes the last step of data cleaning. we will check the duplicate values and remove them if present.
Now we will start with EDA (Exploratory data analysis). Lets check the balance of the dataset.
AS we can see around 88% of the data has target value 0, data is imbalanced. now we fill find the number of character, number of words and number of sentences for each of the text and we try to find some insights of data. for this we will use NLTK library of python
If we look at the mean values than clearly we can see that mostly spam messages are longer in length which makes sense now we will use seaborn library to visualize it and confirm the same
Now we will finish EDA by plotting the heatmap.
Now we will do Data preprocessing. 5 steps to follow
Lower case
Tokenization
Removing special characters
Removing stop words and punctuation
Stemming
for all this we will create a single function but before that lets import some necessary libraries
We will use this function ‘transform_text’ and create a new column
we will create word cloud and try to see the most occurring words in spam and not spam
Now comes the final and the most important part- MODEL BUILDING
Today we would be implementing Naive Bayes algorithm but before that we need to convert out text into numbers so that we can apply sklearn models on it for training purpose.
So for that we will use Tf-Idf vectorizer
TF-IDF is an information retrieval technique that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term that occurs in the text has its respective TF and IDF score.
We have earlier seen that the data is imbalance so we give priority to precision_score rather than accuracy_score. so Multinomial NB is selected as a model since it has the highest pecision score.
DEMO OF THE WEBSITE HOSTED ON HEROKU
I hope you guys got to learn something new and enjoyed this blog. If you do like it than share it with your friends. Take care. keep learning.
You could also reach me through my Linkedin account- https://www.linkedin.com/in/harsh-mishra-4b79031b3/