IPL Win Prediction ML Project- A Classification Solution to a Regression Problem
I know the title of this blog sounds bit absurd but that’s how actually it is. Today I will be making an End to End ML project where we will predict the winning probability in percentage of both the teams playing.
At first this seems to be a regression problem but actually we will be using classification models to do that since we have certain ML classification models that also tells you the probability of the output predicted.
All the code written in this blog is in my github- https://github.com/HarshMishra2002/ipl-win-predictor
link for the dataset used- https://www.kaggle.com/ramjidoolla/ipl-data-set
Lets start with the Problem Statement:
Now a days lot of websites have stared to show an another segment besides the scorecard and that is win probability of teams playing the game. That’s what we will be doing today.
We will be making predictions only after the end of first innings, the time when the chase begins. The features we will be looking for would be the batting team (in second innings), the bowling team, target, overs left, wickets left, current run rate, required run rate and result. So in the end we need our data to something look like this
Lets begin with the datasets we have initially. So we have two datasets- first one is matches.csv and the second one is deliveries.csv.
I will be importing some necessary libraries first and then the dataset
Now lets look at the dataset and try to get an overview of both of it using head and shape function
Also I recommend you to personally go and look at the data to get some more insights. As of now we know what columns we need so lets start creating them one by one.
We need a column of target which is not present but what we have a column of total runs scored in each delivery so we can use that to calculate the total runs after the end of first innings and adding one more to it will give us the target
Now we will merge this ‘total_score_df’ dataframe with ‘match’ dataframe using merge function
Now since the data we have has the record of many IPL editions played in the past we need to alter it. for example there have been instances where a team has changed its name or a new team was added or a team was withdrawn from the tournament. So lets do all the necessary changes
Also we wont be considering the matches which were rain affected I mean where duckworth lewis method has been applied
Now we would be extracting some of the important features and leaving the rest and see how out match dataset look like
Now lets merge this ‘match_df’ dataset with the ‘delivery’ dataset which will give us a single large dataset where the further process of Feature extraction would be taking place
As I have mentioned earlier we would start predicting once the second innings start so we only need the data of second innings and see what's the probability that the batting team chases the target and bowling team defends it
If you remember that in dataset I desire to make has a column named runs_left. So for that first we should know what is the current score and then we can subtract it from the target
Similarly we also need to find the total balls left
Now to get the column of wickets in hand we need to preprocess the column player_dismissed first. Lets look at the column first
So as we can see, if no player has been dismissed at a particular ball then the column shows it as nan and if any batsmen is declared out than name of the batsmen is mentioned. I will first replace all the nan with zero and the names with 1 and then apply cumulative sum on it to get total wickets and then subtract it with 10 so finally I will have total wickets in hand
Now we will calculate the current run rate and required run rate
Now we come to the final column ‘result’. Logic for this column is if the batting team (of second innings) have successfully chased the target we will put 1 in result column else 0 would be considered claiming the victory of bowing team.
Now we have our final dataset ready lets see how it looks
So now the Preprocessing and feature extraction part of the data is completed. Its time to train our model now. Initially I told you that we will predict the winning probabilities of both teams and for that we have few classification models. Today I would be using LogisticRegression for this case. RandomForestClassifier could also be used so I want you guys to try it yourslef and see the result.
Here our target feature is result column so that would be y and rest all columns comes in X variable.
We can see that X has three categorical columns so to deal with that we would use the OneHotEncoder and create a pipeline with logistic regression so we finally get to train our data on the pipeline we just created.
You can see we have successfully trained our data on the pipeline. Now we will make prediction and see the accuracy score
The accuracy we get is somewhat around 80%. You can definitely improve it but now lets look at most important thing- probability of teams in a particular match in a particular situation given and for that we use predict_proba function
This shows that there is a 98% chance Mumbai Indians win this game and near around 2% chance that RCB wins which actually makes sense in the given scenario.
I hope you guys got to learn something new and enjoyed this blog. If you do like it than share it with your friends. Take care. keep learning.
You could also reach me through my Linkedin account- https://www.linkedin.com/in/harsh-mishra-4b79031b3/