IPL Win Prediction ML Project- A Classification Solution to a Regression Problem

Harsh Mishra
6 min readOct 16, 2021

--

I know the title of this blog sounds bit absurd but that’s how actually it is. Today I will be making an End to End ML project where we will predict the winning probability in percentage of both the teams playing.

At first this seems to be a regression problem but actually we will be using classification models to do that since we have certain ML classification models that also tells you the probability of the output predicted.

All the code written in this blog is in my github- https://github.com/HarshMishra2002/ipl-win-predictor

link for the dataset used- https://www.kaggle.com/ramjidoolla/ipl-data-set

Lets start with the Problem Statement:

Now a days lot of websites have stared to show an another segment besides the scorecard and that is win probability of teams playing the game. That’s what we will be doing today.

This is a image of app ‘espn cric info’

We will be making predictions only after the end of first innings, the time when the chase begins. The features we will be looking for would be the batting team (in second innings), the bowling team, target, overs left, wickets left, current run rate, required run rate and result. So in the end we need our data to something look like this

Lets begin with the datasets we have initially. So we have two datasets- first one is matches.csv and the second one is deliveries.csv.

I will be importing some necessary libraries first and then the dataset

Now lets look at the dataset and try to get an overview of both of it using head and shape function

Also I recommend you to personally go and look at the data to get some more insights. As of now we know what columns we need so lets start creating them one by one.

We need a column of target which is not present but what we have a column of total runs scored in each delivery so we can use that to calculate the total runs after the end of first innings and adding one more to it will give us the target

total_score_df dataset will give me the total runs scored in each of the matches in first innings

Now we will merge this ‘total_score_df’ dataframe with ‘match’ dataframe using merge function

Now since the data we have has the record of many IPL editions played in the past we need to alter it. for example there have been instances where a team has changed its name or a new team was added or a team was withdrawn from the tournament. So lets do all the necessary changes

Also we wont be considering the matches which were rain affected I mean where duckworth lewis method has been applied

Now we would be extracting some of the important features and leaving the rest and see how out match dataset look like

Now lets merge this ‘match_df’ dataset with the ‘delivery’ dataset which will give us a single large dataset where the further process of Feature extraction would be taking place

As I have mentioned earlier we would start predicting once the second innings start so we only need the data of second innings and see what's the probability that the batting team chases the target and bowling team defends it

If you remember that in dataset I desire to make has a column named runs_left. So for that first we should know what is the current score and then we can subtract it from the target

total_runs_y is the column that has the total run scored at each ball so if I apply cumulative sum on it I get the current score of the match

Similarly we also need to find the total balls left

Now to get the column of wickets in hand we need to preprocess the column player_dismissed first. Lets look at the column first

So as we can see, if no player has been dismissed at a particular ball then the column shows it as nan and if any batsmen is declared out than name of the batsmen is mentioned. I will first replace all the nan with zero and the names with 1 and then apply cumulative sum on it to get total wickets and then subtract it with 10 so finally I will have total wickets in hand

Now we will calculate the current run rate and required run rate

Now we come to the final column ‘result’. Logic for this column is if the batting team (of second innings) have successfully chased the target we will put 1 in result column else 0 would be considered claiming the victory of bowing team.

Now we have our final dataset ready lets see how it looks

So now the Preprocessing and feature extraction part of the data is completed. Its time to train our model now. Initially I told you that we will predict the winning probabilities of both teams and for that we have few classification models. Today I would be using LogisticRegression for this case. RandomForestClassifier could also be used so I want you guys to try it yourslef and see the result.

Here our target feature is result column so that would be y and rest all columns comes in X variable.

We can see that X has three categorical columns so to deal with that we would use the OneHotEncoder and create a pipeline with logistic regression so we finally get to train our data on the pipeline we just created.

You can see we have successfully trained our data on the pipeline. Now we will make prediction and see the accuracy score

The accuracy we get is somewhat around 80%. You can definitely improve it but now lets look at most important thing- probability of teams in a particular match in a particular situation given and for that we use predict_proba function

This shows that there is a 98% chance Mumbai Indians win this game and near around 2% chance that RCB wins which actually makes sense in the given scenario.

I hope you guys got to learn something new and enjoyed this blog. If you do like it than share it with your friends. Take care. keep learning.

You could also reach me through my Linkedin account- https://www.linkedin.com/in/harsh-mishra-4b79031b3/

--

--

Harsh Mishra
Harsh Mishra

Written by Harsh Mishra

Data science / ML enthusiast | Front-end developer | CS engineering student

No responses yet