ML Mini Project: Car Price Predictor

Harsh Mishra
5 min readOct 31, 2021

--

Today we will make a machine learning mini project where we will build the model and train it to predict the price of car. Model will consider the car name, brand, km driven and fuel type of car as input features to predict the output

All the code written in this blog is in my github- https://github.com/HarshMishra2002/car-price-prediction

link for the dataset used- https://github.com/HarshMishra2002/car-price-prediction/blob/main/quikr_car.csv

Lets start the Project. Complete project is in python language and Jupiter notebook is used as an IDE.

First lets import some important python libraries

Now we will import the data and see how it looks by calling the head function

Now let’s get the basic idea of data

Now we will start the data cleaning so for that lets see which column has what kind of data which would be required some cleaning

First let’s see the year column. Also we saw that year column and some other columns like price and kms_driven which should be integer datatype are actually object data type so thats we need to take care of

so we can see there is lot of gibberish values. we will take care of it later let’s see first what cleaning we have to do in other columns as well.

In this column we have to remove the kms and the column : 46,000 kms -> 46000

So finally we have to do these steps in data cleaning

  • names are pretty inconsistent
  • names have company names attached to it
  • some names are spam like ‘Maruti Ertiga showroom condition with’ and ‘Well mentained Tata Sumo’
  • company: many of the names are not of any company like ‘Used’, ‘URJENT’, and so on.
  • year has many non-year values
  • year is in object. Change to integer
  • Price has Ask for Price
  • Price has commas in its prices and is in object
  • kms_driven has object values with kms at last.
  • It has nan values and two rows have ‘Petrol’ in them
  • fuel_type has nan values

First we will start with year column.

we are using isnumeric function to consider only those rows which has numeric values and then we convert it into integer data type.

Now coming to the Price column we first remove the Ask for Price and then replace the column with nothing to remove it and then convert it in integer data type

Now coming to the kms_driven column, we will use split function to get rid of the kms at the end and replace function to get the rid ‘,’ and finally isnumeric function to remove the any non-numeric value.

Now in the name column we need on the first three words so we use the the above code snippet to do that

So finally we have a clean data set on which we can build our model. So lets make two variables X and y. X is the input features and y is the output column

Now we will apply train test split also import some important and required libraries. Today we will be using LinearRegression model and in case of regression we need to calculate r2 score.

The most common interpretation of r-squared is how well the regression model fits the observed data. For example, an r-squared of 60% reveals that 60% of the data fit the regression model.

Also we need to make the pipe line for first pre process the data and then apply model on it. For Preprocessing we are just encoding all the categorical columns using OneHotEncoder

Now after fitting the data in ohe we create a pipe line as discussed before. For applying OHE on our data we will be using column_transformer.

Now our pipeline is ready and so is our model. lets calculate its r2 scrore.

r2 score is very less. In test train split we have not given an important input ‘random_state’ so thats why everytime we run this model we get the different r2 score as you can see in the loop of 1000. Our goal is to get the max r2 score.

So we store all the r2 score in one list and find the index of that score that give max r2 when taken as random_state

So we reached upto an r2 score of 0.889 which is pretty good when compared to what we got at first.

I hope you guys got to learn something new and enjoyed this blog. If you do like it than share it with your friends. Take care. keep learning.

You could also reach me through my Linkedin account- https://www.linkedin.com/in/harsh-mishra-4b79031b3/

--

--

Harsh Mishra
Harsh Mishra

Written by Harsh Mishra

Data science / ML enthusiast | Front-end developer | CS engineering student

No responses yet