When you start your journey towards data science or data analysis, one thing is for sure that the major task in both these positions is of handling missing values using Python or R whatever platform or language you choose. It’s said that almost 75 – 80% of the time, a data scientist or data analyst utilize on Data wrangling, sometimes referred to as data munging.
Now let’s see how you can handle missing values using Python in action.
Handling MISSING VALUES using python
There are several ways you can use for handling missing values in your dataset. However, the choice of what should be done is largely dependent on the nature of our data and the missing values. Below are a few ways you can choose for handling missing values.
- Drop missing values
- Dropping a complete row
- Dropping a complete column
- Filling missing values with a test statistics
- Forward fill and Backward fill
Before moving forward let’s first make yourself comfortable with the dataset that we are going to use for this post. For this post, we are going to use PIMA Indian Diabetes Dataset, which you can download from here.
About PIMA Indian Diabetes Dataset
Importing necessary modules
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt
Loading the dataset
Now we will use pandas .read_csv() function to load our .CSV file and saved it as data, as shown below:
data = pd.read_csv("C:\\Users\\Pankaj\\Desktop\\PIMA\\diabetes.csv")
Checking what’s inside our dataset
Finding the shape of the dataset
To find the shape of the dataset we can use shape attribute of our DataFrame.
data.shape (768, 9)
Printing Top 5 rows
Now let’s print top 5 rows of our dataset to see what’s inside. We can use the .head() function which by default print 5 rows, but if you want this function to print more or less number of rows you can simply pass a parameter to this function. Say, you want to print top 10 rows, so you can write data.head(10). That’s all.
Checking basic information like datatypes
Using .info() function on a DataFrame we can get basic information about our DataFrame like features and their datatypes, no of missing values, number of rows or columns etc. as shown below:
data.info() Output- <class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): Pregnancies 768 non-null int64 Glucose 768 non-null int64 BloodPressure 768 non-null int64 SkinThickness 768 non-null int64 Insulin 768 non-null int64 BMI 768 non-null float64 DiabetesPedigreeFunction 768 non-null float64 Age 768 non-null int64 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
Checking basic test statistics like mean, median etc.
Generally .describe() function will give us basic statistic details about our dataset like mean, median, minimum and maximum values etc. as shown below:
Checking for missing values or NaN values
Generally, NaN or missing values can be in any form like 0, ? or may be written as “missing ” and in our case, as we can see above there are a lot of 0’s, so we will first convert 0 to NaN, and then calculate how much data we are missing.
There are few ways from where we can know how much of our dataset contains missing values:
Using .info() method
When you check the information using .info() method on DataFrame as shown above if there is a NaN or anything marked as missing we can easily get from here. But as we know in our case, missing values are in the form of 0. So, info will considering it as a value and will not be a useful way to identify missing value.
Using .describe() method
Similar to .info() we can also use .describe() method on DataFrame as shown above. from there we can clearly see that out features or predictor variables Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI contains minimum value of 0. Out of these feature variable Pregnancies can be 0 or 1 (either the patient is pregnant or not) so leaving that out others can’t be 0. So this means that our feature contains NaN values but in another form (in this case 0) as mentioned above.
So let’s check how much data our dataset is missing:
(data.iloc[:,1:6] == 0).sum()) Output- Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 dtype: int64
From the above we can clearly see that Glucose, Blood Pressure and BMI contains less 0’s in their data in comparison to Skin Thickness and Insulin, so we have to apply different strategy on them.
First, let’s convert the 0’s into NaN by using .replace() method as shown below:
Either you can do manually picking one by one each feature and replacing 0 with NaN or write a for loop that will automatically and quickly covert 0 into NaN as shown below:
data.Glucose.replace(0, np.nan, inplace=True)
Automatically and Quick way:
for i in range(1,6): data.iloc[:, i].replace(0, np.nan, inplace=True)
Now let’s check that all of our 0’s are now converted into NaN:
(data.iloc[:,1:6] == 0).sum()) Output- Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 dtype: int64
Now let’s again for confirmation check our dataset using .head() method and print the top 10 rows as shown below:
Now from here, our main task started i.e. to impute values in place of NaN. From the above steps or processes, we now know some basic details about the dataset.
Dropping Missing Value:
The above process is not generally advised because it will delete all observations where any of variables is missing ultimately reducing the size of our dataset and quality of our model. Let’s see this on PIMA dataset shown below:
data = data.dropna()
Now let’s check again the shape of our data :
data.shape() Output- (392, 9)
Almost 50% of our data is deleted which is not good for us. If only a few rows contain missing values, then it’s not so bad, but generally, we need a more robust method. So this method is only advised to use if NaN values are few in numbers.
Dropping a row or column:
If you find that any column or row contains a high number of NaN’s and it’s not good to fill them with any value, you can easily drop them using .loc() or .iloc() method on dataset as shown here.
Filling missing values with a test statistics
We can easily fill missing values with test statistics like mean, median or mode. Let’s take an example by considering Age feature of our dataset as shown below:
The above code will replace all NaN values with the mean of the non-null values
The below code will replace all NaN values with the median of the non-null values
The above code will replace all NaN values with the mode of the non-null values
Generally, the median is the best choice in comparison to mean an mean can be affected by the outliers present in our dataset while the median value is unaffected.
Using Forward fill and Backward fill
Backward fill or ‘bfill’ will fill the NaN values with the previous non-null value. Similarly, forward fill or ‘ffill’ will fill the NaN value with the next value present in the feature. But make sure that if a previous or next value also a NaN value, then, the NaN remains even after back-filling or forward-filling. Here’s how you can use Forward fill and Backward fill:
For Backward fill
For Forward fill
We can also specify an axis to propagate (axis =1 for rows and 0 for columns). For example,
Using Imputer method from sklearn.preprocessing:
Another way is to impute missing data. Imputing means is to make an educated guess as to what missing values could be. Say, in any given column with a missing value, to compute the mean of all the non-missing entries and to replace all missing values with mean as shown below:
from sklearn.preprocessing import Imputer imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0) imp.fit(X) X = imp.transform(X)
Above strategy=’most_frequent’ for using mode
And because of this ability to transform our data as such, imputers are known as transformers. And after transforming our data we could then fit our supervised learning model to it.
We can also both the things, imputing and fitting out the model at once by using a Sklearn’s pipeline object as shown below:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='mean', axis=0) log = LogisticRegression() steps = [('imputation', imp), ('logistic_regresson', log)] pipeline = pipeline(steps)
Now we can split our data into training and test set and then use .fit() and .predict() method of Pipeline object. Also, you can download Jupyter Notebook from GitHub. Here is some reference to official documents from where you can learn more about the above steps – Imputer, Pipeline.
Hope you liked our post. Do share it.