As a Data Scientist or data analyst, you have to prepare your data for machine learning models by getting it into shape. When you came across words like data cleaning or data cleansing, pre-processing, wrangling in data science community it has only one meaning that you are referring to pre modelling data activities like removing outliers, filling nan or missing values with educated guesses. If you are pursuing your career data scientists, you must know that this is how you are going to spend your most of the time while dealing with different datasets.
Step by Step process on how to prepare your data for machine learning models:
Installing required packages
Before you go any further, make sure you have Python and that the expected version is available from your command line. You can check this by running:
Check whether you have installed pip or not
Additionally, you’ll need to make sure you have pip available. You can check this by running:
If you installed Python from source, with an installer from python.org, you should already have pip. But if you using Linux and installed using your OS package manager, you may have to install pip separately, check here
pip isn’t already installed, then first try to bootstrap it from the standard library:
python -m ensurepip --default-pip
If that still doesn’t allow you to run pip:
Run the below command and this will install or upgrade pip.
For more information about how to install packages, you can check here.
Importing data files
To import the data in CSV format or excel, we can use pandas .read_csv(), .read_excel() function as shown in our previous post. You can check the post here. Also for a complete list of functions, you can find the official documentation here.
Exploratory Data Analysis
Exploratory Data Analysis or EDA refers to the important process of performing initial investigations on data to discover patterns, spotting irregularities and to check assumptions with the help of summary statistics and graphical representations.
Basically, when you are dealing with a single variable (univariate ), you can calculate summary statistics for each field in the raw dataset. When dealing with two variables (bivariate), you can calculate summary statistics like mean, standard deviation etc. and also the relationship between each variable in the dataset and the target variable of interest.
Below you can see a visual EDA i.e heatmap of Wine Quality dataset showing correlation.
Working with missing values
In our previous post, we have shown a few ways by which you can deal with missing values. When you have missing values in your dataset you can do a few things, either you can drop the rows or columns that contain missing values or NaN values, or you can make an educated guess to fill desired values or you can calculate test statistics and fill. See here to know how you can do that.
Working with outliers
While there is no single definition of an outlier, in simple words we can define it as datapoints whose value is far greater than or less than most of the rest of the data.
When you are building a machine learning model, outlier detection is an important step to build an accurate model and set a good score. To get rid of outliers, we can use box plots.
From the above box plot, we can say that those data points who are more than 2IQRs(Interquartile range) away from the median can be a common criterion for the outliers.
In the final phase, you can get rid of all the features that are unimportant and make sure that all of your important features are included in your dataset.
Export or save file
You can export your cleaned or formatted dataset to the file format you want like CSV, Excel etc. To know how to see our post.
I hope you like my post. Do share it with your friends.