If you want to choose Data Scientist as a career, getting your data and organizing it in proper shape will be your day to day task before applying any machine learning algorithms. Here in this post, i am sharing basic pandas codes which will help you out in making your data in shape. Pandas package, developed by Wes Mckinney, is a high-level data manipulation tool build on the NumPy package. It’s great for other routine data analysis tasks, such as quick Exploratory Data Analysis (EDA), drawing attractive plots, building and doing some pre-processing tasks like cleaning, merging etc. and making your data ready for machine learning package like scikit-learn. With python and pandas package you can also automate your data processing tasks easily. For below codes, we have used Dataset “Tips.csv” which I have stored in my pc in the path (C:\Users\Pankaj\Desktop\Tips\Tips.csv). So get ready to make your hands dirty with below pandas codes.
Reading your data set stored in CSV, excel file, HTML etc using pandas codes.
Importing Pandas library
import pandas as pd
From CSV file
Tips = pd.DataFrame.from_csv(“C:\\Users\\Pankaj\\Desktop\\Tips\\Tips.csv”)
The above and below pandas codes, both are same but the difference is how we passed the file to read_csv() function. Here above we have given the complete path to the file because the file is in a different directory from where we have saved our python code file(.py extension). If the directory of both excel/CSV file and your code file is same, you can just add a filename in place of the path as shown below:
Tips = pd.read_csv(“Tips.csv”,index_col=0)
or
Tips = pd.read_csv(“C:\\Users\\Pankaj\\Desktop\\Tips\\Tips.csv”,index_col=0)
From an Excel file
Tips_excel = pd.read_excel(“C:\\Users\\Pankaj\\Desktop\\Tips\\Tips.xlsx”, sheetname=0, index_col=0)
From HTML file:
Tips_htm = pd.read_html(“C:\\Users\\Pankaj\\Desktop\\Tips\\Tips.htm”)
Writing your Data Frame directly to CSV or Excel file
When you finished pre-processing tasks like cleansing (cleaning) and manipulation on your dataset (here Tips is our dataset) and want to save it. To do so we provide the path where we want to save our dataset with the name as shown below :
To save in .csv format, comma separated and without indices:
Tips.to_csv(“C:\\Users\\Pankaj\\Desktop\\Tips\\Tips.csv”, sep=",", index=False)
To save in .xlsx format:
Tips.to_csv(“C:\\Users\\Pankaj\\Desktop\\Tips\\Tips.xslx”)
Exploring our data set
Tips.head() #Returns first 5 rows of our data set.If you pass any value like .head(2), it will print only 2 rows.
Tips.tail() #Returns last 5 rows of our data set
Tips.columns #Returns the column names of our data set
Tips.shape #Gives us the shape of the data set as m x n, where m is for row and n is for column
Tips.info()
Summary/descriptive statistics of numeric data
Tips.describe() #Gives statistical summary like mean, median, quartiles etc.
Tips.columns.value_counts()
Converting the data frame to its NumPy array representation
Tips.as_matrix()
Converting data types
Convert the column to a string or object type
Tips[‘Total_bill’] = Tips[‘Total_bill’].astype(str)
Convert the column as a categorical variable
Tips[‘Sex’] = Tips[‘Sex’].astype(‘Category’)
Converting to numeric data
Tips[‘Total_bill’] = pd.to_numeric(Tips[‘Total_bill’], errors = ‘coerce’)
Data Handling (Finding, Removing, Replacing Missing values)
Drop Duplicates
Tips = Tips.drop_duplicates()
Tips = Tips.dropna()
Tips.drop("tip", axis=1) #axis=0, 1 for column
Fill with provided values
Tips["sex"] = Tips["Sex"].fillna("missing")
Tips[["Total_bill","Size"]] = Tips[["Total_bill","Size"]] .fillna(0)
Fill missing values with test statistics
mean = Tips["tip"].mean() Tips["tip"] = Tips["tip"].fillna(mean)
Basic Summary of your data
Tips.sum() #Sum of values in a data frame
Tips.min() #Lowest value of a data frame
Tips.max() #Maximum value of a data frame
Tips.idxmax() #Index of Maximum Value
Tips.idxmin() #Index of Lowest Value
Tips.mean() #Provides mean value of a Data frame
Tips.median() #Gives Median value
Tips.corr() #Correlation between columns
Tips.count() #Count values
Tips['tip'].median() #To get above values for any one column of data frame
Selecting values from Data frame
label-based (.loc)
Tips.loc[:, ["tip","sex"]] #Column access
Index based (.iloc)
Tips.iloc[[1,2]] #Row Access, select second, third row as python starts indexing from 0
Tips.iloc[:,[0,1 ]] #Column Access, Select first and second column
Tips.iloc[[1,2],[0,1]] #Both row and column access
Combining Data
pd.merge(df1, df2)
pd.concat([df1,df2,df3]) #Where df1, df2, df3 are different Data frames
Get the unique values of a column
Tips["tip"].unique()#Gives unique entries of the column "tip"
To Set the index
Tips.index = ["Customer1","Customer2","Customer3","Customer4"..]
TO learn about pandas library and Pandas codes in detail you can refer to Pandas documentation here. Hope you like my post and these pandas codes will help you in your day to day data munging tasks. Do share your comments with me. Till then have a happy time with the pandas library. 🙂
Leave a Reply