Whether you are working as Data Scientist or looking to build a career in a Data Science, the pipeline of your work include Extracting dataset, loading dataset, Data Cleansing and munging, finding summary statistics, then do some Exploratory Data analysis (EDA), and after all these things build a model using machine learning. Anscombe Quartet dataset demonstration is one example that shows us, depending only on summary statistics can be troublesome and how badly it can affect our machine learning model.
Anscombe Quartet Dataset
Here for this post, we are going to use Anscombe-quartet data set which is stored as an excel file and we can read it using the pd.read_excel(). Also, we need to import Pandas, NumPy and Matplotlib package with common alias name as shown below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_excel("C:\\Users\\Pankaj\\Desktop\\Practice\\anscombes-quartet.xlsx")
Here above we have given the complete path to the file because the file is in a different directory from where we have saved our python code file(.py extension).If the directory of both excel/CSV file and your code file is same, you can just add a filename in place of the path. For example we can write:
data = pd.read_excel("anscombes-quartet.xlsx")
It’s a group of four subsets that appear to be similar when using typical summary statistics, but when you plot all the groups using the Matplotlib package, you’ll see a different story. Each dataset consists of eleven (x,y) pairs as follows: We have labelled four pairs as (X, Y),(X.1, Y.1),(X.2, Y.2),(X.3, Y.3).
Calculating Summary Statistics of Anscombe Quartet:
To Calculate the mean, variance, correlation coefficient we can write a small python function which takes input and returns the mean, variance and correlation coefficient. This is how the function looks:
For mean –
def mean(data): mean = np.mean(data) return print(mean)
for variance –
def var(data): variance = np.var(data) return print(variance)
For Correlation Coefficient –
def cor(Data1, Data2): correlation= np.corrcoef(Data1, Data2) return print(correlation)
Example to calculate
After calculating the mean, variance and correlation coefficient on all labelled four pairs of Anscombe Quartet dataset – (X, Y),(X.1, Y.1),(X.2, Y.2),(X.3, Y.3), we can conclude that:
- Mean of X = Mean of X.1 = Mean of X.2 = Mean of X.3 = 9
- Mean of Y = Mean of Y.1 = Mean of Y.2 = Mean of Y.3 = 7.5
- Variance of X = Variance of X.1 = Variance of X.2 = Variance of X.3 = 11
- Variance of X = Variance of Y.1 = Variance of Y.2 = Variance of Y.3 = 4.125 (approx)
- Correlation between (X,Y) = Correlation between (X.1,Y.1) = Correlation between (X.2,Y.2) = Correlation between (X.3,Y.3) = 0.816
- Also when you go for linear regression , you can calculate the slope(a) = 0.5 and intercept(b) = 3 of the line using np.polyfit(). Here the equation becomes y = 0.5x+3.
If you want to know more about how to calculate more summary statistic like standard deviation, .describe(), .info() etc. see our post here.
Despite the summary statistics of the data are pretty much the same, When we plotted all the four pairs, they told us a different story. If you want to know how to plot these graphs check out our post. Fig. 1 is for (X, Y), fig. 2 is for (X.1, Y.1), fig. 3 is for (X.3, Y.3), fig. 4 is for (X.4, Y.4). Below you can see all four figures:
- In Anscombe’s Quartet dataset, group I (X, Y) has a linear relationship.
- Group II (X.1, Y.1) has a non-linear relationship
- In group III (X.2, Y.2), most of the data have a linear relationship, but there’s an outlier
- In group 4, except for the outlier, the two variables have no relationship at all, since X.3 is the same no matter what the Y.3 value is
So you can see how Exploratory Data Analysis or EDA is an important step while working with the dataset. So always do the EDA first. For more information about the Pandas and NumPy, you can check Pandas Documentation, NumPy Documentation