In this post, we will apply linear regression to Boston Housing Dataset on all available features. In our previous post, we have already applied linear regression and tried to predict the price from a single feature of a dataset i.e. RM: Average number of rooms.
We are going to use Boston Housing dataset which contains information about different houses in Boston. There are 506 samples and 13 feature variables in this dataset. Our aim is to predict the value of prices of the house using the given features.
Let’s see how to apply Linear Regression to Boston Housing Dataset in action:
First import all the necessary libraries that we are going to need to build our linear regression model.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error
Now we will use pandas .read_csv() function to load our .CSV file and saved it as data, as shown below:
filepath = "C:\\Users\\Pankaj\\Desktop\\Dataset\\Boston_housing.csv" data = pd.read_csv(filepath)
To check the column names of the dataset we can use .columns attribute as shown below:
data.columns Output- Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv'], dtype='object')
The Boston Housing Dataset characteristics or description is as follows:
To get basic details about our Boston Housing dataset like null values or missing values, data types etc. we can use .info() as shown below:
data.info() Output- <class 'pandas.core.frame.DataFrame> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): crim 506 non-null float64 zn 506 non-null float64 indus 506 non-null float64 chas 506 non-null int64 nox 506 non-null float64 rm 506 non-null float64 age 506 non-null float64 dis 506 non-null float64 rad 506 non-null int64 tax 506 non-null int64 ptratio 506 non-null float64 b 506 non-null float64 lstat 506 non-null float64 medv 506 non-null float64 dtypes: float64(11), int64(3) memory usage: 55.4 KB
Below is a top 5 rows that are returned by default when using .head() on a Dataframe.
data.head()
Generally, NaN or missing values can be in any form like 0, ? or may be written as “missing ” and in our case, as we can see above there are a lot of 0’s, so we can replace them with NaN to calculate how much data we are missing.
data.zn.replace(0, np.nan, inplace=True) data.chas.replace(0, np.nan, inplace=True)
After replacing let’s again use .info() method to see the details about missing values in our dataset:
data.info() Output- <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): crim 506 non-null float64 zn 134 non-null float64 indus 506 non-null float64 chas 35 non-null float64 nox 506 non-null float64 rm 506 non-null float64 age 506 non-null float64 dis 506 non-null float64 rad 506 non-null int64 tax 506 non-null int64 ptratio 506 non-null float64 b 506 non-null float64 lstat 506 non-null float64 medv 506 non-null float64 dtypes: float64(12), int64(2) memory usage: 55.4 KB
Let’s calculate the percentage of missing values in our dataset. Generally, if there is 20-25% missing values we can impute them with different ways like mean, median or an educated guess by us. But if it’s more than that, it’s better to remove those features otherwise they can affect our result. As we can see below both “zn” and “chas” missing more than 70% data so we will remove both these features.
data.isnull().sum()/len(data)*100 Output- crim 0.000000 zn 73.517787 indus 0.000000 chas 93.083004 nox 0.000000 rm 0.000000 age 0.0000<00 dis 0.000000 rad 0.000000 tax 0.000000 ptratio 0.000000 b 0.000000 lstat 0.000000 medv 0.000000 dtype: float64
We can drop both columns as shown below:
data = data.drop("zn", 1) data = data.drop("chas", 1)
Now again check the percentage on missing data to make sure everything is fine so that we can go ahead without any worries.
data.isnull().sum()/len(data)*100 Output- crim 0.0 indus 0.0 nox 0.0 rm 0.0 age 0.0 dis 0.0 rad 0.0 tax 0.0 ptratio 0.0 b 0.0 lstat 0.0 medv 0.0 dtype: float64
Now to get some basic statistics about our data like mean, median, count etc. we can use .describe() method as shown below:
data.describe()
Now let’s plot the histogram of all the available features to see the distribution.
data.hist(bins=50, figsize=(15, 15)) Output- array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF002EB8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2A2208>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2CF518>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2F6748>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2F6DD8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2F6E10>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF372B38>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF3A4208>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF3CB898>], [<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF3F4F28>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF4225F8>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF449C88>]], dtype=object)
Now let’s check the correlation between all the feature variable and target variable by plotting the heatmap as shown below:
# set the size of the figure sns.set(rc={'figure.figsize':(11.7,8.27)}) sns.heatmap(data.corr().round(2), square=True, cmap='RdYlGn', annot=True)
Observations from the above Heatmap:
- From the above correlation plot, we can see that MEDV is strongly correlated to LSTAT, RM
- RAD and TAX are strongly correlated, so we don’t include this in our features together to avoid multicollinearity. Similar to the features DIS and AGE which have a correlation of -0.75. So we will exclude these four features from our features list. You can find the reason behind this here.
As Scikit learn wants “features” and “target” variables in X and y respectively. Here medv is our target variable, we can extract features and target arrays from our dataset as shown below. From X we drop the medv column along with other features and in y we keep only medv column:
X = data.drop(["medv","rad","tax","dis","age"], 1).values y= data["medv"].values
In order to get a better idea of the effect of each feature on the value on “medv”, all independent variables are visualized.
plt.figure(figsize=(20, 15)) features = ['crim','indus','nox','rm','ptratio','b','lstat'] target = data['medv'] for i, col in enumerate(features): plt.subplot(3, len(features)/2 , i+1) x = data[col] y = target plt.scatter(x, y, marker='o') plt.title(col) plt.xlabel(col) plt.ylabel('MEDV')
From the above visualization, we can conclude that medv and rm are linearly correlated and medv increase with an increase in rm which is no. of rooms and looks like both are following a linear relationship. Also when lstat increases medv or price decreases.
Let’s see the shape of our X (features) and y (target variable):
print(X.shape) print(y.shape) (506, 7) (506,)
We can see that there are 7 features with 506 observations and 1 target variable with the same 506 observations. so all well we can now fit our linear regression to Boston Housing Dataset.
Applying Scikit learn Linear Regression to Boston Housing dataset’s predictor variables or independent variables to predict the value of dependent variable ‘MEDV’:
Now, let’s apply linear regression to Boston Housing Dataset and for that first, we will split the data into training and testing sets. We train the model with 70% of the data and test with the remaining 30%.
We do this because when we compute the metric to measure the model’s performance on the same data that we used to fit the classifier but as the same data is used to train the model, this will not provide us with the real answer on how well our model generalizes to unseen data. So for this reason its common to split our data into two sets, a training set on which we train and fit the classifier and a labelled test set on which we can make our predictions and finally compare these predictions with the known labels.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42) print(len(X_train), "train set + ", len(X_test), "test set") Output- 354 train set + 152 test set
Here, finally to check whether we have split our training and test set, we print the sizes as shown below:
print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape) (354, 7) (354,) (152, 7) (152,)
Since the target variable here is quantitative (continuously varying variable – “medv”), this is a regression problem. We have imported Linear regression from sklearn.linear_models as shown at the start of the post. Now first instantiate the LinearRegression() and then use .fit() to fit a linear regression and then predict the price, using .predict() as shown below:
lr = LinearRegression() lr.fit(X_train, y_train) y_pred = lr.predict(X_test)
Now we will compute the metrics to measure the model’s performance. Here we have computed R squared, defined as the metric which quantifies the amount of variance in the target variable that is predicted from the feature variable.
r2 = lr.score(X_test, y_test) rmse = (np.sqrt(mean_squared_error(y_test, y_pred)))
print("The model performance is") print("--------------------------------------") print('R2 score is {}'.format(r2)) print('RMSE is {}'.format(rmse)) print("\n") Output- The model performance is -------------------------------------- R2 score is 0.6387311528873696 RMSE is 5.188377408265187
Let’s also plot a Scatter Plot on training label set and predicted label set which should in an ideal case follow a straight line.
plt.scatter(y_test, y_pred) plt.show()
This is a good result but you will never use linear regression to Boston Housing Dataset this way i.e. out of the box and mostly we will use regularization which will further constraints on the model coefficients. In the coming posts, we will look a few ways to increase the model’s performance.
Hope you liked the post.
Leave a Reply