Here in this post, we will build a simple linear regression model using Python‘s Sci-kit learn/Sklearn library.
When it comes to defining Machine Learning, we can say its an art and science of giving machines especially computers an ability to learn to make a decision from data and all that without being explicitly programmed. The basic example that we see every day while accessing our email where machines or computers predict whether an email is a spam or not.
Basically, In regression tasks, the target variable or dependent variable or response variable, whatever you say, is a continuously varying variable such as the price of the house in case of Boston housing dataset.
The dataset for Linear Regression:
Here the dataset that i am going to use for building a simple linear regression model using Python’s Sci-kit library is Boston Housing Dataset which you can download from here. Also, for now, let’s try to predict the price from a single feature of a dataset i.e. RM: Average number of rooms.
Let’s see how to build a simple Linear Regression model using Python’s Sci-kit library:
First import all the necessary libraries that we are going to need to build our linear regression model.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn import linear_model
In next step we will use pandas .read_csv() function to load our .CSV file and saved it as data, as shown below:
data = pd.read_csv("C:\\Users\\Pankaj\\Desktop\\Dataset\\Boston_housing.csv", sep = ",") data.head()
Below is a top 5 rows that are returned by default when using .head() on a Dataframe.
To get basic details about our Boston Housing dataset like null values, data types etc. we can use .info() as shown below:
data.info() Output - <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): crim 506 non-null float64 zn 506 non-null float64 indus 506 non-null float64 chas 506 non-null int64 nox 506 non-null float64 rm 506 non-null float64 age 506 non-null float64 dis 506 non-null float64 rad 506 non-null int64 tax 506 non-null int64 ptratio 506 non-null float64 b 506 non-null float64 lstat 506 non-null float64 medv 506 non-null float64 dtypes: float64(11), int64(3) memory usage: 55.4 KB
To check the column names of the dataset we can use .columns attribute as shown below:
data.columns Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv'], dtype='object')
The Boston Housing Dataset characteristics or description is as follows:
You can also find basic statistic details like mean, median, standard deviation etc. about the dataset using .describe() method as shown below:
data.describe()
As Scikit learn wants “features” and “target” variables in X and y respectively. Here medv is our target variable, we can extract features and target arrays from our dataset as shown below. In X we drop the medv column and in y we keep only medv column:
X = data.drop("medv", axis=1).values y = data["medv"].values
Here we are trying to try to predict the price from a single feature of a dataset i.e. RM: an Average number of rooms which is our feature variable in this case, so to extract it we can write as shown below:
X_rooms = X[:, 5]
To check the data type of both out feature variable and target variable, we can use type() function shown below:
type(X_rooms), type(y) Output - (numpy.ndarray, numpy.ndarray)
To get X_rooms and y in a shape we can use .reshape() function as shown below:
X_rooms.reshape(-1,1) y = y.reshape(-1,1)
To find the shape of both feature variable and target variable, we can use .shape attribute as shown below:
print(X_rooms.shape, y.shape) output - ((506, 1), (506, 1))
We can also plot heatmap of Boston Housing dataset using Seaborn’s heatmap function and respective code as shown below, where data.corr() computes the pairwise correlation between columns:
sns.heatmap(data.corr(), square=True, cmap='RdYlGn')
The dark green portion means that data is highly correlated or positive correlation where red colour means data is negatively correlated. As in our case rm and medv, heatmap shows a green colour which means that data is positively correlated.
First let’s create a scatter plot of X_rooms, y as shown below:
plt.scatter(X_rooms, y, color='green', s=3) plt.ylabel("Value of house/1000($)") plt.xlabel("Number of rooms") plt.show()
From the above plot, we can conclude that more room leads to higher prices.
Since the target variable here is quantitative, this is a regression problem. We have imported Linear regression from sklearn.linear_models as shown at the start of the post. Now first instantiate the LinearRegression() and then use .fit() to fit a linear regression and then predict the price, using .predict() as shown below:
regression = linear_model.LinearRegression() regression.fit(X_rooms, y) plt.scatter(X_rooms, y, color='green', s=3) #We want to check out the regressor predictions over the range of data, we can do so: Data_range = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1) y_pred = regression.predict(Data_range) plt.plot(Data_range, y_pred , color="black", linewidth = 3) plt.show()
This is how we can build a simple Linear Regression model using Python and with a single feature variable. In the next posts, we will try to build a linear regression model for all the feature variable and we will also use train_test_split() to split our dataset into training and test dataset and also measure the accuracy of the model.
We have also shared how to Perform Linear Regression using Least Squares in one of our posts.
Hope you like our post. Please share and subscribe to our newsletter. 🙂
Leave a Reply