In this post, we will see how to split data for Machine Learning with scikit-learn/sklearn as its always a best practice to split your data into train and test set.
As in our previous post, we defined Machine Learning as an art and science of giving machines especially computers an ability to learn to make a decision from data and all that without being explicitly programmed. The basic example that we see every day while accessing our email where machines or computers predict whether an email is a spam or not.
To build a Machine Learning model we can use different algorithms based on our feature variable and the target variable. Also, in the data science world, you heard people talking about two words training set and test set and sometimes a validation set as well. Let’s understand about both training set and test set in this post.
Meaning and requirement to split data for Machine learning models?
When we build a machine learning model, we compute some metric to measure the model’s performance like for classification model’s the commonly used metric is Accuracy, and its defined as the number of correct predictions divided by the total number of data points. But now the question comes which data we use to compute the accuracy? and How well our model or algorithm will perform on new data that it never has seen before?
We can compute the metric to measure the model’s performance on the same data that we used to fit the classifier but as the same data is used to train the model, this will not provide us with the real answer on how well our model generalizes to unseen data.
So for this reason its common to split our data into two sets, a training set on which we train and fit the classifier and a labelled test set on which we can make our predictions and finally compare these predictions with the known labels.
After this, we can easily compute our metric to measure the model’s performance in this case accuracy for classification.
How to split data for Machine Learning models?
When you are working with a single feature variable and a target variable you can use the same process that we have used here. But when there is more than one feature variable we can use sci-kit/sklearn’s train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
Here in the above code,
train_test_split returns four arrays namely training data, test data, training labels and test labels. By default train_test_split, splits the data into 75% training data and 25% test data which we can think of as a good rule of thumb.
test_size keyword argument specifies what proportion of the original data is used for the test set. Here we have mentioned the test_size=0.3 which means 70% training data and 30% test data.
random_state keyword argument sets a seed for the random number generator which helps us to split our data into training and test set. Also, the best part of this keyword argument is that later by passing the same seed you can reproduce the same split of your data and your result as well.
As we want that the labels in our main dataset to be distributed in both training set and test set, that’s why we used the keyword argument statify=y, where y is the list of arrays containing the labels.
If you want to know more about how to split data for Machine Learning with scikit-learn and also want to dig deep into scikit learn library you can check its official documentation here. Also in our next post, we will see how to split data for Machine Learning using train_test_split and build a linear regression model on Boston Housing dataset.
bob says
literally two lines of code