Cross Validation

Aman S
4 min readOct 2, 2022
  • Cross-validation is a statistical method used to estimate the skill of machine learning models.
  • In Machine Learning we create models to predict the outcome of certain events.
  • To measure if the model is good enough, we can use a method called Train/Test.
  • Train/Test is a method to measure the accuracy of your model.
  • We split the data set into two sets: a training and a testing set.
  • But we simply cannot assume that it is going to work well on data that it has not seen before. We cannot be sure that the model will give the desired accuracy prediction. We need some kind of assurance for the predictions we got from the model.
  • Validation of our model is very important.
  • To evaluate the performance of any machine learning model, it must be tested on unseen data. We can determine whether our model is under-fitting, over-fitting, or well-generalized based on its performance on unseen data.

CROSS VALIDATION(CV) is one of the techniques used to test the effectiveness of machine learning models.

Shuffle the dataset in order to remove any kind of order.
Split the data into K number of folds. K= 5 or 10 will work for most of the cases.
Now keep one fold for testing and remaining all the folds for training.
Train(fit) the model on train set and test(evaluate) it on test set and note down the results for that split.
Now repeat this process for all the folds, every time choosing separate fold as test data.
So for every iteration our model gets trained and tested on different sets of data.
At the end sum up the scores from each split and get the mean score.

K Fold Cross Validation.

In the case of K Fold cross validation input data is divided into ‘K’ number of folds, hence the name K Fold.

This technique is very good because we are giving a variety of samples to our model.

For every iteration, we will use one fold as test data and the rest as training data. Note that for every iteration, data in training and test fold changes which adds to the effectiveness of this method.

This significantly reduces underfitting as we are using most of the data for training(fitting), and also significantly reduces overfitting as most of the data is also being used in the validation set. K Fold cross-validation helps to generalize the machine learning model, which results in better predictions on unknown data.

Stratified K Fold Cross Validation

Stratified K Fold is used when just random shuffling and splitting of the data is not sufficient, and we want to have the correct distribution of data in each fold. In the case of regression, problem folds are selected so that the mean response value is approximately equal in all the folds. In the case of classification, folds are selected to have the same proportion of class labels. Stratified K Fold is more useful in the case of classification problems, where it is very important to have the same percentage of labels in every fold.

Code:

Import Libraries

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
digits= load_digits()

We first build the models without using K Fold Cross Validation Technique and note the accuracy on test_data

Split the data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)

Function to get the score with different models:

def get_score(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
return model.score(X_test,y_test)
  1. Logistic Regression
  2. SVM
  3. Random Forest

WITHOUT USING FOLDS

# For Logistic Regression
print(f”Score for Logistic Regression is :{ get_score(LogisticRegression(solver=’liblinear’,multi_class=’ovr’),X_train, X_test, y_train, y_test)}”)
# For SVM
print(f”Score for SVM is:{ get_score(SVC(gamma=’auto’),X_train, X_test, y_train, y_test)}”)
# For Random Forest
print(f”Score for Random Forest is:{ get_score(RandomForestClassifier(n_estimators=40),X_train, X_test, y_train, y_test)}”)

Output:

If we re-execute the models, we get different scores every time.

WITH USING FOLDS

We use Stratified K Fold Cross Validation

Output:

sklearn HAS AN API FOR Stratified K Fold Cross Validation

cross_val_score has the same functionality as the above code:

Code :

Output:

Using this technique we can increase the performance of the model

--

--