K-Fold Cross Validation Technique

K-Fold Cross Validation Technique


K-fold divides datasets into training and testing and then splits training datasets into k-folds. It is used to avoid overfitting.

Example: If K=5, it means, in the given dataset and we are splitting into 5 folds and running the Train and Test. During each run, one fold is considered for testing and the rest will be for training and moving on with iterations, the below pictorial representation would give you an idea of the flow of the fold-defined size.


K-fold


k-fold Implementation

  1. Importing all the dependencies
# importing dependencies
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import warnings
warnings.simplefilter('ignore')
  1. Importing models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
  1. Data reading
df = pd.read_csv('heart.csv')
df.head()
  1. Segregating the features into independent and dependent
X = df.drop(columns='target',axis=1)
y = df['target']
  1. Splitting data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=3)
  1. Model training
algorithm = [LogisticRegression(max_iter=100),SVC(kernel='linear'),KNeighborsClassifier(),RandomForestClassifier()]
def compare_models_performance():
    for ml in algorithm:
        ml.fit(X_train,y_train)
        y_pred = ml.predict(X_test)
        score = accuracy_score(y_test,y_pred)
        print(f'Accuracy Score of model {ml} =',score*100)
compare_models_performance()
Output
Accuracy Score of model LogisticRegression() = 80.32786885245902 Accuracy Score of model SVC(kernel='linear') = 77.04918032786885 Accuracy Score of model KNeighborsClassifier() = 65.57377049180327 Accuracy Score of model RandomForestClassifier() = 78.68852459016394
  1. Performing cross-validation
algorithm = [LogisticRegression(max_iter=100),SVC(kernel='linear'),KNeighborsClassifier(),RandomForestClassifier()]
score_list = []
def model_train_cv():
    for model in algorithm:
        cv_score = cross_val_score(model,X,y,cv=5)
        mean_accuracy = cv_score.mean()*100
        mean_accuracy = round(mean_accuracy,2)
        score_list.append(mean_accuracy)
        print(f'Cross validation accuracy {model}: {mean_accuracy}')
    max = score_list[0]
    for score in score_list:
        if score > max:
            max = score
    print(f'Best Score is {max}')
model_train_cv()
Output
Cross validation accuracy LogisticRegression(): 83.15 Cross validation accuracy SVC(kernel='linear'): 82.83 Cross validation accuracy KNeighborsClassifier(): 64.39 Cross validation accuracy RandomForestClassifier(): 81.49 Best Score is 83.15