Cross Validation Methods

Hi,

In this blog, I will be explaining about few other cross-validation methods used in statistics.

Stratified cross-validation.

K-fold validation is comparable to this validation technique. Since data is divided into k-instances with a uniform distribution, k-fold validation cannot be applied to imbalanced datasets. The improved k-fold cross-validation technique is known as stratified k-fold. The dataset is divided into k equal iterations, but the ratio of target variables in each instance is the same as in the entire dataset. This enables it to function flawlessly for datasets that are unbalanced.

For example, the original dataset contains Teenagers that are a lot less than Adults, so this target variable distribution is imbalanced. In the stratified k-fold cross-validation technique, this ratio of instances of the target variable is maintained in all the iterations.

Monte Carlo cross-validation.

The Monte Carlo CV approach divides the whole amount of data into training and test sets. This approach is intended to overcome several difficulties, especially when working with constrained or small datasets. When the dataset is small, it can be difficult to use cross-validation techniques like k-fold cross-validation because doing so can produce very small test sets that might not give accurate estimates of model performance.

Any percentage that we choose can be used for splitting, including 70% to 30% and 60% to 40%. Maintaining a different train-test split percentage is the only need for each iteration.

 

The next step is to fit the model on the train data set in that iteration and calculate the accuracy of the fitted model on the test dataset. Repeat these iterations many times – 50,100,500 or even higher – and take the average of all the test errors to conclude how well your model performs.

The model is then fit to the train data set in that iteration, and the accuracy of the fitted model is determined using the test dataset. Take the average of all the test mistakes to determine how well your model performs by repeating these iterations 50, 100, 500 or even more times.

 

Benefits: This model may be used to:

  1. Determine how effectively your model generalizes to various random data partitions.
  2. With tiny datasets, obtain more accurate performance estimations.

Time series cross-validation.

Data collected over a period is referred to as a time series. One can understand what variables are influenced by what factors throughout time using this type of data. Some examples of time series data are weather records, Stock market, etc.

Cross-validation is not simple in the case of time series datasets. The test set or the train set cannot be randomly selected from among the data examples. Therefore, using time as the key component, this technique is utilized to cross-validate time series data.

The dataset is divided into training and test sets according to time because the order of the data is crucial for time series-related problems.

For an Instance, Start the training with a small subset of data. Perform forecasting for the later data points and check their accuracy. The forecasted data points are then included as part of the next training dataset and the next data points are forecasted. This goes on.

Advantage: This model reduces the possibility of data leakage, by simulating real-life situations where information from the future is not accessible, during the training of the model. This enhances the authenticity of the evaluation process.

Thank You!!

Leave a Reply

Your email address will not be published. Required fields are marked *