Cross-validation : compelling reasons

I found myself immersed in cross-validation methods today. Cross-validation helps us assess the performance, potential, and predictions of our models and ensure they generalize well to unseen data.

 

Cross-validation is like a quality control checkpoint for our models. It helps us:

 

  1. Robust Model Evaluation: Cross-validation provides a robust and realistic evaluation of a model’s performance. Instead of relying solely on a single train-test split, it leverages multiple subsets of the data for training and testing. This helps in obtaining a more comprehensive understanding of how well the model generalizes the data.

 

  1. Overfitting: Overfitting is a common, where a model becomes too tailored to the training data and performs poorly on new data. Cross-validation acts as a safeguard by revealing instances where a model may be overfitting. If a model performs well on training data but poorly on validation data, cross-validation can identify this issue.

 

  1. Utilizes Data Effectively: In situations where data is limited, cross-validation makes efficient use of the available information. It ensures that every data point contributes to both training and testing, thereby maximizing the use of the dataset.

 

  1. Applicability to Various Datasets: Cross-validation is versatile and can be applied to a wide range of datasets, regardless of size or characteristics. Whether dealing with small or large datasets, balanced or imbalanced data.

5 Fold Cross Validation Method on Diabetes, Obesity and Inactivity Data Sets

Hi,

During today’s class, I learned about K-Fold cross validation using polynomial models to analyze Diabetes, Obesity, and Inactivity data.

 

There are 354 data points in all three data sets (Diabetes, Obesity, and Inactivity). And, then we figure out the training error for the whole thing. As training progresses, the error generally decreases as the model learns to fit the training data better. Therefore, the training error curve should steadily decrease.

 

Using the K-Flod cross validation method, we find out the Test error from the 354 data points. For a better balance, we split the data into 5 folds. The K fold value might range from 5 to 10. If we choose a smaller K fold value, this leads to a larger portion of our data. This can lead to higher bias because the model will be trained on smaller subsets of data. On the other hand, if we use a very large K fold value, we’ll have many folds, but each fold will contain very few data points, which can lead to higher variance in our model performance. Hence, a value between 5-10 is a balanced value. Since we are using 5 folds, the data groups will be separated into 71,71,71,71 and 70.

 

The dataset’s repetitive or duplicate data points must then be labeled. As a result, this CV model effectively fits all 354 data points. Moving on to the distinct data groups now. In the first group, we randomly select 71 data points to serve as test data, while the remaining 4 groups (71, 71, 71, and 70) serve as first-iteration TRN data. We next calculate the MSE for the test data. In the same manner, we perform the task for 2nd group TEST data and the remaining 4 groups (71, 71, 71, and 70) as TRN data for the second iteration and calculate the MSE for the TEST data. The same process is used for all 5 iterations, and each MSE is obtained. This way we could see the test error starts increasing once the model starts to overfit.

 

At the end of the class, I had a question with the instructor regarding other commonly used Stratified cross validation method, which is quite like the K-Fold cross validation method. I have dropped an email regarding the same and waiting for the insights from the Instructor.

 

Thank You!!

T-Statistic test on Crab Shell Data

Hello,

In today’s blog, I wanted to describe about T-Test on Crab shell example which was discussed in last week’s class.

If there is no difference between the means of pre-molt and post-molt data, then it is called the Null Hypothesis (H0). Contrary to that, if there is no difference in means for pre-molt and post-molt data, then it is called the Alternative Hypothesis (H1).

Later, we must calculate the t- t-statistic and degrees of freedom (df) using the formulas below.

  • t-statistic =Standard error of the difference(mean)/Difference in sample statistics
  • df=n1+n2−2

Where, n1 = Sample size of the first group.

n2​ = Sample size of the second group.

 

Using these two (t- statistic and degrees of freedom (df)), we must find the p-value.

 

As per last Wednesday’s class example regarding crab shells, we must analyze the data for pre-molt and post-molt crab shell sizes for differences. If we assume that there is no difference between the crab shell size pre-molt and post-molt data, which is a Null Hypothesis (H0). Contrary, if we assume that there is a difference in crab shell size for pre-molt and post-molt data, which is the Alternative Hypothesis (H1).

Here, let’s consider Alternative Hypothesis (H1) and t-test the data.

 

As mentioned above, we can get the value using the t-statistic formula, and then proceed to find the p-value. If the p-value is less, then it means that there’s a real and meaningful difference between the data.

 

Based on the calculated p-value, if the data is less than 0.05, then we can come to the conclusion that there is no enough evidence to finalize that there is a difference between the two data.

 

Thank You!

 

 

 

Cross Validation Methods

Hi,

In this blog, I will be explaining about few other cross-validation methods used in statistics.

Stratified cross-validation.

K-fold validation is comparable to this validation technique. Since data is divided into k-instances with a uniform distribution, k-fold validation cannot be applied to imbalanced datasets. The improved k-fold cross-validation technique is known as stratified k-fold. The dataset is divided into k equal iterations, but the ratio of target variables in each instance is the same as in the entire dataset. This enables it to function flawlessly for datasets that are unbalanced.

For example, the original dataset contains Teenagers that are a lot less than Adults, so this target variable distribution is imbalanced. In the stratified k-fold cross-validation technique, this ratio of instances of the target variable is maintained in all the iterations.

Monte Carlo cross-validation.

The Monte Carlo CV approach divides the whole amount of data into training and test sets. This approach is intended to overcome several difficulties, especially when working with constrained or small datasets. When the dataset is small, it can be difficult to use cross-validation techniques like k-fold cross-validation because doing so can produce very small test sets that might not give accurate estimates of model performance.

Any percentage that we choose can be used for splitting, including 70% to 30% and 60% to 40%. Maintaining a different train-test split percentage is the only need for each iteration.

 

The next step is to fit the model on the train data set in that iteration and calculate the accuracy of the fitted model on the test dataset. Repeat these iterations many times – 50,100,500 or even higher – and take the average of all the test errors to conclude how well your model performs.

The model is then fit to the train data set in that iteration, and the accuracy of the fitted model is determined using the test dataset. Take the average of all the test mistakes to determine how well your model performs by repeating these iterations 50, 100, 500 or even more times.

 

Benefits: This model may be used to:

  1. Determine how effectively your model generalizes to various random data partitions.
  2. With tiny datasets, obtain more accurate performance estimations.

Time series cross-validation.

Data collected over a period is referred to as a time series. One can understand what variables are influenced by what factors throughout time using this type of data. Some examples of time series data are weather records, Stock market, etc.

Cross-validation is not simple in the case of time series datasets. The test set or the train set cannot be randomly selected from among the data examples. Therefore, using time as the key component, this technique is utilized to cross-validate time series data.

The dataset is divided into training and test sets according to time because the order of the data is crucial for time series-related problems.

For an Instance, Start the training with a small subset of data. Perform forecasting for the later data points and check their accuracy. The forecasted data points are then included as part of the next training dataset and the next data points are forecasted. This goes on.

Advantage: This model reduces the possibility of data leakage, by simulating real-life situations where information from the future is not accessible, during the training of the model. This enhances the authenticity of the evaluation process.

Thank You!!

Cross Validation Methods

Hi Team,

Continuing to the previous blog, I have explained two cross-validation methods in this blog, I will be writing the rest of the methods in an upcoming blog.

Types of cross-validation:  There are several cross-validation methods in Statistics. However, in this blog, I’ll be explaining two cross-validation methods as said above.

  1. Leave-one-out cross-validation (LOOCV)

In the CV method, the test data consists of 1 sample data (observation) and training data Contains n – 1 sample data. We should repeat this process “n” times, by excluding different data points each time.

Dis-Advantage: When dealing with complex and large datasets, it requires “n” iterations to validate the model. Also, this model can have a high variance since it depends on one data point in each iteration.

 

  1. K-Fold cross-validation

This is the commonly used CV method for assessing the performance and preventing overfitting. In this technique, the whole dataset is partitioned in k parts of equal size. It’s known as k-fold since there are k parts where k can be any number.

 

For instance, if there are 500 records and we require 5 experiments (K value) to perform on these 500 records.

Then, Records/K = 500/5 = 100

 

These 100 records out of 500 records are used to find the ACCURACY1 in Experiment 1.

Again, we consider the other 100 records out of 500 records to find the ACCURACY2 in Experiment 2.

The same approach continues for Experiments 3, 4, and 5 to find the ACCURACY3, ACCURACY4 and ACCURACY5.

 

Based on the obtained 5 Accuracy values, we can find out the MEAN, MIN, and MAX Accuracy from 5 experiments.

 

Thank You!!

Class V : Sep 18, 2023

Hello,

Greetings!

Today we explored the concept of performing regression analysis using two variables.

 

Y=β0+β1X1+β2X2+……+βpXp+ϵ

 

Where, Y = Diabetics dataset

X1 = Inactivity dataset

X2 = Obesity dataset

 

If the independent variable has a parabolic shape on the dependent variable, then a quadratic model is used.

For instance, if the volume of a certain chemical affects the quality of a product, also a quadratic model would suggest that there is an optimal volume that maximizes quality. A chemical that is too little or too much will lower quality.

 

Overfitting occurs when a predictive model is too complex and overfits the sample data too closely. This can lead to poor performance on new data.

For instance, when predicting stock prices, an overfitting model could fit the sample data with less volume closely, capturing short-term fluctuations, but it may struggle to provide accurate forecasts for future stock prices.

 

In my next blog, I’ll be posting about cross-validation for assessing the efficacy of the model.

Class IV (Online) : Sep 15, 2023

I plan to use the white test and Goldfeld-Quandt test instead of the Breusch-Pagan test and find limitations of each test for assessing the heteroscedasticity for diabetes and inactivity data models and then compare results with the Breusch-Pagan test. This helps me in making informed decisions in my analysis.

If I have any unclear topics during my analysis, I will definitely seek clarification from the instructor in the coming class.

Class III : Sep 13, 2023

As specified in last blog post, I have explored and understood the concepts about Least square linear regression method and Kurtosis.

Least square linear regression method represents the relation between variables in a scatterplot. The procedure fits the line to the data points in a way that minimizes the sum of the squares and vertical distance between the line and the points. It is also known as the line of best fit or trendline. 

The linear equation, y = b + mx 

Where, y = dependent variable  

X = Independent variable 

B = Y intercept 

M = Slope of the line 

 

To get the value of m, below formula is required 

M = NΣ(xy) – ΣxΣy/ NΣ(x2) – (Σx)2 

To get the value of m, below formula is required 

B = Σy – mΣx/N 

Where, N = Number of observations 

Kurtosis it is a statistical measure that quantifies the shape of a probability distribution. It provides information about the tails, and peakedness of the distribution. Kurtosis helps in analyzing the outleirs of a data set. Peakedness in a data distribution is the degree to which data values are concentrated around the mean.  

  • Positive Kurtosis indicates heavier tails and more peak distribution, which means Kurtosis is more than normal distribution, which has Kurtosis > 3. 
  • Negative Kurtosis indicates lighter tails, and a flatter distribution, which means Kurtosis is less than normal distribution, which has Kurtosis < 3. 
  • Zero Kurtosis indicates moderate tails and curves are medium peaked height, which means Kurtosis is same as the normal distribution, which has Kurtosis = 3. 

 

Heteroscedasticity in Regression Analysis 

Heteroscedasticity means unequal scatter; In the analysis, we talk about heteroscedasticity in the context of residuals. Heteroscedasticity is a change in the spread of the residuals over the range of measured values. Heteroscedasticity produces a cone or funnel shape in residual plots.  

 

Breusch-Pagan test is used to test for heteroscedasticity in regression analysis. 

Breusch-Pagan test applied to a fair coin toss scenario as discussed in today’s class, testing whether a coin is fair, meaning it has an equal probability of landing heads or tails. If we decide to flip the coin 100 times and record the outcomes. 

Hypothesis consists of two types as below 

  

Null Hypothesis (H0): There is no heteroscedasticity in the coin toss data, implying that the variance of the outcomes (heads or tails) remains constant across all tosses. 

Alternative Hypothesis (Ha): There is heteroscedasticity in the coin toss data, suggesting that the variance of the outcomes (heads or tails) is not constant and may vary across the tosses. 

  • If the p-value is greater than significance level (e.g., 0.05), we do not have enough evidence to reject the null hypothesis. This suggests that the variance of the coin toss outcomes remains relatively constant, and there is no significant heteroscedasticity. 

 

  • If the p-value is less than significance level, we may reject the null hypothesis in favor of the alternative hypothesis. This implies that there is evidence of heteroscedasticity, indicating that the variance of coin toss outcomes may not be constant, and there could be variations in the coin’s behavior across the tosses. 

 

 

1st Session

As per the given first assignment regarding Centers for Disease Control and Prevention 2018 data, I’ve analyzed the three data sets i.e) Diabetes, Obesity and Inactivity.

 

Diabetes dataset has 3000+ observations, Obesity has 300+ observations and whereas Inactivity has 1600+ observations.

 

I understood that three data set has similar variables(columns) with their respective data within those individual data sets.

 

I figured out that if FIPS would be the primary variable to retrieve common observations from given three data sets. So, we can find out how many county ‘s has the diabetes due to obesity and diabetes with Inactivity and obesity with inactivity.

 

 

Hence using intersection methodology in any programming language, we can retrieve the common observations from these three datasets.

 

Doing that we get 354 observations by considering FIPS as a primary variable across all the three data sets.

 

As per diabetes and inactivity dataset comparison concerned, we get 1370 observations by considering the same FIPS as a primary variable.

 

I understood that 1370 county’s have diabetes individuals who has obesity too with mean, median, skewed and Kurtosis values.

 

As per today’s class September 11, I understood how to represent this retrieve data in Histogram graph model and figured out the weather this data is a clean data or not based on the main median, skewed and Kurtosis calculation.

 

Initially, during today’s class, I was confused with the Kurtosis calculation and least square linear model, however at end of the class I understood it in pieces. very soon I will explore and understand more about Kurtosis calculation and least square linear model.