Project 2 – Geospatial Analysis and Clustering of Police Shootings: Uncovering Patterns and Insights
Handling of Missing Geo-coordinates in Python – Nov 8th
I wrote a Python script that uses the Pandas library to deal with missing longitude and latitude data in a dataset. The script starts by reading a CSV file called ‘Data.csv’ and creating a Pandas DataFrame (data_df). The dataset most likely contains information about cities, but some geographic coordinates may be missing.
To tackle this problem, the script generates two dictionaries: one for matching cities to their longitude values (city_longitude_mapping) and another for latitude values (city_latitude_mapping). These dictionaries are created by removing rows with missing values in the respective columns, eliminating duplicate cities, and setting the city as the index.
The script then uses the Pandas fillna method to fill in the missing values in the ‘longitude’ and ‘latitude’ columns. It does this by utilizing the dictionaries created earlier to map each city to its corresponding geo-coordinate, effectively filling in the gaps in the dataset.
Thank You!
Optimizing K-Nearest Neighbors with Elbow Method for Latitude and Longitude Variables – Nov 6th
The K-Nearest Neighbors (KNN) model is crucial for classifying latitude and longitude data based on proximity to other data points. However, determining the optimal number of neighbors (K) is challenging. To address this, we adapt the elbow method used for clustering algorithms.
First, we ensure the dataset includes latitude and longitude variables and then create a new feature combining these values for more spatial relationships.
Next, we select a range of potential K values and train the KNN model for each value. Evaluate performance using metrics like accuracy, precision, recall, and F1 score.
Then, Plot the performance metrics against the range of K values. Initially, performance improves with more neighbors, but at a certain point, adding more does not significantly improve it. This point is the elbow.
Identify the K value at the elbow point to determine the optimal number of neighbors. This fine-tunes accuracy and achieves precise classification outcomes for latitude and longitude data.
Thanks!
Decision Trees – Nov 3rd
This blog explains the Decision Trees machine learning prediction method, which can be effectively used in our Project 2 dataset.
Decision trees are great at handling datasets with multiple variables, which makes them perfect for situations where states may show different patterns based on various factors.
To prepare our data, we should handle missing values by either imputing or removing them to ensure a clean dataset. We should also encode categorical variables into a numerical format for effective modeling.
Next, identify relevant features that influence state characteristics, considering demographics and other factors. Then, split our dataset into training and testing sets to assess the model’s performance.
Further, we have to train the decision tree model on the training dataset using the identified features, and evaluate its performance on the testing set using metrics like accuracy, precision, and recall. Visualize the decision tree to interpret the decision-making process and understand the factors influencing dataset characteristics.
Consider hyperparameter tuning by adjusting the tree depth for optimal performance. Finally, utilize the trained Decision Tree model to predict outcomes for new data points, gaining valuable insights into the factors shaping state characteristics.
In conclusion, Decision Trees provide a structured approach to navigating the complexities of decision-making within the dataset. By interpreting the decision paths, we can gain a deeper understanding of the factors influencing dataset characteristics, contributing to informed decision-making in various fields.
Thank You!!
KNN algorithm for data classification – Nov 1
This blog explains about KNN (K nearest neighbor) machine learning algorithm, which can be effectively used in our Project 2 dataset.
KNN operates on the principle of similarity, making it ideal for datasets where states with similar features tend to cluster together.
Handle Missing Values: Ensure a complete dataset by either imputing or removing missing values.
Feature Scaling: Normalize numerical features to ensure equal importance during distance calculations.
Feature Selection: Identify relevant features for our analysis, considering factors such as demographics, city, and state variables that may influence characteristics.
Data Splitting: Split our dataset into training and testing sets to evaluate the model’s performance.
Choosing K: Determine the best value for K (number of neighbors) through techniques like cross-validation.
Model Training: Train the KNN model on the training dataset using the selected features.
Model Evaluation: Assess the model’s performance on the testing set using metrics like accuracy, precision, and recall.
Prediction: Utilize the trained KNN model to predict the cluster or category of new data points, revealing hidden patterns within the dataset.
KNN, with its focus on proximity and similarity, becomes a valuable machine learning algorithm for uncovering patterns and relationships within the dataset. By exploring the dataset based on shared characteristics, KNN offers a unique perspective on geographical dynamics.
Thank You!!
The Advantages and Challenges of DBSCAN – Oct 30th
One of the major benefits of DBSCAN is its ability to accurately represent geographical patterns within states, regardless of their shape. Unlike other clustering algorithms that assume clusters to be of a certain shape, DBSCAN can handle clusters of any form. This makes it a reliable algorithm for identifying and understanding spatial patterns in the data.
Another advantage of DBSCAN is its robustness when it comes to dealing with noise and outliers. Geospatial datasets often contain irregularities and outliers, which can affect the accuracy of clustering algorithms. However, DBSCAN is designed to handle such noise and outliers effectively. It can identify and classify them as noise points, ensuring that they do not disrupt the clustering process.
However, setting the parameters in DBSCAN can be a challenging task. The distance threshold and minimum points for clustering are crucial parameters that need to be carefully chosen. If the distance threshold is set too high, it may result in oversimplified cluster structures, where points that should be part of the same cluster are considered noise. On the other hand, if the distance threshold is set too low, it may lead to overly complex cluster structures, where points that should be separate clusters are merged together. Finding the right balance is essential for obtaining meaningful insights from the geospatial data.
Additionally, DBSCAN can handle missing geospatial coordinates effectively. This is particularly useful when dealing with incomplete or imperfect datasets. However, this advantage also requires careful parameter tuning. The choice of distance threshold and minimum points can have a significant impact on the clustering results, especially when dealing with missing coordinates. It is important to consider the implications of missing data and adjust the parameters accordingly.
Despite these considerations, DBSCAN remains a powerful algorithm for mapping the relationships between USA states. It provides a unique perspective on the spatial connections and patterns within the data.
DBSCAN Clustering – Oct 27th
In this blog, I am going to describe the DBSCAN clustering algorithm.
DBSCAN operates by defining a neighborhood around each data point and then determining whether the density of points within that neighborhood meets a specified threshold. This threshold, known as the epsilon parameter, determines the maximum distance between points for them to be considered part of the same cluster. Additionally, DBSCAN requires a minimum number of points within a neighborhood to be considered a core point.
By analyzing the density of points, DBSCAN can identify core points, which are surrounded by a sufficient number of neighboring points, as well as border points, which are within the neighborhood of a core point but do not have enough neighboring points to be considered core themselves. On the other hand, isolated points that do not belong to any cluster.
In the context of the project dataset, DBSCAN can effectively identify clusters of states that may have irregular shapes or non-traditional spatial distributions. This is particularly useful when analyzing datasets where the geographical distribution of states does not conform to typical cluster shapes.
Furthermore, DBSCAN can handle missing coordinates in the dataset. Since it focuses on the density of points rather than their exact locations, it can still identify clusters, even if some states have missing latitude and longitude values. This adaptability allows for more comprehensive analysis and insights, even when dealing with incomplete or imperfect geospatial data.
Thank You
Advantages, Disadvantages, and Limitations of Hierarchical Clustering – Oct 25th
I wanted to state the advantages, disadvantages, and limitations of using the Hierarchical clustering algorithm for Project 2.
It is a powerful method for uncovering complex relationships within a dataset, and it has several advantages over other clustering methods. One of its strengths is its ability to handle mixed data types, which is especially useful when analyzing datasets that contain both numerical and categorical variables.
However, as datasets get larger, the computational complexity of Hierarchical Clustering increases, and efficient algorithms become necessary to handle the analysis. It’s also important to carefully choose linkage methods and distance metrics to ensure the dendrogram’s structure is accurate. Linkage methods determine how clusters are merged together, while distance metrics measure the similarity or dissimilarity between data points. Choosing the right combination of linkage method and distance metric is crucial to obtaining meaningful results.
Another consideration when using Hierarchical Clustering is missing data. While the method can handle missing data well, it may not be the best choice for datasets with many missing values. In such cases, imputation techniques may be necessary to fill in the missing values before clustering.
Despite these considerations, the method’s ability to reveal hierarchical relationships makes it an essential algorithm for thorough analysis. Hierarchical Clustering can help identify patterns and relationships within a dataset that may not be immediately apparent. With careful consideration of the factors mentioned above, Hierarchical Clustering can be a powerful algorithm for uncovering complex relationships within a dataset.
Hierarchical Clustering – Oct 23rd
As in previous blogs, here I’m interested in sharing about the hierarchical clustering machine learning method.
Hierarchical clustering is also a powerful machine learning algorithm for data analysis that allows us to explore the relationships between different variables in a dataset. By creating a dendrogram, we can see how different states are related to each other based on a variety of factors, such as race, gender, or age.
One of the key benefits of hierarchical clustering is its ability to handle both numerical and categorical data. This means that we can include a wide range of variables in our analysis. By looking at the relationships between such variables, we can gain a deeper understanding of the complex factors that shape our dataset model.
Another advantage of hierarchical clustering is its flexibility. We can adjust the parameters of the analysis to focus on specific aspects of the data. This allows us to tailor our analysis to the specific questions we want to answer, and to uncover insights that might be missed by other methods.
Advantages, Disadvantages, and Limitations of K-Means – Oct 20th
I wanted to state the advantages, disadvantages, and limitations of using the K-means clustering method for our Project 2.
K-Means is a popular clustering algorithm that is widely used for analyzing large datasets. It offers several advantages that make it a useful method for data analysis. One of its main strengths is its efficiency, as it can handle large datasets with a relatively low computational cost. This makes it particularly suitable for analyzing big data, where traditional methods may be computationally expensive.
However, it is important to keep in mind that K-Means has some limitations that can affect its applicability and the accuracy of its results. One of the main assumptions of K-Means is that clusters are spherical and equally sized. This assumption may not hold true for all datasets, especially when dealing with complex and diverse shapes and characteristics, such as for states in the USA in our Project 2. As a result, K-Means may not accurately represent the underlying structure of the data in such cases.
Another limitation of K-Means is its sensitivity to the initial placement of cluster centers. The algorithm starts by randomly initializing the cluster centers, and the final results can vary depending on this initialization. This means that different runs of the algorithm can produce different results, which can be problematic when trying to obtain consistent and reliable clustering outcomes.
Furthermore, careful preprocessing of the data is necessary when using K-Means. This is particularly important when dealing with missing values in the dataset. Imputation methods, which are used to fill in missing values, can have a significant impact on the clustering outcome. Different imputation methods can lead to different results, and the choice of imputation method should be carefully considered to ensure the validity of the clustering analysis.
Thank You!
K Means Clustering Pattern – Oct 18th
In this blog, I’m going to explain what I have explored on the K Means Clustering machine learning method.
K-Means clustering is a popular machine-learning technique that is widely used in data analysis and pattern recognition. It is powerful for uncovering hidden patterns and relationships in large datasets, making it an ideal method for analyzing the diverse information collected from different states.
The K-Means algorithm works by dividing a dataset into a predetermined number of clusters, with each cluster representing a group of data points that share similar characteristics. The algorithm assigns each data point to the closest cluster center based on its distance from the center. The process continues until the algorithm converges and the clusters are formed.
K-Means clustering is effective for analyzing numerical and categorical data, which makes it perfect for analyzing state-level data. By clustering states based on their features, such as latitude and Longitude, City, County, and State, we can identify groups of states that share similar characteristics. This can help us uncover patterns and trends that may not be immediately apparent when looking at the data as a whole.
Overall, K-Means clustering is useful for uncovering patterns and relationships in large datasets. By using this method to analyze state-level data, we can gain valuable insights into the similarities and develop more effective interpretations.
Thank You!!
Dealing with Missing Data 2 – Oct 16th
In this blog, Here is a common strategy for handling missing data for different types of variables in this dataset.
For variables like “City,” “County,” “Armed,” “Gender,” “Flee,” and “Race” – which is a categorical variable.
It is crucial to find a way to fill in the missing values to ensure the integrity and reliability of the data analysis. The mode value substitution method offers a straightforward approach to addressing this issue.
By replacing missing values with the mode, which occurs most frequently in these individual-specific variables, the overall distribution of the data remains unchanged.
By using the mode to replace missing values, the resulting dataset maintains the original distribution of the variable. This is important because it ensures that any subsequent analysis or modeling performed on the data will not be biased or skewed due to the presence of missing values.
Dealing with Missing Data 1 – Oct 13th
Let’s talk about how we can make sure our analyses and models are accurate and reliable by addressing missing data. I’ll go over some different approaches for handling missing data in this part 1 and discuss which methods work best for a few types of variables.
- Age
When it comes to numerical variables like “Age”, there are a couple of ways to handle missing values. One option is to fill in the gaps with either the median or a predetermined value. Another approach is to use regression models to estimate the missing values based on other data that are available.
2. Latitude & Longitude
When it comes to mapping and spatial analysis, geospatial data is crucial. If we stumble upon latitude and longitude values that are missing, it might be useful to check out external geocoding services that can assist us in determining coordinates using other location information.
In my next blog, I’ll be posting about the remaining missing value variables.
Data Quality Assessment on Shootings in United States
As per today’s class, I started exploring a dataset that encompasses various aspects of shootings that occurred in the USA.
My focus was on understanding the data, identifying discrepancies, and addressing missing values in order to prepare the dataset for further analysis.
Missing Values :
The dataset we are working with contains several columns with missing values, including “threat_type,” “City,” “County,” “Latitude & Longitude,” “Age,” and “Race.” These gaps in the data can significantly impact the accuracy and reliability of any analyses or models we wish to build.
Duplicate Records :
One of the specific issues we encountered was the presence of duplicate records in the “armed_with” column. Identifying and addressing these duplicates is essential to avoid skewing our analysis or modeling results.
Discrepancies :
- Gender – In instances where gender information is absent, pertaining to individuals involved as suspects or victims, a notable concern regarding the credibility of the documented shootout arises due to the lack of identifiable names and details.
- County – The dataset displays gaps in the county attribute, yet corresponding city data is provided. This prompts the inquiry as to whether efforts should be made to ascertain the missing county information based on the available city data.
As I continue, I will further explore techniques to clean and prepare the data, ensuring that it is reliable and fit for analysis.
Project 1 : Developing a Model to Predict Diabetes Rates Using US County Data on Obesity and Physical Activity
Regression Analysis
Today, I have begun by loading the dataset into a Python Data Frame and performed a linear regression analysis. Specifically, I aimed to predict the percentage of diabetic individuals based on the percentages of obesity and inactivity within each county. The R-squared value, which measures the goodness of fit of our model, assesses how well our linear regression model explained the variance in diabetes rates.
Upon analyzing the initial linear regression model, I found that the R-squared value was not very high. To improve the model’s performance, I explored a polynomial regression approach, allowing for more complex relationships between the variables. This led to a higher R-squared value, suggesting that a polynomial regression model might better capture the underlying trends in the data.
Cross-validation : compelling reasons
I found myself immersed in cross-validation methods today. Cross-validation helps us assess the performance, potential, and predictions of our models and ensure they generalize well to unseen data.
Cross-validation is like a quality control checkpoint for our models. It helps us:
- Robust Model Evaluation: Cross-validation provides a robust and realistic evaluation of a model’s performance. Instead of relying solely on a single train-test split, it leverages multiple subsets of the data for training and testing. This helps in obtaining a more comprehensive understanding of how well the model generalizes the data.
- Overfitting: Overfitting is a common, where a model becomes too tailored to the training data and performs poorly on new data. Cross-validation acts as a safeguard by revealing instances where a model may be overfitting. If a model performs well on training data but poorly on validation data, cross-validation can identify this issue.
- Utilizes Data Effectively: In situations where data is limited, cross-validation makes efficient use of the available information. It ensures that every data point contributes to both training and testing, thereby maximizing the use of the dataset.
- Applicability to Various Datasets: Cross-validation is versatile and can be applied to a wide range of datasets, regardless of size or characteristics. Whether dealing with small or large datasets, balanced or imbalanced data.
5 Fold Cross Validation Method on Diabetes, Obesity and Inactivity Data Sets
Hi,
During today’s class, I learned about K-Fold cross validation using polynomial models to analyze Diabetes, Obesity, and Inactivity data.
There are 354 data points in all three data sets (Diabetes, Obesity, and Inactivity). And, then we figure out the training error for the whole thing. As training progresses, the error generally decreases as the model learns to fit the training data better. Therefore, the training error curve should steadily decrease.
Using the K-Flod cross validation method, we find out the Test error from the 354 data points. For a better balance, we split the data into 5 folds. The K fold value might range from 5 to 10. If we choose a smaller K fold value, this leads to a larger portion of our data. This can lead to higher bias because the model will be trained on smaller subsets of data. On the other hand, if we use a very large K fold value, we’ll have many folds, but each fold will contain very few data points, which can lead to higher variance in our model performance. Hence, a value between 5-10 is a balanced value. Since we are using 5 folds, the data groups will be separated into 71,71,71,71 and 70.
The dataset’s repetitive or duplicate data points must then be labeled. As a result, this CV model effectively fits all 354 data points. Moving on to the distinct data groups now. In the first group, we randomly select 71 data points to serve as test data, while the remaining 4 groups (71, 71, 71, and 70) serve as first-iteration TRN data. We next calculate the MSE for the test data. In the same manner, we perform the task for 2nd group TEST data and the remaining 4 groups (71, 71, 71, and 70) as TRN data for the second iteration and calculate the MSE for the TEST data. The same process is used for all 5 iterations, and each MSE is obtained. This way we could see the test error starts increasing once the model starts to overfit.
At the end of the class, I had a question with the instructor regarding other commonly used Stratified cross validation method, which is quite like the K-Fold cross validation method. I have dropped an email regarding the same and waiting for the insights from the Instructor.
Thank You!!
T-Statistic test on Crab Shell Data
Hello,
In today’s blog, I wanted to describe about T-Test on Crab shell example which was discussed in last week’s class.
If there is no difference between the means of pre-molt and post-molt data, then it is called the Null Hypothesis (H0). Contrary to that, if there is no difference in means for pre-molt and post-molt data, then it is called the Alternative Hypothesis (H1).
Later, we must calculate the t- t-statistic and degrees of freedom (df) using the formulas below.
- t-statistic =Standard error of the difference(mean)/Difference in sample statistics
- df=n1+n2−2
Where, n1 = Sample size of the first group.
n2 = Sample size of the second group.
Using these two (t- statistic and degrees of freedom (df)), we must find the p-value.
As per last Wednesday’s class example regarding crab shells, we must analyze the data for pre-molt and post-molt crab shell sizes for differences. If we assume that there is no difference between the crab shell size pre-molt and post-molt data, which is a Null Hypothesis (H0). Contrary, if we assume that there is a difference in crab shell size for pre-molt and post-molt data, which is the Alternative Hypothesis (H1).
Here, let’s consider Alternative Hypothesis (H1) and t-test the data.
As mentioned above, we can get the value using the t-statistic formula, and then proceed to find the p-value. If the p-value is less, then it means that there’s a real and meaningful difference between the data.
Based on the calculated p-value, if the data is less than 0.05, then we can come to the conclusion that there is no enough evidence to finalize that there is a difference between the two data.
Thank You!
Cross Validation Methods
Hi,
In this blog, I will be explaining about few other cross-validation methods used in statistics.
Stratified cross-validation.
K-fold validation is comparable to this validation technique. Since data is divided into k-instances with a uniform distribution, k-fold validation cannot be applied to imbalanced datasets. The improved k-fold cross-validation technique is known as stratified k-fold. The dataset is divided into k equal iterations, but the ratio of target variables in each instance is the same as in the entire dataset. This enables it to function flawlessly for datasets that are unbalanced.
For example, the original dataset contains Teenagers that are a lot less than Adults, so this target variable distribution is imbalanced. In the stratified k-fold cross-validation technique, this ratio of instances of the target variable is maintained in all the iterations.
Monte Carlo cross-validation.
The Monte Carlo CV approach divides the whole amount of data into training and test sets. This approach is intended to overcome several difficulties, especially when working with constrained or small datasets. When the dataset is small, it can be difficult to use cross-validation techniques like k-fold cross-validation because doing so can produce very small test sets that might not give accurate estimates of model performance.
Any percentage that we choose can be used for splitting, including 70% to 30% and 60% to 40%. Maintaining a different train-test split percentage is the only need for each iteration.
The next step is to fit the model on the train data set in that iteration and calculate the accuracy of the fitted model on the test dataset. Repeat these iterations many times – 50,100,500 or even higher – and take the average of all the test errors to conclude how well your model performs.
The model is then fit to the train data set in that iteration, and the accuracy of the fitted model is determined using the test dataset. Take the average of all the test mistakes to determine how well your model performs by repeating these iterations 50, 100, 500 or even more times.
Benefits: This model may be used to:
- Determine how effectively your model generalizes to various random data partitions.
- With tiny datasets, obtain more accurate performance estimations.
Time series cross-validation.
Data collected over a period is referred to as a time series. One can understand what variables are influenced by what factors throughout time using this type of data. Some examples of time series data are weather records, Stock market, etc.
Cross-validation is not simple in the case of time series datasets. The test set or the train set cannot be randomly selected from among the data examples. Therefore, using time as the key component, this technique is utilized to cross-validate time series data.
The dataset is divided into training and test sets according to time because the order of the data is crucial for time series-related problems.
For an Instance, Start the training with a small subset of data. Perform forecasting for the later data points and check their accuracy. The forecasted data points are then included as part of the next training dataset and the next data points are forecasted. This goes on.
Advantage: This model reduces the possibility of data leakage, by simulating real-life situations where information from the future is not accessible, during the training of the model. This enhances the authenticity of the evaluation process.
Thank You!!
Cross Validation Methods
Hi Team,
Continuing to the previous blog, I have explained two cross-validation methods in this blog, I will be writing the rest of the methods in an upcoming blog.
Types of cross-validation: There are several cross-validation methods in Statistics. However, in this blog, I’ll be explaining two cross-validation methods as said above.
- Leave-one-out cross-validation (LOOCV)
In the CV method, the test data consists of 1 sample data (observation) and training data Contains n – 1 sample data. We should repeat this process “n” times, by excluding different data points each time.
Dis-Advantage: When dealing with complex and large datasets, it requires “n” iterations to validate the model. Also, this model can have a high variance since it depends on one data point in each iteration.
- K-Fold cross-validation
This is the commonly used CV method for assessing the performance and preventing overfitting. In this technique, the whole dataset is partitioned in k parts of equal size. It’s known as k-fold since there are k parts where k can be any number.
For instance, if there are 500 records and we require 5 experiments (K value) to perform on these 500 records.
Then, Records/K = 500/5 = 100
These 100 records out of 500 records are used to find the ACCURACY1 in Experiment 1.
Again, we consider the other 100 records out of 500 records to find the ACCURACY2 in Experiment 2.
The same approach continues for Experiments 3, 4, and 5 to find the ACCURACY3, ACCURACY4 and ACCURACY5.
Based on the obtained 5 Accuracy values, we can find out the MEAN, MIN, and MAX Accuracy from 5 experiments.
Thank You!!
Class V : Sep 18, 2023
Hello,
Greetings!
Today we explored the concept of performing regression analysis using two variables.
Y=β0+β1X1+β2X2+……+βpXp+ϵ
Where, Y = Diabetics dataset
X1 = Inactivity dataset
X2 = Obesity dataset
If the independent variable has a parabolic shape on the dependent variable, then a quadratic model is used.
For instance, if the volume of a certain chemical affects the quality of a product, also a quadratic model would suggest that there is an optimal volume that maximizes quality. A chemical that is too little or too much will lower quality.
Overfitting occurs when a predictive model is too complex and overfits the sample data too closely. This can lead to poor performance on new data.
For instance, when predicting stock prices, an overfitting model could fit the sample data with less volume closely, capturing short-term fluctuations, but it may struggle to provide accurate forecasts for future stock prices.
In my next blog, I’ll be posting about cross-validation for assessing the efficacy of the model.
Class IV (Online) : Sep 15, 2023
I plan to use the white test and Goldfeld-Quandt test instead of the Breusch-Pagan test and find limitations of each test for assessing the heteroscedasticity for diabetes and inactivity data models and then compare results with the Breusch-Pagan test. This helps me in making informed decisions in my analysis.
If I have any unclear topics during my analysis, I will definitely seek clarification from the instructor in the coming class.
Class III : Sep 13, 2023
As specified in last blog post, I have explored and understood the concepts about Least square linear regression method and Kurtosis.
Least square linear regression method represents the relation between variables in a scatterplot. The procedure fits the line to the data points in a way that minimizes the sum of the squares and vertical distance between the line and the points. It is also known as the line of best fit or trendline.
The linear equation, y = b + mx
Where, y = dependent variable
X = Independent variable
B = Y intercept
M = Slope of the line
To get the value of m, below formula is required
M = NΣ(xy) – ΣxΣy/ NΣ(x2) – (Σx)2
To get the value of m, below formula is required
B = Σy – mΣx/N
Where, N = Number of observations
Kurtosis it is a statistical measure that quantifies the shape of a probability distribution. It provides information about the tails, and peakedness of the distribution. Kurtosis helps in analyzing the outleirs of a data set. Peakedness in a data distribution is the degree to which data values are concentrated around the mean.
- Positive Kurtosis indicates heavier tails and more peak distribution, which means Kurtosis is more than normal distribution, which has Kurtosis > 3.
- Negative Kurtosis indicates lighter tails, and a flatter distribution, which means Kurtosis is less than normal distribution, which has Kurtosis < 3.
- Zero Kurtosis indicates moderate tails and curves are medium peaked height, which means Kurtosis is same as the normal distribution, which has Kurtosis = 3.
Heteroscedasticity in Regression Analysis
Heteroscedasticity means unequal scatter; In the analysis, we talk about heteroscedasticity in the context of residuals. Heteroscedasticity is a change in the spread of the residuals over the range of measured values. Heteroscedasticity produces a cone or funnel shape in residual plots.
Breusch-Pagan test is used to test for heteroscedasticity in regression analysis.
Breusch-Pagan test applied to a fair coin toss scenario as discussed in today’s class, testing whether a coin is fair, meaning it has an equal probability of landing heads or tails. If we decide to flip the coin 100 times and record the outcomes.
Hypothesis consists of two types as below
Null Hypothesis (H0): There is no heteroscedasticity in the coin toss data, implying that the variance of the outcomes (heads or tails) remains constant across all tosses.
Alternative Hypothesis (Ha): There is heteroscedasticity in the coin toss data, suggesting that the variance of the outcomes (heads or tails) is not constant and may vary across the tosses.
- If the p-value is greater than significance level (e.g., 0.05), we do not have enough evidence to reject the null hypothesis. This suggests that the variance of the coin toss outcomes remains relatively constant, and there is no significant heteroscedasticity.
- If the p-value is less than significance level, we may reject the null hypothesis in favor of the alternative hypothesis. This implies that there is evidence of heteroscedasticity, indicating that the variance of coin toss outcomes may not be constant, and there could be variations in the coin’s behavior across the tosses.
1st Session
As per the given first assignment regarding Centers for Disease Control and Prevention 2018 data, I’ve analyzed the three data sets i.e) Diabetes, Obesity and Inactivity.
Diabetes dataset has 3000+ observations, Obesity has 300+ observations and whereas Inactivity has 1600+ observations.
I understood that three data set has similar variables(columns) with their respective data within those individual data sets.
I figured out that if FIPS would be the primary variable to retrieve common observations from given three data sets. So, we can find out how many county ‘s has the diabetes due to obesity and diabetes with Inactivity and obesity with inactivity.
Hence using intersection methodology in any programming language, we can retrieve the common observations from these three datasets.
Doing that we get 354 observations by considering FIPS as a primary variable across all the three data sets.
As per diabetes and inactivity dataset comparison concerned, we get 1370 observations by considering the same FIPS as a primary variable.
I understood that 1370 county’s have diabetes individuals who has obesity too with mean, median, skewed and Kurtosis values.
As per today’s class September 11, I understood how to represent this retrieve data in Histogram graph model and figured out the weather this data is a clean data or not based on the main median, skewed and Kurtosis calculation.
Initially, during today’s class, I was confused with the Kurtosis calculation and least square linear model, however at end of the class I understood it in pieces. very soon I will explore and understand more about Kurtosis calculation and least square linear model.