Data Quality Assessment on Shootings in United States

As per today’s class, I started exploring a dataset that encompasses various aspects of shootings that occurred in the USA.

My focus was on understanding the data, identifying discrepancies, and addressing missing values in order to prepare the dataset for further analysis.

Missing Values : 

The dataset we are working with contains several columns with missing values, including “threat_type,” “City,” “County,” “Latitude & Longitude,” “Age,” and “Race.” These gaps in the data can significantly impact the accuracy and reliability of any analyses or models we wish to build.

Duplicate Records :

One of the specific issues we encountered was the presence of duplicate records in the “armed_with” column. Identifying and addressing these duplicates is essential to avoid skewing our analysis or modeling results.

Discrepancies :

  • Gender – In instances where gender information is absent, pertaining to individuals involved as suspects or victims, a notable concern regarding the credibility of the documented shootout arises due to the lack of identifiable names and details.
  • County – The dataset displays gaps in the county attribute, yet corresponding city data is provided. This prompts the inquiry as to whether efforts should be made to ascertain the missing county information based on the available city data.

As I continue, I will further explore techniques to clean and prepare the data, ensuring that it is reliable and fit for analysis.

Regression Analysis

Today, I have begun by loading the dataset into a Python Data Frame and performed a linear regression analysis. Specifically, I aimed to predict the percentage of diabetic individuals based on the percentages of obesity and inactivity within each county. The R-squared value, which measures the goodness of fit of our model, assesses how well our linear regression model explained the variance in diabetes rates.

Upon analyzing the initial linear regression model, I found that the R-squared value was not very high. To improve the model’s performance, I explored a polynomial regression approach, allowing for more complex relationships between the variables. This led to a higher R-squared value, suggesting that a polynomial regression model might better capture the underlying trends in the data.