US Obesity Rate Patterns

Fall 2024 Data Science Project

Isaac Plowman, Praharsh Nanduri, Justin Nguyen, Kevin Ferry, Charles Kim

Contributions:

A: Project idea B: Dataset Curation and Preprocessing C: Data Exploration and Summary Statistics D: ML Algorithm Design/Development E: ML Algorithm Training and Test Data Analysis F: Visualization, Result Analysis, Conclusion G: Final Tutorial Report Creation H: Additional (not listed above)

Kevin: Project Idea ~ Helped look for kaggle datasets. Dataset curation and Preprocessing ~ Helped clean data and look at ways to deal with missing values. ML Algorithm Design/Development ~ Helped develop ideas for ML algorithm design and some of the early questions we aim to answer using ML. Also worked on encoding some of the categorical variables. Visualization, Result Analysis, Conclusion ~ Worked on writing what the reader will experience when reading our tutorial and what they will be able to learn from utilizing our code.

Charles: Project Idea - worked with group to search for kaggle datasets. ML Algorithm and Training Test worked on developing initial model design and idea. Summary Statistics - Charles worked on extending the existing ANOVA test with post-hoc test, to pinpoint which income value had a significant impact on obesity rate. Linear Regression model. After initial findings, he came up with the idea to test several different models to see which one works best, as well as explore why these models didn't work well for our given data set. Result Analysis, Conclusion, Tutorial Charles worked on the final insights and explaining, given all of our models, why random forest worked the best.

Justin: Data Exploration and Summary Statistics - Helped look for possible datasets to use. Data Exploration and Summary Statistics - Implemented linear regression hypothesis test. ML Alg Training - Added different models to the model dictionary. Visualization, Result Analysis - Added labels and legend to various graphs, wrote analysis and explanation of model results.

Praharsh: Project Idea - Helped look for kaggle datasets. Data Exploration and Summary Statistics ~ Implemented a Chi-Squared test and a ANOVA test and developed descriptions/tutorials for each test. ML Algorithm and Training Test Data Analysis - Created the dictionary for models and chose/imported the corresponding models. Visualization, Result Analysis, Conclusion & Tutorial ~ Helped curate descriptions regarding visualizations of each plot and also curated descriptions regarding the choice behind Linear Regression and why we found Random Forest as our most accurate model.

Isaac: Project Idea - researched a topic and found the Nutrition_Activity_Obesity CSV from data.gov. Wrote most of the introduction paragraph. Dataset Curation - Cleaned the dataset by removing irrelevant columns and showing basic dataset info. ML Algorithm Design/Development - did feature engineering by changing the categories to numerical values and specified the features and target variables. ML Training - Helped implement and test the different models. Insight and Conclusion - Helped write the descriptions of the steps and contributed to writing the conclusion.

Overall, the team felt everyone contributed equally.

Introduction

Our topic is focused on obesity rates and patterns in the US. Specifically, our project involves in looking at numerous factors like physical activity, geographic location, income, and education for correlation with obesity. We can also use the chosen data for predicting the percentage of a certain specification (e.g. percentage of adults who engage in no leisure-time physical activity) based on the features of a geographical area.

Unfortuantely, obesity is a highly prevalent disease that has been affecting people negatively globally. Data from 2023 shows that in 23 U.S. states, one in five adults in the U.S. has obesity (CDC data). With growing fast-food and snacking markets and an increase in prices of healthy foods, people are more influenced to buy less healthy food alternatives for a cheap price. Additionally, obesity is linked to many chronic conditions like diabetes, heart disease, and certain cancers. We examine these datasets for predicting healthcare needs, and for the assistance of creating nutritional guidelines and policies to mitigate the growth of obesity. Additionally, we can locate disparities in obesity rates and the connections between obesity and certain economic, racial, and geographical groups. These datasets allow us to examine environmental influences on obesity and understand the connection between fast food availability and health outcomes. Understanding this information provides knowledge on populations that are at risk, which determines the amount of resources to allocate and where to allocate them.

Data Curation

For our project, we analyzed three different datasets from two different websites. The links to where we gathered data include the following:

Nutrition_Activity_Obesity From Data.gov:

https://catalog.data.gov/dataset/nutrition-physical-activity-and-obesity-behavioral-risk-factor-surveillance-system

FastFoodRestaurants and Datafiniti_Fast_Food_Restaurants From Kaggle.com:

https://www.kaggle.com/datasets/imtkaggleteam/fast-food-restaurants-across-america/data?select=FastFoodRestaurants.csv

"Nutrition_Activity_Obesity" refers to the dataset that includes different topics or questions and their corresponding percentage. The "FastFoodRestaurants" datasets include the locations of fast food restaurants across the US. These datasets will be used for analysis and pattern prediction, but before that we must process and clean their data.

First we must create the data frames using the csv files. Down below we display the dataframes.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

ffr_df = pd.read_csv('FastFoodRestaurants.csv')
obesity_df = pd.read_csv('Nutrition_Activity_Obesity.csv')
Datafini_ffr_df = pd.read_csv('Datafiniti_Fast_Food_Restaurants.csv')

display(obesity_df)
display(ffr_df)
display(Datafini_ffr_df)
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value_Unit Data_Value_Type ... GeoLocation ClassID TopicID QuestionID DataValueTypeID LocationID StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2020 2020 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... NaN Value ... NaN PA PA1 Q047 VALUE 59 Race/Ethnicity Hispanic RACE RACEHIS
1 2014 2014 GU Guam Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... (13.444304, 144.793731) OWS OWS1 Q036 VALUE 66 Education High school graduate EDU EDUHSGRAD
2 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... NaN OWS OWS1 Q036 VALUE 59 Income $50,000 - $74,999 INC INC5075
3 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... NaN OWS OWS1 Q037 VALUE 59 Income Data not reported INC INCNR
4 2015 2015 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who achieve at least 300 min... NaN Value ... NaN PA PA1 Q045 VALUE 59 Income Less than $15,000 INC INCLESS15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
93244 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q037 VALUE 56 Income Less than $15,000 INC INCLESS15
93245 2022 2022 WY Wyoming BRFSS Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... NaN Value ... (43.23554134300048, -108.10983035299967) PA PA1 Q047 VALUE 56 Education Less than high school EDU EDUHS
93246 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q036 VALUE 56 Age (years) 35 - 44 AGEYR AGEYR3544
93247 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q037 VALUE 56 Income $35,000 - $49,999 INC INC3550
93248 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q036 VALUE 56 Education Less than high school EDU EDUHS

93249 rows × 33 columns

address city country keys latitude longitude name postalCode province websites
0 324 Main St Massena US us/ny/massena/324mainst/-1161002137 44.921300 -74.890210 McDonald's 13662 NY http://mcdonalds.com,http://www.mcdonalds.com/...
1 530 Clinton Ave Washington Court House US us/oh/washingtoncourthouse/530clintonave/-7914... 39.532550 -83.445260 Wendy's 43160 OH http://www.wendys.com
2 408 Market Square Dr Maysville US us/ky/maysville/408marketsquaredr/1051460804 38.627360 -83.791410 Frisch's Big Boy 41056 KY http://www.frischs.com,https://www.frischs.com...
3 6098 State Highway 37 Massena US us/ny/massena/6098statehighway37/-1161002137 44.950080 -74.845530 McDonald's 13662 NY http://mcdonalds.com,http://www.mcdonalds.com/...
4 139 Columbus Rd Athens US us/oh/athens/139columbusrd/990890980 39.351550 -82.097280 OMG! Rotisserie 45701 OH http://www.omgrotisserie.com,http://omgrotisse...
... ... ... ... ... ... ... ... ... ... ...
9995 3013 Peach Orchard Rd Augusta US us/ga/augusta/3013peachorchardrd/-791445730 33.415257 -82.024531 Wendy's 30906 GA http://www.wendys.com,http://wendys.com
9996 678 Northwest Hwy Cary US us/il/cary/678northwesthwy/787691191 42.217300 -88.255800 Lee's Oriental Martial Arts 60013 IL http://www.mcdonalds.com
9997 1708 Main St Longmont US us/co/longmont/1708mainst/-448666054 40.189190 -105.101720 Five Guys 80501 CO http://fiveguys.com
9998 67740 Highway 111 Cathedral City US us/ca/cathedralcity/67740highway111/-981164808 33.788640 -116.482150 El Pollo Loco 92234 CA http://www.elpolloloco.com,http://elpolloloco.com
9999 5701 E La Palma Ave Anaheim US us/ca/anaheim/5701elapalmaave/554191587 33.860074 -117.789762 Carl's Jr. 92807 CA http://www.carlsjr.com

10000 rows × 10 columns

id dateAdded dateUpdated address categories city country keys latitude longitude name postalCode province sourceURLs websites
0 AVwcmSyZIN2L1WUfmxyw 2015-10-19T23:47:58Z 2018-06-26T03:00:14Z 800 N Canal Blvd American Restaurant and Fast Food Restaurant Thibodaux US us/la/thibodaux/800ncanalblvd/1780593795 29.814697 -90.814742 SONIC Drive In 70301 LA https://foursquare.com/v/sonic-drive-in/4b7361... https://locations.sonicdrivein.com/la/thibodau...
1 AVwcmSyZIN2L1WUfmxyw 2015-10-19T23:47:58Z 2018-06-26T03:00:14Z 800 N Canal Blvd Fast Food Restaurants Thibodaux US us/la/thibodaux/800ncanalblvd/1780593795 29.814697 -90.814742 SONIC Drive In 70301 LA https://foursquare.com/v/sonic-drive-in/4b7361... https://locations.sonicdrivein.com/la/thibodau...
2 AVwcopQoByjofQCxgfVa 2016-03-29T05:06:36Z 2018-06-26T02:59:52Z 206 Wears Valley Rd Fast Food Restaurant Pigeon Forge US us/tn/pigeonforge/206wearsvalleyrd/-864103396 35.803788 -83.580553 Taco Bell 37863 TN https://www.yellowpages.com/pigeon-forge-tn/mi... http://www.tacobell.com,https://locations.taco...
3 AVweXN5RByjofQCxxilK 2017-01-03T07:46:11Z 2018-06-26T02:59:51Z 3652 Parkway Fast Food Pigeon Forge US us/tn/pigeonforge/3652parkway/93075755 35.782339 -83.551408 Arby's 37863 TN http://www.yellowbook.com/profile/arbys_163389... http://www.arbys.com,https://locations.arbys.c...
4 AWQ6MUvo3-Khe5l_j3SG 2018-06-26T02:59:43Z 2018-06-26T02:59:43Z 2118 Mt Zion Parkway Fast Food Restaurant Morrow US us/ga/morrow/2118mtzionparkway/1305117222 33.562738 -84.321143 Steak 'n Shake 30260 GA https://foursquare.com/v/steak-n-shake/4bcf77a... http://www.steaknshake.com/locations/23851-ste...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 AV12gJwna4HuVbed9Ayg 2017-07-24T21:28:46Z 2018-04-07T13:19:06Z 3460 Robinhood Rd Fast Food Restaurants Winston-Salem US us/nc/winston-salem/3460robinhoodrd/-66712705 36.117563 -80.316553 Pizza Hut 27106 NC https://www.allmenus.com/nc/winston-salem/7341... http://www.pizzahut.com
9996 AV12gJxKIxWefVJwhpzS 2017-07-24T21:28:46Z 2018-04-07T13:19:05Z 3069 Kernersville Rd Fast Food Restaurants Winston-Salem US us/nc/winston-salem/3069kernersvillerd/-66712705 36.077718 -80.176748 Pizza Hut 27107 NC https://www.allmenus.com/nc/winston-salem/7340... http://www.pizzahut.com
9997 AVwdJMdSByjofQCxl8Vr 2015-10-24T00:17:32Z 2018-04-07T13:19:05Z 838 S Main St Fast Food Restaurants Kernersville US us/nc/kernersville/838smainst/-66712705 36.111015 -80.089165 Pizza Hut 27284 NC https://www.allmenus.com/nc/kernersville/73400... http://www.pizzahut.com
9998 AVwdl2cykufWRAb57ZPs 2016-04-05T02:59:45Z 2018-04-07T13:19:05Z 1702 Glendale Dr SW Fast Food Restaurants Wilson US us/nc/wilson/1702glendaledrsw/-66712705 35.719981 -77.945795 Pizza Hut 27893 NC https://www.allmenus.com/nc/wilson/73403-pizza... http://www.pizzahut.com
9999 AVwdecWKIN2L1WUfwMWU 2016-11-08T02:26:32Z 2018-04-07T13:19:05Z 1405 W Broad St Fast Food Restaurants Elizabethtown US us/nc/elizabethtown/1405wbroadst/-66712705 34.632778 -78.624615 Pizza Hut 28337 NC https://www.allmenus.com/nc/elizabethtown/7339... http://www.pizzahut.com,http://api.citygridmed...

10000 rows × 15 columns

Now, we can start cleaning and exploring the data of our dataframes. First, let's clean and examine obesity_df. It seems that all the values in the column, "Data_Value_Unit," are NaN, so let's check the count of non NaN values using the count function.

In [ ]:
print("Data_Value_Unit # of Non NaN:", obesity_df['Data_Value_Unit'].count()) # Count the number of rows
Data_Value_Unit # of Non NaN: 0

Seeing that this column is entirely NaN, we can drop it.

In [ ]:
obesity_df = obesity_df.drop(columns=['Data_Value_Unit'])

It's also clear that the values of the the column "Data_Value_Type" are useless because all the values seem to have "Value" as their literal. We can test to see if all the values in this column are "Value" by using the unique method.

In [ ]:
obesity_df['Data_Value_Type'].unique()
Out[ ]:
array(['Value'], dtype=object)

We see that the only unique value is "Value," so we can also drop those columns.

In [ ]:
obesity_df = obesity_df.drop(columns=['Data_Value_Type'])
display(obesity_df)
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value Data_Value_Alt ... GeoLocation ClassID TopicID QuestionID DataValueTypeID LocationID StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2020 2020 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 30.6 30.6 ... NaN PA PA1 Q047 VALUE 59 Race/Ethnicity Hispanic RACE RACEHIS
1 2014 2014 GU Guam Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 29.3 29.3 ... (13.444304, 144.793731) OWS OWS1 Q036 VALUE 66 Education High school graduate EDU EDUHSGRAD
2 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 28.8 28.8 ... NaN OWS OWS1 Q036 VALUE 59 Income $50,000 - $74,999 INC INC5075
3 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 32.7 32.7 ... NaN OWS OWS1 Q037 VALUE 59 Income Data not reported INC INCNR
4 2015 2015 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who achieve at least 300 min... 26.6 26.6 ... NaN PA PA1 Q045 VALUE 59 Income Less than $15,000 INC INCLESS15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
93244 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 24.5 24.5 ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q037 VALUE 56 Income Less than $15,000 INC INCLESS15
93245 2022 2022 WY Wyoming BRFSS Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 36.0 36.0 ... (43.23554134300048, -108.10983035299967) PA PA1 Q047 VALUE 56 Education Less than high school EDU EDUHS
93246 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 35.2 35.2 ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q036 VALUE 56 Age (years) 35 - 44 AGEYR AGEYR3544
93247 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 35.3 35.3 ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q037 VALUE 56 Income $35,000 - $49,999 INC INC3550
93248 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 41.0 41.0 ... (43.23554134300048, -108.10983035299967) OWS OWS1 Q036 VALUE 56 Education Less than high school EDU EDUHS

93249 rows × 31 columns

We can also get rid of some of the ID columns and the Data_Value_Alt column since they're described by other columns. This helps the data frame become more readable. We can see if Data_Value_Alt is needed by comparing it to the Data Value column.

In [ ]:
obesity_df['Data_Value_Alt'].equals(obesity_df['Data_Value'])
Out[ ]:
True

Since Data_Value_Alt is identical to Data Value we can drop it along with the ID columns.

In [ ]:
obesity_df = obesity_df.drop(columns=['Data_Value_Alt','QuestionID', 'DataValueTypeID', 'ClassID', 'TopicID', 'LocationID', 'StratificationCategoryId1', 'StratificationID1'])
obesity_df
Out[ ]:
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value Data_Value_Footnote_Symbol ... Sample_Size Total Age(years) Education Gender Income Race/Ethnicity GeoLocation StratificationCategory1 Stratification1
0 2020 2020 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 30.6 NaN ... 31255.0 NaN NaN NaN NaN NaN Hispanic NaN Race/Ethnicity Hispanic
1 2014 2014 GU Guam Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 29.3 NaN ... 842.0 NaN NaN High school graduate NaN NaN NaN (13.444304, 144.793731) Education High school graduate
2 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 28.8 NaN ... 62562.0 NaN NaN NaN NaN $50,000 - $74,999 NaN NaN Income $50,000 - $74,999
3 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 32.7 NaN ... 60069.0 NaN NaN NaN NaN Data not reported NaN NaN Income Data not reported
4 2015 2015 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who achieve at least 300 min... 26.6 NaN ... 30904.0 NaN NaN NaN NaN Less than $15,000 NaN NaN Income Less than $15,000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
93244 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 24.5 NaN ... 111.0 NaN NaN NaN NaN Less than $15,000 NaN (43.23554134300048, -108.10983035299967) Income Less than $15,000
93245 2022 2022 WY Wyoming BRFSS Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 36.0 NaN ... 159.0 NaN NaN Less than high school NaN NaN NaN (43.23554134300048, -108.10983035299967) Education Less than high school
93246 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 35.2 NaN ... 450.0 NaN 35 - 44 NaN NaN NaN NaN (43.23554134300048, -108.10983035299967) Age (years) 35 - 44
93247 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 35.3 NaN ... 512.0 NaN NaN NaN NaN $35,000 - $49,999 NaN (43.23554134300048, -108.10983035299967) Income $35,000 - $49,999
93248 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 41.0 NaN ... 146.0 NaN NaN Less than high school NaN NaN NaN (43.23554134300048, -108.10983035299967) Education Less than high school

93249 rows × 23 columns

We can also see the data value footnote columns in case they are of use

In [ ]:
obesity_df['Data_Value_Footnote_Symbol'].unique()
Out[ ]:
array([nan, '~'], dtype=object)
In [ ]:
obesity_df['Data_Value_Footnote'].unique()
Out[ ]:
array([nan, 'Data not available because sample size is insufficient.'],
      dtype=object)

These basically describe the missing data which can already be seen by using the dropna method.

In [ ]:
obesity_df = obesity_df.drop(columns=['Data_Value_Footnote_Symbol', 'Data_Value_Footnote'])
obesity_df
Out[ ]:
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value Low_Confidence_Limit ... Sample_Size Total Age(years) Education Gender Income Race/Ethnicity GeoLocation StratificationCategory1 Stratification1
0 2020 2020 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 30.6 29.4 ... 31255.0 NaN NaN NaN NaN NaN Hispanic NaN Race/Ethnicity Hispanic
1 2014 2014 GU Guam Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 29.3 25.7 ... 842.0 NaN NaN High school graduate NaN NaN NaN (13.444304, 144.793731) Education High school graduate
2 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 28.8 28.1 ... 62562.0 NaN NaN NaN NaN $50,000 - $74,999 NaN NaN Income $50,000 - $74,999
3 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 32.7 31.9 ... 60069.0 NaN NaN NaN NaN Data not reported NaN NaN Income Data not reported
4 2015 2015 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who achieve at least 300 min... 26.6 25.6 ... 30904.0 NaN NaN NaN NaN Less than $15,000 NaN NaN Income Less than $15,000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
93244 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 24.5 16.3 ... 111.0 NaN NaN NaN NaN Less than $15,000 NaN (43.23554134300048, -108.10983035299967) Income Less than $15,000
93245 2022 2022 WY Wyoming BRFSS Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 36.0 27.9 ... 159.0 NaN NaN Less than high school NaN NaN NaN (43.23554134300048, -108.10983035299967) Education Less than high school
93246 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 35.2 30.6 ... 450.0 NaN 35 - 44 NaN NaN NaN NaN (43.23554134300048, -108.10983035299967) Age (years) 35 - 44
93247 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 35.3 30.2 ... 512.0 NaN NaN NaN NaN $35,000 - $49,999 NaN (43.23554134300048, -108.10983035299967) Income $35,000 - $49,999
93248 2022 2022 WY Wyoming BRFSS Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... 41.0 31.9 ... 146.0 NaN NaN Less than high school NaN NaN NaN (43.23554134300048, -108.10983035299967) Education Less than high school

93249 rows × 21 columns

In the following code, we drop the columns in the dataset that we will not use for our analyses.

In [ ]:
obesity_df = obesity_df.drop(columns=['YearStart', 'Datasource', 'Topic', 'Low_Confidence_Limit','Sample_Size', 'Total', 'StratificationCategory1'])

To continue, we can take a look at the columns and their quantity of missing data using the info method.

let's start with the columns

In [ ]:
obesity_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93249 entries, 0 to 93248
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   YearEnd                 93249 non-null  int64  
 1   LocationAbbr            93249 non-null  object 
 2   LocationDesc            93249 non-null  object 
 3   Class                   93249 non-null  object 
 4   Question                93249 non-null  object 
 5   Data_Value              84014 non-null  float64
 6   High_Confidence_Limit   84014 non-null  float64
 7   Age(years)              19980 non-null  object 
 8   Education               13320 non-null  object 
 9   Gender                  6660 non-null   object 
 10  Income                  23310 non-null  object 
 11  Race/Ethnicity          26640 non-null  object 
 12  GeoLocation             91513 non-null  object 
 13  Stratification1         93240 non-null  object 
dtypes: float64(2), int64(1), object(11)
memory usage: 10.0+ MB

As we can tell, a few columns have very few non-null entries. These include Gender, Income, Race/ethncitiy, age, and total. Unfortunately, these columns may be important for prediction and analysis as these could play a significant role in the obesity or inactivity rates at a certain location. We will handle these missing values later on, but for now let's do a basic exploration and anaysis of the data. Specifically, let's take a look at the summary of data, the variance, and the co-variance between the features of set and the level of obesity.

Data Exploration and Summary Statistics

First, let's look at the summary of the numerical columns that we've kept.

In [ ]:
obesity_df.describe()
Out[ ]:
YearEnd Data_Value High_Confidence_Limit
count 93249.000000 84014.000000 84014.000000
mean 2016.308068 31.226492 36.134303
std 3.308679 10.021059 10.978276
min 2011.000000 0.900000 3.000000
25% 2013.000000 24.400000 28.700000
50% 2017.000000 31.200000 36.000000
75% 2019.000000 37.000000 42.200000
max 2022.000000 77.600000 87.700000

We can see that the total number of rows is 93249 and can observe the different characteristics of the values like their quartiles and mean. We also notice that there are about 10000 entries missing in the Data_Value column.

The most important variable to understand is the data_value column, which displays the percentage of a sample described by the associated question in the row. For example, the question/topic could be the percentage of adults who engage in no phyiscal activity, and the data value is the percentage. Let's look at the general distribution of this value.

In [ ]:
obesity_df['Data_Value'].plot(kind='hist', bins = 30)
plt.show()
No description has been provided for this image

As we can see, it follows some what of a bell curve shape where the max percentage is 77.6% and the min is 0.9%. The highest frequency of values lies between the range of 30-35%.

To digress, let's take a look at the other data frames for cleaning and examination.

Let's look at ffr first

In [ ]:
display(ffr_df)
address city country keys latitude longitude name postalCode province websites
0 324 Main St Massena US us/ny/massena/324mainst/-1161002137 44.921300 -74.890210 McDonald's 13662 NY http://mcdonalds.com,http://www.mcdonalds.com/...
1 530 Clinton Ave Washington Court House US us/oh/washingtoncourthouse/530clintonave/-7914... 39.532550 -83.445260 Wendy's 43160 OH http://www.wendys.com
2 408 Market Square Dr Maysville US us/ky/maysville/408marketsquaredr/1051460804 38.627360 -83.791410 Frisch's Big Boy 41056 KY http://www.frischs.com,https://www.frischs.com...
3 6098 State Highway 37 Massena US us/ny/massena/6098statehighway37/-1161002137 44.950080 -74.845530 McDonald's 13662 NY http://mcdonalds.com,http://www.mcdonalds.com/...
4 139 Columbus Rd Athens US us/oh/athens/139columbusrd/990890980 39.351550 -82.097280 OMG! Rotisserie 45701 OH http://www.omgrotisserie.com,http://omgrotisse...
... ... ... ... ... ... ... ... ... ... ...
9995 3013 Peach Orchard Rd Augusta US us/ga/augusta/3013peachorchardrd/-791445730 33.415257 -82.024531 Wendy's 30906 GA http://www.wendys.com,http://wendys.com
9996 678 Northwest Hwy Cary US us/il/cary/678northwesthwy/787691191 42.217300 -88.255800 Lee's Oriental Martial Arts 60013 IL http://www.mcdonalds.com
9997 1708 Main St Longmont US us/co/longmont/1708mainst/-448666054 40.189190 -105.101720 Five Guys 80501 CO http://fiveguys.com
9998 67740 Highway 111 Cathedral City US us/ca/cathedralcity/67740highway111/-981164808 33.788640 -116.482150 El Pollo Loco 92234 CA http://www.elpolloloco.com,http://elpolloloco.com
9999 5701 E La Palma Ave Anaheim US us/ca/anaheim/5701elapalmaave/554191587 33.860074 -117.789762 Carl's Jr. 92807 CA http://www.carlsjr.com

10000 rows × 10 columns

We don't need the keys, postal code, and websites column. Also, the country column is irrelevant if all the rows describe a position in the US.

In [ ]:
ffr_df.drop(columns=['keys', 'websites', 'postalCode'])
# let's check if all the country values is US
ffr_df['country'].unique()
Out[ ]:
array(['US'], dtype=object)

Because we see that US is the only object we can drop that column. We understand that the data only describes data originating from the US.

In [ ]:
ffr_df.drop(columns=['keys', 'websites', 'country'])
Out[ ]:
address city latitude longitude name postalCode province
0 324 Main St Massena 44.921300 -74.890210 McDonald's 13662 NY
1 530 Clinton Ave Washington Court House 39.532550 -83.445260 Wendy's 43160 OH
2 408 Market Square Dr Maysville 38.627360 -83.791410 Frisch's Big Boy 41056 KY
3 6098 State Highway 37 Massena 44.950080 -74.845530 McDonald's 13662 NY
4 139 Columbus Rd Athens 39.351550 -82.097280 OMG! Rotisserie 45701 OH
... ... ... ... ... ... ... ...
9995 3013 Peach Orchard Rd Augusta 33.415257 -82.024531 Wendy's 30906 GA
9996 678 Northwest Hwy Cary 42.217300 -88.255800 Lee's Oriental Martial Arts 60013 IL
9997 1708 Main St Longmont 40.189190 -105.101720 Five Guys 80501 CO
9998 67740 Highway 111 Cathedral City 33.788640 -116.482150 El Pollo Loco 92234 CA
9999 5701 E La Palma Ave Anaheim 33.860074 -117.789762 Carl's Jr. 92807 CA

10000 rows × 7 columns

Now we can start hypothesis testing to find any co-variation and relationships between the features.

First, let's try to find the relationship between education level and obesity. Let's say that H0: The distribution of obesity among different levels of education are the same (education has no impact on obesity), and Ha: The distributions of obesity among different levels of education are different (education does have an impact on obesity).

Since we will be comparing two categorical types of data, we can use the Chi-Squared test - this test estimates the chances that two sets of categorical data come from the same distribution - put simply, if the distribution of obesity rate is the same across all education values, then we can say that education has no influence on obesity rate.

Below is the contingency table and a plot of the relationship:

In [ ]:
import scipy.stats as st


def obesity_level(val):
  if val > 37:
    return 'High'
  elif 24.4 <= val <= 37:
    return 'Medium'
  else:
    return 'Low'

obesity_df['level_of_obesity'] = obesity_df['Data_Value'].apply(obesity_level)

new_obesity = obesity_df.dropna(subset = ['Education', 'Data_Value'])

contingency_table = pd.crosstab(new_obesity['Education'], new_obesity['level_of_obesity'])

print(contingency_table)

contingency_table.plot(kind = 'bar')

plt.title('Relationship between education and obesity')
plt.ylabel('Obesity count')

plt.show()
level_of_obesity                  High   Low  Medium
Education                                           
College graduate                   867  1173    1273
High school graduate               637   527    2149
Less than high school             1118   621    1574
Some college or technical school   613   855    1845
No description has been provided for this image

Now we can calculate the p-value using a Chi-Squared test and compare it to our significant value of 0.05. We can use this information to see whether we would need to reject the Null Hypothesis.

In [ ]:
res = st.chi2_contingency(contingency_table)

f"{res.pvalue:.162f}"
Out[ ]:
'0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000005'

Since we found our p-value to be 5.1 x 10^(-162), we can say that it is less than the significant value of 0.05. Since our p-value is less than 0.05, we can state that we reject the Null Hypothesis. Therefore, education does impact level of obesity because the different levels of education have different distributions of obesity.

In this analysis, we are examining the relationship between the number of fast food restaurants in each U.S. state and the obesity rate for those states. First, we filter the obesity dataset to include only data for the U.S. states. We then narrow it down further so that only the most recent obesity data for each state is included. Next, we calculate the number of fast food restaurants in each state by counting occurrences in the fast food restaurant dataset. We then merge the obesity data and fast food restaurant count data on the state column to combine the relevant information into a single dataset. Using this merged dataset, we perform a linear regression hypothesis test to observe the relationship between the number of fast food restaurants and obesity rates. The scatter plot shows the data points, while the regression line, represented in red, visualizes the trend in the data. The goal is to determine if there is a noticeable pattern, such as a positive correlation, where an increase in the number of fast food restaurants corresponds to a higher obesity rate in the state

In [ ]:
import statsmodels.api as sm
import seaborn as sns

state_names = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut',
               'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa',
               'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
               'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
               'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
               'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
               'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

filtered_obesity_df = obesity_df[obesity_df['LocationDesc'].isin(state_names)]
most_recent_df = filtered_obesity_df.loc[filtered_obesity_df.groupby('LocationDesc')['YearEnd'].idxmax()]

ffr_df['province'].replace('Co Spgs', 'CO')
province_counts = ffr_df['province'].value_counts().reset_index()
province_counts.columns = ['LocationAbbr', 'count']

merged_df = pd.merge(province_counts, most_recent_df, on='LocationAbbr', how='inner')

X = merged_df['count']
y = merged_df['Data_Value']

X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
plt.figure(figsize=(10, 6))

sns.scatterplot(data=merged_df, x='count', y='Data_Value', color='blue', label='Data Points')

sns.regplot(data=merged_df, x='count', y='Data_Value', scatter=False, color='red', label='Regression Line')

plt.title('Relationship Between Fast Food Restaurant Count and Obesity Rates')
plt.xlabel('Number of Fast Food Restaurants')
plt.ylabel('Obesity Rate (%)')
plt.legend()
plt.grid(True)
No description has been provided for this image

From the visualization above, we had a much lower result than we expected. We were expecting a strong positive correlation between obesity percentage and number of fast food restaurants per state. The result was still positive, but had a weak correlation with a value of only 0.06.

Let's try to find the relationship between income and obesity. Let's say that H0: The mean obesity rate is the same across every income group, and Ha: The mean obesity rate is not the same across every income group (that means at least one group would be different).

Since we are comparing multiple income groups and looking at differences among means, it is best to use ANOVA testing.

ANOVA testing allows us to compare the means of multiple groups (income values), to determine if there are significant differences between them. Using this, we can hope to see whether one or more income values cause a significantly different outcome in obesity rate - signifying that yes, income value does influence obesity rate.

In [ ]:
income_df = obesity_df[['Income', 'Data_Value']].dropna()

g1 = income_df[income_df['Income'] == 'Less than $15,000']['Data_Value']
g2 = income_df[income_df['Income'] == '$15,000 - $24,999']['Data_Value']
g3 = income_df[income_df['Income'] == '$25,000 - $34,999']['Data_Value']
g4 = income_df[income_df['Income'] == '$35,000 - $49,999']['Data_Value']
g5 = income_df[income_df['Income'] == '$50,000 - $74,999']['Data_Value']
g6 = income_df[income_df['Income'] == '$75,000 or greater']['Data_Value']
g7 = income_df[income_df['Income'] == 'Data not reported']['Data_Value']

res = st.f_oneway(g1, g2, g3, g4, g5, g6, g7)

res.pvalue
Out[ ]:
1.027379650584492e-30

Since our p-value is 1.03 x 10^(-30), we can see that it is less than the significant value of 0.05, therefore we would reject the Null Hypothesis. Therefore, we can state that there is at least one group that's mean is different from the others.

We can display this down below:

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

# Group the data based on Income categories and calculate their means
income_groups = [
    "Less than $15,000",
    "$15,000 - $24,999",
    "$25,000 - $34,999",
    "$35,000 - $49,999",
    "$50,000 - $74,999",
    "$75,000 or greater",
    "Data not reported"
]

mean_values = [
    g1.mean(),
    g2.mean(),
    g3.mean(),
    g4.mean(),
    g5.mean(),
    g6.mean(),
    g7.mean()
]

# Plot the means
plt.bar(income_groups, mean_values, color='skyblue')
plt.xlabel('Income Group')
plt.ylabel('Mean Data Value')
plt.title('Mean Obesity Data Value by Income Group')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image

At first glance, our means are extremely similar, despite our results from ANOVA. However, this could imply that, despite the means are numerically close, the differences between each income's means are still statistically significant. Then, we can utilize a post-hoc test to test which income group is significantly different. While ANOVA allows us to see whether a group exists that influences obesity rate, post-hoc tests allow us to pinpoint exactly which group does so.

In [ ]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(
    endog = income_df['Data_Value'],    # Dependent variable
    groups = income_df['Income'],  # Categorical group labels
    alpha = 0.05                          # Significance level
)

# Print the results
print(tukey)

# Optionally, plot the results
tukey.plot_simultaneous()
plt.show()
            Multiple Comparison of Means - Tukey HSD, FWER=0.05             
============================================================================
      group1             group2       meandiff p-adj   lower   upper  reject
----------------------------------------------------------------------------
 $15,000 - $24,999  $25,000 - $34,999  -0.3854 0.6571 -1.0789  0.3081  False
 $15,000 - $24,999  $35,000 - $49,999  -0.5283 0.2709 -1.2218  0.1652  False
 $15,000 - $24,999  $50,000 - $74,999  -0.8289 0.0078 -1.5227 -0.1351   True
 $15,000 - $24,999 $75,000 or greater  -1.3017    0.0 -1.9952 -0.6082   True
 $15,000 - $24,999  Data not reported  -2.3067    0.0 -3.0002 -1.6132   True
 $15,000 - $24,999  Less than $15,000   0.1441 0.9964 -0.5494  0.8377  False
 $25,000 - $34,999  $35,000 - $49,999  -0.1429 0.9966 -0.8364  0.5506  False
 $25,000 - $34,999  $50,000 - $74,999  -0.4435 0.4904 -1.1373  0.2503  False
 $25,000 - $34,999 $75,000 or greater  -0.9163 0.0019 -1.6098 -0.2227   True
 $25,000 - $34,999  Data not reported  -1.9213    0.0 -2.6148 -1.2278   True
 $25,000 - $34,999  Less than $15,000   0.5296 0.2681  -0.164  1.2231  False
 $35,000 - $49,999  $50,000 - $74,999  -0.3006 0.8624 -0.9944  0.3932  False
 $35,000 - $49,999 $75,000 or greater  -0.7734 0.0175 -1.4669 -0.0799   True
 $35,000 - $49,999  Data not reported  -1.7784    0.0 -2.4719 -1.0849   True
 $35,000 - $49,999  Less than $15,000   0.6724 0.0644 -0.0211   1.366  False
 $50,000 - $74,999 $75,000 or greater  -0.4728 0.4086 -1.1666   0.221  False
 $50,000 - $74,999  Data not reported  -1.4778    0.0 -2.1716  -0.784   True
 $50,000 - $74,999  Less than $15,000    0.973 0.0007  0.2793  1.6668   True
$75,000 or greater  Data not reported   -1.005 0.0004 -1.6985 -0.3115   True
$75,000 or greater  Less than $15,000   1.4458    0.0  0.7523  2.1393   True
 Data not reported  Less than $15,000   2.4508    0.0  1.7573  3.1444   True
----------------------------------------------------------------------------
No description has been provided for this image

Based on this, we can clearly tell that groups with less than $15,000 income tend to have higher data values, implying higher obesity rates, and their values are significantly different across the board.

In comparison, the more moderate income ranges do not have significant differences from each other, implying that those with lower income are of higher significance toward influencing the percentage of obesity.

ML Design/Development

With these analyses, we are given an idea of the connection between the different features of the data sets and their correlation with obesity and inactivity. We can now transition to using ML for prediction and pattern detection. However, we need to manipulate our features as our data frame is currently unusable for ML applications. To start, we need to turn our data into numerical values, since every column is categorical. We can do this via assigning a number to each different category for each column. This is kind of a different variation of one hot encoding where we dont need to add a new column for every category. The columns that we will examine include the following: Education, Income, Age, Race, Data_Value, and location.

In [ ]:
# Education is first:
display(obesity_df['Education'].unique())
array([nan, 'High school graduate', 'Less than high school',
       'Some college or technical school', 'College graduate'],
      dtype=object)
In [ ]:
# assign each level of education a number from 1 to 4
education_dict = {'Less than high school': 1, 'High school graduate': 2, 'Some college or technical school': 3, 'College graduate':4}
obesity_df['Education'] = obesity_df['Education'].replace(education_dict)
display(obesity_df['Education'].unique())
<ipython-input-85-4d2faeddb84a>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  obesity_df['Education'] = obesity_df['Education'].replace(education_dict)
array([nan,  2.,  1.,  3.,  4.])
In [ ]:
# Then income
display(obesity_df['Income'].unique())
array([nan, '$50,000 - $74,999', 'Data not reported', 'Less than $15,000',
       '$25,000 - $34,999', '$15,000 - $24,999', '$35,000 - $49,999',
       '$75,000 or greater'], dtype=object)
In [ ]:
# assign each level of income a number from 1 to 6 leaving 'Data not reported' to be converted to nan
income_dict = {'Less than $15,000':1, '$15,000 - $24,999':2, '$25,000 - $34,999':3, '$35,000 - $49,999':4, '$50,000 - $74,999':5, '$75,000 or greater':6}
obesity_df['Income'] = obesity_df['Income'].replace(income_dict)
display(obesity_df['Income'].unique())
obesity_df['Income'] = obesity_df['Income'].replace({'Data not reported': float('NaN')})
array([nan, 5, 'Data not reported', 1, 3, 2, 4, 6], dtype=object)
<ipython-input-87-73d415d0410c>:5: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  obesity_df['Income'] = obesity_df['Income'].replace({'Data not reported': float('NaN')})
In [ ]:
# Age
display(obesity_df['Age(years)'].unique())
array([nan, '25 - 34', '55 - 64', '18 - 24', '45 - 54', '35 - 44',
       '65 or older'], dtype=object)
In [ ]:
# assign each level of age a number from 1 to 6
age_dict = {'18 - 24':1, '25 - 34':2, '35 - 44':3, '45 - 54':4, '55 - 64':5, '65 or older':6}
obesity_df['Age(years)'] = obesity_df['Age(years)'].replace(age_dict)
display(obesity_df['Age(years)'].unique())
<ipython-input-89-18bb04d1ef4c>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  obesity_df['Age(years)'] = obesity_df['Age(years)'].replace(age_dict)
array([nan,  2.,  5.,  1.,  4.,  3.,  6.])
In [ ]:
# Race/ethnicity
display(obesity_df['Race/Ethnicity'].unique())
array(['Hispanic', nan, 'American Indian/Alaska Native', 'Asian',
       'Non-Hispanic White', 'Other', '2 or more races',
       'Hawaiian/Pacific Islander', 'Non-Hispanic Black'], dtype=object)
In [ ]:
# assign each specified race a number from 1 to 8
race_dict = {'Non-Hispanic White':1, 'Non-Hispanic Black':2, 'Hispanic':3, 'Asian':4, 'American Indian/Alaska Native':5, 'Hawaiian/Pacific Islander':6, '2 or more races':7, 'Other':8}
obesity_df['Race/Ethnicity'] = obesity_df['Race/Ethnicity'].replace(race_dict)
display(obesity_df['Race/Ethnicity'].unique())
<ipython-input-91-77004d7ace83>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  obesity_df['Race/Ethnicity'] = obesity_df['Race/Ethnicity'].replace(race_dict)
array([ 3., nan,  5.,  4.,  1.,  8.,  7.,  6.,  2.])
In [ ]:
# Class: Target Value for first Model
display(obesity_df['Class'].unique())
array(['Physical Activity', 'Obesity / Weight Status',
       'Fruits and Vegetables'], dtype=object)
In [ ]:
# assign each class a number from 1 to 3
class_dict = {'Obesity / Weight Status':1, 'Physical Activity':2, 'Fruits and Vegetables':3}
obesity_df['Class'] = obesity_df['Class'].replace(class_dict)
display(obesity_df['Class'].unique())
<ipython-input-93-fc1c6120ad8f>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  obesity_df['Class'] = obesity_df['Class'].replace(class_dict)
array([2, 1, 3])

With these columns now numerical, we need to find a way to fill in the missing values. Unfortunately, the dataset does not have a single row with no single missing value. However, we can see that the data has been sampled from multiple locations. We can use this to our advantage by filling in missing data in these locations with KNN imputation to replace missing data using similar datapoints. This is basically guessing what categories go in each column for a specific row, which may bias the data, but this is essential for doing any sort of ML algorithm.

To start, let's take a look at the quantity of entries in each location.

In [ ]:
obesity_df['LocationDesc'].value_counts()
Out[ ]:
count
LocationDesc
National 1736
West Virginia 1736
Oklahoma 1736
Mississippi 1736
Oregon 1736
Wisconsin 1736
Kansas 1736
Florida 1736
Idaho 1736
Arizona 1736
Montana 1736
Georgia 1736
North Carolina 1736
Pennsylvania 1736
North Dakota 1736
South Carolina 1736
Nebraska 1736
Tennessee 1736
Missouri 1736
Nevada 1736
Iowa 1736
Indiana 1736
Ohio 1736
Alaska 1736
Vermont 1736
Colorado 1736
Kentucky 1736
Utah 1736
New York 1736
Wyoming 1736
District of Columbia 1736
Alabama 1736
Rhode Island 1736
Delaware 1736
Washington 1736
Maine 1736
Michigan 1736
Virginia 1736
California 1736
Texas 1736
Connecticut 1736
Massachusetts 1736
Arkansas 1736
Illinois 1736
New Hampshire 1736
New Mexico 1736
Maryland 1736
Minnesota 1736
Hawaii 1736
Louisiana 1736
South Dakota 1736
New Jersey 1493
Puerto Rico 1316
Guam 1260
Virgin Islands 644

let's group the data by these locations and show an example dataframe from one of those locations.

In [ ]:
features = obesity_df[['Education', 'Income', 'LocationDesc', 'Age(years)', 'Race/Ethnicity', 'Data_Value', 'Class']]
grouped_features = features.groupby('LocationDesc')
grouped_features.count()
display(grouped_features.get_group('Alabama'))
Education Income LocationDesc Age(years) Race/Ethnicity Data_Value Class
9 NaN NaN Alabama 2.0 NaN 35.2 1
48 NaN NaN Alabama 5.0 NaN 35.3 1
119 NaN NaN Alabama 3.0 NaN 31.9 1
236 NaN NaN Alabama NaN NaN 37.7 1
305 NaN NaN Alabama NaN 6.0 NaN 1
... ... ... ... ... ... ... ...
88925 NaN NaN Alabama 2.0 NaN 21.2 2
88926 NaN NaN Alabama NaN 7.0 24.9 2
88927 NaN 4.0 Alabama NaN NaN 34.7 1
88928 NaN NaN Alabama NaN 8.0 NaN 2
88929 NaN NaN Alabama NaN 6.0 NaN 1

1736 rows × 7 columns

Now, let's use Scikit to do KNN imputation.

In [ ]:
from sklearn.impute import KNNImputer

k = 3
imputer = KNNImputer(n_neighbors=k)
new_table = pd.DataFrame()
# knn imputation for each location
for loc in grouped_features.groups.keys():
  # make a new sub table with no missing values
  new_subtable = imputer.fit_transform(grouped_features.get_group(loc).drop(columns = ['LocationDesc']))

  new_subtable = pd.DataFrame(new_subtable, columns = grouped_features.get_group(loc).drop(columns = 'LocationDesc').columns)
  # append sub table to the new table
  new_table = pd.concat([new_table, new_subtable])

display(new_table)
Education Income Age(years) Race/Ethnicity Data_Value Class
0 2.666667 3.666667 2.000000 7.333333 35.200000 1.0
1 2.333333 4.666667 5.000000 7.333333 35.300000 1.0
2 3.000000 4.666667 3.000000 7.333333 31.900000 1.0
3 3.333333 4.666667 4.000000 7.333333 37.700000 1.0
4 2.333333 4.333333 3.333333 6.000000 34.133333 1.0
... ... ... ... ... ... ...
1731 4.000000 1.000000 3.000000 3.666667 24.500000 1.0
1732 1.000000 1.333333 6.000000 6.666667 36.000000 2.0
1733 3.000000 3.666667 3.000000 3.666667 35.200000 1.0
1734 3.000000 4.000000 3.666667 3.666667 35.300000 1.0
1735 1.000000 5.666667 4.333333 3.666667 41.000000 1.0

93249 rows × 6 columns

We should also round the non integers to the nearest integers to match the categorical data.

In [ ]:
new_table['Education'] = new_table['Education'].apply(lambda x: round(x))
new_table['Income'] = new_table['Income'].apply(lambda x: round(x))
new_table['Age(years)'] = new_table['Age(years)'].apply(lambda x: round(x))
new_table['Race/Ethnicity'] = new_table['Race/Ethnicity'].apply(lambda x: round(x))
display(new_table)
Education Income Age(years) Race/Ethnicity Data_Value Class
0 3 4 2 7 35.200000 1.0
1 2 5 5 7 35.300000 1.0
2 3 5 3 7 31.900000 1.0
3 3 5 4 7 37.700000 1.0
4 2 4 3 6 34.133333 1.0
... ... ... ... ... ... ...
1731 4 1 3 4 24.500000 1.0
1732 1 1 6 7 36.000000 2.0
1733 3 4 3 4 35.200000 1.0
1734 3 4 4 4 35.300000 1.0
1735 1 6 4 4 41.000000 1.0

93249 rows × 6 columns

Now that we have our data ready for processing, we can take a moment to consider an important observation. The class data value describes the level of obesity, activity, or fruit/vegetable consumption. We can look at what exactly each data value describes via the question column.

In [ ]:
display(obesity_df['Class'].value_counts())
display(obesity_df['Question'].value_counts())
count
Class
2 47885
1 36234
3 9130

count
Question
Percent of adults aged 18 years and older who have obesity 18117
Percent of adults aged 18 years and older who have an overweight classification 18117
Percent of adults who engage in no leisure-time physical activity 18089
Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) 7449
Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week 7449
Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) 7449
Percent of adults who engage in muscle-strengthening activities on 2 or more days a week 7449
Percent of adults who report consuming fruit less than one time daily 4565
Percent of adults who report consuming vegetables less than one time daily 4565

We can group these questions into the three categories of obesity, activity, and fruit/vegetable consumption.

Obesity (1):

Negative: Percent of adults aged 18 years and older who have obesity; Percent of adults aged 18 years and older who have an overweight classification

Activity (2):

Negative: Percent of adults who engage in no leisure-time physical activity;

Positive: Percent of adults who achieve at least 300 minutes a week of moderate-intensity...; Percent of adults who achieve at least 150 minutes a week of moderate-intensity...; Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination); Percent of adults who engage in muscle-strengthening activities on 2 or more days a week;

Fruit/Vegetable consumption (3):

Negative: Percent of adults who report consuming fruit less than one time daily Percent of adults who report consuming vegetables less than one time daily

Overall Idea

A higher data value for the topic like obesity indicates a negative conclusion.

A higher data value for activity in terms of no leisure time physical activity indiciates a negative conclusion, whereas a higher value for the rest is positive.

Finally, A higher data value for fruit/vegetable consumption indicates a negative conclusion.

These ideas are important for regressors. In the following code, we explore the accuracy of different models in predicting the percentage of obesity given a description of a population (Race, Age range, Education, and Income).

For each different category, we can make a linear regression model for predicting the percentage for that category.

ML Algorithm Training and Test Data Analysis

In [ ]:
display(new_table)
Education Income Age(years) Race/Ethnicity Data_Value Class
0 3 4 2 7 35.200000 1.0
1 2 5 5 7 35.300000 1.0
2 3 5 3 7 31.900000 1.0
3 3 5 4 7 37.700000 1.0
4 2 4 3 6 34.133333 1.0
... ... ... ... ... ... ...
1731 4 1 3 4 24.500000 1.0
1732 1 1 6 7 36.000000 2.0
1733 3 4 3 4 35.200000 1.0
1734 3 4 4 4 35.300000 1.0
1735 1 6 4 4 41.000000 1.0

93249 rows × 6 columns

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

obesity = new_table[new_table['Class'] == 1]
activity = new_table[new_table['Class'] == 2]
nutrition = new_table[new_table['Class'] == 3]

dfs = [("Obesity",obesity), ("Activity",activity), ("Nutrition",nutrition)]

for group_name, df in dfs:
  target_features = df[['Education', 'Income', 'Age(years)', 'Race/Ethnicity']]
  model_target = df['Data_Value']
  X_train, X_test, y_train, y_test = train_test_split(target_features, model_target, test_size = 0.2, random_state = 42)
  scaler = StandardScaler()
  X_train_scaled = scaler.fit_transform(X_train)
  X_test_scaled = scaler.transform(X_test)
  model = LinearRegression()
  model.fit(X_train_scaled, y_train)
  y_pred = model.predict(X_test_scaled)
  mse = mean_squared_error(y_test, y_pred)
  r2 = r2_score(y_test, y_pred)
  print(f"Mean Squared Error for {group_name}: {mse}")
  print(f"R-squared: {r2}")

  plt.scatter(y_test, y_pred, color='blue', alpha=0.5, label = "Predicted values")
  plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2, label = "Ideal Prediction")
  plt.title(f"{group_name} - Actual vs Predicted")
  plt.xlabel('Actual Data Value')
  plt.ylabel('Predicted Data Value')
  plt.grid(True)
  plt.legend()
  plt.show()
Mean Squared Error for Obesity: 28.361608311875717
R-squared: 0.2556615957364562
No description has been provided for this image
Mean Squared Error for Activity: 129.32780845095235
R-squared: 0.007346128474098879
No description has been provided for this image
Mean Squared Error for Nutrition: 102.35161305502322
R-squared: 0.1852917167638467
No description has been provided for this image

Depicted above is our Linear Regression Model, which acts as our prelimary analysis on obesity rate, given the continuous nature of obesity rate for each population.

Linear regression is a model that is used to display the relationship between a target and a single or multiple features. In Linear Regression, we aim to identify the best fitting line/equation that would represent this relationship. In addition, we will be able to understand whether or not our data may not fall under a linear pattern and may need additional models to try and grasp an understanding of the data.

From the start, we prioritized focusing on obesity as the primary target variable in the dataset, which includes multiple classes such as activity and nutrition as shown in the visualizations above. Given this initial focus, we chose to move forward with obesity as it aligns with the project's objectives. Thus, while other classes like activity and nutrition were considered, the decision to move forward with obesity is based on the project's initial goal.

In [ ]:
target_features = obesity[['Education', 'Income', 'Age(years)', 'Race/Ethnicity']]
model_target = obesity['Data_Value']
X_train, X_test, y_train, y_test = train_test_split(target_features, model_target, test_size = 0.2, random_state = 42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error for Obesity: {mse}")
print(f"R-squared: {r2}")
plt.scatter(y_test, y_pred, color='blue', alpha=0.5, label = "Predicted Values")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2, label = "Ideal Prediction")
plt.title(f"Obesity - Actual vs Predicted")
plt.xlabel('Actual Data Value')
plt.ylabel('Predicted Data Value')
plt.grid(True)
plt.legend()
plt.show()
Mean Squared Error for Obesity: 28.361608311875717
R-squared: 0.2556615957364562
No description has been provided for this image

The Mean Squared Error (MSE) for obesity using linear regression is approximately 28.362. MSE measures the average squared difference between the predicted and actual values. In this case, an MSE of 28.362 means that, on average, the squared difference between the predicted and actual values is 28.362. A lower MSE indicates a better fit, so while this value isn’t ideal, it provides a starting point for evaluation. As this is the first model we have tested, the result suggests that the model is reasonably performing but may not be the most optimal.

Our R-squared value using linear regression is 0.256, which means that about 25.6% of the variance in the obesity data is explained by the model. R-squared can range from -1 to 1. A value of 1 would indicate that the model perfectly explains the variance in the target variable, while a value of 0 means that the model does not improve on simply predicting the mean of the target variable. A negative R-squared indicates that the model performs worse than predicting the mean.

In [ ]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline

models = {
    'Polynomial': make_pipeline(PolynomialFeatures(7), LinearRegression()),
    'KNN': KNeighborsRegressor(n_neighbors=20),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'Linear Regression': LinearRegression()
}

Polynomial Regression: Type of linear regression that allows for curving the model to fit more complex patterns in the data. It does this by using polynomial features (higher powers of the input features). Essentially, it tries to fit a curve to the data, rather than just a straight line, to better capture more complex patterns.

KNN: A simple model that makes predictions based on the data points that are closest to a given point. It looks at the "K" nearest neighbors (data points) to make predictions about a new data point. In simpler terms, it looks at its closest neighbors and takes their average or majority.

Decision Tree: makes predictions by splitting the data into smaller groups based on certain conditions (like answering yes/no questions) and making a prediction based on the group the data point belongs to similar to a flowchart, where each decision leads down a path to a final prediction.

Random Forest: Essentially multiple decision trees working together, and the results of all the decision trees are averaged to get a final prediction that is usually more accurate than a single tree.

Gradient Boosting: Similar to random forest, but instead of taking the average of the decision trees, it builds trees one by one where each one improves from the previous tree.

In [ ]:
for model_name, model in models.items():
    model = model
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Mean Squared Error for {model_name}: {mse}")
    print(f"R-squared: {r2}")

    plt.scatter(y_test, y_pred, color='blue', alpha=0.5, label = "Predicted Values")
    plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2, label = "Ideal Prediction")
    plt.title(f"({model_name}) Obesity - Actual vs Predicted")
    plt.xlabel('Actual Data Value')
    plt.ylabel('Predicted Data Value')
    plt.grid(True)
    plt.legend()
    plt.show()
Mean Squared Error for Polynomial: 22.50360691296797
R-squared: 0.4094023626735278
No description has been provided for this image
Mean Squared Error for KNN: 23.434549173987705
R-squared: 0.3849701771144749
No description has been provided for this image
Mean Squared Error for Decision Tree: 22.36625415404215
R-squared: 0.4130071276881139
No description has been provided for this image
Mean Squared Error for Random Forest: 22.248140042551135
R-squared: 0.41610698254477285
No description has been provided for this image
Mean Squared Error for Gradient Boosting: 22.79475606241995
R-squared: 0.40176127649383364
No description has been provided for this image
Mean Squared Error for Linear Regression: 28.361608311875717
R-squared: 0.2556615957364562
No description has been provided for this image

The above plots show the predicted values of five different models. These models include Polynomial Regressor, KNN, Decision Tree, Random Forest, Gradient Boosting, and Linear Regression. The MSE and R^2 score for Polynomial Regressor is approximately 22.504 and 0.409 respectively. The MSE and R^2 score for KNN is approximately 23.435 and 0.385 respectively. The MSE and R^2 score for Decision Tree is approximately 22.366 and 0.413 respectively. The MSE and R^2 score for Random forest is approximately 22.248 and 0.416 respectively. The MSE and R^2 score for Gradient Boosting is approximately 22.795 and 0.402 respectively. The MSE and R^2 score for Linear Regression is approximately 28.362 and 0.256 respectively. When we look at these models, we want to identify the model with the lowest MSE and the highest R^2 score. As we can see, the Random Forest regressor was the most accurate model with the lowest Mean Squared Error and highest R^2 value.

The red lines in the graph represent the ideal distribution of the predicted values. If the predicted values were all neatly placed upon this line, the accuracy of the model would be perfect.

Insights and Conclusion

Throughout the tutorial, we have gained a fair amount of knowledge regarding using the data gathered for prediction of a population's level of obesity given its features. For example, if we are looking at a population of people who are hispanic, 18-24 years old, have a college education, and high income, we can assess their level of obesity.

In [ ]:
race_val = race_dict['Hispanic']
age_val = age_dict['18 - 24']
income_val = income_dict['$75,000 or greater']
education_val = education_dict['College graduate']
obesity_percentage = models['Random Forest'].predict([[education_val, income_val, age_val, race_val]])
print(obesity_percentage)
[18.20874881]

This prediction is the estimated percentage of obesity within the population described by the given categories.

In addition, through our hypothesis testing, we were able to determine which features were most significant in assessing obesity rate, especially income and education. We were also able to find out that, despite our initial hypothesis, the density of fast food locations does NOT have a significant influence on obesity rate

Unfortunately, despite testing several different models, we could not find a model that could get better than 22.25 mse. Given the nature of the continuous value we are predicting - the projected obesity percentage of a population given age, income, etc, it is difficult to directly classify a population as obese, not obese, or somewhere in the middle. However, due to the way the data points are scattered so close to each other, it is also difficult to get a clear decision boundary for regression. Thus, we believe that we can best predict the obesity rate using Random Forest, but there will still be some error.

Through our tutorial, the average reader will first be introduced to our dataset, and why we selected this one for our goal of predicting obesity rate. They will then be guided through our sanitization of the data - deleting redundant/irrelevant features, understanding which values are most relevant to our goal, and figuring out which null features need to be estimated, or simply dropped.

Next, the reader will be introduced to our hypothesis testing, which will introduce the user to some preliminary insights of our data: testing whether the distribution of obesity among different levels of education are the same using the Chi-Squared Test, searching for a linear relationship between fast food restaurant count and obesity rate, and utilizing the ANOVA test and Tukey's HSD to identify the connection between lower incomes and higher obesity rates.

Finally, the reader will be guided through our ML findings. After some further data manipulation to turn classified data into numerical values and estimating missing values using KNN imputation, we will introduce the user to machine learning using Linear Regression, to see if it can reliably predict the obesity rate's continuous value. We will then show other various models, explaining what each does, and showing their results to see which predicts the obesity rate the best. After all of our testings, the reader will be able to understand that Random Forest had the lowest error rate for obesity rate predictions, and can see an example predicted value given several key features.

We made sure to describe each step of the tutorial clearly and with code demonstrations to show our thought process throughout the analysis. This, coupled with our descriptions of each model and hypothesis testing, will allow the uninformed user to utilize our colab to understand how the obesity rate of each population can be predicted, as well as which categories were most influential in determining this. In addition, given our extensive data sanitization, explanation of which data is most important, and usage of several models, we believe that a more advanced user will learn more about the topic.