US Obesity Rate Patterns
Fall 2024 Data Science Project
Isaac Plowman, Praharsh Nanduri, Justin Nguyen, Kevin Ferry, Charles Kim
Contributions:
A: Project idea B: Dataset Curation and Preprocessing C: Data Exploration and Summary Statistics D: ML Algorithm Design/Development E: ML Algorithm Training and Test Data Analysis F: Visualization, Result Analysis, Conclusion G: Final Tutorial Report Creation H: Additional (not listed above)
Kevin: Project Idea ~ Helped look for kaggle datasets. Dataset curation and Preprocessing ~ Helped clean data and look at ways to deal with missing values. ML Algorithm Design/Development ~ Helped develop ideas for ML algorithm design and some of the early questions we aim to answer using ML. Also worked on encoding some of the categorical variables. Visualization, Result Analysis, Conclusion ~ Worked on writing what the reader will experience when reading our tutorial and what they will be able to learn from utilizing our code.
Charles: Project Idea - worked with group to search for kaggle datasets. ML Algorithm and Training Test worked on developing initial model design and idea. Summary Statistics - Charles worked on extending the existing ANOVA test with post-hoc test, to pinpoint which income value had a significant impact on obesity rate. Linear Regression model. After initial findings, he came up with the idea to test several different models to see which one works best, as well as explore why these models didn't work well for our given data set. Result Analysis, Conclusion, Tutorial Charles worked on the final insights and explaining, given all of our models, why random forest worked the best.
Justin: Data Exploration and Summary Statistics - Helped look for possible datasets to use. Data Exploration and Summary Statistics - Implemented linear regression hypothesis test. ML Alg Training - Added different models to the model dictionary. Visualization, Result Analysis - Added labels and legend to various graphs, wrote analysis and explanation of model results.
Praharsh: Project Idea - Helped look for kaggle datasets. Data Exploration and Summary Statistics ~ Implemented a Chi-Squared test and a ANOVA test and developed descriptions/tutorials for each test. ML Algorithm and Training Test Data Analysis - Created the dictionary for models and chose/imported the corresponding models. Visualization, Result Analysis, Conclusion & Tutorial ~ Helped curate descriptions regarding visualizations of each plot and also curated descriptions regarding the choice behind Linear Regression and why we found Random Forest as our most accurate model.
Isaac: Project Idea - researched a topic and found the Nutrition_Activity_Obesity CSV from data.gov. Wrote most of the introduction paragraph. Dataset Curation - Cleaned the dataset by removing irrelevant columns and showing basic dataset info. ML Algorithm Design/Development - did feature engineering by changing the categories to numerical values and specified the features and target variables. ML Training - Helped implement and test the different models. Insight and Conclusion - Helped write the descriptions of the steps and contributed to writing the conclusion.
Overall, the team felt everyone contributed equally.
Introduction
Our topic is focused on obesity rates and patterns in the US. Specifically, our project involves in looking at numerous factors like physical activity, geographic location, income, and education for correlation with obesity. We can also use the chosen data for predicting the percentage of a certain specification (e.g. percentage of adults who engage in no leisure-time physical activity) based on the features of a geographical area.
Unfortuantely, obesity is a highly prevalent disease that has been affecting people negatively globally. Data from 2023 shows that in 23 U.S. states, one in five adults in the U.S. has obesity (CDC data). With growing fast-food and snacking markets and an increase in prices of healthy foods, people are more influenced to buy less healthy food alternatives for a cheap price. Additionally, obesity is linked to many chronic conditions like diabetes, heart disease, and certain cancers. We examine these datasets for predicting healthcare needs, and for the assistance of creating nutritional guidelines and policies to mitigate the growth of obesity. Additionally, we can locate disparities in obesity rates and the connections between obesity and certain economic, racial, and geographical groups. These datasets allow us to examine environmental influences on obesity and understand the connection between fast food availability and health outcomes. Understanding this information provides knowledge on populations that are at risk, which determines the amount of resources to allocate and where to allocate them.
Data Curation
For our project, we analyzed three different datasets from two different websites. The links to where we gathered data include the following:
Nutrition_Activity_Obesity From Data.gov:
FastFoodRestaurants and Datafiniti_Fast_Food_Restaurants From Kaggle.com:
"Nutrition_Activity_Obesity" refers to the dataset that includes different topics or questions and their corresponding percentage. The "FastFoodRestaurants" datasets include the locations of fast food restaurants across the US. These datasets will be used for analysis and pattern prediction, but before that we must process and clean their data.
First we must create the data frames using the csv files. Down below we display the dataframes.
import pandas as pd
import matplotlib.pyplot as plt
ffr_df = pd.read_csv('FastFoodRestaurants.csv')
obesity_df = pd.read_csv('Nutrition_Activity_Obesity.csv')
Datafini_ffr_df = pd.read_csv('Datafiniti_Fast_Food_Restaurants.csv')
display(obesity_df)
display(ffr_df)
display(Datafini_ffr_df)
YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value_Unit | Data_Value_Type | ... | GeoLocation | ClassID | TopicID | QuestionID | DataValueTypeID | LocationID | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 2020 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | NaN | Value | ... | NaN | PA | PA1 | Q047 | VALUE | 59 | Race/Ethnicity | Hispanic | RACE | RACEHIS |
1 | 2014 | 2014 | GU | Guam | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | (13.444304, 144.793731) | OWS | OWS1 | Q036 | VALUE | 66 | Education | High school graduate | EDU | EDUHSGRAD |
2 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | NaN | OWS | OWS1 | Q036 | VALUE | 59 | Income | $50,000 - $74,999 | INC | INC5075 |
3 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | NaN | OWS | OWS1 | Q037 | VALUE | 59 | Income | Data not reported | INC | INCNR |
4 | 2015 | 2015 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who achieve at least 300 min... | NaN | Value | ... | NaN | PA | PA1 | Q045 | VALUE | 59 | Income | Less than $15,000 | INC | INCLESS15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
93244 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q037 | VALUE | 56 | Income | Less than $15,000 | INC | INCLESS15 |
93245 | 2022 | 2022 | WY | Wyoming | BRFSS | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | NaN | Value | ... | (43.23554134300048, -108.10983035299967) | PA | PA1 | Q047 | VALUE | 56 | Education | Less than high school | EDU | EDUHS |
93246 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q036 | VALUE | 56 | Age (years) | 35 - 44 | AGEYR | AGEYR3544 |
93247 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q037 | VALUE | 56 | Income | $35,000 - $49,999 | INC | INC3550 |
93248 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | NaN | Value | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q036 | VALUE | 56 | Education | Less than high school | EDU | EDUHS |
93249 rows × 33 columns
address | city | country | keys | latitude | longitude | name | postalCode | province | websites | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 324 Main St | Massena | US | us/ny/massena/324mainst/-1161002137 | 44.921300 | -74.890210 | McDonald's | 13662 | NY | http://mcdonalds.com,http://www.mcdonalds.com/... |
1 | 530 Clinton Ave | Washington Court House | US | us/oh/washingtoncourthouse/530clintonave/-7914... | 39.532550 | -83.445260 | Wendy's | 43160 | OH | http://www.wendys.com |
2 | 408 Market Square Dr | Maysville | US | us/ky/maysville/408marketsquaredr/1051460804 | 38.627360 | -83.791410 | Frisch's Big Boy | 41056 | KY | http://www.frischs.com,https://www.frischs.com... |
3 | 6098 State Highway 37 | Massena | US | us/ny/massena/6098statehighway37/-1161002137 | 44.950080 | -74.845530 | McDonald's | 13662 | NY | http://mcdonalds.com,http://www.mcdonalds.com/... |
4 | 139 Columbus Rd | Athens | US | us/oh/athens/139columbusrd/990890980 | 39.351550 | -82.097280 | OMG! Rotisserie | 45701 | OH | http://www.omgrotisserie.com,http://omgrotisse... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 3013 Peach Orchard Rd | Augusta | US | us/ga/augusta/3013peachorchardrd/-791445730 | 33.415257 | -82.024531 | Wendy's | 30906 | GA | http://www.wendys.com,http://wendys.com |
9996 | 678 Northwest Hwy | Cary | US | us/il/cary/678northwesthwy/787691191 | 42.217300 | -88.255800 | Lee's Oriental Martial Arts | 60013 | IL | http://www.mcdonalds.com |
9997 | 1708 Main St | Longmont | US | us/co/longmont/1708mainst/-448666054 | 40.189190 | -105.101720 | Five Guys | 80501 | CO | http://fiveguys.com |
9998 | 67740 Highway 111 | Cathedral City | US | us/ca/cathedralcity/67740highway111/-981164808 | 33.788640 | -116.482150 | El Pollo Loco | 92234 | CA | http://www.elpolloloco.com,http://elpolloloco.com |
9999 | 5701 E La Palma Ave | Anaheim | US | us/ca/anaheim/5701elapalmaave/554191587 | 33.860074 | -117.789762 | Carl's Jr. | 92807 | CA | http://www.carlsjr.com |
10000 rows × 10 columns
id | dateAdded | dateUpdated | address | categories | city | country | keys | latitude | longitude | name | postalCode | province | sourceURLs | websites | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AVwcmSyZIN2L1WUfmxyw | 2015-10-19T23:47:58Z | 2018-06-26T03:00:14Z | 800 N Canal Blvd | American Restaurant and Fast Food Restaurant | Thibodaux | US | us/la/thibodaux/800ncanalblvd/1780593795 | 29.814697 | -90.814742 | SONIC Drive In | 70301 | LA | https://foursquare.com/v/sonic-drive-in/4b7361... | https://locations.sonicdrivein.com/la/thibodau... |
1 | AVwcmSyZIN2L1WUfmxyw | 2015-10-19T23:47:58Z | 2018-06-26T03:00:14Z | 800 N Canal Blvd | Fast Food Restaurants | Thibodaux | US | us/la/thibodaux/800ncanalblvd/1780593795 | 29.814697 | -90.814742 | SONIC Drive In | 70301 | LA | https://foursquare.com/v/sonic-drive-in/4b7361... | https://locations.sonicdrivein.com/la/thibodau... |
2 | AVwcopQoByjofQCxgfVa | 2016-03-29T05:06:36Z | 2018-06-26T02:59:52Z | 206 Wears Valley Rd | Fast Food Restaurant | Pigeon Forge | US | us/tn/pigeonforge/206wearsvalleyrd/-864103396 | 35.803788 | -83.580553 | Taco Bell | 37863 | TN | https://www.yellowpages.com/pigeon-forge-tn/mi... | http://www.tacobell.com,https://locations.taco... |
3 | AVweXN5RByjofQCxxilK | 2017-01-03T07:46:11Z | 2018-06-26T02:59:51Z | 3652 Parkway | Fast Food | Pigeon Forge | US | us/tn/pigeonforge/3652parkway/93075755 | 35.782339 | -83.551408 | Arby's | 37863 | TN | http://www.yellowbook.com/profile/arbys_163389... | http://www.arbys.com,https://locations.arbys.c... |
4 | AWQ6MUvo3-Khe5l_j3SG | 2018-06-26T02:59:43Z | 2018-06-26T02:59:43Z | 2118 Mt Zion Parkway | Fast Food Restaurant | Morrow | US | us/ga/morrow/2118mtzionparkway/1305117222 | 33.562738 | -84.321143 | Steak 'n Shake | 30260 | GA | https://foursquare.com/v/steak-n-shake/4bcf77a... | http://www.steaknshake.com/locations/23851-ste... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | AV12gJwna4HuVbed9Ayg | 2017-07-24T21:28:46Z | 2018-04-07T13:19:06Z | 3460 Robinhood Rd | Fast Food Restaurants | Winston-Salem | US | us/nc/winston-salem/3460robinhoodrd/-66712705 | 36.117563 | -80.316553 | Pizza Hut | 27106 | NC | https://www.allmenus.com/nc/winston-salem/7341... | http://www.pizzahut.com |
9996 | AV12gJxKIxWefVJwhpzS | 2017-07-24T21:28:46Z | 2018-04-07T13:19:05Z | 3069 Kernersville Rd | Fast Food Restaurants | Winston-Salem | US | us/nc/winston-salem/3069kernersvillerd/-66712705 | 36.077718 | -80.176748 | Pizza Hut | 27107 | NC | https://www.allmenus.com/nc/winston-salem/7340... | http://www.pizzahut.com |
9997 | AVwdJMdSByjofQCxl8Vr | 2015-10-24T00:17:32Z | 2018-04-07T13:19:05Z | 838 S Main St | Fast Food Restaurants | Kernersville | US | us/nc/kernersville/838smainst/-66712705 | 36.111015 | -80.089165 | Pizza Hut | 27284 | NC | https://www.allmenus.com/nc/kernersville/73400... | http://www.pizzahut.com |
9998 | AVwdl2cykufWRAb57ZPs | 2016-04-05T02:59:45Z | 2018-04-07T13:19:05Z | 1702 Glendale Dr SW | Fast Food Restaurants | Wilson | US | us/nc/wilson/1702glendaledrsw/-66712705 | 35.719981 | -77.945795 | Pizza Hut | 27893 | NC | https://www.allmenus.com/nc/wilson/73403-pizza... | http://www.pizzahut.com |
9999 | AVwdecWKIN2L1WUfwMWU | 2016-11-08T02:26:32Z | 2018-04-07T13:19:05Z | 1405 W Broad St | Fast Food Restaurants | Elizabethtown | US | us/nc/elizabethtown/1405wbroadst/-66712705 | 34.632778 | -78.624615 | Pizza Hut | 28337 | NC | https://www.allmenus.com/nc/elizabethtown/7339... | http://www.pizzahut.com,http://api.citygridmed... |
10000 rows × 15 columns
Now, we can start cleaning and exploring the data of our dataframes. First, let's clean and examine obesity_df. It seems that all the values in the column, "Data_Value_Unit," are NaN, so let's check the count of non NaN values using the count function.
print("Data_Value_Unit # of Non NaN:", obesity_df['Data_Value_Unit'].count()) # Count the number of rows
Data_Value_Unit # of Non NaN: 0
Seeing that this column is entirely NaN, we can drop it.
obesity_df = obesity_df.drop(columns=['Data_Value_Unit'])
It's also clear that the values of the the column "Data_Value_Type" are useless because all the values seem to have "Value" as their literal. We can test to see if all the values in this column are "Value" by using the unique method.
obesity_df['Data_Value_Type'].unique()
array(['Value'], dtype=object)
We see that the only unique value is "Value," so we can also drop those columns.
obesity_df = obesity_df.drop(columns=['Data_Value_Type'])
display(obesity_df)
YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value | Data_Value_Alt | ... | GeoLocation | ClassID | TopicID | QuestionID | DataValueTypeID | LocationID | StratificationCategory1 | Stratification1 | StratificationCategoryId1 | StratificationID1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 2020 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 30.6 | 30.6 | ... | NaN | PA | PA1 | Q047 | VALUE | 59 | Race/Ethnicity | Hispanic | RACE | RACEHIS |
1 | 2014 | 2014 | GU | Guam | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 29.3 | 29.3 | ... | (13.444304, 144.793731) | OWS | OWS1 | Q036 | VALUE | 66 | Education | High school graduate | EDU | EDUHSGRAD |
2 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 28.8 | 28.8 | ... | NaN | OWS | OWS1 | Q036 | VALUE | 59 | Income | $50,000 - $74,999 | INC | INC5075 |
3 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 32.7 | 32.7 | ... | NaN | OWS | OWS1 | Q037 | VALUE | 59 | Income | Data not reported | INC | INCNR |
4 | 2015 | 2015 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who achieve at least 300 min... | 26.6 | 26.6 | ... | NaN | PA | PA1 | Q045 | VALUE | 59 | Income | Less than $15,000 | INC | INCLESS15 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
93244 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 24.5 | 24.5 | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q037 | VALUE | 56 | Income | Less than $15,000 | INC | INCLESS15 |
93245 | 2022 | 2022 | WY | Wyoming | BRFSS | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 36.0 | 36.0 | ... | (43.23554134300048, -108.10983035299967) | PA | PA1 | Q047 | VALUE | 56 | Education | Less than high school | EDU | EDUHS |
93246 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 35.2 | 35.2 | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q036 | VALUE | 56 | Age (years) | 35 - 44 | AGEYR | AGEYR3544 |
93247 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 35.3 | 35.3 | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q037 | VALUE | 56 | Income | $35,000 - $49,999 | INC | INC3550 |
93248 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 41.0 | 41.0 | ... | (43.23554134300048, -108.10983035299967) | OWS | OWS1 | Q036 | VALUE | 56 | Education | Less than high school | EDU | EDUHS |
93249 rows × 31 columns
We can also get rid of some of the ID columns and the Data_Value_Alt column since they're described by other columns. This helps the data frame become more readable. We can see if Data_Value_Alt is needed by comparing it to the Data Value column.
obesity_df['Data_Value_Alt'].equals(obesity_df['Data_Value'])
True
Since Data_Value_Alt is identical to Data Value we can drop it along with the ID columns.
obesity_df = obesity_df.drop(columns=['Data_Value_Alt','QuestionID', 'DataValueTypeID', 'ClassID', 'TopicID', 'LocationID', 'StratificationCategoryId1', 'StratificationID1'])
obesity_df
YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value | Data_Value_Footnote_Symbol | ... | Sample_Size | Total | Age(years) | Education | Gender | Income | Race/Ethnicity | GeoLocation | StratificationCategory1 | Stratification1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 2020 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 30.6 | NaN | ... | 31255.0 | NaN | NaN | NaN | NaN | NaN | Hispanic | NaN | Race/Ethnicity | Hispanic |
1 | 2014 | 2014 | GU | Guam | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 29.3 | NaN | ... | 842.0 | NaN | NaN | High school graduate | NaN | NaN | NaN | (13.444304, 144.793731) | Education | High school graduate |
2 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 28.8 | NaN | ... | 62562.0 | NaN | NaN | NaN | NaN | $50,000 - $74,999 | NaN | NaN | Income | $50,000 - $74,999 |
3 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 32.7 | NaN | ... | 60069.0 | NaN | NaN | NaN | NaN | Data not reported | NaN | NaN | Income | Data not reported |
4 | 2015 | 2015 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who achieve at least 300 min... | 26.6 | NaN | ... | 30904.0 | NaN | NaN | NaN | NaN | Less than $15,000 | NaN | NaN | Income | Less than $15,000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
93244 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 24.5 | NaN | ... | 111.0 | NaN | NaN | NaN | NaN | Less than $15,000 | NaN | (43.23554134300048, -108.10983035299967) | Income | Less than $15,000 |
93245 | 2022 | 2022 | WY | Wyoming | BRFSS | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 36.0 | NaN | ... | 159.0 | NaN | NaN | Less than high school | NaN | NaN | NaN | (43.23554134300048, -108.10983035299967) | Education | Less than high school |
93246 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 35.2 | NaN | ... | 450.0 | NaN | 35 - 44 | NaN | NaN | NaN | NaN | (43.23554134300048, -108.10983035299967) | Age (years) | 35 - 44 |
93247 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 35.3 | NaN | ... | 512.0 | NaN | NaN | NaN | NaN | $35,000 - $49,999 | NaN | (43.23554134300048, -108.10983035299967) | Income | $35,000 - $49,999 |
93248 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 41.0 | NaN | ... | 146.0 | NaN | NaN | Less than high school | NaN | NaN | NaN | (43.23554134300048, -108.10983035299967) | Education | Less than high school |
93249 rows × 23 columns
We can also see the data value footnote columns in case they are of use
obesity_df['Data_Value_Footnote_Symbol'].unique()
array([nan, '~'], dtype=object)
obesity_df['Data_Value_Footnote'].unique()
array([nan, 'Data not available because sample size is insufficient.'], dtype=object)
These basically describe the missing data which can already be seen by using the dropna method.
obesity_df = obesity_df.drop(columns=['Data_Value_Footnote_Symbol', 'Data_Value_Footnote'])
obesity_df
YearStart | YearEnd | LocationAbbr | LocationDesc | Datasource | Class | Topic | Question | Data_Value | Low_Confidence_Limit | ... | Sample_Size | Total | Age(years) | Education | Gender | Income | Race/Ethnicity | GeoLocation | StratificationCategory1 | Stratification1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020 | 2020 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 30.6 | 29.4 | ... | 31255.0 | NaN | NaN | NaN | NaN | NaN | Hispanic | NaN | Race/Ethnicity | Hispanic |
1 | 2014 | 2014 | GU | Guam | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 29.3 | 25.7 | ... | 842.0 | NaN | NaN | High school graduate | NaN | NaN | NaN | (13.444304, 144.793731) | Education | High school graduate |
2 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 28.8 | 28.1 | ... | 62562.0 | NaN | NaN | NaN | NaN | $50,000 - $74,999 | NaN | NaN | Income | $50,000 - $74,999 |
3 | 2013 | 2013 | US | National | Behavioral Risk Factor Surveillance System | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 32.7 | 31.9 | ... | 60069.0 | NaN | NaN | NaN | NaN | Data not reported | NaN | NaN | Income | Data not reported |
4 | 2015 | 2015 | US | National | Behavioral Risk Factor Surveillance System | Physical Activity | Physical Activity - Behavior | Percent of adults who achieve at least 300 min... | 26.6 | 25.6 | ... | 30904.0 | NaN | NaN | NaN | NaN | Less than $15,000 | NaN | NaN | Income | Less than $15,000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
93244 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 24.5 | 16.3 | ... | 111.0 | NaN | NaN | NaN | NaN | Less than $15,000 | NaN | (43.23554134300048, -108.10983035299967) | Income | Less than $15,000 |
93245 | 2022 | 2022 | WY | Wyoming | BRFSS | Physical Activity | Physical Activity - Behavior | Percent of adults who engage in no leisure-tim... | 36.0 | 27.9 | ... | 159.0 | NaN | NaN | Less than high school | NaN | NaN | NaN | (43.23554134300048, -108.10983035299967) | Education | Less than high school |
93246 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 35.2 | 30.6 | ... | 450.0 | NaN | 35 - 44 | NaN | NaN | NaN | NaN | (43.23554134300048, -108.10983035299967) | Age (years) | 35 - 44 |
93247 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 35.3 | 30.2 | ... | 512.0 | NaN | NaN | NaN | NaN | $35,000 - $49,999 | NaN | (43.23554134300048, -108.10983035299967) | Income | $35,000 - $49,999 |
93248 | 2022 | 2022 | WY | Wyoming | BRFSS | Obesity / Weight Status | Obesity / Weight Status | Percent of adults aged 18 years and older who ... | 41.0 | 31.9 | ... | 146.0 | NaN | NaN | Less than high school | NaN | NaN | NaN | (43.23554134300048, -108.10983035299967) | Education | Less than high school |
93249 rows × 21 columns
In the following code, we drop the columns in the dataset that we will not use for our analyses.
obesity_df = obesity_df.drop(columns=['YearStart', 'Datasource', 'Topic', 'Low_Confidence_Limit','Sample_Size', 'Total', 'StratificationCategory1'])
To continue, we can take a look at the columns and their quantity of missing data using the info method.
let's start with the columns
obesity_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 93249 entries, 0 to 93248 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 YearEnd 93249 non-null int64 1 LocationAbbr 93249 non-null object 2 LocationDesc 93249 non-null object 3 Class 93249 non-null object 4 Question 93249 non-null object 5 Data_Value 84014 non-null float64 6 High_Confidence_Limit 84014 non-null float64 7 Age(years) 19980 non-null object 8 Education 13320 non-null object 9 Gender 6660 non-null object 10 Income 23310 non-null object 11 Race/Ethnicity 26640 non-null object 12 GeoLocation 91513 non-null object 13 Stratification1 93240 non-null object dtypes: float64(2), int64(1), object(11) memory usage: 10.0+ MB
As we can tell, a few columns have very few non-null entries. These include Gender, Income, Race/ethncitiy, age, and total. Unfortunately, these columns may be important for prediction and analysis as these could play a significant role in the obesity or inactivity rates at a certain location. We will handle these missing values later on, but for now let's do a basic exploration and anaysis of the data. Specifically, let's take a look at the summary of data, the variance, and the co-variance between the features of set and the level of obesity.
Data Exploration and Summary Statistics
First, let's look at the summary of the numerical columns that we've kept.
obesity_df.describe()
YearEnd | Data_Value | High_Confidence_Limit | |
---|---|---|---|
count | 93249.000000 | 84014.000000 | 84014.000000 |
mean | 2016.308068 | 31.226492 | 36.134303 |
std | 3.308679 | 10.021059 | 10.978276 |
min | 2011.000000 | 0.900000 | 3.000000 |
25% | 2013.000000 | 24.400000 | 28.700000 |
50% | 2017.000000 | 31.200000 | 36.000000 |
75% | 2019.000000 | 37.000000 | 42.200000 |
max | 2022.000000 | 77.600000 | 87.700000 |
We can see that the total number of rows is 93249 and can observe the different characteristics of the values like their quartiles and mean. We also notice that there are about 10000 entries missing in the Data_Value column.
The most important variable to understand is the data_value column, which displays the percentage of a sample described by the associated question in the row. For example, the question/topic could be the percentage of adults who engage in no phyiscal activity, and the data value is the percentage. Let's look at the general distribution of this value.
obesity_df['Data_Value'].plot(kind='hist', bins = 30)
plt.show()
As we can see, it follows some what of a bell curve shape where the max percentage is 77.6% and the min is 0.9%. The highest frequency of values lies between the range of 30-35%.
To digress, let's take a look at the other data frames for cleaning and examination.
Let's look at ffr first
display(ffr_df)
address | city | country | keys | latitude | longitude | name | postalCode | province | websites | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 324 Main St | Massena | US | us/ny/massena/324mainst/-1161002137 | 44.921300 | -74.890210 | McDonald's | 13662 | NY | http://mcdonalds.com,http://www.mcdonalds.com/... |
1 | 530 Clinton Ave | Washington Court House | US | us/oh/washingtoncourthouse/530clintonave/-7914... | 39.532550 | -83.445260 | Wendy's | 43160 | OH | http://www.wendys.com |
2 | 408 Market Square Dr | Maysville | US | us/ky/maysville/408marketsquaredr/1051460804 | 38.627360 | -83.791410 | Frisch's Big Boy | 41056 | KY | http://www.frischs.com,https://www.frischs.com... |
3 | 6098 State Highway 37 | Massena | US | us/ny/massena/6098statehighway37/-1161002137 | 44.950080 | -74.845530 | McDonald's | 13662 | NY | http://mcdonalds.com,http://www.mcdonalds.com/... |
4 | 139 Columbus Rd | Athens | US | us/oh/athens/139columbusrd/990890980 | 39.351550 | -82.097280 | OMG! Rotisserie | 45701 | OH | http://www.omgrotisserie.com,http://omgrotisse... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 3013 Peach Orchard Rd | Augusta | US | us/ga/augusta/3013peachorchardrd/-791445730 | 33.415257 | -82.024531 | Wendy's | 30906 | GA | http://www.wendys.com,http://wendys.com |
9996 | 678 Northwest Hwy | Cary | US | us/il/cary/678northwesthwy/787691191 | 42.217300 | -88.255800 | Lee's Oriental Martial Arts | 60013 | IL | http://www.mcdonalds.com |
9997 | 1708 Main St | Longmont | US | us/co/longmont/1708mainst/-448666054 | 40.189190 | -105.101720 | Five Guys | 80501 | CO | http://fiveguys.com |
9998 | 67740 Highway 111 | Cathedral City | US | us/ca/cathedralcity/67740highway111/-981164808 | 33.788640 | -116.482150 | El Pollo Loco | 92234 | CA | http://www.elpolloloco.com,http://elpolloloco.com |
9999 | 5701 E La Palma Ave | Anaheim | US | us/ca/anaheim/5701elapalmaave/554191587 | 33.860074 | -117.789762 | Carl's Jr. | 92807 | CA | http://www.carlsjr.com |
10000 rows × 10 columns
We don't need the keys, postal code, and websites column. Also, the country column is irrelevant if all the rows describe a position in the US.
ffr_df.drop(columns=['keys', 'websites', 'postalCode'])
# let's check if all the country values is US
ffr_df['country'].unique()
array(['US'], dtype=object)
Because we see that US is the only object we can drop that column. We understand that the data only describes data originating from the US.
ffr_df.drop(columns=['keys', 'websites', 'country'])
address | city | latitude | longitude | name | postalCode | province | |
---|---|---|---|---|---|---|---|
0 | 324 Main St | Massena | 44.921300 | -74.890210 | McDonald's | 13662 | NY |
1 | 530 Clinton Ave | Washington Court House | 39.532550 | -83.445260 | Wendy's | 43160 | OH |
2 | 408 Market Square Dr | Maysville | 38.627360 | -83.791410 | Frisch's Big Boy | 41056 | KY |
3 | 6098 State Highway 37 | Massena | 44.950080 | -74.845530 | McDonald's | 13662 | NY |
4 | 139 Columbus Rd | Athens | 39.351550 | -82.097280 | OMG! Rotisserie | 45701 | OH |
... | ... | ... | ... | ... | ... | ... | ... |
9995 | 3013 Peach Orchard Rd | Augusta | 33.415257 | -82.024531 | Wendy's | 30906 | GA |
9996 | 678 Northwest Hwy | Cary | 42.217300 | -88.255800 | Lee's Oriental Martial Arts | 60013 | IL |
9997 | 1708 Main St | Longmont | 40.189190 | -105.101720 | Five Guys | 80501 | CO |
9998 | 67740 Highway 111 | Cathedral City | 33.788640 | -116.482150 | El Pollo Loco | 92234 | CA |
9999 | 5701 E La Palma Ave | Anaheim | 33.860074 | -117.789762 | Carl's Jr. | 92807 | CA |
10000 rows × 7 columns
Now we can start hypothesis testing to find any co-variation and relationships between the features.
First, let's try to find the relationship between education level and obesity. Let's say that H0: The distribution of obesity among different levels of education are the same (education has no impact on obesity), and Ha: The distributions of obesity among different levels of education are different (education does have an impact on obesity).
Since we will be comparing two categorical types of data, we can use the Chi-Squared test - this test estimates the chances that two sets of categorical data come from the same distribution - put simply, if the distribution of obesity rate is the same across all education values, then we can say that education has no influence on obesity rate.
Below is the contingency table and a plot of the relationship:
import scipy.stats as st
def obesity_level(val):
if val > 37:
return 'High'
elif 24.4 <= val <= 37:
return 'Medium'
else:
return 'Low'
obesity_df['level_of_obesity'] = obesity_df['Data_Value'].apply(obesity_level)
new_obesity = obesity_df.dropna(subset = ['Education', 'Data_Value'])
contingency_table = pd.crosstab(new_obesity['Education'], new_obesity['level_of_obesity'])
print(contingency_table)
contingency_table.plot(kind = 'bar')
plt.title('Relationship between education and obesity')
plt.ylabel('Obesity count')
plt.show()
level_of_obesity High Low Medium Education College graduate 867 1173 1273 High school graduate 637 527 2149 Less than high school 1118 621 1574 Some college or technical school 613 855 1845
Now we can calculate the p-value using a Chi-Squared test and compare it to our significant value of 0.05. We can use this information to see whether we would need to reject the Null Hypothesis.
res = st.chi2_contingency(contingency_table)
f"{res.pvalue:.162f}"
'0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000005'
Since we found our p-value to be 5.1 x 10^(-162), we can say that it is less than the significant value of 0.05. Since our p-value is less than 0.05, we can state that we reject the Null Hypothesis. Therefore, education does impact level of obesity because the different levels of education have different distributions of obesity.
In this analysis, we are examining the relationship between the number of fast food restaurants in each U.S. state and the obesity rate for those states. First, we filter the obesity dataset to include only data for the U.S. states. We then narrow it down further so that only the most recent obesity data for each state is included. Next, we calculate the number of fast food restaurants in each state by counting occurrences in the fast food restaurant dataset. We then merge the obesity data and fast food restaurant count data on the state column to combine the relevant information into a single dataset. Using this merged dataset, we perform a linear regression hypothesis test to observe the relationship between the number of fast food restaurants and obesity rates. The scatter plot shows the data points, while the regression line, represented in red, visualizes the trend in the data. The goal is to determine if there is a noticeable pattern, such as a positive correlation, where an increase in the number of fast food restaurants corresponds to a higher obesity rate in the state
import statsmodels.api as sm
import seaborn as sns
state_names = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut',
'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa',
'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan',
'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
filtered_obesity_df = obesity_df[obesity_df['LocationDesc'].isin(state_names)]
most_recent_df = filtered_obesity_df.loc[filtered_obesity_df.groupby('LocationDesc')['YearEnd'].idxmax()]
ffr_df['province'].replace('Co Spgs', 'CO')
province_counts = ffr_df['province'].value_counts().reset_index()
province_counts.columns = ['LocationAbbr', 'count']
merged_df = pd.merge(province_counts, most_recent_df, on='LocationAbbr', how='inner')
X = merged_df['count']
y = merged_df['Data_Value']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_df, x='count', y='Data_Value', color='blue', label='Data Points')
sns.regplot(data=merged_df, x='count', y='Data_Value', scatter=False, color='red', label='Regression Line')
plt.title('Relationship Between Fast Food Restaurant Count and Obesity Rates')
plt.xlabel('Number of Fast Food Restaurants')
plt.ylabel('Obesity Rate (%)')
plt.legend()
plt.grid(True)
From the visualization above, we had a much lower result than we expected. We were expecting a strong positive correlation between obesity percentage and number of fast food restaurants per state. The result was still positive, but had a weak correlation with a value of only 0.06.
Let's try to find the relationship between income and obesity. Let's say that H0: The mean obesity rate is the same across every income group, and Ha: The mean obesity rate is not the same across every income group (that means at least one group would be different).
Since we are comparing multiple income groups and looking at differences among means, it is best to use ANOVA testing.
ANOVA testing allows us to compare the means of multiple groups (income values), to determine if there are significant differences between them. Using this, we can hope to see whether one or more income values cause a significantly different outcome in obesity rate - signifying that yes, income value does influence obesity rate.
income_df = obesity_df[['Income', 'Data_Value']].dropna()
g1 = income_df[income_df['Income'] == 'Less than $15,000']['Data_Value']
g2 = income_df[income_df['Income'] == '$15,000 - $24,999']['Data_Value']
g3 = income_df[income_df['Income'] == '$25,000 - $34,999']['Data_Value']
g4 = income_df[income_df['Income'] == '$35,000 - $49,999']['Data_Value']
g5 = income_df[income_df['Income'] == '$50,000 - $74,999']['Data_Value']
g6 = income_df[income_df['Income'] == '$75,000 or greater']['Data_Value']
g7 = income_df[income_df['Income'] == 'Data not reported']['Data_Value']
res = st.f_oneway(g1, g2, g3, g4, g5, g6, g7)
res.pvalue
1.027379650584492e-30
Since our p-value is 1.03 x 10^(-30), we can see that it is less than the significant value of 0.05, therefore we would reject the Null Hypothesis. Therefore, we can state that there is at least one group that's mean is different from the others.
We can display this down below:
import pandas as pd
import matplotlib.pyplot as plt
# Group the data based on Income categories and calculate their means
income_groups = [
"Less than $15,000",
"$15,000 - $24,999",
"$25,000 - $34,999",
"$35,000 - $49,999",
"$50,000 - $74,999",
"$75,000 or greater",
"Data not reported"
]
mean_values = [
g1.mean(),
g2.mean(),
g3.mean(),
g4.mean(),
g5.mean(),
g6.mean(),
g7.mean()
]
# Plot the means
plt.bar(income_groups, mean_values, color='skyblue')
plt.xlabel('Income Group')
plt.ylabel('Mean Data Value')
plt.title('Mean Obesity Data Value by Income Group')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
At first glance, our means are extremely similar, despite our results from ANOVA. However, this could imply that, despite the means are numerically close, the differences between each income's means are still statistically significant. Then, we can utilize a post-hoc test to test which income group is significantly different. While ANOVA allows us to see whether a group exists that influences obesity rate, post-hoc tests allow us to pinpoint exactly which group does so.
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(
endog = income_df['Data_Value'], # Dependent variable
groups = income_df['Income'], # Categorical group labels
alpha = 0.05 # Significance level
)
# Print the results
print(tukey)
# Optionally, plot the results
tukey.plot_simultaneous()
plt.show()
Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================================================ group1 group2 meandiff p-adj lower upper reject ---------------------------------------------------------------------------- $15,000 - $24,999 $25,000 - $34,999 -0.3854 0.6571 -1.0789 0.3081 False $15,000 - $24,999 $35,000 - $49,999 -0.5283 0.2709 -1.2218 0.1652 False $15,000 - $24,999 $50,000 - $74,999 -0.8289 0.0078 -1.5227 -0.1351 True $15,000 - $24,999 $75,000 or greater -1.3017 0.0 -1.9952 -0.6082 True $15,000 - $24,999 Data not reported -2.3067 0.0 -3.0002 -1.6132 True $15,000 - $24,999 Less than $15,000 0.1441 0.9964 -0.5494 0.8377 False $25,000 - $34,999 $35,000 - $49,999 -0.1429 0.9966 -0.8364 0.5506 False $25,000 - $34,999 $50,000 - $74,999 -0.4435 0.4904 -1.1373 0.2503 False $25,000 - $34,999 $75,000 or greater -0.9163 0.0019 -1.6098 -0.2227 True $25,000 - $34,999 Data not reported -1.9213 0.0 -2.6148 -1.2278 True $25,000 - $34,999 Less than $15,000 0.5296 0.2681 -0.164 1.2231 False $35,000 - $49,999 $50,000 - $74,999 -0.3006 0.8624 -0.9944 0.3932 False $35,000 - $49,999 $75,000 or greater -0.7734 0.0175 -1.4669 -0.0799 True $35,000 - $49,999 Data not reported -1.7784 0.0 -2.4719 -1.0849 True $35,000 - $49,999 Less than $15,000 0.6724 0.0644 -0.0211 1.366 False $50,000 - $74,999 $75,000 or greater -0.4728 0.4086 -1.1666 0.221 False $50,000 - $74,999 Data not reported -1.4778 0.0 -2.1716 -0.784 True $50,000 - $74,999 Less than $15,000 0.973 0.0007 0.2793 1.6668 True $75,000 or greater Data not reported -1.005 0.0004 -1.6985 -0.3115 True $75,000 or greater Less than $15,000 1.4458 0.0 0.7523 2.1393 True Data not reported Less than $15,000 2.4508 0.0 1.7573 3.1444 True ----------------------------------------------------------------------------
Based on this, we can clearly tell that groups with less than $15,000 income tend to have higher data values, implying higher obesity rates, and their values are significantly different across the board.
In comparison, the more moderate income ranges do not have significant differences from each other, implying that those with lower income are of higher significance toward influencing the percentage of obesity.
ML Design/Development
With these analyses, we are given an idea of the connection between the different features of the data sets and their correlation with obesity and inactivity. We can now transition to using ML for prediction and pattern detection. However, we need to manipulate our features as our data frame is currently unusable for ML applications. To start, we need to turn our data into numerical values, since every column is categorical. We can do this via assigning a number to each different category for each column. This is kind of a different variation of one hot encoding where we dont need to add a new column for every category. The columns that we will examine include the following: Education, Income, Age, Race, Data_Value, and location.
# Education is first:
display(obesity_df['Education'].unique())
array([nan, 'High school graduate', 'Less than high school', 'Some college or technical school', 'College graduate'], dtype=object)
# assign each level of education a number from 1 to 4
education_dict = {'Less than high school': 1, 'High school graduate': 2, 'Some college or technical school': 3, 'College graduate':4}
obesity_df['Education'] = obesity_df['Education'].replace(education_dict)
display(obesity_df['Education'].unique())
<ipython-input-85-4d2faeddb84a>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` obesity_df['Education'] = obesity_df['Education'].replace(education_dict)
array([nan, 2., 1., 3., 4.])
# Then income
display(obesity_df['Income'].unique())
array([nan, '$50,000 - $74,999', 'Data not reported', 'Less than $15,000', '$25,000 - $34,999', '$15,000 - $24,999', '$35,000 - $49,999', '$75,000 or greater'], dtype=object)
# assign each level of income a number from 1 to 6 leaving 'Data not reported' to be converted to nan
income_dict = {'Less than $15,000':1, '$15,000 - $24,999':2, '$25,000 - $34,999':3, '$35,000 - $49,999':4, '$50,000 - $74,999':5, '$75,000 or greater':6}
obesity_df['Income'] = obesity_df['Income'].replace(income_dict)
display(obesity_df['Income'].unique())
obesity_df['Income'] = obesity_df['Income'].replace({'Data not reported': float('NaN')})
array([nan, 5, 'Data not reported', 1, 3, 2, 4, 6], dtype=object)
<ipython-input-87-73d415d0410c>:5: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` obesity_df['Income'] = obesity_df['Income'].replace({'Data not reported': float('NaN')})
# Age
display(obesity_df['Age(years)'].unique())
array([nan, '25 - 34', '55 - 64', '18 - 24', '45 - 54', '35 - 44', '65 or older'], dtype=object)
# assign each level of age a number from 1 to 6
age_dict = {'18 - 24':1, '25 - 34':2, '35 - 44':3, '45 - 54':4, '55 - 64':5, '65 or older':6}
obesity_df['Age(years)'] = obesity_df['Age(years)'].replace(age_dict)
display(obesity_df['Age(years)'].unique())
<ipython-input-89-18bb04d1ef4c>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` obesity_df['Age(years)'] = obesity_df['Age(years)'].replace(age_dict)
array([nan, 2., 5., 1., 4., 3., 6.])
# Race/ethnicity
display(obesity_df['Race/Ethnicity'].unique())
array(['Hispanic', nan, 'American Indian/Alaska Native', 'Asian', 'Non-Hispanic White', 'Other', '2 or more races', 'Hawaiian/Pacific Islander', 'Non-Hispanic Black'], dtype=object)
# assign each specified race a number from 1 to 8
race_dict = {'Non-Hispanic White':1, 'Non-Hispanic Black':2, 'Hispanic':3, 'Asian':4, 'American Indian/Alaska Native':5, 'Hawaiian/Pacific Islander':6, '2 or more races':7, 'Other':8}
obesity_df['Race/Ethnicity'] = obesity_df['Race/Ethnicity'].replace(race_dict)
display(obesity_df['Race/Ethnicity'].unique())
<ipython-input-91-77004d7ace83>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` obesity_df['Race/Ethnicity'] = obesity_df['Race/Ethnicity'].replace(race_dict)
array([ 3., nan, 5., 4., 1., 8., 7., 6., 2.])
# Class: Target Value for first Model
display(obesity_df['Class'].unique())
array(['Physical Activity', 'Obesity / Weight Status', 'Fruits and Vegetables'], dtype=object)
# assign each class a number from 1 to 3
class_dict = {'Obesity / Weight Status':1, 'Physical Activity':2, 'Fruits and Vegetables':3}
obesity_df['Class'] = obesity_df['Class'].replace(class_dict)
display(obesity_df['Class'].unique())
<ipython-input-93-fc1c6120ad8f>:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` obesity_df['Class'] = obesity_df['Class'].replace(class_dict)
array([2, 1, 3])
With these columns now numerical, we need to find a way to fill in the missing values. Unfortunately, the dataset does not have a single row with no single missing value. However, we can see that the data has been sampled from multiple locations. We can use this to our advantage by filling in missing data in these locations with KNN imputation to replace missing data using similar datapoints. This is basically guessing what categories go in each column for a specific row, which may bias the data, but this is essential for doing any sort of ML algorithm.
To start, let's take a look at the quantity of entries in each location.
obesity_df['LocationDesc'].value_counts()
count | |
---|---|
LocationDesc | |
National | 1736 |
West Virginia | 1736 |
Oklahoma | 1736 |
Mississippi | 1736 |
Oregon | 1736 |
Wisconsin | 1736 |
Kansas | 1736 |
Florida | 1736 |
Idaho | 1736 |
Arizona | 1736 |
Montana | 1736 |
Georgia | 1736 |
North Carolina | 1736 |
Pennsylvania | 1736 |
North Dakota | 1736 |
South Carolina | 1736 |
Nebraska | 1736 |
Tennessee | 1736 |
Missouri | 1736 |
Nevada | 1736 |
Iowa | 1736 |
Indiana | 1736 |
Ohio | 1736 |
Alaska | 1736 |
Vermont | 1736 |
Colorado | 1736 |
Kentucky | 1736 |
Utah | 1736 |
New York | 1736 |
Wyoming | 1736 |
District of Columbia | 1736 |
Alabama | 1736 |
Rhode Island | 1736 |
Delaware | 1736 |
Washington | 1736 |
Maine | 1736 |
Michigan | 1736 |
Virginia | 1736 |
California | 1736 |
Texas | 1736 |
Connecticut | 1736 |
Massachusetts | 1736 |
Arkansas | 1736 |
Illinois | 1736 |
New Hampshire | 1736 |
New Mexico | 1736 |
Maryland | 1736 |
Minnesota | 1736 |
Hawaii | 1736 |
Louisiana | 1736 |
South Dakota | 1736 |
New Jersey | 1493 |
Puerto Rico | 1316 |
Guam | 1260 |
Virgin Islands | 644 |
let's group the data by these locations and show an example dataframe from one of those locations.
features = obesity_df[['Education', 'Income', 'LocationDesc', 'Age(years)', 'Race/Ethnicity', 'Data_Value', 'Class']]
grouped_features = features.groupby('LocationDesc')
grouped_features.count()
display(grouped_features.get_group('Alabama'))
Education | Income | LocationDesc | Age(years) | Race/Ethnicity | Data_Value | Class | |
---|---|---|---|---|---|---|---|
9 | NaN | NaN | Alabama | 2.0 | NaN | 35.2 | 1 |
48 | NaN | NaN | Alabama | 5.0 | NaN | 35.3 | 1 |
119 | NaN | NaN | Alabama | 3.0 | NaN | 31.9 | 1 |
236 | NaN | NaN | Alabama | NaN | NaN | 37.7 | 1 |
305 | NaN | NaN | Alabama | NaN | 6.0 | NaN | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
88925 | NaN | NaN | Alabama | 2.0 | NaN | 21.2 | 2 |
88926 | NaN | NaN | Alabama | NaN | 7.0 | 24.9 | 2 |
88927 | NaN | 4.0 | Alabama | NaN | NaN | 34.7 | 1 |
88928 | NaN | NaN | Alabama | NaN | 8.0 | NaN | 2 |
88929 | NaN | NaN | Alabama | NaN | 6.0 | NaN | 1 |
1736 rows × 7 columns
Now, let's use Scikit to do KNN imputation.
from sklearn.impute import KNNImputer
k = 3
imputer = KNNImputer(n_neighbors=k)
new_table = pd.DataFrame()
# knn imputation for each location
for loc in grouped_features.groups.keys():
# make a new sub table with no missing values
new_subtable = imputer.fit_transform(grouped_features.get_group(loc).drop(columns = ['LocationDesc']))
new_subtable = pd.DataFrame(new_subtable, columns = grouped_features.get_group(loc).drop(columns = 'LocationDesc').columns)
# append sub table to the new table
new_table = pd.concat([new_table, new_subtable])
display(new_table)
Education | Income | Age(years) | Race/Ethnicity | Data_Value | Class | |
---|---|---|---|---|---|---|
0 | 2.666667 | 3.666667 | 2.000000 | 7.333333 | 35.200000 | 1.0 |
1 | 2.333333 | 4.666667 | 5.000000 | 7.333333 | 35.300000 | 1.0 |
2 | 3.000000 | 4.666667 | 3.000000 | 7.333333 | 31.900000 | 1.0 |
3 | 3.333333 | 4.666667 | 4.000000 | 7.333333 | 37.700000 | 1.0 |
4 | 2.333333 | 4.333333 | 3.333333 | 6.000000 | 34.133333 | 1.0 |
... | ... | ... | ... | ... | ... | ... |
1731 | 4.000000 | 1.000000 | 3.000000 | 3.666667 | 24.500000 | 1.0 |
1732 | 1.000000 | 1.333333 | 6.000000 | 6.666667 | 36.000000 | 2.0 |
1733 | 3.000000 | 3.666667 | 3.000000 | 3.666667 | 35.200000 | 1.0 |
1734 | 3.000000 | 4.000000 | 3.666667 | 3.666667 | 35.300000 | 1.0 |
1735 | 1.000000 | 5.666667 | 4.333333 | 3.666667 | 41.000000 | 1.0 |
93249 rows × 6 columns
We should also round the non integers to the nearest integers to match the categorical data.
new_table['Education'] = new_table['Education'].apply(lambda x: round(x))
new_table['Income'] = new_table['Income'].apply(lambda x: round(x))
new_table['Age(years)'] = new_table['Age(years)'].apply(lambda x: round(x))
new_table['Race/Ethnicity'] = new_table['Race/Ethnicity'].apply(lambda x: round(x))
display(new_table)
Education | Income | Age(years) | Race/Ethnicity | Data_Value | Class | |
---|---|---|---|---|---|---|
0 | 3 | 4 | 2 | 7 | 35.200000 | 1.0 |
1 | 2 | 5 | 5 | 7 | 35.300000 | 1.0 |
2 | 3 | 5 | 3 | 7 | 31.900000 | 1.0 |
3 | 3 | 5 | 4 | 7 | 37.700000 | 1.0 |
4 | 2 | 4 | 3 | 6 | 34.133333 | 1.0 |
... | ... | ... | ... | ... | ... | ... |
1731 | 4 | 1 | 3 | 4 | 24.500000 | 1.0 |
1732 | 1 | 1 | 6 | 7 | 36.000000 | 2.0 |
1733 | 3 | 4 | 3 | 4 | 35.200000 | 1.0 |
1734 | 3 | 4 | 4 | 4 | 35.300000 | 1.0 |
1735 | 1 | 6 | 4 | 4 | 41.000000 | 1.0 |
93249 rows × 6 columns
Now that we have our data ready for processing, we can take a moment to consider an important observation. The class data value describes the level of obesity, activity, or fruit/vegetable consumption. We can look at what exactly each data value describes via the question column.
display(obesity_df['Class'].value_counts())
display(obesity_df['Question'].value_counts())
count | |
---|---|
Class | |
2 | 47885 |
1 | 36234 |
3 | 9130 |
count | |
---|---|
Question | |
Percent of adults aged 18 years and older who have obesity | 18117 |
Percent of adults aged 18 years and older who have an overweight classification | 18117 |
Percent of adults who engage in no leisure-time physical activity | 18089 |
Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) | 7449 |
Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week | 7449 |
Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination) | 7449 |
Percent of adults who engage in muscle-strengthening activities on 2 or more days a week | 7449 |
Percent of adults who report consuming fruit less than one time daily | 4565 |
Percent of adults who report consuming vegetables less than one time daily | 4565 |
We can group these questions into the three categories of obesity, activity, and fruit/vegetable consumption.
Obesity (1):
Negative: Percent of adults aged 18 years and older who have obesity; Percent of adults aged 18 years and older who have an overweight classification
Activity (2):
Negative: Percent of adults who engage in no leisure-time physical activity;
Positive: Percent of adults who achieve at least 300 minutes a week of moderate-intensity...; Percent of adults who achieve at least 150 minutes a week of moderate-intensity...; Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination); Percent of adults who engage in muscle-strengthening activities on 2 or more days a week;
Fruit/Vegetable consumption (3):
Negative: Percent of adults who report consuming fruit less than one time daily Percent of adults who report consuming vegetables less than one time daily
Overall Idea
A higher data value for the topic like obesity indicates a negative conclusion.
A higher data value for activity in terms of no leisure time physical activity indiciates a negative conclusion, whereas a higher value for the rest is positive.
Finally, A higher data value for fruit/vegetable consumption indicates a negative conclusion.
These ideas are important for regressors. In the following code, we explore the accuracy of different models in predicting the percentage of obesity given a description of a population (Race, Age range, Education, and Income).
For each different category, we can make a linear regression model for predicting the percentage for that category.
ML Algorithm Training and Test Data Analysis
display(new_table)
Education | Income | Age(years) | Race/Ethnicity | Data_Value | Class | |
---|---|---|---|---|---|---|
0 | 3 | 4 | 2 | 7 | 35.200000 | 1.0 |
1 | 2 | 5 | 5 | 7 | 35.300000 | 1.0 |
2 | 3 | 5 | 3 | 7 | 31.900000 | 1.0 |
3 | 3 | 5 | 4 | 7 | 37.700000 | 1.0 |
4 | 2 | 4 | 3 | 6 | 34.133333 | 1.0 |
... | ... | ... | ... | ... | ... | ... |
1731 | 4 | 1 | 3 | 4 | 24.500000 | 1.0 |
1732 | 1 | 1 | 6 | 7 | 36.000000 | 2.0 |
1733 | 3 | 4 | 3 | 4 | 35.200000 | 1.0 |
1734 | 3 | 4 | 4 | 4 | 35.300000 | 1.0 |
1735 | 1 | 6 | 4 | 4 | 41.000000 | 1.0 |
93249 rows × 6 columns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
obesity = new_table[new_table['Class'] == 1]
activity = new_table[new_table['Class'] == 2]
nutrition = new_table[new_table['Class'] == 3]
dfs = [("Obesity",obesity), ("Activity",activity), ("Nutrition",nutrition)]
for group_name, df in dfs:
target_features = df[['Education', 'Income', 'Age(years)', 'Race/Ethnicity']]
model_target = df['Data_Value']
X_train, X_test, y_train, y_test = train_test_split(target_features, model_target, test_size = 0.2, random_state = 42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error for {group_name}: {mse}")
print(f"R-squared: {r2}")
plt.scatter(y_test, y_pred, color='blue', alpha=0.5, label = "Predicted values")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2, label = "Ideal Prediction")
plt.title(f"{group_name} - Actual vs Predicted")
plt.xlabel('Actual Data Value')
plt.ylabel('Predicted Data Value')
plt.grid(True)
plt.legend()
plt.show()
Mean Squared Error for Obesity: 28.361608311875717 R-squared: 0.2556615957364562
Mean Squared Error for Activity: 129.32780845095235 R-squared: 0.007346128474098879
Mean Squared Error for Nutrition: 102.35161305502322 R-squared: 0.1852917167638467
Depicted above is our Linear Regression Model, which acts as our prelimary analysis on obesity rate, given the continuous nature of obesity rate for each population.
Linear regression is a model that is used to display the relationship between a target and a single or multiple features. In Linear Regression, we aim to identify the best fitting line/equation that would represent this relationship. In addition, we will be able to understand whether or not our data may not fall under a linear pattern and may need additional models to try and grasp an understanding of the data.
From the start, we prioritized focusing on obesity as the primary target variable in the dataset, which includes multiple classes such as activity and nutrition as shown in the visualizations above. Given this initial focus, we chose to move forward with obesity as it aligns with the project's objectives. Thus, while other classes like activity and nutrition were considered, the decision to move forward with obesity is based on the project's initial goal.
target_features = obesity[['Education', 'Income', 'Age(years)', 'Race/Ethnicity']]
model_target = obesity['Data_Value']
X_train, X_test, y_train, y_test = train_test_split(target_features, model_target, test_size = 0.2, random_state = 42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error for Obesity: {mse}")
print(f"R-squared: {r2}")
plt.scatter(y_test, y_pred, color='blue', alpha=0.5, label = "Predicted Values")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2, label = "Ideal Prediction")
plt.title(f"Obesity - Actual vs Predicted")
plt.xlabel('Actual Data Value')
plt.ylabel('Predicted Data Value')
plt.grid(True)
plt.legend()
plt.show()
Mean Squared Error for Obesity: 28.361608311875717 R-squared: 0.2556615957364562
The Mean Squared Error (MSE) for obesity using linear regression is approximately 28.362. MSE measures the average squared difference between the predicted and actual values. In this case, an MSE of 28.362 means that, on average, the squared difference between the predicted and actual values is 28.362. A lower MSE indicates a better fit, so while this value isn’t ideal, it provides a starting point for evaluation. As this is the first model we have tested, the result suggests that the model is reasonably performing but may not be the most optimal.
Our R-squared value using linear regression is 0.256, which means that about 25.6% of the variance in the obesity data is explained by the model. R-squared can range from -1 to 1. A value of 1 would indicate that the model perfectly explains the variance in the target variable, while a value of 0 means that the model does not improve on simply predicting the mean of the target variable. A negative R-squared indicates that the model performs worse than predicting the mean.
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
models = {
'Polynomial': make_pipeline(PolynomialFeatures(7), LinearRegression()),
'KNN': KNeighborsRegressor(n_neighbors=20),
'Decision Tree': DecisionTreeRegressor(random_state=42),
'Random Forest': RandomForestRegressor(random_state=42),
'Gradient Boosting': GradientBoostingRegressor(random_state=42),
'Linear Regression': LinearRegression()
}
Polynomial Regression: Type of linear regression that allows for curving the model to fit more complex patterns in the data. It does this by using polynomial features (higher powers of the input features). Essentially, it tries to fit a curve to the data, rather than just a straight line, to better capture more complex patterns.
KNN: A simple model that makes predictions based on the data points that are closest to a given point. It looks at the "K" nearest neighbors (data points) to make predictions about a new data point. In simpler terms, it looks at its closest neighbors and takes their average or majority.
Decision Tree: makes predictions by splitting the data into smaller groups based on certain conditions (like answering yes/no questions) and making a prediction based on the group the data point belongs to similar to a flowchart, where each decision leads down a path to a final prediction.
Random Forest: Essentially multiple decision trees working together, and the results of all the decision trees are averaged to get a final prediction that is usually more accurate than a single tree.
Gradient Boosting: Similar to random forest, but instead of taking the average of the decision trees, it builds trees one by one where each one improves from the previous tree.
for model_name, model in models.items():
model = model
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error for {model_name}: {mse}")
print(f"R-squared: {r2}")
plt.scatter(y_test, y_pred, color='blue', alpha=0.5, label = "Predicted Values")
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2, label = "Ideal Prediction")
plt.title(f"({model_name}) Obesity - Actual vs Predicted")
plt.xlabel('Actual Data Value')
plt.ylabel('Predicted Data Value')
plt.grid(True)
plt.legend()
plt.show()
Mean Squared Error for Polynomial: 22.50360691296797 R-squared: 0.4094023626735278
Mean Squared Error for KNN: 23.434549173987705 R-squared: 0.3849701771144749
Mean Squared Error for Decision Tree: 22.36625415404215 R-squared: 0.4130071276881139
Mean Squared Error for Random Forest: 22.248140042551135 R-squared: 0.41610698254477285
Mean Squared Error for Gradient Boosting: 22.79475606241995 R-squared: 0.40176127649383364
Mean Squared Error for Linear Regression: 28.361608311875717 R-squared: 0.2556615957364562
The above plots show the predicted values of five different models. These models include Polynomial Regressor, KNN, Decision Tree, Random Forest, Gradient Boosting, and Linear Regression. The MSE and R^2 score for Polynomial Regressor is approximately 22.504 and 0.409 respectively. The MSE and R^2 score for KNN is approximately 23.435 and 0.385 respectively. The MSE and R^2 score for Decision Tree is approximately 22.366 and 0.413 respectively. The MSE and R^2 score for Random forest is approximately 22.248 and 0.416 respectively. The MSE and R^2 score for Gradient Boosting is approximately 22.795 and 0.402 respectively. The MSE and R^2 score for Linear Regression is approximately 28.362 and 0.256 respectively. When we look at these models, we want to identify the model with the lowest MSE and the highest R^2 score. As we can see, the Random Forest regressor was the most accurate model with the lowest Mean Squared Error and highest R^2 value.
The red lines in the graph represent the ideal distribution of the predicted values. If the predicted values were all neatly placed upon this line, the accuracy of the model would be perfect.
Insights and Conclusion
Throughout the tutorial, we have gained a fair amount of knowledge regarding using the data gathered for prediction of a population's level of obesity given its features. For example, if we are looking at a population of people who are hispanic, 18-24 years old, have a college education, and high income, we can assess their level of obesity.
race_val = race_dict['Hispanic']
age_val = age_dict['18 - 24']
income_val = income_dict['$75,000 or greater']
education_val = education_dict['College graduate']
obesity_percentage = models['Random Forest'].predict([[education_val, income_val, age_val, race_val]])
print(obesity_percentage)
[18.20874881]
This prediction is the estimated percentage of obesity within the population described by the given categories.
In addition, through our hypothesis testing, we were able to determine which features were most significant in assessing obesity rate, especially income and education. We were also able to find out that, despite our initial hypothesis, the density of fast food locations does NOT have a significant influence on obesity rate
Unfortunately, despite testing several different models, we could not find a model that could get better than 22.25 mse. Given the nature of the continuous value we are predicting - the projected obesity percentage of a population given age, income, etc, it is difficult to directly classify a population as obese, not obese, or somewhere in the middle. However, due to the way the data points are scattered so close to each other, it is also difficult to get a clear decision boundary for regression. Thus, we believe that we can best predict the obesity rate using Random Forest, but there will still be some error.
Through our tutorial, the average reader will first be introduced to our dataset, and why we selected this one for our goal of predicting obesity rate. They will then be guided through our sanitization of the data - deleting redundant/irrelevant features, understanding which values are most relevant to our goal, and figuring out which null features need to be estimated, or simply dropped.
Next, the reader will be introduced to our hypothesis testing, which will introduce the user to some preliminary insights of our data: testing whether the distribution of obesity among different levels of education are the same using the Chi-Squared Test, searching for a linear relationship between fast food restaurant count and obesity rate, and utilizing the ANOVA test and Tukey's HSD to identify the connection between lower incomes and higher obesity rates.
Finally, the reader will be guided through our ML findings. After some further data manipulation to turn classified data into numerical values and estimating missing values using KNN imputation, we will introduce the user to machine learning using Linear Regression, to see if it can reliably predict the obesity rate's continuous value. We will then show other various models, explaining what each does, and showing their results to see which predicts the obesity rate the best. After all of our testings, the reader will be able to understand that Random Forest had the lowest error rate for obesity rate predictions, and can see an example predicted value given several key features.
We made sure to describe each step of the tutorial clearly and with code demonstrations to show our thought process throughout the analysis. This, coupled with our descriptions of each model and hypothesis testing, will allow the uninformed user to utilize our colab to understand how the obesity rate of each population can be predicted, as well as which categories were most influential in determining this. In addition, given our extensive data sanitization, explanation of which data is most important, and usage of several models, we believe that a more advanced user will learn more about the topic.