3 Results
3.0.1 Base OLS Model
To assist in explaining the death rate of cancer deaths per 100,000 thousand people in the United States, we will examine an Ordinary Least Squares model.
Before delving into that, we need to illustrate how we arrived at our best model.
We began by confirming the use of the mean to fill in the missing values for 2018 for adltObesity and physInact.
This was done by comparing our base OLS model, each variable under consideration, and the identical OLS model excluding data from the year 2018.
| Dependent variable: | ||
| Deathsper100k | ||
| OLS | OLS No 2018 | |
| adltSmoking | 1.648*** | 1.727*** |
| p = 0.001 | p = 0.004 | |
| uninsured | 0.019 | -0.079 |
| p = 0.947 | p = 0.812 | |
| STIs | -0.066*** | -0.066*** |
| p = 0.000 | p = 0.000 | |
| excDrinking | 0.235 | 0.232 |
| p = 0.467 | p = 0.557 | |
| foodIns | 0.487 | 0.661 |
| p = 0.439 | p = 0.395 | |
| pctWhite | -0.069 | -0.071 |
| p = 0.389 | p = 0.454 | |
| pctBlack | -0.345** | -0.348* |
| p = 0.038 | p = 0.076 | |
| pctFemale | 19.861*** | 19.342*** |
| p = 0.000 | p = 0.000 | |
| physInact | 0.329 | 0.356 |
| p = 0.327 | p = 0.345 | |
| adltObesity | 2.071*** | 1.973*** |
| p = 0.00000 | p = 0.00001 | |
| povStatusper100k | 0.001** | 0.001* |
| p = 0.024 | p = 0.058 | |
| pctAgeOvr50 | 6.713*** | 6.912*** |
| p = 0.000 | p = 0.000 | |
| Constant | -1,047.305*** | -1,023.885*** |
| p = 0.000 | p = 0.000 | |
| Observations | 204 | 153 |
| R2 | 0.898 | 0.895 |
| Adjusted R2 | 0.891 | 0.886 |
| Note: | P-values reported in parentheses, *p<0.1;**p<0.05;***p<0.01 | |
Examining the table above, we observe no difference in the significance levels of any of the p-values, and the coefficients barely change as well.
We also see that the \(R^2\) values are practically the same for both models.
Thus, solidifying our method for filling the missing values of the adltObesity and physInact variables.
Reflecting on our map and density plot alongside the results of the base OLS models, one might conclude that the change from year to year and from state to state does not contribute significantly to explaining cancer deaths.
This aligns precisely with our initial thoughts at this stage.
We proceeded to compare three new models: one where we controlled for the unexplained variability between each year, one where we controlled for the unexplained variability between each state, and the other controlling for states and years together.
3.0.2 New Models
Upon reviewing the regression results, it becomes evident that creating dummy variables for each year only introduces noise to our model without significantly altering the p-values or coefficients.
However, when we control on the state level, all variables, except pctFemale, lose significance, making it the only statistically significant variable.
A similar trend emerges when we include both state and year controls, where pctFemale and pctWhite exhibit significance at the 10% level.
Based on these findings, we opted not to pursue any fixed effects models further.
Nonetheless, these models underscore the importance of our demographic variables, particularly pctFemale.
| Dependent variable: | |||
| Deathsper100k | |||
| OLS | panel | ||
| linear | |||
| Year Fixed Effect | State Fixed Effect | State Year Fixed Effect | |
| adltSmoking | 1.279** | 0.059 | 0.097 |
| p = 0.013 | p = 0.876 | p = 0.796 | |
| uninsured | 0.107 | -0.588 | -0.581 |
| p = 0.705 | p = 0.328 | p = 0.328 | |
| STIs | -0.056*** | -0.009 | -0.007 |
| p = 0.00000 | p = 0.414 | p = 0.475 | |
| excDrinking | 0.227 | -0.327 | -0.357 |
| p = 0.482 | p = 0.193 | p = 0.145 | |
| foodIns | 0.174 | 0.892 | 0.933 |
| p = 0.786 | p = 0.202 | p = 0.169 | |
| pctWhite | -0.057 | 2.299 | 2.099* |
| p = 0.475 | p = 0.175 | p = 0.077 | |
| pctBlack | -0.447*** | 3.625 | 3.397 |
| p = 0.009 | p = 0.196 | p = 0.180 | |
| pctFemale | 20.157*** | 16.133* | 15.301* |
| p = 0.000 | p = 0.052 | p = 0.057 | |
| physInact | 0.626* | 0.310 | 0.177 |
| p = 0.081 | p = 0.194 | p = 0.364 | |
| adltObesity | 2.321*** | 0.036 | -0.030 |
| p = 0.000 | p = 0.923 | p = 0.928 | |
| povStatusper100k | 0.001** | -0.001 | -0.001 |
| p = 0.035 | p = 0.265 | p = 0.294 | |
| pctAgeOvr50 | 7.129*** | 2.104 | 1.962 |
| p = 0.000 | p = 0.442 | p = 0.400 | |
| factor(Year)2017 | -0.354 | ||
| p = 0.861 | |||
| factor(Year)2018 | -2.511 | ||
| p = 0.234 | |||
| factor(Year)2019 | -5.761** | ||
| p = 0.025 | |||
| Constant | -1,075.977*** | ||
| p = 0.000 | |||
| Observations | 204 | 204 | 204 |
| R2 | 0.901 | 0.118 | 0.201 |
| Adjusted R2 | 0.893 | -0.297 | -0.151 |
| Note: | P-values reported in parentheses, *p<0.1;**p<0.05;***p<0.01 | ||
Returning to our Base OLS model, we opted to break it down into three distinct parts to facilitate the development of our optimal model.
We decided to create individual models for demographic variables, potentially significant variables that might be excessively correlated to be informative, and variables we deemed important to cancer death rates.
Before examining the findings of this regression, it’s worth noting that we excluded pctAsian from our model as we deemed it inappropriate to include. This decision was made due to the absence of cancer death data for Asian males, as mentioned earlier.
Furthermore, we selected pctBlack over pctHisp because we possessed cancer death data for the Black population but lacked such data for the Hispanic population.
3.0.3 Models Ommitting Demographics
| Dependent variable: | ||||
| Deathsper100k | ||||
| Base OLS | Demographic Variables | Correlated Variables 1 | Correlated Variables 2 | |
| adltSmoking | 1.648*** | 3.654*** | ||
| p = 0.001 | p = 0.000 | |||
| uninsured | 0.019 | -1.428*** | ||
| p = 0.947 | p = 0.002 | |||
| STIs | -0.066*** | -0.065*** | ||
| p = 0.000 | p = 0.000 | |||
| excDrinking | 0.235 | 0.298 | ||
| p = 0.467 | p = 0.656 | |||
| foodIns | 0.487 | 0.947 | ||
| p = 0.439 | p = 0.507 | |||
| pctWhite | -0.069 | 0.558*** | ||
| p = 0.389 | p = 0.000 | |||
| pctBlack | -0.345** | 0.292 | ||
| p = 0.038 | p = 0.141 | |||
| pctFemale | 19.861*** | 16.567*** | ||
| p = 0.000 | p = 0.000 | |||
| physInact | 0.329 | 3.984*** | ||
| p = 0.327 | p = 0.000 | |||
| adltObesity | 2.071*** | 2.106*** | ||
| p = 0.00000 | p = 0.0003 | |||
| povStatusper100k | 0.001** | 0.0002 | ||
| p = 0.024 | p = 0.820 | |||
| pctAgeOvr50 | 6.713*** | 9.948*** | ||
| p = 0.000 | p = 0.000 | |||
| Constant | -1,047.305*** | -893.022*** | 37.428* | 73.403*** |
| p = 0.000 | p = 0.000 | p = 0.090 | p = 0.000 | |
| Observations | 204 | 204 | 204 | 204 |
| R2 | 0.898 | 0.668 | 0.348 | 0.520 |
| Adjusted R2 | 0.891 | 0.661 | 0.335 | 0.510 |
| Note: | P-values reported in parentheses, *p<0.1;**p<0.05;***p<0.01 | |||
Upon examining the variables separated, we observe that the demographic variables account for 0.668 of the variability in cancer deaths.
We will revisit this observation later, as we identified pctAgeOvr50 and pctFemale as among the most influential variables in our models, necessitating a dedicated discussion section to thoroughly explore their implications.
Turning to “Correlated Variables 1”, we note that only physical inactivity emerges as statistically significant at the 1% level, with a p-value of 0.00000000000021.
This finding surprised us, as we anticipated that isolating other potentially correlated variables might provide insights into their impact on cancer deaths.
For instance, separating physical inactivity from adult obesity allowed physical inactivity to show its importance, but removing uninsured status did not yield a similar effect on poverty status or food insecurity.
This led us to the conclusion that socioeconomic factors may not exert as significant an influence on cancer deaths as we initially hypothesized.
Continuing with the analysis of physical inactivity, we discovered that when holding all other variables constant, a one-percentage-point increase in the proportion of physically inactive adults corresponds to an average increase of 3.98 in cancer deaths, a statistically significant finding at the 1 percent level.
However, despite this significance, we have decided not to include physical inactivity in our final model.
Instead, adltObesity appeared to be more pivotal in explaining our data.
Nonetheless, it’s crucial to acknowledge that physical inactivity still plays a role in cancer deaths.
Additionally, we found that excessive drinking did not significantly contribute to explaining the variability of cancer deaths across America.
Finally, in “Correlated Variables 2,” all our variables demonstrate statistical significance at the 1 percent level.
However, in our final model, uninsured status becomes insignificant.
Therefore, we replaced it with food insecurity to enhance the optimization of our results.
3.0.4 Best Model
| Dependent variable: | |
| Deathsper100k | |
| Final Model | |
| adltSmoking | 2.290*** |
| p = 0.00000 | |
| foodIns | 1.010** |
| p = 0.022 | |
| STIs | -0.062*** |
| p = 0.000 | |
| adltObesity | 2.203*** |
| p = 0.000 | |
| pctWhite | -0.129* |
| p = 0.078 | |
| pctFemale | 21.423*** |
| p = 0.000 | |
| pctBlack | -0.437*** |
| p = 0.006 | |
| pctAgeOvr50 | 6.437*** |
| p = 0.000 | |
| Constant | -1,112.375*** |
| p = 0.000 | |
| Observations | 204 |
| R2 | 0.894 |
| Adjusted R2 | 0.890 |
| Note: | P-values reported in parentheses, *p<0.1;**p<0.05;***p<0.01 |
After analyzing all our variables, we’ve selected adltSmoking, foodIns, STIs, adltObesity, pctWhite, pctFemale, pctBlack, and pctAgeOvr50 as the chosen variables to explain cancer deaths across the United States.
Our final model yields an adjusted R-squared value of 0.894, indicating that our model accounts for 0.894 of the variability in cancer death rates.
Among these variables, we found adltSmoking, adltObesity, pctAgeOvr50, and pctFemale to be the most important.
All three variables exhibit the largest coefficients and are statistically significant at the 1 percent level.
Both of our race demographic variables, pctBlack and pctWhite, demonstrate statistical significance, with pctBlack being significant at the 1 percent level and pctWhite at the 10 percent level.
Despite their coefficients being smaller in comparison to other demographic variables, they are negative.
The negative coefficients for both race demographics, coupled with pctFemale exhibiting a large positive coefficient, present a puzzling aspect of our research.
This observation hints at one of the limitations of our study.
Our dataset doesn’t solely provide information on all cancer cases, rather it presents data on soley on cancer deaths, allowing us to explore factors beyond the cause of death alone. Including adult smoking in our model is logical due to its significance in predicting cancer deaths. Smoking not only directly causes cancer but also contributes to an overall unhealthy lifestyle. Holding all other variables constant, a one percentage point increase in the proportion of adult smokers is associated with an average increase of 2.29 cancer deaths.
Similarly, obesity correlates with an unhealthy lifestyle, and as the National Library of Medicine states, “Obesity has been linked to several common cancers including breast, colorectal, esophageal, kidney, gallbladder, uterine, pancreatic, and liver cancer. Obesity also increases the risk of dying from cancer and may influence the treatment choices. About 4–8% of all cancers are attributed to obesity.” As we found, and the NLM affirms, there obesity does a appear to be correlated with cancer deaths. This correlation makes sense; individuals who are already unhealthy may find it harder to combat diseases compared to those who are healthy. Holding all other variables constant, a one percentage point increase in the proportion of obese adults is associated with an average increase of 2.2 cancer deaths.
Our findings suggest that regions with unhealthy residents tend to have higher death rates.
Therefore, promoting a healthier lifestyle could be an effective strategy to combat cancer deaths.
Despite the significance of these coefficients, their importance is overshadowed by pctAgeOvr50 and pctFemale.
During the initial stages of our research, the variable pctAgeOvr50 was not included in our dataset.
We introduced it later because in our final model, we observed that the coefficient for the percentage of the population that is female was 27.3, nearly seven times larger than the next largest coefficient.
This discrepancy indicated the presence of omitted variable bias.
We speculated that age might be the missing variable, as older individuals tend to experience higher mortality rates for various reasons, including cancer.
| Dependent variable: | |||
| Deathsper100k | |||
| Pre Best Model | Without pctFemale | Final Model | |
| adltSmoking | 4.132*** | 0.881 | 2.290*** |
| p = 0.000 | p = 0.121 | p = 0.00000 | |
| foodIns | -0.279 | 2.589*** | 1.010** |
| p = 0.582 | p = 0.00005 | p = 0.022 | |
| STIs | -0.084*** | -0.080*** | -0.062*** |
| p = 0.000 | p = 0.000 | p = 0.000 | |
| adltObesity | 1.934*** | 1.506*** | 2.203*** |
| p = 0.00000 | p = 0.0003 | p = 0.000 | |
| pctWhite | -0.205** | 0.030 | -0.129* |
| p = 0.020 | p = 0.775 | p = 0.078 | |
| pctFemale | 27.319*** | 21.423*** | |
| p = 0.000 | p = 0.000 | ||
| pctBlack | -0.915*** | 1.323*** | -0.437*** |
| p = 0.00000 | p = 0.000 | p = 0.006 | |
| pctAgeOvr50 | 10.613*** | 6.437*** | |
| p = 0.000 | p = 0.000 | ||
| Constant | -1,287.863*** | -93.680*** | -1,112.375*** |
| p = 0.000 | p = 0.00002 | p = 0.000 | |
| Observations | 204 | 204 | 204 |
| R2 | 0.844 | 0.775 | 0.894 |
| Adjusted R2 | 0.838 | 0.767 | 0.890 |
| Note: | P-values reported in parentheses, *p<0.1;**p<0.05;***p<0.01 | ||
Upon examining the “Final Model” regression, we observe that pctAgeOvr50 emerges as the second largest coefficient, effectively mitigating 6.44 of the omitted variable bias associated with pctFemale.
However, pctFemale still maintains a significantly larger coefficient compared to the other variables.
This reveals a limitation in our research, as we have omitted variable bias from our model, the majority of which is being overexplained by pctFemale.
In order to further investigate the omitted variable bias, we constructed a model excluding pctFemale entirely.
Surprisingly, in this model, both adltSmoking and pctWhite lose statistical significance.
Conversely, foodIns emerges as statistically significant at the 1 percent level, with a coefficient of 2.59.
Not only is this coefficient significantly larger than previously observed, but it has also changed from negative to positive since the inclusion of pctAgeOvr50.
Analyzing how the omitted variable bias alters the model highlights that even though incorporating pctAgeOvr50 and pctFemale helps to explain some of the omitted variable bias, it does not entirely resolve it.
An important factor behind cancer death rates that our model also fails to cover are personal health aspects pertaining to each person.
These differences could be in their genetics or underlying health conditions, thus impacting cancer death rates.
Also, since we are looking at the state level, we are not accounting for differences between urban and rural lifestyles, access to care, and other important factors that impact cancer death rates at a more local level.
Future research should concentrate on obtaining data that addresses the omitted variable bias we have identified.
Despite our research shortcomings, our final regression model demonstrates a correlation between healthier lifestyles and fewer cancer deaths. It’s important to note that this correlation does not imply that being healthier guarantees immunity from cancer; rather, it suggests that healthier lifestyles are associated with a lower incidence of cancer-related deaths.