Cancer is among the leading causes of death globally. There are many factors related to cancer mortality, including socioeconomic status, age, race and so on. In this project, we aimed to build a multiple linear regression model to predict cancer mortalities of different counties in the United States. In the final model, we chose six variables mainly related to education level, race, employment status, income or incidence rate as our predictors. As a result, our final model has a certain predictive ability.
The data for this project were aggregated from multiple sources including American Community Survey census.gov, clinicaltrials.gov, and cancer.gov. The final dataset contains data for mean per capita(100,000) cancer mortalities and related demographic information from 3047 counties.
Based on criterions and parsimony, we choose the model with 6 predictors which has the smallest BIC, comparatively larger adjusted R-Squared and smaller subset size. Our final regression model contains 6 variables mainly related to education level, race, employment status, income or incidence rate. 48.14% of the dependent variable variation is explained by this multiple linear regression model (R-squared 48.14%; Adjusted R-squared 48.04%).