90 | ?…2 | 77053 | HS-grad | 9 | Widowed | ?…7 | Not-in-family | White | Female | 0 | 4356 | 40 | United-States | <=50K |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
82 | Private | 132870 | HS-grad | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States | <=50K |
66 | ? | 186061 | Some-college | 10 | Widowed | ? | Unmarried | Black | Female | 0 | 4356 | 40 | United-States | <=50K |
54 | Private | 140359 | 7th-8th | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States | <=50K |
41 | Private | 264663 | Some-college | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States | <=50K |
34 | Private | 216864 | HS-grad | 9 | Divorced | Other-service | Unmarried | White | Female | 0 | 3770 | 45 | United-States | <=50K |
38 | Private | 150601 | 10th | 6 | Separated | Adm-clerical | Unmarried | White | Male | 0 | 3770 | 40 | United-States | <=50K |
U.S. Adult Census: Income Prediction with Logistic Regression
1 Summary
This report investigates income prediction using the UCI Adult Dataset (Kohavi and Becker 1996), which compiles demographic and income data from the 1994 U.S. Census. The primary objective is to predict whether an individual earns over $50,000 annually using factors such as age, education level, and hours worked per week. By employing a logistic regression model, the analysis effectively predicted income levels on test cases while assessing model performance using metrics like the ROC curve (AUC: Area Under the Curve), sensitivity, specificity, and accuracy. The findings underscore that while the model achieves robust overall accuracy, there are challenges with false positives that warrant further refinement.
The insights derived from this study not only validate the role of education and work intensity in income determination but also suggest avenues for future research, such as integrating geographic and intersectional demographic variables to capture the complexities of income disparities. Overall, the analysis offers a comprehensive approach to understanding income inequality and provides actionable information for policy makers and individuals aiming to navigate economic opportunities.
2 Introduction
2.1 Dataset Overview
The dataset selected for this project is the UCI Adult Dataset (Kohavi and Becker 1996), available through the UCI Machine Learning Repository (Dua and Graff 2017). It contains demographic and income data collected by the U.S. Census Bureau and is widely used for predicting whether an individual’s income exceeds $50,000 per year based on various demographic factors.
2.2 Dataset Details:
- Dataset Name: UCI Adult Dataset (Kohavi and Becker 1996)
- Source: 1994 U.S. Census database, compiled by Ronny Kohavi and Barry Becker
- Total Observations: 32,561
- Total Variables: 15
2.3 Variables and Their Types
Variable Index | Variable Name | Type | Description |
---|---|---|---|
0 | age | continuous | Age of the individual |
1 | workclass | categorical | Employment sector |
2 | fnlwgt | continuous | Number of individuals represented |
3 | education | categorical | Highest level of education attained |
4 | education-num | continuous | Numerical version of education level |
5 | marital-status | categorical | Marital status |
6 | occupation | categorical | Type of occupation |
7 | relationship | categorical | Position in household |
8 | race | categorical | Race of the individual |
9 | sex | categorical | Gender (Male/Female) |
10 | capital-gain | continuous | Capital gains earned |
11 | capital-loss | continuous | Capital losses incurred |
12 | hours-per-week | continuous | Average hours worked per week |
13 | native-country | categorical | Country of origin |
14 | income | categorical | Income level (<=50K, >50K) |
This dataset includes both categorical and numerical variables, making it suitable for analyzing relationships between demographic attributes and income levels. Further exploration and preprocessing may involve handling missing values and encoding categorical features.
2.4 Research Question
How accurately can key demographic factors predict whether an individual’s annual income exceeds $50,000?
This study aims to use demographic variables to predict income levels without pre-assuming key predictors. Our team initially analyzed different aspects of the dataset before deciding to focus on demographic influences on income such as age, education, and hours worked.
2.5 Literature Context
Prior research supports the importance of demographic factors in income prediction. Jo (Jo 2023) analyzed the Adult dataset and identified capital gain, education, relationship status, and occupation as key predictors. Similarly, Azzollini et al. (Azzollini, Breen, and Nolan 2023) found that demographic differences explained 40% of income inequality across OECD countries, reinforcing the relevance of our analysis.
2.6 Objective
To develop and evaluate a predictive model that estimates the probability of an individual earning more than $50,000 annually based on their demographic characteristics:
- Prediction: Build a robust model to forecast whether an individual’s annual income will exceed $50,000.
- Model Evaluation: Assess model performance to ensure that the model provides reliable predictions.
3 Methods & Results
3.1 Loading the Libraries and Dataset
We will start by importing the necessary R libraries for data analysis and preprocessing. We then load the dataset into R by referencing the downloaded file path.
3.2 Data Wrangling
We will begin by cleaning Table 2. First, we remove missing values and convert the income column into a factor variable to ensure R treats it as a categorical variable. This transformation is crucial for statistical modeling and visualization, especially when income is used as a binary outcome in logistic regression. We also create new column names to streamline readability and analysis.
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
90 | ? | 77053 | HS-grad | 9 | Widowed | ? | Not-in-family | White | Female | 0 | 4356 | 40 | United-States | <=50K |
82 | Private | 132870 | HS-grad | 9 | Widowed | Exec-managerial | Not-in-family | White | Female | 0 | 4356 | 18 | United-States | <=50K |
66 | ? | 186061 | Some-college | 10 | Widowed | ? | Unmarried | Black | Female | 0 | 4356 | 40 | United-States | <=50K |
54 | Private | 140359 | 7th-8th | 4 | Divorced | Machine-op-inspct | Unmarried | White | Female | 0 | 3900 | 40 | United-States | <=50K |
41 | Private | 264663 | Some-college | 10 | Separated | Prof-specialty | Own-child | White | Female | 0 | 3900 | 40 | United-States | <=50K |
34 | Private | 216864 | HS-grad | 9 | Divorced | Other-service | Unmarried | White | Female | 0 | 3770 | 45 | United-States | <=50K |
Number of Rows: 32561
After removing missing values from Table 3, we randomly sample 10% of the data (which contains a total of 32561
observations), bringing our sample size to 3256
data points. The sample is then split into training and testing sets (80-20 split) for prediction analysis.
age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32 | Private | 48458 | HS-grad | 9 | Never-married | Sales | Own-child | Black | Female | 0 | 1669 | 45 | United-States | <=50K |
33 | Private | 75167 | HS-grad | 9 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 55 | United-States | <=50K |
30 | Private | 54608 | Bachelors | 13 | Married-civ-spouse | Adm-clerical | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
49 | State-gov | 203039 | 11th | 7 | Married-civ-spouse | Adm-clerical | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
24 | Private | 249046 | Bachelors | 13 | Never-married | Tech-support | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
39 | State-gov | 152307 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
Number of Rows: 2604
Above, we can see that Table 3 has been successfully split, with Table 4 containing 2604
rows, representing about 80% of our sample size: 3256
.
3.3 Exploratory Data Analysis (EDA)
3.3.1 Pairwise Plot:
To focus on the most relevant variables, we will exclude columns that do not directly contribute to addressing our research question. Hence, we have retained demographic predictors such as age, education level, and hours worked per week. These predictors were chosen based on prior literature (Azzollini, Breen, and Nolan 2023; Smith-Edgell 2024), theoretical considerations, and empirical evidence from exploratory analyses, which indicate that they have a significant influence on income levels.
Using Table 4 with the irrelevant columns dropped, we create pairwise plots to examine relationships between continuous variables (age
, hours_per_week
, education_num
) and the response variable, as well as associations among the input variables.
Figure 1 shows that age
is right-skewed, hours_per_week
peaks around 40, and education_num
has a bimodal distribution. Weak correlations (< 0.6) suggest minimal multicollinearity.
The following code generates summary tables for continuous variables, with the code computing key summary statistics: mean, standard deviation, median, variance, maximum, and minimum.
name | mean | sd | median | variance | max | min |
---|---|---|---|---|---|---|
age | 39.11214 | 13.816498 | 37 | 190.895604 | 90 | 17 |
education_num | 10.15822 | 2.601664 | 10 | 6.768657 | 16 | 1 |
hours_per_week | 40.11444 | 12.064844 | 40 | 145.560468 | 99 | 1 |
Table 5 shows that the average age is 39
years (SD = 13.82
) with a range of 17
to 90
. The average education level (education_num) is 10
years (SD = 2.6
), reflecting high school or some college education. For hours_per_week, the average is 40.11
hours (SD = 12.06
), with a maximum of 99
hours, indicating some individuals work significantly long hours.
3.4 Proposed Method: Logistic Regression and ROC Curve
Why is Logistic Regression Appropriate?
Logistic regression is suitable for modeling binary outcomes like income categories (<=50K and >50K). It estimates the probability of an individual falling into a specific category based on predictors, then classifies the predictions based on a threshold. The ROC Curve, on the other hand, is a reliable tool for evaluating the effectiveness of our model.
3.4.1 Assumptions:
- Independence of observations.
- No high correlation among predictors.
- A large enough sample size for reliable estimates.
3.4.2 Limitations:
- Potential underfitting if too little predictors are included.
3.5 Fit the Logistic Regression Model
In the following code, we fit the logistic regression model to Table 4 using the relevant predictors.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -8.6096196 | 0.3890123 | -22.13200 | 0 |
age | 0.0488833 | 0.0039918 | 12.24582 | 0 |
education_num | 0.3146841 | 0.0218138 | 14.42589 | 0 |
hours_per_week | 0.0519209 | 0.0046899 | 11.07072 | 0 |
We can observe from Table 6 that all predictors were deemed significant (based on the p-values). Furthermore, education number seemed to have the highest coefficient (0.3146841
), demonstrating the greatest impact on model predictions.
3.6 Visualizing the ROC Curve
To evaluate the model, we will use the ROC curve to visualize the trade-off between sensitivity and specificity across classification thresholds. The AUC (Area Under the Curve) will be calculated to quantify model performance, with values closer to 1 indicating strong discrimination and values near 0.5 suggesting random guessing.
Figure 2 shows us that the AUC (Area Under the Curve) values obtained for the model (0.8005
) is significantly above 0.5, indicating that the model performs much better than random guessing. The high AUC value suggests that the model has strong discriminatory power, effectively distinguishing between individuals earning <=50K
and >50K
based on the selected predictors.
3.7 Test the Model on the Testing Dataset
Now, we perform the classification analysis and apply the model to the testing dataset and visualize the results of the analysis in a confusion matrix.
3.8 Classification Results and Model Metrics
Metric | Value |
---|---|
Sensitivity | 0.3333333 |
Specificity | 0.9291339 |
Precision | 0.5714286 |
Accuracy | 0.7975460 |
Cohen’s Kappa | 0.3085229 |
From Table 7, we observe the following metrics:
Sensitivity (SN):
33.33
% - The model correctly identifies33.33
% of higher-income individuals.Specificity (SP):
92.91
% -92.91
% of lower-income individuals are correctly classified.Precision (PR):
57.14
% -57.14
% of predicted >50K individuals actually earn >50K, indicating many false positives.Accuracy (ACC):
79.75
% -79.75
% of overall predictions are correct.Cohen’s Kappa (κ):
30.85
% - Moderate agreement, better than random chance but room for improvement.
3.9 Interpretation
- Strong specificity, but low sensitivity and moderate precision suggest improvements in identifying high-income individuals.
- High accuracy reflects solid overall performance but overlooks class imbalance.
- Low Cohen’s Kappa indicates the need for refinement to improve consistency of predictions.
4 Discussion
4.1 Summary of Findings and Implications
The logistic regression model showed strong predictive power (AUC =0.8005067
), demonstrating that the model can effectively distinguish income levels better than a baseline. These findings can inform policies aimed at reducing income inequality. Education and hours worked were key predictors, emphasizing the need for skill development and work-life balance. Understanding the factors behind income disparities can help individuals make more informed career decisions and pursue opportunities for skill enhancement.
4.2 Expectations and Results
The model’s AUC (0.8005067
) is strong, reflecting the importance of predictors like age, education, and hours worked. Overall, the results are consistent with expectations from the research study: - Age correlates with experience, leading to higher salaries. - Education increases income, with those holding a degree earning significantly more. - Hours Worked reflects labor input, where more hours can translate to higher pay.
4.3 Future Research
- Geographic Influence on Income: Including geographic variables may reveal regional disparities in income linked to education and job opportunities.
- Intersectionality of Demographics: Exploring how race, gender, and marital status interact could improve the model’s accuracy in predicting income.
- Health and Disability Status: Accounting for health conditions or disability could provide additional insight into income disparities by limiting education or work opportunities.