U.S. Adult Census: Income Prediction with Logistic Regression

Authors

Benjamin Gerochi

Izzy Zhou

Michael Tham

Yui Mikuriya

1 Summary

This report investigates income prediction using the UCI Adult Dataset (Kohavi and Becker 1996), which compiles demographic and income data from the 1994 U.S. Census. The primary objective is to predict whether an individual earns over $50,000 annually using factors such as age, education level, and hours worked per week. By employing a logistic regression model, the analysis effectively predicted income levels on test cases while assessing model performance using metrics like the ROC curve (AUC: Area Under the Curve), sensitivity, specificity, and accuracy. The findings underscore that while the model achieves robust overall accuracy, there are challenges with false positives that warrant further refinement.

The insights derived from this study not only validate the role of education and work intensity in income determination but also suggest avenues for future research, such as integrating geographic and intersectional demographic variables to capture the complexities of income disparities. Overall, the analysis offers a comprehensive approach to understanding income inequality and provides actionable information for policy makers and individuals aiming to navigate economic opportunities.

2 Introduction

2.1 Dataset Overview

The dataset selected for this project is the UCI Adult Dataset (Kohavi and Becker 1996), available through the UCI Machine Learning Repository (Dua and Graff 2017). It contains demographic and income data collected by the U.S. Census Bureau and is widely used for predicting whether an individual’s income exceeds $50,000 per year based on various demographic factors.

2.2 Dataset Details:

Dataset Name: UCI Adult Dataset (Kohavi and Becker 1996)
Source: 1994 U.S. Census database, compiled by Ronny Kohavi and Barry Becker
Total Observations: 32,561
Total Variables: 15

2.3 Variables and Their Types

Table 1: Variable Index and Descriptions

Variable Index	Variable Name	Type	Description
0	age	continuous	Age of the individual
1	workclass	categorical	Employment sector
2	fnlwgt	continuous	Number of individuals represented
3	education	categorical	Highest level of education attained
4	education-num	continuous	Numerical version of education level
5	marital-status	categorical	Marital status
6	occupation	categorical	Type of occupation
7	relationship	categorical	Position in household
8	race	categorical	Race of the individual
9	sex	categorical	Gender (Male/Female)
10	capital-gain	continuous	Capital gains earned
11	capital-loss	continuous	Capital losses incurred
12	hours-per-week	continuous	Average hours worked per week
13	native-country	categorical	Country of origin
14	income	categorical	Income level (<=50K, >50K)

This dataset includes both categorical and numerical variables, making it suitable for analyzing relationships between demographic attributes and income levels. Further exploration and preprocessing may involve handling missing values and encoding categorical features.

2.4 Research Question

How accurately can key demographic factors predict whether an individual’s annual income exceeds $50,000?

This study aims to use demographic variables to predict income levels without pre-assuming key predictors. Our team initially analyzed different aspects of the dataset before deciding to focus on demographic influences on income such as age, education, and hours worked.

2.5 Literature Context

Prior research supports the importance of demographic factors in income prediction. Jo (Jo 2023) analyzed the Adult dataset and identified capital gain, education, relationship status, and occupation as key predictors. Similarly, Azzollini et al. (Azzollini, Breen, and Nolan 2023) found that demographic differences explained 40% of income inequality across OECD countries, reinforcing the relevance of our analysis.

2.6 Objective

To develop and evaluate a predictive model that estimates the probability of an individual earning more than $50,000 annually based on their demographic characteristics:

Prediction: Build a robust model to forecast whether an individual’s annual income will exceed $50,000.
Model Evaluation: Assess model performance to ensure that the model provides reliable predictions.

3 Methods & Results

3.1 Loading the Libraries and Dataset

We will start by importing the necessary R libraries for data analysis and preprocessing. We then load the dataset into R by referencing the downloaded file path.

Table 2: Raw Adult Income Dataset

90	?…2	77053	HS-grad	9	Widowed	?…7	Not-in-family	White	Female	4356	40	United-States	<=50K
82	Private	132870	HS-grad	9	Widowed	Exec-managerial	Not-in-family	White	Female	4356	18	United-States	<=50K
66	?	186061	Some-college	10	Widowed	?	Unmarried	Black	Female	4356	40	United-States	<=50K
54	Private	140359	7th-8th	4	Divorced	Machine-op-inspct	Unmarried	White	Female	3900	40	United-States	<=50K
41	Private	264663	Some-college	10	Separated	Prof-specialty	Own-child	White	Female	3900	40	United-States	<=50K
34	Private	216864	HS-grad	9	Divorced	Other-service	Unmarried	White	Female	3770	45	United-States	<=50K
38	Private	150601	10th	6	Separated	Adm-clerical	Unmarried	White	Male	3770	40	United-States	<=50K

3.2 Data Wrangling

We will begin by cleaning Table 2. First, we remove missing values and convert the income column into a factor variable to ensure R treats it as a categorical variable. This transformation is crucial for statistical modeling and visualization, especially when income is used as a binary outcome in logistic regression. We also create new column names to streamline readability and analysis.

Table 3: Cleaned Adult Income Dataset

age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_loss	hours_per_week	native_country	income
90	?	77053	HS-grad	9	Widowed	?	Not-in-family	White	Female	4356	40	United-States	<=50K
82	Private	132870	HS-grad	9	Widowed	Exec-managerial	Not-in-family	White	Female	4356	18	United-States	<=50K
66	?	186061	Some-college	10	Widowed	?	Unmarried	Black	Female	4356	40	United-States	<=50K
54	Private	140359	7th-8th	4	Divorced	Machine-op-inspct	Unmarried	White	Female	3900	40	United-States	<=50K
41	Private	264663	Some-college	10	Separated	Prof-specialty	Own-child	White	Female	3900	40	United-States	<=50K
34	Private	216864	HS-grad	9	Divorced	Other-service	Unmarried	White	Female	3770	45	United-States	<=50K

Number of Rows: 32561

After removing missing values from Table 3, we randomly sample 10% of the data (which contains a total of 32561 observations), bringing our sample size to 3256 data points. The sample is then split into training and testing sets (80-20 split) for prediction analysis.

Table 4: Training Set of Adult Dataset

age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_loss	hours_per_week	native_country	income
32	Private	48458	HS-grad	9	Never-married	Sales	Own-child	Black	Female	1669	45	United-States	<=50K
33	Private	75167	HS-grad	9	Married-civ-spouse	Transport-moving	Husband	White	Male	0	55	United-States	<=50K
30	Private	54608	Bachelors	13	Married-civ-spouse	Adm-clerical	Husband	White	Male	0	40	United-States	<=50K
49	State-gov	203039	11th	7	Married-civ-spouse	Adm-clerical	Husband	White	Male	0	40	United-States	<=50K
24	Private	249046	Bachelors	13	Never-married	Tech-support	Own-child	White	Male	0	40	United-States	<=50K
39	State-gov	152307	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	40	United-States	<=50K

Number of Rows: 2604

Above, we can see that Table 3 has been successfully split, with Table 4 containing 2604 rows, representing about 80% of our sample size: 3256.

3.3 Exploratory Data Analysis (EDA)

3.3.1 Pairwise Plot:

To focus on the most relevant variables, we will exclude columns that do not directly contribute to addressing our research question. Hence, we have retained demographic predictors such as age, education level, and hours worked per week. These predictors were chosen based on prior literature (Azzollini, Breen, and Nolan 2023; Smith-Edgell 2024), theoretical considerations, and empirical evidence from exploratory analyses, which indicate that they have a significant influence on income levels.

Using Table 4 with the irrelevant columns dropped, we create pairwise plots to examine relationships between continuous variables (age, hours_per_week, education_num) and the response variable, as well as associations among the input variables.

Figure 1: Pairwise Plot of Response and Predictors

Figure 1 shows that age is right-skewed, hours_per_week peaks around 40, and education_num has a bimodal distribution. Weak correlations (< 0.6) suggest minimal multicollinearity.

The following code generates summary tables for continuous variables, with the code computing key summary statistics: mean, standard deviation, median, variance, maximum, and minimum.

Table 5: Summary Statistics Table of Relevant Predictors

name	mean	sd	median	variance	max	min
age	39.11214	13.816498	37	190.895604	90	17
education_num	10.15822	2.601664	10	6.768657	16	1
hours_per_week	40.11444	12.064844	40	145.560468	99	1

Table 5 shows that the average age is 39 years (SD = 13.82) with a range of 17 to 90. The average education level (education_num) is 10 years (SD = 2.6), reflecting high school or some college education. For hours_per_week, the average is 40.11 hours (SD = 12.06), with a maximum of 99 hours, indicating some individuals work significantly long hours.

3.4 Proposed Method: Logistic Regression and ROC Curve

Why is Logistic Regression Appropriate?

Logistic regression is suitable for modeling binary outcomes like income categories (<=50K and >50K). It estimates the probability of an individual falling into a specific category based on predictors, then classifies the predictions based on a threshold. The ROC Curve, on the other hand, is a reliable tool for evaluating the effectiveness of our model.

3.4.1 Assumptions:

Independence of observations.
No high correlation among predictors.
A large enough sample size for reliable estimates.

3.4.2 Limitations:

Potential underfitting if too little predictors are included.

3.5 Fit the Logistic Regression Model

In the following code, we fit the logistic regression model to Table 4 using the relevant predictors.

Table 6: Summary of the Logistic Regression Model

term	estimate	std.error	statistic
(Intercept)	-8.6096196	0.3890123	-22.13200
age	0.0488833	0.0039918	12.24582
education_num	0.3146841	0.0218138	14.42589
hours_per_week	0.0519209	0.0046899	11.07072

We can observe from Table 6 that all predictors were deemed significant (based on the p-values). Furthermore, education number seemed to have the highest coefficient (0.3146841), demonstrating the greatest impact on model predictions.

3.6 Visualizing the ROC Curve

To evaluate the model, we will use the ROC curve to visualize the trade-off between sensitivity and specificity across classification thresholds. The AUC (Area Under the Curve) will be calculated to quantify model performance, with values closer to 1 indicating strong discrimination and values near 0.5 suggesting random guessing.

Figure 2: ROC Curve of the Logistic Regression Model

Figure 2 shows us that the AUC (Area Under the Curve) values obtained for the model (0.8005) is significantly above 0.5, indicating that the model performs much better than random guessing. The high AUC value suggests that the model has strong discriminatory power, effectively distinguishing between individuals earning <=50K and >50K based on the selected predictors.

3.7 Test the Model on the Testing Dataset

Now, we perform the classification analysis and apply the model to the testing dataset and visualize the results of the analysis in a confusion matrix.

Figure 3: Confusion Matrix of Full Model on Testing Set

3.8 Classification Results and Model Metrics

Table 7: Classification Results and Model Metrics

Metric	Value
Sensitivity	0.3333333
Specificity	0.9291339
Precision	0.5714286
Accuracy	0.7975460
Cohen’s Kappa	0.3085229

From Table 7, we observe the following metrics:

Sensitivity (SN): 33.33% - The model correctly identifies 33.33% of higher-income individuals.
Specificity (SP): 92.91% - 92.91% of lower-income individuals are correctly classified.
Precision (PR): 57.14% - 57.14% of predicted >50K individuals actually earn >50K, indicating many false positives.
Accuracy (ACC): 79.75% - 79.75% of overall predictions are correct.
Cohen’s Kappa (κ): 30.85% - Moderate agreement, better than random chance but room for improvement.

3.9 Interpretation

Strong specificity, but low sensitivity and moderate precision suggest improvements in identifying high-income individuals.
High accuracy reflects solid overall performance but overlooks class imbalance.
Low Cohen’s Kappa indicates the need for refinement to improve consistency of predictions.

4 Discussion

4.1 Summary of Findings and Implications

The logistic regression model showed strong predictive power (AUC =0.8005067), demonstrating that the model can effectively distinguish income levels better than a baseline. These findings can inform policies aimed at reducing income inequality. Education and hours worked were key predictors, emphasizing the need for skill development and work-life balance. Understanding the factors behind income disparities can help individuals make more informed career decisions and pursue opportunities for skill enhancement.

4.2 Expectations and Results

The model’s AUC (0.8005067) is strong, reflecting the importance of predictors like age, education, and hours worked. Overall, the results are consistent with expectations from the research study: - Age correlates with experience, leading to higher salaries. - Education increases income, with those holding a degree earning significantly more. - Hours Worked reflects labor input, where more hours can translate to higher pay.

4.3 Future Research

Geographic Influence on Income: Including geographic variables may reveal regional disparities in income linked to education and job opportunities.
Intersectionality of Demographics: Exploring how race, gender, and marital status interact could improve the model’s accuracy in predicting income.
Health and Disability Status: Accounting for health conditions or disability could provide additional insight into income disparities by limiting education or work opportunities.

5 References

Azzollini, L., R. Breen, and B. Nolan. 2023. “Demographic Behaviour and Earnings Inequality Across OECD Countries.” Journal of Economic Inequality 21: 441–61. https://doi.org/10.1007/s10888-022-09559-1.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information and Computer Sciences. https://archive.ics.uci.edu/ml.

Jo, K. 2023. “Income Prediction Using Machine Learning Techniques.” University of California, Los Angeles.

Kohavi, R., and B. Becker. 1996. “UCI Machine Learning Repository: Adult Data Set.” UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/adult.

Smith-Edgell, A. 2024. “Proof Point: Financial Returns After a Post-Secondary Education Have Diminished.” RBC Thought Leadership. https://thoughtleadership.rbc.com/proof-point-financial-returns-after-a-post-secondary-education-have-diminished/#:~:text=Incomes%20are%20positively%20correlated%20with%20higher%20education&text=Respondents%20with%20a%20bachelor’s%20degree,median%20income%20in%20the%20sample.