« Back to Results

Advances in Imputing Race and Ethnicity to Administrative Data

Paper Session

Friday, Jan. 5, 2024 2:30 PM - 4:30 PM (CST)

Grand Hyatt, Lone Star Ballroom Salon A
Hosted By: American Economic Association & Committee on Economic Statistics
  • Chair: Mark Mazur, U.S. Treasury Department

BISG Validation using Tax-Linked Data

Elena Derby
,
Joint Committee on Taxation
Connor Dowd
,
Joint Committee on Taxation
Jacob Mortenson
,
Joint Committee on Taxation

Abstract

Bayesian Improved Surname Geocoding (BISG) is an approach to estimating race and ethnicity onto individual-level records lacking racial and ethnic information. BISG uses the likelihood that an individual resides in a certain geographical area, given their race and ethnicity, and the probability of belonging to a certain race and ethnicity, given their surname, to estimate the set of probabilities associated with belonging to different racial and ethnic groups. However, validity testing of BISG in these contexts has been limited, due to a lack of data containing an individual’s name, race and ethnicity, location, and tax information. This paper estimates the statistical bias and uncertainty associated with estimating racial and ethnic disparities in tax outcomes using BISG-imputed race and ethnicity probabilities. We use a new set of matched administrative data for this exercise, consisting of residents of buildings that qualified for the low-income housing tax credit. These data contain the name, location, race, and ethnicity of the residents, allowing for a novel setting to assess the validity of BISG estimates. We then match these data with tax records and estimate tax disparities across racial and ethnic groups using observed race and ethnicity and the set of probabilities estimated using BISG. This comparison allows us to estimate the accuracy and uncertainty associated with using BISG to estimate tax disparities for this group. Given that LIHTC residents tend to have lower incomes, we focus on the earned income tax credit, child tax credit, and average tax rates. This validation exercise will be informative about potential sources of statistical bias associated with using BISG to estimate racial disparities.

Measuring Marriage Penalties and Bonuses by Race and Ethnicity: An Application of Race and Ethnicity Re-Weighting to Tax Data

Emily Lin
,
U.S. Treasury Department
Rachel Costello
,
U.S. Treasury Department
Portia DeFilippes
,
U.S. Treasury Department
Robin Fisher
,
U.S. Treasury Department
Ben Klemens
,
U.S. Treasury Department

Abstract

Tax law can have different impacts on individuals in different racial and ethnic groups because individuals’ tax return characteristics vary across groups on average. As tax forms do not collect information about individual race or ethnicity, it has been challenging to use administrative tax data to analyze tax differentials by race and ethnicity. To facilitate better understanding, the U.S. Treasury Department's Office of Tax Analysis imputes race/ethnicity information for a stratified random sample of taxpayers on its microsimulation model. This paper uses this information to simulate the marriage-penalty and -bonus outcomes under the federal individual income tax system by race and ethnicity. Legal scholars such as Moran and Whitford (1996) and Brown (1997, 2021) have discussed probable disparate marriage penalty outcomes by race due to group differences in the spousal division of income. Comparing Black and White married couples shows sparse evidence of group disparities on average. However, for income levels above $75,000, Black couples have a higher penalty rate and a lower bonus rate relative to White couples with the same income, other things being equal. Hispanic couples on average have a higher penalty rate, a lower bonus rate, and a smaller bonus amount relative to White couples.

Tax Expenditures by (Improved Imputed) Race and Ethnicity

Julie Anne Cronin
,
U.S. Treasury Department
Portia DeFilippes
,
U.S. Treasury Department
Robin Fisher
,
U.S. Treasury Department

Abstract

U.S. tax forms do not collect information about race or ethnicity. While no tax rule is established based on the taxpayer’s race or ethnicity, not taking race and ethnicity into consideration in the policymaking process can result in the unintentional consequence of widening racial and ethnic disparities in after-tax income. Fisher (2023) imputed race to the Office of Tax Analysis’ Individual Tax Model by applying Bayesian inference to a set of explanatory variables available in tax data, including total income, filing status, age, number of dependents, taxable interest, presence of farm income, first name, last name, and the ZIP Code Tabulation Area (ZCTA). Cronin, DeFilippes and Fisher (2023) used this imputation to analyze the distribution of tax expenditures by race and ethnicity (RH). This paper extends Cronin et al. (2023) by using better data sources for the imputation of RH, including Asian families as one of the RH categories, and by considering the effects of filing status by RH on previous results. We find that by updating the geocode and tax data to more recent years and using the richer Census data for first names, we are able to better estimate RH, especially for Asian families. We also find that filing status varies significantly across RH groups and reduces the measured differences in tax expenditure benefits by RH for certain tax expenditures. The measured difference in the benefits of the tax expenditure for preferential rates on capital gains and dividends is unchanged from the earlier paper.

Using Multiple Data Sources to Learn about the Race and Ethnicity of Taxpayers

James Pearce
,
Congressional Budget Office
Shannon Mok
,
Congressional Budget Office
Rebecca Heller
,
Congressional Budget Office
Jonathan Rothbaum
,
U.S. Census Bureau

Abstract

A difficulty in using administrative tax data to study income distribution and other aspects of economic well-being is that tax data lacks information on race and ethnicity. The Congressional Budget Office (CBO) maintains an individual tax model that statistically merges administrative tax data with the Current Population Survey (CPS) to create a household distribution model that is used for CBO reports. The CBO tax model primarily uses information about income from tax returns, with supplemental information on non-filers, nontaxable sources of income, and household structure from the CPS. While the merged data contains race and ethnicity data from the CPS, CBO has not used it for analysis by race because it is not clear that the statistical match preserves the relationships between income, tax liability, and race and ethnicity. CBO is working with the Census Bureau to assess the validity of estimates of household income and other factors that affect tax liability by race and ethnicity in CBO’s statistically matched data, by comparing it to Census Bureau data that match CPS records to administrative tax data at the individual level. Understanding the quality of CBO's statistically matched CPS and tax data could expand ability to use alternative data sources and decrease the need to access to highly sensitive individually linked data. This paper will present preliminary comparisons of CBO's statistically matched data and Census’s linked data and discuss implications of the differences for future CBO analysis of income and taxes by race.

Discussant(s)
Robert McClelland
,
Tax Policy Center, Urban Institute and Brookings Institution
Rhonda Vonshay Sharpe
,
Women's Institute for Science, Equity and Race
Charles Hokayem
,
U.S. Census Bureau
Sheridan Fuller
,
Federal Reserve Board
JEL Classifications
  • H2 - Taxation, Subsidies, and Revenue
  • C5 - Econometric Modeling