« Back to Results

Big Data and Advances in Economic Statistics

Paper Session

Friday, Jan. 3, 2025 2:30 PM - 4:30 PM (PST)

Hilton San Francisco Union Square, Continental Ballroom 1&2

Hosted By: American Economic Association & Committee on Economic Statistics

Chair: Karen Dynan, Harvard University

Expanding the Frontier of Economic Statistics Using Big Data: A Case Study of Regional Employment

Brian Quistorff

U.S. Bureau of Economic Analysis

Abe Dunn

U.S. Bureau of Economic Analysis

Eric English

U.S. Census Bureau

Kyle Hood

U.S. Bureau of Economic Analysis

Lowell Mason

U.S. Bureau of Labor Statistics

View Abstract

Abstract

Big data offers potentially enormous benefits for improving economic measurement, but it also presents challenges (e.g., lack of representativeness and instability), implying that their value is not always clear. We propose a framework for quantifying the usefulness of these data sources for specific applications, relative to existing official sources. We specifically weigh the potential benefits of additional granularity and timeliness, while examining the accuracy associated with any new or improved estimates, relative to comparable accuracy produced in existing official statistics. We apply the methodology to employment estimates using data from a payroll processor, considering both the improvement of existing state-level estimates, but also the production of new, more timely, county-level estimates. We find that incorporating payroll data can improve existing state-level estimates by 9% based on out-of-sample mean absolute error, although the improvement is considerably higher for smaller state-industry cells. We also produce new county-level estimates that could provide more timely granular estimates than previously available, while also falling within an acceptable accuracy standard. We demonstrate the practical importance of these experimental estimates by investigating a hypothetical application during the COVID-19 pandemic, a period in which more timely and granular information could have assisted in implementing effective policies. Relative to existing estimates, we find that the Paychex data series could help identify areas of the country where employment was lagging. Moreover, we also demonstrate the value of a more timely series, even when accuracy of the more timely series is lower than official estimates. More broadly, this paper demonstrates how to systematically use big data to expand the frontier of economic measurement.

From Online Job Postings to Economic Insights: A Machine Learning Approach to Structuring Naturally-Occurring Data

Tatjana Dahlhaus

Bank of Canada

Reinhard Ellwanger

Bank of Canada

Gabriela Galassi

Bank of Canada

Pierre-Yves Yanni

Bank of Canada

View Abstract

Abstract

This paper develops a novel matched vacancy-company dataset by combining two large Canadian datasets derived from naturally occurring data: daily job postings provided by the largest online job board in the country and visits to Points of Interest (mostly companies) computed from smartphone location data. Since the company names are declared inconsistently across job postings and datasets, we enhance state-of-the-art natural language processing algorithms to match the data. We first demonstrate that the resulting dataset contains granular high-frequency information not available through official statistics. We then use the dataset to study technological change around the COVID-19 pandemic. Our results suggest that the expansion of tech firms was a major driver of changes in digital jobs. Moreover, the uptick in digital employment creation resulted mainly from new digital production rather than digital adoption.

Nowcasting Distributional National Accounts for the United States: A Machine Learning Approach

Marina Gindelsky

George Washington University H.O. Stekler Program on Forecasting

Gary Cornwall

George Washington University

Abstract

Inequality statistics are usually calculated from high-quality, comprehensive survey or administrative microdata. Naturally, this data is typically available with a lag of at least 9 months from the reference period. In turbulent times, there is interest in knowing the distributional impacts of observable aggregate business cycle and policy changes sooner. In this paper, we use an elastic net, a generalized model that incorporates Lasso and Ridge regressions as special cases, to first predict the overall Gini coefficient and then nowcast the decile-level income shares. Our model, trained on the period 2000 to 2019, uses national accounts data (NIPA), published by the Bureau of Economic Analysis, as features instead of the underlying microdata. We show that this approach closely fits the distribution of decile-level income shares for the in-sample period and performs well for a pseudo-out-of-sample period of 2020-2022, which represents one of the most turbulent periods in recent economic history. We find that we can estimate inequality approximately one month after the end of the calendar year, reducing the present lag by almost a year.

Slowly Scaling Per-Record Differential Privacy

Brian Finley

U.S. Census Bureau

Justin Doty

U.S. Census Bureau

Ashwin Machanavajjhala

Tumult Labs

Mikaela Meyer

MITRE

David Pujol

Tumult Labs

Anthony Caruso

U.S. Census Bureau

William Sexton

Tumult Labs

Zachary Terner

MITRE

Abstract

We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy (Seeman et al., 2023) guarantee degrades slowly in the protected records’ influence on the statistics being released. For context, formal privacy mechanisms generally add randomness to published statistics. If the statistics’ distribution changes little with the addition, deletion, or alteration of a single record in the underlying dataset, an attacker looking at these statistics will find it plausible that any particular record was present or absent or took any particular value, preserving the records’ privacy. More influential records - those whose absence, presence, or alteration would change the statistics’ distribution more - typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records’ influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.

Discussant(s)

Nela Richardson

ADP

Pawel Adrjan

Indeed

David Johnson

Committee on National Statistics, National Academies of Sciences, Engineering, and Medicine

Lars Vilhuber

Cornell University

JEL Classifications

C8 - Data Collection and Data Estimation Methodology; Computer Programs
E2 - Consumption, Saving, Production, Investment, Labor Markets, and Informal Economy

This website uses cookies.

Big Data and Advances in Economic Statistics

Friday, Jan. 3, 2025 2:30 PM - 4:30 PM (PST)

Expanding the Frontier of Economic Statistics Using Big Data: A Case Study of Regional Employment

Abstract

From Online Job Postings to Economic Insights: A Machine Learning Approach to Structuring Naturally-Occurring Data

Abstract

Nowcasting Distributional National Accounts for the United States: A Machine Learning Approach

Abstract

Slowly Scaling Per-Record Differential Privacy

Abstract

Discussant(s)

JEL Classifications