« Back to Results

Big Data and Advances in Economic Statistics

Paper Session

Friday, Jan. 3, 2025 2:30 PM - 4:30 PM (PST)

Hilton San Francisco Union Square, Continental Ballroom 1&2
Hosted By: American Economic Association & Committee on Economic Statistics
  • Chair: Karen Dynan, Harvard University

Expanding the Frontier of Economic Statistics Using Big Data: A Case Study of Regional Employment

Brian Quistorff
,
U.S. Bureau of Economic Analysis
Abe Dunn
,
U.S. Bureau of Economic Analysis
Eric English
,
U.S. Census Bureau
Kyle Hood
,
U.S. Bureau of Economic Analysis
Lowell Mason
,
U.S. Bureau of Labor Statistics

Abstract

Big data offers potentially enormous benefits for improving economic measurement, but it also presents challenges (e.g., lack of representativeness and instability), implying that their value is not always clear. We propose a framework for quantifying the usefulness of these data sources for specific applications, relative to existing official sources. We specifically weigh the potential benefits of additional granularity and timeliness, while examining the accuracy associated with any new or improved estimates, relative to comparable accuracy produced in existing official statistics. We apply the methodology to employment estimates using data from a payroll processor, considering both the improvement of existing state-level estimates, but also the production of new, more timely, county-level estimates. We find that incorporating payroll data can improve existing state-level estimates by 9% based on out-of-sample mean absolute error, although the improvement is considerably higher for smaller state-industry cells. We also produce new county-level estimates that could provide more timely granular estimates than previously available, while also falling within an acceptable accuracy standard. We demonstrate the practical importance of these experimental estimates by investigating a hypothetical application during the COVID-19 pandemic, a period in which more timely and granular information could have assisted in implementing effective policies. Relative to existing estimates, we find that the Paychex data series could help identify areas of the country where employment was lagging. Moreover, we also demonstrate the value of a more timely series, even when accuracy of the more timely series is lower than official estimates. More broadly, this paper demonstrates how to systematically use big data to expand the frontier of economic measurement.

From Online Job Postings to Economic Insights: A Machine Learning Approach to Structuring Naturally-Occurring Data

Tatjana Dahlhaus
,
Bank of Canada
Reinhard Ellwanger
,
Bank of Canada
Gabriela Galassi
,
Bank of Canada
Pierre-Yves Yanni
,
Bank of Canada

Abstract

This paper develops a novel matched vacancy-company dataset by combining two large Canadian datasets derived from naturally occurring data: daily job postings provided by the largest online job board in the country and visits to Points of Interest (mostly companies) computed from smartphone location data. Since the company names are declared inconsistently across job postings and datasets, we enhance state-of-the-art natural language processing algorithms to match the data. We first demonstrate that the resulting dataset contains granular high-frequency information not available through official statistics. We then use the dataset to study technological change around the COVID-19 pandemic. Our results suggest that the expansion of tech firms was a major driver of changes in digital jobs. Moreover, the uptick in digital employment creation resulted mainly from new digital production rather than digital adoption.

Nowcasting Distributional National Accounts for the United States: A Machine Learning Approach

Marina Gindelsky
,
George Washington University H.O. Stekler Program on Forecasting
Gary Cornwall
,
George Washington University

Abstract

Inequality statistics are usually calculated from high-quality, comprehensive survey or administrative microdata. Naturally, this data is typically available with a lag of at least 9 months from the reference period. In turbulent times, there is interest in knowing the distributional impacts of observable aggregate business cycle and policy changes sooner. In this paper, we use an elastic net, a generalized model that incorporates Lasso and Ridge regressions as special cases, to first predict the overall Gini coefficient and then nowcast the decile-level income shares. Our model, trained on the period 2000 to 2019, uses national accounts data (NIPA), published by the Bureau of Economic Analysis, as features instead of the underlying microdata. We show that this approach closely fits the distribution of decile-level income shares for the in-sample period and performs well for a pseudo-out-of-sample period of 2020-2022, which represents one of the most turbulent periods in recent economic history. We find that we can estimate inequality approximately one month after the end of the calendar year, reducing the present lag by almost a year.

Slowly Scaling Per-Record Differential Privacy

Brian Finley
,
U.S. Census Bureau
Justin Doty
,
U.S. Census Bureau
Ashwin Machanavajjhala
,
Tumult Labs
Mikaela Meyer
,
MITRE
David Pujol
,
Tumult Labs
Anthony Caruso
,
U.S. Census Bureau
William Sexton
,
Tumult Labs
Zachary Terner
,
MITRE

Abstract

We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy (Seeman et al., 2023) guarantee degrades slowly in the protected records’ influence on the statistics being released. For context, formal privacy mechanisms generally add randomness to published statistics. If the statistics’ distribution changes little with the addition, deletion, or alteration of a single record in the underlying dataset, an attacker looking at these statistics will find it plausible that any particular record was present or absent or took any particular value, preserving the records’ privacy. More influential records - those whose absence, presence, or alteration would change the statistics’ distribution more - typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records’ influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.

Discussant(s)
Nela Richardson
,
ADP
Pawel Adrjan
,
Indeed
David Johnson
,
Committee on National Statistics, National Academies of Sciences, Engineering, and Medicine
Lars Vilhuber
,
Cornell University
JEL Classifications
  • C8 - Data Collection and Data Estimation Methodology; Computer Programs
  • E2 - Consumption, Saving, Production, Investment, Labor Markets, and Informal Economy