Big Data and Advances in Economic Statistics
Paper Session
Friday, Jan. 3, 2025 2:30 PM - 4:30 PM (PST)
- Chair: Karen Dynan, Harvard University
From Online Job Postings to Economic Insights: A Machine Learning Approach to Structuring Naturally-Occurring Data
Abstract
This paper develops a novel matched vacancy-company dataset by combining two large Canadian datasets derived from naturally occurring data: daily job postings provided by the largest online job board in the country and visits to Points of Interest (mostly companies) computed from smartphone location data. Since the company names are declared inconsistently across job postings and datasets, we enhance state-of-the-art natural language processing algorithms to match the data. We first demonstrate that the resulting dataset contains granular high-frequency information not available through official statistics. We then use the dataset to study technological change around the COVID-19 pandemic. Our results suggest that the expansion of tech firms was a major driver of changes in digital jobs. Moreover, the uptick in digital employment creation resulted mainly from new digital production rather than digital adoption.Nowcasting Distributional National Accounts for the United States: A Machine Learning Approach
Abstract
Inequality statistics are usually calculated from high-quality, comprehensive survey or administrative microdata. Naturally, this data is typically available with a lag of at least 9 months from the reference period. In turbulent times, there is interest in knowing the distributional impacts of observable aggregate business cycle and policy changes sooner. In this paper, we use an elastic net, a generalized model that incorporates Lasso and Ridge regressions as special cases, to first predict the overall Gini coefficient and then nowcast the decile-level income shares. Our model, trained on the period 2000 to 2019, uses national accounts data (NIPA), published by the Bureau of Economic Analysis, as features instead of the underlying microdata. We show that this approach closely fits the distribution of decile-level income shares for the in-sample period and performs well for a pseudo-out-of-sample period of 2020-2022, which represents one of the most turbulent periods in recent economic history. We find that we can estimate inequality approximately one month after the end of the calendar year, reducing the present lag by almost a year.Slowly Scaling Per-Record Differential Privacy
Abstract
We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy (Seeman et al., 2023) guarantee degrades slowly in the protected records’ influence on the statistics being released. For context, formal privacy mechanisms generally add randomness to published statistics. If the statistics’ distribution changes little with the addition, deletion, or alteration of a single record in the underlying dataset, an attacker looking at these statistics will find it plausible that any particular record was present or absent or took any particular value, preserving the records’ privacy. More influential records - those whose absence, presence, or alteration would change the statistics’ distribution more - typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records’ influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.Discussant(s)
Nela Richardson
,
ADP
Pawel Adrjan
,
Indeed
David Johnson
,
Committee on National Statistics, National Academies of Sciences, Engineering, and Medicine
Lars Vilhuber
,
Cornell University
JEL Classifications
- C8 - Data Collection and Data Estimation Methodology; Computer Programs
- E2 - Consumption, Saving, Production, Investment, Labor Markets, and Informal Economy