Editor's Note
On May 28, 2025, Xiaohong Chen, Malcolm K. Brachman Professor of Economics at Yale University, Member of the American Academy of Arts and Sciences and the first China-based Fellow of the Econometric Society, presented her research at Peking University Guanghua School of Management's "Distinguished Scholars Forum." Her talk explored how nonparametric over-identification and efficient learning advance causal inference in the Generative AI (GAI) era, offering new frameworks for complex economic analysis.

Professor Xiaohong Chen
Establishing economic causal inference models and applying them to real data is central to empirical research. With observational data alone and no credible economic model, one typically uncovers only correlations rather than causal relationships—for instance, whether growth in imports from China caused higher U.S. unemployment cannot be resolved by mere data associations.
Statistically, a model is a collection of probability distributions for random variables indexed by unknown parameters. According to parameter dimensionality, models fall into parametric (finite-dimensional), nonparametric (infinite-dimensional), semiparametric (finite-dimensional target parameter with infinite-dimensional nuisance), and semi-nonparametric (containing both finite- and infinite-dimensional target parameters). Semi-nonparametric models balance flexibility and informativeness: they avoid functional-form misspecification while yielding sharper estimation and inference than fully nonparametric approaches, and they naturally support over-identification tests.
In nonparametric settings—for example the nonparametric instrumental-variables (NPIV) model, which links endogenous variables to instruments via unknown functions—Chen and Santos (2018) introduce the notion of local over-identification. They show that over-identification is vital for obtaining efficient estimators and for constructing nontrivial specification tests. Classical “counting” rules of GMM cannot determine identification status here because both parameters and restrictions are infinite-dimensional, necessitating this local definition.
Chen and Santos further demonstrate that local identification governs model efficiency. In locally just-identified models, all regular, asymptotically linear estimators are first-order equivalent, implying no efficiency gains and only trivial local-power tests. By contrast, locally over-identified models admit genuine efficiency improvements and powerful tests, highlighting the deep link between model design, efficiency, and testing capability.
Consider canonical causal-inference frameworks. The unconfoundedness model is globally just-identified, so no efficiency gains exist. Hahn (1998) and Hahn & Ridder (2013) propose nonparametric imputation and propensity-score matching estimators that reach the semiparametric efficiency bound. When the propensity score is known (as in randomized experiments), the model becomes locally over-identified, yet the efficiency bound remains unchanged—showing that identification status need not affect the bound in special cases.
A two-period difference-in-differences (DiD) design is globally just-identified, but multi-period or staggered-adoption settings can be locally over-identified, yielding notable efficiency gains and nontrivial tests. Modern DiD applications—with multiple treatment times and staggered rollout—often embody such local over-identification, allowing more efficient estimation and more stringent specification checks. Exploiting the full panel history can markedly improve precision and reliability.
For instrumental-variables (IV) models, local identification turns on whether imposed restrictions bind on a set of positive measure. Strict constraints create local over-identification and permit more efficient estimation; weak constraints leave the model locally just-identified, offering no efficiency margin. Hence assessing the tightness of restrictions is crucial in IV practice.
Historically, local over-identification in causal inference arises from parametric assumptions on nuisance components such as propensity scores. These parametric forms often lack solid theoretical or empirical support and are no longer prevalent in the machine-learning era, where flexible nonparametric estimators dominate. Modern causal-inference work, built on the potential-outcome framework plus machine learning, typically produces locally just-identified models, so regular estimators are asymptotically equivalent and efficiency enhancements are scarce. This limits scope for specification testing and efficiency gains. By contrast, classical parametric structural models allow explicit over-identification, enabling both sharper inference and meaningful tests—suggesting renewed attention to over-identification within contemporary analytic frameworks.
Chen, Sant’Anna, and Xie (2025) propose an efficient estimator for DiD and event-study (ES) designs. In 2024, over 30 % of NBER applied-micro working papers used DiD or ES, surpassing any other method; the two designs are likewise popular in finance and macro. Despite advances improving robustness to heterogeneous treatment effects, key econometric issues remain: First, researchers often treat all pre-treatment periods as interchangeable—or ignore them entirely—without theoretical or empirical justification, sacrificing precision. Second, while modern DiD accommodates rich effect heterogeneity, its finite-sample power lacks systematic evaluation; trade-offs between flexibility and power versus traditional two-way fixed effects are unclear. Third, practitioners routinely report multiple DiD and ES estimates side-by-side with no formal guidance on whether they rely on the same identification or target the same parameter.

The paper sets a unified framework for DiD and ES under “parallel trends + no anticipation,” delivering semiparametric efficiency analysis and comparison:
(1) recasting DiD identification as constraints on the joint distribution of observables;
(2) deriving efficiency bounds across single-wave, staggered-adoption, and covariate-adjusted settings;
(3) proving that efficiency requires non-uniform weighting across pre-treatment periods and comparison groups;
(4) showing sizeable empirical gains even in the simplest design. DiD is typically nonparametrically over-identified, so extra moment conditions raise efficiency.
Under “large n, fixed T,” the authors construct a closed-form efficient influence-function estimator whose weights are proportional to the (conditional) covariances across groups and periods, satisfy Neyman orthogonality, and pair naturally with machine learning. With a working parametric model, the estimator remains doubly robust to mild misspecification. The efficiency bound implies that equal-weighting all pre-periods or using only the last period is sub-optimal; covariance-driven weights should aggregate periods and groups according to information content. Graphical tools reveal the weight geometry, making efficiency sources transparent.
They also propose a nonparametric Hausman-type test—requiring no homoskedasticity or weak serial-correlation assumptions—to decide whether to drop certain pre-treatment baselines or comparison groups, and supply visualization to diagnose sensitivity to identification assumptions.
Simulations using CPS and Compustat data and a re-analysis of hospitalizations and medical spending demonstrate practical gains: in both single-shot and staggered treatments, mean-squared error and confidence-interval width fall sharply, with precision gains commonly exceeding 40%. Achieving comparable accuracy with traditional estimators would need at least 30% more observations. Aligning estimation with the identification structure’s information content thus markedly improves inference.
The framework complements existing multi-period, staggered-adoption DiD estimators. Traditional two-way fixed effects blend weights ambiguously under heterogeneous effects; ad-hoc aggregation across time or groups also loses efficiency. The new efficiency bounds offer a common yardstick for method comparison, guiding design choices between efficiency and robustness. The approach is especially suitable for small samples, dynamic effects, or high-dimensional covariates, as it exploits all data information without strong extra assumptions. It can likewise benchmark other identification strategies such as synthetic control. By fully characterizing DiD and ES information and delivering estimators that attain the information frontier—along with simple tests and visualization—the paper furnishes practitioners a higher-precision toolkit.
Future work should explore how to secure local over-identification in more economic settings to harness latent data information and boost analytic effectiveness. Deeper study of efficiency limits under varying model structures and development of more flexible, powerful tests will better suit complex real-world contexts.
In summary, as econometrics develope in the GAI era, the importance of nonparametric over-identification and efficient learning has grown markedly. These ideas not only enhance estimation efficiency and accuracy but also provide a firmer theoretical and methodological foundation for future causal inference and policy analysis.