Staggered DiD with Nonlinear Outcomes: Methods and Usage

NonlinearDiD Package

2026-05-03

1 Introduction

The NonlinearDiD package implements difference-in-differences estimators for staggered treatment adoption with binary, count, and other nonlinear outcomes.

1.1 Why Nonlinear Outcomes Require Special Methods

The landmark Callaway & Sant’Anna (2021) framework assumes that parallel trends holds on the probability scale:

\[E[Y_{it}(0) - Y_{it-1}(0) \mid G = g] = E[Y_{it}(0) - Y_{it-1}(0) \mid G = \infty]\]

For continuous outcomes (wages, test scores), this is natural. But for binary outcomes (employment, hospitalization, default), parallel trends in probabilities implies parallel trends in log-odds only under a very specific functional form restriction.

Roth & Sant’Anna (2023) show this creates two distinct problems:

  1. Sensitivity to functional form: Whether pre-treatment trends appear parallel depends entirely on which scale you test.

  2. Treatment effect heterogeneity conflation: In nonlinear models, the “treatment effect” includes both the true effect and Jensen’s inequality adjustments.

This package addresses both problems.


2 Installation

# From CRAN (forthcoming)
install.packages("NonlinearDiD")

# Development version from GitHub
remotes::install_github("example/NonlinearDiD")

3 Binary Outcomes: Staggered DiD

3.1 Simulating Data

dat <- sim_binary_panel(
  n            = 800,
  nperiods     = 8,
  prop_treated = 0.6,
  n_cohorts    = 3,
  true_att     = c(0.15, 0.25, 0.20),  # heterogeneous treatment effects
  base_prob    = 0.3,
  seed         = 42
)

head(dat)
#>   id period y g D alpha_i     x1 x2
#> 1  1      1 1 0 0  0.8921 0.6888  1
#> 2  1      2 1 0 0  0.8921 0.6888  1
#> 3  1      3 0 0 0  0.8921 0.6888  1
#> 4  1      4 1 0 0  0.8921 0.6888  1
#> 5  1      5 0 0 0  0.8921 0.6888  1
#> 6  1      6 0 0 0  0.8921 0.6888  1
cat("Treatment cohorts:", table(dat$g[dat$period == 1]), "\n")
#> Treatment cohorts: 320 160 160 160
cat("Baseline outcome rate:", round(mean(dat$y[dat$D == 0]), 3), "\n")
#> Baseline outcome rate: 0.343

3.2 Estimating ATT(g,t) with Logit Model

res <- nonlinear_attgt(
  data          = dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  xformla       = ~ x1 + x2,
  outcome_model = "logit",
  estimand      = "att",
  control_group = "nevertreated",
  doubly_robust = TRUE
)

summary(res)
#> 
#> === Nonlinear DiD Summary ===
#> 
#> Model:         logit 
#> Estimand:      att 
#> Control group: nevertreated 
#> 
#> --- Post-treatment ATT(g,t) ---
#>   Mean ATT: 0.0327
#>   Median ATT: 0.0287
#>   Min ATT: -0.0531  |  Max ATT: 0.1634
#>   N(g,t) cells: 14
#> 
#> --- Pre-treatment ATT(g,t) (should be ~0 under PT) ---
#>   Mean: 0.0399  |  SD: 0.0568
#>   N(g,t) cells: 10
#> 
#> --- Convergence ---
#>   Converged: 24 / 24

The nonlinear_attgt() function estimates a separate ATT for each (group, time) pair. The key difference from the linear case: we model the change in log-odds for control units as the counterfactual, not the change in probability.

3.3 Event-Study Aggregation

agg_dynamic <- nonlinear_aggte(res, type = "dynamic")
print(agg_dynamic)
#> 
#> === Nonlinear DiD: Aggregated ATT ===
#> 
#> Aggregation type: dynamic 
#> Outcome model:   logit 
#> 
#> Overall ATT: 0.0327  (SE: 0.0130, t = 2.508, p = 0.0121)
#> 95% CI: [0.0071, 0.0582]
#> 
#>  label       att      se     ci_lo   ci_hi n_groups  post
#>     -6  0.155126 0.04903  0.059034 0.25122        1 FALSE
#>     -5  0.082694 0.04779 -0.010974 0.17636        1 FALSE
#>     -4  0.108083 0.04850  0.013032 0.20313        1 FALSE
#>     -3  0.002701 0.03444 -0.064793 0.07019        2 FALSE
#>     -2  0.024059 0.03406 -0.042694 0.09081        2 FALSE
#>     -1  0.000000 0.00000  0.000000 0.00000        3 FALSE
#>      0  0.005890 0.02791 -0.048816 0.06060        3  TRUE
#>      1  0.062026 0.02836  0.006445 0.11761        3  TRUE
#>      2  0.002272 0.03493 -0.066189 0.07073        2  TRUE
#>      3  0.065044 0.03432 -0.002223 0.13231        2  TRUE
#>      4  0.045172 0.03519 -0.023797 0.11414        2  TRUE
#>      5 -0.001089 0.04710 -0.093395 0.09122        1  TRUE
#>      6  0.029827 0.04741 -0.063086 0.12274        1  TRUE
plot(agg_dynamic)
Event-study plot for binary outcome DiD
Event-study plot for binary outcome DiD

3.4 Group-Level ATT

agg_group <- nonlinear_aggte(res, type = "group")
print(agg_group)
#> 
#> === Nonlinear DiD: Aggregated ATT ===
#> 
#> Aggregation type: group 
#> Outcome model:   logit 
#> 
#> Overall ATT: 0.0327  (SE: 0.0130, t = 2.508, p = 0.0121)
#> 95% CI: [0.0071, 0.0582]
#> 
#>  label      att      se      ci_lo   ci_hi n_periods post
#>      2 0.003981 0.01821 -0.0317023 0.03966         7 TRUE
#>      4 0.043094 0.02214 -0.0003072 0.08650         5 TRUE
#>      7 0.107062 0.03449  0.0394586 0.17467         2 TRUE

The group-level ATT tells us: cohort 4 experienced the largest treatment effect (they adopted treatment earlier, when baseline rates were lower).

4 Odds-Ratio DiD

For some binary outcome applications, the odds-ratio DiD is preferable because: - It is invariant to the choice of reference group - It does not require parallel trends in probabilities (only in log-odds) - It has a natural multiplicative interpretation

# Use simple 2-period case for clarity
dat2 <- dat[dat$period %in% c(3, 5), ]

res_or <- odds_ratio_did(
  data           = dat2,
  yname          = "y",
  tname          = "period",
  idname         = "id",
  treat_period   = 5,
  control_period = 3,
  gname          = "g"
)

print(res_or)
#> 
#> === Odds-Ratio DiD ===
#> 
#> Interpretation: OR-DiD = 0.964 [95% CI: 0.634, 1.464] 
#> 
#> Log(OR-DiD): -0.0372  (SE: 0.2135)
#> 95% CI [OR]: [0.6340, 1.4642]
#> t-stat: -0.174  |  p-value: 0.8618

An OR-DiD of 1 means no treatment effect. Values > 1 indicate the treatment increased the odds of Y = 1.


5 Count Outcomes: Poisson DiD

For count outcomes (patents, hospital visits, crimes), we assume parallel trends on the log scale (multiplicative parallel trends):

\[\frac{E[Y_{it}(0)]}{E[Y_{it-1}(0)]} = \frac{E[Y_{jt}(0)]}{E[Y_{jt-1}(0)]}\]

count_dat <- sim_count_panel(
  n            = 600,
  nperiods     = 6,
  prop_treated = 0.5,
  true_rr      = c(1.5, 2.0, 1.3),  # rate ratios per cohort
  base_rate    = 10,
  seed         = 7
)

summary(count_dat$y)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00   10.00   14.00   15.53   19.00   63.00
# Staggered Poisson DiD
res_count <- nonlinear_attgt(
  data          = count_dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  outcome_model = "poisson",
  estimand      = "att",
  control_group = "nevertreated"
)

agg_count <- nonlinear_aggte(res_count, type = "dynamic")
plot(agg_count)

# Simple 2x2 Poisson DiD (Wooldridge QMLE)
count_sub <- count_dat[count_dat$period %in% c(2, 4), ]
res_pois  <- count_did_poisson(
  count_sub,
  yname          = "y",
  tname          = "period",
  idname         = "id",
  treat_period   = 4,
  control_period = 2,
  gname          = "g"
)
print(res_pois)
#> 
#> === Count DiD (Poisson QMLE) ===
#> 
#> ATT (log rate ratio): 0.2643  (SE: 0.0510)
#> Rate ratio:           1.3026  [95% CI: 1.1787, 1.4394]
#> ATT (APE, count scale): 3.9909
#> t-stat: 5.185  |  p-value: 0.0000

6 Doubly-Robust Estimation

The doubly-robust estimator combines an outcome model with a propensity score model. It is consistent if either (but not necessarily both) is correctly specified.

res_dr <- binary_did_dr(
  data           = dat[dat$period %in% c(3, 5), ],
  yname          = "y",
  tname          = "period",
  idname         = "id",
  treat_period   = 5,
  control_period = 3,
  gname          = "g",
  xformla        = ~ x1 + x2,
  outcome_model  = "logit"
)
print(res_dr)
#> 
#> === Doubly-Robust Binary DiD ===
#> 
#> ATT:   -0.0080  (SE: 0.0475)
#> 95% CI: [-0.1010, 0.0851]
#> t-stat: -0.168  |  p-value: 0.8670
#> N (treated): 480  |  N (control): 320

7 Nonparametric Bounds

When we are uncertain about functional form, we can compute sharp bounds on the ATT. Under parallel trends alone (with no functional form assumption), the ATT for binary outcomes is point identified. But under weaker assumptions (e.g., only monotone treatment response), we get intervals.

bounds <- nonlinear_bounds(
  data          = dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  bound_type    = "manski",  # widest (no assumptions)
  control_group = "nevertreated"
)

# Show post-treatment bounds
post_bounds <- bounds[bounds$post, ]
head(post_bounds[, c("group", "time", "lb", "ub", "identified")])
#>   group time       lb      ub identified
#> 2     2    2 -0.67500 0.32500       TRUE
#> 3     2    3 -0.64375 0.35625       TRUE
#> 4     2    4 -0.62500 0.37500       TRUE
#> 5     2    5 -0.58750 0.41250       TRUE
#> 6     2    6 -0.56250 0.43750       TRUE
#> 7     2    7 -0.58750 0.41250       TRUE

8 Comparison: Linear vs. Nonlinear DiD

A critical question: does it matter? When baseline probabilities are far from 0 and 1, the answer is yes — the linear DiD can be severely biased.

# Linear comparison (for illustration)
res_linear <- nonlinear_attgt(
  data          = dat,
  yname         = "y",
  tname         = "period",
  idname        = "id",
  gname         = "g",
  outcome_model = "linear",
  estimand      = "att"
)

agg_lin  <- nonlinear_aggte(res_linear, type = "dynamic")
agg_logit <- nonlinear_aggte(res, type = "dynamic")

# The two produce different estimates when baseline rates are moderate
cat("Linear overall ATT:",  round(agg_lin$overall_att, 4), "\n")
cat("Logit overall ATT:",   round(agg_logit$overall_att, 4), "\n")
cat("True ATT (avg):    ",  round(mean(c(0.15, 0.25, 0.20)), 4), "\n")

9 Methodological Details

9.1 Identifying Assumption

For the logit-scale ATT(g,t) estimator, the key identifying assumption is:

Parallel Trends on the Logit Scale: For all \(t \geq 2\) and all treated cohorts \(g\):

\[[\text{logit}(E[Y_{it}(0)]) - \text{logit}(E[Y_{it-1}(0)])] \cdot \mathbf{1}_{G_i = g}\] \[= [\text{logit}(E[Y_{it}(0)]) - \text{logit}(E[Y_{it-1}(0)])] \cdot \mathbf{1}_{G_i = \infty}\]

This is a stronger assumption than CS2021’s linear parallel trends when treatment effects are heterogeneous (it restricts both the means and the full distribution).

9.2 DR Estimator

The doubly-robust estimator for binary outcomes uses the influence function:

\[\hat{\psi}^{DR}_{g,t} = \frac{1}{n} \sum_i \left[ \frac{D_i}{\hat{p}} (\Delta Y_i - \hat{\mu}(X_i)) - \frac{(1-D_i)\hat{\pi}(X_i)}{1-\hat{\pi}(X_i)} \cdot \frac{1}{\hat{p}} (\Delta Y_i - \hat{\mu}(X_i)) \right]\]

where \(\Delta Y_i = Y_{it} - Y_{it-1}\), \(\hat{\pi}\) is the propensity score, and \(\hat{\mu}\) is the outcome regression on controls.


10 References

Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.

Roth, J., & Sant’Anna, P. H. C. (2023). When is parallel trends sensitive to functional form? Econometrica, 91(2), 737-747.

Wooldridge, J. M. (2023). Simple approaches to nonlinear difference-in- differences with panel data. The Econometrics Journal, 26(3), 31-66.

Sant’Anna, P. H. C., & Zhao, J. (2020). Doubly robust difference-in-differences estimators. Journal of Econometrics, 219(1), 101-122.

Manski, C. F. (1990). Nonparametric bounds on treatment effects. American Economic Review, 80(2), 319-323.