Skip to contents

Introduction

The summclust package allows to compute leverage statistics for clustered errors and fast CRV3(J) variance-covariance matrices as described in MacKinnon, J.G., Nielsen, M.Ø., Webb, M.D., 2022. Leverage, influence, and the jackknife in clustered regression models: Reliable inference using summclust.

It is a post-estimation command and currently supports methods for objects of type lm (from stats) and fixest (from the fixest package).

CRV 1-3 Cluster Robust Variance Estimators and Jackknife formulations

summclust handles cluster robust variance estimation of linear regression models of the form

\[\begin{equation} y = \begin{bmatrix} y_{1} \\ y_{2} \\ ...\\ y_{G} \end{bmatrix} = X\beta + u = \begin{bmatrix} X_{1} \\ X_{2} \\ ...\\ X_{G} \end{bmatrix} \beta + \begin{bmatrix} u_{1} \\ u_{2} \\ ...\\ u_{G} \end{bmatrix}, \end{equation}\]

where group \(g\) contains \(N_{g}\) observations so that \(N = \sum_{g = 1}^{G} N_{g}\). The regression residuals \(u\) are allowed to be correlated within clusters, but are assumed to be uncorrelated across clusters. %In consequence, the models’ covariance matrix is block diagonal. %For each cluster, we denote \(E(u_{g} u_{g}') =\Omega_{g}\).

with \(E(u|X) = 0\).

The literature on cluster robust inference has proposed three different estimators, which all follow the same ‘sandwich’ structure

\[\begin{equation} (X'X)^{-1} (\sum_{g=1}^{G} \Sigma_{g} ) (X'X)^{-1}. \end{equation}\]

The three different types of CRV estimators depend on how \(\Sigma_{g}\) is estimated.

The most common cluster robust estimator, the CRV1 estimator, is defined as

\[\begin{equation} CRV1: \hat{V}_{1}(\hat{\beta}) = m (X'X)^{-1} (\sum_{g=1}^{G} s_{g} s_{g}') (X'X)^{-1}. \end{equation}\]

where \(s_g = X_{g}'\hat{u}_{g}\).

The CRV2 estimator is computed as

\[\begin{equation} CRV2: \hat{V}_{2}(\hat{\beta}) = (X'X)^{-1} (\sum_{g=1}^{G} s^{2}_{g} s^{2}_{g}') (X'X)^{-1}. \end{equation}\]

where \(s^{2}_g = X_{g}' M_{gg}^{-1/2} \hat{u}_{g}\).

\(M_{gg}\) is defined as …

Last, the CRV3 estimator is defined as

\[\begin{equation} CRV3: \hat{V}_{3}(\hat{\beta}) = m (X'X)^{-1} (\sum_{g=1}^{G} s^{3}_{g} s^{3}_{g}') (X'X)^{-1}. \end{equation}\]

with \(s^{3}_{g} = X_{g}' M_{gg}^{-1} \hat{u}_{g}\) with \(m = G/(G-1)\).

Building on work by Niccodemi and … MacKinnon, Nielsen and Webb show that the CRV3 estimator can be computed as a Jackknife estimator.

First, let’s define \(\hat{\beta}^{(g)}\), the OLS estimate of (1) when cluster g is omitted:

\[\begin{equation} \hat{\beta}^{(g)} = ((X'X)^{-1} - (X_{g}'X_{g})^{-1})(X'y - X_{g}'y_{g}), g = 1, ... , G. \end{equation}\]

MNW show the the CRV3 estimator is equivalent to computing

\[\begin{equation} \hat{V}_{3}(\hat{\beta}) = \frac{G}{G-1} \sum{g = 1}^{G} (\hat{\beta}^{(g)} - \hat{\beta}) (\hat{\beta}^{(g)} - \hat{\beta})' \end{equation}\]

They further propose the following Jackknive estimator, CRVJ:

\[\begin{equation} \hat{V}_{3J}(\hat{\beta}) = \frac{G}{G-1} \sum{g = 1}^{G} (\hat{\beta}^{(g)} - \bar{\beta}) (\hat{\beta}^{(g)} - \bar{\beta})' \end{equation}\]

with \(\bar{\beta} = G^{-1} \sum_{g=1}^{G} \hat{\beta}^{(g)}\).

Both estimators can be computed very quickly (as long as the number of clusters does not get too large), and both estimators are implemented in summclust.

The summclust function

library(summclust)
library(lmtest)
library(haven)

nlswork <- read_dta("http://www.stata-press.com/data/r9/nlswork.dta")
# drop NAs at the moment
nlswork <- nlswork[, c("ln_wage", "grade", "age", "birth_yr", "union", "race", "msp", "ind_code")]
nlswork <- na.omit(nlswork)

lm_fit <- lm(
  ln_wage ~ as.factor(grade) + as.factor(age) + as.factor(birth_yr) + union +  race + msp,
  data = nlswork)

summclust_res <- summclust(
  obj = lm_fit,
  cluster = ~ind_code,
  type = "CRV3")

# CRV3-based inference - exactly matches output of summclust-stata
coeftable(summclust_res, param = c("msp", "union"))
#>             coef     tstat         se      p_val  conf_int_l  conf_int_u
#> union  0.2039597  2.440122 0.08358587 0.03281561  0.01998847 0.387930980
#> msp   -0.0275151 -1.956404 0.01406412 0.07628064 -0.05847002 0.003439815

summary(summclust_res, param = c("msp","union"))
#>             coef     tstat         se      p_val  conf_int_l  conf_int_u
#> union  0.2039597  2.440122 0.08358587 0.03281561  0.01998847 0.387930980
#> msp   -0.0275151 -1.956404 0.01406412 0.07628064 -0.05847002 0.003439815
#>  
#>            leverage partial-leverage-msp partial-leverage-union    beta-msp
#> Min.     0.09332052          0.001622359           0.0006662968 -0.03320040
#> 1st Qu.  0.70440923          0.009133996           0.0048899422 -0.02893131
#> Median   3.51549151          0.056682344           0.0379535242 -0.02776470
#> Mean     5.41666667          0.083333333           0.0833333333 -0.02691999
#> 3rd Qu.  6.41132962          0.106083114           0.1004277711 -0.02610221
#> Max.    20.28918187          0.312994532           0.3597669210 -0.01583453
#>         beta-union
#> Min.     0.1624754
#> 1st Qu.  0.1994694
#> Median   0.2045197
#> Mean     0.2053997
#> 3rd Qu.  0.2056569
#> Max.     0.2754228

To visually inspect the leverage statistics, use the plot method

plot(summclust_res, param = c("msp","union"))
#> $residual_leverage

#> 
#> $coef_leverage

#> 
#> $coef_beta

Using summclust with coefplot and fixest

Note that you can also use CVR3 and CRV3J covariance matrices computed via summclust with the lmtest() and fixest packages.

library(lmtest)
library(fixest)

df <- length(summclust_res$cluster) - 1

# with lmtest
CRV1 <- coeftest(lm_fit, sandwich::vcovCL(lm_fit, ~ind_code), df = df)
CRV3 <- coeftest(lm_fit, summclust_res$vcov, df = df)

CRV1[c("union", "race", "msp"),]
#>          Estimate  Std. Error   t value     Pr(>|t|)
#> union  0.20395972 0.061167499  3.334446 0.0066585766
#> race  -0.08619813 0.016150418 -5.337207 0.0002384275
#> msp   -0.02751510 0.009293046 -2.960827 0.0129561148
CRV3[c("union", "race", "msp"),]
#>          Estimate Std. Error   t value    Pr(>|t|)
#> union  0.20395972 0.08358587  2.440122 0.032815614
#> race  -0.08619813 0.01904684 -4.525586 0.000864074
#> msp   -0.02751510 0.01406412 -1.956404 0.076280639

confint(CRV1)[c("union", "race", "msp"),]
#>             2.5 %       97.5 %
#> union  0.06933097  0.338588481
#> race  -0.12174496 -0.050651302
#> msp   -0.04796896 -0.007061245
confint(CRV3)[c("union", "race", "msp"),]
#>             2.5 %       97.5 %
#> union  0.01998847  0.387930980
#> race  -0.12811995 -0.044276312
#> msp   -0.05847002  0.003439815

# with fixest
feols_fit <- feols(
  ln_wage ~ as.factor(grade) + as.factor(age) + as.factor(birth_yr) + union +  race + msp,
  data = nlswork)

fixest::coeftable(
  feols_fit,
  vcov = summclust_res$vcov,
  ssc = ssc(adj = FALSE, cluster.adj = FALSE)
)[c("msp", "union", "race"),]
#>          Estimate Std. Error   t value     Pr(>|t|)
#> msp   -0.02751510 0.01406412 -1.956404 5.043213e-02
#> union  0.20395972 0.08358587  2.440122 1.469134e-02
#> race  -0.08619813 0.01904684 -4.525586 6.059226e-06

The p-value and confidence intervals for fixest::coeftable() differ from lmtest::coeftest() and summclust::coeftable(). This is due to the fact that fixest::coeftable() uses a different degree of freedom for the t-distribution used in these calculation (I believe it uses t(N-1)).