svytest: Diagnostic Tests for Survey Weight Informativeness

Introduction

Survey weights are indispensable for producing unbiased population estimates under complex sampling designs. Yet, when fitting regression models, analysts often ask: are the weights truly informative for my outcome, or can I safely ignore them? Failure to include informative weights can lead to biased estimates and invalid inference, while including non-informative weights may unnecessarily inflate variance. Analysts need robust tools to assess the informativeness of survey weights in regression contexts.

The svytest package provides a suite of diagnostic tests designed to answer this question. These tests adapt classical specification tests (Hausman, 1978; Pfeffermann, 1993), more recent developments (Pfeffermann & Sverchkov, 1999, 2003; Wu & Fuller, 2005) to the survey regression setting, and non-parametric methods to evaluate whether survey weights materially affect regression estimates. Survey weight diagnostic tests are only meant to be used as a diagnostic tool to determine whether weights should be used in a regression analysis approach. Survey weight diagnostic tests should not be used to draw causal relationships between \(\vec{Y}\) and \(\textbf{X}\) such that they should only be limited to testing the necessity of survey weights in regressions.

This vignette introduces each test, explains the underlying statistical machinery, and demonstrates their use on example data. We conclude by outlining a simulation study design that evaluates their performance. For demonstration purposes, we use the included svytestCE dataset, a curated subset of the Consumer Expenditure Survey. See ?svytestCE for details. Furthermore, the svytest package currently supports linear regression models fitted via svyglm from the survey package. Future versions may extend support to generalized linear models and other modeling frameworks.

General Review of Survey Weight Diagnostic Tests

Bollen et al. (2016) provide a comprehensive review of the landscape of diagnostic tests for survey weight informativeness. Their central insight is that, despite the proliferation of methods, nearly all tests fall into two broad categories: Difference-in-Coefficients (DC) tests and Weight-Association (WA) tests.

Bollen and colleagues emphasize that DC and WA tests are theoretically equivalent: testing whether weighted and unweighted coefficients differ is equivalent to testing whether weights are correlated with the outcome conditional on covariates. In practice, however, the finite-sample behavior of the tests can diverge, and simulation evidence is limited. Some studies suggest that classical Hausman-type DC tests may over-reject in small samples, while certain WA formulations may be more stable.

They also highlight several practical considerations:

This classification provides a unifying framework for the diagnostics implemented in svytest: the package offers both DC-style tests (e.g., diff_in_coef_test) and WA-style tests (e.g., wa_test with DD, PS1, PS2, WF variants), along with extensions such as the estimating equations test and permutation-based approaches.

Diagnostic Test Implementation

Throughout this section, we describe each diagnostic test implemented in the svytest package. For each test, we outline the statistical formulation, assumptions, and provide example code demonstrating its use. Let \(\{(y_k, \vec{X}_k, w_k): k \in S\}\) denote the observed data for sampled units, where \(y_k\) is the outcome, \(\vec{X}_k\) is a vector of covariates, and \(w_k\) is the survey weight for the \(k\) response in the survey sample \(S\). We will create a svyglm object from the survey package to demonstrate the tests.

# Construct survey design
des <- svydesign(ids = ~1, weights = ~FINLWT21, data = svytestCE)

# Fit weighted regression
fit <- svyglm(TOTEXPCQ ~ ROOMSQ + BATHRMQ + BEDROOMQ + FAM_SIZE + AGE, design = des)

Given the many different options, the function run_all_diagnostic_tests provides a convenient wrapper to run all tests and summarize results. It produces a recommendation which can guide analysts on whether to use weights in their regression analysis, though there is no formal consensus on decision rules within the literature of diagnostic tests for survey weights.

# Run all diagnostic tests
all_results <- run_all_diagnostic_tests(fit, alpha = 0.05)
print(all_results)
#> 
#> Diagnostic Tests for Informative Weights
#> # A tibble: 10 × 4
#>    test       statistic   p.value reject
#>    <chr>          <dbl>     <dbl> <lgl> 
#>  1 DD              2.56 0.0178    TRUE  
#>  2 PS1             2.40 0.0186    TRUE  
#>  3 PS1q            3.69 0.0000138 TRUE  
#>  4 PS2             4.68 0.00930   TRUE  
#>  5 PS2q            5.55 0.00389   TRUE  
#>  6 WF              1.97 0.0802    FALSE 
#>  7 HP             15.3  0.0179    TRUE  
#>  8 PS3             2.03 0.0581    FALSE 
#>  9 perm_mean      NA    0.0450    TRUE  
#> 10 perm_mahal     NA    0.128     FALSE 
#> 
#> Recommendation:
#>  At least one test rejects H0 at significance level = 0.05. Recommendation: use survey weights in regression.

Difference-in-Coefficients Tests

The Difference-in-Coefficients (DC) tests assess whether the coefficients estimated from weighted and unweighted regression models differ significantly. The underlying logic is that if weights are informative, they will affect the estimated coefficients. The DC test implemented in svytest is based on the Hausman-Pfeffermann framework (Pfeffermann, 1993; Pfeffermann & Sverchkov, 1999). The test statistic is constructed as follows:

Let \(W = \text{diag}(w_1, \dots, w_n)\) be the diagonal weight matrix. We define the unweighted and weighted estimator as \[\hat\beta_{U} = (X^\top X)^{-1} X^\top y,\] \[\hat\beta_{W} = (X^\top W X)^{-1} X^\top W y.\] The DC test statistic is then given by \[T = (\hat\beta_{W} - \hat\beta_{U})^\top [\text{Var}(\hat\beta_{W}) - \text{Var}(\hat\beta_{U})]^{-1} (\hat\beta_{W} - \hat\beta_{U}),\] which asymptotically follows a chi-squared distribution with degrees of freedom equal to the number of coefficients tested under the null hypothesis that weights are non-informative. Therefore, with the null hypothesis \(H_0: E(\hat\beta_{W} - \hat\beta_{U}) = 0\), we test \(T \sim \chi^2_p\).

Determining the variance estimator is crucial for the DC test. Generally, the user assumes homoscedastic residuals and estimates the variance estimator using pooled residual variance from the weighted model. However, we implement options to implement heteroskedasticity-consistent estimators. The svytest package implements this test in the diff_in_coef_test function. The heteroskedasticity-robust variance estimators available are “HC0”, “HC1”, “HC2”, and “HC3” (MacKinnon & White, 1985) and as implemented in the sandwich package.

# Run DC test with equal residual variance
res_equal <- diff_in_coef_test(fit, var_equal = TRUE)
print(res_equal)
#> 
#> Hausman-Pfeffermann Difference-in-Coefficients Test
#> Chi-squared = 15.3278  df = 6  p-value = 0.0179 
#> 
#> Unweighted coefficients:
#> (Intercept)      ROOMSQ     BATHRMQ    BEDROOMQ    FAM_SIZE         AGE 
#>  -207.41727   372.95582  1599.03461    71.90571   447.79437    19.73244 
#> 
#> Weighted coefficients:
#> (Intercept)      ROOMSQ     BATHRMQ    BEDROOMQ    FAM_SIZE         AGE 
#>  -327.93795   361.18174  1596.63690    65.48708   454.68955    23.36097

# Run DC test with heteroskedasticity-robust variance (HC3)
res_robust <- diff_in_coef_test(fit, var_equal = FALSE, robust_type = "HC3")
summary(res_robust)
#> 
#> Difference-in-Coefficients Test
#> Call:
#> diff_in_coef_test(model = fit, var_equal = FALSE, robust_type = "HC3")
#> 
#> Method:
#>  Hausman-Pfeffermann Difference-in-Coefficients Test
#> 
#> Test Statistic:
#>  Chi-sq = 62288.1804  on 6 df , p-value = 0.0000

Weight-Association Tests

Weight‑association (WA) tests provide an alternative to difference‑in‑coefficients tests. Rather than directly comparing weighted and unweighted coefficient vectors, WA tests ask: are the survey weights associated with the outcome (or residuals) after conditioning on covariates?

Formally, the null hypothesis is \(H_0: E(y \mid X, w) = E(y \mid X)\), i.e. weights are non‑informative given the covariates.

DuMouchel–Duncan (DD)

The DD test (DuMouchel & Duncan, 1983) is one of the earliest WA diagnostics.

  1. Fit the unweighted regression: \[\hat\beta_U = (X^\top X)^{-1} X^\top y.\]

  2. Compute residuals: \[e = y - X\hat\beta_U.\]

  3. Regress residuals on the weights: \[e = \gamma_0 + \gamma_1 w + u.\]

If \(\gamma_1 = 0\), weights are not associated with residuals. A significant slope indicates that weights explain variation in residuals, hence are informative. The test statistic is an \(F\)-test on \(\gamma_1\).

results <- wa_test(fit, type = "DD")
print(results)
#> 
#> DuMouchel-Duncan Weight-Association Test
#> F = 2.5557  df1 = 6  df2 = 21914  p-value = 0.0178

Pfeffermann–Sverchkov PS1

The PS1 test (Pfeffermann & Sverchkov, 1999) augments the regression with functions of the weights: \[y = X\beta + f(w)\theta + \varepsilon.\]

  • Under \(H_0\), \(\theta = 0\).
  • \(f(w)\) can be linear (\(w\)), quadratic (\(w, w^2\)), or user‑supplied.
  • The quadratic version is denoted PS1q.

This is implemented as a nested model comparison:
- Reduced model: \(y = X\beta + \varepsilon\).
- Full model: \(y = X\beta + f(w)\theta + \varepsilon\).
- An \(F\)-test evaluates whether the auxiliary regressors \(f(w)\) improve fit.

results <- wa_test(fit, type = "PS1")
print(results)
#> 
#> Pfeffermann-Sverchkov WA Test 1
#> F = 2.4024  df1 = 7  df2 = 21913  p-value = 0.0186

Pfeffermann–Sverchkov PS2

The PS2 test (Pfeffermann & Sverchkov, 2003) is a two‑step procedure:

  1. Regress weights on covariates: \[w = X\alpha + \eta, \quad \hat w = X\hat\alpha.\]

  2. Augment the outcome regression with \(\hat w\): \[y = X\beta + g(\hat w)\theta + \varepsilon.\]

  • As with PS1, quadratic terms (\(\hat w, \hat w^2\)) can be included (PS2q).
  • Under \(H_0\), \(\theta = 0\).
  • The test statistic is again an \(F\)-test comparing reduced and full models.

This approach conditions out the part of the weights predictable from \(X\), focusing on whether the residual informativeness of weights matters for \(y\).

results <- wa_test(fit, type = "PS2")
print(results)
#> 
#> Pfeffermann-Sverchkov WA Test 2
#> F = 4.6786  df1 = 2  df2 = 21918  p-value = 0.0093

Wu–Fuller (WF)

The WF test (Wu & Fuller, 2005) is a more elaborate WA test that can be seen as a bridge between DC and WA frameworks.

  • Let \(\hat\beta_W\) and \(\hat\beta_U\) be the weighted and unweighted estimators.
  • Define the difference: \[d = \hat\beta_W - \hat\beta_U.\]
  • Construct the quadratic form: \[T = d^\top \widehat{\mathrm{Var}}^{-1}(d).\]

Unlike the Hausman–Pfeffermann DC test, the WF test is framed as an \(F\)-test in the WA family, but the intuition is similar: if weights are non‑informative, the two estimators should not differ systematically.

results <- wa_test(fit, type = "WF")
print(results)
#> 
#> Wu-Fuller Weight-Association Test
#> F = 1.9663  df1 = 5  df2 = 21915  p-value = 0.0802

Estimating Equations Test

The Estimating Equations (EE) test of Pfeffermann & Sverchkov (2003) provides a third diagnostic framework for assessing weight informativeness. Unlike the Difference‑in‑Coefficients and Weight‑Association tests, which compare regression fits directly, the EE test works at the level of the score equations that define regression estimators.

For linear regression with Gaussian errors and identity link, the unweighted OLS estimator \(\hat\beta_{\text{unw}}\) solves the estimating equations \[\sum_{i=1}^n u_i = 0, \qquad u_i = x_i \bigl(y_i - x_i^\top \hat\beta_{\text{unw}}\bigr).\]

To adjust for potential informativeness of weights, define \[q_i = \frac{w_i}{E_s(w_i \mid x_i)}, \qquad R_i = (1 - q_i) u_i,\]

where \(E_s(w_i \mid x_i)\) is estimated by regressing \(w\) (or \(\log w\)) on \(X\). Let \[\bar R = \frac{1}{n} \sum_{i=1}^n R_i, \qquad S = \frac{1}{n-1} \sum_{i=1}^n (R_i - \bar R)(R_i - \bar R)^\top.\]

The Hotelling-type test statistic is \[F = \frac{n-p}{p} \, \bar R^\top S^{-1} \bar R,\]

with numerator degrees of freedom \(p\) (the number of tested coefficients) and denominator degrees of freedom \(n-p\).

linear_results <- estim_eq_test(fit, q_method = "linear")
print(linear_results)
#> 
#> Pfeffermann-Sverchkov Estimating Equations Test (linear case)
#> F = 2.0305  df1 = 6  df2 = 21920  p-value = 0.0581

log_results <- estim_eq_test(fit, q_method = "log")
print(log_results)
#> 
#> Pfeffermann-Sverchkov Estimating Equations Test (log case)
#> F = 3.3232  df1 = 6  df2 = 21920  p-value = 0.0028

Permutation Tests

The permutation tests implemented in svytest provide a non-parametric approach to assessing the informativeness of survey weights. These tests do not rely on asymptotic distributions and instead use the empirical distribution of a test statistic under random permutations of the data. The null hypothesis is that, conditional on covariates \(X\), the survey weights \(w\) are non‑informative with respect to the outcome \(y\). Under \(H_0\), permuting the weights should not change the distribution of any statistic that measures the effect of weighting.

perm_mean_results <- perm_test(fit, stat = "pred_mean", B = 1000, engine = "R")
print(perm_mean_results)
#> 
#> Permutation test for weight informativeness (pred_mean)
#> Observed = 6856.7709  Null = 6888.7717  Effect = -31.4661  p-value = 0.0500

library(Rcpp)
perm_mahal_results <- perm_test(fit, stat = "coef_mahal", B = 1000, engine = "C++")
print(perm_mahal_results)
#> 
#> Permutation test for weight informativeness (coef_mahal)
#> Observed = 80517058.4472  Null = 0.0000  Effect = 45326375.8879  p-value = 0.0969

Simulation Study: Replication of Wang et al. (2023)

To evaluate the finite-sample performance of the diagnostic tests implemented in the package, we replicated the first simulation study of Wang et al. (2023) and extended it to include our non-parametric permutation tests.

Design

Following Wang et al. (2023), we generated a finite population of size \(N=3000\) from the linear model \(Y_i = 1 + X_i + \varepsilon_i, \quad i=1,\ldots,N,\) where \(X_i \sim \text{Uniform}(0,1)\) and \(\varepsilon_i \sim N(0,\sigma^2)\) with \(\sigma \in \{0.1, 0.2\}\). Samples of size \(n \in \{100,200\}\) were drawn with probability proportional to weights \(W_i = \alpha Y_i + 0.3 X_i + \delta U_i,\) where \(U_i \sim N(0,1)\), \(\delta \in \{1,1.5\}\), and \(\alpha \in \{0,0.2,0.4,0.6\}\) controls the informativeness of the weights. When \(\alpha=0\), weights are non-informative.

For each configuration \((n,\sigma,\delta,\alpha)\), we generated \(1000\) samples and applied the following tests:

A test was deemed to reject if its \(p\)-value was below \(\alpha=0.05\). The empirical rejection rate across replications estimates the size (when \(\alpha=0\)) and power (when \(\alpha>0\)).

Results

Consistent with Wang et al. (2023), the PS2 test exhibited the highest power across most configurations, followed by DD, HP, and WF. The PS3 (estimating equations) test was generally less powerful. Our added permutation tests showed competitive performance: the coefficient Mahalanobis statistic was particularly sensitive when informativeness manifested through slope differences, while the predicted mean statistic was more sensitive to shifts in overall prediction levels. Both maintained nominal size when \(\alpha = 0\).

These results highlight that permutation-based diagnostics can serve as robust, distribution-free alternatives to parametric tests, complementing the existing DC and WA families.

Empirical rejection rates (Study 1 design). Columns correspond to diagnostic tests; rows to simulation scenarios.
n sigma delta alpha DD PS1 PS1q PS2 PS2q HP PS3 pred_mean coef_mahal
100 0.1 1.5 0.0 0.048 0.072 0.071 0.071 0.070 0.048 0.021 0.038 0.048
100 0.1 1.5 0.2 0.076 0.076 0.080 0.096 0.098 0.072 0.047 0.056 0.062
100 0.1 1.5 0.4 0.102 0.118 0.135 0.143 0.122 0.100 0.083 0.079 0.073
100 0.1 1.5 0.6 0.183 0.174 0.193 0.228 0.200 0.180 0.136 0.141 0.105
100 0.1 1.0 0.0 0.029 0.063 0.073 0.062 0.063 0.029 0.031 0.026 0.035
100 0.1 1.0 0.2 0.066 0.097 0.114 0.117 0.097 0.061 0.049 0.051 0.052
100 0.1 1.0 0.4 0.184 0.179 0.236 0.286 0.212 0.178 0.160 0.117 0.091
100 0.1 1.0 0.6 0.377 0.316 0.383 0.458 0.364 0.375 0.307 0.208 0.160
100 0.2 1.5 0.0 0.047 0.068 0.060 0.052 0.067 0.045 0.032 0.047 0.043
100 0.2 1.5 0.2 0.090 0.099 0.111 0.104 0.099 0.088 0.063 0.082 0.073
100 0.2 1.5 0.4 0.245 0.238 0.246 0.286 0.264 0.242 0.190 0.246 0.186
100 0.2 1.5 0.6 0.588 0.502 0.500 0.577 0.549 0.580 0.499 0.543 0.437
100 0.2 1.0 0.0 0.058 0.088 0.087 0.072 0.077 0.058 0.039 0.036 0.048
100 0.2 1.0 0.2 0.150 0.181 0.204 0.213 0.201 0.145 0.127 0.133 0.111
100 0.2 1.0 0.4 0.577 0.471 0.479 0.573 0.537 0.569 0.492 0.487 0.385
100 0.2 1.0 0.6 0.926 0.839 0.843 0.905 0.889 0.924 0.892 0.840 0.705
200 0.1 1.5 0.0 0.033 0.068 0.075 0.062 0.069 0.032 0.035 0.044 0.041
200 0.1 1.5 0.2 0.075 0.086 0.082 0.102 0.082 0.075 0.071 0.056 0.060
200 0.1 1.5 0.4 0.157 0.154 0.189 0.214 0.175 0.155 0.136 0.133 0.106
200 0.1 1.5 0.6 0.327 0.282 0.321 0.409 0.329 0.323 0.295 0.278 0.200
200 0.1 1.0 0.0 0.043 0.080 0.082 0.062 0.086 0.043 0.027 0.028 0.042
200 0.1 1.0 0.2 0.119 0.152 0.193 0.208 0.162 0.115 0.102 0.087 0.085
200 0.1 1.0 0.4 0.358 0.301 0.399 0.467 0.350 0.353 0.316 0.254 0.198
200 0.1 1.0 0.6 0.670 0.574 0.677 0.735 0.631 0.669 0.631 0.490 0.357
200 0.2 1.5 0.0 0.057 0.065 0.065 0.052 0.059 0.055 0.031 0.046 0.046
200 0.2 1.5 0.2 0.149 0.141 0.145 0.168 0.149 0.148 0.125 0.138 0.114
200 0.2 1.5 0.4 0.552 0.447 0.455 0.538 0.519 0.548 0.486 0.539 0.429
200 0.2 1.5 0.6 0.905 0.824 0.832 0.870 0.860 0.901 0.895 0.876 0.776
200 0.2 1.0 0.0 0.046 0.082 0.094 0.069 0.076 0.044 0.040 0.045 0.041
200 0.2 1.0 0.2 0.309 0.267 0.293 0.341 0.313 0.307 0.257 0.286 0.204
200 0.2 1.0 0.4 0.898 0.816 0.852 0.879 0.853 0.896 0.871 0.834 0.734
200 0.2 1.0 0.6 0.998 0.989 0.993 0.992 0.994 0.998 0.998 0.994 0.980

References