Type: Package
Title: Survey Weight Diagnostic Tests
Version: 1.1.0
Description: Provides diagnostic tests for assessing the informativeness of survey weights in regression models. Implements difference-in-coefficients tests (Hausman 1978 <doi:10.2307/1913827>; Pfeffermann 1993 <doi:10.2307/1403631>), weight-association tests (DuMouchel and Duncan 1983 <doi:10.2307/2288185>; Pfeffermann and Sverchkov 1999 https://www.jstor.org/stable/25051118; Pfeffermann and Sverchkov 2003 <ISBN:9780470845672>; Wu and Fuller 2005 https://www.jstor.org/stable/27590461), estimating equations tests (Pfeffermann and Sverchkov 2003 <ISBN:9780470845672>), and non-parametric permutation tests. Includes simulation utilities replicating Wang et al. (2023 <doi:10.1111/insr.12509>) and extensions.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Depends: R (≥ 4.1.0)
Imports: Rcpp, survey
LinkingTo: Rcpp, RcppArmadillo
Suggests: knitr, MASS, rmarkdown, sampling, dplyr, tidyr, tibble, future.apply, broom, testthat (≥ 3.0.0)
VignetteBuilder: knitr
RoxygenNote: 7.3.3
Config/testthat/edition: 3
NeedsCompilation: yes
Packaged: 2025-10-27 14:57:19 UTC; cnlub
Author: Corbin Lubianski ORCID iD [aut, cre, cph]
Maintainer: Corbin Lubianski <cnlubianski@yahoo.com>
Repository: CRAN
Date/Publication: 2025-10-30 20:20:02 UTC

svytest: Survey Weight Regression Diagnostics

Description

The **svytest** package provides diagnostic tools for assessing the role of survey weights in regression modeling. It implements formal tests such as the Hausman-Pfeffermann Difference-in-Coefficients test, permutation-based diagnostics, and Wald-type assessments of weight informativeness. The package also includes curated datasets and reproducible workflows for applied survey analysis.

Details

The package builds on the general Hausman specification test (Hausman, 1978) and its adaptation to survey weights by Pfeffermann (1993). It also incorporates subsequent developments in testing for informative weights, weight trimming, and model diagnostics in complex survey settings (e.g., DuMouchel & Duncan, 1983; Wu & Fuller, 2005; Asparouhov & Muthen, 2007; Breidt et al., 2013; Wang et al., 2023). These methods allow researchers to formally assess whether survey weights materially affect regression estimates, and to evaluate the robustness of analytic inferences under complex sampling.

Functions

Datasets

Author(s)

Maintainer: Corbin Lubianski cnlubianski@yahoo.com (ORCID) [copyright holder]

References

Asparouhov, T. & B. Muthen (2007). Testing for informative weights and weight trimming in multivariate modeling with survey data. *Mplus Web Notes*, 2, 3394-3399. https://www.statmodel.com/download/JSM2007000745.pdf

Bollen, K. A., Biemer, P. P., Karr, A. F., Tueller, S., & Berzofsky, M. E. (2016). Are Survey Weights Needed? A Review of Diagnostic Tests in Regression Analysis. *Annual Review of Statistics and Its Applications*, 3, 375-392. doi:10.1146/annurev-statistics-011516-012958

Breidt, F. J., Opsomer, J. D., Herndon, W., Cao, R., & Francisco-Fern, M. (2013). Testing for informativeness in analytic inference from complex surveys. *Proceedings of the 59th ISI World Statistics Congress*, 889-893.

DuMouchel, W. H., & Duncan, G. J. (1983). Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples. *Journal of the American Statistical Association*, 78, 535-543.

Hausman, J. A. (1978). Specification Tests in Econometrics. *Econometrica*, 46(6), 1251-1271. doi:10.2307/1913827

Pfeffermann, D. (1993). The Role of Sampling Weights When Modeling Survey Data. *International Statistical Review*, 61(2), 317-337. doi:10.2307/1403631

Pfeffermann, D. & Nathan, G. (1985). Problems in model identification based on data from complex sample surveys. *Bulletin of the International Statistical Institute*, 51(12.2), 1-12.

Pfeffermann, D. & Sverchkov, M. (1999). Parametric and Semi-Parametric Estimation of Regression Models Fitted to Survey Data. *Indian Statistical Institute*, 61(1), 166-186.

Pfeffermann, D. & Sverchkov, M. (2003). Fitting generalized linear models under informative sampling. In: *Analysis of Survey Data*. Chichester, UK: John Wiley & Sons, Ltd., pp. 175-195.

Pfeffermann, D. & Sverchkov, M. (2007). Small area estimation under informative probability sampling of areas and within the selected areas. *Journal of the American Statistical Association*, 102(480), 1427-1439.

Toth, D. (2021). rpms: Recursive Partitioning for Modeling Survey Data. R package version 0.5.1. https://CRAN.R-project.org/package=rpms

Wang, F., Wang, H., & Yan, J. (2023). Diagnostic Tests for the Necessity of Weight in Regression With Survey Data. *International Statistical Review*, 91(1), 55-71.

Wu, Y., & Fuller, W. A. (2005). Preliminary testing procedures for regression with survey samples. *Proceedings of the Joint Statistical Meetings, Survey Research Methods Section*, 3683-3688.

See Also

diff_in_coef_test, wa_test, perm_test, svytestCE


Internal: check full rank before solving

Description

Internal: check full rank before solving

Usage

.check_full_rank(M, name = "design matrix")

Internal helper: Mahalanobis coefficient distance

Description

Computes the Mahalanobis distance between two coefficient vectors using a supplied precision matrix.

Usage

.coef_mahal_stat(beta, beta0, XtX)

Arguments

beta

Estimated coefficient vector.

beta0

Baseline coefficient vector.

XtX

Precision matrix (X'X).

Value

Numeric scalar distance.


Internal helper: Evaluate one permutation

Description

Computes the test statistic for a single permutation of weights.

Usage

.perm_eval_one(idx, X, y, w_use, fit_null_beta, fit_null_XtX, stat)

Arguments

idx

Integer index vector for permuted weights.

X

Numeric design matrix.

y

Numeric response vector.

w_use

Normalized weight vector.

fit_null_beta

Baseline coefficient vector.

fit_null_XtX

Baseline precision matrix.

stat

Character string, statistic type.

Value

Numeric scalar statistic.


Internal helper: Predicted mean statistic

Description

Computes the mean of fitted values given design matrix and coefficients.

Usage

.pred_mean_stat(X, beta)

Arguments

X

Numeric design matrix.

beta

Coefficient vector.

Value

Numeric scalar, mean of predictions.


Internal helper: Weighted least squares fit

Description

Performs a weighted least squares regression using QR decomposition, with a generalized inverse fallback if the system is singular.

Usage

.wls_fit(X, y, w)

Arguments

X

Numeric design matrix.

y

Numeric response vector.

w

Numeric vector of weights.

Value

A list with elements beta, mu, resid, sigma2, and XtX.


Difference-in-Coefficients Test for Survey Weights

Description

Implements the Hausman-Pfeffermann Difference-in-Coefficients test to assess whether survey weights significantly affect regression estimates.

Usage

diff_in_coef_test(
  model,
  lower.tail = FALSE,
  var_equal = TRUE,
  robust_type = c("HC0", "HC1", "HC2", "HC3"),
  coef_subset = NULL,
  na.action = stats::na.omit
)

## S3 method for class 'diff_in_coef_test'
print(x, ...)

## S3 method for class 'diff_in_coef_test'
summary(object, ...)

## S3 method for class 'diff_in_coef_test'
tidy(x, ...)

## S3 method for class 'diff_in_coef_test'
glance(x, ...)

Arguments

model

An object of class svyglm.

lower.tail

Logical; passed to pchisq().

var_equal

Logical; assume equal residual variance between models. If FALSE, a heteroskedasticity-robust variance estimator is used.

robust_type

Character; type of heteroskedasticity-robust variance estimator to use if var_equal = FALSE. Options are "HC0", "HC1", "HC2", "HC3" as used in 'sandwich' package.

coef_subset

Character vector of coefficient names to include in the test. Defaults to all coefficients.

na.action

Function to handle missing data before fitting the test.

x

An object of class diff_in_coef_test

...

Additional arguments passed to methods

object

An object of class diff_in_coef_test

Details

Let X denote the design matrix and y the response vector. Define the unweighted OLS estimator

\hat\beta_{U} = (X^\top X)^{-1} X^\top y,

and the survey-weighted estimator

\hat\beta_{W} = (X^\top W X)^{-1} X^\top W y,

where W = \mathrm{diag}(w_1, \ldots, w_n) is the diagonal matrix of survey weights.

The test statistic is based on the difference

d = \hat\beta_{W} - \hat\beta_{U}.

Under the null hypothesis that weights are not informative, d has mean zero and variance V_d. The test statistic is

T = d^\top V_d^{-1} d,

which is asymptotically \chi^2_p distributed with p equal to the number of coefficients tested.

If var_equal = TRUE, V_d is estimated assuming equal residual variance across weighted and unweighted models. If var_equal = FALSE, a heteroskedasticity-robust estimator (e.g. HC0–HC3) is used.

This test is a survey-weighted adaptation of the Hausman specification test (Hausman, 1978), as proposed by Pfeffermann (1993).

Value

An object of class "diff_in_coef_test" containing:

statistic

Chi-squared test statistic

parameter

Degrees of freedom

p.value

P-value for the test

betas_unweighted

Unweighted coefficient estimates

betas_weighted

Weighted coefficient estimates

vcov_diff

Estimated variance-covariance matrix of coefficient differences

diff_betas

Vector of coefficient differences

call

Function call

References

Hausman, J. A. (1978). Specification Tests in Econometrics. *Econometrica*, 46(6), 1251-1271. doi:10.2307/1913827

Pfeffermann, D. (1993). The Role of Sampling Weights When Modeling Survey Data. *International Statistical Review*, 61(2), 317-337. doi:10.2307/1403631

See Also

svytestCE for the curated Consumer Expenditure dataset included in this package, which can be used to demonstrate the Difference-in-Coefficients test.

@importFrom survey svyglm

Examples

# Load in survey package (required) and load in example data
library(survey)
data(api, package = "survey")

# Create a survey design and fit a weighted regression model
des <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat)
fit <- svyglm(api00 ~ ell + meals, design = des)

# Run difference-in-coefficients diagnostic test versions with different variance assumptions
# and reports Chi-Squared statistic, df, and p-value
summary(diff_in_coef_test(fit, var_equal = TRUE))
summary(diff_in_coef_test(fit, var_equal = FALSE, robust_type = "HC3"))


Estimating Equations Test for Informative Sampling (Linear Case)

Description

Implements the Pfeffermann-Sverchkov estimating equations test for informativeness of survey weights in the linear regression case (Gaussian with identity link). The test compares unweighted estimating equations to adjusted-weight equations using q_i = w_i / E_s(w_i \mid x_i).

Usage

estim_eq_test(
  model,
  coef_subset = NULL,
  q_method = c("linear", "log"),
  stabilize = TRUE,
  na.action = stats::na.omit
)

## S3 method for class 'estim_eq_test'
print(x, ...)

## S3 method for class 'estim_eq_test'
summary(object, ...)

## S3 method for class 'estim_eq_test'
tidy(x, ...)

## S3 method for class 'estim_eq_test'
glance(x, ...)

Arguments

model

An object of class svyglm with family = gaussian(identity).

coef_subset

Optional character vector of coefficient names to include in the test. Defaults to all coefficients in the model matrix.

q_method

Method for estimating E_s(w \mid x): "linear" (default, OLS regression of w on X) or "log" (regress log(w) on X, then exponentiate).

stabilize

Logical; if TRUE (default) clips extreme q values.

na.action

Function to handle missing data.

x

An object of class estim_eq_test

...

Additional arguments passed to methods

object

An object of class estim_eq_test

Details

For linear regression, the per-observation score is u_i = x_i (y_i - x_i^\top \hat\beta_{\text{unw}}) at the unweighted OLS estimate. The test statistic is based on R_i = (1 - q_i) u_i, where q_i = w_i / E_s(w_i \mid x_i). The Hotelling F statistic is F = \frac{n-p}{p} \bar R^\top S^{-1} \bar R, with df1 = p, df2 = n - p.

Value

An object of class "estim_eq_test" containing:

statistic

Hotelling F statistic

p.value

p-value under F distribution

df1

Numerator df (# tested coefficients)

df2

Denominator df (n - p)

Rbar

Mean estimating-equation contrast vector

S

Sample covariance of R

terms

Names of tested coefficients

n

Sample size

call

Matched call

method

Description string

References

Pfeffermann, D., & Sverchkov, M. Y. (2003). Fitting generalized linear models under informative sampling. In R. L. Chambers & C. J. Skinner (Eds.), *Analysis of Survey Data* (Ch. 12). Wiley.

See Also

diff_in_coef_test, wa_test, perm_test

Examples

# Load in survey package (required) and load in example data
library(survey)
data("svytestCE", package = "svytest")

# Create a survey design and fit a weighted regression model
des <- svydesign(ids = ~1, weights = ~FINLWT21, data = svytestCE)
fit <- svyglm(TOTEXPCQ ~ ROOMSQ + BATHRMQ + BEDROOMQ + FAM_SIZE + AGE, design = des)

# Run estimating equations diagnostic test; reports F statistic, df's, and p-value
results <- estim_eq_test(fit, q_method = "linear")
print(results)


Likelihood-Ratio Test for Informative Survey Weights (In production)

Description

Implements the Breidt-Herndon likelihood-ratio test for assessing whether survey weights are informative in linear regression models. The test compares maximized log-likelihoods under equal weights (null) and survey weights (alternative), with an asymptotic distribution given by a weighted chi-squared mixture.

Usage

lr_test(
  model,
  coef_subset = NULL,
  na.action = stats::na.omit,
  likelihood = c("pseudo", "scaled")
)

## S3 method for class 'lr_test'
print(x, ...)

## S3 method for class 'lr_test'
summary(object, ...)

## S3 method for class 'lr_test'
tidy(x, ...)

## S3 method for class 'lr_test'
glance(x, ...)

Arguments

model

An object of class svyglm.

coef_subset

Optional character vector of coefficient names to include in the test. Defaults to all coefficients.

na.action

Function to handle missing data before testing.

likelihood

Character string specifying the likelihood form: "pseudo" (default) for raw weighted likelihood, or "scaled" to normalize weights by their mean.

x

An object of class lr_test

...

Additional arguments passed to methods

object

An object of class lr_test

Details

The null hypothesis is that survey weights are not informative (equal weights suffice). The alternative allows weights to affect the likelihood. The asymptotic null distribution is a weighted chi-squared mixture; here we approximate the p-value using a Satterthwaite moment-matching approach.

Value

An object of class "lr_test" containing:

statistic

Likelihood-ratio test statistic (non-negative)

p.value

P-value for the test (Satterthwaite approximation)

df

Approximate degrees of freedom

eigvals

Eigenvalues of the Gamma matrix

logLik_null

Maximized log-likelihood under equal weights

logLik_alt

Maximized log-likelihood under survey weights

method

Name of the test performed

call

Function call

References

Breidt, F. J., & Opsomer, J. D. (1997). Testing for informativeness in analytic inference from complex surveys. *Survey Methodology*, 23(1), 1-11.

Herndon, J. (2022). Testing and adjusting for informative sampling in survey data. *Journal of Survey Statistics and Methodology*, 10(3), 455-480.

See Also

diff_in_coef_test, wa_test, svytestCE


Permutation test for weight informativeness in survey regression

Description

Non-parametric test that permutes survey weights (optionally within blocks) to generate the null distribution of a chosen statistic. Supports fast closed-form WLS (linear case) via C++ and a pure R engine.

Usage

perm_test(
  model,
  stat = c("pred_mean", "coef_mahal"),
  B = 1000,
  coef_subset = NULL,
  block = NULL,
  normalize = TRUE,
  engine = c("C++", "R"),
  custom_fun = NULL,
  na.action = stats::na.omit
)

## S3 method for class 'perm_test'
print(x, ...)

## S3 method for class 'perm_test'
summary(object, ...)

## S3 method for class 'perm_test'
tidy(x, ...)

## S3 method for class 'perm_test'
glance(x, ...)

Arguments

model

An object of class svyglm (currently supports Gaussian family best).

stat

Statistic to use. Options:

  • "pred_mean": Compares the mean predicted outcome under weighted vs. unweighted regression. Simple, interpretable, and directly tied to differences in fitted population means. Sensitive to shifts in overall prediction levels caused by informative weights.

  • "coef_mahal": Computes the Mahalanobis distance between the weighted and unweighted coefficient vectors, using the unweighted precision matrix (X'X) as the metric. Captures joint shifts in regression coefficients, not just mean predictions. More powerful when informativeness manifests as changes in slopes or multiple coefficients simultaneously.

B

Number of permutations (e.g., 1000).

coef_subset

Optional character vector of coefficient names to include.

block

Optional factor for blockwise permutations (e.g., strata), permute within levels.

normalize

Logical; if TRUE (default), normalize weights to have mean 1.

engine

"C++" for fast WLS or "R" for pure R loop.

custom_fun

Optional function(model, X, y, wts) -> scalar statistic (overrides stat).

na.action

Function to handle missing data.

x

An object of class perm_test

...

Additional arguments passed to methods

object

An object of class perm_test

Details

This procedure implements a non‑parametric randomization test for the informativeness of survey weights. The null hypothesis is that, conditional on the covariates X, the survey weights w are non‑informative with respect to the outcome y. Under this null, permuting the weights across observations should not change the distribution of any statistic that measures the effect of weighting.

The algorithm is:

  1. Fit the unweighted regression

    \hat\beta_{U} = (X^\top X)^{-1} X^\top y

    and the weighted regression

    \hat\beta_{W} = (X^\top W X)^{-1} X^\top W y,

    where W = \mathrm{diag}(w).

  2. Compute the observed test statistic T_{\mathrm{obs}}:

    • For "pred_mean": the difference in mean predicted outcomes between weighted and unweighted fits.

    • For "coef_mahal": the Mahalanobis distance

      T = (\hat\beta_{W} - \hat\beta_{U})^\top (X^\top X)(\hat\beta_{W} - \hat\beta_{U}),

      using the unweighted precision matrix as the metric.

    • For a user‑supplied custom_fun, any scalar function of (X,y,w).

  3. Generate the null distribution by permuting the weights:

    w^{*(b)} = P_b w, \quad b=1,\ldots,B,

    where each P_b is a permutation matrix. If a block factor is supplied, permutations are restricted within block levels.

  4. Recompute the test statistic T^{*(b)} for each permuted weight vector. The empirical distribution of T^{*(b)} represents the null distribution under non‑informative weights.

  5. The two‑sided permutation p‑value is

    p = \frac{1 + \sum_{b=1}^B I\{|T^{*(b)} - T_0| \ge |T_{\mathrm{obs}} - T_0|\}} {B+1},

    where T_0 is the baseline statistic under equal weights.

Intuitively, if the weights are informative, the observed statistic will lie in the tails of the permutation distribution, leading to a small p‑value. If the weights are non‑informative, shuffling them destroys any spurious association with the outcome, and the observed statistic is typical of the permutation distribution.

Value

An object of class "perm_test" with fields:

stat_obs

Observed statistic with actual weights

stat_null

Baseline statistic under equal weights (for centering)

perm_stats

Vector of permutation statistics

p.value

Permutation p-value (two-sided, centered at baseline)

effect

Observed minus median of permutation stats

stat

Statistic name

B

Number of permutations

call

Matched call

method

Description string

Examples

# Load in survey package (required) and load in example data
library(survey)
data(api, package = "survey")

# Create a survey design and fit a weighted regression model
des <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat)
fit <- svyglm(api00 ~ ell + meals, design = des)

# Run permutation diagnostic test; reports permutation statistics with p-value
results <- perm_test(fit, stat = "pred_mean", B = 1000, engine = "R")
print(results)


Plot method for permutation test objects

Description

Produces a histogram of the permutation distribution with a vertical line indicating the observed statistic.

Usage

## S3 method for class 'perm_test'
plot(x, bins = 30, col = "lightgray", line_col = "red", ...)

Arguments

x

An object of class "perm_test".

bins

Number of histogram bins (default 30).

col

Color for histogram bars (default "lightgray").

line_col

Color for observed statistic line (default "red").

...

Additional arguments passed to hist().

Value

A base R side effect plot. Function returns NULL invisibly.


Pfeffermann-Nathan Predictive Power Test (svyglm, K-fold CV, fold-mean option) (In production)

Description

Implements the predictive power test following Wang et al. (2023, Sec. 2.2): split observations into estimation and validation sets; fit unweighted and weighted linear regressions on the estimation set; compute validation squared-error differences D_i = (y_i - \hat y_{u,i})^2 - (y_i - \hat y_{w,i})^2; test H_0: E[D_i] = 0 with Z = \bar D / (s_D / \sqrt{n_V}). Supports K-fold CV (default) and a "fold-mean" option to reduce dependence among errors by using per-fold means as the test observations.

Usage

pred_power_test(
  model,
  kfold = TRUE,
  K = 5,
  est_split = 0.5,
  use_fold_means = TRUE,
  seed = NULL
)

## S3 method for class 'pred_power_test'
print(x, ...)

## S3 method for class 'pred_power_test'
summary(object, ...)

## S3 method for class 'pred_power_test'
tidy(x, ...)

## S3 method for class 'pred_power_test'
glance(x, ...)

Arguments

model

A fitted svyglm with family = gaussian(identity).

kfold

Logical; if TRUE, use K-fold cross-validation (default TRUE).

K

Integer number of folds (default 5).

est_split

Proportion for estimation set if kfold = FALSE (default 0.5).

use_fold_means

Logical; if TRUE (default), compute one D per fold as the mean of within-fold D_i, then form Z using the K fold means. This stabilizes the test by reducing dependence noted in Wang (2023).

seed

Optional integer seed for reproducibility.

x

An object of class pred_power_test

...

Additional arguments passed to methods

object

An object of class pred_power_test

Value

An object of class "pred_power_test" with fields:

statistic

Z statistic

p.value

Two-sided p-value

mean_diff

Mean of D (fold mean if use_fold_means = TRUE)

n_val

Count of observations used in Z (K if use_fold_means = TRUE, else total validation n)

K

Number of folds (if kfold = TRUE)

method

Description string

call

Matched call


Run All Diagnostic Tests for Informative Weights

Description

This function runs all implemented diagnostic tests: - wa_test() with types "DD", "PS1", "PS1q", "PS2", "PS2q", "WF" - diff_in_coef_test() - estim_eq_test() - perm_test() with stats "pred_mean" and "coef_mahal"

Usage

run_all_diagnostic_tests(model, alpha = 0.05, B = 1000)

## S3 method for class 'run_all_diagnostic_tests'
print(x, ...)

Arguments

model

A fitted svyglm object.

alpha

Critical value for rejection (default 0.05).

B

Number of permutations for permutation tests (default 1000).

x

An object of class run_all_diagnostic_tests

...

Additional arguments passed to methods

Value

A list with:

results

Data frame of test names, statistics, p-values, reject indicator

recommendation

Character string with suggested action

raw

List of raw test outputs, including permutation test objects

See Also

wa_test, diff_in_coef_test, estim_eq_test, perm_test

Examples

# Load in survey package (required) and load in example data
library(survey)
data(api, package = "survey")

# Create a survey design and fit a weighted regression model
des <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat)
fit <- svyglm(api00 ~ ell + meals, design = des)

# Run all diagnostic tests and return a list of statistics, including a recommendation
results <- run_all_diagnostic_tests(fit)
print(results)


Subset of 2015 Consumer Expenditure (CE) Dataset

Description

A curated subset of rows and columns from the Consumer Expenditure (CE) dataset that is provided in the [rpms](https://CRAN.R-project.org/package=rpms) package by Daniell Toth. This example dataset is designed for demonstration purposes within this package. Please reframe from using this dataset for inferential purposes.

Usage

svytestCE

Format

A data frame with _n_ rows and _m_ variables:

NEWID

Consumer unit identifying variable, constructed using the first seven digits of a unique identifier.

CID

Cluster Identifier for all clusters (constructed using PSU, REGION, STATE, and POPSIZE).

QINTRVMO

Month for which the data were collected.

FINLWT21

Final sample weight used to make population inferences.

STATE

State FIPS code indicating the location of the consumer unit.

REGION

Region code: 1 = Northeast, 2 = Midwest, 3 = South, 4 = West.

BLS_URBN

Indicator of urban (1) versus rural (2) residence status.

POPSIZE

Population size class of the PSU, ranging from 1 (largest) to 5 (smallest).

CUTENURE

Housing tenure: 1 = Owned with mortgage; 2 = Owned without mortgage; 3 = Owned (mortgage not reported); 4 = Rented; 5 = Occupied without cash rent; 6 = Student housing.

ROOMSQ

Number of rooms (including finished living areas but excluding bathrooms).

BATHRMQ

Number of bathrooms in the consumer unit.

BEDROOMQ

Number of bedrooms in the consumer unit.

VEHQ

Number of owned vehicles.

FAM_TYPE

Household type based on the relationship of members to the reference person; for example, 1 = Married Couple only, 2 = Married Couple with children (oldest < 6 years), 3 = Married Couple with children (oldest 6-17 years), etc.

FAM_SIZE

Number of members in the consumer unit (family size).

PERSLT18

Count of persons less than 18 years old in the consumer unit.

PERSOT64

Count of persons older than 64 years in the consumer unit.

NO_EARNR

Number of earners in the consumer unit.

AGE

Age of the primary earner.

EDUCA

Education level of the primary earner, coded as 1 = None, 2 = 1st-8th Grade, 3 = Some high school, 4 = High school, 5 = Some college, 6 = AA degree, 7 = Bachelor's degree, 8 = Advanced degree.

SEX

Gender of the primary earner (F = Female, M = Male).

MARITAL

Marital status of the primary earner (1 = Married, 2 = Widowed, 3 = Divorced, 4 = Separated, 5 = Never Married).

MEMBRACE

Race of the primary earner (e.g., 1 = White, 2 = Black, 3 = Native American, 4 = Asian, 5 = Pacific Islander, 6 = Multi-race).

HORIGIN

Indicator of Hispanic, Latino, or Spanish origin (Y for yes, N for no).

ARM_FORC

Indicator if the primary earner is a member of the armed forces (Y/N).

IN_COLL

Current college enrollment status for the primary earner (Full for full time, Part for part time, No for not enrolled).

EARNTYPE

Type of employment for the primary earner: 1 = Full time all year, 2 = Part time all year, 3 = Full time part-year, 4 = Part time part-year.

OCCUCODE

Occupational code representing the primary job of the earner.

INCOMEY

Type of employment: coded as 1 = Employee of a private company, 2 = Federal government employee, 3 = State government employee, 4 = Local government employee, 5 = Self-employed, 6 = Working without pay in a family business.

FINCBTAX

Amount of consumer unit income before taxes in the past 12 months.

SALARYX

Wage or salary income received in the past 12 months, before deductions.

SOCRRX

Income received from Social Security and Railroad Retirement in the past 12 months.

TOTEXPCQ

Total expenditures reported for the current quarter.

TOTXEST

Total taxes paid (estimated) in the current period.

EHOUSNGC

Total expenditures for housing in the current quarter.

HEALTHCQ

Expenditures for health care during the current quarter.

FOODCQ

Expenditures on food during the current quarter.

Details

This example dataset is a subset extracted from the complete CE dataset used by the rpms package. It is intended to illustrate how to work with survey data in the context of recursive partitioning. The original CE data contain 68,415 observations on 47 variables; this example contains a smaller selection for ease of demonstration. The curated subset of the dataset removed several columns were removed for mostly missing data, redundant data, or not relevant to the examples. Rows were filtered for strictly-positive salary, expenditure, and tax variable values. Weights were not recalibrated following the changes.

Note

For more information on the methodology and details behind the original dataset, please see: Toth, D. (2021). *rpms: Recursive Partitioning for Modeling Survey Data* (v0.5.1). CRAN. https://CRAN.R-project.org/package=rpms.

Source

rpms package on CRAN

See Also

rpms for an overview of the functions provided in the original package.


DuMouchel-Duncan WA test

Description

DuMouchel-Duncan WA test

Usage

wa_DD(y, X, wts)

Pfeffermann-Sverchkov Test 1

Description

Pfeffermann-Sverchkov Test 1

Usage

wa_PS1(y, X, wts, quadratic = FALSE, aux_design = NULL)

Pfeffermann-Sverchkov Test 2

Description

Pfeffermann-Sverchkov Test 2

Usage

wa_PS2(y, X, wts, quadratic = FALSE, aux_design = NULL)

Wu-Fuller test

Description

Wu-Fuller test

Usage

wa_WF(y, X, wts)

Weight-Association Tests for Survey Weights

Description

Implements several weight-association tests that examine whether survey weights are informative about the response variable after conditioning on covariates. Variants include DuMouchel-Duncan (DD), Pfeffermann-Sverchkov (PS1 and PS2, with optional quadratic terms or user-supplied auxiliary designs), and Wu-Fuller (WF).

Usage

wa_test(
  model,
  type = c("DD", "PS1", "PS1q", "PS2", "PS2q", "WF"),
  coef_subset = NULL,
  aux_design = NULL,
  na.action = stats::na.omit
)

## S3 method for class 'wa_test'
print(x, ...)

## S3 method for class 'wa_test'
summary(object, ...)

## S3 method for class 'wa_test'
tidy(x, ...)

## S3 method for class 'wa_test'
glance(x, ...)

Arguments

model

An object of class svyglm.

type

Character string specifying the test type: "DD", "PS1", "PS1q", "PS2", "PS2q", "WF".

coef_subset

Optional character vector of coefficient names to include in the test. Defaults to all coefficients.

aux_design

Optional matrix or function to generate auxiliary regressors for PS1/PS2 tests. If a function, it should take X and y and return a matrix of extra columns to include.

na.action

Function to handle missing data before testing.

x

An object of class wa_test

...

Additional arguments passed to methods

object

An object of class wa_test

Details

Let y denote the response, X the design matrix of covariates, and w the survey weights. The null hypothesis in all cases is that the weights are non-informative given X, i.e. they do not provide additional information about y beyond the covariates.

The following test variants are implemented:

In all cases, the reported statistic is an F-test with numerator degrees of freedom equal to the number of auxiliary regressors added, and denominator degrees of freedom equal to the residual degrees of freedom from the augmented regression.

Value

An object of class "wa_test" containing:

statistic

F-test statistic

parameter

Degrees of freedom (numerator, denominator)

p.value

P-value for the test

method

Name of the test performed

call

Function call

References

DuMouchel, W. H., & Duncan, G. J. (1983). Using sample survey weights in multiple regression analyses of stratified samples. *Journal of the American Statistical Association*, 78(383), 535-543.

Pfeffermann, D., & Sverchkov, M. (1999). Parametric and semi-parametric estimation of regression models fitted to survey data. *Sankhya: The Indian Journal of Statistics, Series B*, 61(1), 166-186.

Pfeffermann, D., & Sverchkov, M. (2003). Fitting generalized linear models under informative sampling. In R. L. Chambers & C. J. Skinner (Eds.), *Analysis of Survey Data* (pp. 175-196). Wiley.

Wu, Y., & Fuller, W. A. (2005). Preliminary testing procedures for regression with survey samples. In *Proceedings of the Joint Statistical Meetings, Survey Research Methods Section* (pp. 3683-3688). American Statistical Association.

See Also

diff_in_coef_test for the Hausman-Pfeffermann difference-in-coefficients test, and svytestCE for the example dataset included in this package.

Examples

# Load in survey package (required) and load in example data
library(survey)
data(api, package = "survey")

# Create a survey design and fit a weighted regression model
des <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat)
fit <- svyglm(api00 ~ ell + meals, design = des)

# Run weight-association diagnostic test; reports F-stat, df's, and p-value
results <- wa_test(fit, type = "DD")
print(results)