lsReg is designed for settings where a large number of
candidate covariates must each be tested for association with an
outcome, given a fixed set of baseline covariates. Rather than fitting a
separate model for each candidate, lsReg pre-computes and
caches expensive quantities from the base model once, then reuses them
for every subsequent test. This is the pattern used in genome-wide
association studies (GWAS), where thousands or millions of genetic
variants are tested one at a time against a common base model.
The workflow has two steps:
glm(), then call lsReg() to cache the
quantities needed for repeated testing.addcovar() once per
candidate covariate, retrieving the parameter estimate and test
statistic from the allocated memory object.We use a simulated dataset with 1,000 observations. The outcome
ylin was generated as a linear function of x1,
x2, z5, and z9. The variables
z1 through z10 are candidate covariates to be
screened.
library(lsReg)
datafile <- system.file("extdata", "simulated_data.rds", package = "lsReg")
dat <- readRDS(datafile)
head(dat[, c("ylin", "x1", "x2", "z1", "z2", "z5", "z9")])
#> ylin x1 x2 z1 z2 z5 z9
#> 1 1.2926463 -1.52199116 1 0 1 0 0.5931028
#> 2 3.9721109 -0.96937043 1 0 0 0 0.6778513
#> 3 0.5331866 -0.38494124 0 0 0 0 0.8131135
#> 4 3.3306664 -0.24568841 0 1 1 0 0.2320682
#> 5 4.1558941 -0.49640532 0 0 1 0 0.7719654
#> 6 2.5245574 0.07975954 0 0 1 0 0.2162713The base model includes only the baseline covariates x1
and x2.
basemdl <- glm(ylin ~ x1 + x2, data = dat, family = gaussian)
summary(basemdl)
#>
#> Call:
#> glm(formula = ylin ~ x1 + x2, family = gaussian, data = dat)
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4.56684 0.11907 38.355 < 2e-16 ***
#> x1 1.03466 0.08637 11.979 < 2e-16 ***
#> x2 0.59230 0.17517 3.381 0.00075 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 7.616899)
#>
#> Null deviance: 8798.1 on 999 degrees of freedom
#> Residual deviance: 7594.0 on 997 degrees of freedom
#> AIC: 4873.2
#>
#> Number of Fisher Scoring iterations: 2Pass the fitted base model to lsReg(), specifying that
each test will add one column (colstoadd = 1) and that the
Wald test statistic should be used.
Loop over z1 through z10, calling
addcovar() for each. After each call the parameter estimate
for the candidate covariate is stored in mem$fitdata$betab
and the Wald test statistic in mem$testvalue.
results$pvalue <- 2 * pnorm(-abs(results$statistic))
print(results, digits = 4)
#> variable estimate statistic pvalue
#> 1 z1 -0.469022 -1.405562 0.1598543
#> 2 z2 -0.327854 -1.875926 0.0606655
#> 3 z3 -0.096312 -0.532620 0.5942964
#> 4 z4 -0.096532 -0.409402 0.6822450
#> 5 z5 0.744905 3.565794 0.0003628
#> 6 z6 -0.024623 -0.082187 0.9344980
#> 7 z7 0.682225 2.241809 0.0249737
#> 8 z8 0.001659 0.005704 0.9954488
#> 9 z9 -0.748608 -2.471511 0.0134543
#> 10 z10 -0.533338 -1.774089 0.0760484The variables z5 and z9 have the largest
test statistics and the smallest p-values, consistent with the
data-generating model. The remaining variables are null and show
statistics close to zero.
To confirm that lsReg produces the same estimates and
test statistics as a full glm() fit, we fit the models
ylin ~ x1 + x2 + z5 and ylin ~ x1 + x2 + z9
directly and compare.
glm_z5 <- glm(ylin ~ x1 + x2 + z5, data = dat, family = gaussian)
glm_z9 <- glm(ylin ~ x1 + x2 + z9, data = dat, family = gaussian)
glm_est_z5 <- coef(glm_z5)["z5"]
glm_stat_z5 <- coef(summary(glm_z5))["z5", "t value"]
glm_est_z9 <- coef(glm_z9)["z9"]
glm_stat_z9 <- coef(summary(glm_z9))["z9", "t value"]
lsreg_est_z5 <- results$estimate[results$variable == "z5"]
lsreg_stat_z5 <- results$statistic[results$variable == "z5"]
lsreg_est_z9 <- results$estimate[results$variable == "z9"]
lsreg_stat_z9 <- results$statistic[results$variable == "z9"]
comparison <- data.frame(
variable = c("z5", "z9"),
glm_estimate = c(glm_est_z5, glm_est_z9),
lsreg_estimate = c(lsreg_est_z5, lsreg_est_z9),
glm_statistic = c(glm_stat_z5, glm_stat_z9),
lsreg_statistic = c(lsreg_stat_z5, lsreg_stat_z9)
)
print(comparison, digits = 6)
#> variable glm_estimate lsreg_estimate glm_statistic lsreg_statistic
#> z5 z5 0.744905 0.744905 3.56579 3.56579
#> z9 z9 -0.748608 -0.748608 -2.47151 -2.47151The estimates and test statistics from lsReg match those
from glm() to numerical precision.