Large-Scale Linear Regression with lsReg

Overview

lsReg is designed for settings where a large number of candidate covariates must each be tested for association with an outcome, given a fixed set of baseline covariates. Rather than fitting a separate model for each candidate, lsReg pre-computes and caches expensive quantities from the base model once, then reuses them for every subsequent test. This is the pattern used in genome-wide association studies (GWAS), where thousands or millions of genetic variants are tested one at a time against a common base model.

The workflow has two steps:

Allocate — fit the base model with glm(), then call lsReg() to cache the quantities needed for repeated testing.
Test — call addcovar() once per candidate covariate, retrieving the parameter estimate and test statistic from the allocated memory object.

Example data

We use a simulated dataset with 1,000 observations. The outcome ylin was generated as a linear function of x1, x2, z5, and z9. The variables z1 through z10 are candidate covariates to be screened.

library(lsReg)

datafile <- system.file("extdata", "simulated_data.rds", package = "lsReg")
dat <- readRDS(datafile)
head(dat[, c("ylin", "x1", "x2", "z1", "z2", "z5", "z9")])
#>        ylin          x1 x2 z1 z2 z5        z9
#> 1 1.2926463 -1.52199116  1  0  1  0 0.5931028
#> 2 3.9721109 -0.96937043  1  0  0  0 0.6778513
#> 3 0.5331866 -0.38494124  0  0  0  0 0.8131135
#> 4 3.3306664 -0.24568841  0  1  1  0 0.2320682
#> 5 4.1558941 -0.49640532  0  0  1  0 0.7719654
#> 6 2.5245574  0.07975954  0  0  1  0 0.2162713

Step 1: Fit the base model

The base model includes only the baseline covariates x1 and x2.

basemdl <- glm(ylin ~ x1 + x2, data = dat, family = gaussian)
summary(basemdl)
#> 
#> Call:
#> glm(formula = ylin ~ x1 + x2, family = gaussian, data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  4.56684    0.11907  38.355  < 2e-16 ***
#> x1           1.03466    0.08637  11.979  < 2e-16 ***
#> x2           0.59230    0.17517   3.381  0.00075 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 7.616899)
#> 
#>     Null deviance: 8798.1  on 999  degrees of freedom
#> Residual deviance: 7594.0  on 997  degrees of freedom
#> AIC: 4873.2
#> 
#> Number of Fisher Scoring iterations: 2

Step 2: Allocate memory

Pass the fitted base model to lsReg(), specifying that each test will add one column (colstoadd = 1) and that the Wald test statistic should be used.

mem <- lsReg(basemdl, colstoadd = 1, testtype = "wald")

Step 3: Test each candidate covariate

Loop over z1 through z10, calling addcovar() for each. After each call the parameter estimate for the candidate covariate is stored in mem$fitdata$betab and the Wald test statistic in mem$testvalue.

zvars <- paste0("z", 1:10)

results <- data.frame(
  variable  = zvars,
  estimate  = NA_real_,
  statistic = NA_real_
)

for (i in seq_along(zvars)) {
  xr <- as.matrix(dat[, zvars[i], drop = FALSE])
  addcovar(mem, xr)
  results$estimate[i]  <- mem$fitdata$betab[1]
  results$statistic[i] <- mem$testvalue[1, 1]
}

Results

results$pvalue <- 2 * pnorm(-abs(results$statistic))
print(results, digits = 4)
#>    variable  estimate statistic    pvalue
#> 1        z1 -0.469022 -1.405562 0.1598543
#> 2        z2 -0.327854 -1.875926 0.0606655
#> 3        z3 -0.096312 -0.532620 0.5942964
#> 4        z4 -0.096532 -0.409402 0.6822450
#> 5        z5  0.744905  3.565794 0.0003628
#> 6        z6 -0.024623 -0.082187 0.9344980
#> 7        z7  0.682225  2.241809 0.0249737
#> 8        z8  0.001659  0.005704 0.9954488
#> 9        z9 -0.748608 -2.471511 0.0134543
#> 10      z10 -0.533338 -1.774089 0.0760484

The variables z5 and z9 have the largest test statistics and the smallest p-values, consistent with the data-generating model. The remaining variables are null and show statistics close to zero.

Verification against standard GLM

To confirm that lsReg produces the same estimates and test statistics as a full glm() fit, we fit the models ylin ~ x1 + x2 + z5 and ylin ~ x1 + x2 + z9 directly and compare.

glm_z5 <- glm(ylin ~ x1 + x2 + z5, data = dat, family = gaussian)
glm_z9 <- glm(ylin ~ x1 + x2 + z9, data = dat, family = gaussian)

glm_est_z5  <- coef(glm_z5)["z5"]
glm_stat_z5 <- coef(summary(glm_z5))["z5", "t value"]

glm_est_z9  <- coef(glm_z9)["z9"]
glm_stat_z9 <- coef(summary(glm_z9))["z9", "t value"]

lsreg_est_z5  <- results$estimate[results$variable == "z5"]
lsreg_stat_z5 <- results$statistic[results$variable == "z5"]

lsreg_est_z9  <- results$estimate[results$variable == "z9"]
lsreg_stat_z9 <- results$statistic[results$variable == "z9"]

comparison <- data.frame(
  variable  = c("z5", "z9"),
  glm_estimate  = c(glm_est_z5,  glm_est_z9),
  lsreg_estimate = c(lsreg_est_z5, lsreg_est_z9),
  glm_statistic  = c(glm_stat_z5,  glm_stat_z9),
  lsreg_statistic = c(lsreg_stat_z5, lsreg_stat_z9)
)
print(comparison, digits = 6)
#>    variable glm_estimate lsreg_estimate glm_statistic lsreg_statistic
#> z5       z5     0.744905       0.744905       3.56579         3.56579
#> z9       z9    -0.748608      -0.748608      -2.47151        -2.47151

The estimates and test statistics from lsReg match those from glm() to numerical precision.