ssp.glm.rF: Balanced Subsampling for Preserving Rare Features in Generalized Linear Models

This vignette illustrates how to use ssp.glm.rF() for generalized linear models with rare binary features. A rare feature is a binary covariate that takes the value 1 in only a small fraction of observations. Ordinary subsampling can miss many of these expressed rare-feature observations, making estimation of the corresponding coefficients unstable.

ssp.glm.rF() uses rarity-aware sampling probabilities for one-step balance-score sampling, two-step optimal subsampling, optional response balancing for binary outcomes, automatic rare-feature detection, and a combined estimator fitted on the union of the pilot and second-step subsamples.

Setup

library(subsampling)

Simulated Logistic Regression Example

We first simulate a logistic regression dataset with two rare binary features and two continuous covariates. In the formula Y ~ ., the model matrix contains an intercept column. Numeric rareFeature.index values follow the original covariate order supplied by the user; the function internally shifts the indices to account for the intercept column.

set.seed(2)
N <- 5000
Z1 <- rbinom(N, 1, 0.04)
Z2 <- rbinom(N, 1, 0.07)
X1 <- rnorm(N)
X2 <- rnorm(N)
eta <- 0.5 + 0.5 * Z1 + 0.5 * Z2 + 0.5 * X1 + 0.5 * X2
Y <- rbinom(N, 1, plogis(eta))

data <- data.frame(Y, Z1, Z2, X1, X2)
formula <- Y ~ .
rareFeature.index <- 1:2
n.plt <- 300
n.ssp <- 700

One-Step Balanced Sampling

The default criterion is "BL-Uni". It draws one Poisson subsample with probabilities proportional to the rare-feature balance score. For "BL-Uni" and "Uni", the expected sample size is n.plt + n.ssp.

For observation \(i\), the balance score is \[ b(Z_i) = \sum_{j=1}^{d_r} \frac{|Z_{ij} - \bar{Z}_j|}{\bar{Z}_j}, \] where \(d_r\) is the number of rare features and \(\bar{Z}_j\) is the prevalence of the \(j\)th rare feature in the full data. Observations containing expressed rare features receive larger scores and therefore larger sampling probabilities.

fit_bl <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  rareFeature.index = rareFeature.index
)

summary(fit_bl)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "BL-Uni", rareFeature.index = rareFeature.index)
#> 
#> Design Summary:
#>                   Variable  Value
#>          Pilot Sample Size   1001
#>    Expected Subsample Size   1000
#>      Actual Subsample Size   1001
#>      Unique Subsample Size   1001
#>        Combined Union Size   1001
#>              Overlap Count   1001
#>      Overlap Rate in Pilot   100%
#>  Overlap Rate in Subsample   100%
#>    Expected Subsample Rate    20%
#>      Actual Subsample Rate 20.02%
#>      Unique Subsample Rate 20.02%
#> 
#> Sample Composition (Logistic Regression):
#>          Sample Size Y_count Y_rate Rare_row_rate
#>           Pilot 1001     656 65.53%        49.15%
#>       Subsample 1001     656 65.53%        49.15%
#>  Combined union 1001     656 65.53%        49.15%
#>       Full data 5000    3044 60.88%        11.32%
#> 
#> Subsample Rare-Feature Summary:
#>  Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#>       Z1             212         21.18%        212     4.24%             100%
#>       Z2             298         29.77%        372     7.44%           80.11%
#> 
#> Subsample Rare-Pattern Distribution:
#>  Rare_features_in_row Count   Rate
#>                0_ones   509 50.85%
#>                1_ones   474 47.35%
#>                2_ones    18   1.8%
#> 
#> Pilot Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4701     0.0951  4.9413  <0.0001
#> Z1          0.5197     0.1816  2.8620   0.0042
#> Z2          0.4508     0.1600  2.8169   0.0048
#> X1          0.3368     0.0891  3.7818   0.0002
#> X2          0.5454     0.0859  6.3484  <0.0001
#> 
#> Second-Step Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4701     0.0951  4.9413  <0.0001
#> Z1          0.5197     0.1816  2.8620   0.0042
#> Z2          0.4508     0.1600  2.8169   0.0048
#> X1          0.3368     0.0891  3.7818   0.0002
#> X2          0.5454     0.0859  6.3484  <0.0001
#> 
#> Combined-Union Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4701     0.0951  4.9413  <0.0001
#> Z1          0.5197     0.1816  2.8620   0.0042
#> Z2          0.4508     0.1600  2.8169   0.0048
#> X1          0.3368     0.0891  3.7818   0.0002
#> X2          0.5454     0.0859  6.3484  <0.0001

The summary reports the realized sample size, response composition, rare-feature coverage, and coefficient estimates. Because this is a one-step method, the pilot, subsample, and combined estimators are the same.

Two-Step Rareness-Aware Optimal Subsampling

Two-step criteria such as "Lopt", "Aopt", "R-Lopt", and "BL-Lopt" first draw a pilot sample, fit a pilot estimator, compute second-step sampling probabilities, and then refit the model on the second-step subsample. The final combined estimator is fitted on the union of the pilot and second-step samples.

For two-step methods, balance.X.plt = TRUE draws the pilot sample using the balance score.

fit_rlopt <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "R-Lopt",
  balance.X.plt = TRUE,
  rareFeature.index = c("Z1", "Z2")
)

summary(fit_rlopt)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "R-Lopt", balance.X.plt = TRUE, 
#>     rareFeature.index = c("Z1", "Z2"))
#> 
#> Design Summary:
#>                   Variable  Value
#>          Pilot Sample Size    300
#>    Expected Subsample Size    700
#>      Actual Subsample Size    702
#>      Unique Subsample Size    702
#>        Combined Union Size    892
#>              Overlap Count    110
#>      Overlap Rate in Pilot 36.67%
#>  Overlap Rate in Subsample 15.67%
#>    Expected Subsample Rate    14%
#>      Actual Subsample Rate 14.04%
#>      Unique Subsample Rate 14.04%
#> 
#> Sample Composition (Logistic Regression):
#>          Sample Size Y_count Y_rate Rare_row_rate
#>           Pilot  300     211 70.33%           55%
#>       Subsample  702     332 47.29%           49%
#>  Combined union  892     483 54.15%        45.74%
#>       Full data 5000    3044 60.88%        11.32%
#> 
#> Subsample Rare-Feature Summary:
#>  Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#>       Z1             150         21.37%        212     4.24%           70.75%
#>       Z2             209         29.77%        372     7.44%           56.18%
#> 
#> Subsample Rare-Pattern Distribution:
#>  Rare_features_in_row Count   Rate
#>                0_ones   358    51%
#>                1_ones   329 46.87%
#>                2_ones    15  2.14%
#> 
#> Pilot Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.7663     0.1986  3.8579   0.0001
#> Z1          0.5044     0.3246  1.5541   0.1202
#> Z2          0.3788     0.3488  1.0860   0.2775
#> X1          0.5885     0.1784  3.2984   0.0010
#> X2          0.8452     0.1801  4.6936  <0.0001
#> 
#> Second-Step Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.3045     0.1166  2.6118   0.0090
#> Z1          0.7609     0.2004  3.7969   0.0001
#> Z2          0.6093     0.1799  3.3874   0.0007
#> X1          0.5275     0.0927  5.6909  <0.0001
#> X2          0.5109     0.0974  5.2464  <0.0001
#> 
#> Combined-Union Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4040     0.0943  4.2854  <0.0001
#> Z1          0.6489     0.1860  3.4887   0.0005
#> Z2          0.6148     0.1625  3.7833   0.0002
#> X1          0.5406     0.0803  6.7342  <0.0001
#> X2          0.5595     0.0767  7.2953  <0.0001

Automatically Account for Rarity if `rareFeature.index = NULL`

If rareFeature.index = NULL, the function searches for binary columns whose prevalence is below rareThreshold.

fit_auto <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  rareFeature.index = NULL,
  rareThreshold = 0.09
)

fit_auto$rareFeature.index
#> Z1 Z2 
#>  2  3

Balancing the Outcome for Logistic Regression

For binary outcomes, balance.Y.ssp = TRUE applies a case-control style allocation for one-step "Uni" and "BL-Uni" methods. The option balance.Y.all = TRUE includes all observations with Y = 1 and subsamples from observations with Y = 0.

fit_y_balanced <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  balance.Y.ssp = TRUE,
  rareFeature.index = c("Z1", "Z2")
)

c(
  full_Y_rate = mean(data$Y),
  subsample_Y_rate = fit_y_balanced$Y.proportion.ssp
)
#>      full_Y_rate subsample_Y_rate 
#>        0.6088000        0.5185557

Objective Weights

By default, sampled observations are fitted with inverse-probability weights. For one-step methods whose sampling probabilities do not depend on the response, objective.weight = "unweighted" can be used. Two-step methods currently use a weighted second-step objective.

Control Options

The control argument accepts alpha, b, and poi.method. The default poi.method = "exact" computes Poisson probabilities using full-data normalization. The alternative poi.method = "estimated" uses the pilot sample to estimate the normalizing quantity.

fit_estimated <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "R-Lopt",
  balance.X.plt = TRUE,
  rareFeature.index = c("Z1", "Z2"),
  control = list(poi.method = "estimated", b = 2),
  record.stage.time = TRUE
)

fit_estimated$stage.time
#>      balance_score     pilot.estimate        subsampling subsample.estimate 
#>              0.001              0.001              0.001              0.001 
#>    combining.union    result.assembly 
#>              0.001              0.000

Non-Binomial Families

The rare-feature machinery can also be used with non-binomial GLMs. Response balancing options are ignored for non-binary outcomes.

set.seed(3)
N_g <- 3000
Z <- rbinom(N_g, 1, 0.05)
X <- rnorm(N_g)
y <- 1 + 0.6 * Z + 0.2 * X + rnorm(N_g)
gaussian_data <- data.frame(y, Z, X)

fit_gaussian <- ssp.glm.rF(
  y ~ .,
  data = gaussian_data,
  n.plt = 200,
  n.ssp = 500,
  family = "gaussian",
  criterion = "BL-Uni",
  rareFeature.index = "Z"
)

summary(fit_gaussian)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = y ~ ., data = gaussian_data, n.plt = 200, 
#>     n.ssp = 500, family = "gaussian", criterion = "BL-Uni", rareFeature.index = "Z")
#> 
#> Design Summary:
#>                   Variable  Value
#>          Pilot Sample Size    680
#>    Expected Subsample Size    700
#>      Actual Subsample Size    680
#>      Unique Subsample Size    680
#>        Combined Union Size    680
#>              Overlap Count    680
#>      Overlap Rate in Pilot   100%
#>  Overlap Rate in Subsample   100%
#>    Expected Subsample Rate 23.33%
#>      Actual Subsample Rate 22.67%
#>      Unique Subsample Rate 22.67%
#> 
#> Sample Composition:
#>          Sample Size Rare_row_rate
#>           Pilot  680        22.21%
#>       Subsample  680        22.21%
#>  Combined union  680        22.21%
#>       Full data 3000         5.03%
#> 
#> Subsample Rare-Feature Summary:
#>  Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#>        Z             151         22.21%        151   5.0333%             100%
#> 
#> Subsample Rare-Pattern Distribution:
#>  Rare_features_in_row Count   Rate
#>                0_ones   529 77.79%
#>                1_ones   151 22.21%
#> 
#> Pilot Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9475     0.0450 21.0475  <0.0001
#> Z           0.6655     0.1009  6.5962  <0.0001
#> X           0.2663     0.0417  6.3806  <0.0001
#> 
#> Second-Step Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9475     0.0450 21.0475  <0.0001
#> Z           0.6655     0.1009  6.5962  <0.0001
#> X           0.2663     0.0417  6.3806  <0.0001
#> 
#> Combined-Union Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9475     0.0450 21.0475  <0.0001
#> Z           0.6655     0.1009  6.5962  <0.0001
#> X           0.2663     0.0417  6.3806  <0.0001