ssp.glm.rF: Balanced Subsampling for Preserving Rare Features in Generalized Linear Models

This vignette illustrates how to use ssp.glm.rF for logistic regression with rare features. We focus on the practical usage of the function; additional statistical theory and algorithmic details will be released in a forthcoming paper.

Installation

You can install the development version of subsampling from GitHub with:

# install.packages("devtools")
devtools::install_github("dqksnow/Subsampling")
library(subsampling)

Example: Logistic Regression with Simulated Data Containing Rare Features

We introduce the usage of ssp.glm.RF using simulated data. The covariate matrix \(X\) contains \(d=5\) variables: three independent rare binary features \(Z\) with probabilities p_rare; two non-rare features jointly normally distributed. The full dataset size is \(N = 1 \times 10^4\).

set.seed(2)
N <- 1e4
d_rare <- 3
d_cont <- 2
p_rare <- c(0.01, 0.02, 0.05)
beta0 <- c(0.5, rep(0.5, d_rare), rep(0.5, d_cont)) 
corr <- 0.5
sigmax  <- matrix(corr, d_cont, d_cont) + diag(1-corr, d_cont)
X <- MASS::mvrnorm(N, rep(0, d_cont), sigmax)
Z <- do.call(cbind, lapply(seq_along(p_rare), function(i) {
rbinom(N, 1, p_rare[i])
}))
X <- cbind(Z, X)
P <- 1 / (1 + exp(-(beta0[1] + X %*% beta0[-1])))
Y <- as.integer(rbinom(N, 1, P))
colnames(X) <- paste0("X", 1:(d_rare + d_cont))
rareFeature.index <- c(1:d_rare)
data <- data.frame(Y = Y, X)
formula <- Y ~ .
head(data)
#>   Y X1 X2 X3         X4         X5
#> 1 0  0  0  1 -1.4469846 -0.1065169
#> 2 0  0  0  1  0.7179401 -0.3977719
#> 3 1  0  0  0  0.6218452  2.1283836
#> 4 0  0  0  0 -0.1739511 -1.7839170
#> 5 1  0  0  0 -0.5609966  0.4219965
#> 6 1  1  0  0  0.3786550 -0.1492964
summary(data)
#>        Y                X1               X2               X3        
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median :1.0000   Median :0.0000   Median :0.0000   Median :0.0000  
#>  Mean   :0.6136   Mean   :0.0095   Mean   :0.0191   Mean   :0.0499  
#>  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
#>        X4                  X5           
#>  Min.   :-3.810962   Min.   :-4.025550  
#>  1st Qu.:-0.670567   1st Qu.:-0.662755  
#>  Median : 0.019920   Median :-0.002225  
#>  Mean   : 0.009841   Mean   : 0.009267  
#>  3rd Qu.: 0.691971   3rd Qu.: 0.706946  
#>  Max.   : 3.701031   Max.   : 3.416948

Key Arguments

The main function interface is

ssp.glm.rF(formula,
           data,
           subset = NULL,
           n.plt,
           n.ssp,
           family = 'binomial',
           criterion = 'BL-Uni',
           sampling.method = 'poisson',
           likelihood = 'weighted',
           balance.plt = TRUE,
           balance.Y = FALSE,
           rareFeature.index = NULL,
           control = list(...),
           contrasts = NULL,
           ...
           ) 

The idea of subsampling is to draw a size \(n \ll N\) subsample and fit the model on it. The subsampling probabilities depend on the chosen method. For a general background of subsampling, refer to the documentation of ssp.glm. The function ssp.glm.rF extends ssp.glm to address the rare feature problem, where some binary features take the value 1 infrequently.

Why rare features require special treatment? A rare feature is a Bernoulli-distributed covariate \(Z\) with a low probability of being \(1\). In a large-scale dataset, its frequency \(N_{\cdot 1} = \sum_{i=1}^{N}Z_i \ll N\). The key challenge is that the estimator of the coefficient for a rare feature converges at rate \(O(1/N_{\cdot 1})\) rather than \(O(1/N)\). In other words, the accuracy for a rare feature’s coefficient estimation depends directly on the number of \(Z=1\) cases in the dataset. It implicates that a good subsampling method should assign higher sampling probabilities to observations with rare feature expressions. Classical criterion to compute sampling probabilities such as Uni and Lopt cannot sufficiently upweight those rare observations. Aopt can upweight them but replies heavily on the quality of the pilot estimator.

To address this issue, ssp.glm.rF incorporates the balance score \(b(Z)\). For the \(i\)-th observation, it is defined as: \[ b(Z_i) = \sum_{j=1}^{d_{r}} \frac{\left| Z_{ij} - \bar{Z}_{\cdot j}\right|}{\bar{Z}_{\cdot j}}, \] where \(\bar{Z}_{\cdot j} = N_{\cdot j} / N\) is the mean of the \(j\)-th rare feature in the full dataset, and \(d_r\) is the dimension of rare features. The balance score upweights observations that contain expressed rare features and aggregates contributions when multiple rare features occur together. Furthermore, it does not rely on unknown model parameters, thus does not require a pilot estimator.

In the function, criterion='BL-Uni' uses sampling probabilities proportional to the balance score. criterion = "R-Lopt"extends L-optimality so that it also accounts for rarity, albeit requires a pilot estimator. If balance.plt = TRUE, the pilot sample is drawn using "BL-Uni", which is important because an unstable pilot harms "Aopt", "Lopt", or "R-Lopt" performance. rareFeature.index must specify which columns of the design matrix correspond to rare features.

Outputs

Below are two examples demonstrating the use of ssp.glm with different configurations.

n.plt <- 300
n.ssp <- 2000
BL.Uni.results <- ssp.glm.rF(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'BL-Uni',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(BL.Uni.results)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "BL-Uni", sampling.method = "poisson", 
#>     likelihood = "weighted", balance.plt = TRUE, balance.Y = FALSE, 
#>     rareFeature.index = rareFeature.index)
#> 
#> Subsample Summary:
#>                 Variable  Value
#>        Total Sample Size  10000
#>  Expected Subsample Size   2000
#>    Actual Subsample Size   1964
#>    Unique Subsample Size   1964
#>  Expected Subsample Rate    20%
#>    Actual Subsample Rate 19.64%
#>    Unique Subsample Rate 19.64%
#> 
#> Rare Features Summary (pilot, subsample, full):
#>                X1      X2      X3      
#> size_pilot     "1964"  "1964"  "1964"  
#> count1_pilot   "95"    "191"   "471"   
#> prop_pilot     "4.84%" "9.73%" "23.98%"
#> coverage_pilot "100%"  "100%"  "94.39%"
#>                ""      ""      ""      
#> size_subsample "1964"  "1964"  "1964"  
#> count1_ssp     "95"    "191"   "471"   
#> prop_ssp       "4.84%" "9.73%" "23.98%"
#> coverage_ssp   "100%"  "100%"  "94.39%"
#>                ""      ""      ""      
#> size_full      "10000" "10000" "10000" 
#> count1_full    "95"    "191"   "499"   
#> rare_rate      "0.95%" "1.91%" "4.99%" 
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4777     0.0639  7.4734  <0.0001
#> X1          0.6419     0.2570  2.4972   0.0125
#> X2          0.6119     0.1899  3.2227   0.0013
#> X3          0.5804     0.1253  4.6306  <0.0001
#> X4          0.5596     0.0705  7.9369  <0.0001
#> X5          0.4629     0.0666  6.9560  <0.0001

The Subsample Summary reports the distribution of the three rare features in the pilot sample, the subsample, and the full dataset. When "BL-Uni" or Uni criterion is used, the pilot sample and the subsample return the same results. The value size_subsample corresponds to the realized subsample size. The value count1_ssp represents the number of observations with \(X1=1\) in the subsample. Here, 95 out of 1964 observations (4.7%). The corresponding coverage_ssp = 100% indicates that all 95 rare cases for \(X1=1\) in the full dataset are included in the subsample. The remaining quantities follow the same interpretation as in ssp.glm.

R.Lopt.results <- ssp.glm.rF(formula = formula, 
data = data, 
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'R-Lopt',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(R.Lopt.results)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "R-Lopt", sampling.method = "poisson", 
#>     likelihood = "weighted", balance.plt = TRUE, balance.Y = FALSE, 
#>     rareFeature.index = rareFeature.index)
#> 
#> Subsample Summary:
#>                 Variable  Value
#>        Total Sample Size  10000
#>  Expected Subsample Size   2000
#>    Actual Subsample Size   1935
#>    Unique Subsample Size   1935
#>  Expected Subsample Rate    20%
#>    Actual Subsample Rate 19.35%
#>    Unique Subsample Rate 19.35%
#> 
#> Rare Features Summary (pilot, subsample, full):
#>                X1       X2       X3      
#> size_pilot     "278"    "278"    "278"   
#> count1_pilot   "43"     "55"     "61"    
#> prop_pilot     "15.47%" "19.78%" "21.94%"
#> coverage_pilot "45.26%" "28.8%"  "12.22%"
#>                ""       ""       ""      
#> size_subsample "1935"   "1935"   "1935"  
#> count1_ssp     "91"     "186"    "350"   
#> prop_ssp       "4.7%"   "9.61%"  "18.09%"
#> coverage_ssp   "95.79%" "97.38%" "70.14%"
#>                ""       ""       ""      
#> size_full      "10000"  "10000"  "10000" 
#> count1_full    "95"     "191"    "499"   
#> rare_rate      "0.95%"  "1.91%"  "4.99%" 
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4720     0.0580  8.1360  <0.0001
#> X1          0.6529     0.2598  2.5129   0.0120
#> X2          0.6327     0.1914  3.3056   0.0009
#> X3          0.5875     0.1284  4.5764  <0.0001
#> X4          0.5602     0.0574  9.7554  <0.0001
#> X5          0.5358     0.0623  8.5930  <0.0001

As recommended by survey::svyglm, when working with binomial models, it is advisable to use use family=quasibinomial() to avoid a warning issued by glm. Refer to svyglm() help documentation Details. The ‘quasi’ version of the family objects provide the same point estimates.