ssp.glm.rF: Balanced
Subsampling for Preserving Rare Features in Generalized Linear
ModelsThis vignette illustrates how to use ssp.glm.rF for
logistic regression with rare features. We focus on the practical usage
of the function; additional statistical theory and algorithmic details
will be released in a forthcoming paper.
You can install the development version of subsampling
from GitHub with:
We introduce the usage of ssp.glm.RF using simulated
data. The covariate matrix \(X\)
contains \(d=5\) variables: three
independent rare binary features \(Z\)
with probabilities p_rare; two non-rare features jointly
normally distributed. The full dataset size is \(N = 1 \times 10^4\).
set.seed(2)
N <- 1e4
d_rare <- 3
d_cont <- 2
p_rare <- c(0.01, 0.02, 0.05)
beta0 <- c(0.5, rep(0.5, d_rare), rep(0.5, d_cont))
corr <- 0.5
sigmax <- matrix(corr, d_cont, d_cont) + diag(1-corr, d_cont)
X <- MASS::mvrnorm(N, rep(0, d_cont), sigmax)
Z <- do.call(cbind, lapply(seq_along(p_rare), function(i) {
rbinom(N, 1, p_rare[i])
}))
X <- cbind(Z, X)
P <- 1 / (1 + exp(-(beta0[1] + X %*% beta0[-1])))
Y <- as.integer(rbinom(N, 1, P))
colnames(X) <- paste0("X", 1:(d_rare + d_cont))
rareFeature.index <- c(1:d_rare)
data <- data.frame(Y = Y, X)
formula <- Y ~ .
head(data)
#> Y X1 X2 X3 X4 X5
#> 1 0 0 0 1 -1.4469846 -0.1065169
#> 2 0 0 0 1 0.7179401 -0.3977719
#> 3 1 0 0 0 0.6218452 2.1283836
#> 4 0 0 0 0 -0.1739511 -1.7839170
#> 5 1 0 0 0 -0.5609966 0.4219965
#> 6 1 1 0 0 0.3786550 -0.1492964
summary(data)
#> Y X1 X2 X3
#> Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
#> 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
#> Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000
#> Mean :0.6136 Mean :0.0095 Mean :0.0191 Mean :0.0499
#> 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
#> Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
#> X4 X5
#> Min. :-3.810962 Min. :-4.025550
#> 1st Qu.:-0.670567 1st Qu.:-0.662755
#> Median : 0.019920 Median :-0.002225
#> Mean : 0.009841 Mean : 0.009267
#> 3rd Qu.: 0.691971 3rd Qu.: 0.706946
#> Max. : 3.701031 Max. : 3.416948The main function interface is
ssp.glm.rF(formula,
data,
subset = NULL,
n.plt,
n.ssp,
family = 'binomial',
criterion = 'BL-Uni',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = NULL,
control = list(...),
contrasts = NULL,
...
) The idea of subsampling is to draw a size \(n \ll N\) subsample and fit the model on
it. The subsampling probabilities depend on the chosen method. For a
general background of subsampling, refer to the documentation of
ssp.glm. The function ssp.glm.rF extends
ssp.glm to address the rare feature problem, where some
binary features take the value 1 infrequently.
Why rare features require special treatment? A rare feature is a
Bernoulli-distributed covariate \(Z\)
with a low probability of being \(1\).
In a large-scale dataset, its frequency \(N_{\cdot 1} = \sum_{i=1}^{N}Z_i \ll N\).
The key challenge is that the estimator of the coefficient for a rare
feature converges at rate \(O(1/N_{\cdot
1})\) rather than \(O(1/N)\). In
other words, the accuracy for a rare feature’s coefficient estimation
depends directly on the number of \(Z=1\) cases in the dataset. It implicates
that a good subsampling method should assign higher sampling
probabilities to observations with rare feature expressions. Classical
criterion to compute sampling probabilities such as Uni and
Lopt cannot sufficiently upweight those rare observations.
Aopt can upweight them but replies heavily on the quality
of the pilot estimator.
To address this issue, ssp.glm.rF incorporates the
balance score \(b(Z)\). For the \(i\)-th observation, it is defined as: \[
b(Z_i) = \sum_{j=1}^{d_{r}} \frac{\left|
Z_{ij} - \bar{Z}_{\cdot j}\right|}{\bar{Z}_{\cdot j}},
\] where \(\bar{Z}_{\cdot j} = N_{\cdot
j} / N\) is the mean of the \(j\)-th rare feature in the full dataset,
and \(d_r\) is the dimension of rare
features. The balance score upweights observations that contain
expressed rare features and aggregates contributions when multiple rare
features occur together. Furthermore, it does not rely on unknown model
parameters, thus does not require a pilot estimator.
In the function, criterion='BL-Uni' uses sampling
probabilities proportional to the balance score.
criterion = "R-Lopt"extends L-optimality so that it also
accounts for rarity, albeit requires a pilot estimator. If
balance.plt = TRUE, the pilot sample is drawn using
"BL-Uni", which is important because an unstable pilot
harms "Aopt", "Lopt", or "R-Lopt"
performance. rareFeature.index must specify which columns
of the design matrix correspond to rare features.
Below are two examples demonstrating the use of ssp.glm
with different configurations.
n.plt <- 300
n.ssp <- 2000
BL.Uni.results <- ssp.glm.rF(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'BL-Uni',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(BL.Uni.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "BL-Uni", sampling.method = "poisson",
#> likelihood = "weighted", balance.plt = TRUE, balance.Y = FALSE,
#> rareFeature.index = rareFeature.index)
#>
#> Subsample Summary:
#> Variable Value
#> Total Sample Size 10000
#> Expected Subsample Size 2000
#> Actual Subsample Size 1964
#> Unique Subsample Size 1964
#> Expected Subsample Rate 20%
#> Actual Subsample Rate 19.64%
#> Unique Subsample Rate 19.64%
#>
#> Rare Features Summary (pilot, subsample, full):
#> X1 X2 X3
#> size_pilot "1964" "1964" "1964"
#> count1_pilot "95" "191" "471"
#> prop_pilot "4.84%" "9.73%" "23.98%"
#> coverage_pilot "100%" "100%" "94.39%"
#> "" "" ""
#> size_subsample "1964" "1964" "1964"
#> count1_ssp "95" "191" "471"
#> prop_ssp "4.84%" "9.73%" "23.98%"
#> coverage_ssp "100%" "100%" "94.39%"
#> "" "" ""
#> size_full "10000" "10000" "10000"
#> count1_full "95" "191" "499"
#> rare_rate "0.95%" "1.91%" "4.99%"
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4777 0.0639 7.4734 <0.0001
#> X1 0.6419 0.2570 2.4972 0.0125
#> X2 0.6119 0.1899 3.2227 0.0013
#> X3 0.5804 0.1253 4.6306 <0.0001
#> X4 0.5596 0.0705 7.9369 <0.0001
#> X5 0.4629 0.0666 6.9560 <0.0001The Subsample Summary reports the distribution of the
three rare features in the pilot sample, the subsample, and the full
dataset. When "BL-Uni" or Uni criterion is
used, the pilot sample and the subsample return the same results. The
value size_subsample corresponds to the realized subsample
size. The value count1_ssp represents the number of
observations with \(X1=1\) in the
subsample. Here, 95 out of 1964 observations (4.7%). The corresponding
coverage_ssp = 100% indicates that all 95 rare cases for
\(X1=1\) in the full dataset are
included in the subsample. The remaining quantities follow the same
interpretation as in ssp.glm.
R.Lopt.results <- ssp.glm.rF(formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = 'quasibinomial',
criterion = 'R-Lopt',
sampling.method = 'poisson',
likelihood = 'weighted',
balance.plt = TRUE,
balance.Y = FALSE,
rareFeature.index = rareFeature.index
)
summary(R.Lopt.results)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "R-Lopt", sampling.method = "poisson",
#> likelihood = "weighted", balance.plt = TRUE, balance.Y = FALSE,
#> rareFeature.index = rareFeature.index)
#>
#> Subsample Summary:
#> Variable Value
#> Total Sample Size 10000
#> Expected Subsample Size 2000
#> Actual Subsample Size 1935
#> Unique Subsample Size 1935
#> Expected Subsample Rate 20%
#> Actual Subsample Rate 19.35%
#> Unique Subsample Rate 19.35%
#>
#> Rare Features Summary (pilot, subsample, full):
#> X1 X2 X3
#> size_pilot "278" "278" "278"
#> count1_pilot "43" "55" "61"
#> prop_pilot "15.47%" "19.78%" "21.94%"
#> coverage_pilot "45.26%" "28.8%" "12.22%"
#> "" "" ""
#> size_subsample "1935" "1935" "1935"
#> count1_ssp "91" "186" "350"
#> prop_ssp "4.7%" "9.61%" "18.09%"
#> coverage_ssp "95.79%" "97.38%" "70.14%"
#> "" "" ""
#> size_full "10000" "10000" "10000"
#> count1_full "95" "191" "499"
#> rare_rate "0.95%" "1.91%" "4.99%"
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4720 0.0580 8.1360 <0.0001
#> X1 0.6529 0.2598 2.5129 0.0120
#> X2 0.6327 0.1914 3.3056 0.0009
#> X3 0.5875 0.1284 4.5764 <0.0001
#> X4 0.5602 0.0574 9.7554 <0.0001
#> X5 0.5358 0.0623 8.5930 <0.0001As recommended by survey::svyglm, when working with
binomial models, it is advisable to use use
family=quasibinomial() to avoid a warning issued by
glm. Refer to svyglm()
help documentation Details. The ‘quasi’ version of the family
objects provide the same point estimates.