multChernoff: Get Started

library(multChernoff)

Overview

multChernoff computes finite-sample tail bounds for the likelihood ratio test (LRT) under multinomial sampling. These bounds can be used to obtain conservative p-values and critical values when the sample size is not large relative to the alphabet size, so standard chi-squared asymptotics may be inaccurate.

The package exposes three main functions:

When to Use This Package

The package is most useful when:

In this setting, tailProbBound() and criticalValue() provide conservative finite-sample alternatives to asymptotic likelihood-ratio approximations.

Basic Usage

Tail Probabilities

The most direct workflow is to evaluate the finite-sample tail bound for a fixed LRT value.

tailProbBound(x = 20, k = 7, n = 50)
#> [1] 0.304216

This value is a conservative upper bound on the tail probability. The function also supports independent multinomial samples by passing vectors of matching lengths for k and n, for example tailProbBound(x = 12, k = c(3, 4), n = c(12, 20)).

Computing a Critical Value

To calibrate a test at level p, use criticalValue().

criticalValue(k = 10, n = 40, p = 0.05)
#> [1] 30.77278

The returned value can be used as a finite-sample alternative to the standard chi-squared cutoff.

Example: What percentage of butterflies were unseen by Corbet?

Naturalist Alexander Steven Corbet spent two years trapping butterflies in Peninsular Malaysia. Here is the data collected by Corbet.

Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Species 118 74 44 24 29 22 20 19 20 15 12 14 6 12 6

Here we ask the question: what percentage of butterflies in Malaya belonged to the species that Corbet had not seen? That is, we want to estimate the proportion of butterflies from all the unseen species. Clearly, the MLE is zero based on the sample. Instead, we ask for an upper bound with 95% confidence.

Let k = 435 + 1, where 435 is the number of species observed by Corbet. The corresponding observed distribution is the distribution from the table above with a zero entry at the end. In the following program, we maximize the last entry of the underlying distribution subject to a tail bound on the LRT, which equals the Kullback-Leibler divergence KL(observed distribution || true distribution) multiplied by twice the sample size.

See the paper at the bottom of this page for details.

library(plyr)

# Corbet butterfly data
# https://en.wikipedia.org/wiki/Unseen_species_problem
corbet.butterfly <- data.frame(j=1:15, n_j=c(118,74,44,24,29,22,20,19,20,15,12,14,6,12,6))
n.butterfly <- Reduce(c, alply(corbet.butterfly, 1, function(.df) rep(.df$j, .df$n_j)))

alpha <- 0.05
n.observed <- c(n.butterfly, 0)    # the last one is the unseen
n <- sum(n.observed)
k <- length(n.observed)
p.observed <- n.observed / n

# critical value
t.alpha <- criticalValue(k, n, p=alpha, verbose = TRUE)

cat(sprintf("critical value = %f\n", t.alpha))
#> critical value = 962.402970

We get printout critical value = 962.402970. This value is then used in the following convex program.

# convex program
p <- Variable(k)
obj <- p[k]
constr <- list(p>=0, 
               sum(p) == 1, 
               2 * n * sum(p.observed * (log(p.observed) - log(p))) <= t.alpha)
prob <- Problem(Maximize(obj), constr)
result <- solve(prob, verbose=FALSE) # you should try other solvers

# result
print(result$status)
unseen <- result$value

cat(sprintf("unseen <= %f\n", unseen))

With MOSEK solver, the percentage of unseen species is at most 21.1%.

References

F. Richard Guo and Thomas S. Richardson (2021). Chernoff-Type Concentration of Empirical Probabilities in Relative Entropy. IEEE Transactions on Information Theory, 67(1), 549-558. https://doi.org/10.1109/TIT.2020.3034539