power.nb is a package developed for estimating
statistical power and required sample size in differential abundance
microbiome studies using negative binomial models. The package includes
tools for simulation-based power analysis and sample size estimation
using generalized additive models (GAMs), and visualization utilities
for exploring the relationship between power, effect size, abundance,
and sample size.
power.nb presents two novel microbiome data simulations
frameworks: the Mixture of Gaussians (MixGaussSim) and the Reduced Rank
(RRSim) simulation approaches. MixGaussSim models the distribution of
mean abundance of taxa and the distribution of fold change (as a
function of mean abundance of taxa) by finite Mixture of Gaussian
distributions. Microbiome count data is then simulated from a negative
binomial distribution. Detailed description of the MixGaussSim method is
presented in the paper “Investigating
statistical power of differential abundance studies” (Agronah and Bolker 2025).
RRSim (this simulation method is still under development and some
functions in RRSim are yet to tested), on the other hand models the mean
abundance of all taxa jointly with a mixed-effects model while
accounting for correlations among taxa within subjects and zero
inflation in microbiome data. Given the high dimensionality of
microbiome data (usually having hundreds or thousands of taxa),
modelling correlations between taxa requires estimating thousands or
even millions of parameters for the correlation matrix. RRSim uses the
reduced rank functionality implemented in the glmmTMB (Magnusson et al. 2017) packages to reduce the
number of parameter estimates required of the variance-correlation
matrix. Details of this method is presented in the thesis “A Novel
Approach for Simulation-Based Power Estimation and Joint Modeling of
Microbiome Counts” (Agronah 2025)
Install the package from GitHub using devtools:
install.packages("devtools")
devtools::install_github("magronah/power.nb")
Since the goal of differential abundance studies is to identify taxa that differ significantly between groups, statistical power must be estimated at the level of individual taxa. Because of the complexity of microbiome data, analytical approaches based on theoretical distributions of test statistics (e.g., non-central t or chi-squared distributions) are not feasible (Arnold et al. 2011). Therefore, power estimation typically relies on simulation-based methods that can mimic the characteristics of real microbiome data.
Statistical power for a given taxon depends on both its mean abundance and effect size (defined in this package as fold changes) (Agronah and Bolker 2025). Thus, estimating the distributions of mean abundance and fold change of taxa explicitly will offer great benefit for estimation of statistical power for individual taxon. We use the MixGaussSim simulation approach to simulate microbiome data for power calculation due to its flexibility in modelling the distributions of taxa mean abundances and effect sizes explicitly as as well as their relationship.
There are one of two ways to obtain the parameters of the MixGaussSim method for data simulation. Users can:
1. pre-specify a set of parameters: To help users decide on plausible parameter values, we have run MixGaussSim on some of the microbiome datasets used in the paper “Microbiome differential abundance methods produce different results across 38 datasets” (Nearing et al. 2022) and presented the parameter estimates obtained from each of these datasets (see Appendix for the parameter estimates from these datasets ).
2. estimate parameters from an existing dataset: Secondly, users can also obtain estimates directly from fitting the model implemented in MixGaussSim to existing datasets. The following steps outlines the process for estimating parameters from an existing dataset.
We use a subset of the Obesity dataset, one of the datasets analyzed in (Nearing et al. 2022). The full dataset contains 52 samples and 1203 taxa.
##Load libraries
library(tidyverse)
library(dplyr)
library(power.nb)
library(patchwork)
library(rlist)
library(latex2exp)
data_path = system.file("pkgdata", package = "power.nb")
##Read count data and metadata
data <- read.table(
file.path(data_path, "ob_ross_ASVs_table.tsv"),
header = TRUE, sep = "\t",
check.names = FALSE, comment.char = "",
row.names = 1
)
metadata <- read.table(
file.path(data_path, "ob_ross_metadata.tsv"),
header = TRUE, sep = "\t",
check.names = FALSE, comment.char = ""
)
metadata <- metadata %>%
setNames(c("SampleID", "Groups"))
dim(data); dim(metadata)
## [1] 1203 52
## [1] 52 2
head(data,2); head(metadata,2)
Low abundant tax exhibit high variability, posing challenges in detecting significant differences between groups (Love, Huber, and Anders 2014). Pre-filtering steps, as is routinely done in differential abundance analysis, is used to filter out these low abundant taxa. We filter out taxa with less that 10 count in less that 5 samples.
filter_data = filter_low_count(countdata = data,
metadata = metadata,
abund_thresh = 3,
sample_thresh = 2,
sample_colname = "SampleID",
group_colname = "Groups")
dim(filter_data)
## [1] 707 52
We estimate fold change for each taxon and dispersion parameters
using the negative binomial model implemented in the DESeq2
package (Love, Huber, and Anders 2014).
The negative binomial model is widely used in both microbiome and
RNA-seq analyses and is implemented in the NBZIMM (Yi 2020) and edgeR R packages
(Robinson, McCarthy, and Smyth 2010). The
model implemented in the DESeq2 package is described as
follows:
Let \(K_{ij}\) represent the
observed count for the \(i^{\text{th}}\) taxon in the \(j^{\text{th}}\) sample. The model assumes
that
\[K_{ij} \sim \text{NB}(\mu_{ij},
\alpha_{ij}),\] where the mean is defined as \(\mu_{ij} = s_j q_{ij}\), and \(s_j\) is the normalization factor for
sample \(j\). The expected abundance
prior to normalization, \(q_{ij}\), is
modeled as
\[\log q_{ij} = \sum_r x_{jr}
\beta_{ir},\] with \(x_{jr}\)
denoting the covariates and \(\beta_{ir}\) the associated coefficients.
The dispersion parameter is assumed constant across samples for each
taxon, such that \(\alpha_{ij} =
\alpha_i\). This formulation links the variance and mean
through
\[\mathrm{Var}(K_{ij}) = \mu_i + \alpha_i
\mu_i^2.\]
Estimation of the regression coefficients \(\hat{\beta}_{ir}\) and dispersion
parameters \(\hat{\alpha}_i\) was
carried out using the empirical Bayes shrinkage procedures implemented
in the DESeq2 package Anders and
Huber (2010).
unique(metadata$Groups)
## [1] "OB" "H"
foldchange_est <- deseqfun(countdata = filter_data,
metadata = metadata,
alpha_level = 0.1,
ref_name = "H",
group_colname = "Groups",
sample_colname = "SampleID")
logfoldchange = foldchange_est$deseq_estimate$log2FoldChange
head(logfoldchange)
## [1] 0.3472147294 -0.8840585417 0.2167456449 -0.5366652659 0.0008622315
## [6] -0.8221867027
We modelled the log mean abundance (defined as the logarithm of the arithmetic mean abundance across both control and treatment groups) as a finite mixture of Gaussian distributions. To determine the optimal number of mixture components, denoted by \(k\), we employed a parametric bootstrap approach to sequentially compare models with \(k\) and \(k+1\) components.
Let the observed data be denoted by \(\mathbf{y} = (y_1, y_2, \dots, y_n)\), where each \(y_i\) represents the log mean abundance for the \(i^{\text{th}}\) taxon. The mixture model with \(k\) components is given by
\[f_k(y_i \mid \boldsymbol{\theta}_k) = \sum_{j=1}^{k} \pi_{j,k} \, \phi(y_i \mid \mu_{j,k}, \sigma_{j,k}^2),\]
where \(\pi_{j,k}\) are the mixing proportions satisfying \(\sum_{j=1}^{k} \pi_{j,k} = 1\), and \(\phi(\cdot \mid \mu, \sigma^2)\) denotes the normal density with mean \(\mu\) and variance \(\sigma^2\).
To test whether adding an additional component improves model fit, we formulated the hypotheses as:
\[ \begin{cases} H_0: \mathbf{y} \sim f_k(y_i \mid \boldsymbol{\theta}_k) & \textrm{(the data follow a $k$-component mixture model)} \\ H_a: \mathbf{y} \sim f_{k+1}(y_i \mid \boldsymbol{\theta}_{k+1}) & \textrm{(the data follow a $(k+1)$-component mixture model)} \end{cases}\] {=latex}’
The test statistic is the likelihood ratio statistic:
’ \[\Lambda = 2 \left[ \ell_{k+1}(\widehat{\boldsymbol{\theta}}_{k+1}) - \ell_{k}(\widehat{\boldsymbol{\theta}}_{k}) \right],\] where \(\ell_{k}(\widehat{\boldsymbol{\theta}}_{k})\) and \(\ell_{k+1}(\widehat{\boldsymbol{\theta}}_{k+1})\) are the maximized log-likelihoods under the \(k\)- and \((k+1)\)-component models, respectively. {=latex}’
To approximate the null distribution of \(\Lambda\), we performed a parametric bootstrap using \(B = 100\) bootstrap samples generated under \(H_0\). For each bootstrap replicate \(b = 1, \dots, B\), data were simulated from the fitted \(k\)-component model, and both the null and alternative models were re-fitted to obtain \(\Lambda^{(b)}\). The empirical \(p\)-value is computed as
\[p = \frac{1}{B} \sum_{b=1}^{B} I\left( \Lambda^{(b)} \ge \Lambda_{\text{obs}} \right),\] where \(\Lambda_{\text{obs}}\) is the observed likelihood ratio statistic.
If \(p < 0.05\), the null hypothesis \(H_0\) is rejected in favor of the \((k+1)\)-component model. Testing proceeds sequentially for \(k = 1, 2, \dots, 5\) until the \(p\)-value exceeds the \(5\%\) significance threshold, at which point the \(k\)-component model from the last accepted test is selected as the optimal mixture size (Benaglia et al. 2010).
### function to mute printing out messages about
### update on the number of iterations
quiet <- function(expr) {
out <- suppressWarnings(suppressMessages(
capture.output(res <- eval.parent(substitute(expr)))
))
res}
### Path to other files used in this vignette
extdata_path = system.file("extdata", package = "power.nb")
logmean = log(rowMeans(filter_data))
logmeanFit <- readRDS(file.path(extdata_path, "logmeanFit.rds"))
## The code below was used to generate the precomputed
## `logmeanFit.rds` object. The saved result is loaded
## throughout this vignette to reduce vignette build times.
# logmeanFit = quiet(logmean_fit(logmean, sig = 0.05,
# max.comp = 4, max.boot = 100))
We also modelled log fold change of taxa as a finite mixture of Gaussian distributions. Fold change is typically related to mean abundance (Love, Huber, and Anders 2014); therefore, we expressed the mean and standard deviation parameters of the individual Gaussian components as functions of log mean abundance. Detailed description of the model fitted to log fold changes is presented in the paper “Investigating statistical power of differential abundance studies” (Agronah and Bolker 2025).
logfoldchangeFit <- readRDS(file.path(extdata_path, "logfoldchangeFit.rds"))
logfoldchangeFit
## $par
## logitprob_1 mu_int_1 mu_int_2 mu_slope_1 mu_slope_2 logsd_.1_1
## 1.688126642 -1.054068569 0.115132772 -0.611302399 0.007166436 -0.554719649
## logsd_.1_2 logsd_.L_1 logsd_.L_2
## -0.512670667 0.415862601 0.233292472
##
## $np
## [1] 2
##
## $sd_ord
## [1] 1
##
## $aic
## [1] 1609.076
## The code below was used to generate the precomputed
## `logfoldchangeFit.rds` object. The saved result is loaded
## throughout this vignette to reduce vignette build times.
# logfoldchangeFit <- quiet(logfoldchange_fit(logmean,
# logfoldchange,
# ncore = 1,
# max_sd_ord = 2,
# max_np = 5,
# minval = -5,
# maxval = 5,
# itermax = 100,
# NP = 800,
# seed = 100))
Because dispersion tends to depend on mean count abundance with rarer
taxa showing greater variability we modeled the dispersion as a
nonlinear function of the mean, following the approach implemented in
DESeq2. The nonlinear function is defined by
\[ d = c_0 + \frac{c_1}{m} \]
where \(d\) and \(m\) denote the scaled dispersion and mean abundance respectively. The term \(c_0\) represents the asymptotic dispersion level for high abundance taxa, and \(c_1\) captures additional dispersion variability. This procedure enables estimation of dispersion values corresponding to given mean counts, facilitating their use in downstream simulations and power analyses.
dispersion = foldchange_est$dispersion
dispersionFit = dispersion_fit(dispersion, logmean)
dispersionFit$param
To simulate microbiome data, we first simulate log mean counts and corresponding log fold changes. The simulated log mean counts represent overall log mean counts (i.e., log of the arithmetic mean of the counts in all of treatment and control groups). We then use the simulated log mean count to estimate corresponding dispersion values for each sample through the fitted nonlinear function described in the equation here (see section on Modelling the Distribution of Log Fold Changes). Using the simulated log mean counts and fold changes, we now compute the group specific mean counts (i.e., mean counts of taxa for control group and mean count of taxa for treatment group). We then simulate microbiome count data from the negative binomial distribution using the group specific mean counts and the estimated dispersion values.
### Simulated log mean count and log foldchange from the fitted
### log mean count and log foldchange models
logmean_param = logmeanFit$param
logfoldchange_param = logfoldchangeFit
notu = dim(filter_data)[1]
notu
## [1] 707
logmean_sim = logmean_sim_fun(logmean_param, notu)
logfoldchange_sim = logfoldchange_sim_fun(logmean_sim = logmean_sim,
logfoldchange_param =
logfoldchange_param,
max_lfc = 15,
max_iter = 30000)
head(logfoldchange_sim)
## [1] 0.16326156 0.08548872 -0.09128340 0.28545329 0.21357079 1.09021814
Estimating the dispersion values is done internally within the
countdata_sim_fun function.
nsim = 4
ntreat = sum(metadata$Groups == "OB")
ncont = sum(metadata$Groups == "H")
dispersion_param = dispersionFit$param
countdata_sims = countdata_sim_fun(logmean_param,
logfoldchange_param,
dispersion_param,
nsamp_per_group = NULL,
ncont = ncont,
ntreat = ntreat,
notu,
nsim = nsim,
disp_scale = 0.7,
max_lfc = 15,
maxlfc_iter = 1000,
seed = 121)
## mixtools package, version 2.0.0.1, Released 2022-12-04
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772 and the Chan Zuckerberg Initiative: Essential Open Source Software for Science (Grant No. 2020-255193).
dim(countdata_sims$treat_countdata_list$sim_1)
## [1] 707 31
dim(countdata_sims$control_countdata_list$sim_1)
## [1] 707 21
dim(countdata_sims$countdata_list$sim_1)
## [1] 707 52
dim(countdata_sims$metadata_list$sim_1)
## [1] 52 2
We compare the following comparisons between our simulation and the actual dataset
we come distribution of variances of counts of taxa and
the means of taxa from out simulation with those of the actual dataset
compare_dataset <- function(countdata_sim_list,countdata_obs,method = c("var", "mean")){
# Check if method is either "var" or "mean"
if (!(method %in% c("var", "mean"))) {
stop("Invalid method. Please choose either 'var' or 'mean'.")
}
# Calculate variance or mean based on the chosen method
if (method == "var") {
calc_func <- stats::var
xlab_text <- "$\\log$(variance of taxa)"
} else {
calc_func <- mean
xlab_text <- "$\\log$(mean of taxa)"
}
# Create a list combining simulation data and observed data
dlist <- list.append(countdata_sim_list, obs_filt = countdata_obs)
# Calculate variance or mean for each dataset in the list
vars <- dlist |> purrr::map(~ apply(., 1, calc_func)) |>
purrr::map_dfr(~ tibble(var = .), .id = "type")
# Separate observed and simulated data
vars_obs <- vars[vars$type == "obs_filt", ]
vars_sim <- vars[vars$type != "obs_filt", ]
# Create ggplot for visualization
p <- ggplot(vars_sim, aes(x = log(var), colour = type)) +
geom_density(lwd = 1.5) +
geom_density(data = vars_obs, aes(x = log(var)),
colour = "black", linetype = "dashed", lwd = 1.5) +
xlab(TeX(xlab_text)) + theme_bw()
return(p)
}
countdata_sim_list = countdata_sims$countdata_list[1]
countdata_obs = filter_data
p11 = compare_dataset(countdata_sim_list,countdata_obs,method = "mean")
p12 = compare_dataset(countdata_sim_list,countdata_obs,method = "var")
(p11|p12) + plot_layout(guides = "collect")
dim(countdata_obs)
## [1] 707 52
We estimate statistical power for a given taxon as the probability of rejecting the null hypothesis that the mean abundance in the control and treatment groups are the same for that given taxon. To estimate statistical power for various combinations of log mean abundance and log fold change, we fitted a shape-constrained generalized additive model (GAM) (Pya and Wood 2015). The event that a given taxon is significantly different between groups is treated as a Bernoulli trial.
Define the logistic function as \[\text{Logist}(x) = \left(1 + \exp(-x)\right)^{-1}\].
The model predicting statistical power as a function of log fold change and log mean abundance is given by:
\[\begin{align} y &\sim \text{Bernoulli}(p_i) \\ p_i &= \text{Logist}(x) \\ \eta &= \beta_0 + f_1(x_1, x_2), \end{align}\]
where:
Power and fold change are positively correlated. Additionally, effect sizes of highly abundant taxa are more likely to be detected—hence having higher power—than those of rare taxa. To account for these relationships, the function \(f_1\) was constrained to be a monotonically increasing function of both log mean abundance and log fold change.
We used the DESeq2 package to compute p-values for each
taxon, applying the Benjamini and Hochberg method for false discovery
rate (FDR) correction. We used the default critical p-value threshold of
0.1 in the DESeq2 package.
The follow steps outlines the power estimation procedure:
To train the GAM on a sufficiently large sample size, we generate
multiple microbiome count datasets following the procedure described in
the section Simulate count
microbiome data. We then estimate p-values associated with the
simulated fold changes using the DESeq2 package.
countdata_list = countdata_sims$countdata_list
metadata_list = countdata_sims$metadata_list
desq_est = quiet(deseq_fun_est(metadata_list = metadata_list,
countdata_list = countdata_list,
alpha_level = 0.1,
group_colname = "Groups",
sample_colname = "Samples",
num_cores = 1,
ref_name = "control"))
deseq_est_list = lapply(desq_est, function(x){x$deseq_estimate})
names(desq_est$sim_1)
## [1] "logfoldchange" "dispersion" "deseq_estimate" "normalised_count"
## [5] "deseq_object" "metadata" "countdata"
We now fit the GAM for predicting statistical power for individual taxon.
true_lmean_list = countdata_sims$logmean_list
true_lfoldchange_list = countdata_sims$logfoldchange_list
gamFit <- gam_fit(deseq_est_list,
true_lfoldchange_list,
true_lmean_list,
grid_len = 50,
alpha_level=0.1)
range(true_lfoldchange_list)
## [1] -8.952604 7.643726
range(true_lmean_list)
## [1] -2.383209 6.458936
cont_breaks = seq(0,1,0.1)
combined_data = gamFit$combined_data
power_estimate = gamFit$power_estimate
contour_plot <- contour_plot_fun(combined_data,
power_estimate,
cont_breaks)
contour_plot
nsim = 4
nsample_vec = seq(10, 100, 20)
countdata_sims_list = list()
for(j in 1:length(nsample_vec)){
countdata_sims_list[[j]] = countdata_sim_fun(logmean_param,
logfoldchange_param,
dispersion_param,
nsamp_per_group = nsample_vec[j],
ncont = NULL,
ntreat = NULL,
notu,
nsim = nsim,
disp_scale = 0.3,
max_lfc = 15,
maxlfc_iter = 1000,
seed = 121)
}
names(countdata_sims_list) = paste0("sample_",nsample_vec)
desq_est_list = list()
for(i in 1:length(countdata_sims_list)){
countdata_list = countdata_sims_list[[i]]$countdata_list
metadata_list = countdata_sims_list[[i]]$metadata_list
desq_est_list[[i]] = deseq_fun_est(metadata_list = metadata_list,
countdata_list = countdata_list,
alpha_level = 0.1,
group_colname = "Groups",
sample_colname = "Samples",
num_cores = 1,
ref_name = "control")
}
names(desq_est_list) = paste0("sample_",nsample_vec)
We now fit a GAM model to predict statistical power. This time, we predict GAM with covariates as mean count of taxa, fold change of taxa and group sample size. Let \(n_k\) for \(k=1,2,...,N\) denote the sample size per group corresponding to the \(k^{th}\) dataset. For a given dataset, we assume equal sample sizes across all groups. The value of \(n_k\) is constant within each dataset but varies across datasets. The GAM model is defined by
\[\begin{equation} \begin{cases} y &\sim \text{Bernoulli}(p_i) \\ p_i &= \text{Logist}(\eta) \\ \eta &= \beta_0 + f_1({x}_{1i},{x}_{2i}) + f_2({x}_{1i},n_k) + f_3({x}_{2i},n_k), \end{cases} \label{eqn2} \end{equation}\] where the functions \(f_1, f_2\) and \(f_3\) are two-dimensional smoothing surfaces with basis generated by the tensor product smooth of pairs of \(x_{1i},x_{2i}\) and \(n_k\) . \(f_1, f_2\) and \(f_3\) are constrained to be monotonically increasing functions.
pow_ss <- readRDS(file.path(extdata_path, "gam_fit_MultiSamples.rds"))
## The code below was used to generate the precomputed
## `gam_fit_MultiSamples.rds` object.
## The saved result is loaded throughout
## this vignette to reduce vignette build times.
# deseq_list = lapply(desq_est_list, function(x){
# read_data(x, "deseq_estimate")
# })
#
# pval_est_list <- lapply(deseq_list, function(sample_list) {
# lapply(sample_list, function(sim_df) {
# sim_df$padj
# })
# })
#
#
# logfoldchange_list = read_data(countdata_sims_list,"logfoldchange_list")
# logmean_list = read_data(countdata_sims_list,"logmean_list")
#
# pow_ss <- power_fun_ss(pval_est_list,
# logmean_list,
# nsample_vec = nsample_vec,
# logfoldchange_list,
# alpha_level=0.1)
For a given target power, the sample size required to detect a taxon with a given fold change and mean abundance is obtained by a root-finding technique. Let \(g(x_{1i}, x_{2i}, n_k)\) denote the estimated linear predictor from the fitted GAM. The predicted power is obtained via the logistic transformation
\[\begin{align} \hat{p}_i = \text{Logist}\big(g(x_{1i}, x_{2i}, n_k)\big). \end{align}\]
Given \({x}_{1i}, {x}_{2i}\) and a target power \(p^\ast\), the required sample size \(n_k\) is estimated by finding the root of the equation: \[\begin{align} \text{Logist}\big(g(x_{1i}, x_{2i}, n_k)\big) - p^\ast = 0. \end{align}\]
target_power = 0.8; model = pow_ss$gam_mod; abs_lfc = 1.2; logmean = 4
ss_estimate = uniroot_ss(target_power, logmean, abs_lfc, model,
xmin = log2(10), xmax = log2(5000),
maxiter = 10000,
max_report = 2000)
ss_estimate
## $sample_size_per_group
## [1] 277.471
##
## $conclusion
## [1] "Success."
##
## $predicted_power_min_n
## [1] 0.004765494
##
## $predicted_power_max_n
## [1] 0.9882709
target_power_vec <- seq(0.5, 0.9, 0.05) #c(0.6, 0.7, 0.8, 0.9)
abs_lfc_vec <- c(0.5, 1.0, 1.5, 2.0)
logmean_vec <- c(2, 4, 6)
param_grid <- expand.grid(
target_power = target_power_vec,
abs_lfc = abs_lfc_vec,
logmean = logmean_vec
)
param_grid$sample_size_per_group <- apply(
param_grid,
1,
function(x) {
uniroot_ss(
target_power = as.numeric(x["target_power"]),
logmean = as.numeric(x["logmean"]),
abs_lfc = as.numeric(x["abs_lfc"]),
model = pow_ss$gam_mod,
xmin = log2(10),
xmax = log2(5000),
maxiter = 10000,
max_report = 2000)$sample_size_per_group
}
)
head(param_grid)
ggplot(param_grid,
aes(x = target_power,
y = sample_size_per_group,
group = 1)) +
geom_line() +
geom_point() +
facet_grid(
abs_lfc ~ logmean,
labeller = labeller(
abs_lfc = function(x) paste0("|log(fold change)| = ", x),
logmean = function(x) paste0("log(mean count) = ", x)
)
) +
labs(
x = "Target Power",
y = "Estimated Sample Size per Group",
title = "Estimated Sample Size by Target Power"
) +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5)
)
| Dataset | \(n_{comp}\) | \(p_1\) | \(\mu_1\) | \(\sigma_1\) | \(n_{comp}\) | \(p_2\) | \(\mu_2\) | \({f(x)}_{\sigma_2}\) | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArcticFireSoils | 4 | (0.10,0.38,0.37,0.15) | (-2.81,-1.56,0.31,3.27) | (0.45,0.80,1.35,2.39) | 2 | (-3.88) | \(0.13 + 0.06x_i,\;-1.55 - 0.62x_i\) | \(-0.65 + 0.30x_i,\;0.44 + 0.46x_i\) | ||||
| Blueberry | 4 | (0.37,0.22,0.29,0.12) | (-0.47,-1.45,0.83,2.54) | (0.71,0.52,1.05,1.66) | 2 | (4.97) | \(0.16 - 4.31x_i,\;0.01 + 0.01x_i\) | \(-1.17 + 2.07x_i,\;-1.07 + 0.13x_i\) | ||||
| cdi_schubert | 3 | (0.54,0.37,0.09) | (-2.02,0.78,4.17) | (0.94,1.57,2.04) | 2 | (-1.16) | \(-1.42 - 0.37x_i,\;2.61 + 0.77x_i\) | \(-0.27 + 0.54x_i,\;-0.17 + 0.23x_i\) | ||||
| cdi_vincent | 2 | (0.56,0.44) | (3.06,-0.03) | (2.10,0.89) | 2 | (4.93) | \(1.04 - 1.14x_i,\;-0.21 - 0.09x_i\) | \(1.05 - 0.49x_i,\;0.18 + 0.13x_i\) | ||||
| Chemerin | 3 | (0.56,0.02,0.42) | (-0.86,0.74,2.27) | (1.14,0.09,2.32) | 2 | (4.52) | \(2.95 + 3.42x_i,\;-0.02 - 0.00x_i\) | \(-1.25 - 2.24x_i,\;-1.31 + 0.07x_i\) | ||||
| crc_baxter | 4 | (0.16,0.33,0.20,0.32) | (-3.13,-2.01,-0.49,2.73) | (0.55,0.76,2.46,1.21) | 2 | (4.65) | \(4.25 + 3.11x_i,\;-0.09 - 0.03x_i\) | \(0.34 + 1.84x_i,\;-0.39 + 0.43x_i\) | ||||
| crc_zeller | 5 | (0.13,0.40,0.30,0.13,0.03) | (-2.32,-1.33,0.15,2.33,5.75) | (0.50,0.71,1.09,1.70,2.52) | 2 | (3.66) | \(0.78 + 0.88x_i,\;-0.16 - 0.02x_i\) | \(1.53 - 0.39x_i - 1.30x_i^2,\;-0.31 + 0.35x_i - 0.03x_i^2\) | ||||
| edd_singh | 1 | 1 | (-0.42) | (2.39) | 2 | (-1.99) | \(0.09 + 0.01x_i,\;-2.20 - 0.72x_i\) | \(0.13 + 0.28x_i,\;-0.54 + 0.16x_i\) | ||||
| Exercise | 3 | (0.15,0.48,0.36) | (-1.20,0.17,2.39) | (0.50,1.10,1.97) | 2 | (-4.83) | \(0.03 - 0.02x_i,\;2.27 - 2.38x_i\) | \(-1.23 + 0.05x_i,\;-1.55 + 1.39x_i\) | ||||
| glass_plastic_oberbeckmann | 3 | (0.44,0.50,0.06) | (-0.08,1.91,5.71) | (0.64,1.23,1.18) | 2 | (-3.83) | \(-0.07 - 0.05x_i,\;2.17 + 0.13x_i\) | \(-0.51 + 0.06x_i,\;-1.39 + 0.51x_i\) | ||||
| GWMC_ASIA_NA | 4 | (0.10,0.35,0.38,0.17) | (-4.72,-3.18,-1.17,1.41) | (0.61,0.98,1.50,2.28) | 2 | (4.08) | \(3.69 - 0.21x_i,\;0.11 + 0.03x_i\) | \(0.26 - 1.68x_i,\;0.55 + 0.46x_i\) | ||||
| GWMC_HOT_COLD | 5 | (0.14,0.35,0.37,0.00,0.14) | (-4.73,-3.16,-1.06,-0.93,1.57) | (0.69,1.55,1.05,0.07,2.28) | 2 | (3.58) | \(1.62 + 0.25x_i,\;-0.03 - 0.01x_i\) | \(0.33 + 0.36x_i,\;0.68 + 0.65x_i\) | ||||
| hiv_dinh | 3 | (0.49,0.45,0.06) | (0.42,2.15,5.79) | (0.84,1.25,0.70) | 2 | (-2.29) | \(0.33 + 0.11x_i,\;-1.54 - 0.68x_i\) | \(-0.36 + 0.14x_i,\;-0.93 + 0.25x_i\) | ||||
| hiv_lozupone | 3 | (0.30,0.28,0.42) | (-0.12,1.30,3.80) | (0.92,0.50,1.83) | 2 | (-4.86) | \(0.30 + 0.24x_i,\;-4.39 + 0.21x_i\) | \(-0.04 + 0.17x_i,\;-0.79 - 0.25x_i\) | ||||
| hiv_noguerajulian | 4 | (0.09,0.46,0.25,0.19) | (-2.83,-1.71,0.21,3.64) | (0.46,0.75,1.33,2.32) | 2 | (-2.44) | \(0.07 + 0.04x_i,\;-0.47 - 0.19x_i\) | \(-0.72 + 0.19x_i,\;-0.18 - 0.20x_i\) | ||||
| ibd_papa | 3 | (0.06,0.49,0.45) | (0.14,1.71,4.69) | (0.69,0.57,0.98) | 2 | (-1.31) | \(-0.32 + 0.11x_i,\;0.57 + 0.11x_i\) | \(0.16 + 0.06x_i,\;-1.30 + 0.48x_i\) | ||||
| Ji_WTP_DS | 3 | (0.33,0.30,0.37) | (-0.68,2.61,6.67) | (0.84,2.69,1.84) | 2 | (4.44) | \(2.59 - 1.66x_i,\;0.10 + 0.00x_i\) | \(-1.34 - 1.83x_i,\;-0.78 + 0.04x_i\) | ||||
| MALL | 3 | (0.39,0.11,0.50) | (-0.65,0.65,3.26) | (0.82,0.40,1.83) | 2 | (4.26) | \(-0.09 + 1.07x_i,\;0.08 + 0.05x_i\) | \(-0.22 - 1.87x_i + 1.43x_i^2,\;0.11 + 0.20x_i - 0.03x_i^2\) | ||||
| ob_goodrich | 4 | (0.15,0.16,0.62,0.06) | (-3.89,-2.39,-0.17,3.31) | (0.58,1.61,0.98,2.47) | 2 | (4.90) | \(-0.35 - 2.37x_i,\;-0.09 - 0.02x_i\) | \(-2.92 + 1.76x_i,\;-0.71 + 0.39x_i\) | ||||
| ob_ross | 2 | (0.34,0.66) | (-0.58,2.34) | (0.75,1.98) | 2 | (4.40) | \(0.81 + 0.42x_i,\;0.02 + 0.01x_i\) | \(-2.19 + 0.27x_i,\;-0.11 + 0.01x_i\) | ||||
| ob_turnbaugh | 3 | (0.55,0.03,0.42) | (-0.98,0.14,1.22) | (0.79,0.08,1.66) | 2 | (1.60) | \(-1.24 - 0.36x_i,\;0.28 + 0.10x_i\) | \(-0.38 + 0.46x_i,\;-0.24 + 0.18x_i\) | ||||
| ob_zhu | 3 | (0.57,0.01,0.42) | (-0.12,0.53,2.77) | (0.82,0.04,1.97) | 2 | (-2.12) | \(-0.30 + 0.05x_i,\;1.03 + 0.47x_i\) | \(0.01 + 0.11x_i,\;-0.61 + 0.19x_i\) | ||||
| ob_zupancic | 3 | (0.45,0.32,0.23) | (-1.98,-0.56,1.91) | (0.87,1.42,0.41) | 2 | (-1.90) | \(-0.07 + 0.03x_i,\;1.26 + 0.18x_i\) | \(-0.04 + 0.17x_i,\;-0.74 - 0.29x_i\) | ||||
| par_scheperjans | 3 | (0.39,0.32,0.28) | (-1.71,-0.31,1.34) | (0.71,1.02,1.58) | 2 | (4.16) | \(0.88 + 0.44x_i,\;-0.10 - 0.02x_i\) | \(-0.07 + 0.62x_i,\;-0.44 + 0.15x_i\) | ||||
| sed_plastic_hoellein | 3 | (0.41,0.39,0.20) | (-0.03,1.46,3.32) | (0.65,1.05,1.67) | 2 | (-4.73) | \(-0.07 - 0.00x_i,\;-0.52 + 0.41x_i\) | \(-0.42 + 0.10x_i,\;0.23 - 0.99x_i\) | ||||
| sed_plastic_rosato | 3 | (0.40,0.43,0.17) | (-0.25,1.58,4.06) | (0.70,1.23,1.93) | 2 | (4.65) | \(-0.25 - 1.49x_i,\;-0.08 - 0.04x_i\) | \(0.70 + 0.67x_i,\;-1.61 + 0.19x_i\) | ||||
| sw_plastic_frere | 5 | (0.06,0.25,0.33,0.27,0.08) | (-1.88,-1.04,0.19,1.93,4.08) | (0.33,0.52,0.83,1.35,2.21) | 2 | (2.68) | \(1.82 + 0.35x_i,\;-0.06 + 0.01x_i\) | \(0.43 - 0.40x_i,\;-0.54 + 0.11x_i\) | ||||
| t1d_alkanani | 5 | (0.06,0.30,0.47,0.15,0.02) | (-3.06,-2.54,-1.91,-0.87,0.82) | (0.17,0.30,0.46,0.72,1.37) | 5 | (0.23,-2.85,-4.30,4.83) | \(-1.75 + 0.01x_i,\;-0.36 - 2.61x_i,\;-0.34 + 2.04x_i\) | \(3.94 + 3.15x_i,\;-1.28 - 1.36x_i,\;-3.81 - 3.44x_i\) | ||||
| \(-0.13 - 0.93x_i,\;-0.03 - 0.02x_i\) | \(-2.31 + 3.62x_i,\;-1.12 + 0.63x_i\) | |||||||||||
| t1d_mejialeon | 2 | (0.47,0.53) | (0.73,3.57) | (0.92,1.94) | 2 | (-2.93) | \(0.25 + 0.02x_i,\;-3.32 + 0.13x_i\) | \(-0.04 + 0.06x_i,\;-0.25 + 0.35x_i\) | ||||
| wood_plastic_kesy | 3 | (0.29,0.42,0.29) | (-1.16,1.16,4.65) | (0.70,1.43,2.46) | 2 | (3.96) | \(0.28 + 2.24x_i,\;-0.00 - 0.01x_i\) | \(0.12 + 0.67x_i,\;-1.31 + 0.10x_i\) |
We would like to express our sincere gratitude to Dr. Jacob T. Nearing for his generosity in granting us access to the 38 microbiome datasets used in the paper “Microbiome differential abundance methods produce different results across 38 datasets” (Nearing et al. 2022). These datasets were invaluable in developing, testing, and validating this R package for power and sample size calculation in microbiome studies.
We also sincerely appreciate the following individuals for running the functions in this package on the microbiome datasets published in (Nearing et al. 2022), thereby providing users with plausible parameter values for their research work: Alim Tuguzbay, Arjya Bhattacharjee, Atiyah Sulaiman, George Vinichenko, Josh Effero, Mahdiya Adnan, Mir Akbaar Khaled, Nevin Xu, Nicole Padoun, Riya Srivastava, and Shaira Habib (University of Alberta: Alim Tuguzbay, Arjya Bhattacharjee, Atiyah Sulaiman, George Vinichenko, Josh Effero, Mahdiya Adnan, Mir Akbaar Khaled, Nevin Xu, and Shaira Habib; McMaster University: Nicole Padoun and Riya Srivastava.