Help for package myTAI

Type:

Package

Title:

Evolutionary Transcriptomics

Version:

2.3.4

Date:

2025-11-10

Maintainer:

Hajk-Georg Drost <hajk-georg.drost@tuebingen.mpg.de>

Description:

Investigate the evolution of biological processes by capturing evolutionary signatures in transcriptomes (Drost et al. (2018) <doi:10.1093/bioinformatics/btx835>). This package aims to provide a transcriptome analysis environment to quantify the average evolutionary age of genes contributing to a transcriptome of interest.

NeedsCompilation:

yes

License:

GPL-2

Depends:

R (≥ 4.1)

Imports:

S7, patchwork, purrr, tidyr, Rcpp, memoise, fitdistrplus (≥ 1.1-5), dplyr (≥ 0.3.0), RColorBrewer (≥ 1.1-2), ggplot2 (≥ 1.0.1), ggforce, ggridges, ggtext, readr (≥ 0.2.2), tibble, ggplotify, ggrepel, Matrix, pheatmap

Suggests:

knitr (≥ 1.6), rmarkdown (≥ 0.3.3), testthat (≥ 0.9.1), mgcv, Seurat, SeuratObject, uwot, decor, DESeq2, gganimate, taxize

LinkingTo:

RcppArmadillo, RcppThread, Rcpp

URL:

https://drostlab.github.io/myTAI/

BugReports:

https://github.com/drostlab/myTAI/issues

RoxygenNote:

7.3.3

Encoding:

UTF-8

LazyData:

true

LazyDataCompression:

VignetteBuilder:

knitr

Collate:

'globals.R' 'utils_S7.R' 'utils_math.R' 'phyloset_base.R' 'phyloset_bulk.R' 'phyloset_sc.R' 'stat_distributions.R' 'stat_test_result.R' 'RcppExports.R' 'age.apply.R' 'datasets.R' 'destroy_pattern.R' 'gatai_convergence_plots.R' 'genes_filter.R' 'genes_patterns.R' 'genes_transforms.R' 'myTAI-package.R' 'omit_matrix.R' 'plot_contribution.R' 'plot_distribution_expression.R' 'plot_distribution_pTAI.R' 'plot_distribution_strata.R' 'plot_gene_heatmap.R' 'plot_gene_profiles.R' 'plot_gene_space.R' 'plot_mean_var.R' 'plot_relative_expression.R' 'plot_sample_space.R' 'plot_signature.R' 'plot_signature_gene_quantiles.R' 'plot_signature_multiple.R' 'plot_signature_transformed.R' 'plot_strata_expression.R' 'plot_utils.R' 'stat_ci_estimation.R' 'stat_early_conservation_test.R' 'stat_flatline_test.R' 'stat_generic_conservation_test.R' 'stat_late_conservation_test.R' 'stat_null_conservation_txis.R' 'stat_pairwise_test.R' 'stat_reductive_hourglass_test.R' 'stat_reverse_hourglass_test.R' 'taxid.R' 'tf_PS.R' 'tf_stability.R' 'zzz.R'

Packaged:

2025-11-11 09:54:33 UTC; hdrost001

Author:

Hajk-Georg Drost

[aut, cre], Stefan Manolache

[aut, ctb], Jaruwatana Sodai Lotharukpong

[aut, ctb], Nikola Kalábová

[aut, ctb], Filipa Martins Costa [aut, ctb], Kristian K Ullrich

[aut, ctb]

Repository:

CRAN

Date/Publication:

2025-11-11 11:10:02 UTC

Calculate TXI for Raw Expression Data

Description

Internal function to calculate TXI for expression data.

Usage

.TXI(expression_matrix, strata_values)

Arguments

expression_matrix

Matrix of expression values

strata_values

Numeric vector of phylostratum values

Value

Vector of TXI values

Calculate TXI for single cell expression sparse matrix.

Description

Internal function to calculate TXI for expression data.

Usage

.TXI_sc(expression, strata_values)

Arguments

expression

Matrix of expression values, dgmatrix

strata_values

Numeric vector of phylostratum values

Value

Vector of TXI values

Adaptive TXI calculation for single cell expression

Description

Automatically selects the best TXI implementation based on dataset size and available computational resources. Uses R implementation for smaller datasets and C++ implementation with optimal parallelization for larger datasets.

Usage

.TXI_sc_adaptive(expression, strata_values, force_method = NULL, ncores = NULL)

Arguments

expression

Matrix of expression values, dgCMatrix

strata_values

Numeric vector of phylostratum values

force_method

Character, force specific method: "r", "cpp_simple", or "cpp_batched"

ncores

Integer, number of cores to use (default: parallel::detectCores())

Details

Based on performance benchmarking: - R implementation is fastest for < 50,000 cells - C++ batched implementation becomes advantageous for >= 100,000 cells - Optimal core count scales with dataset size

Value

Vector of TXI values

Collapse Expression Data Across Replicates

Description

Internal function to collapse expression data across replicates by taking row means.

Usage

.collapse_replicates(expression, groups)

Arguments

expression

Matrix of expression counts

groups

Factor indicating group membership

Value

Matrix with collapsed expression data

Compute Dimensional Reduction

Description

Compute PCA or UMAP on expression data when not available in stored reductions.

Usage

.compute_reduction(expression, method = c("PCA", "UMAP"), seed = 42)

Arguments

expression

Expression matrix with genes as rows and cells as columns

method

Character string: "PCA" or "UMAP"

seed

Integer seed for reproducible results (default: 42)

Value

Matrix with cells as rows and dimensions as columns

Fit Gamma Distribution Parameters

Description

Fit gamma distribution parameters using a robust method that filters outliers iteratively to find the best fit.

Usage

.fit_gamma(x)

Arguments

x

Numeric vector of data values

Details

This function uses an iterative approach to filter outliers and find the gamma distribution parameters that best fit the data, improving robustness compared to standard fitting methods. # Fit gamma distribution # params <- .fit_gamma(data_vector)

Value

List with shape and rate parameters

Fit Normal Distribution Parameters

Description

Fit normal distribution parameters using method of moments.

Usage

.fit_normal(x)

Arguments

x

Numeric vector of data values

Details

# Fit normal distribution # params <- .fit_normal(data_vector)

Value

List with mean and sd parameters

Get Expression Matrix from Seurat Object

Description

Extract expression matrix from Seurat object, preserving sparse format when possible.

Usage

.get_expression_from_seurat(seurat, layer)

Arguments

seurat

A Seurat object

layer

Character string specifying which layer to use

Value

Expression matrix (sparse or dense)

Calculate P-Value from Distribution

Description

Internal function to calculate p-values from cumulative distribution functions based on test statistics and alternative hypothesis specifications.

Usage

.get_p_value(
  cdf,
  test_stat,
  params,
  alternative = c("two-sided", "less", "greater")
)

Arguments

cdf

Cumulative distribution function

test_stat

Numeric test statistic value

params

List of distribution parameters

alternative

Character string specifying alternative hypothesis ("two-sided", "less", "greater")

Details

This function calculates p-values using the appropriate tail(s) of the distribution: - "greater": Uses upper tail (1 - CDF) - "less": Uses lower tail (CDF) - "two-sided": Uses 2 * minimum of both tails # Calculate p-value (internal use) # pval <- .get_p_value(pnorm, 1.96, list(mean=0, sd=1), "two-sided")

Value

Numeric p-value

Memoized Null Conservation TXI Generation

Description

Memoized version of generate_conservation_txis for improved performance with repeated calls using the same parameters.

Usage

.memo_generate_conservation_txis(strata_vector, count_matrix, sample_size)

Details

This function caches results to avoid recomputing expensive permutations when the same parameters are used multiple times.

Calculate pTXI for Raw Expression Data

Description

Internal function to calculate pTXI for expression data.

Usage

.pTXI(expression_matrix, strata_values)

Arguments

expression_matrix

Matrix of expression values

strata_values

Numeric vector of phylostratum values

Value

Matrix of pTXI values

Shared Gene Heatmap Implementation

Description

Internal helper function that contains the shared logic for creating gene heatmaps.

Usage

.plot_gene_heatmap_impl(
  expression_matrix,
  strata,
  gene_ids,
  num_strata,
  genes = NULL,
  top_p = NULL,
  top_k = 30,
  std = TRUE,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  show_gene_age = TRUE,
  show_gene_ids = FALSE,
  gene_annotation = NULL,
  gene_annotation_colors = NULL,
  annotation_col = NULL,
  annotation_col_colors = NULL,
  ...
)

Arguments

expression_matrix

Matrix of expression values (genes x samples)

strata

Factor vector of gene phylostrata

gene_ids

Character vector of all gene IDs in the dataset (used for phylostratum mapping)

num_strata

Integer number of phylostrata

genes

Character vector of specific genes to plot. If NULL, uses top dynamic genes

top_p

Proportion of most dynamic genes to include (default: NULL). Ignored if top_k is specified.

top_k

Absolute number of top genes to select (default: 30). Takes precedence over top_p.

std

Logical indicating whether to use standardized expression values (default: TRUE)

cluster_rows

Logical indicating whether to cluster genes/rows (default: FALSE)

cluster_cols

Logical indicating whether to cluster identities/columns (default: FALSE)

show_gene_age

Logical indicating whether to show gene age as row annotation (default: TRUE)

show_gene_ids

Logical indicating whether to show gene names (default: FALSE)

gene_annotation

Data frame with custom gene annotations, rownames should match gene IDs (default: NULL)

gene_annotation_colors

Named list of color vectors for custom gene annotations (default: NULL)

annotation_col

Data frame with column annotations (default: NULL)

annotation_col_colors

List of colors for column annotations (default: NULL)

...

Additional arguments passed to pheatmap::pheatmap

Value

A ggplot object showing the gene expression heatmap

Prepare Single-Cell Plot Data

Description

Prepare data frame for single-cell plotting with flexible identities

Usage

.prepare_sc_plot_data(
  phyex_set,
  primary_identity = NULL,
  secondary_identity = NULL
)

Arguments

phyex_set

A ScPhyloExpressionSet object

primary_identity

Primary identity column name

secondary_identity

Secondary identity column name (optional)

Value

List with sample data and aggregated data

Create Pseudobulk Expression Data

Description

Aggregate single-cell data by cell identity to create pseudobulk expression.

Usage

.pseudobulk_expression(expression, groups)

Arguments

expression

Expression matrix with genes as rows and cells as columns

groups

Factor vector indicating which group each cell belongs to

Value

Matrix of pseudobulked expression

Standardise Expression Data

Description

Standardise gene expression data by centring and scaling each gene.

Usage

.to_std_expr(e)

Arguments

e

Matrix of expression values with genes as rows and samples as columns

Details

This function standardises each gene's expression profile by subtracting the mean and dividing by the standard deviation. Genes with zero or undefined variance are set to zero. This is useful for comparing expression patterns across genes with different absolute expression levels. # Standardise expression data # std_expr <- .to_std_expr(expression_matrix)

Value

Matrix with standardised expression values (mean=0, sd=1 for each gene)

Bulk PhyloExpressionSet Class

Description

S7 class for bulk RNA-seq phylotranscriptomic expression data. This class handles expression data with biological replicates and provides bootstrapping functionality for statistical analysis.

Usage

BulkPhyloExpressionSet(
  strata = stop("@strata is required"),
  strata_values = stop("@strata_values is required"),
  expression = stop("@expression is required"),
  groups = stop("@groups is required"),
  name = "Phylo Expression Set",
  species = character(0),
  index_type = "TXI",
  identities_label = "Identities",
  gene_ids = character(0),
  null_conservation_sample_size = 5000L,
  .null_conservation_txis = NULL,
  .bootstrapped_txis = NULL
)

Arguments

strata

Factor vector of phylostratum assignments for each gene

strata_values

Numeric vector of phylostratum values used in TXI calculations

expression

Matrix of expression counts with genes as rows and samples as columns

groups

Factor vector indicating which identity each sample belongs to

name

Character string naming the dataset (default: "Phylo Expression Set")

species

Character string specifying the species (default: NULL)

index_type

Character string specifying the transcriptomic index type (default: "TXI")

identities_label

Character string labeling the identities (default: "Stages")

gene_ids

Character vector of gene identifiers (default: character(0), auto-generated from expression rownames if not provided)

null_conservation_sample_size

Numeric value for null conservation sample size (default: 5000)

.null_conservation_txis

Precomputed null conservation TXI values (default: NULL)

.bootstrapped_txis

Precomputed bootstrapped TXI values (default: NULL)

Details

The BulkPhyloExpressionSet class is designed for bulk RNA-seq data with biological replicates. It extends the base PhyloExpressionSetBase class with bulk-specific functionality.

Replicate Handling: Expression data across biological replicates is collapsed by taking row means within each experimental condition or developmental stage.

Computed Properties: In addition to inherited computed properties from the base class, this class provides:

expression_collapsed - Matrix of expression data collapsed across replicates (genes x identities)
bootstrapped_txis - Matrix of bootstrapped TXI values for statistical inference (500 bootstrap samples x identities)

Inherited computed properties from PhyloExpressionSetBase include:

gene_ids - Character vector of gene identifiers
identities - Character vector of identity labels
sample_names - Character vector of sample names
num_identities - Integer count of unique identities
num_samples - Integer count of total samples
num_genes - Integer count of genes
num_strata - Integer count of phylostrata
index_full_name - Full name of the transcriptomic index type
group_map - List mapping identity names to sample names
TXI - Numeric vector of TXI values for each identity
TXI_sample - Numeric vector of TXI values for each sample
null_conservation_txis - Matrix of null conservation TXI values for statistical testing

Statistical Analysis: The class supports confidence interval estimation and standard deviation calculation through bootstrapped TXI values, enabling robust statistical analysis of developmental or experimental patterns.

Value

A BulkPhyloExpressionSet object

Convert Data to BulkPhyloExpressionSet

Description

Convert a data frame with phylostratum, gene ID, and expression data into a BulkPhyloExpressionSet object.

Usage

BulkPhyloExpressionSet_from_df(
  data,
  groups = colnames(data[, 3:ncol(data)]),
  name = deparse(substitute(data)),
  strata_legend = NULL,
  ...
)

Arguments

data

A data frame where column 1 contains phylostratum information, column 2 contains gene IDs, and columns 3+ contain expression data

groups

A factor or character vector indicating which group each sample belongs to. Default uses column names from expression data

name

A character string naming the dataset. Default uses the variable name

strata_legend

A data frame with two columns: phylostratum assignments and name of each stratum. If NULL, no labels will be added (default: NULL) If NULL, uses sorted unique values from column 1

...

Additional arguments passed to BulkPhyloExpressionSet constructor

Value

A BulkPhyloExpressionSet object

Count Transformation Functions

Description

Predefined list of transformation functions for count data normalization.

Usage

COUNT_TRANSFORMS

Format

A named list of transformation functions:

none: Identity transformation (no change)
sqrt: Square root transformation
log2: log2(x+1) transformation
vst: Variance stabilizing transformation (DESeq2)
rlog: Regularized log transformation (DESeq2)
rank: Rank transformation within each sample

Details

This object provides a list of predefined transformation functions for gene expression matrix. For 'rlog()' and 'vst()' transformations (from 'DESeq2'), the expression matrix is multiplied by 2 prior to rounding. This is done to preserve more information for the lower expression values (especially for TPM).

Conservation Test Result S7 Class

Description

S7 class extending TestResult for conservation-specific test results, including TXI profiles and null distributions.

Usage

ConservationTestResult(
  method_name = stop("@method_name is required"),
  test_stat = stop("@test_stat is required"),
  fitting_dist = stop("@fitting_dist is required"),
  params = stop("@params is required"),
  alternative = "two-sided",
  null_sample = stop("@null_sample is required"),
  data_name = character(0),
  p_label = "p_val",
  test_txi = stop("@test_txi is required"),
  null_txis = stop("@null_txis is required"),
  modules = list()
)

Arguments

method_name

Character string specifying the statistical test method

test_stat

Numeric value of the test statistic

fitting_dist

Character string specifying the fitted distribution

params

Named list of distribution parameters

alternative

Character string specifying the alternative hypothesis

null_sample

Numeric vector of null distribution values

data_name

Character string describing the data

p_label

Character string for p-value label

test_txi

Numeric vector of observed TXI values

null_txis

Matrix of null TXI distributions from permutations

modules

Optional list of developmental modules used in the test

Details

ConservationTestResult extends TestResult with phylotranscriptomic-specific information including the observed TXI profile and null TXI distributions generated by permutation testing.

Value

A ConservationTestResult object

Distribution S7 Class

Description

S7 class for representing probability distributions used in statistical testing, including PDF, CDF, quantile functions, and fitting procedures.

Usage

Distribution(
  name = stop("@name is required"),
  pdf = stop("@pdf is required"),
  cdf = stop("@cdf is required"),
  quantile_function = stop("@quantile_function is required"),
  fitting_function = stop("@fitting_function is required"),
  param_names = stop("@param_names is required")
)

Arguments

name

Character string identifying the distribution

pdf

Function for probability density function

cdf

Function for cumulative distribution function

quantile_function

Function for quantile calculations

fitting_function

Function to fit distribution parameters from data

param_names

Character vector of parameter names

Details

The Distribution class provides a unified interface for different probability distributions used in phylotranscriptomic testing. Each distribution includes the necessary functions for statistical inference.

Examples

# Access predefined distributions
normal_dist <- distributions$normal
gamma_dist <- distributions$gamma

Generate Phylostratum Colors

Description

Generate a color palette for phylostrata visualization using a log-scaled transformation.

Usage

PS_colours(n)

Arguments

n

number of colors to generate

Value

A character vector of color codes

Examples

# Generate colors for 5 phylostrata
colors <- PS_colours(5)

PhyloExpressionSet Base Class

Description

Abstract S7 base class for storing and manipulating phylotranscriptomic expression data. This class provides the common interface for both bulk and single-cell phylotranscriptomic data.

Usage

PhyloExpressionSetBase(
  strata = stop("@strata is required"),
  strata_values = stop("@strata_values is required"),
  expression = stop("@expression is required"),
  groups = stop("@groups is required"),
  name = "Phylo Expression Set",
  species = character(0),
  index_type = "TXI",
  identities_label = "Identities",
  gene_ids = character(0),
  null_conservation_sample_size = 5000L,
  .null_conservation_txis = NULL
)

Arguments

strata

Factor vector of phylostratum assignments for each gene

strata_values

Numeric vector of phylostratum values used in TXI calculations

expression

Matrix of expression counts with genes as rows and samples as columns

groups

Factor vector indicating which identity each sample belongs to

name

Character string naming the dataset (default: "Phylo Expression Set")

species

Character string specifying the species (default: NULL)

index_type

Character string specifying the transcriptomic index type (default: "TXI")

identities_label

Character string labeling the identities (default: "Identities")

gene_ids

Character vector of gene identifiers (default: character(0), auto-generated from expression rownames if not provided)

null_conservation_sample_size

Numeric value for null conservation sample size (default: 5000)

.null_conservation_txis

Precomputed null conservation TXI values (default: NULL)

Details

The PhyloExpressionSetBase class serves as the foundation for phylotranscriptomic analysis, providing shared functionality for both bulk and single-cell data types.

Abstract Properties: Subclasses must implement the expression_collapsed property to define how expression data should be collapsed across replicates or cells.

Computed Properties: Several properties are computed automatically when accessed:

gene_ids - Character vector of gene identifiers (rownames of expression matrix)
identities - Character vector of identity labels (colnames of collapsed expression)
sample_names - Character vector of sample names (colnames of expression matrix)
num_identities - Integer count of unique identities
num_samples - Integer count of total samples
num_genes - Integer count of genes
num_strata - Integer count of phylostrata
index_full_name - Full name of the transcriptomic index type
group_map - List mapping identity names to sample names
TXI - Numeric vector of TXI values for each identity (computed from collapsed expression)
TXI_sample - Numeric vector of TXI values for each sample (computed from raw expression)
null_conservation_txis - Matrix of null conservation TXI values for statistical testing

Validation: The class ensures consistency between expression data, phylostratum assignments, and groupings. All gene-level vectors must have matching lengths, and sample groupings must be consistent.

Value

A PhyloExpressionSetBase object

Single-Cell PhyloExpressionSet Class

Description

S7 class for single-cell phylotranscriptomic expression data. This class stores expression matrices and metadata, with support for dimensional reductions and pseudobulking functionality.

Usage

ScPhyloExpressionSet(
  strata = stop("@strata is required"),
  strata_values = stop("@strata_values is required"),
  expression = stop("@expression is required"),
  groups = stop("@groups is required"),
  name = "Phylo Expression Set",
  species = character(0),
  index_type = "TXI",
  identities_label = "Identities",
  gene_ids = character(0),
  null_conservation_sample_size = 5000L,
  .null_conservation_txis = NULL,
  .pseudobulk_cache = list(),
  .TXI_sample = numeric(0),
  metadata = NULL,
  selected_idents = character(0),
  idents_colours = list(),
  reductions = list()
)

Arguments

strata

Factor vector of phylostratum assignments for each gene

strata_values

Numeric vector of phylostratum values used in TXI calculations

expression

Sparse or dense matrix of expression counts with genes as rows and cells as columns

groups

Factor vector indicating which identity each cell belongs to (derived from selected_idents column in metadata)

name

Character string naming the dataset (default: "Phylo Expression Set")

species

Character string specifying the species (default: NULL)

index_type

Character string specifying the transcriptomic index type (default: "TXI")

identities_label

Character string labeling the identities (default: "Cell Type")

gene_ids

Character vector of gene identifiers (default: character(0), auto-generated from expression rownames if not provided)

null_conservation_sample_size

Numeric value for null conservation sample size (default: 5000)

.null_conservation_txis

Precomputed null conservation TXI values (default: NULL)

.pseudobulk_cache

Internal cache for pseudobulked expression matrices by different groupings

.TXI_sample

Internal storage for computed TXI values

metadata

Data frame with cell metadata, where rownames correspond to cell IDs and columns contain cell attributes

selected_idents

Character string specifying which metadata column is currently used for grouping cells

idents_colours

List of named character vectors specifying colors for each identity level, organized by metadata column name

reductions

List of dimensional reduction matrices (PCA, UMAP, etc.) with cells as rows and dimensions as columns

Details

The ScPhyloExpressionSet class provides a comprehensive framework for single-cell phylotranscriptomic analysis. Key features include:

Identity Management: The selected_idents property determines which metadata column is used for grouping cells. When changed, it automatically updates the groups property and invalidates cached pseudobulk data to ensure consistency.

Dimensional Reductions: The reductions property stores pre-computed dimensional reductions (PCA, UMAP, etc.). If not provided during construction from Seurat objects, basic PCA and UMAP are computed automatically.

Color Management: idents_colours allows custom color schemes for different metadata columns, ensuring consistent visualization across plots.

Computed Properties: Several properties are computed automatically when accessed:

available_idents - Character vector of factor columns in metadata that can be used for grouping (automatically detected from metadata)
expression_collapsed - Matrix of pseudobulked expression data (genes x identities), created by summing expression within each identity group
TXI_sample - Named numeric vector of TXI (Transcriptomic Age Index) values for each cell, computed using efficient C++ implementation

Inherited computed properties from PhyloExpressionSetBase include:

gene_ids - Character vector of gene identifiers
identities - Character vector of identity labels
sample_names - Character vector of sample names (cell IDs)
num_identities - Integer count of unique cell types/identities
num_samples - Integer count of total cells
num_genes - Integer count of genes
num_strata - Integer count of phylostrata
index_full_name - Full name of the transcriptomic index type
group_map - List mapping identity names to cell IDs
TXI - Numeric vector of TXI values for each identity (computed from pseudobulked expression)
null_conservation_txis - Matrix of null conservation TXI values for statistical testing

These properties use lazy evaluation and caching for optimal performance.

Value

A ScPhyloExpressionSet object

Examples

# Create from Seurat object
data(example_phyex_set_sc)
sc_set <- example_phyex_set_sc

# Switch to different cell grouping
sc_set@selected_idents <- "day"

# Access pseudobulked data (computed au  tomatically)
pseudobulk <- sc_set@expression_collapsed

# Access TXI values for each cell
txi_values <- sc_set@TXI_sample

Create Single-Cell PhyloExpressionSet from Expression Matrix

Description

Create a ScPhyloExpressionSet object from an expression matrix and metadata.

Usage

ScPhyloExpressionSet_from_matrix(
  expression_matrix,
  strata,
  metadata,
  groups_column = NULL,
  name = "Single-Cell Phylo Expression Set",
  ...
)

Arguments

expression_matrix

Sparse or dense expression matrix with genes as rows and cells as columns

strata

Factor vector of phylostratum assignments for each gene

metadata

Data frame with cell metadata, rownames should match colnames of expression_matrix

groups_column

Character string specifying which metadata column to use for initial grouping (default: first factor column found)

name

A character string naming the dataset (default: "Single-Cell Phylo Expression Set")

...

Additional arguments passed to ScPhyloExpressionSet constructor

Details

This function creates a ScPhyloExpressionSet from basic components. The groups_column parameter determines the initial selected_idents value, which can be changed later using the setter. All discrete columns in metadata are automatically converted to factors for consistent handling.

Value

A ScPhyloExpressionSet object

Convert Seurat Object to Single-Cell PhyloExpressionSet

Description

Convert a Seurat object with phylostratum information into a ScPhyloExpressionSet object for single-cell phylotranscriptomic analysis. Automatically extracts dimensional reductions if present, or computes basic PCA and UMAP if none are available.

Usage

ScPhyloExpressionSet_from_seurat(
  seurat,
  strata,
  layer = "counts",
  selected_idents = NULL,
  name = "Single-Cell PhyloExpressionSet",
  seed = 42,
  ...
)

Arguments

seurat

A Seurat object containing single-cell expression data

strata

Factor vector of phylostratum assignments for each gene

layer

Character string specifying which layer to use from the Seurat object (default: "counts")

selected_idents

Character string specifying which metadata column to use for grouping (default: NULL, uses active idents)

name

A character string naming the dataset (default: "Single-Cell Phylo Expression Set")

seed

Integer seed for reproducible UMAP computation (default: 42)

...

Additional arguments passed to ScPhyloExpressionSet constructor

Value

A ScPhyloExpressionSet object

Calculate Transcriptomic Age Index (TAI)

Description

Calculate the transcriptomic age index values for a PhyloExpressionSet. This function provides backward compatibility with the old TAI() function.

Usage

TAI(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object

Value

Numeric vector of TAI values for each identity

Examples

# Calculate TAI values
tai_values <- TAI(example_phyex_set)

Calculate Transcriptomic Divergence Index (TDI)

Description

Calculate the transcriptomic divergence index values for a PhyloExpressionSet. This function provides backward compatibility with the old TDI() function.

Usage

TDI(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object

Value

Numeric vector of TDI values for each identity

Calculate Transcriptomic Evolutionary Index (TEI)

Description

Calculate the transcriptomic evolutionary index values for a PhyloExpressionSet. This function provides backward compatibility with the old TEI() function.

Usage

TEI(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object

Value

Numeric vector of TEI values for each identity

Transcriptomic Index Name Mapping

Description

Named list mapping transcriptomic index abbreviations to full names.

Usage

TI_map

Format

A named list with 5 elements:

TXI: Transcriptomic Index
TAI: Transcriptomic Age Index
TDI: Transcriptomic Divergence Index
TPI: Transcriptomic Polymorphism Index
TEI: Transcriptomic Evolutionary Index

Calculate Transcriptomic Polymorphism Index (TPI)

Description

Calculate the transcriptomic polymorphism index values for a PhyloExpressionSet. This function provides backward compatibility with the old TPI() function.

Usage

TPI(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object

Value

Numeric vector of TPI values for each identity

Calculate Transcriptomic Index (TXI)

Description

Calculate the transcriptomic index values for a PhyloExpressionSet. This function provides backward compatibility with the old TXI() function.

Usage

TXI(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object

Value

Numeric vector of TXI values for each identity

Examples

# Calculate TXI values
txi_values <- TXI(example_phyex_set)

Confidence Intervals for Transcriptomic Index (TXI)

Description

Compute confidence intervals for the TXI using bootstrapped TXI values.

Usage

TXI_conf_int(phyex_set, probs = c(0.025, 0.975))

Arguments

phyex_set

A BulkPhyloExpressionSet object

probs

Numeric vector of probabilities for the confidence interval (default: c(0.025, 0.975))

Details

This function returns confidence intervals for the TXI for each identity (sample or group), based on the bootstrapped TXI values stored in the PhyloExpressionSet object.

Value

A tibble with first column Identity names, second column lower bound, third column upper bound

Standard Deviation for TXI

Description

Return a named vector of standard deviations for the TXI for each identity.

Usage

TXI_std_dev(phyex_set)

Arguments

phyex_set

A BulkPhyloExpressionSet object

Value

Named numeric vector of standard deviations, names are identities

Test Result S7 Class

Description

S7 class for storing and manipulating statistical test results from phylotranscriptomic conservation tests.

Usage

TestResult(
  method_name = stop("@method_name is required"),
  test_stat = stop("@test_stat is required"),
  fitting_dist = stop("@fitting_dist is required"),
  params = stop("@params is required"),
  alternative = "two-sided",
  null_sample = stop("@null_sample is required"),
  data_name = character(0),
  p_label = "p_val"
)

Arguments

method_name

Character string identifying the test method

test_stat

Numeric test statistic value

fitting_dist

Distribution object used for null hypothesis testing

params

List of fitted distribution parameters

alternative

Character string specifying alternative hypothesis ("two-sided", "less", "greater")

null_sample

Numeric vector of null distribution samples

data_name

Character string naming the dataset (optional)

p_label

Character string for p-value label (default: "p_val")

Details

The TestResult class provides computed properties including: - 'p_value': Computed p-value based on test statistic and fitted distribution

Value

A TestResult object

Age Category Specific apply Function

Description

This function performs the split-apply-combine methodology on Phylostrata or Divergence Strata stored within the input PhyloExpressionSet.

This function is very useful to perform any phylostratum or divergence-stratum specific analysis.

Usage

age.apply(phyex_set, FUN, ..., as.list = FALSE)

Arguments

phyex_set

a standard PhyloExpressionSet object.

FUN

a function to be performed on the corresponding expression matrix of each phylostratum or divergence-stratum.

...

additional arguments of FUN.

as.list

a boolean value specifying whether the output format shall be a matrix or a list object.

Details

This function uses the split function to subset the expression matrix into phylostratum specific sub-matrices. Internally using lapply, any function can be performed to the sub-matrices. The return value of this function is a numeric matrix storing the return values by FUN for each phylostratum and each developmental stage s. Note that the input FUN must be an function that can be applied to a matrix (e.g., colMeans). In case you use an anonymous function you could use function(x) apply(x , 2 , var) as an example to compute the variance of each phylostratum and each developmental stage s.

Value

Either a numeric matrix storing the return values of the applied function for each age class or a numeric list storing the return values of the applied function for each age class in a list.

Author(s)

Hajk-Georg Drost

Examples

 
# source the example dataset
data(example_phyex_set)
 
# Example 1
# get the relative expression profiles for each phylostratum
age.apply(example_phyex_set, relative_expression)

# this is analogous to 
rel_exp_matrix(example_phyex_set)
# Example 2
# compute the mean expression profiles for each phylostratum
age.apply(example_phyex_set, colMeans)

# Example 3
# compute the variance profiles for each phylostratum
age.apply(example_phyex_set, function(x) apply(x , 2 , var))

# Example 4
# compute the range for each phylostratum
# Note: in this case, the range() function returns 2 values for each phylostratum
# and each developmental stage, hence one should use the argument 'as.list = TRUE'
# to make sure that the results are returned properly 
age.apply(example_phyex_set, function(x) apply(x , 2 , range), as.list = TRUE)

as_BulkPhyloExpressionSet

Description

This function is an alias for BulkPhyloExpressionSet_from_df. Please refer to the documentation for BulkPhyloExpressionSet_from_df for usage details, arguments, and examples.

Usage

as_BulkPhyloExpressionSet(
  data,
  groups = colnames(data[, 3:ncol(data)]),
  name = deparse(substitute(data)),
  strata_legend = NULL,
  ...
)

Arguments

data

A data frame with phylostratum assignments and gene expression data

groups

Vector of group labels for the samples/replicates

name

Character string to name the dataset

strata_legend

Optional data frame mapping phylostratum numbers to labels

...

Additional arguments passed to BulkPhyloExpressionSet_from_df

Value

A BulkPhyloExpressionSet object

Convert BulkPhyloExpressionSet to Data Frame

Description

Convert a BulkPhyloExpressionSet object back to the original data frame format with phylostratum, gene ID, and expression data as columns.

Usage

as_data_frame(phyex_set, use_collapsed = FALSE)

Arguments

phyex_set

A BulkPhyloExpressionSet object

use_collapsed

Logical indicating whether to use collapsed expression data (default: FALSE)

Value

A data frame where column 1 contains phylostratum information, column 2 contains gene IDs, and columns 3+ contain expression data

Examples

# Convert BulkPhyloExpressionSet back to data frame
df <- as_data_frame(example_phyex_set)
df_collapsed <- as_data_frame(example_phyex_set, use_collapsed = TRUE)

Check if object is a BulkPhyloExpressionSet

Description

Checks if the input is a PhyloExpressionSet S7 object and throws an error if not.

Usage

check_BulkPhyloExpressionSet(phyex_set)

Arguments

phyex_set

An object to check

Value

Invisibly returns TRUE if check passes, otherwise throws an error

Check if object is a PhyloExpressionSet

Description

Checks if the input is a PhyloExpressionSet S7 object and throws an error if not.

Usage

check_PhyloExpressionSet(phyex_set)

Arguments

phyex_set

An object to check

Value

Invisibly returns TRUE if check passes, otherwise throws an error

Check if object is a ScPhyloExpressionSet

Description

Checks if the input is a PhyloExpressionSet S7 object and throws an error if not.

Usage

check_ScPhyloExpressionSet(phyex_set)

Arguments

phyex_set

An object to check

Value

Invisibly returns TRUE if check passes, otherwise throws an error

Collapse PhyloExpressionSet Replicates

Description

Convert a PhyloExpressionSet with replicates to one with collapsed expression data.

Usage

collapse(phyex_set, ...)

Arguments

phyex_set

A PhyloExpressionSet object

...

Additional arguments passed to methods

Value

A new PhyloExpressionSet object with collapsed expression data

Examples

# Collapse replicates in a PhyloExpressionSet
collapsed_set <- collapse(example_phyex_set)

Calculate Confidence Intervals for Test Result

Description

Calculate confidence intervals for the null distribution of a test result.

Usage

conf_int(test_result, probs = c(0.025, 0.975))

Arguments

test_result

A TestResult object

probs

Numeric vector of probabilities for quantiles (default: c(0.025, 0.975))

Details

# Calculate 95 # ci <- conf_int(test_result, probs = c(0.025, 0.975))

Value

Numeric vector of quantiles from the null distribution

Calculate Consensus Gene Set

Description

Calculate consensus genes from multiple GATAI runs based on frequency threshold.

Usage

consensus(x, p = 0.5)

Arguments

x

List of gene sets from different GATAI runs

p

Frequency threshold (default: 0.5)

Details

This function identifies genes that appear in at least p proportion of the input gene sets, providing a consensus set of genes across multiple GATAI runs. # Calculate consensus from multiple runs # consensus_genes <- consensus(gatai_runs, p = 0.5)

Value

Character vector of consensus genes

Create Convergence Plots for GATAI Analysis

Description

Create plots showing how consensus gene set sizes and p-values converge across GATAI runs for different threshold values.

Usage

convergence_plots(phyex_set, runs, ps = c(0.5))

Arguments

phyex_set

A PhyloExpressionSet object

runs

List of GATAI run results

ps

Vector of consensus thresholds to analyze (default: c(0.5))

Details

This function analyzes how consensus gene sets and their statistical significance change as more GATAI runs are included in the analysis. It uses cached null distributions for efficient p-value calculation. # Create convergence plots # conv_plots <- convergence_plots(phyex_set, gatai_runs, ps = c(0.25, 0.5, 0.75))

Value

A list with two ggplot objects:

counts: Plot showing convergence of consensus set sizes
pval: Plot showing convergence of p-values

Calculate TXI for Single-Cell Expression Data (C++ Implementation)

Description

Efficiently calculate TXI values for sparse single-cell expression matrices using batch processing and parallel computation.

Usage

cpp_txi_sc(expression, strata_values, batch_size = 2000L, ncores = 10L)

Arguments

expression

Sparse expression matrix (genes x cells) - dgCMatrix format

strata_values

Numeric vector of phylostratum values for each gene

batch_size

Integer, number of cells to process per batch (default: 2000)

ncores

Integer, number of cores to use for parallel processing (default: 10, automatically capped at available cores)

Details

This function processes large sparse single-cell expression matrices efficiently by: - Splitting cells into batches to manage memory usage - Using parallel processing across batches - Leveraging sparse matrix operations to skip zero entries - Handling cells with zero expression by returning NA

Value

Numeric vector of TXI values for each cell

Author(s)

Kristian K Ullrich

Destroy Phylotranscriptomic Pattern Using GATAI

Description

Apply the GATAI algorithm to identify and remove genes that contribute to phylotranscriptomic patterns.

Usage

destroy_pattern(
  phyex_set,
  num_runs = 20,
  runs_threshold = 0.5,
  analysis_dir = NULL,
  plot_results = TRUE,
  max_generations = 10000,
  seed = 1234,
  extended_analysis = FALSE,
  min_pval = 0.05,
  always_return_genes = FALSE,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object (bulk or single cell, the latter which will get pseudo-bulked)

num_runs

Number of GATAI runs to perform (default: 20)

runs_threshold

Threshold for gene removal consistency across runs (default: 0.5)

analysis_dir

Directory to store GATAI analysis results (default: NULL)

plot_results

Whether to plot the results. If analysis dir is given, this will be ignored.

max_generations

Integer. Maximum number of generations (iterations) for the genetic algorithm (default 10000).

seed

Random seed for reproducibility (default: 1234)

extended_analysis

Whether to show the multiple runs and convergence plots (default: FALSE)

min_pval

Minimum p-value for which the pattern is considered destroyed (default: 0.05).

always_return_genes

Whether to return genes even when the pattern is not destroyed (default: FALSE).

...

Additional arguments passed to gataiR::gatai

Details

This function requires the gataiR package to be installed. GATAI systematically removes genes that contribute to phylotranscriptomic patterns by iteratively testing gene removal and evaluating the impact on the overall transcriptomic signature.

Value

A list containing GATAI results including identified genes that contribute to the pattern

Author(s)

Filipa Martins Costa

Diagnose Test Robustness

Description

Evaluate the robustness of conservation tests across different sample sizes for null distribution generation.

Usage

diagnose_test_robustness(
  test,
  phyex_set,
  sample_sizes = c(500, 1000, 5000, 10000),
  plot_result = TRUE,
  num_reps = 5,
  ...
)

Arguments

test

Function representing the conservation test to evaluate

phyex_set

A PhyloExpressionSet object

sample_sizes

Numeric vector of sample sizes to test (default: c(500, 1000, 5000, 10000))

plot_result

Logical indicating whether to plot results (default: TRUE)

num_reps

Number of replicates for each sample size (default: 5)

...

Additional arguments passed to the test function

Details

This function assesses how consistent test results are across different sample sizes for null distribution generation, helping to determine appropriate sample sizes for reliable testing.

Value

A data frame with test results across different sample sizes

Examples

# Diagnose flatline test robustness
robustness <- diagnose_test_robustness(stat_flatline_test, 
                                       example_phyex_set,
                                       sample_sizes=c(10,20),
                                       plot_result=FALSE,
                                       num_reps=3)

Predefined Distribution Objects

Description

List of predefined Distribution objects for use in statistical testing.

Usage

distributions

Format

A named list containing Distribution objects:

normal: Normal distribution with mean and sd parameters
gamma: Gamma distribution with shape and rate parameters

Details

This list provides ready-to-use Distribution objects for common statistical tests. Each distribution includes appropriate fitting functions and statistical functions for hypothesis testing.

Downsample ScPhyloExpressionSet

Description

Create a downsampled copy of a ScPhyloExpressionSet object with fewer cells per identity.

Usage

downsample(phyex_set, downsample = 10)

Arguments

phyex_set

A ScPhyloExpressionSet object

downsample

Integer, number of cells to keep per identity (default: 10)

Details

This function creates a new ScPhyloExpressionSet with a subset of cells, maintaining the same proportional representation across identities. The sampling is stratified by the current selected_idents grouping. All metadata and reductions are filtered to match the selected cells.

Value

A new ScPhyloExpressionSet object with downsampled cells

Examples

# Downsample to 20 cells per identity
small_set <- downsample(example_phyex_set_sc, downsample = 20)

# Change grouping and downsample
example_phyex_set_sc@selected_idents <- "day"
treatment_set <- downsample(example_phyex_set_sc, downsample = 15)

Downsample Expression Matrix by Groups

Description

Downsample an expression matrix by randomly selecting a specified number of samples from each group.

Usage

downsample_expression(expression_matrix, groups, downsample = 10)

Arguments

expression_matrix

Expression matrix with genes as rows and samples as columns

groups

Factor vector indicating which group each sample belongs to

downsample

Integer, number of samples to keep per group (default: 10)

Details

This function randomly samples up to downsample samples from each group level. The returned expression matrix is converted to dense format and maintains column names from the original matrix. Useful for creating balanced subsets for visualization or when memory is limited.

Value

A dense expression matrix (genes x downsampled samples)

Early Expression Pattern

Description

Generate an ideal early expression pattern for S developmental stages.

Usage

early_gene(S)

Arguments

S

Number of developmental stages

Details

Creates a pattern where expression starts low, increases in early stages, and remains high in later stages.

Value

Numeric vector representing early expression pattern

Early Conservation Score Function

Description

Compute the early conservation score by comparing mid and late developmental stages to early stages.

Usage

ec_score(txi, modules)

Arguments

txi

Numeric vector of transcriptomic index values

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

Details

The score is computed as the minimum of: - D1: mean(mid) - mean(early) - D2: mean(late) - mean(early)

Higher scores indicate stronger early conservation patterns. # Compute early conservation score # modules <- list(early = 1:3, mid = 4:6, late = 7:9) # score <- ec_score(txi_values, modules)

Value

A numeric value representing the early conservation score

Example phyex set

Description

Arabidopsis thaliana embryogenesis dataset from Hoffman et al. 2019. check the phyexSet package for more details

Usage

example_phyex_set

Format

A BulkPhyloExpressionSet with 8 developmental stages (3 reps each) and 27520 genes

Source

phyexSets::Athaliana.embryogenesis_2019 matched with the Arabidopsis thaliana phylomap from phylomapr

Example phyex set old

Description

Arabidopsis thaliana embryogenesis dataset from Xiang et al. 2011. check the phyexSet package for more details

Usage

example_phyex_set_old

Format

A BulkPhyloExpressionSet with 7 developmental stages (1 rep each) and 25096 genes

Source

phyexSets::Athaliana.embryogenesis_2011 matched with the Arabidopsis thaliana phylomap from phylomapr

Load Example Single-Cell PhyloExpressionSet

Description

Creates and returns an example ScPhyloExpressionSet object for testing and examples.

Usage

example_phyex_set_sc

Format

A ScPhyloExpressionSet with 1000 cells and 1000 genes

Source

Synthetically generated data

Format P-Value for Scientific Notation

Description

Format p-values in scientific notation for plot annotations.

Usage

exp_p(p, sci_thresh = 4)

Arguments

p

Numeric p-value

sci_thresh

Numeric threshold for using scientific notation (number of decimal places)

Details

This function formats p-values in scientific notation using the format "p = a × 10^b" which is suitable for ggplot2 annotations and maintains proper mathematical formatting.

Value

Expression object for use in plot annotations

Examples

# Format p-value for plotting
expr <- exp_p(0.001)

Create Full GATAI Convergence Plot

Description

Create a comprehensive plot showing GATAI convergence across multiple runs and thresholds.

Usage

full_gatai_convergence_plot(phyex_set, runs, p = 0.5, ps = c(0.25, 0.5, 0.75))

Arguments

phyex_set

A PhyloExpressionSet object

runs

List of GATAI run results

p

Consensus threshold for petal plot (default: 0.5)

ps

Vector of consensus thresholds for convergence plots (default: c(0.25, 0.5, 0.75))

Details

This function creates a comprehensive visualization of GATAI convergence including consensus set sizes, p-values, threshold comparisons, and gene removal patterns across multiple runs. # Create full convergence plot # conv_plot <- full_gatai_convergence_plot(phyex_set, gatai_runs)

Value

A patchwork composition showing convergence analysis

Animate GATAI Destruction Process

Description

Create an animation showing how the transcriptomic signature changes during the GATAI gene removal process across generations.

Usage

gatai_animate_destruction(
  phyex_set,
  save_file = NULL,
  fps = 10,
  width = 1000,
  height = 800,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object containing the original gene expression data.

save_file

Optional file path to save the animation (default: NULL, returns animation object).

fps

Frames per second for the animation (default: 20).

width

Width of the animation in pixels (default: 1000).

height

Height of the animation in pixels (default: 800).

...

Additional arguments passed to gataiR::gatai.

Details

This function runs GATAI for a single run while saving intermediate TAI values at each generation. It then creates an animated plot showing how the transcriptomic signature evolves as genes are progressively removed. The animation shows: - The original signature (generation 0) - Progressive changes through each generation - Final signature after convergence

The intermediate file format contains generation numbers in the first column and TAI values for each developmental stage in subsequent columns.

Value

If save_file is NULL, returns a gganimate animation object. If save_file is specified, saves the animation and returns the file path invisibly.

Filter Dynamic Expression Genes

Description

Filter genes based on expression variance to select the most dynamically expressed genes.

Usage

genes_filter_dynamic(e, thr = 0.9)

Arguments

e

Matrix of expression values with genes as rows and samples as columns

thr

Threshold quantile for variance filtering (default: 0.9)

Details

This function calculates the variance for each gene across samples and retains only genes with variance above the specified quantile threshold. This helps focus analysis on genes that show significant expression changes. # Filter top 10 # filtered_expr <- genes_filter_dynamic(expression_matrix, thr = 0.9)

Value

Matrix containing only genes above the variance threshold

Select Lowly Expressed Genes

Description

Select genes with mean expression below a specified threshold.

Usage

genes_lowly_expressed(phyex_set, threshold = 1)

Arguments

phyex_set

A PhyloExpressionSet object

threshold

Mean expression threshold (default: 1)

Details

This function identifies genes with low mean expression levels, which might be candidates for filtering or separate analysis.

Value

Character vector of gene IDs with mean expression <= threshold

Examples

# Select genes with mean expression <= 1
low_expr_genes <- genes_lowly_expressed(example_phyex_set, threshold = 1)

Gene Expression Filtering Functions

Description

Collection of functions for filtering genes based on expression patterns in PhyloExpressionSet objects.

Generic function to select genes with the highest values for a given expression metric.

Usage

genes_top_expr(phyex_set, FUN = rowMeans, top_p = 0.99, top_k = NULL, ...)

Arguments

phyex_set

A PhyloExpressionSet object

FUN

Function to calculate gene-wise expression metric (default: rowMeans)

top_p

Quantile threshold for gene selection (default: 0.99). Ignored if top_k is specified.

top_k

Absolute number of top genes to select (default: NULL). Takes precedence over top_p.

...

Additional arguments passed to FUN

Details

This function applies the specified function to calculate a metric for each gene across samples, then selects genes above the specified quantile threshold or the top k genes by absolute count. If both top_p and top_k are specified, top_k takes precedence.

Value

Character vector of gene IDs with metric values >= top_p quantile or top top_k genes

Examples

# Select top 1% most expressed genes by mean
high_expr_genes <- genes_top_expr(example_phyex_set, function(x) apply(x, 1, mean), top_p = 0.99)

# Select top 100 most expressed genes
top_100_genes <- genes_top_expr(example_phyex_set, function(x) apply(x, 1, mean), top_k = 100)

Select Top Mean Expressed Genes

Description

Select genes with the highest mean expression across samples.

Usage

genes_top_mean(phyex_set, top_p = 0.99, top_k = NULL)

Arguments

phyex_set

A PhyloExpressionSet object

top_p

Quantile threshold for gene selection (default: 0.99). Ignored if top_k is specified.

top_k

Absolute number of top genes to select (default: NULL). Takes precedence over top_p.

Details

This function identifies genes with the highest mean expression levels, which are often the most reliably detected and functionally important.

Value

Character vector of gene IDs with mean expression >= top_p quantile or top top_k genes

Examples

# Select top 1% most expressed genes by mean
high_expr_genes <- genes_top_mean(example_phyex_set, top_p = 0.99)

# Select top 1000 most expressed genes
top_1000_genes <- genes_top_mean(example_phyex_set, top_k = 1000)

Select Top Variable Genes

Description

Select genes with the highest variance across samples.

Usage

genes_top_variance(phyex_set, top_p = 0.99, top_k = NULL)

Arguments

phyex_set

A PhyloExpressionSet object

top_p

Quantile threshold for gene selection (default: 0.99). Ignored if top_k is specified.

top_k

Absolute number of top genes to select (default: NULL). Takes precedence over top_p.

Details

This function identifies genes with the highest variance across samples, which are often the most informative for downstream analyses.

Value

Character vector of gene IDs with variance >= top_p quantile or top top_k genes

Examples

# Select top 1% most variable genes
high_var_genes <- genes_top_variance(example_phyex_set, top_p = 0.99)

# Select top 500 most variable genes
top_500_var_genes <- genes_top_variance(example_phyex_set, top_k = 500)

Geometric Mean

Description

This function computes the geometric mean of a numeric input vector x.

Usage

geom_mean(x)

Arguments

x

a numeric vector for which geometric mean computations shall be performed.

Details

x <- 1:10

geom_mean(x)

Author(s)

Hajk-Georg Drost

Calculate Gene Expression Angles

Description

Calculate developmental stage angles for genes based on their expression patterns.

Usage

get_angles(e)

Arguments

e

Matrix of expression values with genes as rows and stages as columns

Details

This function uses PCA to project genes and ideal expression patterns into 2D space, then calculates the angle of each gene relative to ideal patterns. The angles represent the peak expression timing during development.

Value

Numeric vector of angles representing each gene's expression pattern

Goodness of Fit Test

Description

Perform a Kolmogorov-Smirnov test to assess goodness of fit between the null sample and fitted distribution.

Usage

goodness_of_fit(test_result)

Arguments

test_result

A TestResult object

Details

This function tests whether the null sample follows the fitted distribution using the Kolmogorov-Smirnov test. A significant result indicates poor fit. # Test goodness of fit # gof_result <- goodness_of_fit(test_result)

Value

A ks.test result object

Late Expression Pattern

Description

Generate an ideal late expression pattern for S developmental stages.

Usage

late_gene(S)

Arguments

S

Number of developmental stages

Details

Creates a pattern where expression starts high and decreases in later stages. This is the inverse of the early pattern.

Value

Numeric vector representing late expression pattern

Late Conservation Score Function

Description

Compute the late conservation score by comparing early and mid developmental stages to late stages.

Usage

lc_score(txi, modules)

Arguments

txi

Numeric vector of transcriptomic index values

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

Details

The score is computed as the minimum of: - D1: mean(early) - mean(late) - D2: mean(mid) - mean(late)

Higher scores indicate stronger late conservation patterns. # Compute late conservation score # modules <- list(early = 1:3, mid = 4:6, late = 7:9) # score <- lc_score(txi_values, modules)

Value

A numeric value representing the late conservation score

Match Gene Expression Data with Phylostratum Map

Description

Join gene expression data with a phylostratum mapping to create a BulkPhyloExpressionSet object.

Usage

match_map(
  data,
  phylomap,
  groups = colnames(data[, 2:ncol(data)]),
  name = NULL,
  ...
)

Arguments

data

A data frame where column 1 contains gene IDs and columns 2+ contain expression data

phylomap

A data frame with two columns: phylostratum assignments and gene IDs

groups

A factor or character vector indicating which group each sample belongs to. Default uses column names from expression data

name

A character string naming the dataset. Default uses the variable name

...

Additional arguments passed to as_BulkPhyloExpressionSet

Value

A BulkPhyloExpressionSet object

Examples

# Match expression data with phylostratum map
# bulk_set <- match_map(expression_data, phylo_map, 
#                       groups = c("stage1", "stage2", "stage3"),
#                       name = "Matched Dataset")

Match Expression Matrix with Phylostratum Map

Description

Join single-cell gene expression matrix with a phylostratum mapping to create a ScPhyloExpressionSet object.

Usage

match_map_sc_matrix(
  expression_matrix,
  metadata,
  phylomap,
  strata_legend = NULL,
  ...
)

Arguments

expression_matrix

Expression matrix with genes as rows and cells as columns

metadata

Data frame with cell metadata, rownames should match colnames of expression_matrix

phylomap

A data frame with two columns: phylostratum assignments and gene IDs

strata_legend

A data frame with two columns: phylostratum assignments and name of each stratum. If NULL, numeric labels will be used (default: NULL)

...

Additional arguments passed to ScPhyloExpressionSet_from_matrix

Details

This function combines phylostratum mapping with expression matrix and metadata to create a ScPhyloExpressionSet. Only genes present in both the expression matrix and phylomap will be retained. All discrete metadata columns are converted to factors automatically.

Value

A ScPhyloExpressionSet object

Match Single-Cell Expression Data with Phylostratum Map (Seurat)

Description

Join single-cell gene expression data (from a Seurat object) with a phylostratum mapping to create a ScPhyloExpressionSet object. Automatically extracts dimensional reductions and metadata.

Usage

match_map_sc_seurat(
  seurat,
  phylomap,
  layer = "counts",
  strata_legend = NULL,
  ...
)

Arguments

seurat

A Seurat object containing single-cell expression data

phylomap

A data frame with two columns: phylostratum assignments and gene IDs

layer

Character string specifying which layer to use from the Seurat object (default: "counts")

strata_legend

A data frame with two columns: phylostratum assignments and name of each stratum. If NULL, numeric labels will be used (default: NULL)

...

Additional arguments passed to ScPhyloExpressionSet_from_seurat

Details

This is a convenience function that combines phylostratum mapping with Seurat object conversion. Only genes present in both the expression data and phylomap will be retained. The function extracts all metadata and dimensional reductions from the Seurat object.

Value

A ScPhyloExpressionSet object

Mid Expression Pattern

Description

Generate an ideal mid expression pattern for S developmental stages.

Usage

mid_gene(S)

Arguments

S

Number of developmental stages

Details

Creates a pattern where expression starts low, peaks in middle stages, and returns to low in later stages.

Value

Numeric vector representing mid expression pattern

Modulo Pi Function

Description

Wrap angles to the range (-pi, pi].

Usage

mod_pi(x)

Arguments

x

Numeric vector of angles

Details

Ensures angles are in the standard range (-pi, pi] by wrapping around 2pi.

Value

Numeric vector of wrapped angles

Create S7 Options Property

Description

Helper function to create an S7 property with a limited set of valid options.

Usage

new_options_property(
  class = S7::class_any,
  options,
  default = options[[1]],
  name = NULL
)

Arguments

class

The S7 class for the property (default: S7::class_any)

options

Character vector of valid options for the property

default

Default value for the property (default: first option)

name

Optional name for the property

Details

# Create a property that only accepts specific values # color_prop <- new_options_property(options = c("red", "blue", "green"))

Value

An S7 property object with validation for allowed options

Create S7 Required Property

Description

Helper function to create an S7 property that is required (throws error if not provided).

Usage

new_required_property(
  class = S7::class_any,
  name,
  validator = NULL,
  getter = NULL,
  setter = NULL
)

Arguments

class

The S7 class for the property (default: S7::class_any)

name

Name of the property (used in error messages)

validator

Optional validation function for the property

getter

Optional getter function for the property

setter

Optional setter function for the property

Details

# Create a required property # required_prop <- new_required_property(class = class_character, name = "gene_ids")

Value

An S7 property object that is required

Normalise Stage Expression Data

Description

Normalise expression data to a specified total expression level per sample.

Usage

normalise_stage_expression(phyex_set, total = 1e+06)

Arguments

phyex_set

A PhyloExpressionSet object

total

Numeric value to normalise each sample to (default: 1e6)

Value

A PhyloExpressionSet object with normalised expression data

Examples

# Normalise to 1 million total expression per sample
normalised_set <- normalise_stage_expression(example_phyex_set, total = 1e6)

Compute TXI Profiles Omitting Each Gene

Description

For each gene i, exclude the corresponding gene i from the PhyloExpressionSet and compute the TXI profile for the dataset with gene i excluded.

This procedure results in a TXI profile matrix storing the TXI profile for each omitted gene i.

Usage

omit_matrix(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object

Details

This function systematically removes each gene and recalculates the transcriptomic index profile to assess the contribution of individual genes to the overall pattern. This is useful for identifying genes that have a large influence on the phylotranscriptomic signature.

Value

A numeric matrix storing TXI profiles for each omitted gene i

Author(s)

Hajk-Georg Drost

Examples

# Compute omit matrix for a PhyloExpressionSet
omit_mat <- omit_matrix(example_phyex_set)

Calculate Phylostratum-Specific Transcriptomic Index

Description

Calculate pTXI values for expression data. This is a generic function that dispatches based on the input type.

Usage

pTXI(phyex_set, reps = FALSE)

Arguments

phyex_set

A PhyloExpressionSet object

reps

Whether to return pTXI for each sample instead for each group

Value

Matrix of pTXI values

Pairwise Score Function

Description

Compute the pairwise contrast score between two groups of developmental stages.

Usage

pair_score(txi, modules, alternative = c("greater", "less"))

Arguments

txi

Numeric vector of transcriptomic index values

modules

A named list with elements 'contrast1' and 'contrast2' containing stage indices for each contrast group

alternative

Character string specifying the alternative hypothesis: "greater" or "less"

Details

The score is computed as mean(contrast1) - mean(contrast2). For alternative = "less", the score is negated. # Compute pairwise score # modules <- list(contrast1 = 1:3, contrast2 = 7:9) # score <- pair_score(txi_values, modules, "greater")

Value

A numeric value representing the pairwise contrast score

Permute Strata in PhyloExpressionSet

Description

Returns a copy of the PhyloExpressionSet with permuted strata and corresponding strata values.

Usage

permute_PS(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object

Value

A new PhyloExpressionSet object with permuted strata and strata_values

Create Petal Plot for Gene Removal Analysis

Description

Create a petal plot showing how many genes are removed per run relative to the consensus set.

Usage

petal_plot(sets, p = 0.5)

Arguments

sets

List of gene sets from GATAI runs

p

Consensus threshold (default: 0.5)

Details

This function creates a petal plot visualization showing the relationship between individual GATAI runs and the consensus gene set, highlighting how many genes are unique to each run. # Create petal plot # petal <- petal_plot(gatai_runs, p = 0.5)

Value

A ggplot2 petal plot

Plot Phylostratum Contribution to Transcriptomic Index

Description

Create a stacked area plot showing the contribution of each phylostratum to the overall transcriptomic index across developmental stages or cell types.

Usage

plot_contribution(phyex_set, type = c("stacked", "line"))

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

type

One of "stacked" or "line". If stacked, will show the lines stacked in a cumulative area plot, otherwise they will not be stacked.

Details

This function visualizes how different phylostrata contribute to the overall transcriptomic index pattern across developmental stages (bulk data) or cell types (single-cell data). Each area represents the contribution of a specific phylostratum, with older strata typically shown in darker colors and younger strata in lighter colors.

The plot uses the PS_colours function to create a consistent color scheme that matches other myTAI visualizations.

Value

A ggplot2 object showing phylostratum contributions as a stacked area plot

Examples

# Create contribution plot for bulk data
contrib_plot <- plot_contribution(example_phyex_set)

# Create contribution plot for single-cell data
contrib_plot_sc <- plot_contribution(example_phyex_set_sc)

Plot Cullen-Frey Diagram for Distribution Assessment

Description

Create a Cullen-Frey diagram to assess which distribution family best fits the null sample data.

Usage

plot_cullen_frey(test_result)

Arguments

test_result

A TestResult object

Details

The Cullen-Frey diagram plots skewness vs. kurtosis to help identify appropriate distribution families for the null sample data.

Value

A Cullen-Frey plot from the fitdistrplus package

Comparing expression levels distributions across developmental stages

Description

plot_distribution_expression generates plots that help to compare the distribution of expression levels through various developmental stages or cell types, highlighting each stage with distinct colors. By default, a log transformation is applied to the expression values.

Usage

plot_distribution_expression(
  phyex_set,
  show_identities = TRUE,
  show_strata = FALSE,
  log_transform = TRUE,
  seed = 123
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet).

show_identities

Logical, whether to show identity-specific distributions.

show_strata

Logical, whether to show stratum-specific distributions.

log_transform

Logical, whether to apply log transformation to expression values (default: TRUE).

seed

Seed for reproducible color selection.

Value

A ggplot2 object showing expression levels distributions across identities

Recommendation

Apply a square root transformation to enhance the visualization of differences in the distributions: plot_distribution_expression(transform_counts(phyex_set, sqrt))

Author(s)

Filipa Martins Costa

Partial TAI Distribution Plotting Functions

Description

Functions for plotting and comparing partial TAI distributions using PhyloExpressionSet S7 objects.

plot_distribution_pTAI generates 2 plots that help to compare the distribution of the quotient of expression by partial TAI through various developmental stages or cell types, highlighting each stage with distinct colors.

Usage

plot_distribution_pTAI(
  phyex_set,
  stages = NULL,
  xlab = "Expression / Partial TAI",
  ylab = "Density",
  main = "Density Distribution of Expression / Partial TAI by Developmental Stage"
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet).

stages

A numeric vector specifying the indices of the stages to compare. Each index corresponds to a stage in the PhyloExpressionSet. If NULL, all stages are used.

xlab

Label of x-axis.

ylab

Label of y-axis.

main

Figure title.

Value

A ggplot2 object showing partial TAI distributions

Author(s)

Filipa Martins Costa

QQ plot comparing partial TAI distributions across developmental stages against a reference stage

Description

plot_distribution_partialTAI_qqplot generates a QQ plot to compare the partial TAI distributions of various developmental stages against a reference stage. It visualizes quantile differences between the reference and other stages, highlights each stage with distinct colors, and annotates the plot with the p-values from the nonparametric ks.test to indicate the significance of distribution differences.

Usage

plot_distribution_pTAI_qqplot(
  phyex_set,
  reference_stage_index = 1,
  xlab = "Quantiles of Reference Stage",
  ylab = "Quantiles of Other Stages",
  main = "QQ Plot: Developmental Stages vs Reference Stage (p-values from KS test)",
  alpha = 0.7,
  size = 1.2
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet).

reference_stage_index

An integer specifying the index of the reference developmental stage. The partial TAI distribution of this stage will be used as the reference for comparisons with other stages (default: stage index 1).

xlab

Label of x-axis.

ylab

Label of y-axis.

main

Figure title.

alpha

Transparency of the points.

size

Size of the points.

Value

A ggplot2 object showing a qqplot of partial TAI distributions

Author(s)

Filipa Martins Costa

Plot Distribution of Genes Across Phylostrata

Description

Create a bar plot showing the distribution of genes across phylostrata, with options for showing observed vs. expected ratios.

Usage

plot_distribution_strata(
  strata,
  selected_gene_ids = names(strata),
  as_log_obs_exp = FALSE
)

Arguments

strata

Named factor vector of phylostratum assignments (names are gene IDs)

selected_gene_ids

Character vector of gene IDs to include in the plot (default: all genes in strata)

as_log_obs_exp

Logical indicating whether to show log2(observed/expected) ratios instead of raw counts (default: FALSE)

Details

This function visualizes how genes are distributed across different phylostrata. When as_log_obs_exp=FALSE, it shows raw gene counts per stratum. When TRUE, it shows log2 ratios of observed vs. expected gene counts, useful for identifying enrichment or depletion of specific strata in gene sets.

Value

A ggplot2 object showing the phylostratum distribution

Examples

# Plot raw gene counts by strata
p1 <- plot_distribution_strata(example_phyex_set@strata)

# Plot observed vs expected ratios for selected genes
p2 <- plot_distribution_strata(example_phyex_set@strata, 
                               selected_gene_ids = example_phyex_set@gene_ids[5:20],
                               as_log_obs_exp = TRUE)

Plot Comprehensive GATAI Results

Description

Create a suite of plots summarizing the effects of GATAI gene removal on phylotranscriptomic patterns.

Usage

plot_gatai_results(
  phyex_set,
  gatai_result,
  conservation_test = stat_flatline_test,
  runs_threshold = 0.5,
  signature_plot_type = c("separate", "combined")
)

Arguments

phyex_set

A PhyloExpressionSet object containing the original gene expression data.

gatai_result

Result list from destroy_pattern(), containing GATAI analysis output.

conservation_test

Function for conservation test (default: stat_flatline_test).

runs_threshold

Threshold for gene removal consistency across runs (default: 0.5).

signature_plot_type

Type of signature plot: "separate" for individual plots, "combined" for overlay (default: both options).

Details

This function provides a comprehensive visualization of the impact of GATAI gene removal, including transcriptomic signature plots, gene expression profiles, heatmaps, mean-variance relationships, phylostrata distributions, conservation test comparisons, and convergence diagnostics.

Value

A named list of ggplot/patchwork objects and results:

signature_plots

Signature plots before/after GATAI and top variance removal

heatmap_plot

Heatmap of GATAI-removed genes

profiles_plot

Gene expression profiles of GATAI-removed genes

profiles_plot_facet

Faceted gene profiles by strata

gene_space_plot

Gene space plot of GATAI-removed genes

mean_var_plot

Mean-variance plot highlighting GATAI-removed genes

strata_plot

Phylostrata distribution plot (log obs/exp) for GATAI-removed genes

null_dist_plot

Null distribution plot with test statistics and p-values

convergence_plots

GATAI convergence plots (if available)

Author(s)

Filipa Martins Costa, Stefan Manolache

Plot Gene Expression Heatmap

Description

Create a heatmap showing gene expression patterns across conditions with optional dendrograms and gene age annotation.

Usage

plot_gene_heatmap(
  phyex_set,
  genes = NULL,
  top_p = NULL,
  top_k = 30,
  std = TRUE,
  show_reps = FALSE,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  show_gene_age = TRUE,
  show_gene_ids = FALSE,
  gene_annotation = NULL,
  gene_annotation_colors = NULL,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

genes

Character vector of specific gene IDs to include in the heatmap (default: NULL for auto-selection of dynamic genes)

top_p

Numeric value specifying the top proportion of genes to include (default: NULL). Ignored if top_k is specified.

top_k

Absolute number of top genes to select (default: 30). Takes precedence over top_p.

std

Logical indicating whether to standardize expression values (default: TRUE)

show_reps

Logical indicating whether to show replicates or collapsed data (default: FALSE)

cluster_rows

Logical indicating whether to cluster genes/rows (default: FALSE)

cluster_cols

Logical indicating whether to cluster conditions/columns (default: FALSE)

show_gene_age

Logical indicating whether to show gene age annotation (default: TRUE)

show_gene_ids

Logical indicating whether to show gene identifiers (default: FALSE)

gene_annotation

Data frame with custom gene annotations, rownames should match gene IDs (default: NULL)

gene_annotation_colors

Named list of color vectors for custom gene annotations (default: NULL)

...

Additional arguments passed to specific methods

Details

This function creates a comprehensive heatmap visualization of gene expression patterns. By default, genes are ordered by their expression angle (developmental trajectory). The function supports clustering of both genes and identities, and can optionally display gene age (phylostratum) as a colored annotation bar.

For bulk data, the heatmap shows expression across developmental conditions. For single-cell data, the heatmap shows expression across cell types.

The gene age annotation uses the PS_colours function to create a consistent color scheme across different myTAI visualizations.

Custom gene annotations can be provided via the gene_annotation parameter, which should be a data frame with gene IDs as rownames and annotation categories as columns. Corresponding colors should be provided via gene_annotation_colors as a named list where names match the annotation column names.

Value

A ggplot object (converted from pheatmap) showing the gene expression heatmap

Examples

# Basic heatmap with gene age annotation
p1 <- plot_gene_heatmap(example_phyex_set, show_gene_age = TRUE)

# Single-cell heatmap with subset of cells
p2 <- plot_gene_heatmap(example_phyex_set_sc, show_reps = TRUE, max_cells_per_type = 3)

# Custom gene annotation example
gene_ids <- example_phyex_set@gene_ids[1:3]
gene_annot <- data.frame(
  Category = c("High", "Medium", "Low"),
  row.names = gene_ids
)
colors <- list(Category = c("High" = "red", "Medium" = "yellow", "Low" = "blue"))
p3 <- plot_gene_heatmap(example_phyex_set |> select_genes(gene_ids), gene_annotation = gene_annot, 
                       gene_annotation_colors = colors, show_gene_age = FALSE)

Plot Individual Gene Expression Profiles

Description

Create a plot showing expression profiles for individual genes across developmental stages or cell types, with various visualization options.

Usage

plot_gene_profiles(
  phyex_set,
  genes = NULL,
  show_set_mean = FALSE,
  show_reps = FALSE,
  transformation = c("log", "std_log", "none"),
  colour_by = c("manual", "strata", "stage"),
  colours = NULL,
  max_genes = 100,
  show_labels = TRUE,
  label_size = 1.75,
  show_legend = TRUE,
  facet_by_strata = FALSE
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

genes

Character vector of gene IDs to plot. If NULL, top expressing genes are selected

show_set_mean

Logical indicating whether to show the mean expression across all genes (default: FALSE)

show_reps

Logical indicating whether to show individual replicates (bulk) or cells (single-cell) (default: FALSE)

transformation

Character string specifying expression transformation: "log" (log1p), "std_log" (standardized log1p), or "none" (default: "log")

colour_by

Character string specifying coloring scheme: "strata" (by phylostratum), "stage" (by developmental stage/cell type), or "manual" (default: "manual")

colours

Optional vector of colors for manual coloring (default: NULL)

max_genes

Maximum number of genes to plot when genes=NULL (default: 100)

show_labels

Logical indicating whether to show gene labels (default: TRUE)

label_size

Font size of gene id labels if shown (default: 0.5).

show_legend

Logical indicating whether to show legend (default: TRUE)

facet_by_strata

Logical indicating whether to facet by phylostratum (default: FALSE)

Details

This function creates detailed visualizations of individual gene expression patterns across development (bulk data) or cell types (single-cell data). Genes can be colored by phylostratum or developmental stage, and various transformations can be applied to highlight different aspects of the data.

Value

A ggplot2 object showing gene expression profiles

Author(s)

Filipa Martins Costa, Stefan Manolache, Hajk-Georg Drost

Examples

# Plot specific genes for bulk data
p1 <- plot_gene_profiles(example_phyex_set, genes = example_phyex_set@gene_ids[1:5])

# Plot for single-cell data with faceting by strata
p2 <- plot_gene_profiles(example_phyex_set_sc, facet_by_strata = TRUE)

Plot Gene Space Using PCA

Description

Create a PCA plot showing genes in expression space with ideal expression patterns overlaid as reference points.

Usage

plot_gene_space(
  phyex_set,
  top_p = 0.2,
  genes = NULL,
  colour_by = c("identity", "strata")
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

top_p

Proportion of most dynamic genes to include when genes=NULL (default: 0.2)

genes

Character vector of specific genes to plot. If NULL, uses top dynamic genes

colour_by

Character string specifying coloring scheme: "identity" (by peak expression stage/identity) or "strata" (by phylostratum) (default: "identity")

Details

This function creates a PCA visualization of genes in expression space, with ideal expression patterns (early, mid, late, reverse mid) overlaid as reference points. The analysis uses log-transformed and standardized expression values. Genes are colored either by their phylostratum or by their peak expression stage.

Value

A ggplot2 object showing the gene space PCA plot

Examples

# Plot gene space colored by identity
p1 <- plot_gene_space(example_phyex_set, colour_by = "identity")

# Plot specific genes colored by strata
p2 <- plot_gene_space(example_phyex_set, 
                      genes = example_phyex_set@gene_ids[1:5], 
                      colour_by = "strata")

Plot Mean-Variance Relationship

Description

Create a scatter plot showing the relationship between mean expression and variance for genes, colored by phylostratum, with optional highlighting and labeling of specific genes.

Usage

plot_mean_var(
  phyex_set,
  highlight_genes = NULL,
  colour_by = c("none", "strata")
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet) containing gene expression data.

highlight_genes

Optional character vector of gene IDs to highlight and label on the plot.

colour_by

Character string specifying coloring scheme: "none" (default), "strata" colors by phylostratum

Details

This function plots the mean expression versus variance for each gene, with points colored by phylostratum. Optionally, specific genes can be highlighted and labeled. This visualization helps identify expression patterns and heteroscedasticity in the data.

The function uses collapsed expression data (averaged across replicates for bulk data, or averaged across cells per cell type for single-cell data).

Value

A ggplot2 object showing the mean-variance relationship.

Examples

# Create mean-variance plot for bulk data
mv_plot <- plot_mean_var(example_phyex_set)

# Highlight and label specific genes in single-cell data
mv_plot_sc <- plot_mean_var(example_phyex_set_sc, 
  highlight_genes = example_phyex_set_sc@gene_ids[1:3])

# Color by phylostratum
mv_plot_colored <- plot_mean_var(example_phyex_set, colour_by = "strata")

Plot Null TXI Sample Distribution

Description

Create a plot showing the null TXI distribution sample compared to the observed test TXI values across developmental stages.

Usage

plot_null_txi_sample(test_result)

Arguments

test_result

A ConservationTestResult object containing null TXI distributions

Details

This function creates a visualization of the null hypothesis testing by plotting: - Gray lines representing individual null TXI samples from permutations - A horizontal line showing the mean of null distributions - A colored line showing the observed test TXI values

The plot helps visualize how the observed TXI pattern compares to what would be expected under the null hypothesis of no conservation signal.

Value

A ggplot2 object showing null samples as gray lines and test TXI as colored line

Plot Mean Relative Expression Levels as Barplot

Description

Plots mean relative expression levels for age category groups using a PhyloExpressionSet S7 object, with statistical testing.

Usage

plot_relative_expression_bar(phyex_set, groups, p_adjust_method = NULL, ...)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet).

groups

A list of integer vectors specifying age categories (e.g., phylostrata) for each group (2+ groups).

p_adjust_method

P-value adjustment for multiple testing.

...

Further arguments passed to ggplot2 geoms.

Value

ggplot2 object.

Plot Relative Expression Profiles (Line Plot)

Description

Plots relative expression profiles for age categories using a PhyloExpressionSet S7 object.

Usage

plot_relative_expression_line(
  phyex_set,
  groups,
  modules = NULL,
  adjust_range = TRUE,
  alpha = 0.1,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet).

groups

A list of integer vectors specifying age categories (e.g., phylostrata) for each group (1 or 2 groups).

modules

Optional list for shading modules: list(early=..., mid=..., late=...).

adjust_range

Logical, adjust y-axis range for both panels (if 2 groups).

alpha

Transparency for shaded module area.

...

Further arguments passed to ggplot2 geoms.

Value

ggplot2 object or list of ggplot2 objects.

Plot Sample Space Visualization

Description

Create a dimensional reduction plot to visualize sample relationships in gene expression space using PCA or UMAP.

Usage

plot_sample_space(
  phyex_set,
  method = c("PCA", "UMAP"),
  colour_by = c("identity", "TXI"),
  seed = 42,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

method

Character string specifying the dimensionality reduction method: "PCA" or "UMAP" (default: "PCA")

colour_by

Character string specifying what to colour by: "identity" (default), "TXI"

seed

Integer seed for reproducible UMAP results (default: 42)

...

Additional arguments passed to specific methods

Details

This function performs log1p transformation on expression data, removes genes with zero variance, and applies the specified dimensionality reduction method. Samples are coloured by their group assignments or TAI values.

Value

A ggplot2 object showing the sample space visualisation

Examples

# Create PCA plot coloured by identity
pca_plot <- plot_sample_space(example_phyex_set, method = "PCA", colour_by = "identity")

# Create UMAP plot coloured by TXI
if (requireNamespace("uwot", quietly = TRUE)) {
    umap_plot <- plot_sample_space(example_phyex_set, method = "UMAP", colour_by = "TXI")
}

Plot Transcriptomic Signature

Description

Create a plot of the transcriptomic index signature across developmental stages or cell types, with options for showing individual samples/cells and statistical testing.

Usage

plot_signature(
  phyex_set,
  show_reps = TRUE,
  show_p_val = TRUE,
  conservation_test = stat_flatline_test,
  colour = NULL,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

show_reps

Logical, whether to show individual replicates

show_p_val

Logical, whether to show conservation test p-value

conservation_test

Function, conservation test to use for p-value calculation

colour

Character, custom color for the plot elements

...

Additional arguments passed to specific methods

Details

This function creates visualizations appropriate for the data type:

**Bulk data (BulkPhyloExpressionSet):** - Line plots showing TXI trends across developmental stages - Optional individual biological replicates as jittered points - Optional conservation test p-values

**Single-cell data (ScPhyloExpressionSet):** - Sina plots showing TXI distributions across cell types or other identities - Mean TXI values overlaid as line - Optional individual cells using geom_sina for better visualization - Flexible identity selection from metadata via additional parameters: - 'primary_identity': Character, name of metadata column for x-axis (default: current selected identities) - 'secondary_identity': Character, name of metadata column for coloring/faceting - 'facet_by_secondary': Logical, whether to facet by secondary identity (default: FALSE uses colouring)

Value

A ggplot2 object showing the transcriptomic signature

Examples

# Basic signature plot for bulk data
p <- plot_signature(example_phyex_set)

Plot Signature Across Gene Expression Quantiles

Description

Create a plot showing how the transcriptomic signature changes when genes are progressively removed based on expression quantiles.

Usage

plot_signature_gene_quantiles(
  phyex_set,
  quantiles = c(1, 0.99, 0.95, 0.9, 0.8),
  selection_FUN = genes_top_mean,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

quantiles

Numeric vector of quantiles to test (default: c(1.0, 0.99, 0.95, 0.90, 0.80))

selection_FUN

Function to select genes for removal (default: genes_top_mean)

...

Additional arguments passed to plot_signature_multiple

Details

This function systematically removes genes based on expression quantiles and shows how the transcriptomic signature changes. This is useful for understanding the contribution of highly expressed genes to the overall pattern and for assessing the robustness of phylotranscriptomic patterns.

The analysis works with both bulk and single-cell data, helping to determine whether phylotranscriptomic patterns are driven by a few highly expressed genes or represent broad transcriptomic trends.

Value

A ggplot2 object showing signatures across different quantiles

Examples

# Plot signature across expression quantiles for bulk data
phyex_set <- example_phyex_set |>
    select_genes(example_phyex_set@gene_ids[1:100])
phyex_set@null_conservation_sample_size <- 500
p <- plot_signature_gene_quantiles(phyex_set, quantiles = c(0.95, 0.90))

Plot Multiple Transcriptomic Signatures

Description

Create a plot comparing multiple transcriptomic signatures on the same axes, with options for statistical testing and transformations.

Usage

plot_signature_multiple(
  phyex_sets,
  legend_title = "Phylo Expression Set",
  show_p_val = TRUE,
  conservation_test = stat_flatline_test,
  transformation = NULL,
  colours = NULL,
  ...
)

Arguments

phyex_sets

A vector of PhyloExpressionSet objects to compare (BulkPhyloExpressionSet or ScPhyloExpressionSet)

legend_title

Title for the legend (default: "Phylo Expression Set")

show_p_val

Logical indicating whether to show p-values (default: TRUE)

conservation_test

Function to use for conservation testing (default: stat_flatline_test)

transformation

Optional transformation function to apply to all datasets (default: NULL)

colours

Optional vector of colors for each dataset (default: NULL)

...

Additional arguments passed to plot_signature

Details

This function allows comparison of multiple transcriptomic signatures by overlaying them on the same plot. Each signature is colored differently and can be tested for conservation patterns (bulk data only). If a transformation is provided, it's applied to all datasets before plotting.

The function automatically adapts to the data type: - **Bulk data**: Line plots with optional statistical testing - **Single-cell data**: Violin plots showing distributions

All datasets must use the same axis labels (developmental stages or cell types).

Value

A ggplot2 object showing multiple transcriptomic signatures

Examples

# Compare multiple bulk datasets
phyex_set <- example_phyex_set

bulk_list <- c(phyex_set, 
  phyex_set |> remove_genes(phyex_set@gene_ids[1:5]))
p <- plot_signature_multiple(bulk_list, legend_title = "Dataset")

Plot Signature Under Different Transformations

Description

Compare transcriptomic signatures under various data transformations to assess the robustness of phylotranscriptomic patterns.

Usage

plot_signature_transformed(phyex_set, transformations = COUNT_TRANSFORMS, ...)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

transformations

Named list of transformation functions (default: COUNT_TRANSFORMS)

...

Additional arguments passed to plot_signature_multiple

Details

This function applies different transformations to the same dataset and compares the resulting transcriptomic signatures. This is useful for assessing whether phylotranscriptomic patterns are robust to different data processing approaches or are artifacts of specific transformations.

The analysis works with both bulk and single-cell data, helping to determine whether phylotranscriptomic patterns are consistent across different normalization and transformation methods.

Value

A ggplot2 object showing signatures under different transformations

Examples

# Single-cell data with custom transformations

phyex_set <- example_phyex_set

custom_transforms <- list(raw = identity, log = log1p)
p <- plot_signature_transformed(phyex_set, transformations = custom_transforms)

Plot Expression Levels by Phylostratum

Description

Create a boxplot showing the distribution of expression levels for each phylostratum.

Usage

plot_strata_expression(phyex_set, aggregate_FUN = mean)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet)

aggregate_FUN

Function to aggregate expression across identities (default: mean)

Details

This function creates a boxplot visualization showing how expression levels vary across different phylostrata. Each point represents a gene, and the boxes show the distribution of expression levels within each phylostratum.

For bulk data, expression is aggregated across developmental stages. For single-cell data, expression is aggregated across cell types.

Value

A ggplot2 object showing expression distributions by phylostratum

Examples

# Plot expression by strata using mean aggregation for bulk data
p1 <- plot_strata_expression(example_phyex_set, aggregate_FUN = mean)

# Plot using median aggregation for single-cell data
p2 <- plot_strata_expression(example_phyex_set_sc, aggregate_FUN = median)

Print Method for TestResult

Description

Print method for TestResult objects showing test summary information.

Arguments

x

A TestResult object

...

Additional arguments (ignored)

Details

# Print a test result # print(test_result)

Value

Invisibly returns the TestResult object

Calculate Quantile Ranks

Description

Calculate quantile ranks for a numeric vector, handling ties using average method.

Usage

quantile_rank(x)

Arguments

x

numeric vector for which to calculate quantile ranks

Value

A numeric vector of quantile ranks between 0 and 1

Examples

# Calculate quantile ranks for a vector
ranks <- quantile_rank(c(1, 2, 3, 4, 5))

Reductive Hourglass Score Function

Description

Compute the reductive hourglass score by comparing early and late developmental stages to mid stages.

Usage

reductive_hourglass_score(txi, modules)

Arguments

txi

Numeric vector of transcriptomic index values

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

Details

The score is computed as the minimum of: - D1: mean(early) - mean(mid) - D2: mean(late) - mean(mid)

Higher scores indicate stronger reductive hourglass patterns (mid stages dominated by older genes). # Compute reductive hourglass score # modules <- list(early = 1:3, mid = 4:6, late = 7:9) # score <- reductive_hourglass_score(txi_values, modules)

Value

A numeric value representing the reductive hourglass score

Compute Relative Expression Matrix for PhyloExpressionSet

Description

Computes relative expression profiles for all age categories in a PhyloExpressionSet.

Usage

rel_exp_matrix(phyex_set)

Arguments

phyex_set

A PhyloExpressionSet object (BulkPhyloExpressionSet or ScPhyloExpressionSet).

Value

A matrix with age categories as rows and identities as columns, containing relative expression values.

Relative Expression Functions

Description

Functions for computing and plotting relative expression profiles using PhyloExpressionSet S7 objects.

Computes the relative expression profile for a given gene expression matrix. The relative expression is calculated by normalizing the column means of the matrix to a [0, 1] scale.

Usage

relative_expression(count_matrix)

Arguments

count_matrix

A numeric matrix where columns represent developmental stages/cell types and rows represent genes.

Value

A numeric vector of relative expression values for each stage (column) in the input matrix.

Remove Genes from PhyloExpressionSet

Description

Remove specified genes from a PhyloExpressionSet object.

Usage

remove_genes(
  phyex_set,
  genes,
  new_name = paste(phyex_set@name, "perturbed"),
  reuse_null_txis = TRUE
)

Arguments

phyex_set

A PhyloExpressionSet object

genes

Character vector of gene IDs to remove

new_name

Character string for the new dataset name (default: auto-generated)

reuse_null_txis

Logical indicating whether to reuse precomputed null conservation TXIs (default: TRUE)

Value

A PhyloExpressionSet object with the specified genes removed

Examples

# Remove specific genes
filtered_set <- remove_genes(example_phyex_set, example_phyex_set@gene_ids[1:5], 
                            new_name = "Filtered Dataset")

Rename a PhyloExpressionSet

Description

Returns a copy of the PhyloExpressionSet with a new name.

Usage

rename_phyex_set(phyex_set, new_name)

Arguments

phyex_set

A PhyloExpressionSet object

new_name

Character string for the new dataset name

Value

The PhyloExpressionSet object with the updated name

Reverse Mid Expression Pattern

Description

Generate an ideal reverse mid expression pattern for S developmental stages.

Usage

rev_mid_gene(S)

Arguments

S

Number of developmental stages

Details

Creates a pattern where expression starts high, drops in middle stages, and returns to high in later stages. This is the inverse of the mid pattern.

Value

Numeric vector representing reverse mid expression pattern

Reverse Hourglass Score Function

Description

Compute the reverse hourglass score by comparing mid developmental stages to early and late stages.

Usage

reverse_hourglass_score(txi, modules)

Arguments

txi

Numeric vector of transcriptomic index values

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

Details

The score is computed as the minimum of: - D1: mean(mid) - mean(early) - D2: mean(mid) - mean(late)

Higher scores indicate stronger reverse hourglass patterns (mid stages dominated by younger genes). # Compute reverse hourglass score # modules <- list(early = 1:3, mid = 4:6, late = 7:9) # score <- reverse_hourglass_score(txi_values, modules)

Value

A numeric value representing the reverse hourglass score

Row-wise Variance Calculation

Description

Calculate the variance for each row of a matrix or data frame.

Usage

rowVars(x, ...)

Arguments

x

A numeric matrix or data frame

...

Additional arguments passed to rowSums and rowMeans

Details

This function computes the sample variance for each row using the formula: var = sum((x - mean(x))^2) / (n - 1) # Calculate row variances for a matrix # mat <- matrix(1:12, nrow = 3) # row_vars <- rowVars(mat)

Value

A numeric vector containing the variance for each row

Calculate Stratum-Specific Transcriptomic Index

Description

Calculate the stratum-specific transcriptomic index (sTXI) by summing pTXI values within each phylostratum.

Usage

sTXI(phyex_set, option = "identity")

Arguments

phyex_set

A PhyloExpressionSet object

option

Character string specifying calculation method: - "identity": Sum pTXI values within each stratum - "add": Cumulative sum across strata

Value

Matrix of sTXI values with strata as rows and identities as columns

Examples

# Calculate sTXI values
stxi_values <- sTXI(example_phyex_set, option = "identity")
stxi_cumsum <- sTXI(example_phyex_set, option = "add")

Save GATAI Analysis Results to PDF

Description

Save removed gene IDs and all GATAI analysis plots to a PDF file.

Usage

save_gatai_results_pdf(
  phyex_set,
  gatai_result,
  analysis_dir = "gatai_analysis",
  prefix = "report",
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object containing the original gene expression data.

gatai_result

Result list from destroy_pattern(), containing GATAI analysis output.

analysis_dir

Directory to save the PDF file.

prefix

Optional prefix for the PDF filename (default: "report").

...

Additional arguments passed to plot_gatai_results().

Value

Invisibly returns the path to the saved PDF.

Select Genes from PhyloExpressionSet

Description

Extract a subset of genes from a PhyloExpressionSet object.

Usage

select_genes(phyex_set, ...)

Arguments

phyex_set

A PhyloExpressionSet object

...

Additional arguments passed to methods (typically includes 'genes' parameter)

Value

A PhyloExpressionSet object containing only the selected genes

Examples

# Select specific genes
selected_set <- select_genes(example_phyex_set, example_phyex_set@gene_ids[1:10])

Gene Expression Transformation Functions

Description

Collection of functions for transforming gene expression data in PhyloExpressionSet objects.

Generic function to set the expression matrix in a PhyloExpressionSet object.

Usage

set_expression(phyex_set, new_expression, new_name = NULL, ...)

Arguments

phyex_set

A PhyloExpressionSet object

new_expression

Matrix to set as the new expression

new_name

Optional new name for the dataset

...

Additional arguments

Value

A PhyloExpressionSet object with updated expression

Early Conservation Test

Description

Test for early conservation patterns in transcriptomic data by comparing early developmental stages to mid and late stages.

Usage

stat_early_conservation_test(phyex_set, modules, ...)

Arguments

phyex_set

A PhyloExpressionSet object

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

...

Additional arguments passed to stat_generic_conservation_test

Details

The early conservation test evaluates whether early developmental stages show lower transcriptomic index values (indicating older genes) compared to later stages. The test computes a score based on the minimum difference between mid vs. early and late vs. early TXI values.

Value

A ConservationTestResult object with early conservation test results

Examples

# Define developmental modules
p <- example_phyex_set_old |> 
     select_genes(example_phyex_set_old@gene_ids[1:100])
modules <- list(early = 1:2, mid = 3:5, late = 6:7)
result <- stat_early_conservation_test(p, modules=modules)

Flat Line Test for Conservation Pattern

Description

Perform a flat line test to assess whether the transcriptomic index profile shows a flat (non-varying) pattern across developmental stages.

Usage

stat_flatline_test(phyex_set, ...)

Arguments

phyex_set

A PhyloExpressionSet object

...

Additional arguments passed to stat_generic_conservation_test

Details

The flat line test evaluates whether the TXI profile remains constant across developmental stages by testing the variance of the profile against a null distribution. A significant result indicates rejection of the flat line pattern.

Value

A test result object containing p-value and test statistics

Examples

# Perform flat line test
result <- stat_flatline_test(example_phyex_set)

Generate Null Conservation TXI Distribution

Description

Generate a null distribution of transcriptomic index values for conservation testing by permuting phylostratum assignments.

Usage

stat_generate_conservation_txis(strata_vector, count_matrix, sample_size)

Arguments

strata_vector

Numeric vector of phylostratum assignments

count_matrix

Matrix of expression counts

sample_size

Number of permutations to generate

Details

This function creates a null distribution by randomly permuting phylostratum assignments while keeping the expression matrix fixed. This preserves the expression structure while breaking the phylostratum-expression relationship. # Generate null TXI distribution # null_txis <- stat_generate_conservation_txis(strata_vec, expr_matrix, 1000)

Value

Matrix of permuted TXI values for null hypothesis testing

Generic Conservation Test Framework

Description

Perform a generic conservation test by comparing observed TXI patterns against a null distribution generated by permutation.

Usage

stat_generic_conservation_test(
  phyex_set,
  test_name,
  scoring_function,
  fitting_dist,
  alternative = c("two-sided", "greater", "less"),
  p_label = p_label,
  custom_null_txis = NULL,
  plot_result = TRUE
)

Arguments

phyex_set

A PhyloExpressionSet object

test_name

Character string naming the test

scoring_function

Function to compute test statistic from TXI profile

fitting_dist

Distribution object for fitting the null distribution

alternative

Character string specifying alternative hypothesis: "two-sided", "greater", or "less" (default: "two-sided")

p_label

Label for p-value in results (default: p_label)

custom_null_txis

Optional custom null TXI distribution (default: NULL)

plot_result

Logical indicating whether to plot results (default: TRUE)

Details

This function provides a generic framework for conservation testing by: 1. Generating null TXI distributions via permutation 2. Computing test statistics using the provided scoring function 3. Fitting the specified distribution to the null sample 4. Computing p-values based on the alternative hypothesis

Value

A ConservationTestResult object containing test statistics and p-values

Late Conservation Test

Description

Test for late conservation patterns in transcriptomic data by comparing late developmental stages to early and mid stages.

Usage

stat_late_conservation_test(phyex_set, modules, ...)

Arguments

phyex_set

A PhyloExpressionSet object

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

...

Additional arguments passed to stat_generic_conservation_test

Details

The late conservation test evaluates whether later developmental stages show lower transcriptomic index values (indicating older genes) compared to earlier stages. The test computes a score based on the minimum difference between early vs. late and mid vs. late TXI values.

Value

A ConservationTestResult object with late conservation test results

Examples

# Define developmental modules
modules <- list(early = 1:2, mid = 3:5, late = 6:7)
result <- stat_late_conservation_test(example_phyex_set_old, modules)

Pairwise Conservation Test

Description

Test for significant differences in transcriptomic index values between two groups of developmental stages.

Usage

stat_pairwise_test(phyex_set, modules, alternative = c("greater", "less"), ...)

Arguments

phyex_set

A PhyloExpressionSet object

modules

A named list with elements 'contrast1' and 'contrast2' containing stage indices for each contrast group

alternative

Character string specifying the alternative hypothesis: "greater" (contrast1 > contrast2) or "less" (contrast1 < contrast2)

...

Additional arguments passed to stat_generic_conservation_test

Details

The pairwise test compares the mean transcriptomic index values between two groups of developmental stages. This is useful for testing specific hypotheses about differences in gene age composition between developmental periods.

Value

A ConservationTestResult object with pairwise test results

Author(s)

Jaruwatana Sodai Lotharukpong

Examples

# Define contrast groups
modules <- list(contrast1 = 1:3, contrast2 = 4:7)
result <- stat_pairwise_test(example_phyex_set, modules, alternative = "greater")

Reductive Hourglass Test

Description

Test for reductive hourglass patterns in transcriptomic data by comparing early and late developmental stages to mid developmental stages.

Usage

stat_reductive_hourglass_test(phyex_set, modules, ...)

Arguments

phyex_set

A PhyloExpressionSet object

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

...

Additional arguments passed to stat_generic_conservation_test

Details

The reductive hourglass test evaluates whether mid developmental stages show lower transcriptomic index values (indicating older genes) compared to both early and late stages. This creates an hourglass-shaped pattern where ancient genes dominate during mid-development. The test computes a score based on the minimum difference between early vs. mid and late vs. mid TXI values.

Value

A ConservationTestResult object with reductive hourglass test results

Examples

# Define developmental modules
modules <- list(early = 1:2, mid = 3:5, late = 6:7)
result <- stat_reductive_hourglass_test(example_phyex_set_old, modules=modules)

Reverse Hourglass Test

Description

Test for reverse hourglass patterns in transcriptomic data by comparing mid developmental stages to early and late developmental stages.

Usage

stat_reverse_hourglass_test(phyex_set, modules, ...)

Arguments

phyex_set

A PhyloExpressionSet object

modules

A named list with elements 'early', 'mid', and 'late' containing stage indices for each developmental module

...

Additional arguments passed to stat_generic_conservation_test

Details

The reverse hourglass test evaluates whether mid developmental stages show higher transcriptomic index values (indicating younger genes) compared to both early and late stages. This creates a reverse hourglass pattern where recently evolved genes dominate during mid-development. The test computes a score based on the minimum difference between mid vs. early and mid vs. late TXI values.

Value

A ConservationTestResult object with reverse hourglass test results

Examples

# Define developmental modules
modules <- list(early = 1:2, mid = 3:5, late = 6:7)
result <- stat_reverse_hourglass_test(example_phyex_set_old, modules=modules)

Calculate Phylostratum Enrichment

Description

Calculate log2(observed/expected) enrichment ratios for phylostrata in a selected gene set compared to the background distribution.

Usage

strata_enrichment(strata, selected_gene_ids)

Arguments

strata

Named factor vector of phylostratum assignments (names are gene IDs)

selected_gene_ids

Character vector of gene IDs to test for enrichment

Details

This function calculates enrichment or depletion of phylostrata in a gene set by comparing the observed proportion of each stratum in the selected genes to the expected proportion based on the background distribution in all genes.

Positive values indicate enrichment (more genes than expected), while negative values indicate depletion (fewer genes than expected).

Value

A data frame with columns:

Stratum: Phylostratum factor levels
log_obs_exp: Log2 ratio of observed vs expected proportions

Examples

# Calculate enrichment for a gene set
enrichment <- strata_enrichment(example_phyex_set@strata, example_phyex_set@gene_ids[1:30])
print(enrichment)

Retrieve taxonomy categories from NCBI Taxonomy

Description

This function retrieves category information from NCBI Taxonomy and is able to filter kingdom specific taxids.

Usage

taxid(db.path, download = FALSE, update = FALSE, filter = NULL)

Arguments

db.path

path to download and store the NCBI Taxonomy categories.dmp file. Default is the tempdir() directory.

download

a logical value specifying whether or not the categories.dmp shall be downloaded (download = TRUE) or whether a local version already exists on the users machine (download = TRUE - in this case please specify the db.path argument to target the local categories.dmp file).

update

should the local file be updated? Please specify the db.path argument to target the local categories.dmp file.

filter

a character string specifying the kingdom of life for which taxids shall be returned. Options are "Archea", "Bacteria", "Eukaryota", "Viruses", "Unclassified".

Value

A tibble object containing taxonomy category information

Author(s)

Hajk-Georg Drost

Examples


# download categories.dmp file to current working directory 
# and filter for 'Archea' taxids
Archea.taxids <- taxid(db.path = getwd(), filter = "Archea", download = TRUE)

# Once the NCBI Taxonomy 'categories.dmp' file is downloaded to your machine ('download = TRUE')
# the 'taxid()' function can be proceed on the local 'categories.dmp' file
# e.g. filter for Virus taxids
Virus.taxids <- taxid(db.path = getwd(), filter = "Viruses")

Short Alias for Transform Counts

Description

Convenience alias for transform_counts function.

Usage

tf(phyex_set, FUN, FUN_name = deparse(substitute(FUN)), new_name = NULL, ...)

Arguments

phyex_set

A PhyloExpressionSet object

FUN

Function to apply

FUN_name

Name of the transformation function (optional)

new_name

Name for the new dataset (optional)

...

Additional arguments passed to FUN

Value

A PhyloExpressionSet object with transformed expression data

Transform Phylostratum Values

Description

This function performs transformation of phylostratum values.

Usage

tf_PS(phyex_set, transform = "qr")

Arguments

phyex_set

a PhyloExpressionSet S7 object.

transform

a character vector of any valid function that transforms PS values. Possible values can be:

transform = "qr" (or "quantilerank") : quantile rank transformation analogous to Julia function StatsBase.quantilerank using method = :tied.

Details

This function transforms the phylostratum assignment. The return value of this function is a PhyloExpressionSet object with transformed phylostratum tfPhylostratum as the first column, satisfying PhyloExpressionSetBase. Note that the input transform must be an available function, currently limited to only "qr" (or "quantilerank").

Value

a PhyloExpressionSet object storing transformed Phylostratum levels.

Author(s)

Jaruwatana Sodai Lotharukpong and Lukas Maischak

Examples


# get the relative expression profiles for each phylostratum
tfPES <- tf_PS(example_phyex_set, transform = "qr")

Perform Permutation Tests Under Different Transformations

Description

tf_stability statistically evaluates the stability of phylotranscriptomics permutation tests (e.g., stat_flatline_test, stat_reductive_hourglass_test, etc.) under different data transformations using a PhyloExpressionSet.

Usage

tf_stability(
  phyex_set,
  conservation_test = stat_flatline_test,
  transforms = COUNT_TRANSFORMS
)

Arguments

phyex_set

a PhyloExpressionSet.

conservation_test

a conservation test function (e.g. stat_flatline_test, stat_reductive_hourglass_test, etc.)

transforms

named list of transformation functions (default: COUNT_TRANSFORMS)

Details

Assesses the stability of data transforms on the permutation test of choice. See tf, stat_flatline_test, stat_reductive_hourglass_test, etc.

Value

Named numeric vector of p-values for each transformation.

Author(s)

Jaruwatana Sodai Lotharukpong

References

Lotharukpong JS et al. (2023) (unpublished)

Create Threshold Comparison Plots

Description

Create plots comparing how different consensus thresholds affect gene set sizes and p-values in GATAI analysis.

Usage

threshold_comparison_plots(phyex_set, runs)

Arguments

phyex_set

A PhyloExpressionSet object

runs

List of GATAI run results

Details

This function analyzes how the choice of consensus threshold (minimum number of runs a gene must appear in) affects the final gene set size and statistical significance. Uses cached null distributions for efficient p-value calculation. # Create threshold comparison plots # thresh_plots <- threshold_comparison_plots(phyex_set, gatai_runs)

Value

A list with two ggplot objects:

counts: Plot showing gene counts across thresholds
pval: Plot showing p-values across thresholds

Transform Expression Counts in PhyloExpressionSet

Description

Apply a transformation function to the expression counts in a PhyloExpressionSet.

Usage

transform_counts(
  phyex_set,
  FUN,
  FUN_name = deparse(substitute(FUN)),
  new_name = NULL,
  ...
)

Arguments

phyex_set

A PhyloExpressionSet object

FUN

Function to apply

FUN_name

Name of the transformation function (optional)

new_name

Name for the new dataset (optional)

...

Additional arguments passed to FUN

Value

A PhyloExpressionSet object with transformed expression data