Metric and Non-Gaussian Variants of Archetypal Analysis

One simplex, several geometries

Classical archetypal analysis (AA) represents each observation as a convex mixture of a small number of extreme profiles (Cutler and Breiman 1994). In yaap, the main objects keep the same interpretation across most variants:

coefficients describe each archetype as a mixture of training observations.
compositions describe each observation as a mixture of archetype profiles.
coordinates are the fitted archetype profiles, when an input-space profile is available.

What changes is the notion of closeness. For Gaussian models, this is usually an inner product that defines lengths and angles in feature space. For non-Gaussian or directional models, it is a likelihood or angular objective.

Ordinary AA uses squared Euclidean distance.
Metric AA uses a metric matrix \(G\) to weight feature-space residuals.
Kernel AA implicitly maps the data into another feature space and uses that space’s inner product.
Probabilistic AA uses a likelihood objective, and directional AA uses an angular objective.

(Seth and Eugster 2016; Olsen et al. 2022).

Variant	Main function	What changes	Typical data
Metric Gaussian AA	`run_aa(..., scale = G)`	Feature inner product	Curves, spectra, correlated features
Kernel AA	`run_aa(..., method = "kernel")`	Implicit feature-space inner product	Nonlinear geometry
Probabilistic AA	`run_aa(..., method = "paa", family = ...)`	Likelihood/objective	Binary, count, compositional observations
Directional AA	`run_aa(..., method = "directional")`	Angular objective	Unit directions, polarity-invariant signals

This vignette is organized as a set of recipes. Each section starts with the modeling situation where the variant is useful, then shows the smallest code pattern needed to fit the model and interpret the result. The examples are small and synthetic so they can be rerun quickly.

Metric Gaussian AA

Use this recipe when the rows are still numeric profiles, but ordinary Euclidean distance is the wrong residual scale. Common cases include calibrated measurement error, correlated features, spectra, and functional curves.

In yaap, the standard Gaussian objective uses the following matrices:

\(X\), the data matrix (\(N \times M\))
\(S\), the composition matrix (\(N \times K\))
\(A\), the archetype matrix (\(K \times M\))
\(E = X - SA\), the residual matrix
\(W = \mathrm{diag}(w_1, \ldots, w_N)\), the diagonal matrix of sample weights
\(G\), the feature metric, a \(M \times M\) positive-definite matrix

The fitted objective is a weighted and metric-adjusted sum of squared errors:

\[\mathcal{L}(S,A; W, G) = \mathrm{tr}\{W E G E^\top\}\]

Here, \(W\) is used to adjust the contribution of each sample, and \(G\) defines how feature-space residuals are measured. If \(G\) is diagonal, its entries say how strongly each feature contributes to squared error. \(G\)’s off-diagonal entries add cross-terms to the quadratic error, so residuals in nearby or related features can be judged together.

The weights argument sets \(W\); when it is omitted, \(W\) is the identity matrix. The scale argument sets \(G\). Conceptually, every non-kernel Gaussian use of scale is a choice of this feature metric:

scale = FALSE uses the identity metric
scale = TRUE uses the feature-wise sample variances as a diagonal metric
a numeric vector of column divisors gives a diagonal metric through per-feature scale factors
a positive-definite matrix supplies a full non-diagonal metric.

Recipe: known measurement-error metric

A feature metric is useful when technical measurement error is known from instrument calibration or prior domain knowledge. In the recipe below, the true archetypal structure is a triangle in the first two measured features. The last two features are nuisance measurements with large correlated technical noise. For simplicity, we do not add any other irreducible noise. Plain Euclidean AA treats all feature directions as having the same reliability. Metric Gaussian AA uses the technical-noise precision matrix, so residuals are judged relative to measurement reliability.

It is important that this metric describes measurement noise, not the total covariance of X_metric. The total covariance of X_metric contains the archetypal signal we want to model. Using its inverse would also downweight meaningful variation between archetypes.

# Define the signal archetypes.
A_signal_true <- rbind(
    left  = c(-1,      0, 0, 0),
    right = c(1,       0, 0, 0),
    top   = c(0, sqrt(3), 0, 0)
)

# Basic dimensions
K <- nrow(A_signal_true)
M <- ncol(A_signal_true)
N <- 220  # samples

# Archetype compositions for each sample
S_metric_true <- sample_simplex(N, K)
X_metric_mean <- S_metric_true %*% A_signal_true

# Build block-diagonal measurement-noise covariance
Sigma_noise <- matrix(0, nrow = M, ncol = M)
# Signal features: small, moderately correlated noise
Sigma_noise[1:2, 1:2] <- 0.03^2 * matrix(c(1, -0.5, -0.5, 1), nrow = 2)
# Nuisance features: large, weakly correlated noise
Sigma_noise[3:4, 3:4] <- 3^2 * matrix(c(1, 0.2, 0.2, 1), nrow = 2)
# Precision matrix used as metric for AA
G_noise <- solve(Sigma_noise)

# Correlated anisotropic noise
X_noise <- matrix(rnorm(N * M), ncol = M) %*% chol(Sigma_noise)
X_metric <- X_metric_mean + X_noise
colnames(X_metric) <- c("feature_1", "feature_2", "noise_1", "noise_2")

set.seed(42)
fit_plain <- run_aa(
    X_metric,
    K = K,
    scale = FALSE,
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_plain
#> 
#> Call:
#> run_aa(x = X_metric, K = K, init = "random", scale = FALSE, sd_threshold = 0, 
#>     nrep = 20)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 50 iterations.
#> Final Loss Metrics:
#>      loss        r2
#>  263.5798 0.9399025

set.seed(42)  # same starts
fit_metric <- run_aa(
    X_metric,
    K = K,
    scale = G_noise,
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_metric
#> 
#> Call:
#> run_aa(x = X_metric, K = K, init = "random", scale = G_noise, 
#>     sd_threshold = 0, nrep = 20)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 30 iterations.
#> Final Loss Metrics:
#>      loss        r2
#>  616.9888 0.9971891

The two fits use the same observations and the same simplex model. The metric fit differs only in how residuals are judged: directions with larger technical measurement noise count less, and the off-diagonal entries in G_noise account for correlated errors.

	Final loss	R2	Noise-metric archetype error	Composition RMSE
plain	264	0.940	4189.15	5.22
metric	617	0.997	57.24	3.10

The loss values are on different scales because each fit uses its own residual metric. The archetype error and composition RMSE are evaluated against the known generating structure, using the noise precision matrix to match the fitted archetypes to the truth and thus are comparable across fits.

Metric Gaussian AA with known correlated measurement noise.

Functional AA

Functional data use the same metric idea, but the metric is usually not supplied by hand. Use this recipe when analyzing functional data with the fda package.

In functional analysis, a curve is represented as a weighted sum of basis functions, such as sine waves in Fourier analysis. The appropriate feature metric is computed from the inner-product matrix of the basis functions.

The package fda provides a rich set of basis functions and tools for working with functional data. run_aa.fd() works directly with fda::fd objects and applies the basis inner-product metric (computed using fda::eval.penalty()) to the Gaussian AA engine automatically, so users can fit the functional object directly. The resulting archetypes and fitted values are also fd objects so that they can be post-processed with the same tools as the input data.

Below we demonstrate functional AA on synthetic curves with known archetypal structure. The true archetypes are smooth curves with peaks at different times. The data are noisy samples from the convex hull of those archetypes.

N <- 90  # number of curves
t_fd <- seq(0, 1, length.out = 80)  # time points for functional data
A_fd_true <- rbind(
    early_peak = exp(-60 * (t_fd - 0.18)^2),
    late_peak  = exp(-60 * (t_fd - 0.82)^2),
    ramp       = 0.15 + 0.85 * t_fd
)
# Archetype compositions for each sample
S_fd_true <- sample_simplex(N, 3)
X_fd_noise <- matrix(rnorm(N * length(t_fd), sd = 0.025), nrow = N)
X_fd <- S_fd_true %*% A_fd_true + X_fd_noise

# Convert the sampled curves to an fda::fd object
basis_fd <- fda::create.bspline.basis(rangeval = c(0, 1), nbasis = 14)
fd_data <- fda::Data2fd(argvals = t_fd, y = t(X_fd), basisobj = basis_fd)

fit_fd <- run_aa(fd_data, K = 3, sd_threshold = 0)
fit_fd
#> 
#> Call:
#> run_aa(x = fd_data, K = 3, sd_threshold = 0)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 45 iterations.
#> Final Loss Metrics:
#>        loss        r2
#>  0.01743686 0.9988108

The resulting archetypes are functional objects, and the fitted values are curves in the same space. The plot below shows the fitted archetypes overlaid on a sample of the data curves.


A_fit_fd <- coordinates(fit_fd)
class(A_fit_fd)  # fitted archetype curves
#> [1] "fd"

sample_ix <- seq(1, nrow(X_fd), length.out = 25) # sample curves for plotting

# Plot sampled curves, then overlay archetype curves
matplot(
    t_fd,
    t(X_fd[sample_ix, ]),
    type = "l", lty = 1, col = pal_data,
    xlab = "t", ylab = "Value", main = "Functional AA"
)
invisible( # plot.fd prints a "done" message that we want to suppress
    plot(A_fit_fd, lty = 1, lwd = 2, col = pal_aa[1:3], add = TRUE)
)
legend(
    "top",
    legend = anames(fit_fd),
    lty = 1, lwd = 2, col = pal_aa[1:3],
    bty = "n", horiz = TRUE
)

Functional Gaussian AA fitted directly to an fda::fd object.

Kernel AA

Kernel AA is the nonlinear analog of changing the inner product. Instead of choosing an explicit feature metric (\(G\)), it works with a \(N \times N\) Gram matrix \(K_{ij} = k(x_i, x_j)\) and fits archetypes in the kernel-induced feature space (Mørup and Hansen 2012). With the linear kernel, this recovers the ordinary Euclidean geometry. With nonlinear kernels, the convex hull is formed after the implicit feature map.

Use this recipe when the relevant extremes are defined by a nonlinear geometry rather than by the convex hull in the original feature space. The example below has a dense core and an outer ring. Linear AA sees the input space and places proxy archetypes around broad outer directions. A Laplace kernel changes the geometry so local neighborhoods are emphasized.

# Choose the sample sizes
n_outer <- 120
n_inner <- 80

# Simulate an outer ring
theta_outer <- runif(n_outer, 0, 2 * pi)
r_outer <- 1 + 0.12 * runif(n_outer)
X_outer <- cbind(r_outer * cos(theta_outer), r_outer * sin(theta_outer))

# Simulate an inner core
theta_inner <- runif(n_inner, 0, 2 * pi)
r_inner <- 0.15 * runif(n_inner)
X_inner <- cbind(r_inner * cos(theta_inner), r_inner * sin(theta_inner))

# Combine the two groups
X_ring <- rbind(X_outer, X_inner)
colnames(X_ring) <- c("x", "y")

set.seed(42)
fit_linear <- run_aa(
    X_ring,
    K = 7,
    scale = FALSE,
    init = "random",
    nrep = 20
)
fit_linear
#> 
#> Call:
#> run_aa(x = X_ring, K = 7, init = "random", scale = FALSE, nrep = 20)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 7 
#> DID NOT CONVERGE after 100 iterations.
#> Final Loss Metrics:
#>       loss        r2
#>  0.5960806 0.9956172

set.seed(42)
fit_kernel <- run_aa(
    X_ring,
    K = 7,
    method = "kernel",
    kernel = "laplace",
    kernel_args = list(sigma = 0.35),
    init = "random",
    nrep = 20
)
fit_kernel
#> 
#> Call:
#> run_aa(x = X_ring, K = 7, method = "kernel", init = "random", 
#>     nrep = 20, kernel = "laplace", kernel_args = list(sigma = 0.35))
#> 
#> Kernel Archetypal Analysis (laplace kernel):
#> Number of Archetypes: 7 
#> Number of Samples: 200 
#> Input-space coordinate proxy: available
#> Converged after 33 iterations.
#> Final Loss Metrics:
#>      loss        r2
#>  91.84126 0.5407937

The two loss metrics are not directly comparable because they are computed in different spaces. Even the r2 values are not directly comparable because they are computed relative to different null models. There is no simple scalar diagnostic that decides whether to use kernel AA or linear AA. Use domain knowledge and visual inspection of the fitted profiles.

plot(
    X_ring, frame.plot = FALSE, axes = FALSE,
    col = pal_data, asp = 1, pch = 19, cex = 0.45,
    main = "Linear AA vs kernel-AA proxies", xlab = "", ylab = ""
)
points(coordinates(fit_linear), pch = 4, cex = 1.25, lwd = 1.5, col = pal_aa[1])
points(coordinates(fit_kernel), pch = 1, cex = 1.45, lwd = 1.6, col = pal_aa[2])
legend(
    "topleft",
    legend = c("Data", "Linear AA", "Kernel-AA"),
    pch = c(19, 4, 1),
    col = c(pal_data, pal_aa[1:2]),
    pt.cex = c(0.7, 1.2, 1.4),
    bty = "n"
)

Linear AA archetypes and kernel-AA input-space proxy archetypes.

For nonlinear kernels, coordinates are input-space proxy coordinates computed as coefficients %*% data. They are useful for plotting, but they are not exact preimages of the Hilbert-space archetypes. This distinction also explains why fitted() and out-of-sample predict() for type = "reconstruction" are not defined for kernel_archetypes. predict() for type = "compositions" on the other hand is defined and returns the new composition weights using cross-Gram evaluation (new samples vs every training sample).

In this particular example, the kernel-AA considers the inner core and outer ring as two separate archetypal directions, while linear AA places archetypes around the outer ring and considers the inner core as a uniformly mixed profile.

The kernel choice and bandwidth are substantive modeling decisions. Very local kernels can make many points effectively extreme in feature space, a phenomenon also discussed in kernel frame methods (Mair et al. 2018). In practice, compare a small grid of kernel parameters and inspect whether the resulting compositions are stable.

Probabilistic AA

Use PAA when the observed entries are generated from a distribution such as binary responses, counts, or category counts. PAA changes the objective rather than the inner product. It keeps the simplex structure, but archetypes are profiles in the parameter space of an observation family (Seth and Eugster 2016). In yaap, the supported families are:

`family =`	Input data	Parameter space
`"gaussian"`	Real-valued numeric matrix	Mean profiles on the data scale
`"binomial"`	Binary 0/1 response matrix	Response probabilities
`"poisson"`	Non-negative count matrix	Poisson rates, or expected counts
`"multinomial"`	Rows of category counts	Category probabilities that sum to one

PAA’s main caveat is interpretive: the convex hull lives in parameter space. For non-Gaussian families, plot(fit, what = "coordinates") is intentionally unavailable because the observed data and fitted archetypes are not in the same space. PAA also does not currently support sample weights, robust fitting, or missing-data fitting. For Gaussian data, PAA is equivalent to ordinary AA but less versatile and slower. Prefer the standard Gaussian fitters unless you specifically need the PAA interface.

Recipe: binary responses

The following example represents observations as answers to six yes/no items. The archetypes are probability profiles, not binary observations. For your own data, use an N x M matrix of 0/1 responses.

N <- 120
P_true <- rbind(
    broad_yes = c(0.85, 0.80, 0.75, 0.20, 0.15, 0.10),
    middle    = c(0.25, 0.75, 0.35, 0.70, 0.35, 0.65),
    broad_no  = c(0.10, 0.15, 0.20, 0.80, 0.85, 0.75)
)
K <- nrow(P_true)
M <- ncol(P_true)
S_bin <- sample_simplex(N, K)
P_bin <- S_bin %*% P_true
X_bin <- matrix(rbinom(N * M, size = 1, prob = as.vector(P_bin)), nrow = N)
colnames(X_bin) <- paste0("item_", seq_len(M))
head(X_bin)
#>      item_1 item_2 item_3 item_4 item_5 item_6
#> [1,]      0      1      0      1      0      0
#> [2,]      1      0      1      1      1      1
#> [3,]      0      0      1      0      0      0
#> [4,]      0      0      0      1      0      0
#> [5,]      0      1      0      1      0      1
#> [6,]      1      0      0      0      0      0

fit_binomial <- run_aa(
    X_bin,
    K = K,
    method = "paa",
    family = "binomial"
)

fit_binomial
#> 
#> Call:
#> run_aa(x = X_bin, K = K, method = "paa", family = "binomial")
#> 
#> PAA Archetypal Analysis (binomial family):
#> Number of Archetypes: 3 
#> Converged after 39 iterations.
#> Final Loss Metrics:
#>      loss        r2
#>  322.2442 0.2965358

Binomial PAA archetype response probabilities.

Because the archetypes are in parameter space, the fitted values are family parameters. The compositions are still convex weights over archetypes, but the reconstruction is not always a convex combination of observed data. Instead, it is a convex combination of archetype profiles in parameter space.

The predict() method for PAA supports both types of prediction. However, type = "reconstruction" returns fitted values in parameter space, not observed data in the original sample space.

S_new <- predict(fit_binomial, X_bin[1:5, ], type = "compositions")
round(S_new, 3)
#>         A1    A2    A3
#> [1,] 0.546 0.389 0.064
#> [2,] 0.696 0.000 0.304
#> [3,] 0.494 0.000 0.506
#> [4,] 1.000 0.000 0.000
#> [5,] 0.316 0.684 0.000

P_new <- predict(fit_binomial, X_bin[1:5, ], type = "reconstruction")
round(P_new, 2)
#>      item_1 item_2 item_3 item_4 item_5 item_6
#> [1,]   0.06   0.45   0.26   0.55   0.27   0.71
#> [2,]   0.30   0.30   0.55   0.70   0.42   0.41
#> [3,]   0.51   0.51   0.68   0.50   0.41   0.29
#> [4,]   0.00   0.00   0.36   1.00   0.45   0.59
#> [5,]   0.00   0.68   0.11   0.32   0.14   0.87

Recipe: multinomial counts

Next, we fit a multinomial example. The data are counts of five terms in three document types:

“visual” documents have high counts of the first two terms and low counts of the last three.
“text” documents have the opposite pattern.
“mixed” documents have moderate counts of all terms.

The archetypes are term-probability profiles, not observed count vectors. The fitted and predicted values are probabilities whose rows sum to one. If expected counts are needed, multiply those probabilities by the document totals.

# Define term probabilities for three document types
Topic_true <- rbind(
    visual = c(0.55, 0.25, 0.10, 0.05, 0.05),
    text   = c(0.05, 0.10, 0.55, 0.25, 0.05),
    mixed  = c(0.15, 0.25, 0.15, 0.20, 0.25)
)
# Basic dimensions
K <- nrow(Topic_true)
M <- ncol(Topic_true)
N <- 80
# Archetype compositions
S_txt <- sample_simplex(N, K)
Theta <- S_txt %*% Topic_true

totals <- 39 + sample(40, N, replace = TRUE) # total term counts per document
X_txt <- matrix(0L, nrow = N, ncol = M)     # observed term counts
for (i in seq_len(N)) {
    X_txt[i, ] <- as.integer(rmultinom(1, totals[i], Theta[i, ]))
}
colnames(X_txt) <- paste0("term_", seq_len(ncol(X_txt)))

fit_multi <- run_aa(
    X_txt,
    K = K,
    method = "paa",
    family = "multinomial",
    tol = 1e-6
)
fit_multi
#> 
#> Call:
#> run_aa(x = X_txt, K = K, method = "paa", family = "multinomial", 
#>     tol = 1e-06)
#> 
#> PAA Archetypal Analysis (multinomial family):
#> Number of Archetypes: 3 
#> Converged after 45 iterations.
#> Final Loss Metrics:
#>      loss         r2
#>  6856.474 0.01086952

P_hat <- fitted(fit_multi)
head(round(P_hat, 2), 3)
#>      term_1 term_2 term_3 term_4 term_5
#> [1,]   0.19   0.20   0.29   0.18   0.14
#> [2,]   0.25   0.22   0.23   0.16   0.14
#> [3,]   0.44   0.23   0.19   0.09   0.04
rowSums(P_hat[1:3, ])
#> [1] 1 1 1

expected_counts <- rowSums(X_txt) * P_hat
head(round(expected_counts, 1), 3)
#>      term_1 term_2 term_3 term_4 term_5
#> [1,]   13.3   13.9   20.3   12.7    9.8
#> [2,]   10.1    8.7    9.1    6.4    5.6
#> [3,]   27.9   14.7   11.8    6.0    2.7

Directional AA

Use directional AA when row direction matters more than magnitude, especially when signs can be arbitrary. The implementation uses a Watson-style loss that is insensitive to polarity after hemisphere handling (Olsen et al. 2022).

The recipe below draws points from a spherical triangle, then flips some rows by multiplying them by -1. Euclidean AA treats the flipped rows as opposite points. Directional AA can model them as the same direction.

A_dir_true <- rbind(
    c(1, 0, 0),
    c(0, 1, 0),
    c(0, 0.15, 1)
)
K <- nrow(A_dir_true)
N <- 120
S_dir <- sample_simplex(N, K)
X_dir <- unit_rows(S_dir %*% A_dir_true)
flip <- sample(c(-1, 1), nrow(X_dir), replace = TRUE)
X_dir_flip <- X_dir * flip
colnames(X_dir_flip) <- c("x", "y", "z")

set.seed(42)
fit_euclidean_dir <- run_aa(
    X_dir_flip,
    K = K,
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_euclidean_dir
#> 
#> Call:
#> run_aa(x = X_dir_flip, K = K, init = "random", sd_threshold = 0, 
#>     nrep = 20)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 62 iterations.
#> Final Loss Metrics:
#>      loss        r2
#>  19.46239 0.8378134

set.seed(42)
fit_directional <- run_aa(
    X_dir_flip,
    K = K,
    method = "directional",
    sd_threshold = 0,
    init = "random",
    nrep = 20
)
fit_directional
#> 
#> Call:
#> run_aa(x = X_dir_flip, K = K, method = "directional", init = "random", 
#>     sd_threshold = 0, nrep = 20)
#> 
#> Directional Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 50 iterations.
#> Final Loss Metrics:
#>        loss        r2
#>  0.05942967 0.9995048

round(fit_directional[["directions"]], 3)
#>          x     y     z
#> [1,] 0.028 0.200 0.979
#> [2,] 0.992 0.111 0.053
#> [3,] 0.057 0.997 0.046
round(rowSums(fitted(fit_directional)^2), 6)[1:6]
#> [1] 1 1 1 1 1 1

Orthographic view of polarity-flipped spherical data.

Directional AA returns directional_archetypes, which extends the ordinary archetypes class. fitted() and out-of-sample predict() return row-normalized reconstructions aligned to the original row polarity when the training data are available. The directional loss is not comparable to Euclidean SSE, so compare directional fits with directional diagnostics, not with ordinary residual sums of squares.

Current limitations are stricter than for Gaussian AA: input must be a dense numeric matrix with no zero rows; missing data, robust fitting, and custom scale values are not supported.