Classical archetypal analysis (AA) represents each observation as a
convex mixture of a small number of extreme profiles (Cutler and Breiman 1994). In yaap,
the main objects keep the same interpretation across most variants:
coefficients describe each archetype as a mixture of
training observations.compositions describe each observation as a mixture of
archetype profiles.coordinates are the fitted archetype profiles, when an
input-space profile is available.What changes is the notion of closeness. For Gaussian models, this is usually an inner product that defines lengths and angles in feature space. For non-Gaussian or directional models, it is a likelihood or angular objective.
(Seth and Eugster 2016; Olsen et al. 2022).
| Variant | Main function | What changes | Typical data |
|---|---|---|---|
| Metric Gaussian AA | run_aa(..., scale = G) |
Feature inner product | Curves, spectra, correlated features |
| Kernel AA | run_aa(..., method = "kernel") |
Implicit feature-space inner product | Nonlinear geometry |
| Probabilistic AA | run_aa(..., method = "paa", family = ...) |
Likelihood/objective | Binary, count, compositional observations |
| Directional AA | run_aa(..., method = "directional") |
Angular objective | Unit directions, polarity-invariant signals |
This vignette is organized as a set of recipes. Each section starts with the modeling situation where the variant is useful, then shows the smallest code pattern needed to fit the model and interpret the result. The examples are small and synthetic so they can be rerun quickly.
Use this recipe when the rows are still numeric profiles, but ordinary Euclidean distance is the wrong residual scale. Common cases include calibrated measurement error, correlated features, spectra, and functional curves.
In yaap, the standard Gaussian objective uses the
following matrices:
The fitted objective is a weighted and metric-adjusted sum of squared errors:
\[\mathcal{L}(S,A; W, G) = \mathrm{tr}\{W E G E^\top\}\]
Here, \(W\) is used to adjust the contribution of each sample, and \(G\) defines how feature-space residuals are measured. If \(G\) is diagonal, its entries say how strongly each feature contributes to squared error. \(G\)’s off-diagonal entries add cross-terms to the quadratic error, so residuals in nearby or related features can be judged together.
The weights argument sets \(W\); when it is omitted, \(W\) is the identity matrix. The
scale argument sets \(G\).
Conceptually, every non-kernel Gaussian use of scale is a
choice of this feature metric:
scale = FALSE uses the identity metricscale = TRUE uses the feature-wise sample variances as
a diagonal metricA feature metric is useful when technical measurement error is known from instrument calibration or prior domain knowledge. In the recipe below, the true archetypal structure is a triangle in the first two measured features. The last two features are nuisance measurements with large correlated technical noise. For simplicity, we do not add any other irreducible noise. Plain Euclidean AA treats all feature directions as having the same reliability. Metric Gaussian AA uses the technical-noise precision matrix, so residuals are judged relative to measurement reliability.
It is important that this metric describes measurement noise, not the
total covariance of X_metric. The total covariance of
X_metric contains the archetypal signal we want to model.
Using its inverse would also downweight meaningful variation between
archetypes.
# Define the signal archetypes.
A_signal_true <- rbind(
left = c(-1, 0, 0, 0),
right = c(1, 0, 0, 0),
top = c(0, sqrt(3), 0, 0)
)
# Basic dimensions
K <- nrow(A_signal_true)
M <- ncol(A_signal_true)
N <- 220 # samples
# Archetype compositions for each sample
S_metric_true <- sample_simplex(N, K)
X_metric_mean <- S_metric_true %*% A_signal_true
# Build block-diagonal measurement-noise covariance
Sigma_noise <- matrix(0, nrow = M, ncol = M)
# Signal features: small, moderately correlated noise
Sigma_noise[1:2, 1:2] <- 0.03^2 * matrix(c(1, -0.5, -0.5, 1), nrow = 2)
# Nuisance features: large, weakly correlated noise
Sigma_noise[3:4, 3:4] <- 3^2 * matrix(c(1, 0.2, 0.2, 1), nrow = 2)
# Precision matrix used as metric for AA
G_noise <- solve(Sigma_noise)
# Correlated anisotropic noise
X_noise <- matrix(rnorm(N * M), ncol = M) %*% chol(Sigma_noise)
X_metric <- X_metric_mean + X_noise
colnames(X_metric) <- c("feature_1", "feature_2", "noise_1", "noise_2")set.seed(42)
fit_plain <- run_aa(
X_metric,
K = K,
scale = FALSE,
sd_threshold = 0,
init = "random",
nrep = 20
)
fit_plain
#>
#> Call:
#> run_aa(x = X_metric, K = K, init = "random", scale = FALSE, sd_threshold = 0,
#> nrep = 20)
#>
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3
#> Converged after 50 iterations.
#> Final Loss Metrics:
#> loss r2
#> 263.5798 0.9399025
set.seed(42) # same starts
fit_metric <- run_aa(
X_metric,
K = K,
scale = G_noise,
sd_threshold = 0,
init = "random",
nrep = 20
)
fit_metric
#>
#> Call:
#> run_aa(x = X_metric, K = K, init = "random", scale = G_noise,
#> sd_threshold = 0, nrep = 20)
#>
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3
#> Converged after 30 iterations.
#> Final Loss Metrics:
#> loss r2
#> 616.9888 0.9971891The two fits use the same observations and the same simplex model.
The metric fit differs only in how residuals are judged: directions with
larger technical measurement noise count less, and the off-diagonal
entries in G_noise account for correlated errors.
| Final loss | R2 | Noise-metric archetype error | Composition RMSE | |
|---|---|---|---|---|
| plain | 264 | 0.940 | 4189.15 | 5.22 |
| metric | 617 | 0.997 | 57.24 | 3.10 |
The loss values are on different scales because each fit uses its own residual metric. The archetype error and composition RMSE are evaluated against the known generating structure, using the noise precision matrix to match the fitted archetypes to the truth and thus are comparable across fits.
Metric Gaussian AA with known correlated measurement noise.
Functional data use the same metric idea, but the metric is usually
not supplied by hand. Use this recipe when analyzing functional data
with the fda package.
In functional analysis, a curve is represented as a weighted sum of basis functions, such as sine waves in Fourier analysis. The appropriate feature metric is computed from the inner-product matrix of the basis functions.
The package fda provides a rich set of basis functions
and tools for working with functional data. run_aa.fd()
works directly with fda::fd objects and applies the basis
inner-product metric (computed using fda::eval.penalty())
to the Gaussian AA engine automatically, so users can fit the functional
object directly. The resulting archetypes and fitted values are also
fd objects so that they can be post-processed with the same
tools as the input data.
Below we demonstrate functional AA on synthetic curves with known archetypal structure. The true archetypes are smooth curves with peaks at different times. The data are noisy samples from the convex hull of those archetypes.
N <- 90 # number of curves
t_fd <- seq(0, 1, length.out = 80) # time points for functional data
A_fd_true <- rbind(
early_peak = exp(-60 * (t_fd - 0.18)^2),
late_peak = exp(-60 * (t_fd - 0.82)^2),
ramp = 0.15 + 0.85 * t_fd
)
# Archetype compositions for each sample
S_fd_true <- sample_simplex(N, 3)
X_fd_noise <- matrix(rnorm(N * length(t_fd), sd = 0.025), nrow = N)
X_fd <- S_fd_true %*% A_fd_true + X_fd_noise
# Convert the sampled curves to an fda::fd object
basis_fd <- fda::create.bspline.basis(rangeval = c(0, 1), nbasis = 14)
fd_data <- fda::Data2fd(argvals = t_fd, y = t(X_fd), basisobj = basis_fd)
fit_fd <- run_aa(fd_data, K = 3, sd_threshold = 0)
fit_fd
#>
#> Call:
#> run_aa(x = fd_data, K = 3, sd_threshold = 0)
#>
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3
#> Converged after 45 iterations.
#> Final Loss Metrics:
#> loss r2
#> 0.01743686 0.9988108The resulting archetypes are functional objects, and the fitted values are curves in the same space. The plot below shows the fitted archetypes overlaid on a sample of the data curves.
A_fit_fd <- coordinates(fit_fd)
class(A_fit_fd) # fitted archetype curves
#> [1] "fd"
sample_ix <- seq(1, nrow(X_fd), length.out = 25) # sample curves for plotting
# Plot sampled curves, then overlay archetype curves
matplot(
t_fd,
t(X_fd[sample_ix, ]),
type = "l", lty = 1, col = pal_data,
xlab = "t", ylab = "Value", main = "Functional AA"
)
invisible( # plot.fd prints a "done" message that we want to suppress
plot(A_fit_fd, lty = 1, lwd = 2, col = pal_aa[1:3], add = TRUE)
)
legend(
"top",
legend = anames(fit_fd),
lty = 1, lwd = 2, col = pal_aa[1:3],
bty = "n", horiz = TRUE
)Functional Gaussian AA fitted directly to an fda::fd object.
Kernel AA is the nonlinear analog of changing the inner product. Instead of choosing an explicit feature metric (\(G\)), it works with a \(N \times N\) Gram matrix \(K_{ij} = k(x_i, x_j)\) and fits archetypes in the kernel-induced feature space (Mørup and Hansen 2012). With the linear kernel, this recovers the ordinary Euclidean geometry. With nonlinear kernels, the convex hull is formed after the implicit feature map.
Use this recipe when the relevant extremes are defined by a nonlinear geometry rather than by the convex hull in the original feature space. The example below has a dense core and an outer ring. Linear AA sees the input space and places proxy archetypes around broad outer directions. A Laplace kernel changes the geometry so local neighborhoods are emphasized.
# Choose the sample sizes
n_outer <- 120
n_inner <- 80
# Simulate an outer ring
theta_outer <- runif(n_outer, 0, 2 * pi)
r_outer <- 1 + 0.12 * runif(n_outer)
X_outer <- cbind(r_outer * cos(theta_outer), r_outer * sin(theta_outer))
# Simulate an inner core
theta_inner <- runif(n_inner, 0, 2 * pi)
r_inner <- 0.15 * runif(n_inner)
X_inner <- cbind(r_inner * cos(theta_inner), r_inner * sin(theta_inner))
# Combine the two groups
X_ring <- rbind(X_outer, X_inner)
colnames(X_ring) <- c("x", "y")set.seed(42)
fit_linear <- run_aa(
X_ring,
K = 7,
scale = FALSE,
init = "random",
nrep = 20
)
fit_linear
#>
#> Call:
#> run_aa(x = X_ring, K = 7, init = "random", scale = FALSE, nrep = 20)
#>
#> PGD Archetypal Analysis:
#> Number of Archetypes: 7
#> DID NOT CONVERGE after 100 iterations.
#> Final Loss Metrics:
#> loss r2
#> 0.5960806 0.9956172
set.seed(42)
fit_kernel <- run_aa(
X_ring,
K = 7,
method = "kernel",
kernel = "laplace",
kernel_args = list(sigma = 0.35),
init = "random",
nrep = 20
)
fit_kernel
#>
#> Call:
#> run_aa(x = X_ring, K = 7, method = "kernel", init = "random",
#> nrep = 20, kernel = "laplace", kernel_args = list(sigma = 0.35))
#>
#> Kernel Archetypal Analysis (laplace kernel):
#> Number of Archetypes: 7
#> Number of Samples: 200
#> Input-space coordinate proxy: available
#> Converged after 33 iterations.
#> Final Loss Metrics:
#> loss r2
#> 91.84126 0.5407937The two loss metrics are not directly comparable because they are
computed in different spaces. Even the r2 values are not
directly comparable because they are computed relative to different null
models. There is no simple scalar diagnostic that decides whether to use
kernel AA or linear AA. Use domain knowledge and visual inspection of
the fitted profiles.
plot(
X_ring, frame.plot = FALSE, axes = FALSE,
col = pal_data, asp = 1, pch = 19, cex = 0.45,
main = "Linear AA vs kernel-AA proxies", xlab = "", ylab = ""
)
points(coordinates(fit_linear), pch = 4, cex = 1.25, lwd = 1.5, col = pal_aa[1])
points(coordinates(fit_kernel), pch = 1, cex = 1.45, lwd = 1.6, col = pal_aa[2])
legend(
"topleft",
legend = c("Data", "Linear AA", "Kernel-AA"),
pch = c(19, 4, 1),
col = c(pal_data, pal_aa[1:2]),
pt.cex = c(0.7, 1.2, 1.4),
bty = "n"
)Linear AA archetypes and kernel-AA input-space proxy archetypes.
For nonlinear kernels, coordinates are input-space proxy
coordinates computed as coefficients %*% data. They are
useful for plotting, but they are not exact preimages of the
Hilbert-space archetypes. This distinction also explains why
fitted() and out-of-sample predict() for
type = "reconstruction" are not defined for
kernel_archetypes. predict() for
type = "compositions" on the other hand is defined and
returns the new composition weights using cross-Gram evaluation (new
samples vs every training sample).
In this particular example, the kernel-AA considers the inner core and outer ring as two separate archetypal directions, while linear AA places archetypes around the outer ring and considers the inner core as a uniformly mixed profile.
The kernel choice and bandwidth are substantive modeling decisions. Very local kernels can make many points effectively extreme in feature space, a phenomenon also discussed in kernel frame methods (Mair et al. 2018). In practice, compare a small grid of kernel parameters and inspect whether the resulting compositions are stable.
Use PAA when the observed entries are generated from a distribution
such as binary responses, counts, or category counts. PAA changes the
objective rather than the inner product. It keeps the simplex structure,
but archetypes are profiles in the parameter space of an observation
family (Seth and Eugster 2016). In
yaap, the supported families are:
family = |
Input data | Parameter space |
|---|---|---|
"gaussian" |
Real-valued numeric matrix | Mean profiles on the data scale |
"binomial" |
Binary 0/1 response matrix | Response probabilities |
"poisson" |
Non-negative count matrix | Poisson rates, or expected counts |
"multinomial" |
Rows of category counts | Category probabilities that sum to one |
PAA’s main caveat is interpretive: the convex hull lives in parameter
space. For non-Gaussian families,
plot(fit, what = "coordinates") is intentionally
unavailable because the observed data and fitted archetypes are not in
the same space. PAA also does not currently support sample
weights, robust fitting, or missing-data fitting. For
Gaussian data, PAA is equivalent to ordinary AA but less versatile and
slower. Prefer the standard Gaussian fitters unless you specifically
need the PAA interface.
The following example represents observations as answers to six
yes/no items. The archetypes are probability profiles, not binary
observations. For your own data, use an N x M matrix of 0/1
responses.
N <- 120
P_true <- rbind(
broad_yes = c(0.85, 0.80, 0.75, 0.20, 0.15, 0.10),
middle = c(0.25, 0.75, 0.35, 0.70, 0.35, 0.65),
broad_no = c(0.10, 0.15, 0.20, 0.80, 0.85, 0.75)
)
K <- nrow(P_true)
M <- ncol(P_true)
S_bin <- sample_simplex(N, K)
P_bin <- S_bin %*% P_true
X_bin <- matrix(rbinom(N * M, size = 1, prob = as.vector(P_bin)), nrow = N)
colnames(X_bin) <- paste0("item_", seq_len(M))
head(X_bin)
#> item_1 item_2 item_3 item_4 item_5 item_6
#> [1,] 0 1 0 1 0 0
#> [2,] 1 0 1 1 1 1
#> [3,] 0 0 1 0 0 0
#> [4,] 0 0 0 1 0 0
#> [5,] 0 1 0 1 0 1
#> [6,] 1 0 0 0 0 0fit_binomial <- run_aa(
X_bin,
K = K,
method = "paa",
family = "binomial"
)
fit_binomial
#>
#> Call:
#> run_aa(x = X_bin, K = K, method = "paa", family = "binomial")
#>
#> PAA Archetypal Analysis (binomial family):
#> Number of Archetypes: 3
#> Converged after 39 iterations.
#> Final Loss Metrics:
#> loss r2
#> 322.2442 0.2965358Binomial PAA archetype response probabilities.
Because the archetypes are in parameter space, the fitted values are family parameters. The compositions are still convex weights over archetypes, but the reconstruction is not always a convex combination of observed data. Instead, it is a convex combination of archetype profiles in parameter space.
The predict() method for PAA supports both types of
prediction. However, type = "reconstruction" returns fitted
values in parameter space, not observed data in the original sample
space.
S_new <- predict(fit_binomial, X_bin[1:5, ], type = "compositions")
round(S_new, 3)
#> A1 A2 A3
#> [1,] 0.546 0.389 0.064
#> [2,] 0.696 0.000 0.304
#> [3,] 0.494 0.000 0.506
#> [4,] 1.000 0.000 0.000
#> [5,] 0.316 0.684 0.000
P_new <- predict(fit_binomial, X_bin[1:5, ], type = "reconstruction")
round(P_new, 2)
#> item_1 item_2 item_3 item_4 item_5 item_6
#> [1,] 0.06 0.45 0.26 0.55 0.27 0.71
#> [2,] 0.30 0.30 0.55 0.70 0.42 0.41
#> [3,] 0.51 0.51 0.68 0.50 0.41 0.29
#> [4,] 0.00 0.00 0.36 1.00 0.45 0.59
#> [5,] 0.00 0.68 0.11 0.32 0.14 0.87Next, we fit a multinomial example. The data are counts of five terms in three document types:
The archetypes are term-probability profiles, not observed count vectors. The fitted and predicted values are probabilities whose rows sum to one. If expected counts are needed, multiply those probabilities by the document totals.
# Define term probabilities for three document types
Topic_true <- rbind(
visual = c(0.55, 0.25, 0.10, 0.05, 0.05),
text = c(0.05, 0.10, 0.55, 0.25, 0.05),
mixed = c(0.15, 0.25, 0.15, 0.20, 0.25)
)
# Basic dimensions
K <- nrow(Topic_true)
M <- ncol(Topic_true)
N <- 80
# Archetype compositions
S_txt <- sample_simplex(N, K)
Theta <- S_txt %*% Topic_true
totals <- 39 + sample(40, N, replace = TRUE) # total term counts per document
X_txt <- matrix(0L, nrow = N, ncol = M) # observed term counts
for (i in seq_len(N)) {
X_txt[i, ] <- as.integer(rmultinom(1, totals[i], Theta[i, ]))
}
colnames(X_txt) <- paste0("term_", seq_len(ncol(X_txt)))
fit_multi <- run_aa(
X_txt,
K = K,
method = "paa",
family = "multinomial",
tol = 1e-6
)
fit_multi
#>
#> Call:
#> run_aa(x = X_txt, K = K, method = "paa", family = "multinomial",
#> tol = 1e-06)
#>
#> PAA Archetypal Analysis (multinomial family):
#> Number of Archetypes: 3
#> Converged after 45 iterations.
#> Final Loss Metrics:
#> loss r2
#> 6856.474 0.01086952
P_hat <- fitted(fit_multi)
head(round(P_hat, 2), 3)
#> term_1 term_2 term_3 term_4 term_5
#> [1,] 0.19 0.20 0.29 0.18 0.14
#> [2,] 0.25 0.22 0.23 0.16 0.14
#> [3,] 0.44 0.23 0.19 0.09 0.04
rowSums(P_hat[1:3, ])
#> [1] 1 1 1
expected_counts <- rowSums(X_txt) * P_hat
head(round(expected_counts, 1), 3)
#> term_1 term_2 term_3 term_4 term_5
#> [1,] 13.3 13.9 20.3 12.7 9.8
#> [2,] 10.1 8.7 9.1 6.4 5.6
#> [3,] 27.9 14.7 11.8 6.0 2.7Use directional AA when row direction matters more than magnitude, especially when signs can be arbitrary. The implementation uses a Watson-style loss that is insensitive to polarity after hemisphere handling (Olsen et al. 2022).
The recipe below draws points from a spherical triangle, then flips
some rows by multiplying them by -1. Euclidean AA treats
the flipped rows as opposite points. Directional AA can model them as
the same direction.
A_dir_true <- rbind(
c(1, 0, 0),
c(0, 1, 0),
c(0, 0.15, 1)
)
K <- nrow(A_dir_true)
N <- 120
S_dir <- sample_simplex(N, K)
X_dir <- unit_rows(S_dir %*% A_dir_true)
flip <- sample(c(-1, 1), nrow(X_dir), replace = TRUE)
X_dir_flip <- X_dir * flip
colnames(X_dir_flip) <- c("x", "y", "z")set.seed(42)
fit_euclidean_dir <- run_aa(
X_dir_flip,
K = K,
sd_threshold = 0,
init = "random",
nrep = 20
)
fit_euclidean_dir
#>
#> Call:
#> run_aa(x = X_dir_flip, K = K, init = "random", sd_threshold = 0,
#> nrep = 20)
#>
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3
#> Converged after 62 iterations.
#> Final Loss Metrics:
#> loss r2
#> 19.46239 0.8378134
set.seed(42)
fit_directional <- run_aa(
X_dir_flip,
K = K,
method = "directional",
sd_threshold = 0,
init = "random",
nrep = 20
)
fit_directional
#>
#> Call:
#> run_aa(x = X_dir_flip, K = K, method = "directional", init = "random",
#> sd_threshold = 0, nrep = 20)
#>
#> Directional Archetypal Analysis:
#> Number of Archetypes: 3
#> Converged after 50 iterations.
#> Final Loss Metrics:
#> loss r2
#> 0.05942967 0.9995048round(fit_directional[["directions"]], 3)
#> x y z
#> [1,] 0.028 0.200 0.979
#> [2,] 0.992 0.111 0.053
#> [3,] 0.057 0.997 0.046
round(rowSums(fitted(fit_directional)^2), 6)[1:6]
#> [1] 1 1 1 1 1 1Orthographic view of polarity-flipped spherical data.
Directional AA returns directional_archetypes, which
extends the ordinary archetypes class.
fitted() and out-of-sample predict() return
row-normalized reconstructions aligned to the original row polarity when
the training data are available. The directional loss is not comparable
to Euclidean SSE, so compare directional fits with directional
diagnostics, not with ordinary residual sums of squares.
Current limitations are stricter than for Gaussian AA: input must be
a dense numeric matrix with no zero rows; missing data, robust fitting,
and custom scale values are not supported.
Use the simplest model that matches the data-generating assumptions:
scale.Across all variants, use multiple seeds or nrep where
supported. The optimization problems remain non-convex, so stable
conclusions should not rely on a single initialization.