---
title: "Choosing an Initialization Method for Archetypal Analysis"
bibliography: references.bib
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Choosing an Initialization Method for Archetypal Analysis}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.width = 7,
    fig.height = 5.5,
    fig.align = "center",
    warning = FALSE
)
set.seed(42)
par(bty = "n")
library(yaap)
```

> This vignette assumes familiarity with the basic AA workflow covered in the
  **Introduction to Archetypal Analysis with yaap** vignette.

## Quick Start

For a first Euclidean AA fit, start with `"furthest_sum"` or `"kmeans_pp"`.
`"furthest_sum"` is a strong boundary-seeking default for convex data, while
`"kmeans_pp"` is a softer alternative when you want some randomness without
falling back to uniform sampling. If the geometry is unknown and you can afford
several starts, use `"random"` with a larger `nrep`. If you only have a few
starts and runtime is less important, use `"aa_pp"` because each added point is
chosen against the current AA reconstruction error.

```{r quick-recommendation, eval = FALSE}
library(yaap)

toy <- read.csv(system.file("extdata", "toy.csv", package = "yaap"))

fit <- run_aa(toy, K = 3, nrep = 5,  init = "furthest_sum")
fit <- run_aa(toy, K = 3, nrep = 5,  init = "kmeans_pp")
fit <- run_aa(toy, K = 3, nrep = 20, init = "random")
fit <- run_aa(toy, K = 3, nrep = 3,  init = "aa_pp")
```

## Background

Archetypal analysis (AA) minimizes a reconstruction loss, written in the
standard Euclidean formulation as

$$\mathcal{L}(X, SA) = \|X - S A\|_F^2.$$

As with many matrix decomposition problems, the objective is non-convex in
both $S$ and $A$, and gradient descent can terminate at a **local minimum**. As
a result, the quality of the final solution depends substantially on the starting
point.

yaap implements several options for initializing the archetype coordinates,
which have been suggested in the literature, through the `aa_init()` function
and the `init` argument.

## Methods at a Glance

The table below summarizes the seven methods by their strategy, whether they
involve random sampling, their relative speed, how strongly they depend on the
observed sampling density, how strongly they impose geometric assumptions, and
any mandatory extra arguments.

| Method                 | Strategy                                                                           | Stochastic?      | Speed  | Density sensitivity | Geometry sensitivity | Mandatory extras |
| ---------------------- | ---------------------------------------------------------------------------------- | ---------------- | ------ | ------------------- | -------------------- | :--------------: |
| `"random"`             | Sample $K$ rows uniformly at random                                                | Yes              | Fast   | High                | Low                  | —                |
| `"dirichlet"`          | Construct each archetype as a random convex combination of rows                    | Yes              | Fast   | High                | Low                  | —                |
| `"kmeans_pp"`          | Sample proportional to squared distance to the nearest current archetype           | Yes              | Medium | Medium              | Medium               | —                |
| `"aa_pp"`              | Sample using residual reconstruction error from the current hull                   | Yes              | Slow   | Medium              | Medium               | —                |
| `"furthest_first"`     | Select the point furthest from all current archetypes                              | Seed only        | Medium | Low-medium          | High                 | —                |
| `"furthest_sum"`       | Select the point that maximizes the *sum* of distances to all current archetypes   | Seed only        | Medium | Low                 | High                 | —                |
| `"hull_outmost"`       | Select frequently outlying convex-hull candidates                                  | Partitioned only | Medium | Low                 | High                 | `hull_method`    |

Density and geometry are counterbalancing sensitivities. Uniform methods such
as `random` and diffuse `dirichlet` starts are strongly affected by where the
data are numerous: a sparsely sampled corner may need many restarts before it
is selected. Their advantage is that independent restarts genuinely explore
different parts of the empirical distribution, which can help avoid repeatedly
converging to the same local minimum when many fits are affordable.

More biased boundary-seeking methods such as `furthest_sum` are less affected
by interior density because, once a corner is represented by at least one
observed point, the greedy distance criterion can still discover it. The cost of
that bias is geometric sensitivity and lower restart diversity: repeated starts
often return the same, or nearly the same, boundary configuration. These methods
therefore make sense when the assumed geometry is credible or when running many
full AA optimizations is too expensive.

Large-data approximations are controlled orthogonally with `batch_size`,
`batch_type`, and `batch_replace`. By default, `batch_type = "distal"` samples
candidate rows with probability proportional to squared distance from the data
center, preserving the coreset idea of focusing computation on likely boundary
points [@Mair2019]. Use `batch_type = "uniform"` when you want an unbiased
mini-batch approximation, for example with `method = "aa_pp"`.


## The `aa_init()` Function

`aa_init()` runs any initialization method as a standalone step, returning a
named list with two elements:

* **`A`** — $K \times M$ matrix of initial archetype coordinates.
* **`B`** — $K \times N$ row-stochastic matrix expressing each archetype as a
  convex combination of data points ($A = B X$).

```{r aa-init-basic}
toy <- read.csv(system.file("extdata", "toy.csv", package = "yaap"))
X <- as.matrix(toy) # 250 × 2
K <- 3

init <- aa_init(X, K = K, method = "furthest_sum")
str(init)
```

The `A` matrix provides the initial archetype coordinates that will seed the
optimization loop; `B` is the convex-hull certificate for those coordinates. If
`run_aa()` is initialized with `aa_init()` and a method name or a custom function,
the `B` returned is carried into the algorithm initialization. If you instead
pass a pre-computed coordinate matrix such as `init$A`, `run_aa()` checks that
the coordinates still admit a feasible `B`. Rows outside the allowed data hull
are projected back into the hull with a warning.

```{r initialization-helpers, include = FALSE}
draw_simplex <- function(A, border, fill = NULL, lty = 1, lwd = 2,
                         pch = 17, cex = 1.1) {
    if (nrow(A) >= 3) {
        hull <- chull(A)
        polygon(A[hull, , drop = FALSE],
            border = border, lwd = lwd, lty = lty,
            col = fill
        )
    } else if (nrow(A) == 2) {
        lines(A, col = border, lwd = lwd, lty = lty)
    }
    points(A, pch = pch, cex = cex, col = border)
}

selection_deviation_score <- function(counts, K, n, pseudocount = 0.5) {
    total_counts <- sum(counts)
    N_runs <- total_counts / K
    expected <- total_counts / n
    p_null <- K / n

    log_fc <- log2((counts + pseudocount) / (expected + pseudocount))
    p_lower <- pbinom(counts, size = N_runs, prob = p_null)
    p_upper <- pbinom(counts - 1, size = N_runs, prob = p_null,
        lower.tail = FALSE
    )
    p_value <- pmin(1, 2 * pmin(p_lower, p_upper))
    signed_log_p <- sign(counts - expected) *
        -log10(pmax(p_value, .Machine$double.xmin))

    list(log_fc = log_fc, p_value = p_value, signed_log_p = signed_log_p)
}

plot_selected_density <- function(X, counts, main, K, color_max = NULL,
                                  pseudocount = 0.5) {
    score <- selection_deviation_score(counts, K, nrow(X), pseudocount)
    signed_log_p <- score$signed_log_p

    pal <- colorRamp(c("blue", "white", "red"))
    cmax <- color_max
    if (is.null(cmax)) cmax <- max(abs(signed_log_p))
    probs <- if (cmax > 0) {
        pmax(0, pmin(1, (signed_log_p + cmax) / (2 * cmax)))
    } else {
        rep(0.5, length(signed_log_p))
    }
    point_cols <- pal(probs) / 255
    point_cols <- rgb(point_cols[, 1], point_cols[, 2], point_cols[, 3])
    ix <- order(abs(signed_log_p))
    plot(
        X[ix, ],
        pch = 21, cex = 1, col = "gray50", bg = point_cols[ix],
        main = main, xlab = "", ylab = "", axes = FALSE, asp = 1
    )
}

plot_data_with_points <- function(X, A, main, col = "grey75", view = c(1, 2),
                                  pch = 4, point_col = "black") {
    plot(
        X[, view, drop = FALSE],
        pch = 16, cex = 0.45, col = col,
        main = main, xlab = "", ylab = "", axes = FALSE, asp = 1
    )
    points(A[, view, drop = FALSE],
        pch = pch, cex = 1.25, lwd = 1.6,
        col = point_col
    )
}

make_concentric_circles <- function(n = 180, noise = 0.03) {
    n_inner <- n %/% 2
    n_outer <- n - n_inner
    theta   <- c(runif(n_inner, 0, 2 * pi),
                 runif(n_outer, 0, 2 * pi))
    radius  <- c(rep(0.4, n_inner),
                 rep(1.0, n_outer))
    X <- cbind(radius * cos(theta),
               radius * sin(theta))
    X <- X + rnorm(length(X), sd = noise)
    colnames(X) <- c("x", "y")
    list(
        X = X,
        view = c(1, 2),
        col = c(rep("#0072B2", n_inner), rep("#D55E00", n_outer)),
        main = "Concentric circles",
        kernel = "laplace",
        kernel_args = list(sigma = 0.25)
    )
}

make_swiss_roll <- function(n = 180, noise = 0.025) {
    t <- runif(n, 1.5 * pi, 4.5 * pi)
    h <- runif(n, -1, 1)
    X <- cbind(t * cos(t) / 7, h, t * sin(t) / 7)
    X <- X + rnorm(length(X), sd = noise)
    colnames(X) <- c("x", "height", "z")
    roll_cols <- colorRampPalette(c("#0072B2", "#009E73", "#D55E00"))(100)
    col_id <- pmax(1, pmin(100, ceiling(99 * (t - min(t)) / diff(range(t))) + 1))
    list(
        X = X,
        view = c(1, 3),
        col = roll_cols[col_id],
        main = "Swiss roll",
        kernel = "rbf",
        kernel_args = list(sigma = 0.45)
    )
}
```

### Supplying Your Own Initialization

The built-in method names cover the common cases, but `run_aa()` also accepts
either a precomputed matrix of archetype coordinates or a custom initialization
function.

If the starting coordinates are already available, pass them directly as a
matrix in the same units as the `X`. `run_aa()` will preprocess it the same way
it preprocesses `X`, handles the required checks, and constructs `B`.

```{r precomputed-init-matrix}
X <- as.matrix(toy)
K <- 3
tol <- 0.01
eps <- 0

A0 <- X[1:K, , drop = FALSE]

fit_matrix <- run_aa(X, K = K, init = A0, scale = FALSE, tol = tol, eps = eps)
```

A custom function is useful when initialization itself needs code. The function
must take `X` and `K`, and return the same two objects as `aa_init()`: the
archetype coordinates `A` and the data weights `B`. Additional arguments can
be passed through `init_args`. Below is a deliberately simple initializer that
chooses the first `K` rows of the data.

```{r custom-init-function}
X <- as.matrix(toy)
K <- 3
tol <- 0.01

first_k_init <- function(X, K, ...) {
    A <- X[1:K, , drop = FALSE]
    B <- diag(1, nrow = K, ncol = nrow(X))
    list(A = A, B = B)
}

fit_custom <- run_aa(X, K = K, init = first_k_init, scale = FALSE, tol = tol)
```

If your custom initializer naturally produces only archetype coordinates,
`fit_simplex()` exposes the same simplex mapping used internally to construct a
feasible `B`. Reconstructing `A_init` as `B %*% X` ensures the coordinates passed
to AA are on the data hull:

```{r custom-init-fit-simplex, eval = FALSE}
X <- as.matrix(toy)
K <- 3

A_user <- some_initializer(X, K = K)
B <- fit_simplex(A_user, X)
A_init <- B %*% X
init <- list(A = A_init, B = B)
```

These two forms are useful when the starting point comes from a previous
analysis, a domain-specific rule, or an external clustering workflow. For most
workflows, passing a method name remains the shortest form.

```{r run-aa-init-short, eval = FALSE}
X <- as.matrix(toy)
K <- 3
tol <- 0.01

fit <- run_aa(X, K = K, init = "furthest_sum", tol = tol)
fit <- run_aa(X, K = K, init = "aa_pp", tol = tol)
fit <- run_aa(X,
    K = K, init = "aa_pp",
    init_args = list(batch_size = 200, batch_type = "uniform"),
    tol = tol
)
```

## Comparing Initializer Biases

```{r init-selection-density}
X <- as.matrix(toy)
K <- 3

method_labels <- c(
    random         = "Random",
    kmeans_pp      = "k-means++",
    aa_pp          = "AA++",
    furthest_first = "Furthest First",
    furthest_sum   = "Furthest Sum",
    hull_outmost   = "Hull-outmost"
)
N_expected <- 10  # Expected selections per point across all runs
N_runs <- nrow(X) * N_expected
# "significance" cutoff for coloring points red/blue
color_cutoff <- -log10(1 / nrow(X))

selected_counts <- list()
for (method in names(method_labels)) {
    counts <- rep(0, nrow(X))
    for (run in 1:N_runs) {
        if (method == "hull_outmost") {
            init_obj <- aa_init(X, K = K, method = method,
                                hull_method = "partitioned")
        } else {
            init_obj <- aa_init(X, K = K, method = method)
        }
        selected <- which(init_obj$B > 0, arr.ind = TRUE)[, 2]
        counts[selected] <- counts[selected] + 1
    }
    selected_counts[[method]] <- counts
}
```

The first practical difference between initializers is not whether they are
"good" or "bad"; it is where they tend to look. The toy data below are simple:
250 two-dimensional points spread near the vertices of a triangle. The figure
below repeats each row-selecting initializer many times and colors each point by
how often it was selected compared with uniform random sampling. Red points are
selected more often than random, blue points are selected less often, and white
points are close to the random baseline.

```{r init-selection-density-plot, fig.width = 8, fig.height = 7, out.width="100%", results='hold', echo=FALSE}

op <- par(mfrow = c(2, 3), mar = c(2, 2, 3, 0.5), bty = "n")
for (method in names(method_labels)) {
    plot_selected_density(
        X,
        selected_counts[[method]],
        main = method_labels[method],
        K = K,
        color_max = color_cutoff
    )
}
par(op)
```

As expected, `random` selects archetypes uniformly from the observed rows, so
the points do not diverge much from their null probability of selection. This
is probably not the best strategy for a convex geometry like the toy data, but
it is unbiased and thus robust to unknown data structure. With enough restarts
this strategy will eventually explore every region, but any single run may miss
the corners.

On the other extreme, methods like `furthest_first`, `furthest_sum`, and
`hull_outmost` are exploring more aggressively the boundary of the data
geometry. We see that most of them converge to the same set of candidate points
and never pick any point in the middle of the cloud. `furthest_first` is the
only one that shows some variation in the selected points, but it still has a
strong corner preference. For a simple convex shape that bias is often useful:
the corners are exactly where good archetypes should start.

Clustering-inspired methods such as `kmeans_pp` and `aa_pp` sit between these
two extremes. These methods adapt ideas developed to initialize clustering
algorithms to the AA setting. They are more probabilistic in nature and thus
represent softer versions of their boundary-searching counterparts. They also
rarely select any point in the center of the cloud, but they explore boundary
regions more diffusely and do not explicitly maximize extremeness.

More details about the methodology employed by each method can be found in the
[Method Reference](#method-reference) section at the end of this vignette.
The rest of the vignette narrows to three representatives: `random` as an
exploratory baseline, `kmeans_pp` as the softer outward-biased initializer, and
`furthest_sum` for the strong greedy boundary bias.

## Increasing the Number of Archetypes

With $K = 3$, the toy data roughly match the three-corner shape. When $K$
increases, the question changes from "did we find the corners?" to "how do the
extra archetypes cover the cloud?" The following chunk is the user-facing call:
only the `method` and `K` values change.

```{r increasing-k}
X <- as.matrix(toy)
k_grid <- c(3, 7, 15)
k_methods <- c("random", "kmeans_pp", "furthest_sum")

k_inits <- list()
for (method in k_methods) {
    method_inits <- list()
    for (K in k_grid) {
        k_lab <- paste("K =", K)
        method_inits[[k_lab]] <- aa_init(X, K = K, method = method)
    }
    k_inits[[method]] <- method_inits
}
```

```{r increasing-k-plot, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold', echo=FALSE}
op <- par(
    mfrow = c(length(k_methods), length(k_grid)),
    mar = c(0.5, 0.5, 2.2, 0.2), oma = c(0, 0, 0, 0), bty = "n"
)
for (method in k_methods) {
    for (k_label in names(k_inits[[method]])) {
        init_obj <- k_inits[[method]][[k_label]]
        plot(
            X,
            pch = 16, cex = 0.42, col = "grey75",
            main = sprintf("%s\n%s", method_labels[method], k_label),
            xlab = "", ylab = "", axes = FALSE, asp = 1
        )
        draw_simplex(
            init_obj$A,
            border = "firebrick3",
            fill = adjustcolor("firebrick3", 0.05),
            cex = 0.95
        )
    }
}
par(op)
```

As $K$ grows, the non-aggressive methods start to waste extra archetypes in
the center of the convex hull. This is visible for `random`, and eventually also
for `kmeans_pp`: once the main corners have been sampled, additional points are
often placed in the interior regions where there are definitely no archetypes
for this triangle. Those starts can still optimize successfully, but the
initialization budget is no longer being used to approximate the boundary.

`furthest_sum` behaves differently. It keeps adding points on the outside of the
cloud and, as $K$ increases, the selected polygon becomes a finer approximation
to the true convex hull boundary. In fact, points generated by `furthest_sum`
have a useful geometric property: they lie in the minimal convex set of the
unselected data points [Theorem 2 in [@Morup2012]].

## When Initialization Matters

The toy triangle is intentionally simple. If we start from one initialization
and then let `run_aa()` optimize, most methods reach essentially the same final
fit. The red dashed triangle below is the starting point; the blue solid
triangle is the fitted solution.

```{r init-before-after-fit}
X <- as.matrix(toy)
K <- 3
focused_methods <- c("random", "kmeans_pp", "furthest_sum")
fits <- list()
for (method in focused_methods) {
    fits[[method]] <- run_aa(
        x        = X,
        K        = K,
        init     = method,
        scale    = FALSE,
        max_iter = 60,
        tol      = 0.01
    )
}
```

```{r init-before-after-plot, echo=FALSE, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold'}
loss_values <- c()
for (method in focused_methods) {
    loss_values <- c(loss_values, fits[[method]]$loss$loss)
}
loss_ylim <- range(loss_values)
op <- par(mfrow = c(2, 3), bty = "n")
par(mar = c(0.8, 0.6, 3, 0.5))
for (method in focused_methods) {
    fit <- fits[[method]]
    plot(fit,
        what = "coordinates",
        show_anames = FALSE,
        main = sprintf("%s\n%d steps", method_labels[method], nrow(fit$loss) - 1),
        args.data.scatter = list(pch = 16, cex = 0.45, col = "grey75"),
        pch = 19,
        col = "steelblue4",
        axes = FALSE,
        asp = 1
    )
    draw_simplex(
        fit$init,
        border = adjustcolor("firebrick3", 0.75),
        fill = adjustcolor("firebrick3", 0.05),
        lty = 2
    )
}
par(mar = c(2.8, 4.2, 3, 0.5))
for (method in focused_methods) {
    plot(fits[[method]],
        what = "loss",
        lwd = 1.8,
        col = "steelblue4",
        ylim = loss_ylim,
        main = "Loss",
        ylab = "",
        las = 1
    )
}
par(op)
```

This is the reassuring case: the data geometry is simple enough that a mediocre
start can still be corrected by the optimizer. Initialization matters more when
the geometry is less well described by straight-line Euclidean distances.

## Nonlinear Geometries

The previous examples still live in a simple Euclidean geometry: distances in
the plotted coordinates describe the structure we care about. Some datasets are
not like that. In concentric circles, the inner circle is important even though
the outer circle contains the most distant points. In a Swiss roll, points that
look close in a two-dimensional projection may be far apart along the rolled
surface.

```{r nonlinear-init}
K <- 8
nonlinear_data <- list(
    circles    = make_concentric_circles(),
    swiss_roll = make_swiss_roll()
)
nonlinear_methods <- c("random", "kmeans_pp", "furthest_sum")

nonlinear_inits <- list()
for (data_shape in names(nonlinear_data)) {
    data_obj <- nonlinear_data[[data_shape]]
    out <- list()
    for (method in nonlinear_methods) {
        out[[method]] <- aa_init(data_obj$X, K = K, method = method)
    }
    nonlinear_inits[[data_shape]] <- out
}
```

```{r nonlinear-init-plot, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold', echo=FALSE}
op <- par(
    mfrow = c(length(nonlinear_data), length(nonlinear_methods)),
    mar = c(2, 2, 3, 0.5), bty = "n"
)
for (data_shape in names(nonlinear_data)) {
    data_obj <- nonlinear_data[[data_shape]]
    for (method in nonlinear_methods) {
        init_obj <- nonlinear_inits[[data_shape]][[method]]
        plot_data_with_points(
            data_obj$X,
            init_obj$A,
            main = sprintf("%s\n%s", data_obj$main, method_labels[method]),
            col = adjustcolor(data_obj$col, 0.62),
            view = data_obj$view
        )
    }
}
par(op)
```

First inspect the Euclidean initializations. The biased methods are doing what
they were designed to do: they seek far-away points. On these shapes, that can
over-emphasize the outside and leave the center or inner structure
under-represented. `random`, on the other hand, follows the observed sampling
density rather than the outer geometry. On these uniformly sampled shapes, that
keeps it from concentrating only on the outside, but it also means performance
would change if important regions were sampled sparsely.

For these cases, the better answer is often to change the space in which AA is
fit. A kernel method lets nearby points influence each other according to a
smooth similarity rule rather than only through straight-line distance in the
input coordinates.

```{r kernel-user-code, eval=FALSE}
fit_kernel <- archetypes_kernel_pgd(
    x      = X,
    K      = K,
    kernel = "laplace",
    init   = "furthest_sum",
    tol    = 0.01
)
```

```{r nonlinear-kernel-setup, include = FALSE, warning=FALSE}
kernel_methods <- c("furthest_sum", "kmeans_pp")
kernel_inits <- list()
for (data_shape in names(nonlinear_data)) {
    data_obj <- nonlinear_data[[data_shape]]
    out <- list()
    for (method in kernel_methods) {
        kernel_fit <- archetypes_kernel_pgd(
            data_obj$X,
            K = K,
            init = method,
            kernel = data_obj$kernel,
            kernel_args = data_obj$kernel_args,
            max_iter = 1,
            tol = 0.01,
            tol_r2 = 1
        )
        out[[method]] <- list(
            euclidean = nonlinear_inits[[data_shape]][[method]]$A,
            kernel = kernel_fit$init %*% data_obj$X
        )
    }
    kernel_inits[[data_shape]] <- out
}
```

```{r nonlinear-kernel-plot, fig.width = 8.5, fig.height = 6, out.width="100%", results='hold', echo=FALSE}
op <- par(
    mfrow = c(length(nonlinear_data), length(kernel_methods)),
    mar = c(2, 2, 3, 0.5), bty = "n"
)
for (data_shape in names(nonlinear_data)) {
    data_obj <- nonlinear_data[[data_shape]]
    for (method in kernel_methods) {
        init_obj <- kernel_inits[[data_shape]][[method]]
        plot(
            data_obj$X[, data_obj$view, drop = FALSE],
            pch = 16, cex = 0.45, col = adjustcolor(data_obj$col, 0.55),
            main = sprintf("%s\n%s", data_obj$main, method_labels[method]),
            xlab = "", ylab = "", axes = FALSE, asp = 1
        )
        points(init_obj$euclidean[, data_obj$view, drop = FALSE],
            pch = 4, cex = 1.2, lwd = 1.5, col = "black"
        )
        points(init_obj$kernel[, data_obj$view, drop = FALSE],
            pch = 1, cex = 1.25, lwd = 1.6, col = "#D55E00"
        )
    }
}
par(fig = c(0, 1, 0, 1), mar = c(0, 0, 0, 0), new = TRUE)
plot.new()
legend(
    "center",
    legend = c("Euclidean init", "Kernel init"),
    pch = c(4, 1), col = c("black", "#D55E00"),
    pt.cex = c(2.1, 2.2), bty = "n", cex = 1.2
)
par(op)
```

The same initialization names still work. The important difference is that the
kernel version measures spread in the similarity space used by the model. This
can rescue boundary-seeking methods on shapes where the raw coordinates make
the outside look more important than the rest of the structure.

## Conclusions and Recommendations

There is no universal best initializer. The right choice depends on how much
you know about the geometry, whether useful regions are sampled evenly, and how
many full optimization restarts you can afford.

| Situation                                      | Recommended `init`                                                          |
| ---------------------------------------------- | ----------------------------------------------------------------------------|
| Simple convex-ish data                         | `"furthest_sum"`, `"kmeans_pp"`                                             |
| Uneven sampling density, corners represented   | `"furthest_sum"`, `"furthest_first"`, `"hull_outmost"`                      |
| Unknown structure, enough restarts             | `"random"` or `"kmeans_pp"` with a larger `nrep`                            |
| Unknown structure, few restarts                | `"aa_pp"`                                                                   |
| Very large dataset, speed critical             | `"furthest_sum"` with `batch_size` or `"hull_outmost"` partitioned          |
| Reproducing legacy results                     | `"furthest_sum"` (PCHA), `"random"` [archetypes], `"dirichlet"` directional |

Choose along two main axes: how much you trust the geometry, and how much
restart diversity you can afford. Strongly biased initializers such as
`furthest_sum` and `hull_outmost` are useful when the convex geometry is trusted
or full optimization runs are expensive, but repeated starts often return the
same boundary configuration. `random` is the opposite extreme: it makes the
fewest geometric assumptions and explores different basins across restarts, but
it follows the empirical sampling density, so sparse corners may require a
larger `nrep`.

`kmeans_pp` and `aa_pp` are the practical middle ground. They keep stochastic
variation across restarts while biasing the draw toward points that improve
coverage or reconstruction. Use them when you want a better first guess than
`random` without committing as strongly to one boundary-seeking geometry. If an
extreme corner is absent from the sample entirely, no row-selecting initializer
can recover it without changing the model, the data collection, or the candidate
set.

If the geometry is visibly nonlinear, changing the model is usually more
important than fine-tuning the Euclidean initializer. Use
`archetypes_kernel_pgd()` with a suitable kernel, then choose one of the
kernel-compatible initializers such as `"kmeans_pp"` or `"furthest_sum"`.

For memory efficiency and speed, any initializer can be restricted to a
candidate batch with `batch_size`. The default `batch_type = "distal"` samples
candidate rows in proportion to their squared distance from the data center,
which is the coreset idea of @Mair2019: if archetypes are expected near the
boundary, it is wasteful to spend most candidate memory on central points. Use
`batch_type = "uniform"` when the goal is an unbiased mini-batch approximation,
especially for `method = "aa_pp"`.

### Initialization Caveats {#initialization-caveats}

The default `scale = FALSE` in `run_aa()` preserves the data on its supplied
feature scale. Setting `scale = TRUE` applies column-wise standardization
(z-scoring) before fitting; this often improves the conditioning of Gaussian
Euclidean fits, but it interacts differently with each initialization strategy.

@Mair2023 show that FurthestSum is sensitive to preprocessing: standardization
can distort the Euclidean distances it uses to choose boundary points. AA++ is
less affected because it samples from residuals tied directly to the AA objective
[@Mair2023].

The cost of this objective-aware sampling is runtime. Among the built-in
initializers, `"aa_pp"` is usually the slowest because after the first two
points it solves a small AA projection problem at each step, essentially a
greedy AA analysis. In return, each added point is expected not to increase
the reconstruction cost, and the expected cost decreases whenever the new
point expands the current hull.

Initializer names are not supported identically by every model family. The
Euclidean solvers (`method = "pgd"` and `"nnls"`) call `aa_init()` on the
preprocessed data, so they accept the full catalogue of initializer strings
described above, as well as custom initializer functions and coordinate
matrices. PAA uses family-specific parameter profiles and does not support the
Euclidean `scale` argument.

Kernel AA is the main exception. With `method = "kernel"`, initializer strings
are limited to `"random"`, `"dirichlet"`, `"furthest_first"`, `"kmeans_pp"`,
and `"furthest_sum"`. Row-selection methods can be expressed using selected row
indices and pairwise distances from the kernel Gram matrix; `"dirichlet"` is
also available because it constructs the coefficient matrix $B$ directly and
does not require input-space archetype coordinates. Candidate batching is
supported for these kernel initializers, with distal batches computed from
kernel-space center distances. The remaining strings are not accepted as kernel
initializers: `"aa_pp"` requires convex-hull residual projections and
`"hull_outmost"` requires explicit hull candidates. Those operations would need
separate kernel-space implementations. If you need one of those starts for a
kernel fit, pass an explicit `K x n` non-negative coefficient matrix instead;
row indices or row names are also accepted when you want to select observed rows
directly.

The directional solver also accepts the `aa_init()` method
strings, but it defaults to `"dirichlet"` for historical reasons.

## Method Reference {#method-reference}

### `random`

Selects $K$ rows of $X$ uniformly at random without replacement.

```{r uniform-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
init <- aa_init(X, K = K, method = "random")
```

Uniform sampling is the simplest possible initialization and requires no
distance computations. By design it explores the whole geometry with equal
probability, so it makes few assumptions about where useful archetypes should
lie. When combined with many random restarts via the `nrep` argument in
`run_aa()`, even simple uniform initialization can find good solutions as
pointed out in [@Krohne2019].

### `dirichlet`

Instead of sampling archetypes from the data, `dirichlet` constructs the
coefficients matrix $B$ directly by sampling each row from a symmetric Dirichlet
distribution $\text{Dirichlet}(\alpha \mathbf{1}_N)$. Thus the resulting
archetypes are random convex combinations of **all** data points.

The `alpha` parameter controls the shape of the distribution:

* **`alpha = 1`** (default): all possible combinations are equally likely
* **`alpha < 1`**: sparse combinations are more likely, so only a few points
  contribute to each archetype.
* **`alpha > 1`**: near uniform combinations are more likely, so many points
  contribute to each archetype.

As $\alpha \to 0$ the behavior approaches `random` (archetypes snap to data
points), while for $\alpha \gg 1$ the archetypes are more likely to lie near the
center of the data hull. The plot below compares initial archetype placements
across three variants to demonstrate the convergence of `dirichlet` to
`random`.

```{r dirichlet-vs-random-plot, fig.height = 3, fig.width=7, results='hold', echo=FALSE}
X <- as.matrix(toy)
K <- 3
dir_variants <- list(
    `random` = list(method = "random"),
    `dirichlet alpha = 1` = list(method = "dirichlet", alpha = 1),
    `dirichlet alpha = 0.01` = list(method = "dirichlet", alpha = 0.01)
)
n_dir_runs <- 3

dir_inits <- list()
for (variant in names(dir_variants)) {
    args <- dir_variants[[variant]]
    dir_inits[[variant]] <- list()
    for (run in 1:n_dir_runs) {
        if (args$method == "dirichlet") {
            dir_inits[[variant]][[run]] <- aa_init(
                X,
                K = K,
                method = args$method,
                alpha = args$alpha
            )
        } else {
            dir_inits[[variant]][[run]] <- aa_init(
                X,
                K = K,
                method = args$method
            )
        }
    }
}
op <- par(mfrow = c(1, 3), mar = c(2, 2, 2.5, 0.5), bty = "n")
run_cols <- adjustcolor(c("firebrick3", "darkorange3", "purple4"), 0.7)

for (nm in names(dir_inits)) {
    plot(
        X,
        pch = 16, cex = 0.45, col = "grey75",
        main = nm, xlab = "", ylab = "",
        xlim = range(X[, 1]) + c(-1, 1),
        ylim = range(X[, 2]) + c(-1, 1),
        axes = FALSE
    )
    for (run_ix in 1:length(dir_inits[[nm]])) {
        draw_simplex(
            dir_inits[[nm]][[run_ix]]$A,
            border = run_cols[run_ix],
            fill   = adjustcolor(run_cols[run_ix], 0.06),
            lty    = run_ix,
            cex    = 1.1
        )
    }
    if (nm == names(dir_inits)[1]) {
        legend(
            "topleft",
            legend = paste("Run", 1:n_dir_runs),
            col = run_cols, lty = 1:n_dir_runs, lwd = 2, pch = 17,
            bty = "n", cex = 0.7
        )
    }
}
par(op)
```

`"random"` archetypes (left) always coincide with data points and cluster
near high-density regions. `"dirichlet"` with $\alpha = 1$ (center) samples
weights from all data points and often produces archetypes near the center of
the hull. Lowering $\alpha$ (right) makes weights sparse, which pushes the
random mixtures toward specific data points and toward the behavior of
`"random"`.

### Furthest First (`"furthest_first"`)

Selects the first archetype at random, with bias toward points far from
the mean, and then iteratively adds the single point that is *furthest* from the
already-selected set (the nearest-archetype distance to the set is maximized).

```{r furthest-first-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
init <- aa_init(X, K = K, method = "furthest_first")
```

Furthest First is a greedy adaptation of the $k$-center algorithm. It produces
well-separated archetypes and is deterministic after the first random draw.
Because it targets the globally farthest point at each step, it tends to spread
archetypes evenly across the data hull, though this spread is not specifically
tuned to the AA reconstruction objective.

### k-means++ (`"kmeans_pp"`)

Selects the first archetype at random, with bias toward points far from the mean,
and then samples each subsequent archetype with probability proportional to the
*squared* distance to the nearest already-selected archetype.

```{r kmeans-pp-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
init <- aa_init(X, K = K, method = "kmeans_pp")
```

The k-means++ sampling rule is a softer version of *Furthest First*. Instead of
always picking the single farthest point, it gives more chances to points that
are far from the already-selected archetypes. It can also be thought of as an
approximation of *AA++* which uses the squared distance to the nearest archetype
as a proxy for the distance to the convex hull of the already-selected archetypes
(distance to corners instead of distance to faces).

### Furthest Sum (`"furthest_sum"`)

Selects archetypes by maximizing the *sum* of distances to all current archetypes [@Morup2012].
An optional refinement phase, controlled by `refinement_steps` (default 10), cycles through
the selected set and replaces each archetype with a better candidate to remove the
effect of the initial random selection.

```{r furthest-sum-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
# Default refinement
init <- aa_init(X, K = K, method = "furthest_sum")

# More aggressive refinement
init <- aa_init(X, K = K, method = "furthest_sum", refinement_steps = 30)

# No refinement
init <- aa_init(X, K = K, method = "furthest_sum", refinement_steps = 0)
```

FurthestSum is the historical default initialization for AA [@Morup2012] and
remains the default when `init = NULL` is passed to `run_aa()`.

### Batched Coreset-Style Initialization

Supplying `batch_size` restricts initialization to a candidate batch before
selecting archetypes. With the default `batch_type = "distal"`, candidates are
sampled with probability proportional to each point's squared distance from the
data mean. This is the coreset idea of [@Mair2019]: reduce the effective problem
size while preserving the extremal structure that boundary-seeking methods such
as `furthest_sum` rely on.

```{r coreset-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
batch_size <- 60

# batch_size must be at least K
init <- aa_init(X, K = K, method = "furthest_sum", batch_size = batch_size)
```

For large datasets where computing all pairwise distances is expensive, the
batch reduces memory use and runtime. Because distal batching downweights the
center, it avoids spending most candidate slots in regions where archetypes are
unlikely to be useful.

### AA++ (`"aa_pp"`)

AA++ [@Mair2023] selects each archetype by sampling with probability
proportional to the *squared residual distance from the convex hull* of the
already-selected archetypes. Concretely, after selecting $k - 1$ archetypes it
solves a small NNLS projection to find, for each data point, its best
approximation as a convex combination of the current archetypes; the squared
reconstruction error becomes the sampling weight for the $k$-th archetype.

```{r aa-pp-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
init <- aa_init(X, K = K, method = "aa_pp")
```

Because the sampling probability is tied to the AA objective itself rather than
to a surrogate (such as nearest-archetype distance), AA++ has a formal guarantee:
each new archetype decreases the expected objective function^[Proposition 3.2 of
[@Mair2023]].

Compared to the other initialization methods, AA++ is more computationally
intensive because it requires solving $K-2$ NNLS problems (the first two
archetypes are selected via `kmeans_pp`) in order to compute the sampling
probabilities for each subsequent archetype.
To mitigate this cost, AA++ can be used with `batch_size` to give a Monte Carlo
approximation of the exact method. See the next section for details.

### Batched AA++

AA++ can also use `batch_size`, giving a variant of the Monte Carlo AA++
approximation. Instead of projecting the full dataset at each step, it draws a
candidate batch, projects only those rows, and samples the next archetype from
their residuals.

```{r aa-pp-mc-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
batch_size <- 100
batch_type <- "uniform"

# Larger batch_size -> closer approximation to exact AA++
init <- aa_init(X,
    K = K,
    method = "aa_pp",
    batch_size = batch_size,
    batch_type = batch_type
)
```

The `batch_size` parameter trades approximation accuracy for speed and must
satisfy `batch_size >= K`. When `batch_size == nrow(X)` the method reduces to
exact AA++. When `batch_size << nrow(X)` only a small fraction of the data is
evaluated at each step, making batched AA++ useful for large datasets where
exact AA++ becomes memory-bound. Use `batch_type = "uniform"` for the least
biased approximation, or keep the default `batch_type = "distal"` if you want
the candidate batches to emphasize likely boundary points.

### Hull-outmost (`"hull_outmost"`)

Builds a pool of candidate hull vertices, then ranks them by an outmost-vote
criterion: each candidate votes for the vertex farthest from it, and the $K$
most-voted candidates are selected as archetypes. The `hull_method` argument
(required) controls how the candidate pool is constructed:

* **`"full"`** — exact convex hull of $X$ in the original feature space.
  Uses `grDevices::chull` for 2-D data and the `geometry` package for higher
  dimensions.
* **`"projected"`** — union of convex hulls computed across random
  low-dimensional projections of $X$. Controlled by `projected_dim`
  (default `2`) and `n_projection_max`.
* **`"partitioned"`** — union of hull vertices from random partitions of the
  data rows. Controlled by `n_partitions` (default `10`). The fastest strategy.

```{r hull-outmost-ex, eval = FALSE}
X <- as.matrix(toy)
K <- 3
# Full hull — exact but requires 'geometry' for D > 2
init <- aa_init(X, K = K, method = "hull_outmost", hull_method = "full")

# Projected hull — works in any dimension without extra packages
init <- aa_init(X,
    K = K, method = "hull_outmost",
    hull_method = "projected", projected_dim = 2
)

# Partitioned hull — fastest, suitable for large n
init <- aa_init(X,
    K = K, method = "hull_outmost",
    hull_method = "partitioned", n_partitions = 15
)
```

The partitioned and projected strategies are designed for datasets where
computing the exact hull is intractable. The `use_unique_candidates` argument
(default `FALSE`) controls whether duplicate hull candidates (a point appearing
in multiple projections or partitions) cast only one vote.

## Session Information

```{r session-info, echo = FALSE}
sessionInfo()
```

## References
