The bootkmeans algorithm [1] augments standard \(k\)-means with sequential bootstrap resamples. Each resample fits \(k\)-means and passes its centres forward to initialize the next resample. Across iterations, out-of-bag (OOB) allocations produces fuzzy memberships (row-normalized OOB tallies) and hard labels, and averaged centres over the terminal window of iterations. An adaptive stopping rule uses a Breusch–Godfrey (BG) serial-correlation test [2,3] on the objective trace; when serial correlation is no longer detected at level \(\alpha\), the outer loop terminates.
A brief overview of the method is given as follow, for more information see [1]:
Bootstrap rows; fit \(k\)-means, where the first pass uses user-supplied centres or random starts, and subsequent iterations reuse the previous iteration’s centres.
Compute squared Euclidean distance from each original observation to each current centre; allocate to nearest centre; record OOB allocations.
Track the \(k\)-means objective as the sum of per-row minimum squared distances at each iteration.
Test for serial correlation in the recent window of objectives using the BG test. If not significant, stop; otherwise continue until a maximum number of iterations..
Return the final outputs: fuzzy memberships U
, hard
labels cluster
, averaged centres centers
,
objective trace soslist
, and diagnostic \(p\)-value p.value
.
You can install the released version of bootkmeans from GitHub with:
From the standard clustering dataset iris
, we run the
boot.kmeans()
function with the true number of groups
groups = 3
, minimum number of bootstrap iterations before
conducting BG-tests of iterations = 500
and up to a maximum
maxsample = 1000
bootstrap iterations. The number of random
starts and maximum iterations for the kmeans()
algorithm
are itermax = 1000
and nstart = 5
,
respectively. Below we look at some basic functionality.
set.seed(1)
x <- as.matrix(iris[, -5])
fit <- boot.kmeans(
data = x,
groups = 3,
iterations = 500,
itermax = 1000,
nstart = 5,
verbose = FALSE,
maxsamp = 1000
)
We look at the true cluster means of the dataset and compare to the ones obtained from the algorithm run.
cat("bootkmeans centres:\n")
#> bootkmeans centres:
fit$centers
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 6.843712 3.072078 5.715812 2.0636626
#> 2 5.883789 2.742195 4.379250 1.4236944
#> 3 5.006358 3.427070 1.463814 0.2463964
cat("true cluster centers")
#> true cluster centers
aggregate(. ~ Species, data = iris, FUN = mean)
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1 setosa 5.006 3.428 1.462 0.246
#> 2 versicolor 5.936 2.770 4.260 1.326
#> 3 virginica 6.588 2.974 5.552 2.026
We take a look at the the probabilistic cluster memberships for the first 10 observations, which are drawn from cluster one, knowing these observations are well-separated from the other two clusters.
fit$U[1:10, ]
#> [,1] [,2] [,3]
#> [1,] 0 0 1
#> [2,] 0 0 1
#> [3,] 0 0 1
#> [4,] 0 0 1
#> [5,] 0 0 1
#> [6,] 0 0 1
#> [7,] 0 0 1
#> [8,] 0 0 1
#> [9,] 0 0 1
#> [10,] 0 0 1
The remaining two clusters have a higher degree of cluster overlap, which we can see in the first 10 observations of cluster two:
fit$U[51:60,]
#> [,1] [,2] [,3]
#> [1,] 0.46111111 0.5388889 0
#> [2,] 0.02890173 0.9710983 0
#> [3,] 0.85955056 0.1404494 0
#> [4,] 0.00000000 1.0000000 0
#> [5,] 0.02824859 0.9717514 0
#> [6,] 0.00000000 1.0000000 0
#> [7,] 0.05208333 0.9479167 0
#> [8,] 0.00000000 1.0000000 0
#> [9,] 0.01666667 0.9833333 0
#> [10,] 0.00000000 1.0000000 0
Then we examine the p-value which terminates the algorithm sometime after the first 500 initial bootstrap iterations
We can use the function bootk.hardsoftvis()
using our
fit
above to create a scatterplot matrix of pairwise
variables, coloured by which observations obtain soft (uncertain)
cluster assignments (blue) compare to observations which were assigned
hard (absolute certainty) assignments (green). From the scatterplot
matrix, it is evident that only observations contained in regions of
cluster overlap are assigned soft cluster memberships.
We can compare contingency matrices between the hard clusters
determined by boot.kmeans()
to the traditional
kmeans()
[4] function from package stats
[5] and the fuzzy \(c\)-means algorithm
[6] using FKM()
from package fclust [7].
To do so, we use the included function compare.clusters()
with argument what = "all"
, and run all three algorithms.
We use the stored results with function compare.tables()
to
display the contingency tables.
res <- compare.clusters(
data = x,
groups = 3,
nstart = 10,
what = "all"
)
compare.tables(res, true.labs = iris$Species)
#>
#> true.labs 1 2 3
#> setosa 50 0 0
#> versicolor 0 2 48
#> virginica 0 36 14
#>
#> true.labs 1 2 3
#> setosa 0 0 50
#> versicolor 2 48 0
#> virginica 36 14 0
#> NULL
To compare how uncertainty is quantified for
boot.kmeans()
compared to FKM()
, we simulate a
dataset where we assume true probabilistic cluster assignments rather
than hard clusters. We generate a mixture of Gaussians in two dimensions
with ten clusters, where nine outer clusters have means equally spaced
on a circle of radius six, and one central component is at the origin.
All clusters share an identity covariance and have the same number of
observations. This creates clear cluster interiors but overlapping
borders where posteriors are ambiguous.
Given the known mixture, we compute the true posterior probabilities and define the uncertainty as \(1 - \max_{k}U_{i,k}\) for \(i\) observations, \(k\) clusters and membership matrix \(U\), where higher uncertainty values indicate points near cluster overlap regions, or decision boundaries.
set.seed(1)
numClusters <- 10
numObsPerCluster <- 30
radius <- 6
dim <- 2
centreNumObs <- numObsPerCluster
angles <- seq(0, 2 * pi, length.out = numClusters)
outerMean <- t(sapply(1:(numClusters - 1),
function(i) c(radius * cos(angles[i]),
radius * sin(angles[i]))))
centreMean <- c(0, 0)
allMeans <- rbind(outerMean, centreMean)
covMat <- diag(1, dim)
data <- do.call(rbind,
c(lapply(1:(numClusters - 1),
function(i) MASS::mvrnorm(numObsPerCluster,
outerMean[i, ], covMat)),
list(MASS::mvrnorm(centreNumObs, centreMean, covMat))
))
# within cluster density calculation
densities <- sapply(1:numClusters,
function(i) mvtnorm::dmvnorm(data, mean = allMeans[i, ],
sigma = covMat))
totDensity <- rowSums(densities) # overall densities
U <- densities / totDensity # probabilistic (fuzzy) cluster assignments
hardU <- apply(U, 1, which.max) # assigning hard cluster labels
plotDF <- data.frame(X = data[,1],
Y = data[,2],
U = 1- apply(U, 1, max),
Z = factor(hardU))
ggplot(plotDF, aes(x = X, y = Y, col = Z, size = U)) +
geom_point() +
theme_classic() +
coord_equal() +
labs(x = expression(X[1]),
y = expression(X[2]),
col = "Hard Cluster",
size = "Uncertainty") +
theme(panel.border = element_rect(NA, "black", 1))
Now we examine the fuzzy results of both boot.kmeans()
and FKM()
using ggplot()
[8], where
observations are sized similarly as above and coloured with the returned
hard cluster memberships determined by the respective algorithm
set.seed(1)
compareFUZZ <- compare.clusters(data, groups = 10)
BKMres <- compareFUZZ$bkm
BKMplotDF <- data.frame(X = data[,1], Y = data[,2],
U = 1 - apply(BKMres$U, 1, max),
Z = factor(BKMres$clusters))
FKMres <- compareFUZZ$fkm
FKMplotDF <- data.frame(X = data[,1], Y = data[,2],
U = 1 - apply(FKMres$U, 1, max),
Z = factor(FKMres$clus[,1]))
cpal <- scales::hue_pal()(10)
mplots <- function(df, title) {
ggplot(df, aes(x = X, y = Y, col = Z, size = U)) +
geom_point(alpha = 0.85) +
coord_equal() +
scale_color_manual(values = cpal) +
scale_size_continuous(limits = c(0,1), range = c(1,5)) +
labs(x = expression(X[1]),
y = expression(X[2]),
col = "Cluster",
size = "Uncertainty",
title = title) +
theme_classic() +
theme(panel.border = element_rect(fill = NA, color = "black"),
legend = "bottom")
}
p1 <- mplots(BKMplotDF, "bootkmeans")
p2 <- mplots(FKMplotDF, "fuzzy cmeans")
(p1 + p2) + plot_layout(guides = "collect") &
theme(legend.position = "bottom",
legend.box = "vertical",
legend.box.just = "center")
The boot.kmeans()
procedure produces uncertainty
estimates that are more closely aligned with the true data-generating
process, whereas fuzzy clustering tends to report low uncertainty
primarily near cluster centers.
To estimate clustering accuracy, we compute the proportion of
correctly classified observations after optimally permuting the cluster
labels using matchLabels()
from package
Thresher [9]. We sum of the diagonal of the contingency
table under the best label permutation, divided by the total number of
observations:
cat("bootkmeans clustering accuracy: \n")
#> bootkmeans clustering accuracy:
sum(diag(Thresher::matchLabels(table(BKMres$clusters, hardU))))/nrow(data)
#> [1] 0.9833333
cat("\nFuzzy cmeans clustering accuracy: \n")
#>
#> Fuzzy cmeans clustering accuracy:
sum(diag(Thresher::matchLabels(table(FKMres$clus[,1], hardU))))/nrow(data)
#> [1] 0.9866667
Both methods achieve comparable accuracy under this metric. To assess
recovery of the fuzzy/probabilistic structure, we compare the true
membership matrix with the estimated membership matrices \(U\) from each method using the Frobenius
Adjusted Rand Index [10], using the included fari()
function adapted from [11]. This index generalizes the Adjusted Rand
Index to fuzzy partitions.
cat("bootkmeans FARI: \n")
#> bootkmeans FARI:
fari(U, BKMres$U)
#> [1] 0.9855035
cat("\nFuzzy cmeans FARI: \n")
#>
#> Fuzzy cmeans FARI:
fari(U, FKMres$U)
#> [1] 0.8373161
Here we see that boot.kmeans()
more accurately captures
the underlying probabilistic memberships than fuzzy \(c\)-means.
References
[1] Ghashti, J.S., Andrews, J.L. Thompson, J.R.J., Epp, J. and H.S. Kochar (2025). A bootstrap augmented \(k\)-means algorithm for fuzzy partitions. Submitted.
[2] Breusch, T.S. (1978). Testing for Autocorrelation in Dynamic Linear Models, Australian Economic Papers, 17, 334-355.
[3] Godfrey, L.G. (1978). Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables, Econometrica, 46, 1293-1301.
[4] R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
[5] Hartigan, J.A. and M.A. Wong (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108.
[6] Bezdek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.
[7] Ferraro, M.B., Giordani, P. and A. Serafini (2019). fclust: An R Package for Fuzzy Clustering, The R Journal, 11.
[8] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
[9] Coombes, K.R. (2025). Thresher: Threshing and Reaping for Principal Components. R package version 1.1.5.
[10] Andrews, J.L., Browne, R. and C.D. Hvingelby (2022). On Assessments of Agreement Between Fuzzy Partitions. Journal of Classification, 39, 326–342.
[11] J.L. Andrews, FARI (2013). GitHub repository, https://github.com/its-likeli-jeff/FARI