bootkmeans: Bootstrap Augmented k-Means Algorithm for Parameter-less Fuzzy Clustering

Jesse S. Ghashti, Jeffrey L. Andrews, John R.J. Thompson, Joyce Epp, and Harkunwar S. Kochar

Introduction

The bootkmeans algorithm [1] augments standard \(k\)-means with sequential bootstrap resamples. Each resample fits \(k\)-means and passes its centres forward to initialize the next resample. Across iterations, out-of-bag (OOB) allocations produces fuzzy memberships (row-normalized OOB tallies) and hard labels, and averaged centres over the terminal window of iterations. An adaptive stopping rule uses a Breusch–Godfrey (BG) serial-correlation test [2,3] on the objective trace; when serial correlation is no longer detected at level \(\alpha\), the outer loop terminates.

A brief overview of the method is given as follow, for more information see [1]:

Installing

You can install the released version of bootkmeans from GitHub with:

library(devtools)
install_github("https://github.com/ghashti-j/bootkmeans")
library(bootkmeans)

Sample Usage

From the standard clustering dataset iris, we run the boot.kmeans() function with the true number of groups groups = 3, minimum number of bootstrap iterations before conducting BG-tests of iterations = 500 and up to a maximum maxsample = 1000 bootstrap iterations. The number of random starts and maximum iterations for the kmeans() algorithm are itermax = 1000 and nstart = 5, respectively. Below we look at some basic functionality.

set.seed(1)
x <- as.matrix(iris[, -5])

fit <- boot.kmeans(
  data = x, 
  groups = 3,
  iterations = 500,   
  itermax   = 1000,
  nstart    = 5,
  verbose   = FALSE,
  maxsamp = 1000
)

We look at the true cluster means of the dataset and compare to the ones obtained from the algorithm run.

cat("bootkmeans centres:\n")
#> bootkmeans centres:
fit$centers
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     6.843712    3.072078     5.715812   2.0636626
#> 2     5.883789    2.742195     4.379250   1.4236944
#> 3     5.006358    3.427070     1.463814   0.2463964

cat("true cluster centers")
#> true cluster centers
aggregate(. ~ Species, data = iris, FUN = mean)
#>      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1     setosa        5.006       3.428        1.462       0.246
#> 2 versicolor        5.936       2.770        4.260       1.326
#> 3  virginica        6.588       2.974        5.552       2.026

We take a look at the the probabilistic cluster memberships for the first 10 observations, which are drawn from cluster one, knowing these observations are well-separated from the other two clusters.

fit$U[1:10, ]
#>       [,1] [,2] [,3]
#>  [1,]    0    0    1
#>  [2,]    0    0    1
#>  [3,]    0    0    1
#>  [4,]    0    0    1
#>  [5,]    0    0    1
#>  [6,]    0    0    1
#>  [7,]    0    0    1
#>  [8,]    0    0    1
#>  [9,]    0    0    1
#> [10,]    0    0    1

The remaining two clusters have a higher degree of cluster overlap, which we can see in the first 10 observations of cluster two:

fit$U[51:60,]
#>             [,1]      [,2] [,3]
#>  [1,] 0.46111111 0.5388889    0
#>  [2,] 0.02890173 0.9710983    0
#>  [3,] 0.85955056 0.1404494    0
#>  [4,] 0.00000000 1.0000000    0
#>  [5,] 0.02824859 0.9717514    0
#>  [6,] 0.00000000 1.0000000    0
#>  [7,] 0.05208333 0.9479167    0
#>  [8,] 0.00000000 1.0000000    0
#>  [9,] 0.01666667 0.9833333    0
#> [10,] 0.00000000 1.0000000    0

Then we examine the p-value which terminates the algorithm sometime after the first 500 initial bootstrap iterations

cat("\n p-value for BG-test: \n")
#> 
#>  p-value for BG-test:
fit$p.value
#> [1] 0.1085817

We can use the function bootk.hardsoftvis() using our fit above to create a scatterplot matrix of pairwise variables, coloured by which observations obtain soft (uncertain) cluster assignments (blue) compare to observations which were assigned hard (absolute certainty) assignments (green). From the scatterplot matrix, it is evident that only observations contained in regions of cluster overlap are assigned soft cluster memberships.

bootk.hardsoftvis(x, fit, plotallvars = TRUE)

We can compare contingency matrices between the hard clusters determined by boot.kmeans() to the traditional kmeans() [4] function from package stats [5] and the fuzzy \(c\)-means algorithm [6] using FKM() from package fclust [7]. To do so, we use the included function compare.clusters() with argument what = "all", and run all three algorithms. We use the stored results with function compare.tables() to display the contingency tables.

res <- compare.clusters(
  data = x, 
  groups = 3, 
  nstart = 10,
  what =  "all"
)

compare.tables(res, true.labs = iris$Species)
#>             
#> true.labs     1  2  3
#>   setosa     50  0  0
#>   versicolor  0  2 48
#>   virginica   0 36 14
#>             
#> true.labs     1  2  3
#>   setosa      0  0 50
#>   versicolor  2 48  0
#>   virginica  36 14  0
#> NULL

To compare how uncertainty is quantified for boot.kmeans() compared to FKM(), we simulate a dataset where we assume true probabilistic cluster assignments rather than hard clusters. We generate a mixture of Gaussians in two dimensions with ten clusters, where nine outer clusters have means equally spaced on a circle of radius six, and one central component is at the origin. All clusters share an identity covariance and have the same number of observations. This creates clear cluster interiors but overlapping borders where posteriors are ambiguous.

Given the known mixture, we compute the true posterior probabilities and define the uncertainty as \(1 - \max_{k}U_{i,k}\) for \(i\) observations, \(k\) clusters and membership matrix \(U\), where higher uncertainty values indicate points near cluster overlap regions, or decision boundaries.

set.seed(1)
numClusters <- 10
numObsPerCluster <- 30
radius <- 6
dim <- 2
centreNumObs <- numObsPerCluster
angles <- seq(0, 2 * pi, length.out = numClusters)

outerMean <- t(sapply(1:(numClusters - 1),
                      function(i) c(radius * cos(angles[i]), 
                                    radius * sin(angles[i]))))
centreMean <- c(0, 0)
allMeans <- rbind(outerMean, centreMean)
covMat <- diag(1, dim)

data <- do.call(rbind, 
                c(lapply(1:(numClusters - 1), 
                         function(i) MASS::mvrnorm(numObsPerCluster, 
                                                   outerMean[i, ], covMat)),
                  list(MASS::mvrnorm(centreNumObs, centreMean, covMat))
))

# within cluster density calculation
densities <- sapply(1:numClusters, 
                    function(i) mvtnorm::dmvnorm(data, mean = allMeans[i, ], 
                                                 sigma = covMat))

totDensity <- rowSums(densities) # overall densities
U <- densities / totDensity # probabilistic (fuzzy) cluster assignments
hardU <- apply(U, 1, which.max) # assigning hard cluster labels 
plotDF <- data.frame(X = data[,1],
                     Y = data[,2],
                     U = 1- apply(U, 1, max),
                     Z = factor(hardU))

ggplot(plotDF, aes(x = X, y = Y, col = Z, size = U)) +
  geom_point() +
  theme_classic() +
  coord_equal() +
  labs(x = expression(X[1]), 
         y = expression(X[2]), 
         col = "Hard Cluster",
         size = "Uncertainty") +
  theme(panel.border = element_rect(NA, "black", 1))

Now we examine the fuzzy results of both boot.kmeans() and FKM() using ggplot() [8], where observations are sized similarly as above and coloured with the returned hard cluster memberships determined by the respective algorithm

set.seed(1) 

compareFUZZ <- compare.clusters(data, groups = 10)
BKMres <- compareFUZZ$bkm

BKMplotDF <- data.frame(X = data[,1], Y = data[,2],
                        U = 1 - apply(BKMres$U, 1, max),
                        Z = factor(BKMres$clusters))

FKMres <- compareFUZZ$fkm
FKMplotDF <- data.frame(X = data[,1], Y = data[,2], 
                        U = 1 - apply(FKMres$U, 1, max),
                        Z = factor(FKMres$clus[,1]))

cpal <- scales::hue_pal()(10)

mplots <- function(df, title) {
  ggplot(df, aes(x = X, y = Y, col = Z, size = U)) +
    geom_point(alpha = 0.85) +
    coord_equal() +
    scale_color_manual(values = cpal) +
    scale_size_continuous(limits = c(0,1), range = c(1,5)) +
    labs(x = expression(X[1]),
         y = expression(X[2]),
         col = "Cluster",
         size = "Uncertainty",
         title = title) +
    theme_classic() +
    theme(panel.border = element_rect(fill = NA, color = "black"),
          legend = "bottom")
}

p1 <- mplots(BKMplotDF, "bootkmeans")
p2 <- mplots(FKMplotDF, "fuzzy cmeans")
(p1 + p2) + plot_layout(guides = "collect") &
  theme(legend.position = "bottom",
        legend.box = "vertical",
        legend.box.just = "center")

The boot.kmeans() procedure produces uncertainty estimates that are more closely aligned with the true data-generating process, whereas fuzzy clustering tends to report low uncertainty primarily near cluster centers.

To estimate clustering accuracy, we compute the proportion of correctly classified observations after optimally permuting the cluster labels using matchLabels() from package Thresher [9]. We sum of the diagonal of the contingency table under the best label permutation, divided by the total number of observations:

cat("bootkmeans clustering accuracy: \n")
#> bootkmeans clustering accuracy:
sum(diag(Thresher::matchLabels(table(BKMres$clusters, hardU))))/nrow(data)
#> [1] 0.9833333
cat("\nFuzzy cmeans clustering accuracy: \n")
#> 
#> Fuzzy cmeans clustering accuracy:
sum(diag(Thresher::matchLabels(table(FKMres$clus[,1], hardU))))/nrow(data)
#> [1] 0.9866667

Both methods achieve comparable accuracy under this metric. To assess recovery of the fuzzy/probabilistic structure, we compare the true membership matrix with the estimated membership matrices \(U\) from each method using the Frobenius Adjusted Rand Index [10], using the included fari() function adapted from [11]. This index generalizes the Adjusted Rand Index to fuzzy partitions.

cat("bootkmeans FARI: \n")
#> bootkmeans FARI:
fari(U, BKMres$U)
#> [1] 0.9855035
cat("\nFuzzy cmeans FARI: \n")
#> 
#> Fuzzy cmeans FARI:
fari(U, FKMres$U)
#> [1] 0.8373161

Here we see that boot.kmeans() more accurately captures the underlying probabilistic memberships than fuzzy \(c\)-means.

References

[1] Ghashti, J.S., Andrews, J.L. Thompson, J.R.J., Epp, J. and H.S. Kochar (2025). A bootstrap augmented \(k\)-means algorithm for fuzzy partitions. Submitted.

[2] Breusch, T.S. (1978). Testing for Autocorrelation in Dynamic Linear Models, Australian Economic Papers, 17, 334-355.

[3] Godfrey, L.G. (1978). Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables, Econometrica, 46, 1293-1301.

[4] R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

[5] Hartigan, J.A. and M.A. Wong (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100–108.

[6] Bezdek, J.C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.

[7] Ferraro, M.B., Giordani, P. and A. Serafini (2019). fclust: An R Package for Fuzzy Clustering, The R Journal, 11.

[8] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

[9] Coombes, K.R. (2025). Thresher: Threshing and Reaping for Principal Components. R package version 1.1.5.

[10] Andrews, J.L., Browne, R. and C.D. Hvingelby (2022). On Assessments of Agreement Between Fuzzy Partitions. Journal of Classification, 39, 326–342.

[11] J.L. Andrews, FARI (2013). GitHub repository, https://github.com/its-likeli-jeff/FARI