Implementation of the DISCO Metric for Internal Clustering Evaluation
R implementation of the DISCO (Density-based Internal Score for Clustering Outcomes) metric — a Cluster Validity Index (CVI) for evaluating density-based clustering results, including explicit evaluation of noise point quality. Translated from the original Python reference implementation by Beer, Krieger, Weber et al. (2025).
discoCVI evaluates clustering quality without ground truth labels by using density-connectivity distance (dc-distance) instead of Euclidean distance. It is the first CVI that:
Traditional CVIs like Silhouette and Davies-Bouldin assume convex clusters and ignore noise. On non-convex data, they rank the wrong algorithm higher:
| Algorithm | Silhouette | Davies-Bouldin | DISCO |
|---|---|---|---|
| DBSCAN (correct) | −0.16 ❌ | 2.8 (bad) ❌ | 0.68 ✅ |
| K-Means (wrong) | 0.37 ✅ | 0.6 (good) ✅ | 0.22 ❌ |
DISCO correctly identifies DBSCAN as better on non-convex data.
# Install from GitHub
devtools::install_github("aminentezari/discoCVI")Dependencies (installed automatically): -
FNN — fast k-nearest neighbor search
library(discoCVI)
library(dbscan)
# 1. Generate synthetic data
data <- make_circles(n_samples = 300, noise = 0.05, factor = 0.4, random_state = 42)
X <- data$X
# 2. Apply clustering
db_labels <- dbscan(X, eps = 0.2, minPts = 5)$cluster - 1L # 0 -> -1 for noise
km_labels <- kmeans(X, centers = 2, nstart = 10)$cluster - 1L
# 3. Evaluate
disco_score(X, db_labels) # -> 0.6900921813
disco_score(X, km_labels) # -> 0.0270091908
# 4. Per-point scores
scores <- disco_samples(X, db_labels)
summary_disco_scores(scores)
# 5. Visualise
plot_disco_scores(X, db_labels, scores, main = "DBSCAN - DISCO Scores")
# 6. Compare algorithms
compare_clusterings(X, list(DBSCAN = db_labels, KMeans = km_labels))disco_score(X, labels, min_points = 5)Returns a single number summarising overall clustering quality.
| Argument | Type | Description |
|---|---|---|
X |
matrix / data.frame | n x p feature matrix |
labels |
integer vector | Cluster labels; use -1 for noise |
min_points |
integer | k for dc-distance. Default: 5 |
Returns: Single numeric in [-1, 1].
disco_samples(X, labels, min_points = 5)Returns per-point scores — useful for identifying problematic points.
scores <- disco_samples(X, labels)
bad_points <- which(scores < 0)p_cluster(X, labels, min_points, precomputed_dc_dists)Silhouette-style scores using dc-distances. -1 is
treated as a valid cluster here.
p_noise(X, labels, min_points, dc_dists)Returns list(p_sparse, p_far) for noise points only.
compute_dc_distances(X, min_points = 5)Returns the n x n DC-distance matrix directly.
D <- compute_dc_distances(X)
s1 <- p_cluster(D, labels1, precomputed_dc_dists = TRUE)
s2 <- p_cluster(D, labels2, precomputed_dc_dists = TRUE)| Function | Description |
|---|---|
make_circles(n, noise, factor, seed) |
Concentric circles dataset |
make_moons(n, noise, seed) |
Two interleaving crescents |
make_blobs(n, centers, std, seed) |
Gaussian blobs with variable density |
plot_disco_scores(X, labels, scores) |
Red-yellow-green scatter plot |
compare_clusterings(X, list(...)) |
Rank multiple clusterings by DISCO |
summary_disco_scores(scores) |
Mean, median, SD, quantiles, n_negative |
k(x) = d_euclidean(x, x_{min_points-1})
dm(x, y) = max( k(x), k(y), d_euclidean(x, y) )
dc(x, y) = max edge weight on minimax path x -> y in MST
Cluster points:
a = mean dc-distance to own cluster (excluding self)
b = min over other clusters of mean dc-distance
p(x) = (b - a) / max(a, b)
Noise points:
p_sparse = (k(n) - k_max(C)) / max(k(n), k_max(C))
p_far = (dc_min(n,C) - k_max(C)) / max(dc_min(n,C), k_max(C))
p(n) = min(p_sparse, p_far)
Overall:
DISCO = (1/n) * sum(p(x_i))
This R package is a line-by-line translation of the
Python reference (disco.py + dctree.py). Three
bugs were found and fixed:
Python includes self (dist=0) in k-NN, so effective neighbor is k-1.
FNN::get.knn excludes self:
k <- min_points - 1
knn_result <- FNN::get.knn(X, k = k)
core_dists <- knn_result$nn.dist[, k]igraph::mst breaks equal-weight edges differently than
Python’s Prim. Fixed by porting Python’s exact
_get_mst_edges() (dctree.py lines 401-438).
R and Python RNGs produce different K-Means partitions from the same seed. Fixed by exporting Python labels as CSV and loading in R:
# Python
pd.DataFrame({"km_label": km.labels_}).to_csv("python_km_labels.csv", index=False)# R
km_labels <- as.integer(read.csv("python_km_labels.csv")$km_label)After all fixes, R and Python produce identical scores to 10 decimal places:
DISCO - DBSCAN : 0.6900921813 (R == Python)
DISCO - K-Means : 0.0270091908 (R == Python)
Eight benchmark datasets validated. All differences are within machine epsilon (< 1e-14).
| Dataset | n | Clusters | DBSCAN | K-Means | Match |
|---|---|---|---|---|---|
| Concentric Circles | 300 | 2 | 0.6900921813 | 0.0270091908 | R == Python |
| Two Moons | 1000 | 2 | 0.8179598377 | 0.2832395719 | R == Python |
| 3-Spiral | 312 | 3 | 0.5836514046 | -0.0014176359 | R == Python |
| Complex9 | 3031 | 9 | 0.4209464514 | 0.0551096775 | R == Python |
| Complex8 | 2551 | 8 | 0.0718687652 | 0.0122574749 | R == Python |
| Dartboard1 | 1000 | 4 | 0.8743426861 | -0.0033636711 | R == Python |
| Blobs | 450 | 3 | 0.8548953163 | 0.8575691834 | R == Python |
| Diagonal Blobs | 600 | 3 | 0.7215536253 | 0.7043253397 | R == Python |
Note: Negative K-Means scores on Dartboard1 and 3-Spiral are correct and expected — DISCO penalises algorithms that cut through ring or spiral structures.
discoCVI/
├── R/
│ └── disco.R <- All functions (single combined file)
├── man/ <- Auto-generated by devtools::document()
├── DESCRIPTION
├── NAMESPACE <- Auto-generated by devtools::document()
├── LICENSE
└── README.md
| Score | Meaning |
|---|---|
| 0.7 - 1.0 | Excellent - well-separated, compact clusters |
| 0.4 - 0.7 | Good - clear cluster structure |
| 0.0 - 0.4 | Moderate - overlapping clusters |
| -0.3 - 0.0 | Poor - misassigned points |
| -1.0 - -0.3 | Very poor - clustering worse than random |
# WRONG: dbscan() uses 0 for noise
labels <- dbscan(X, eps = 0.2)$cluster
# CORRECT: DISCO uses -1 for noise
labels <- dbscan(X, eps = 0.2)$cluster - 1L@article{beer2025disco,
title = {DISCO: Internal Evaluation of Density-Based Clustering},
author = {Beer, Anna and Krieger, Lena and Weber, Pascal and
Ritzert, Martin and Assent, Ira and Plant, Claudia},
journal = {arXiv preprint arXiv:2503.00127},
year = {2025}
}R Package:
Amin Entezari (2025). discoCVI: Implementation of the DISCO Metric for
Internal Clustering Evaluation. R package.
https://github.com/aminentezari/discoCVI
MIT © Amin Entezari
Original Python implementation by Beer, Krieger, Weber, Ritzert, Assent, Plant (2025).
For questions: amin_entezari@outlook.com