Cluster-wise assessment of cluster stability

doi:10.1016/j.csda.2006.11.025

Computational Statistics & Data Analysis

Volume 52, Issue 1, 15 September 2007, Pages 258-271

https://doi.org/10.1016/j.csda.2006.11.025 Get rights and content

Abstract

Stability in cluster analysis is strongly dependent on the data set, especially on how well separated and how homogeneous the clusters are. In the same clustering, some clusters may be very stable and others may be extremely unstable. The Jaccard coefficient, a similarity measure between sets, is used as a cluster-wise measure of cluster stability, which is assessed by the bootstrap distribution of the Jaccard coefficient for every single cluster of a clustering compared to the most similar cluster in the bootstrapped data sets. This can be applied to very general cluster analysis methods. Some alternative resampling methods are investigated as well, namely subsetting, jittering the data points and replacing some data points by artificial noise points. The different methods are compared by means of a simulation study. A data example illustrates the use of the cluster-wise stability assessment to distinguish between meaningful stable and spurious clusters, but it is also shown that clusters are sometimes only stable because of the inflexibility of certain clustering methods.

Introduction

Validation is very important in cluster analysis, because clustering methods tend to generate clusterings even for fairly homogeneous data sets. Most clustering methods assume a certain model or prototype for clusters, and this may be adequate for some parts of data, but not for others. Cluster analysis is often carried out in an exploratory manner, and the patterns found by cluster analysis are not necessarily meaningful.

An important aspect of cluster validity is stability. Stability means that a meaningful valid cluster should not disappear easily if the data set is changed in a non-essential way. There could be several conceptions what a “non-essential change” of the data set is. In terms of statistical modelling it could be demanded that a data set drawn from the same underlying distribution should give rise to more or less the same clustering (though the true underlying distribution is unknown). It could also be of interest whether clusterings remain stable under the addition of outliers, under subsetting or under “jittering”, i.e., the addition of a random error to every point to simulate measurement errors.

Given a clustering on a data set generated by a clustering method, the following principle is discussed in the present paper:

•
Interpret the Jaccard coefficient (Jaccard, 1901) as a measure of similarity between two subsets of a set based on set membership.
•
Resample new data sets from the original one (using various strategies) and apply the clustering method to them.
•
For every given cluster in the original clustering find the most similar cluster in the new clustering and record the similarity value.
•
Assess the cluster stability of every single cluster by the mean similarity taken over the resampled data sets.

It appears to be quite natural to assess cluster stability by resampling methods, and this has been done in several recent papers, most of them related to the analysis of gene expression data. Examples are Ben-Hur et al. (2002), Bryan (2004), Dudoit and Fridlyand (2002), Grün and Leisch (2004), Lange et al. (2004), Monti et al. (2001) and Tibshirani and Walther (2005). Many of these papers use stability or prediction strength measurements as a tool to estimate the true number of clusters.

The approach taken in the present article has the following two important characteristics:

•
It is applicable to very general clustering methods including methods based on (not necessarily metric) dissimilarity measures, non-partitioning methods and methods that include an estimator of the number of clusters (so that the determination of this number is not an aim of the present approach), as well as conventional methods based on Euclidean data with a fixed number of clusters such as $k$ -means. No particular cluster model is assumed.
•
The approach is cluster-wise. The idea behind this is that many data sets contain meaningful clusters for which a certain cluster model is adequate, but they do not necessarily consist only of such clusters. Therefore, the result of a clustering method could find some important meaningful patterns in the data set, while other clusters in the same clustering can be spurious. The reason for this is not necessarily the choice of the wrong clustering method; it may well be that no single method delivers a satisfactory result for the whole data set. Note that none of the approaches in the literature cited above is cluster-wise.

As an example consider the data set in Fig. 1, which is described in more detail in Section 5. The data consists of 366 points in four dimensions and has been generated by classical multidimensional scaling on a dissimilarity matrix. Therefore the first two dimensions shown in Fig. 1 are the first two principal components. The plot suggests that there are some patterns in the data set, but many points do not seem to belong to such a pattern. Neither the more nor the less clustered parts clearly suggest a fit with a standard parametric distribution such as the normal or the uniform distribution. This impression can be backed up by more sophisticated visual analyses (some patterns become a bit clearer if all dimensions are considered; not shown).

The clustering shown in Fig. 1 has been obtained by a normal mixture model with unrestricted covariance matrices for the mixture components and a noise component modelled as a uniform distribution on the convex hull of the data. The number of clusters has been estimated by the Bayesian information criterion. The procedure is explained in Fraley and Raftery (1998) and implemented in the package MCLUST for the statistical software R. A tuning constant for the initial estimation of the noise component has to be specified and was chosen as $h = 10$ , so that the distinction between noise and non-noise points has been made based on the 10th nearest neighbor of every point, see Byers and Raftery (1998). This is implemented in the R-package PRABCLUS. Several clustering methods have been carried out on this data set, but none of these leads to more convincing results.

Usually, in such an analysis, the normal components are interpreted as clusters, but this does not seem to be reasonable for all components in the given data set. This motivates the cluster-wise approach: it would be very helpful to know to what extent the normal components can be interpreted as stable patterns of the data, and it can reasonably be suspected that this applies to some but not all of the components. The methods suggested in the present paper confirm stability only for the cluster nos. 1, 7 and 8, see Section 5.

Stability is not the only aspect of cluster validity, and therefore a stable cluster is not guaranteed to be a meaningful pattern. With another clustering of the same data set, it will be illustrated why meaningless clusters sometimes are stable.

Some alternative methods of cluster validation are homogeneity and/or separation-based validation indexes, comparison of different clustering methods on the same data, visual cluster validation, tests of homogeneity of the data set against a clustering alternative and use of external information, see Gordon (1999), Haldiki et al. (2002), Hennig, 2004b, Hennig, 2005, Milligan and Cooper (1985) and the references given therein.

The analysis of the sensitivity of a clustering against perturbations of the data has a long history as well, see, e.g., Rand (1971) and Milligan (1996). The adjusted Rand index (Hubert and Arabie, 1985) has been used often to measure the similarity between two complete clusterings.

Some work on robustness properties in cluster analysis (e.g., Garcia-Escudero and Gordaliza, 1999, Hennig, 2004a) is also related to the assessment of stability in cluster analysis. It turns out in this work that classical robustness concepts such as the finite sample breakdown point (Donoho and Huber, 1983) are heavily data dependent when applied to cluster analysis.

The paper proceeds as follows. The basic method, based on a non-parametric bootstrap, is introduced in Section 2. Section 3 discusses some alternative approaches to carry out the resampling. The approaches are compared in Section 4 by means of a simulation study. Section 5 applies the methodology to the snails distribution ranges data and a concluding discussion is given in Section 6.

Section snippets

Bootstrapping the Jaccard coefficient

A sequence of mappings $E = (E_{n})_{n \in N}$ is called a general clustering method, if $E_{n}$ maps a set of entities $x_{n} = {x_{1}, \dots, x_{n}}$ (this is how $x_{n}$ is always defined throughout the paper) onto a collection of subsets ${C_{1}, \dots, C_{s}}$ of $x_{n}$ . Note that it is assumed that entities with different indexes can be distinguished. This means that the elements of $x_{n}$ are interpreted as data points and that $| x_{n} | = n$ is even if, for example, for $i \neq j$ , $x_{i} = x_{j}$ in terms of their values. It is not assumed how the entities are defined. This

Alternative resampling and simulation schemes

The non-parametric bootstrap is not the only possibility to generate new similar but somewhat distorted data sets from the original data set, which can be used to assess stability. As already observed by Monti et al. (2001), a disadvantage of non-parametric bootstrap particularly in connection with cluster analysis is the occurrence of multiple points in the bootstrapped data set. Multiple points can be seen as miniclusters in itself. For some implementations of clustering and multidimensional

A simulation study

To assess the performance of a method for cluster stability assessment, it is necessary to find out whether the method can distinguish “better” from “worse” clusters in the same clustering. Therefore data sets have to be constructed in which there are well-defined “true” clusters. Clustering methods have to be applied which make some sense for this kind of data, but which do not necessarily find all of the clusters, and which do not necessarily have to be the best methods for these data.

Data

Data example

Every point in the data set shown in Fig. 1 represents a distribution range of a species of snails in North-Western Europe. The data have been generated from a 0–1 matrix indicating whether each of the 366 species (data points) involved are present on each of 306 grid squares of a grid spanning a map of North-Western Europe. Clustering of such distribution ranges is interesting because some theories about the species differentiation process predict the occurrence (and a particular

Discussion

The simulation study and the example suggest that the various schemes to measure the stability of the clusters by computing the average maximum Jaccard coefficient over resampled (or modified) data sets can be very informative. Only the “jittering alone” schemes cannot be recommended. A good strategy in practice can be the use of one of the schemes bootstrap, bootstrap/jittering and subsetting together with one of the noise schemes. The number of bootstrap replications $B$ does not have to be

References (26)

Ben-Hur, A., Elisseeff, A., Guyon, I., 2002. A stability based method for discovering structure in clustered data. In:...
J. Bryan
Problems in gene clustering based on gene expression data
J. Multivariate Anal.
(2004)
S. Byers et al.
Nearest-neighbor clutter removal for estimating features in spatial point processes
J. Amer. Statist. Assoc.
(1998)
J.A. Cuesta-Albertos et al.
Trimmed $k$ -means: an attempt to robustify quantizers
Ann. Statist.
(1997)
D.L. Donoho et al.
The notion of breakdown point
S. Dudoit et al.
A prediction-based resampling method to estimate the number of clusters in a dataset
Genome Biol.
(2002)
C. Fraley et al.
How many clusters? Which clustering method? Answers via model based cluster analysis
Comput. J.
(1998)
L.A. Garcia-Escudero et al.
Robustness properties of $k$ means and trimmed $k$ means
J. Amer. Statist. Assoc.
(1999)
A.D. Gordon
Classification
(1999)
J.C. Gower et al.
Metric and Euclidean properties of dissimilarity coefficients
J. Classification
(1986)

B. Grün et al.

Bootstrapping finite mixture models

M. Haldiki et al.

Cluster validity methods, Part I

SIGMOD Record

(2002)

B. Hausdorf et al.

Biotic element analysis in biogeography

Systematic Biol.

(2003)

Cited by (499)

Identifying unique subgroups in suicide risks among psychiatric outpatients
2024, Comprehensive Psychiatry
The presence of psychiatric disorders is widely recognized as one of the primary risk factors for suicide. A significant proportion of individuals receiving outpatient psychiatric treatment exhibit varying degrees of suicidal behaviors, which may range from mild suicidal ideations to overt suicide attempts. This study aims to elucidate the transdiagnostic symptom dimensions and associated suicidal features among psychiatric outpatients.
The study enrolled patients who attended the psychiatry outpatient clinic at a tertiary hospital in South Korea (n = 1, 849, age range = 18–81; 61% women). A data-driven classification methodology was employed, incorporating a broad spectrum of clinical symptoms, to delineate distinctive subgroups among psychiatric outpatients exhibiting suicidality (n = 1189). A reference group of patients without suicidality (n = 660) was included for comparative purposes to ascertain cluster-specific sociodemographic, suicide-related, and psychiatric characteristics.
Psychiatric outpatients with suicidality (n = 1189) were subdivided into three distinctive clusters: the low-suicide risk cluster (Cluster 1), the high-suicide risk externalizing cluster (Cluster 2), and the high-suicide risk internalizing cluster (Cluster 3). Relative to the reference group (n = 660), each cluster exhibited distinct attributes pertaining to suicide-related characteristics and clinical symptoms, covering domains such as anxiety, externalizing and internalizing behaviors, and feelings of hopelessness. Cluster 1, identified as the low-suicide risk group, exhibited less frequent suicidal ideation, planning, and multiple attempts. In the high-suicide risk groups, Cluster 2 displayed pronounced externalizing symptoms, whereas Cluster 3 was primarily defined by internalizing and hopelessness symptoms. Bipolar disorders were most common in Cluster 2, while depressive disorders were predominant in Cluster 3.
Our findings suggest the possibility of differentiating psychiatric outpatients into distinct, clinically relevant subgroups predicated on their suicide risk. This research potentially paves the way for personalizing interventions and preventive strategies that address cluster-specific characteristics, thereby mitigating suicide-related mortality among psychiatric outpatients.
Subpopulations of children with multiple chronic health outcomes in relation to chemical exposures in the ECHO-PATHWAYS consortium
2024, Environment International
A multimorbidity-focused approach may reflect common etiologic mechanisms and lead to better targeting of etiologic agents for broadly impactful public health interventions. Our aim was to identify clusters of chronic obesity-related, neurodevelopmental, and respiratory outcomes in children, and to examine associations between cluster membership and widely prevalent chemical exposures to demonstrate our epidemiologic approach. Early to middle childhood outcome data collected 2011–2022 for 1092 children were harmonized across the ECHO-PATHWAYS consortium of 3 prospective pregnancy cohorts in six U.S. cities. 15 outcomes included age 4–9 BMI, cognitive and behavioral assessment scores, speech problems, and learning disabilities, asthma, wheeze, and rhinitis. To form generalizable clusters across study sites, we performed k-means clustering on scaled residuals of each variable regressed on study site. Outcomes and demographic variables were summarized between resulting clusters. Logistic weighted quantile sum regressions with permutation test p-values associated odds of cluster membership with a mixture of 15 prenatal urinary phthalate metabolites in full-sample and sex-stratified models. Three clusters emerged, including a healthier Cluster 1 (n = 734) with low morbidity across outcomes; Cluster 2 (n = 192) with low IQ and higher levels of all outcomes, especially 0.4–1.8-standard deviation higher mean neurobehavioral outcomes; and Cluster 3 (n = 179) with the highest asthma (92 %), wheeze (53 %), and rhinitis (57 %) frequencies. We observed a significant positive, male-specific stratified association (odds ratio = 1.6; p = 0.01) between a phthalate mixture with high weights for MEP and MHPP and odds of membership in Cluster 3 versus Cluster 1. These results identified subpopulations of children with co-occurring elevated levels of BMI, neurodevelopmental, and respiratory outcomes that may reflect shared etiologic pathways. The observed association between phthalates and respiratory outcome cluster membership could inform policy efforts towards children with respiratory disease. Similar cluster-based epidemiology may identify environmental factors that impact multi-outcome prevalence and efficiently direct public policy efforts.
Behavioural sleep in salmonid fish with flexible diel activity
2024, Animal Behaviour
Sleep is a universal phenomenon reported in a wide variety of species, from jellyfish to humans, with varying patterns and functions across taxa. However, the adaptive significance of sleep remains largely unknown, especially in wild populations, due to the lack of adequate models. Salmonid fishes are good candidates since they are one of the most studied wild animals and show remarkable diel activity patterns within and among species. Here, for the first time, we show that a typical resting posture (contact with the riverbed) of the brown trout, Salmo trutta, meets the criteria of behavioural sleep: (1) a resting posture with behavioural quiescence, (2) elevated arousal thresholds and (3) homeostatic regulation as a response to sleep deprivation. We also found a remarkable individual variation in sleep phenotypes (nocturnal, intermediate and cathemeral) even within the same population. Note that homeostatic regulation was observed for this species with flexible diel activity. Because of their variability and flexibility, salmonids represent a promising candidate for an experimental model to clarify the advantages of sleep behaviour in wild populations.
Long COVID is not a uniform syndrome: Evidence from person-level symptom clusters using latent class analysis
2024, Journal of Infection and Public Health
The current study aims to enhance insight into the heterogeneity of long COVID by identifying symptom clusters and associated socio-demographic and health determinants.
A total of 458 participants (M_age 36.0 ± 11.9; 46.5% male) with persistent symptoms after COVID-19 completed an online self-report questionnaire including a 114-item symptom list. First, a k-means clustering analysis was performed to investigate overall clustering patterns and identify symptoms that provided meaningful distinctions between clusters. Next, a step-three latent class analysis (LCA) was performed based on these distinctive symptoms to analyze person-centered clusters. Finally, multinominal logistic models were used to identify determinants associated with the symptom clusters.
From a 5-cluster solution obtained from k-means clustering, 30 distinctive symptoms were selected. Using LCA, six symptom classes were identified: moderate (20.7%) and high (20.7%) inflammatory symptoms, moderate malaise-neurocognitive symptoms (18.3%), high malaise-neurocognitive-psychosocial symptoms (17.0%), low-overall symptoms (13.3%) and high overall symptoms (9.8%). Sex, age, employment, COVID-19 suspicion, COVID-19 severity, number of acute COVID-19 symptoms, long COVID symptom duration, long COVID diagnosis, and impact of long COVID were associated with the different symptom clusters.
The current study’s findings characterize the heterogeneity in long COVID symptoms and underscore the importance of identifying determinants of different symptom clusters.
A scoping review finds a growing trend in studies validating multimorbidity patterns and identifies five broad types of validation methods
2024, Journal of Clinical Epidemiology
Multimorbidity, the presence of two or more long-term conditions, is a growing public health concern. Many studies use analytical methods to discover multimorbidity patterns from data. We aimed to review approaches used in published literature to validate these patterns.
We systematically searched PubMed and Web of Science for studies published between July 2017 and July 2023 that used analytical methods to discover multimorbidity patterns.
Out of 31,617 studies returned by the searches, 172 were included. Of these, 111 studies (64%) conducted validation, the number of studies with validation increased from 53.13% (17 out of 32 studies) to 71.25% (57 out of 80 studies) in 2017–2019 to 2022–2023, respectively. Five types of validation were identified: assessing the association of multimorbidity patterns with clinical outcomes (n = 79), stability across subsamples (n = 26), clinical plausibility (n = 22), stability across methods (n = 7) and exploring common determinants (n = 2). Some studies used multiple types of validation.
The number of studies conducting a validation of multimorbidity patterns is clearly increasing. The most popular validation approach is assessing the association of multimorbidity patterns with clinical outcomes. Methodological guidance on the validation of multimorbidity patterns is needed.
Average Jaccard index of random graphs
2024, Journal of Applied Probability

View all citing articles on Scopus

View full text

Cluster-wise assessment of cluster stability

Abstract

Introduction

Section snippets

Bootstrapping the Jaccard coefficient

Alternative resampling and simulation schemes

A simulation study

Data example

Discussion

Problems in gene clustering based on gene expression data

J. Multivariate Anal.

Nearest-neighbor clutter removal for estimating features in spatial point processes

J. Amer. Statist. Assoc.

Trimmed k-means: an attempt to robustify quantizers

Ann. Statist.

The notion of breakdown point

A prediction-based resampling method to estimate the number of clusters in a dataset

Genome Biol.

How many clusters? Which clustering method? Answers via model based cluster analysis

Comput. J.

Robustness properties of k means and trimmed k means

J. Amer. Statist. Assoc.

Classification

Metric and Euclidean properties of dissimilarity coefficients

J. Classification

Bootstrapping finite mixture models

Cluster validity methods, Part I

SIGMOD Record

Biotic element analysis in biogeography

Systematic Biol.

Trimmed $k$ -means: an attempt to robustify quantizers

Robustness properties of $k$ means and trimmed $k$ means