Cluster-wise assessment of cluster stability
Introduction
Validation is very important in cluster analysis, because clustering methods tend to generate clusterings even for fairly homogeneous data sets. Most clustering methods assume a certain model or prototype for clusters, and this may be adequate for some parts of data, but not for others. Cluster analysis is often carried out in an exploratory manner, and the patterns found by cluster analysis are not necessarily meaningful.
An important aspect of cluster validity is stability. Stability means that a meaningful valid cluster should not disappear easily if the data set is changed in a non-essential way. There could be several conceptions what a “non-essential change” of the data set is. In terms of statistical modelling it could be demanded that a data set drawn from the same underlying distribution should give rise to more or less the same clustering (though the true underlying distribution is unknown). It could also be of interest whether clusterings remain stable under the addition of outliers, under subsetting or under “jittering”, i.e., the addition of a random error to every point to simulate measurement errors.
Given a clustering on a data set generated by a clustering method, the following principle is discussed in the present paper:
- •
Interpret the Jaccard coefficient (Jaccard, 1901) as a measure of similarity between two subsets of a set based on set membership.
- •
Resample new data sets from the original one (using various strategies) and apply the clustering method to them.
- •
For every given cluster in the original clustering find the most similar cluster in the new clustering and record the similarity value.
- •
Assess the cluster stability of every single cluster by the mean similarity taken over the resampled data sets.
The approach taken in the present article has the following two important characteristics:
- •
It is applicable to very general clustering methods including methods based on (not necessarily metric) dissimilarity measures, non-partitioning methods and methods that include an estimator of the number of clusters (so that the determination of this number is not an aim of the present approach), as well as conventional methods based on Euclidean data with a fixed number of clusters such as -means. No particular cluster model is assumed.
- •
The approach is cluster-wise. The idea behind this is that many data sets contain meaningful clusters for which a certain cluster model is adequate, but they do not necessarily consist only of such clusters. Therefore, the result of a clustering method could find some important meaningful patterns in the data set, while other clusters in the same clustering can be spurious. The reason for this is not necessarily the choice of the wrong clustering method; it may well be that no single method delivers a satisfactory result for the whole data set. Note that none of the approaches in the literature cited above is cluster-wise.
The clustering shown in Fig. 1 has been obtained by a normal mixture model with unrestricted covariance matrices for the mixture components and a noise component modelled as a uniform distribution on the convex hull of the data. The number of clusters has been estimated by the Bayesian information criterion. The procedure is explained in Fraley and Raftery (1998) and implemented in the package MCLUST for the statistical software R. A tuning constant for the initial estimation of the noise component has to be specified and was chosen as , so that the distinction between noise and non-noise points has been made based on the 10th nearest neighbor of every point, see Byers and Raftery (1998). This is implemented in the R-package PRABCLUS. Several clustering methods have been carried out on this data set, but none of these leads to more convincing results.
Usually, in such an analysis, the normal components are interpreted as clusters, but this does not seem to be reasonable for all components in the given data set. This motivates the cluster-wise approach: it would be very helpful to know to what extent the normal components can be interpreted as stable patterns of the data, and it can reasonably be suspected that this applies to some but not all of the components. The methods suggested in the present paper confirm stability only for the cluster nos. 1, 7 and 8, see Section 5.
Stability is not the only aspect of cluster validity, and therefore a stable cluster is not guaranteed to be a meaningful pattern. With another clustering of the same data set, it will be illustrated why meaningless clusters sometimes are stable.
Some alternative methods of cluster validation are homogeneity and/or separation-based validation indexes, comparison of different clustering methods on the same data, visual cluster validation, tests of homogeneity of the data set against a clustering alternative and use of external information, see Gordon (1999), Haldiki et al. (2002), Hennig, 2004b, Hennig, 2005, Milligan and Cooper (1985) and the references given therein.
The analysis of the sensitivity of a clustering against perturbations of the data has a long history as well, see, e.g., Rand (1971) and Milligan (1996). The adjusted Rand index (Hubert and Arabie, 1985) has been used often to measure the similarity between two complete clusterings.
Some work on robustness properties in cluster analysis (e.g., Garcia-Escudero and Gordaliza, 1999, Hennig, 2004a) is also related to the assessment of stability in cluster analysis. It turns out in this work that classical robustness concepts such as the finite sample breakdown point (Donoho and Huber, 1983) are heavily data dependent when applied to cluster analysis.
The paper proceeds as follows. The basic method, based on a non-parametric bootstrap, is introduced in Section 2. Section 3 discusses some alternative approaches to carry out the resampling. The approaches are compared in Section 4 by means of a simulation study. Section 5 applies the methodology to the snails distribution ranges data and a concluding discussion is given in Section 6.
Section snippets
Bootstrapping the Jaccard coefficient
A sequence of mappings is called a general clustering method, if maps a set of entities (this is how is always defined throughout the paper) onto a collection of subsets of . Note that it is assumed that entities with different indexes can be distinguished. This means that the elements of are interpreted as data points and that is even if, for example, for , in terms of their values. It is not assumed how the entities are defined. This
Alternative resampling and simulation schemes
The non-parametric bootstrap is not the only possibility to generate new similar but somewhat distorted data sets from the original data set, which can be used to assess stability. As already observed by Monti et al. (2001), a disadvantage of non-parametric bootstrap particularly in connection with cluster analysis is the occurrence of multiple points in the bootstrapped data set. Multiple points can be seen as miniclusters in itself. For some implementations of clustering and multidimensional
A simulation study
To assess the performance of a method for cluster stability assessment, it is necessary to find out whether the method can distinguish “better” from “worse” clusters in the same clustering. Therefore data sets have to be constructed in which there are well-defined “true” clusters. Clustering methods have to be applied which make some sense for this kind of data, but which do not necessarily find all of the clusters, and which do not necessarily have to be the best methods for these data.
Data
Data example
Every point in the data set shown in Fig. 1 represents a distribution range of a species of snails in North-Western Europe. The data have been generated from a 0–1 matrix indicating whether each of the 366 species (data points) involved are present on each of 306 grid squares of a grid spanning a map of North-Western Europe. Clustering of such distribution ranges is interesting because some theories about the species differentiation process predict the occurrence (and a particular
Discussion
The simulation study and the example suggest that the various schemes to measure the stability of the clusters by computing the average maximum Jaccard coefficient over resampled (or modified) data sets can be very informative. Only the “jittering alone” schemes cannot be recommended. A good strategy in practice can be the use of one of the schemes bootstrap, bootstrap/jittering and subsetting together with one of the noise schemes. The number of bootstrap replications does not have to be
References (26)
- Ben-Hur, A., Elisseeff, A., Guyon, I., 2002. A stability based method for discovering structure in clustered data. In:...
Problems in gene clustering based on gene expression data
J. Multivariate Anal.
(2004)- et al.
Nearest-neighbor clutter removal for estimating features in spatial point processes
J. Amer. Statist. Assoc.
(1998) - et al.
Trimmed -means: an attempt to robustify quantizers
Ann. Statist.
(1997) - et al.
The notion of breakdown point
- et al.
A prediction-based resampling method to estimate the number of clusters in a dataset
Genome Biol.
(2002) - et al.
How many clusters? Which clustering method? Answers via model based cluster analysis
Comput. J.
(1998) - et al.
Robustness properties of means and trimmed means
J. Amer. Statist. Assoc.
(1999) Classification
(1999)- et al.
Metric and Euclidean properties of dissimilarity coefficients
J. Classification
(1986)
Bootstrapping finite mixture models
Cluster validity methods, Part I
SIGMOD Record
Biotic element analysis in biogeography
Systematic Biol.
Cited by (499)
Identifying unique subgroups in suicide risks among psychiatric outpatients
2024, Comprehensive PsychiatryBehavioural sleep in salmonid fish with flexible diel activity
2024, Animal BehaviourLong COVID is not a uniform syndrome: Evidence from person-level symptom clusters using latent class analysis
2024, Journal of Infection and Public HealthA scoping review finds a growing trend in studies validating multimorbidity patterns and identifies five broad types of validation methods
2024, Journal of Clinical EpidemiologyAverage Jaccard index of random graphs
2024, Journal of Applied Probability