Mixture model clustering for mixed data with missing information
Introduction
Missing observations are frequently seen in multivariate data sets. For example, the specimen may be damaged and thus not all attributes can be measured, or an inexpensive and easy administered test may be administered to all items in the sample whilst the more expensive test may only be administered to a random sub-sample of the items. In such situations, the data matrix will be incomplete with not all attributes being observed for all items. These missing values can be regarded as accidental missing values.
Review papers in the literature on partially missing data include those by Afifi and Elashoff (1966), Hartley and Hocking (1971), Orchard and Woodbury (1972), and Dempster et al. (1977), and monographs on partially missing data by Little and Rubin (1987), and Schafer (1997). The approaches appropriate for handling such data in classification studies are restricted due to the reluctance of the investigator to make assumptions about the data (Gordon, 1999) and the lack of a formal model for cluster analysis. Given the objective of clustering the data, we need to implement some technique when the data to be clustered are incomplete.
Gordon (1999, p. 26) notes that Gower's (1971) general (dis)similarity coefficient can be used as one strategy to cope with missing variables, by assuming that the contribution that would have been provided by the incompletely recorded variable to the proximity between the two items is equal to the weighted mean of the contributions provided by the variables for which complete information is available.
Data are described as ‘missing at random’ when the probability that a variable is missing for a particular individual may depend on the values of the observed variables for that individual, but not on the value of the missing variable. That is, the distribution of the missing data mechanism does not depend on the missing values. For example, censored data are certainly not missing at random.
Rubin (1976) showed that the process that causes the missing data can be ignored when making likelihood-based inferences about the parameter of the data if the data are ‘missing at random’ and the parameter of the missing data process is ‘distinct’ from the parameter of the data. When the data are missing in this manner, the appropriate likelihood is simply the density of the observed data, regarded as a function of the parameters. ‘Missing at random’ is a central concept in the work of Little and Rubin (1987).
The EM algorithm of Dempster et al. (1977) is a general iterative procedure for maximum likelihood estimation in incomplete data problems. Their general model includes both the conceptual missing data formulation used in finite mixture models and the accidental missing data discussed earlier. Many authors, for example McLachlan and Krishnan (1997), have discussed the EM algorithm and its properties.
Little and Schluchter (1985) present maximum likelihood procedures using the EM algorithm for the general location model with missing data. They note that their model reduces to that of Day (1969) for K-variate mixtures when there is one K-level categorical variable that is completely missing. Little and Rubin (1987) and Schafer (1997) point out that the parametric mixture models lend themselves well to implementing incomplete data methods. We implement their approach to produce explicit methodology that enables the clustering of mixed (categorical/continuous) data using a mixture likelihood approach when data are missing at random. We illustrate this approach by clustering Byar's prostate cancer data. It is shown that the proposed methodology can detect meaningful structure in mixed data when there is a fairly extreme amount of missing information.
Section snippets
The mixture approach to clustering data
Suppose that p attributes are measured on n individuals. Let be the observed values of a random sample from a mixture of K underlying populations in unknown proportions π1,…,πK. Let the density of in the kth group be , where is the parameter vector for group k, and let , where , . The density of can be written aswhere ∑k=1Kπk=1, πk⩾0, for k=1,…,K.
The EM algorithm of Dempster et al. (1977) is applied to the
Application
The approach will be illustrated by considering the clustering of cases on the basis of the pre-trial variables of the prostate cancer clinical trial data of Byar and Green (1980), reproduced in Andrews and Herzberg (1985, pp. 261–247). The data are available at http://lib.stat.cmu.edu/datasets/Andrews/T46.1. The data were obtained from a randomized clinical trial comparing four treatments for 506 patients with prostatic cancer. These patients had been grouped on clinical criteria into Stage 3
Discussion
When clustering real multivariate data sets having large numbers of attributes, it is rare that all variables are either categorical or continuous as some approaches based on finite mixture models require. The Multimix approach allows the clustering of mixed data containing both types of variables.
Missing values are also a problem in many classification studies. The lack of a formal model restricts the number of approaches that can cope with incomplete datasets. The finite mixture model leads
References (28)
- et al.
Missing observations in multivariate statistics Ireview of the literature
J. Amer. Statist. Assoc.
(1966) - et al.
Data: A Collection of Problems from Many Fields for the Student and Research Worker
(1985) - et al.
Model-based Gaussian and non-Gaussian clustering
Biometrics
(1993) - Beaton, A.E., 1964. The use of special matrix operators in statistical calculus. Educational Testing Service Research...
- et al.
The choice of treatment for cancer patients based on covariate informationapplication to prostate cancer
Bull. Cancer
(1980) - et al.
An entropy criterion for assessing the number of clusters in a mixture model
J. Class.
(1996) Estimating the components of a mixture of normal components
Biometrika
(1969)Elements of Continuous Multivariate Analysis
(1969)- et al.
Maximum likelihood from incomplete data via the EM algorithm
J. Roy. Statist. Soc. B
(1977) Introduction to Graphical Modelling
(1995)
A note on parameter estimation for Lazarsfeld's latent class model using the EM algorithm
Multivariate Behavioral Res.
Classification
A general coefficient of similarity and some of its properties
Biometrics
The analysis of incomplete data
Biometrics
Cited by (83)
How to undertake reviews of large collections of articles and establish main contributions: an ontology-based literature review approach
2022, International Journal of Information Management Data InsightsCitation Excerpt :Huang (1998) proposed two modifications in the dissimilarity function of the k-means algorithm so that it could cope with categorical and mixed datasets. Hunt & Jorgensen (2003) extended the MULTIMIX algorithm in 2003, which was proposed on their own previously (see in Harch et al., 1999), addressing missing data and changing the statistical model to make the algorithm able to cope with mixed attributes. Blomstedt et al. (2015) addressed the mixed data clustering problem, proposing a Bayesian predictive framework with an optimizer, called greedy stochastic search, to search optimal partitions.
Gaussian kernels for incomplete data
2019, Applied Soft Computing JournalCitation Excerpt :In principle, any parametric density estimation technique can be used to approximate the data distribution; in our formulation, we use a Gaussian mixture distribution whose parameters are estimated by maximum likelihood [21]. For such a model, a modification of conventional expectation–maximization to handle incomplete data has been developed by Hunt and Jorgensen in [22] and can be effectively used for the task. Given this formulation, the EGK estimation can be readily incorporated in any kernel method based on the Gaussian kernel.
Impact of Motor Competence Profiles on Adolescents' Physical Activity and Cardiorespiratory Fitness across Four Years
2023, Medicine and Science in Sports and ExerciseModel-Based Clustering
2023, Annual Review of Statistics and Its Application