Mixture model clustering for mixed data with missing information

doi:10.1016/S0167-9473(02)00190-1

Computational Statistics & Data Analysis

Volume 41, Issues 3–4, 28 January 2003, Pages 429-440

https://doi.org/10.1016/S0167-9473(02)00190-1 Get rights and content

Abstract

One difficulty with classification studies is unobserved or missing observations that often occur in multivariate datasets. The mixture likelihood approach to clustering has been well developed and is much used, particularly for mixtures where the component distributions are multivariate normal. It is shown that this approach can be extended to analyse data with mixed categorical and continuous attributes and where some of the data are missing at random in the sense of Little and Rubin (Statistical Analysis with Mixing Data, Wiley, New York).

Introduction

Missing observations are frequently seen in multivariate data sets. For example, the specimen may be damaged and thus not all attributes can be measured, or an inexpensive and easy administered test may be administered to all items in the sample whilst the more expensive test may only be administered to a random sub-sample of the items. In such situations, the data matrix will be incomplete with not all attributes being observed for all items. These missing values can be regarded as accidental missing values.

Review papers in the literature on partially missing data include those by Afifi and Elashoff (1966), Hartley and Hocking (1971), Orchard and Woodbury (1972), and Dempster et al. (1977), and monographs on partially missing data by Little and Rubin (1987), and Schafer (1997). The approaches appropriate for handling such data in classification studies are restricted due to the reluctance of the investigator to make assumptions about the data (Gordon, 1999) and the lack of a formal model for cluster analysis. Given the objective of clustering the data, we need to implement some technique when the data to be clustered are incomplete.

Gordon (1999, p. 26) notes that Gower's (1971) general (dis)similarity coefficient can be used as one strategy to cope with missing variables, by assuming that the contribution that would have been provided by the incompletely recorded variable to the proximity between the two items is equal to the weighted mean of the contributions provided by the variables for which complete information is available.

Data are described as ‘missing at random’ when the probability that a variable is missing for a particular individual may depend on the values of the observed variables for that individual, but not on the value of the missing variable. That is, the distribution of the missing data mechanism does not depend on the missing values. For example, censored data are certainly not missing at random.

Rubin (1976) showed that the process that causes the missing data can be ignored when making likelihood-based inferences about the parameter of the data if the data are ‘missing at random’ and the parameter of the missing data process is ‘distinct’ from the parameter of the data. When the data are missing in this manner, the appropriate likelihood is simply the density of the observed data, regarded as a function of the parameters. ‘Missing at random’ is a central concept in the work of Little and Rubin (1987).

The EM algorithm of Dempster et al. (1977) is a general iterative procedure for maximum likelihood estimation in incomplete data problems. Their general model includes both the conceptual missing data formulation used in finite mixture models and the accidental missing data discussed earlier. Many authors, for example McLachlan and Krishnan (1997), have discussed the EM algorithm and its properties.

Little and Schluchter (1985) present maximum likelihood procedures using the EM algorithm for the general location model with missing data. They note that their model reduces to that of Day (1969) for K-variate mixtures when there is one K-level categorical variable that is completely missing. Little and Rubin (1987) and Schafer (1997) point out that the parametric mixture models lend themselves well to implementing incomplete data methods. We implement their approach to produce explicit methodology that enables the clustering of mixed (categorical/continuous) data using a mixture likelihood approach when data are missing at random. We illustrate this approach by clustering Byar's prostate cancer data. It is shown that the proposed methodology can detect meaningful structure in mixed data when there is a fairly extreme amount of missing information.

Section snippets

The mixture approach to clustering data

Suppose that p attributes are measured on n individuals. Let $x_{1},…, x_{n}$ be the observed values of a random sample from a mixture of K underlying populations in unknown proportions π₁,…,π_K. Let the density of $x_{i}$ in the kth group be $f_{k} (x_{i}; θ_{k})$ , where $θ_{k}$ is the parameter vector for group k, and let $φ =(θ ′, π ′)′$ , where $π =(π_{1},…,π_{K})′$ , $θ =(θ_{1},…, θ_{K})′$ . The density of $x_{i}$ can be written as $f(x_{i}; φ)= ∑ k=1 K π_{k} f_{k} (x_{i}; θ_{k}),$ where ∑_k=1^Kπ_k=1, π_k⩾0, for k=1,…,K.

The EM algorithm of Dempster et al. (1977) is applied to the

Application

The approach will be illustrated by considering the clustering of cases on the basis of the pre-trial variables of the prostate cancer clinical trial data of Byar and Green (1980), reproduced in Andrews and Herzberg (1985, pp. 261–247). The data are available at http://lib.stat.cmu.edu/datasets/Andrews/T46.1. The data were obtained from a randomized clinical trial comparing four treatments for 506 patients with prostatic cancer. These patients had been grouped on clinical criteria into Stage 3

Discussion

When clustering real multivariate data sets having large numbers of attributes, it is rare that all variables are either categorical or continuous as some approaches based on finite mixture models require. The Multimix approach allows the clustering of mixed data containing both types of variables.

Missing values are also a problem in many classification studies. The lack of a formal model restricts the number of approaches that can cope with incomplete datasets. The finite mixture model leads

References (28)

A.A. Afifi et al.
Missing observations in multivariate statistics Ireview of the literature
J. Amer. Statist. Assoc.
(1966)
D.A. Andrews et al.
Data: A Collection of Problems from Many Fields for the Student and Research Worker
(1985)
J.D. Banfield et al.
Model-based Gaussian and non-Gaussian clustering
Biometrics
(1993)
Beaton, A.E., 1964. The use of special matrix operators in statistical calculus. Educational Testing Service Research...
B.P. Byar et al.
The choice of treatment for cancer patients based on covariate informationapplication to prostate cancer
Bull. Cancer
(1980)
G. Celeux et al.
An entropy criterion for assessing the number of clusters in a mixture model
J. Class.
(1996)
N.E. Day
Estimating the components of a mixture of normal components
Biometrika
(1969)
A.P. Dempster
Elements of Continuous Multivariate Analysis
(1969)
A.P. Dempster et al.
Maximum likelihood from incomplete data via the EM algorithm
J. Roy. Statist. Soc. B
(1977)
D. Edwards
Introduction to Graphical Modelling
(1995)

B.S. Everitt

A note on parameter estimation for Lazarsfeld's latent class model using the EM algorithm

Multivariate Behavioral Res.

(1984)

A.D. Gordon

Classification

(1999)

J.C. Gower

A general coefficient of similarity and some of its properties

Biometrics

(1971)

H.O. Hartley et al.

The analysis of incomplete data

Biometrics

(1971)

Cited by (83)

How to undertake reviews of large collections of articles and establish main contributions: an ontology-based literature review approach
2022, International Journal of Information Management Data Insights
Citation Excerpt :
Huang (1998) proposed two modifications in the dissimilarity function of the k-means algorithm so that it could cope with categorical and mixed datasets. Hunt & Jorgensen (2003) extended the MULTIMIX algorithm in 2003, which was proposed on their own previously (see in Harch et al., 1999), addressing missing data and changing the statistical model to make the algorithm able to cope with mixed attributes. Blomstedt et al. (2015) addressed the mixed data clustering problem, proposing a Bayesian predictive framework with an optimizer, called greedy stochastic search, to search optimal partitions.
Identifying the main contributions of scientific works may demand time and previous depth knowledge, besides it can be a bottleneck for reviewing processes. To make it easier, this paper presents a standard representation based on the Knowledge Discovery in Databases process to characterize knowledge-discovery-in-databases scientific documents. This first attempt consists of a string composed of five elements, which provide: (i) the complexity of the problem; the tasks from the (ii) preprocessing, (iii) processing, and (iv) post-processing step, and (v) the auxiliary approaches such document used to achieve its objectives. As all documents can be classified using this approach, this proposal is scalable by text mining algorithms, making the review working easier for scientists since this methodology saves the time of the document's main classification.
Gaussian kernels for incomplete data
2019, Applied Soft Computing Journal
Citation Excerpt :
In principle, any parametric density estimation technique can be used to approximate the data distribution; in our formulation, we use a Gaussian mixture distribution whose parameters are estimated by maximum likelihood [21]. For such a model, a modification of conventional expectation–maximization to handle incomplete data has been developed by Hunt and Jorgensen in [22] and can be effectively used for the task. Given this formulation, the EGK estimation can be readily incorporated in any kernel method based on the Gaussian kernel.
This paper discusses a method to estimate the expected value of the Gaussian kernel in the presence of incomplete data. We show how, under the general assumption of a missing-at-random mechanism, the expected value of the Gaussian kernel function has a simple closed-form solution. Such a solution depends only on the parameters of the Gamma distribution which is assumed to represent squared distances. Furthermore, we show how the parameters governing the Gamma distribution depend only on the non-central moments of the kernel arguments, via the second-order moments of their squared distance, and can be estimated by making use of any parametric density estimation model of the data distribution. We approximate the data distribution with the maximum likelihood estimate of a Gaussian mixture distribution. The validity of the method is empirically assessed, under a range of conditions, on synthetic and real problems and the results compared to existing methods. For comparison, we consider methods that indirectly estimate a Gaussian kernel function by either estimating squared distances or by imputing missing values and then computing distances. Based on the experimental results, the proposed method consistently proves itself an accurate technique that further extends the use of Gaussian kernels with incomplete data.
Impact of Motor Competence Profiles on Adolescents' Physical Activity and Cardiorespiratory Fitness across Four Years
2023, Medicine and Science in Sports and Exercise
TRACTABLE PROBABILISTIC GRAPH REPRESENTATION LEARNING WITH GRAPH-INDUCED SUM-PRODUCT NETWORKS
2023, arXiv
Model-Based Clustering
2023, Annual Review of Statistics and Its Application
Multiobjective semisupervised learning with a right-censored endpoint adapted to the multiple imputation framework
2022, Biometrical Journal

View all citing articles on Scopus

View full text

Mixture model clustering for mixed data with missing information

Abstract

Introduction

Section snippets

The mixture approach to clustering data

Application

Discussion

Missing observations in multivariate statistics Ireview of the literature

J. Amer. Statist. Assoc.

Data: A Collection of Problems from Many Fields for the Student and Research Worker

Model-based Gaussian and non-Gaussian clustering

Biometrics

The choice of treatment for cancer patients based on covariate informationapplication to prostate cancer

Bull. Cancer

An entropy criterion for assessing the number of clusters in a mixture model

J. Class.

Estimating the components of a mixture of normal components

Biometrika

Elements of Continuous Multivariate Analysis

Maximum likelihood from incomplete data via the EM algorithm

J. Roy. Statist. Soc. B

Introduction to Graphical Modelling

A note on parameter estimation for Lazarsfeld's latent class model using the EM algorithm

Multivariate Behavioral Res.

Classification

A general coefficient of similarity and some of its properties

Biometrics

The analysis of incomplete data

Biometrics