Airway disease in childhood comprises a heterogeneous group of disorders. Attempts to distinguish different phenotypes have generally considered few disease dimensions. The present study examines phenotypes of childhood wheeze and chronic cough, by fitting a statistical model to data representing multiple disease dimensions.
From a population-based, longitudinal cohort study of 1,650 preschool children, 319 with parent-reported wheeze or chronic cough were included. Phenotypes were identified by latent class analysis using data on symptoms, skin-prick tests, lung function and airway responsiveness from two preschool surveys. These phenotypes were then compared with respect to outcome at school age.
The model distinguished three phenotypes of wheeze and two phenotypes of chronic cough. Subsequent wheeze, chronic cough and inhaler use at school age differed clearly between the five phenotypes. The wheeze phenotypes shared features with previously described entities and partly reconciled discrepancies between existing sets of phenotype labels.
This novel, multidimensional approach has the potential to identify clinically relevant phenotypes, not only in paediatric disorders but also in adult obstructive airway diseases, where phenotype definition is an equally important issue.
It is widely accepted that childhood asthma comprises several distinct disorders, characterised by the common symptom of wheeze 1–4. Distinguishing between these disorders is clinically important, since aetiology, pathophysiology, potential for therapy and outcome may differ 1, 5, 6. Similarly, it has been emphasised that, although some children with chronic cough might suffer from a variant form of asthma, “lumping” together all chronic coughers under the term “cough-variant asthma” is probably wrong 7.
Obstructive airway diseases clearly have multiple dimensions which involve atopy, disordered lung function, airway responsiveness and a variety of symptoms. Despite this, traditional phenotype definitions have used simple distinctions, such as a clinical classification into “exclusive viral wheeze” triggered only by colds and “multiple-trigger wheeze” triggered also by other factors 8, or a retrospective classification by symptom history into “early transient”, “persistent” and “late-onset” wheeze 2, 3. Since they are limited to single dimensions, such phenotype definitions embody an arbitrary element and may not properly reflect underlying disease processes. Furthermore, it is unclear how the different sets of phenotype labels relate to each other and whether they identify similar entities. For instance, is “exclusive viral wheeze” the same condition as “early transient wheeze”? An agreed system of classification that appropriately reflects underlying disease processes and, potentially, therapeutic responses is still lacking. It has been proposed that statistical methods which can account for multiple dimensions of airway disease may facilitate the identification of relevant phenotypes 9.
Latent class analysis (LCA) 10, 11 is a statistical method developed in the social sciences, which is used to identify distinct subsets (classes) underlying the observed heterogeneity in a population. Such classes are not directly observable and must be determined from the observed data. LCA has recently been used in medical research to identify disease phenotypes 12, 13. The aims of the present study were: 1) to apply LCA to a multivariate data set, combining symptoms and physiological measurements in order to identify and describe phenotypes of wheeze and cough in childhood; and 2) to explore the validity of the resultant phenotypes by assessing how well they predicted future outcomes. The emphasis of the present paper is on the potential of this approach to identify phenotypes of obstructive airway disease.
MATERIALS AND METHODS
Subjects and study design
In a population-based cohort study of 1,650 white children recruited in 1990 at the age of 0–5 yrs in Leicestershire, UK 14–19, parents completed postal questionnaires on respiratory symptoms, exposures and sociodemographic characteristics in 1990, 1998 and 2003. Between 1992 and 1994, a nested sample of 795 children was invited for physiological measurements and interviews 16, 17, including 222 children with parent-reported wheeze and 226 children with chronic cough (cough occurring apart from colds) in 1990, and a random sample of 347 previously asymptomatic children. The study was approved by the Leicester Health Authority Committee on the Ethics of Clinical Research Investigation.
Identification of phenotypes was based on data from the first two surveys (1990 and 1992–1994). Of the 488 respondents to the second survey (1992–1994), data was analysed from 319 with a positive response in either survey (1990 or 1992–1994) to one or both of the following questions: “Has your child ever had attacks of wheezing?” and “Does he/she usually have a cough apart from colds?” (fig. 1⇓).
In a next step, prognosis was compared across phenotypes, using data on current (i.e. within previous 12 months) wheeze, frequent wheeze, bronchodilator use and cough without colds from two recent surveys, carried out in 1998 and 2003, when the children were aged 8–13 and 13–18 yrs, respectively. All children who were asymptomatic in the first two surveys, 169 in total, served as a control group.
Physiological measurements included in the present analysis were age- and height-standardised z-scores 20 of the pre-bronchodilator forced expiratory volume in 0.5 s (FEV0.5), bronchial responsiveness (provocative concentration of methacholine causing a 20% decrease in transcutaneous oxygen tension (PC20-Ptc,O2)) 21, and atopy assessed by skin-prick testing. Subjects responding to one or more of four aeroallergens (cat hair, dog danders, Dermatophagoides pteronyssinus and mixed grass pollen) were designated atopic (for further details see supplementary data).
To identify phenotypes, LCA was applied to a set of variables measured in the sample of 319 children during the first two surveys. LCA assumes that the population is composed of subpopulations (latent classes), each having its distinctive distribution of the included variables 10. If these variables represent disease manifestations the latent classes can be interpreted as clinical phenotypes. Application of LCA involves some prior decisions: 1) the variables, and 2) the number of latent classes to be included in the model. When choosing which variables to include, there has to be a balance between using all potentially relevant information and the need to limit the number of parameters in the model. In the present study all parent-reported symptom data relating to cough and wheeze from the first two surveys and all measurements of atopy, lung function and bronchial responsiveness were considered for inclusion. Multiple correspondence analysis 22 was then used to make a narrower selection. In addition, the variables age and sex were included (tables 1⇓ and 2⇓). In order to choose the appropriate number of latent classes the model was repeatedly fitted with the number of classes increasing stepwise from 1 (model 1) to 7 (model 7). These models were then compared using bootstrapped p-values for the likelihood ratio (LR) test and the Bayesian information criterion (BIC) 10.
The model was fitted by maximum likelihood estimation using Multimix, a Fortran program designed to fit latent class models including both continuous and categorical variables 24. The variables FEV0.5 and log-transformed PC20-Ptc,O2 23 were treated as continuous with a normal distribution and all other variables as categorical. The program was adapted to deal with missing data 25 and conditional questions, such as questions on shortness of breath or seasonality of symptoms, which were asked only to those children reporting wheeze ever. For further details on the modelling approach see the supplementary data.
LCA allows computing of the probability of belonging to a particular phenotype given the observed features of a subject. As is common practice in LCA 10, each child in the sample was assigned to the phenotype for which it had the highest membership probability. In the present study, groups of children assigned in this way to different phenotypes are referred to as “phenotype clusters”. Two-sided Fisher’s exact tests were used to test associations between phenotype clusters and prognostic end-points. A Bonferroni-corrected significance level was used to account for multiple pair-wise testing.
The sample used for phenotype definition (n = 319) consisted of 189 (59%) children with wheeze ever reported in 1990 and/or in 1992–1994 and 130 (41%) children with cough apart from colds (but no wheeze), reported in at least one survey. The sample contained 160 (50%) young females and the median (range) age was 3.3 (0.3–5.4) and 6.3 (4.1–8.8) yrs in 1990 and 1992–1994, respectively. The healthy control group consisted of 169 asymptomatic children.
The two criteria which were applied to determine the number of phenotypes did not agree: the bootstrapped p-values for the LR test indicated five phenotypes (model 5) while the BIC preferred a model with only two (model 2). As this method is explorative and has the potential to reveal new phenotypes, the authors chose to present model 5 (tables 1⇑ and 2⇑), knowing that the heterogeneity in the data might be sufficiently represented by fewer phenotypes (see tables E2–E5 in the supplementary data for detailed results for the models with two to five phenotypes). The main characteristics of the five phenotypes are summarised as follows (details in tables 1⇑ and 2⇑). To simplify the discussion, each phenotype was given a summary label describing its most pertinent characteristics.
Phenotype A: persistent cough
Children with this phenotype typically suffered from cough apart from colds at both surveys. Wheeze ever was more common than in phenotype B but considerably less common than in phenotypes C, D and E. FEV0.5 values tended to be slightly lower and bronchial responsiveness greater than in asymptomatic children.
Phenotype B: transient cough
Cough apart from colds occurred only in the first survey and wheeze ever was rarely reported. FEV0.5 and bronchial responsiveness were comparable with asymptomatic children.
Phenotype C: atopic persistent wheeze
Attacks of wheeze were frequent in both surveys. Attacks occurred with and without colds and were commonly accompanied by shortness of breath. For almost a third of the children with this phenotype, summer was the season with more frequent attacks in the second survey. Cough apart from colds and being woken at night by cough was common. Sensitisation to at least one allergen was likely. FEV0.5 values were typically lower and bronchial responsiveness greater than in asymptomatic children.
Phenotype D: nonatopic persistent wheeze
Attacks of wheeze were likely in both surveys although not as frequent as in phenotype C. Attacks tended to be accompanied by shortness of breath and occurred with and without colds. They were generally worse at night and, in the second survey, were more common in winter. Atopic sensitisation was rare. FEV0.5 was similar and bronchial responsiveness greater than in asymptomatic children.
Phenotype E: transient viral wheeze
Attacks of wheeze tended to occur prior to the first survey or, if reported at the first survey, were infrequent. Attacks had subsided by the second survey. Wheeze tended to occur only with colds. FEV0.5 was similar to that in asymptomatic children, bronchial responsiveness was slightly greater.
For each child in the sample, membership probabilities were computed for each of the identified phenotypes. Children were then assigned to the phenotypes for which they had the highest probability (phenotype clusters). For 271 (85%) children, the highest membership probability was >0.9 indicating clear membership, while for nine (3%) children, the highest membership probability was <0.6 indicating more ambiguous membership.
To investigate the relationship between phenotypes identified in the sequential steps of the analysis (models 1–5), the number of children “flowing” from the phenotype clusters of a given model into the clusters of the subsequent model with one more phenotype was determined (fig. 2⇓). The phenotypes showed a high degree of stability across models. Children grouped to one phenotype at an early stage tended to be grouped together again at later stages. Thus, four of the phenotypes in the five-phenotype model were essentially distinguished at earlier stages (phenotypes A and B by model 4 (clusters 4A and 4B) and phenotypes C and E by model 3 (3B and 3C)), with phenotype D appearing as the only “new” phenotype at the fifth stage.
Comparing prognosis across identified phenotypes
In 1998, at age 8–13 yrs (fig. 3⇓), the prevalence of current wheeze was highest in phenotype cluster C (atopic persistent wheeze; 37 (71%) out of 52 respondents), less in phenotype cluster D (nonatopic persistent wheeze; 14 (35%) out of 40), followed by A (persistent cough; 21 (25%) out of 84) and E (transient viral wheeze; eight (24%) out of 34), and lowest in B (transient cough; seven (10%) out of 72) and in asymptomatics (17 (11%) out of 158). A similar pattern was found for the outcomes of frequent wheeze (at least four attacks in the last 12 months) and use of bronchodilators.
Statistical tests for differences in the prevalence of the four prognostic end-points between the phenotype clusters were performed. The present authors were interested in pair-wise comparisons between children with persistent cough (phenotype A) and asymptomatics and between the two cough phenotypes (A and B), because persistent coughers represented a novel group (see Discussion section). It is still disputed whether children with chronic cough, or a subgroup of them, have a different probability to develop wheeze compared with asymptomatic children. The present authors also tested for differences between the two more persistent wheeze phenotypes (C and D). In order to limit the problem of multiple testing, no more pair-wise comparisons were considered. The Bonferroni-corrected significance level for these tests was 0.0042 (overall significance level (0.05) divided by number of tests (12)). The outcomes at 8–13 yrs (fig. 3⇑) tended to be more prevalent in cluster C than in D, with significant differences for current wheeze (p = 0.001) and for use of bronchodilators (p = 0.002). Prognosis of asthma-related outcomes tended to be worse for phenotype cluster A (persistent cough) than for phenotype cluster B (transient cough) and asymptomatics, with significant differences for use of bronchodilators (p<0.001 and p = 0.001, respectively). Prevalence of cough apart from colds at 8–13 yrs was higher in A (44%) than in B (18%; p = 0.001) and asymptomatics (12%; p<0.001).
In 2003, at age 13–18 yrs (fig. 3⇑), prognostic differences between phenotype clusters remained qualitatively similar for all four outcomes. Marked differences, although not significant at the Bonferroni-corrected level, remained between C and D for current wheeze (56 versus 31%; p = 0.038) and inhaler use (65 versus 36%; p = 0.013). Prevalence of cough apart from colds again differed significantly between A (41%) and B (16%; p = 0.002).
The present study describes a novel approach to phenotype recognition in children with wheeze and cough using LCA. By applying this method to data on respiratory symptoms and physiological measurements from a population-based childhood cohort, three wheeze phenotypes and two cough phenotypes were identified. These phenotypes were predictive of outcomes at school age and later childhood. What distinguishes these entities from previously used phenotypes is that they were derived directly from data, rather than defined a priori, and that they account for multiple disease dimensions.
LCA as a multidimensional clustering technique
Historically, clinicians have refined diagnosis by resolving complex diseases into discrete, clinically useful subsets. These subsets, referred to as disease phenotypes, provide a way of classifying patients into groups of individuals with similar disease characteristics. In the present study, phenotypes were treated as unknown and were derived from the observed heterogeneity in a sample of symptomatic children. The chosen technique, LCA, can be interpreted as a form of cluster analysis. It has, however, important advantages over algorithmic clustering techniques such as hierarchical or k-means clustering. First, it is based on a formal statistical model that can readily accommodate features measured in different modes (categorical, continuous or count variables). Secondly, the algorithm typically used to fit such models was designed to deal with missing values 26. Thus, these models meet major challenges of real-life epidemiological and clinical data. Thirdly, the resulting clusters are not rigid in the sense that each individual is assigned to just one class. Rather, each individual can be assigned to various classes with differing probabilities. This soft form of classification more closely corresponds to the clinical situation, where some patients have features common to more than one condition. In the present study sample, the majority of children could clearly be classified into one of the phenotypes, i.e. with a high probability, but for a minority of children there remained some ambiguity. A possible downside of this approach is that the method does not directly produce clear-cut diagnostic rules for the clinical setting. However, once phenotype definitions obtained by this technique are validated, for instance by application of the model to independent data sets, the results can be translated into simplified diagnostic algorithms.
LCA shares some limitations with other clustering techniques. First, the problem of determining the number of classes has not been completely resolved 10. Different statistical criteria can be used to determine the number of classes, but may yield different results, as has been the case in the present study. Secondly, some prior decisions need to be made, such as the type and number of variables to include. This method therefore also involves some degree of subjectivity, although considerably less than a priori phenotype definitions. In the present application, another multivariate statistical method, multiple correspondence analysis, was used to assist variable selection and reduce the risk of subjective choices. The phenotypes identified are influenced by the range of data included. Therefore, it is necessary that all dimensions considered to be relevant for phenotype definition are represented by the included variables. As long as the same disease dimensions are included, results obtained by applying this approach to different cohorts should be comparable, even if the single variables representing these disease dimensions might differ (e.g. skin-prick tests versus specific immunoglobulin E measurements). In the current analysis, the authors deliberately focused on clinical dimensions (signs, symptoms and physiological measurements), i.e. dimensions related to disease expression and not to disease causes. The reason for this was to keep the methodology simple and transparent at this early stage of research. Using appropriate adjustments to the statistical model, future applications may extend this approach to include important risk factors of wheezing disorders, such as smoking.
The data set used for the present study, obtained from an ongoing population-based cohort, had a small sample size and a considerable proportion of missing values (12.8%). These problems are typical of clinical and epidemiological data. The data set thus provided a suitable test bed for the new approach. Although only 11% of individuals had complete data for all variables, all 319 individuals contributed to the analysis. This highlights the advantage of using an estimation procedure which makes best use of all available information in spite of missing values. The fact that not all targeted children responded at survey 1 (1,422 (86%) out of 1,650) and survey 2 (488 (61%) out of 795) might have induced some selection bias. This will mainly have affected the prevalence of identified phenotypes within the sample, but is less likely to have influenced the type of phenotypes found.
A further limitation of the present study sample might have been the considerable age spread of the children at the time of data collection. The probability of observing certain features such as atopy or wheeze ever changes naturally with age. This was partially accounted for by including age in the model, which allowed for a narrower age spread within phenotypes.
Phenotypes of wheeze
The present model distinguished the phenotypes transient viral wheeze (phenotype E), related to colds and mainly affecting nonatopic children, and atopic persistent wheeze (phenotype C), associated with multiple triggers and atopy. This suggests that the previously proposed categorisations transient and persistent wheeze 2, 3, 27, and viral and multiple-trigger wheeze 8, 28, 29 might reflect a single phenotypic dichotomy. Phenotypes E and C appear to reconcile the discrepancies between these two sets of labels. Children with the atopic persistent wheeze phenotype were mostly atopic, had the highest levels of bronchial responsiveness, lowest lung function and poorest prognosis, which agrees with findings from other groups 3, 30. Children with the transient viral wheeze phenotype were generally nonatopic and had normal lung function. This matches findings from the German Multicentre Asthma Study 31 but contrasts with reports from Tucson (AZ, USA) 2 describing impaired lung function both in infancy and at early school age in early transient wheezers.
A third phenotype of wheeze (phenotype D; fig. 2⇑) was labelled as nonatopic persistent wheeze and was characterised by a low rate of atopy, similar to the phenotype labelled nonatopic wheeze by the Tucson group 27, 32. A low rate of atopy and the winter season predominance distinguished this phenotype from the atopy-associated phenotype. It is known from experimental studies that a nonatopic form of viral wheeze may persist in a mild form into adult life 33. Phenotype D also shares features with what has been described as “intrinsic asthma” in adult respiratory medicine 34. No evidence was found for a distinct late-onset phenotype, characterised by wheeze reported only in the second survey 2, 3, 27. The application of LCA to the present data set therefore provided support for 1) the distinction between transient and persistent wheeze, recognising that the former is associated with viral infections and the latter with other triggers, and for 2) the existence of a third form of wheeze which is nonatopic but largely persistent.
Phenotypes of cough
One of the two cough phenotypes that the present model identified (phenotype A) was associated with reduced lung function, increased bronchial responsiveness and a significantly higher risk of later wheeze compared with asymptomatics and to children belonging to the other cough phenotype (fig. 3⇑). The statistical model therefore appears to have identified, within the large group of children with nonspecific cough, a group which exhibits features of a condition called “cough-variant asthma”. It is clear that lumping together all children with chronic cough under this term leads to an over-diagnosis of asthma 7. The present multidimensional approach could help to single out a subgroup of children who might indeed profit from asthma treatment.
Implications for research and clinical practice
Reliable phenotype definitions are important for research and clinical practice. They are useful for describing the natural history of the disease and for studying underlying mechanisms and the role of environmental and genetic factors. In the clinical setting, the ability to allocate children to phenotypes allows informed counselling of parents and is a prerequisite for phenotype-specific treatment 5, 6. More accurate phenotype definitions might also help to explain seemingly conflicting results in time trends and international prevalence of asthma 19, 35.
In all these settings, phenotypes are only useful if they reflect true aetiological entities. Statistical techniques that are designed to detect the structures underlying multivariate data, such as latent class analysis, have the potential to identify such phenotypes. However, because these methods are exploratory, it is important to validate the resulting phenotypes. In the present study, recent outcome data were used to provide support for phenotypes identified from early symptoms and physiological measurements. Identifying similar phenotypes using independent data sets is an additional necessary validation step (external validation). Further development of this approach and application to other cohorts should help increase understanding of phenotypic variability, not only in childhood respiratory disorders but also in adult obstructive airway diseases, where phenotype definition is an equally important issue 9.
This work was funded by the Swiss National Science Foundation (PROSPER grant 3233-069348 and 3200-069349, and SNF grant 823B - 046481) and the Swiss Society of Pneumology. Original data collection was funded by the UK National Asthma Campaign. Follow-up data collection was funded by grants from: University Hospitals of Leicester NHS Trust (R&D), Leicestershire and Rutland Partnership Trust, Medisearch, Trent NHS Regional Health Authority, and the UK Department of Health (grant 0020014). J. Grigg of the Division of Academic Paediatrics, Institute of Cell and Molecular Science, Queen Mary University (London, UK) helped finance the follow-up survey in 2003 via a grant from the UK Department of Health (No. 0020014).
Statement of interest
The authors would like to thank all the children and their parents for participating in the study; T. Davis (Business Manager of the Children’s Directorate, Leicester City West Primary Care Trust) for his assistance with the Leicestershire Child Health Database; and M. Egger, M-P. Strippoli, M. Brinkhof (all at the Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland) and P. Latzin (Dept of Paediatrics and Institute of Social and Preventive Medicine, University of Bern) for their valuable comments on the manuscript.
This manuscript has supplementary data accessible from www.erj.ersjournals.com
- Received November 15, 2007.
- Accepted January 8, 2008.
- © ERS Journals Ltd