Abstract
Background Cystic fibrosis (CF) is a multisystem disease in which the assessment of disease severity based on lung function alone may not be appropriate. The aim of the study was to develop a comprehensive machine-learning algorithm to assess clinical status independent of lung function in children.
Methods A comprehensive prospectively collected clinical database (Toronto, Canada) was used to apply unsupervised cluster analysis. The defined clusters were then compared by current and future lung function, risk of future hospitalisation, and risk of future pulmonary exacerbation treated with oral antibiotics. A k-nearest-neighbours (KNN) algorithm was used to prospectively assign clusters. The methods were validated in a paediatric clinical CF dataset from Great Ormond Street Hospital (GOSH).
Results The optimal cluster model identified four (A–D) phenotypic clusters based on 12 200 encounters from 530 individuals. Two clusters (A and B) consistent with mild disease were identified with high forced expiratory volume in 1 s (FEV1), and low risk of both hospitalisation and pulmonary exacerbation treated with oral antibiotics. Two clusters (C and D) consistent with severe disease were also identified with low FEV1. Cluster D had the shortest time to both hospitalisation and pulmonary exacerbation treated with oral antibiotics. The outcomes were consistent in 3124 encounters from 171 children at GOSH. The KNN cluster allocation error rate was low, at 2.5% (Toronto) and 3.5% (GOSH).
Conclusion Machine learning derived phenotypic clusters can predict disease severity independent of lung function and could be used in conjunction with functional measures to predict future disease trajectories in CF patients.
Abstract
Machine learning-derived clusters can be used to define clinical status in children with cystic fibrosis https://bit.ly/3nudlPG
Introduction
Cystic fibrosis (CF) is characterised by lung disease, pancreatic insufficiency and malabsorption of nutrients, and can lead to numerous comorbidities such as CF-related diabetes (CFRD) and male infertility [1, 2]. Respiratory complications are the greatest cause of mortality, and therefore, current standards for assessing disease severity, monitoring disease progression and evaluating clinical trials rely heavily on lung function as an outcome measure [3].
People with CF have seen profound improvements in care, and many children now maintain lung function in the normal range [4–6]. Novel therapeutics (e.g. ivacaftor and elexacaftor/tezacaftor/ivacaftor) that correct the underlying molecular defect responsible for CF are expected to further improve lung function decline and the overall prognosis of people living with this disease [7]. Nonetheless, these treatments are not a cure, and there is still a need to monitor disease progression, and thus a call to develop new measures that adequately detect mild disease and can predict disease trajectories [8], especially in the paediatric age group.
Unsupervised cluster analysis, a form of machine learning, is a common approach to identify subgroups of disease. Unlike supervised methods, where multivariate classification is anchored to predefined labels, such as death or arbitrary thresholds of lung function, unsupervised analysis will group data based on natural patterns found both within and between variables [9]. Relevance of the groups, or clusters, are then assessed though association with outcome measures. The method has been applied to respiratory illnesses including COPD, asthma and bronchiectasis [10–12], as well as in CF [13–15]; however, these studies have associated clusters with death and transplant, which are uncommon events in paediatrics.
In order to develop a paediatric-specific outcome measure, it is important that a stronger link is established between routinely collected clinical variables and milder outcomes. Furthermore, any new measure should be derived independently of lung function in order to deviate from its historic reliance, and accordingly provide a complementary measure to monitor CF disease.
The aims of this study were to 1) use unsupervised clustering to identify clusters in a large paediatric Toronto CF dataset (TCF), and investigate whether these clusters distinguish patients based on current and future lung function measured by spirometry (forced expiratory volume in 1 s (FEV1)), risk of future hospitalisation and risk of future pulmonary exacerbation treated with oral antibiotics; 2) validate the clusters internally by investigating trends across age and time; 3) validate the clusters externally using a large paediatric United Kingdom CF dataset from Great Ormond Street Hospital (GOSH); 4) evaluate the repeatability in applying the clusters as a clinical measure.
Methods
Data
The TCF is an encounter-based registry that records clinical data at every CF clinic visit. To ensure relevance to the current CF population, analyses were limited to the most recent two decades (2000–2018). Adults (aged >18 years) were excluded, and clinical data recorded after a lung transplant were censored. Oral and inhaled antibiotics captured outside of clinical encounters were recorded, but not included as an encounter, and each hospitalisation was summarised as a single encounter. The recent history of hospitalisations and pulmonary exacerbation events treated with antibiotics were captured using a 12-month look-back window. Missing data were random throughout the dataset and were excluded from the cluster model.
Clinical data from patients with CF from a second specialist children's hospital, GOSH, were used to validate the TCF-derived clusters. Data were obtained from hospital admission records, microbiology lab results, spirometry tests and clinical notes, which were available from 2009 to 2017. Data were merged and analysed in the GOSH-DRIVE digital research environment (DRE) (an electronic healthcare records (EHR) database) (Aridhia DRE, Aridhia, Edinburgh, UK). The same exclusions applied to the TCF dataset were also applied to the GOSH data. This study was approved by the research ethics board at the Hospital for Sick Children (REB#1000060824) and covered under the ethical approval 17/LO/0008 (R&D#19IA07) at GOSH.
Cluster analysis
All analyses were carried out in R software [16]. Paediatric CF physician input and the CF literature identified an initial list of 25 variables as relevant to CF health (table 1). In order to reduce noise and redundancy in the model, Pearson correlation tests and principal component analyses were used to inform decisions on excluding correlated variables and those with minimal contribution to the variance in the data. FEV1 was deliberately excluded from the model and was instead assessed as an outcome measure to corroborate the disease severities of the clusters.
Description of the initial list of Toronto cystic fibrosis dataset variables identified for their relevance to cystic fibrosis health
Partitioning around medoids clustering [19] was used to generate between three and five clusters. Initially, clustering was carried out on all combinations of variables (range 3–11), resulting in a total of 1981 cluster models per cluster number. The maximum number of variables included in the cluster combinations were restricted to 11 in the first instance for computational and practical reasons. Superior models were identified by silhouette width (a measure of within-cluster similarity). Additional details are provided in the supplementary material.
Outcomes
The final candidate models (n=36) were assessed by comparing between-cluster differences in outcomes. Specifically, the models were ranked by model fit estimated from the Bayesian information criterion of each of time to hospitalisation (typically courses of intravenous antibiotics), time to pulmonary exacerbation treated with oral antibiotics, and a linear regression model of FEV1 % predicted (calculated from Global Lung Function Initiative reference equations [20]) (details can be found in the supplementary material). In addition, the models were ranked by sample size. Those that ranked best across all four parameters were assessed independently, and an optimal model was chosen as the one with the best between-cluster separation in outcomes.
The final optimal model was also analysed to determine cluster association with future lung function as the rate of change in FEV1 % predicted at 1 year from each encounter (estimated using linear mixed models with random slopes and intercepts) stratified between clusters [21]. Additionally, the proportion of encounters in each cluster was calculated for different thresholds of FEV1 % predicted. Finally, time to transplant or death from an individual's first cluster assignment was calculated.
Internal validation
On average, older children in the study cohort were anticipated to be more unwell than younger children, and children of the same age were anticipated to be healthier in the late 2010s than in the early 2000s. To validate the cluster detection of these trends in disease severity, the proportion of encounters in each cluster were assessed across time and age.
External validation
Clusters were defined for the GOSH data using the variables identified in the TCF optimal model. Clustering was also carried out on a smaller TCF dataset using the same time period as the GOSH data for a matched comparison. Between-cluster trends in outcomes were compared between the populations.
Cluster allocation
To apply the clusters as a clinical measure, new data (i.e. from a clinic encounter) can be allocated to the closest cluster using a k-nearest-neighbours (KNN) algorithm [22]. The mean error rate of the method was estimated by assigning a randomly selected 20% of encounters (test data) to clusters generated from the remaining 80% of data [23], which was compared to the cluster assignment from the optimal model for both TCF and GOSH datasets, in 1000 iterations.
Results
Data
The TCF contains 78 014 clinical encounters from 1309 people with CF. After exclusions, the dataset included 20 586 encounters from 575 children between 2000 and 2018 (figure 1). From the initial list of 25 variables (table 1), the review process identified 11 candidate variables for iterative clustering (variable selection details can be found in the supplementary material).
Flow chart summarising the study population and data exclusions applied to the Toronto cystic fibrosis (TCF) dataset.
Optimal cluster model
The final optimal cluster model consisted of four clusters comprised of nine variables: body mass index (BMI), height, hospitalisations in prior year, cough and previous rates of infection with Pseudomonas aeruginosa, Staphylococcus aureus, Stenotrophomonas spp., Haemophilus influenzae and Aspergillus spp. The model was generated from 12 200 complete encounters involving 530 individuals aged 2–18 years (mean±sd 10.79±4.38 years). Of the individuals included, 11% received a transplant and 12% died before the end of the study period. 13 individuals received ivacaftor, making up 0.03% of the encounters (additional patient characteristics can be found in supplementary table S1).
Based on the assessment of individuals within the clusters, two clusters were consistent with milder disease (clusters A and B) and two clusters were consistent with more severe disease (clusters C and D). The four-cluster model provided more granularity on between-cluster outcomes than the three-cluster model, while the five-cluster model did not show a meaningful distinction.
The addition of FEV1 to the optimal model did not change the results, suggesting that the variables included already combine to influence FEV1.
Cluster characteristics
Both severe clusters (C and D) were characterised by low BMI, reduced height and weight, high numbers of hospitalisations in prior year, high rates of previous Aspergillus spp. infection and high prevalence of chronic cough. These clusters were composed of older children with a large prevalence of both pancreatic insufficiency and CFRD, and a high use of chronic inhaled antibiotics (table 2). In addition, cluster C had the highest rate of previous P. aeruginosa infection, and cluster D had the greatest number of pulmonary exacerbations treated with oral antibiotics in prior year.
Summary of patient characteristics and clinical variables for each cluster
In contrast, the two mild clusters (A and B) were composed of younger children with lower prevalence of CFRD and pancreatic insufficiency, and were characterised by high growth parameters, low rates of previous P. aeruginosa and Aspergillus spp. infection and low numbers of hospitalisations in prior year (table 2). In addition, cluster A had the lowest prevalence of chronic cough.
Cluster outcomes
There were 10 351 complete lung function measurements in the optimal model. Cluster A had the highest FEV1 (mean±sd 93±15.5%), cluster B had intermediate FEV1 (mean±sd 83.5±17.9%) and clusters C and D had low FEV1 (mean±sd 68.5±21% and 64.6±22.2%, respectively) (figure 2a). Differences in FEV1 were significant between all clusters (post hoc Tukey test p<0.05). Lung function alone did not distinguish cluster (figure 2a).
The association between each cluster and lung function. a) Violin plots summarising forced expiratory volume in 1 s (FEV1) % predicted by cluster, where the violin height represents the density of data. Boxes within the violins display the median (middle line) and 25% and 75% quartiles, horizontal lines indicate 1.5×interquartile range and points are outliers. b) Stacked bar graph demonstrating the proportion of encounters in each cluster across different thresholds of FEV1 % predicted. c) The predicted rate of change in FEV1 % predicted over 1 year stratified across clusters. The predicted trajectory for three different ages i) 10 years, ii) 13 years and iii) 16 years are shown. The corresponding slopes are provided in supplementary table S2.
The greatest proportion of encounters in cluster A (29.6%) had a FEV1 >100% and the greatest proportion of encounters in cluster D (46.6%) had a FEV1 <40% (figure 2b). When the trajectory of each individual's encounters were investigated, encounters in cluster A with FEV1 <40% (nencounters=6, npeople=4), probably resulted from inconsistent data entry. FEV1 in cluster A declined less with increasing age, and in cluster B FEV1 was relatively stable (figure 2c). Cluster C had the steepest decline in FEV1 over 1 year, which remained stable across age (figure 2c). FEV1 in cluster D increased over 1 year, which was more profound in early childhood. The slope in this cluster was highly variable between individuals (sd 9.64) (figure 2c).
The risk of both future hospitalisation and future pulmonary exacerbation treated with oral antibiotics was lowest in cluster A, and highest in cluster D (figure 3a and b; table 3). Cluster B had a slightly higher risk of pulmonary exacerbation treated with oral antibiotics than cluster C, while cluster C had a comparably much higher risk of hospitalisation (figure 3a and b; table 3). While death and transplant in childhood (age <18 years) were rare (n=21), time to death or transplant later in adulthood was linked to first severe disease cluster in childhood, in which cluster D had the highest risk of both death and transplant (figure 3a and b; table 3).
Time to event analyses (marginal means and rates model) by cluster. Analysis included repeated events where time was re-initiated when an individual switched clusters for a) time to pulmonary exacerbation treated with oral antibiotics and b) time to hospitalisation. In addition, analysis included time to event from an individual's first cluster assignment in the Toronto cystic fibrosis dataset for c) time to death and d) time to transplant. Hazard ratios are provided in table 3.
Hazard ratios and 95% confidence intervals for each cluster as compared to cluster A across each time-to-event analysis
Internal validation
Age and year of encounter were not included in the model; nonetheless, there were clear age-related and temporal trends, such that older children were more likely to be in the more severe clusters C and D, and newer cohorts (at the same age) were more likely to be in the milder clusters A and B (figure 4). These observations were consistent with expected temporal and age-related trends.
The proportion of individuals in each cluster across a) age at different years and b) time at different ages. There were 7494 encounters before 2010, and 4706 encounters after 2010. The proportion of encounters in clusters A and D increased across decades, and the proportion of encounters in clusters B and C decreased across decades.
External validation
The GOSH data included 12 912 encounters from 187 children. Cough and oral antibiotic data were not available in this dataset so were removed from analysis. After data exclusions and the removal of missing values, the cluster model comprised 3124 encounters from 171 children aged 1–17.9 years (mean 8.2 years) (exclusion criteria can be found in supplementary figure S2).
For direct comparison, the TCF data were re-analysed using the same time period and variables as the GOSH data; there were 6623 encounters from 338 children aged 2–18 years (mean 11.1 years) in the revised TCF cluster model.
There was a similar gradient in risk of hospitalisation and FEV1 in both populations. Cluster D had the shortest time to hospitalisation, and the lowest FEV1 % predicted (mean±sd GOSH 69.2±19.3%, TCF: 68.5±21.7%), whereas cluster A had the lowest risk of hospitalisation and highest FEV1 % predicted (mean±sd GOSH 85.7±16.4%, TCF 87.8±18%) (figure 5). Cluster differences in FEV1 were all significant (post hoc Tukey test p<0.05), with the exception of clusters C and D in the GOSH analysis (p=1.00).
Comparison of cluster outcomes in the i) Great Ormond Street Hospital and ii) Toronto datasets. a) Time to hospitalisation (marginal means and rates model) by cluster. Analysis included repeated events where time was re-initiated when an individual switched clusters. Hazard ratios are provided in supplementary table S3. b) Violin plots summarising forced expiratory volume in 1 s (FEV1) % predicted by cluster, where the violin height represents the density of data. Boxes within the violins display the median (middle line) and 25% and 75% quartiles; horizontal lines indicate 1.5×interquartile range; and points are outliers.
Cluster allocation
The KNN method for cluster allocation accurately assigned the test encounters to the same clusters as the original clustering, with an error rate of 2.5% for TCF encounters and 3.5% for GOSH encounters.
Discussion
Phenotypic clusters using a range of clinical outcomes collected during routine clinical care allow for a comprehensive overview of CF health, and our results show that they meaningfully represent both mild and severe classes of CF disease. Within clusters, lung function was concordant with disease outcomes, whereas the range of individual lung function values observed within each cluster was wide. In the current era of CF, where up to 85% of affected children are reported to have mild to normal lung function [4–6], a multifactorial tool may provide further insight into disease progression.
The unsupervised algorithm benefits from the exclusion of FEV1, since young children who cannot perform spirometry are still captured. Where FEV1 is normal, the clusters may aid to explain who is at a greater risk of hospitalisation or pulmonary exacerbation and who could benefit from targeted management.
It was surprising that FEV1 in cluster D increased over 1 year in younger children, although it suggests that clinicians may already recognise that these individuals have more severe disease and require more intense interventions. The greater variability in the slopes over 12 months further suggests that these trends may reflect treatment effects.
The validation carried out in the GOSH population was very similar to the results of the TCF data re-analysed with a subset of the original population (npeople=338). Although the GOSH population was at a higher risk of hospitalisations overall, a further investigation into clinical practices reveals differences between these two centres. For instance, routine regular hospitalisations for i.v. antibiotics are common for children with severe disease at GOSH, whereas hospitalisations in Toronto are typically only for acute exacerbations. Despite this significant difference in management approach, FEV1 values were similar across clusters between the two populations. Clustering on the same variables yielded similar outcomes, highlighting the robustness of the cluster method. To ensure applicability in the modern CF era, the model should be updated in a predominantly newborn screening cohort, as well as in genetically diverse cohorts to ensure generalisability.
Multivariate scores have rarely been used in routine clinical care, but as EHR become more common and data are stored centrally, the implementation of this type of phenotypic clustering into clinical practice will become more feasible. The KNN algorithm can be used to calculate a cluster for each CF encounter based on the input of the nine cluster variables (table 2). As such, following appropriate governance and evaluation, the algorithm could be incorporated into EHR systems to provide clinicians with an overall picture of patient status and inform clinical decision making. Investigation into treatment effects across clusters with chronic therapies (e.g. dornase alfa and hypertonic saline) were limited in this study by missing and inconsistent data, but need to be explored further to better understand whether treatments can modify disease severity status.
The algorithm could also be implemented in patient portals to their EHR or apps to provide patients with a more comprehensive picture of their health status where single clinical measures, such as BMI or recent infection may not be as meaningful or interpretable indicators of health. Future work involving patients/families will highlight the appropriate presentation of cluster labels to better provide insight into overall health status.
Additional potential application use cases of clusters are as an end-point in research studies where changes in FEV1 are not detectable; improvements in health status may instead be indicated by movement from a severe cluster (C/D) to a mild cluster (A/B). This may be particularly attractive in trials involving young children, and/or in the rapidly evolving era of highly effective cystic fibrosis transmembrane conductance regulator (CFTR) modulator treatments being integrated into routine care. Clusters may also be used in national registry reports to standardise clinics or regions by disease severity for matched comparisons of populations, or to highlight those who may benefit from more resources.
A current barrier to implementation is the issue of missing clinical data, in which future analyses are necessary to triangulate evidence from existing data to ensure robust estimation of clusters. In addition, future work will explore the integration of filters to prevent the miscalculation of clusters from unrealistic clinical values. A limitation of the approach is that we standardised variables, which means the relative importance of each variable in defining the transition between cluster is unknown, and future work is needed to explore this.
Cluster analysis involves largely subjective decisions at each step of the cluster pipeline, and further refinement of the model may include the selection of different parameters to better understand the stability of the model. At present, the optimal model has a good performance and is an important first step, but it may not be the best model as highly effective CFTR modulators begin to change clinical outcomes and prognosis. Future works should determine how to routinely update models as patient characteristics and available data changes. For instance, cough was repeatedly highlighted as important in defining outcomes during iterative clustering, but did not appear to impact the GOSH validation. While cough is limited by subjectivity and a high proportion of missing values, its inclusion allowed the model to encompass patient-reported symptoms. More objective symptoms collected routinely through EHR or patient apps may be required to better capture the lived experience of the patient. CFTR modulator therapy was excluded from the model due to minimal data, but should be considered once these treatments are available to a wider proportion of CF patients. Similarly, lung clearance index should be considered for inclusion as it becomes a more routinely collected clinical variable, and extrapulmonary manifestations not captured in the registry should be included when using EHR data. An advantage of using an integrated cluster algorithm means that the variables can be updated, the time period adjusted, and subgroups refined as new information becomes available.
This analysis focused on time-dependent, continuous variables to assess cluster change over time, but also because clustering with partitioning around medoids is more favourable using continuous variables. For example, the inclusion of genetic sex resulted in clusters completely defined by sex alone, with less association with clinical outcomes. This demonstrates that the unsupervised nature of clustering, in which the algorithm aims to find similar groups without a purpose, requires clinical interpretation to ensure real-world value. In selecting meaningful, continuous variables, the look-back window to capture an individual's clinical history means the frequency of microbiology sampling could introduce a potential bias, and sicker patients with more hospitalisations may be overrepresented.
The phenotypic clusters were broadly comparable to a previous study of an adult CF population, which found that P. aeruginosa, i.v. antibiotics and pancreatic insufficiency were consistent in severe clusters with a high risk of death, and vice versa for milder clusters with a low risk of death [13]. The study identified seven clusters within 25 variables. We restricted the number of possible clusters to between three and five to simplify the practical interpretation of the results, and we were able to reduce the number of variables to nine for a more feasible clinical application. Despite these differences, both approaches identified robust clusters with similar variables associated with severe disease. Other cluster analyses in CF have anchored clusters to physician-determined levels of disease severity, which may be biased by indication [24], or have included FEV1 in the model [14]. Our analysis provides a more objective categorisation of disease severity, by statistically testing the clusters against known indicators of health that are mainly excluded from the model itself.
Cluster analyses have been applied to other diseases and further emphasise that knowledge-based inclusion of variables is necessary to ensure meaningful translation of the results [11, 12, 25, 26]. Another common approach has been to transform variables into linear combinations [10, 11, 25], which we decided would complicate the interpretation of our results. Two studies have gone further to prospectively apply the clusters by generating an external scoring system using decision tree analyses [13, 25]. The advantage of the KNN algorithm is that it directly allocates clusters to an encounter to predict disease severity. Our results show it has an extremely low error rate (<5%).
Multivariate clinical scoring systems have been derived in CF, but they typically assume an additive/multiplicative independent association between the variables included [27–30]. Cluster analysis instead makes no assumption about the nature of the relationship, only how individuals are similar to one another through shared patterns across the variables. Additionally, multivariate models must be linked to an event of interest or a specific outcome, and the advantage of this approach is that the clusters are correlated with important clinical outcomes, but not developed with an explicit prediction model of a single outcome.
Conclusion
It is feasible to develop machine learning based phenotypic clusters that summarise the overall health status of children living with CF, and these may provide a holistic way to track the progression of disease across childhood. The cluster algorithm can be updated regularly to accompany a rapidly changing therapeutic environment.
Supplementary material
Supplementary Material
Please note: supplementary material is not edited by the Editorial Office, and is uploaded as it has been supplied by the author.
Supplementary material ERJ-02881-2020.SUPPLEMENT
Shareable PDF
Supplementary Material
This one-page PDF can be shared freely online.
Shareable PDF ERJ-02881-2020.Shareable
Footnotes
This article has supplementary material available from erj.ersjournals.com
Conflict of interest: N. Filipow has nothing to disclose.
Conflict of interest: G. Davies reports personal fees for lectures from Chiesi Limited, outside the submitted work.
Conflict of interest: E. Main has nothing to disclose.
Conflict of interest: N.J. Sebire has nothing to disclose.
Conflict of interest: C. Wallis has nothing to disclose.
Conflict of interest: F. Ratjen reports grants and personal fees for consultancy from Vertex, Calithera, Proteostasis, TranslateBio, Genentech, Bayer and Boehringer Ingelheim, outside the submitted work.
Conflict of interest: S. Stanojevic reports grants from SickKids Foundation and European Respiratory Society, during the conduct of the study.
Support statement: G. Davies was supported by a grant from the UCL's Wellcome Institutional Strategic Support Fund 3 (grant reference 204841/Z/16/Z). S. Stanojevic received funding from the Program for Individualized Cystic Fibrosis Therapy Synergy Grant and the European Respiratory Society. N. Filipow received funding from a UCL, GOSH and Toronto SickKids studentship. All research at Great Ormond Street Hospital NHS Foundation Trust and UCL Great Ormond Street Institute of Child Health is made possible by the NIHR Great Ormond Street Hospital Biomedical Research Centre. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. Funding information for this article has been deposited with the Crossref Funder Registry.
- Received July 23, 2020.
- Accepted December 22, 2020.
- Copyright ©The authors 2021. For reproduction rights and permissions contact permissions{at}ersnet.org