Reply to “COVID-19 prediction models should adhere to methodological and reporting standards”
- 1The D-Lab, Dept of Precision Medicine, GROW - School for Oncology, Maastricht University Medical Center+, Maastricht, The Netherlands
- 2Dept of Radiology and Nuclear Medicine, GROW- School for Oncology and Developmental Biology, Maastricht University Medical Center+, Maastricht, The Netherlands
- Guangyao Wu, The D-Lab, Dept of Precision Medicine, GROW - School for Oncology, Maastricht University Medical Center+, 6229 ER, Maastricht, The Netherlands. E-mail: g.wu{at}maastrichtuniversity.nl
Abstract
It is hard to follow a standardised methodology for prediction models, while researchers should adhere to generally accepted reporting standards according to research needs and journal submission requirements https://bit.ly/30zfMIw
From the authors:
We would like to thank G.S. Collins, M. van Smeden, and R.D. Riley for their commentary on the design, analysis, and reporting of our article [1]. However, their comments seem to stem from a traditional biostatistics angle rather than from a translational research machine-learning approach and the overwhelming majority of criticisms arise from either misunderstandings or misreading.
The authors inaccurately state that we randomly split datasets. As described in our manuscript we nonrandomly split the data by time and place, making it a stronger design according to the TRIPOD statement. The use of independent cohorts to test model generalisability make it a TRIPOD type 3 study [2]. We agree that splitting reduces the training dataset size, increasing the probability of overfitting. However, as an RNA virus, SARS-CoV-2 may be able to mutate rapidly and develop diverse characteristics. Hence, we split the datasets by time and place rather than using cross-validation or bootstrapping.
The authors used 75 candidate predictors rather than the seven selected ones to perform their sample size calculations for our training dataset [3]. Although we agree that using candidate predictors is a more rigorous approach compared to using only the selected ones, it is too strict in the modern machine-learning and -omics field, and disregards the power of feature dimensionality reduction and selection methods we employed. While we understand that overfitting remains possible, the validation of the model on five datasets from unrelated institutions strengthens the likelihood that the model presented is robust. Test set results are presented separately to improve understanding of robustness, because it is easy to hide possible poor performance in a small test set by combining it with a large test set where the performance is good. More importantly, the selected variables make sense from the clinical point of view [4, 5], making our models explainable, transparent, and therefore acceptable by the end-users.
We agree that excluding missing data may lead to biases, and list this as our first limitation in the Discussion. Given the time-critical nature of this quickly developing pandemic, we decided that excluding 38 patients was preferable to imputation and that the bias introduced by such a selection would be revealed in the five external validations and further validations post-publication. The authors inaccurately state that we assume that continuous predictors are linearly associated with the outcome. We emphasise that neither feature selection nor modelling assume a linear association between predictors and outcomes. The process of randomising the outcomes and re-running of the analysis is a powerful sanity check against overfitting [6].
We must point out that the Adaptive Synthetic (ADASYN) algorithm is a published and validated method for dealing with dataset unbalance. Whilst we agree that this methodology could introduce an error in the model intercept, we believe that this error can be estimated when calculating the model's performance in the five external validation datasets. Everyone has their preferred metrics and often a better metric can be found than those commonly reported. This is especially true in the convergence zone between machine-learning and clinical application, where reporting possibly suboptimal metrics that are easier to understand may have added benefit over more technical metrics used by data scientists. Reporting confusion matrices, a widely used and readily understandable way of evaluating classification performance, can easily be defended. Equally, reporting the universally adopted sensitivity and specificity metrics as well as the results from the calibration plots align well with the readership of this esteemed publication.
The authors call our risk groupings arbitrary. Using three risk groups was a requirement of the clinicians and is common in the clinic, including COVID-19: low-risk (home care), medium-risk (hospital surveillance), and high-risk (ICU admission). The risk probability thresholds were based on the 25th and 75th probability percentiles in the balanced training set. With these thresholds, the low-risk group had <20% incidence of severe outcomes, and the high-risk group had >75% chance of severe outcomes on each test set, which the clinicians deemed clinically useful. The authors reprimand us for not reporting the model parameters explicitly. For us, the main aim of any clinical triage model is the application on individual patients in a clinical setting. We believe both a nomogram and a web calculator satisfy this requirement. In addition, for model evaluation, the model parameters can be fully reconstructed from the nomogram.
There are numerous checklists or guidelines for diagnostic and predictive models [7–10]. In retrospect, we agree that TRIPOD is a more appropriate checklist than STARD for modelling studies due to the details regarding the reporting of methodology and results. We chose a more familiar checklist from the submission guidelines of this journal (guidelines in which TRIPOD was not listed) and will ensure to also include TRIPOD reporting in the future. Given the quickly changing nature of machine learning and the increasing number of guidelines, it is hard to forge standards, while the need for them in the reporting of model studies increases.
Overall, we believe our work is useful and explainable, and have received positive feedback from colleagues, including clinicians, who appreciate that their requirements have been taken into account. We are currently prospectively validating our models out of a conviction that only this approach can truly validate a predefined model.
Shareable PDF
Supplementary Material
This one-page PDF can be shared freely online.
Shareable PDF ERJ-02918-2020.Shareable
Footnotes
Conflict of interest: Dr. Wu has nothing to disclose.
Conflict of interest: Dr. Woodruff has (minority) shares in Oncoradiomics, outside the submitted work.
Conflict of interest: Dr. Chatterjee has nothing to disclose.
Conflict of interest: Dr. Lambin has minority shares in The Medical Cloud Company, and reports grants from Varian Medical, Oncoradiomics, ptTheragnostic/DNAmito and Health Innovation Ventures, personal fees from Oncoradiomics, BHV, Varian, Elekta, ptTheragnostic and Convert Pharmaceuticals, outside the submitted work; and has patents PCT/NL2014/050248, PCT/NL2014/050728 and PCT/EP2014/059089 licensed, and patents N2024482, N2024889 and N2024889 pending.
- Received July 27, 2020.
- Accepted July 30, 2020.
- Copyright ©ERS 2020.
This version is distributed under the terms of the Creative Commons Attribution Non-Commercial Licence 4.0.