Abstract
We evaluated performance characteristics and estimated the minimal clinically important difference (MCID) of data-driven texture analysis (DTA), a high-resolution computed tomography (HRCT)-derived measurement of lung fibrosis, in subjects with idiopathic pulmonary fibrosis (IPF).
The study population included 141 subjects with IPF from two interventional clinical trials who had both baseline and nominal 54- or 60-week follow-up HRCT. DTA scores were computed and compared with forced vital capacity (FVC), diffusing capacity of the lung for carbon monoxide, distance covered during a 6-min walk test and St George's Respiratory Questionnaire scores to assess the method's reliability, validity and responsiveness. Anchor- and distribution-based methods were used to estimate its MCID.
DTA had acceptable reliability in subjects appearing stable according to anchor variables at follow-up. Correlations between the DTA score and other clinical measurements at baseline were moderate to weak and in the hypothesised directions. Acceptable responsiveness was demonstrated by moderate to weak correlations (in the directions hypothesised) between changes in the DTA score and changes in other parameters. Using FVC as an anchor, MCID was estimated to be 3.4%.
Quantification of lung fibrosis extent on HRCT using DTA is reliable, valid and responsive, and an increase of ∼3.4% represents a clinically important change.
Abstract
In subjects with IPF, quantification of lung fibrosis extent on HRCT using data-driven texture analysis shows acceptable performance characteristics and minimal clinically important difference in the range of 3.4–6.4% http://ow.ly/fFNc30lfAGh
Introduction
Idiopathic pulmonary fibrosis (IPF) is a chronic fibrosing interstitial lung disease whose aetiology remains unknown [1]. It is characterised by progressive scarring of the lung parenchyma that leaves patients with increasing dyspnoea, decreasing quality of life and shortened survival. Its median survival is estimated at 3–5 years, although individual prognosis can vary significantly [2–5]. Some patients suffer rapid disease progression and early death, while others decline more slowly with variable periods of clinical stability [6]. Accurate assessment of disease severity and of change over time is critical for both clinical care and therapeutic trials.
Pulmonary physiology, particularly forced vital capacity (FVC), is the standard method for longitudinal monitoring of disease progression. However, it is an indirect measure of disease activity with drawbacks including dependency on technique and patient effort, variability in rate of change [7], and lack of sensitivity to subtle changes in disease status [8]. While it has been the subject of debate [9], decline in FVC is generally accepted as a surrogate end-point for death in IPF [10]. Still, it is widely recognised that additional valid and reliable outcome measures are needed [6, 11].
High-resolution computed tomography (HRCT) plays an essential role in the evaluation of patients with IPF. It provides noninvasive visualisation of lung parenchyma, can diagnose IPF without lung biopsy in the majority of patients and has been used for entry into clinical trials. However, visual assessment of HRCT images is limited by interobserver variation [12] and is insufficiently precise for longitudinal evaluation [13].
Computational methods for quantitative evaluation of HRCT have emerged as promising objective markers of disease severity in pulmonary fibrosis [14]. HRCT-derived scores for fibrosis extent correlate with degree of physiological impairment at baseline and may be more sensitive to subtle changes in disease status than physiological metrics [15]. In previous work, we showed that extent of lung fibrosis, quantified on HRCT using a method called data-driven texture analysis (DTA), provides an IPF severity index that correlates with expert visual assessment and lung function, and could be used to predict longitudinal disease behaviour better than semiquantitative visual scores or HRCT lung histogram-based metrics [16].
To be accepted as outcome measures, the performance characteristics of HRCT lung fibrosis scores require further study. In this work, we analysed pooled subject-level data from two IPF treatment trials (PANTHER-IPF [17] and RAINIER [18]) to assess the reliability, validity and responsiveness, and to estimate the minimal clinically important difference (MCID) (the smallest difference in an outcome measure that would be meaningful to the patient [19]) of DTA.
The sequential HRCT analysis of the PANTHER-IPF data has been reported previously [16], but test characteristics were not evaluated. Some of the data reported here have been presented in poster format [20].
Methods
Study design and population
As all analyses were conducted retrospectively on previously collected, de-identified data, this study was exempt from additional institutional review board approval. Methods for the PANTHER-IPF and RAINIER trials have been published previously [17, 18]. We included subjects who had both baseline and follow-up (15 and 12.5 months for PANTHER-IPF and RAINIER, respectively) data. Briefly, for inclusion in PANTHER-IPF, a three-armed trial of placebo versus N-acetylcysteine versus a three-drug combination (prednisone, azathioprine and N-acetylcysteine), patients required a diagnosis of IPF made using criteria similar to subsequently published international consensus guidelines [1]. A subset of 72 subjects underwent both baseline and nominal 15-month follow-up volumetric (axial slice thickness and spacing ≤1.25 mm) HRCT. RAINIER was a placebo-controlled trial of simtuzumab, a monoclonal antibody against the lysyl-oxidase like-2 enzyme, conducted from March 2011 to January 2016 and terminated prematurely for lack of efficacy. In this trial the diagnosis of IPF was made in accordance with accepted criteria [1]. A subset of subjects from sites in the USA underwent both baseline and nominal 54-week follow-up HRCT. In RAINIER, HRCT protocols were more varied, but only series with axial slice thickness ≤2.5 mm and limited gaps (slice spacing ≤10 mm) were included in this analysis (n=69). HRCT scans showing excessive motion artefacts, inadequate inspiration or an incomplete depiction of the lung parenchyma, identified by visual assessment, were omitted from this analysis. Baseline and follow-up HRCT with similar protocols were matched, to the extent possible. Case selection and summary characteristics of HRCT are included in supplementary figure E1 and supplementary table E1. In each trial, standard demographic, physiological and patient-reported outcome variables were collected, including FVC, diffusing capacity of the lung for carbon monoxide (DLCO), distance covered during a 6-min walk test (6MWD) and response data from the St George's Respiratory Questionnaire (SGRQ). The SGRQ is a respiratory disease-specific health-related quality of life questionnaire with 50 items separated into three domains (Symptoms, Activity and Impacts). Each domain score and the SGRQ total score has a range from 0 to 100, with higher scores corresponding to greater impairments [21].
HRCT analysis
DTA is a machine learning method capable of automatic detection and quantification of lung fibrosis on HRCT [16]. It is trained to discriminate fibrosis using radiologist-identified image regions demonstrating normal lung parenchyma and usual interstitial pneumonia patterns. Exemplar regions labelled as reticulation, honeycombing or traction bronchiectasis were used to define the fibrosis category. The algorithm classifies local regions in axial sections as either normal lung or fibrosis in a sliding window fashion over lung fields, which are identified in a separate segmentation process. The DTA fibrosis score is computed as the percentage of the total number of window regions classified as fibrosis (figure 1). Performing classification on axial images enables analysis of studies with noncontiguous sections.
Statistical analyses
Summary statistics were generated for baseline characteristics. In the performance characteristics analyses, we used several disease severity variables as anchors against which DTA scores were compared. Anchors included FVC, DLCO, 6MWD and SGRQ scores.
Several analyses were conducted to support the validity of the DTA score as a measure capable of capturing baseline and change in IPF severity. For concurrent validity analyses, we examined associations between baseline values for the DTA score and each anchor (FVC, DLCO, 6MWD and SGRQ scores) by using Spearman correlation coefficients. Known-groups validity was assessed by comparing mean DTA fibrosis scores across anchor-defined, discrete subgroups of IPF severity determined by stratifying the cohort on baseline values for FVC, DLCO, 6MWD and SGRQ. One-way ANOVA was used for statistical comparisons, as well as p-value-adjusted pairwise comparisons using the Tukey method.
Responsiveness was assessed with Spearman correlation coefficients between change from baseline in the DTA score and change from baseline in each anchor variable. These values were also used to gauge the appropriateness of anchors for estimation of MCID. Following Cohen's rule of thumb, anchors with correlation coefficients ≥0.30 when compared with DTA change were considered appropriate [22, 23].
DTA scores were compared across groups of subjects stratified into discrete, anchor-defined categories of change in IPF severity. Stratification levels for each anchor variable were selected to represent groups that were “much worse”, “slightly worse”, “same”, “slightly better” and “much better”. Cut-off values for each anchor were selected based on published data. For example, in IPF the MCID for FVC is estimated to be ∼5% [24], so the “same” group for this anchor was defined as subjects whose change in FVC was within ±5% relative to baseline and the cut-off between “slightly worse” and “much worse” was set at twice this value. Other published MCIDs in IPF are ∼30 m for 6MWD [25] and 7 points for SGRQ total score [26]. In chronic obstructive pulmonary disease, MCID for DLCO has been estimated to be ∼10% [27]. Test–retest reliability was evaluated by computing the intraclass correlation coefficient (ICC) of baseline and follow-up DTA scores for subjects in the “same” groups defined by stable values in each external anchor. ICC values <0.5, 0.5–0.75, 0.75–0.9 and >0.90 were interpreted as poor, moderate, good and excellent reliability, respectively [28]. Mean changes in DTA scores for anchor-defined groups were compared using ANOVA and p-value-adjusted pairwise comparisons using t-tests.
MCID is defined as the smallest difference in an outcome measure that can be considered important and that would lead a clinician to consider a change in therapy [25]. While there is no consensus agreement on the ideal method to determine MCID, current best practice is to use several approaches to estimate a practical range [23, 26]. Attempts to triangulate the MCID for DTA were made using both anchor- and distribution-based methods. Anchor-based estimation of MCID is a special case of responsiveness where mean change in the DTA score for subjects who change minimally according to a given anchor provides an estimate for MCID. We considered the mean values of DTA change scores in the “slightly worse” groups, described earlier, as estimates of MCID for worsening. Effect size, the difference in mean DTA scores at baseline and follow-up divided by standard deviation of the DTA score at baseline, was used to estimate the magnitude of MCID. Effect size values of 0.2, 0.5 and 0.8 are considered small, medium and large, respectively [24].
Distribution-based methods use sample data only, relying on statistical properties of the distribution of outcome scores to estimate the MCID [19]. We used the standard deviation of the DTA score at baseline and the standard error of measurement for these estimates. The amount of change in the outcome variable that corresponds to a moderate effect size, i.e. one half of the standard deviation at baseline, can be used as an estimate of MCID [25, 29]. Finally, the standard error of measurement (another estimate of MCID [24]) was calculated as sem=(sd of DTA at baseline)×sqrt(1–ICC).
Statistical analyses were performed using R version 3.4.2 [30] and p-values <0.05 were considered statistically significant.
Results
The final cohort was comprised of 141 subjects who had baseline and follow-up data available for analysis. Demographics, baseline values and changes at follow-up are presented in table 1. The mean±sd age of the pooled cohort was 68.0±8.2 years and 108 (76.6%) were male. Mean±sd baseline FVC % pred was 68.9±15.2%, DLCO % pred was 43.6±11.9%, 6MWD was 393.0±93.5 m and SGRQ total score was 39.4±17.2. The mean±sd DTA score at baseline was 28.0±12.9%. On average, subjects showed slight progression (mean FVC decline 6.14% relative to baseline) over the follow-up period.
Table 2 shows baseline correlations between the DTA score and the clinical variables. There were weak to moderate correlations in the expected directions between DTA score and each anchor at baseline. Table 3 presents the results for the known-groups validity analyses of baseline data. Mean values for the DTA score were generally higher for subjects with poorer lung function, 6MWD and health-related quality of life. ANOVA showed the mean baseline DTA score differed across FVC, DLCO and SGRQ groups defined by severity, but not across the spectrum of 6MWD values. Tukey multiple comparison of means showed the mean DTA score was significantly different between any two FVC tertiles and any two DLCO tertiles (p<0.05, with Bonferroni adjustment). The subgroup with the lowest SGRQ score had a mean DTA score significantly different from the other two SGRQ-defined subgroups, whose mean DTA scores were not significantly different from each other.
Table 4 demonstrates responsiveness using Spearman correlations between the DTA change score, calculated as follow-up score minus baseline score, and changes in FVC, DLCO, 6MWD and SGRQ values. Correlations were weakly to moderately strong and in the expected directions. The absolute value of all correlation coefficients was ≥0.30, supporting the appropriateness of each anchor for estimation of MCID.
Data for anchor-based estimation of MCID are presented in table 5. Mean change in DTA scores was stratified into groups according to “much worse”, “slightly worse”, “same”, “slightly better” and “much better” changes in each external anchor variable. ICC for baseline and follow-up DTA scores in each subgroup of stable subjects ranged from 0.78 to 0.91 (mean 0.83), showing good to excellent reliability. ANOVA showed significant differences in the means of DTA change scores across each set. Increase in the DTA score was consistently greater for subjects with larger declines in pulmonary function, 6MWD and health-related quality of life. The mean DTA change score in the “slightly worse” group is an estimate of MCID. Effect sizes in this group are small to medium for each anchor.
Table 6 summarises results for estimation of MCID using anchor- and distribution-based methods. Estimates of MCID were fairly consistent, ranging from 3.4% to 6.4%.
Discussion
HRCT of the chest is relied upon for the diagnosis and management of patients with IPF. Computational methods that produce quantitative HRCT scores for disease extent show promise as precise and objective outcome measures in IPF, but require further systematic performance testing. In this study, we used baseline and longitudinal data from two well-characterised populations to evaluate the performance characteristics of DTA. Confirming our prior work, correlations between the DTA score and pulmonary function tests at baseline were moderate. Extending our previous findings, additional anchor variables (6MWD and SGRQ) and known-groups validity testing show that DTA can distinguish subjects with differing levels of disease severity. It also showed good to excellent test–retest reliability in subjects determined to be stable based on anchor variables and it was responsive to changes in measured disease severity. Finally, we estimated, using both anchor- and distribution-based methods, MCID for worsening to be in the range of 3.4–6.4%.
Other researchers have used computational methods to evaluate lung fibrosis on HRCT. Like DTA, scores from multiple methods correlate with measures of pulmonary physiology. For example, Jacob et al. [31] showed that CALIPER, a quantification method based on local histograms of pixel intensity within volumes of interest, provided image-derived metrics of lung fibrosis that correlated more strongly with FVC at baseline than did visual scoring. Park et al.’s [32] texture-based quantification system showed fibrosis score correlations with baseline FVC and their measure of reticulation was predictive of decline in FVC at 1-year follow-up. Kim et al. [33] observed that a quantitative lung fibrosis score, computed by a machine learning algorithm trained with image textural features and expert labelled image regions, correlated with baseline values for FVC and DLCO. At 7-month follow-up, change in quantitative lung fibrosis score was also associated with changes in FVC and DLCO. Salisbury et al. [15] have also analysed HRCT scans from the PANTHER-IPF cohort. Using the AMFM (Adaptive Multi-Feature Method) algorithm they showed that baseline score for a ground-glass reticular pattern was independently associated with risk of a composite outcome of death, hospitalisation or 10% decline in FVC over 60 weeks. The change in this score was only weakly correlated (r= −0.25; p=0.01) with change in FVC at follow-up.
DTA is implemented as a simple convolutional neural network. It is based on unsupervised feature learning; image features used for classification were discovered in an initial clustering process, in contrast to engineered features that are chosen by the algorithm designer. In image texture analysis, engineered features are often based on first- and second-order pixel statistics within local regions. A weakness of feature engineering is the bias introduced in the design and feature selection process. Learned features rely on fewer design choices and tend to capture important details better than manually designed features [34]. Future work will evaluate the benefits of more complex convolutional neural network architectures in detection and quantification of diffuse lung diseases.
In October 2014, the US Food and Drug Administration approved two antifibrotic drugs for IPF (pirfenidone and nintedanib) based on changes in FVC [35]. In the confirmatory trials, the modelled average decline in FVC was ∼100 mL per year in subjects on either of the approved treatments [36, 37] compared with ∼200 mL per year in subjects on placebo [38]. These approvals have reshaped the landscape for future drug trials in IPF, as most upcoming trial subjects will be on one of these drugs [8, 39]. As measuring differences in FVC below 100 mL will be difficult, additional reliable, responsive and validated outcome measures of disease activity are needed.
While repeat HRCT within a short time interval was not available in this study, we observed acceptable reliability (ICC=0.78–0.91) in DTA fibrosis scores in subjects who remained stable, based on each anchor variable, during the follow-up period. We also observed that greater DTA fibrosis scores corresponded with greater degree of physiological impairment, reduced exercise tolerance and reduced patient-reported quality of life, and that changes in DTA fibrosis scores were moderately correlated with changes in external anchors. Confirmation in a separate population would be ideal; however, an appropriate, independent dataset with sequential scans and physiology was not available at the time of this analysis. As quantitative imaging using machine learning continues to advance, there is an increasingly urgent need for standardised imaging cohorts in IPF and other fibrotic lung diseases that can be used to develop, test and validate methods. It is likely that available datasets would help drive innovation in the field.
Estimates of MCID are useful for consistent interpretation of results and for sample size calculations in clinical trial design. Distribution-based methods are more straightforward to calculate and provide an estimate of the degree of change in an outcome that is unlikely to be attributable to random measurement variation [19]. However, they lack the context provided by external anchors. Anchor-based methods are generally preferred [23], because they determine MCID as the degree of change in the outcome that is associated with a clinically relevant change in an external variable. We chose FVC, DLCO, 6MWD and SGRQ as external anchors because they are well-known indices of severity in IPF, and are routinely measured in clinical care and therapeutic trials. Of these, only FVC could be considered a validated outcome and this may be the best anchor. However, we included DLCO, 6MWD and SGRQ because, despite showing greater variability, they met minimum criteria for appropriateness and have been used as anchors in estimates of MCID of FVC [25]. Progression of morphological fibrosis on HRCT may be relatively independent of physiological progression, which may explain why the correlations between these measures in our study are not very strong. In fact, DTA may function best as a complementary measure rather than a substitute for physiological evaluation.
Strengths of the present study include use of pooled data, acquired prospectively in clinical trials, and the use of four external anchor variables in testing DTA's performance characteristics and estimation of its MCID. There are also several limitations to be noted. First, this was a post hoc analysis and data beyond 60 weeks were not available. Follow-up HRCT was available on only a small fraction of total subjects enrolled in each trial and this may represent a selection bias toward subjects with less aggressive disease progression. Second, there was variation in HRCT parameters and a slightly different follow-up interval in the trials. Differences in HRCT acquisition and reconstruction parameters, and in the level of lung inflation during a scan, are well-known sources of variation in quantitative HRCT of the lungs [40]. These effects can be alleviated by using standardised HRCT protocols that require only short breath holds and coaching subjects on the importance of reaching full inspiration for the scan [41]. We speculate that improved consistency in HRCT characteristics would reduce variability and improve performance of DTA or any quantitative image analysis method. Third, repeat HRCT over a short time interval was not available for test–retest analysis. Fourth, subjects either remained clinically stable or declined, so our MCID estimates are for worsening only. Finally, other methods for fibrosis quantification on HRCT have been proposed, but we did not perform direct comparisons of different algorithms.
This study demonstrates quantitative measurement of lung fibrosis on HRCT is a reliable, valid and responsive measure of disease severity in a cohort combining subjects with IPF from two clinical trial populations. We estimated that DTA's MCID for worsening in IPF is in the range of 3.4–6.4%. This work suggests that quantitative HRCT using DTA, an image-based measure of morphology, may be a valuable additional tool for assessing outcomes in IPF that should be tested in prospective clinical trials.
Supplementary material
Supplementary Material
Please note: supplementary material is not edited by the Editorial Office, and is uploaded as it has been supplied by the author.
Supplementary material ERJ-01384-2018_Supplement
Footnotes
This article has supplementary material available from erj.ersjournals.com
Author contributions: Concept and design: S.M. Humphries, J.J. Swigris, K.K. Brown and D.A. Lynch. Data acquisition, analysis and interpretation: S.M. Humphries, J.J. Swigris, Qi Gong, J.S. Sundy, G. Raghu, M. Strand, M.I. Schwarz, K.K. Brown, K.R. Flaherty, R. Sood, T.G. O'Riordan and D.A. Lynch. Drafted manuscript for important intellectual contribution: S.M. Humphries, J.J. Swigris, K.K. Brown and D.A. Lynch. Review and finalising of the manuscript: all authors
Support statement: Analysis of PANTHER-IPF study data was partially supported by NIH/NHLBI R01 HL091743 (K.R. Flaherty). Gilead Sciences funded quantitative analysis of HRCT in the RAINIER study. Funding information for this article has been deposited with the Crossref Funder Registry.
Conflict of interest: S.M. Humphries reports service contract for quantitative analysis of RAINIER HRCT scans from Gilead Sciences, during the conduct of the study; personal fees from Boehringer Ingelheim, grants from NHLBI, and service contract from PAREXEL Informatics, outside the submitted work; in addition, S.M. Humphries has a patent “Systems and methods for automatic detection and quantification of pathology using dynamic feature classification” pending to National Jewish Health.
Conflict of interest: J.J. Swigris has nothing to disclose.
Conflict of interest: K.K. Brown reports multiple lung fibrosis grants from NHLBI, personal fees from AstraZeneca, Bayer, Biogen, Fibrogen, Galecto, MedImmune, Novartis, Aeolus, ProMetic, Patara, Third Pole, aTyr and Boehringer Ingelheim, conversations under CDAs with Genoa, Galapagos and Global Blood Therapeutics, grants and personal fees from Gilead Sciences, and submitted grant from Roche/Genentech, outside the submitted work.
Conflict of interest: M. Strand has nothing to disclose.
Conflict of interest: Q. Gong has nothing to disclose.
Conflict of interest: J.S. Sundy reports being a full-time employee and stockholder in Gilead Sciences, Inc.
Conflict of interest: G. Raghu has been a consultant on IPF and fibrotic lung diseases for Boehringer Ingelheim, BMS, Bellerophan, Roche/Genentech and Veracyte, and a consultant on IPF studies for Biogen, Fibrogen, Gilead Sciences, Nitto, Promedior, Patara and Sanofi, outside the submitted work.
Conflict of interest: M.I. Schwarz has nothing to disclose.
Conflict of interest: K.R. Flaherty reports grants and personal fees from Boehringer Ingelheim and Roche/Genentech, personal fees from Veracyte, Aeolus, Pharmakea, Fibrogen and Sanofi-Genzyme, and grants from Afferent, outside the submitted work.
Conflict of interest: R. Sood reports that Gilead Sciences paid for cost of services for running the IPF clinical trial, during the conduct of the study.
Conflict of interest: T.G. O'Riordan is a full-time employee and stockholder of Gilead Sciences.
Conflict of interest: D.A. Lynch reports grants from NHLBI, personal fees and research support from PAREXEL and Veracyte, personal fees from Boehringer Ingelheim, Genentech/Roche and Acceleron, outside the submitted work; in addition, D.A. Lynch has a patent “Systems and methods for automatic detection and quantification of pathology using dynamic feature classification” pending to National Jewish Health.
- Received July 23, 2018.
- Accepted July 26, 2018.
- Copyright ©ERS 2018