Abstract
Computer-based computed tomography (CT) analysis can provide objective quantitation of disease in idiopathic pulmonary fibrosis (IPF). A computer algorithm, CALIPER, was compared with conventional CT and pulmonary function measures of disease severity for mortality prediction.
CT and pulmonary function variables (forced expiratory volume in 1 s, forced vital capacity, diffusion capacity of the lung for carbon monoxide, transfer coefficient of the lung for carbon monoxide and composite physiologic index (CPI)) of 283 consecutive patients with a multidisciplinary diagnosis of IPF were evaluated against mortality. Visual and CALIPER CT features included total extent of interstitial lung disease, honeycombing, reticular pattern, ground glass opacities and emphysema. In addition, CALIPER scored pulmonary vessel volume (PVV) while traction bronchiectasis and consolidation were only scored visually. A combination of mortality predictors was compared with the Gender, Age, Physiology model.
On univariate analyses, all visual and CALIPER-derived interstitial features and functional indices were predictive of mortality to a 0.01 level of significance. On multivariate analysis, visual CT parameters were discarded. Independent predictors of mortality were CPI (hazard ratio (95% CI) 1.05 (1.02–1.07), p<0.001) and two CALIPER parameters: PVV (1.23 (1.08–1.40), p=0.001) and honeycombing (1.18 (1.06–1.32), p=0.002). A three-group staging system derived from this model was powerfully predictive of mortality (2.23 (1.85–2.69), p<0.0001).
CALIPER-derived parameters, in particular PVV, are more accurate prognostically than traditional visual CT scores. Quantitative tools such as CALIPER have the potential to improve staging systems in IPF.
Abstract
CALIPER-derived parameters such as pulmonary vessel volume are more accurate prognostically than visual CT scores http://ow.ly/2b5G304exlA
Introduction
Accurate prognostication is central to the management of patients with idiopathic pulmonary fibrosis (IPF). In addition to informing a patient of their probable life expectancy [1], accurate prediction of a patient's likely clinical course allows the institution of appropriate management, which presently may include antifibrotic medication, referral for transplantation [2] or palliative care pathways.
However, prognostication in IPF is fraught with difficulties. An array of prognostic indicators have been used over the years, but with varying degrees of success in IPF. Pulmonary function tests (PFTs) such as diffusion capacity of the lung for carbon monoxide (DLCO) are perhaps the most sensitive markers of disease severity, but are associated with a measurement variation of 10–15% per test [3]. Visual computed tomographic (CT) evaluation is subject to interobserver variation [4, 5]. Composite indices have been proposed, but are yet to be fully validated [6]. As a result, new tools that may be more accurate in predicting a patient's prognosis are required. A recent recommendation from the Fleischner Society [7] has emphasised computer-based quantitative CT analysis as a potential outcome measure in IPF.
Establishing whether a quantitative tool is a suitable marker of disease outcome requires the evaluation of the tool against other markers of baseline disease severity. A sophisticated quantitative CT algorithm (CALIPER) has been shown to have better correlations with PFTs than semiquantitative visual CT evaluation [8].
The aims of our study were to compare CALIPER, visual CT scoring and PFTs against survival in IPF. Optimal stratification was explored using CALIPER variables, visual variables, PFTs and the Gender, Age, Physiology (GAP) score.
Methods
A retrospective analysis of an interstitial lung disease database identified all consecutive newly attending patients receiving a multidisciplinary team diagnosis of IPF according to published guidelines [9], over a 4.5-year period (January 2007 to July 2011). Patients with a departmental, noncontrast, supine, volumetric CT were included in the study cohort (figure 1). CT, echocardiography and PFT protocols are included in the online supplementary material, as are details of CALIPER CT evaluation. The Digital Imaging and Communications in Medicine images for the CT scans were transferred to the Mayo Clinic (Rochester, MN, USA) for blinded CALIPER processing. Approval for this study of clinically indicated CT and pulmonary function data was obtained from the institutional ethics committees of the Royal Brompton Hospital (London, UK) and Mayo Clinic and informed patient consent was not required.
Visual CT evaluation
Each CT scan was evaluated independently by two radiologists with 5 and 7 years of thoracic imaging experience, blinded to all clinical information. An initial training dataset of 15 nonstudy cases was used to help to identify pre-existing biases. The scores of the test cases were reviewed and the most widely discrepant results discussed with a third radiologist.
CTs were scored on a lobar basis using a continuous scale. The total extent of interstitial lung disease (ILD) was initially estimated to the nearest 5%, then subclassified into four patterns: reticular, ground glass opacification, honeycombing and consolidation, using definitions from the Fleischner Society glossary of terms for thoracic imaging [10]. To derive a lobar percentage for each parenchymal pattern, the total lobar ILD extent was multiplied by individual lobar parenchymal pattern extents and divided by 100. Furthermore, the percentage (to the nearest 5%) of each lobe that contained mosaicism (decreased attenuation component) or emphysema was recorded. The individual lobar percentages of each parenchymal pattern were summed for each radiologist and divided by six to create an averaged lobar score per pattern, per scorer, per case.
Traction bronchiectasis, as defined in the Fleischner Society glossary of terms [10], was assigned with a categorical “severity” score that took into account the average degree of airway dilatation within areas of fibrosis as well as the extent of dilatation throughout the lobe, and was given a Gestalt score of none (0), mild (1), moderate (2) and severe (3). An index of pulmonary hypertension (main pulmonary artery:ascending aorta ratio) was assessed by two scorers using electronic caliper diameter measurements of the ascending aorta and pulmonary artery diameters at the level of the pulmonary artery bifurcation [11]. Consensus formulation for visual scores is outlined in the online supplementary material.
CALIPER CT evaluation
Data processing
Initial data processing steps involved extraction of the lung from the surrounding thoracic structures and segmentation into upper, middle and lower zones. Lung segmentation was performed using an adaptive density-based morphological approach [12], while airway segmentation involved iterative three-dimensional region growing, density thresholding (thresholds including −950 HU and −960 HU) and connected components analysis. Segmentation of pulmonary vessels, prior to their extraction, was achieved using an optimised multiscale tubular structure enhancement filter based on the eigenvalues of the Hessian matrix. The filters calculated the second-order derivatives that occurred in the regions that surrounded each pulmonary voxel. The eigenvalues of the Hessian matrix that were constructed from the derivatives were then analysed, and from these values, it was possible to determine the likelihood that an underlying voxel was connected to a dense tubular structure and therefore represented a vessel [13, 14].
The pulmonary vessel volume (PVV) score quantified the volumes of pulmonary arteries and veins excluding vessels at the lung hilum as a percentage of lung volume (figure 2). The PVV score was subdivided according to vessel size. All vessels on a single CT image that had a cross-sectional area <5 mm2 (PVV5) or <10 mm2 (PVV10) or >5 mm2 (PVV>5) were summed, and expressed as volume (cm3) after adjusting for z-axis CT slice thickness. The 5 mm2 and 10 mm2 thresholds were chosen after analysing a range of vessel size thresholds as described in the online supplementary material, and are in line with vessel size thresholds analysed in patients with chronic obstructive pulmonary disease [15]. The PVV>5 threshold was examined to remove the capture of potentially misclassified reticular pattern in the PVV variable in cases with extensive fibrosis.
Parenchymal tissue type classification was applied to 15×15×15-voxel volume units using texture analysis, computer vision-based image understanding of volumetric histogram signature mapping features and 3D morphology [13]. The CALIPER tool was trained by subspecialty thoracic radiologist consensus assessment of pathologically confirmed datasets [13, 16].
Pattern evaluation
CALIPER evaluation of CT data involved algorithmic identification and volumetric quantification of every voxel volume unit into one of eight radiological parenchymal features: normal lung, three grades of decreased lung attenuation (grade 1: mild, 2: moderate and 3: marked), ground glass opacification, reticular pattern, honeycombing and the pulmonary vessels (figure 2). Volumes for all eight parenchymal features were converted into a percentage using the total lung volume, also measured by CALIPER. Total extent of ILD represented the sum of ground glass, reticular and honeycomb percentages. The sum of grade 2 and 3 decreased lung attenuation represented emphysema [8].
Statistical analysis
Data are presented as median, mean±sd or n (%). Interobserver variation for visual scores was assessed using the single-determination standard deviation. Linear regression analyses were performed to explore relationships between the PVV and CALIPER ILD extent, visual ILD and reticular pattern extents and right ventricular systolic pressure (RVSP). Linear regression was performed to evaluate relationships between PVV subdivisions and PVV and DLCO. Univariate and multivariate Cox regression analyses were used to investigate relationships within and between the three datasets: CALIPER CT evaluation, visual CT evaluation and PFTs. In all study analyses, p<0.01 was considered significant.
The hazard ratios for those parameters that were independent predictors of mortality on multivariate Cox regression analysis were used to generate a formula that represented an estimate of mortality for each patient. The hazard ratios for CALIPER parameters alone that were independent predictors of mortality on multivariate Cox regression analysis were also used to generate a separate formula that represented an estimate of mortality for each patient.
The mortality estimates derived from the hazard ratio scores were converted into categorical scores and compared to mortality estimates derived from the GAP index staging system [17], using univariate and bivariate Cox regression analyses, Kaplan–Meier survival plots and the log-rank test. Robustness of results were confirmed using bootstrapping and resampling of the dataset up to 1000 times. Goodness of fit of the survival models was calculated using Harrell's concordance index. Assumptions of linearity and proportional hazards were tested by visual inspection of Martingale residuals and scaled Schoenfeld residuals. Statistical analyses were performed using Stata (version 12; StataCorp, College Station, TX, USA).
Results
Demographic data
The study population consisted of 283 consecutive patients with a multidisciplinary diagnosis of IPF. The median age at presentation was 67 years; the mean follow-up time was 30±21.5 months and 210 (74%) patients died during the study period. Data on vital status were completed in 98.6% of cases, with four (1.4%) patients censored. Demographic data and average visual score, CALIPER score and PFT data are presented in table 1. Interobserver variation values for the visual scores are provided in the online supplementary material.
Mortality analyses
All visual and CALIPER-derived interstitial features were predictive of mortality. The CALIPER-derived PVV (shown in figure 2) and PVV5, PVV10 and PVV>5 were highly significant on univariate analysis, as were all PFTs, the RVSP and the GAP score (table 2). When RVSP and CALIPER ILD extent were each placed alongside PVV in a bivariate mortality analysis, only PVV remained independently predictive of mortality. On linear regression analysis, major co-linearity was demonstrated between PVV and CALIPER ILD extent (R2 0.76, p<0.0001) and between PVV and visual ILD extent (R2 0.57, p<0.0001) [8]. A lesser degree of co-linearity was demonstrated between PVV and RVSP (R2 0.20, p<0.0001) and between PVV and visual reticular pattern extent (R2 0.16, p<0.0001). Relationships between PVV subdivisions and both functional indices and survival are shown in the online supplementary material.
On stepwise proportional hazards analysis, the independent CALIPER-derived predictors of mortality were honeycombing and the PVV (table 3). When inserted separately into multivariate models, PVV5 and PVV10 did not retain significance. However, PVV>5 was independently predictive of mortality (online supplementary table S4). On multivariate analysis of pulmonary function indices, the composite physiologic index (CPI) was the strongest predictor of mortality. A multivariate Cox regression analysis combining CALIPER scores, visual scores, PFTs and the GAP score demonstrated the CPI (which has been previously validated using CALIPER [8]) to be the variable that best quantified the severity of ILD. The two remaining CT variables independently predictive of mortality were derived by CALIPER: honeycombing and PVV. No visually scored CT parameters were independently associated with mortality (table 3). When the GAP score was substituted for the CPI in the final multivariate model, although it was retained as an independent predictor of mortality, it was not as strong as the CPI in the model (table 3). These findings were maintained when PVV>5 was substituted for PVV.
Derivation of composite variables
A further mortality evaluation compared the GAP index staging system [17] to two mortality estimates derived from the hazard ratios of two multivariate models. One formula was derived from the hazard ratios of the final three independent predictors of mortality (CALIPER–CPI score) as follows:
CALIPER–CPI score = (CALIPER PVV×23.0904) + (CALIPER honeycombing×18.3795) + (CPI×4.5065)
The second formula was derived from the hazard ratios of the two variables that on multivariate analysis of CALIPER scores were independent predictors of mortality (CALIPER-only score):
CALIPER-only score = (CALIPER PVV×52.9004) + (CALIPER honeycombing×12.0524)
Both the CALIPER–CPI and CALIPER-only scores were converted into categorical scores by aligning the individual scores in ascending numerical order and dividing the respective cohorts into three equally sized groups (n=83).
Comparison of composite variables
Univariate Cox regression analyses demonstrated that the CALIPER–CPI and CALIPER-only categories were not only of similar prognostic strength to the GAP index staging system, but demonstrated improved goodness of fit as models (table 4).
On bivariate Cox regression analysis, the GAP index staging system did not retain significance against the CALIPER–CPI categories, a result confirmed with bootstrapping of 1000 samples (table 4). The GAP index staging system was also shown to be a weaker predictor of mortality than the CALIPER-only categories, again confirmed with bootstrapping of 1000 samples. These findings were maintained when PVV>5 was substituted for PVV (online supplementary table S5). Kaplan–Meier survival curves demonstrated similar separation of the groups using the CALIPER–CPI and CALIPER-only categories when compared to the GAP index staging system (log-rank test p<0.0001 for the GAP index staging system, CALIPER–CPI and CALIPER-only categories) (figure 3).
When the CALIPER–CPI and CALIPER-only scores were adjusted such that patient numbers in each of the groups were identical to the patient numbers in the GAP staging system groups, the relationships on univariate analysis and bivariate analysis with bootstrapping did not change. Goodness of fit of the two models was similar to that of the GAP index staging system (Harrell's c-index 0.66 for the CALIPER–CPI score and 0.65 for the CALIPER-only score). Kaplan–Meier survival curves again demonstrated similar separation of the groups using the new adjusted CALIPER–CPI and CALIPER-only categories when compared to the GAP index staging system (log-rank test p<0.0001 for the adjusted CALIPER–CPI and CALIPER-only categories) (figure 3).
Discussion
Our study has demonstrated that computer-derived quantitative CT parameters are better predictors of mortality in IPF than visually scored parameters. When patients are stratified on the basis of CALIPER variables and the CPI, mortality prediction is improved when compared to stratification using the GAP index staging system. Central to the strength of mortality prediction using CALIPER was a single variable, the PVV. The PVV is a novel CALIPER variable with no visually scored equivalent. Accordingly, the PVV may represent a new parameter in the evaluation of patients with IPF.
DLCO has long been considered the parameter that best reflects disease severity at baseline in IPF [18], but is handicapped by measurement noise in the range 5–15% [3, 19]. Consequently, interest is increasing in exploring other potential markers of disease severity or worsening in IPF, such as peripheral blood [20] and imaging [7] biomarkers. Given the rapid technological advances of computer-based quantitative CT [13], an exploration of their potential accuracy in assessing IPF is timely.
In the current study, on univariate analysis, visually scored reticular pattern, honeycombing and traction bronchiectasis were predictive of mortality, confirming the conclusions of a previous study which evaluated the prognostic significance of the same patterns in patients with a histopathologically proven usual interstitial pneumonia pattern [21].
The final combined visual, CALIPER and CPI multivariate model identified three independent predictors of survival. One of these was the CPI, and the finding confirms previous reports highlighting the strength of signal of the CPI in predicting outcome in fibrosing lung disease [22, 23]. Similarly, the prognostic signal associated with honeycombing when scored by CALIPER confirms results from visual scoring of the same pattern in IPF [24].
The co-linearity demonstrated between PVV and CALIPER/visual ILD extent and between PVV and RVSP suggests that the PVV might represent a variable that simultaneously captures disease within the interstitial and vascular compartments. It is noteworthy that CALIPER software was originally designed to segment out and discard vascular structures (rather than quantify them) and in so doing optimise the classification of parenchymal patterns. In cases with more severe fibrosis on CT, vessel segmentation inevitably captures a minor degree of peripheral reticular pattern (figure 2); however, there was notably only a minor degree of co-linearity between visual reticular pattern and PVV. Furthermore, the ability of PVV to predict mortality when vessels <5 mm2 were excluded from analysis argues against the likelihood that a significant part of the PVV signal is misclassified reticular pattern. In any event, as software algorithms improve, there remains the potential to refine vessel delineation by CALIPER, or a similar quantitative tool, and thus improve the prognostic signal of this new CT parameter.
The basis of the signal provided by PVV is obscure. PVV has not previously been studied as a prognostic marker in IPF or considered as a prognostic indicator independent of pulmonary hypertension. Recent IPF studies assessing the complex interactions between angiogenic and angiostatic mediators in disease pathogenesis [25] have primarily considered the vasculature through the prism of overt pulmonary hypertension [26].
Blood perfusion in areas of fibrosis has been shown to be reduced [27], but conversely increased in spared lung adjacent to areas of fibrosis [28, 29]. It follows that the strong correlations between ILD extent and vessels may reflect regional, subclinically elevated local pulmonary arterial pressures within mildly fibrotic lung, or capillary bed destruction in more advanced disease, which may produce a preferential diversion of blood flow to relatively spared or nonfibrotic lung. The vascular capacitance of spared lung (the upper and middle lobes in patients with IPF, a predominantly basal disease) may result in an increase in vessel volume in more advanced disease. The identification of greater numbers of vessels, of a size that could be detected by CALIPER, could therefore act as a surrogate marker for the extent and severity of parenchymal disease in IPF.
Another possible explanation for the relationship between PVV and ILD extent relates to the increased negative intrathoracic pressure that noncompliant fibrotic lungs need to generate during inspiration. The transmission of high negative pressures through the pleural space into the parenchyma could in turn be exerted on the vasculature, resulting in dilatation throughout the lung and an increase in capacitance. A third possible mechanism relates to pleuroparenchymal and/or bronchial-pulmonary artery anastomoses described histopathologically in patients with fibrosing lung disease [30]. While the clinical importance of shunting within the lungs is yet to be established [31], the development of shunts could theoretically increase the PVV as fibrosis progresses.
The evaluation of CALIPER in our previous study highlighted the close correlations between parenchymal patterns scored using an automated computer system and pulmonary function indices [8]. The results suggested that a tool such as CALIPER, when used alone, could be a viable alternative to PFTs in predicting outcome in IPF. Our mortality analyses have also underlined the importance of integrating structural and functional parameters for prognostication in IPF. The CPI, when combined with CALIPER variables, produced a stronger model with which to predict mortality than achieved by CALIPER or functional indices alone.
The GAP score is a multidimensional continuous score that aims to predict mortality in IPF by utilising commonly measured clinical and physiological variables [17]. The GAP score was shown to be a strong univariate predictor of mortality in our study, and was enhanced when combined with CALIPER variables in a multivariate model. However, the model was not as powerful as the combination of CALIPER variables with CPI. The findings highlight a relative weakness of the GAP score consequent upon its relatively coarse nine-point gradation, whereas the continuous nature of the CPI allows the CPI to be more discriminatory.
The GAP staging system represents a categorical version of the GAP score [17]. A consequence of stratification according to the GAP staging system is that only 50% of patients in the current study constituted GAP groups 1 and 3 (patients with mild and severe disease, respectively). Yet it is precisely such patients that require identification in IPF cohorts. Those identified as likely to have limited disease can be monitored using a watch-and-wait policy, while those with more severe disease might be referred earlier for transplantation [2]. When stratification with the CALIPER–CPI categories was performed, despite the arbitrary division of the patients into evenly sized groups, prognostication and goodness of fit was improved when compared to the categorical GAP staging system.
Emphysema when scored visually or by CALIPER was not found to be predictive of mortality in the current study. There have been conflicting reports in the literature regarding the impact on survival when emphysema coexists with fibrosis. Whilst some authors have argued that emphysema is a poor prognostic determinant in IPF [32–35], other studies have refuted this observation [36–40]. While our findings concur with the latter view, more detailed evaluation of emphysema subtypes and distribution within the lung, as well as correction for baseline disease severity in patients with IPF may help clarify the primary effect of emphysema on survival in IPF.
A limitation of our study lies in the lack of an external validation cohort with which to confirm our study findings. However, the scarcity of well-characterised populations of IPF patients, even in tertiary centres, is well recognised. A potential solution lies in the pooling of international multicentre populations, similar to the COPDGene study population [41], but such an undertaking will require close collaborative efforts between tertiary centres.
In conclusion we have shown that quantitative computer-derived CT variables in IPF are superior predictors of mortality than any visually scored CT parameter. Stratification using CALIPER variables and CPI provides a stronger mortality signal than stratification using the GAP index. But one CALIPER variable in particular, the PVV, has the strongest link with mortality and could be a new index in the evaluation of IPF.
Supplementary material
Supplementary Material
Please note: supplementary material is not edited by the Editorial Office, and is uploaded as it has been supplied by the author.
Supplementary material ERJ-01011-2016_Supplementary_appendix
Supplementary figure 1a ERJ-01011-2016_Supp_Figure_1A
Supplementary figure 1b ERJ-01011-2016_Supp_Figure_1B
Supplementary figure 1c ERJ-01011-2016_Supp_Figure_1C
Supplementary figure 1d ERJ-01011-2016_Supp_Figure_1D
Supplementary figure 1e ERJ-01011-2016_Supp_Figure_1E
Supplementary figure 2a ERJ-01011-2016_Supp_Figure_2A
Supplementary figure 2b ERJ-01011-2016_Supp_Figure_2B
Disclosures
Supplementary Material
B.J. Bartholmai ERJ-01011-2016_Bartholmai
D.M. Hansell ERJ-01011-2016_Hansell
R. Karwoski ERJ-01011-2016_Karwoski
S. Rajagopalan ERJ-01011-2016_Rajagopalan
S.L.F. Walsh ERJ-01011-2016_Walsh
A.U. Wells ERJ-01011-2016_Wells
Footnotes
This article has supplementary material available from erj.ersjournals.com
Conflict of interest: Disclosures can be found alongside the online version of this article at erj.ersjournals.com
- Received May 20, 2016.
- Accepted September 7, 2016.
- Copyright ©ERS 2017