The purpose of the present study was to evaluate the accuracy of the diagnosis of idiopathic pulmonary fibrosis (IPF) by respiratory physicians in six European countries, and to calculate the interobserver agreement between high-resolution computed tomography reviewers and histology reviewers in IPF diagnosis.
The diagnosis of usual interstitial pneumonia (UIP) was assessed by a local investigator, following the American Thoracic Society/European Respiratory Society consensus statement, and confirmed when a minimum of two out of three expert reviewers from each expert panel agreed with the diagnosis. The level of agreement between readers within each expert panel was calculated by weighted kappa.
The diagnosis of UIP was confirmed by the expert panels in 87.2% of cases. A total of 179 thoracic high-resolution computed tomography scans were independently reviewed, and an interobserver agreement of 0.40 was found. Open or thoracoscopic lung biopsy was performed in 97 patients, 82 of whom could be reviewed by the expert committee. The weighted kappa between histology readers was 0.30.
It is concluded that, although the level of agreement between the readers within each panel was only fair to moderate, the overall accuracy of a clinical diagnosis of idiopathic pulmonary fibrosis in expert centres is good (87.2%).
Idiopathic pulmonary fibrosis (IPF) is a specific form of a chronic fibrosing interstitial pneumonia limited to the lung, and is typically characterised by the histological appearance of usual interstitial pneumonia (UIP) on open or thoracoscopic lung biopsy (OLB and TLB, respectively) 1. The clinical diagnosis of IPF is based on the exclusion of known causes of interstitial lung disease, a restrictive lung function pattern with impaired gas exchange and the presence of a typical pattern of bibasilar reticular abnormalities with minimal ground-glass opacities on thoracic high-resolution computed tomography (HRCT) 1.
Patients with IPF show worse survival than those with other types of idiopathic interstitial pneumonia 2–4. Since the diagnosis of IPF depends upon the expertise of the pathologist and radiologist, it is important that the clinician knows the diagnostic accuracy of thoracic HRCT and of lung biopsy in UIP. Various studies have calculated the accuracy of thoracic HRCT in fibrotic lung diseases 5–7, evaluated interobserver agreement for the diagnosis of different thoracic HRCT patterns (e.g. ground-glass and reticular pattern) 8, 9 in patients with biopsy-proven nonspecific interstitial pneumonia (NSIP) or UIP 3, or in different forms of interstitial lung disease 10, 11. Studies on interobserver agreement amongst pathologists are sparse 12, and only one study with a multicentric prospective design has addressed the issue of diagnostic accuracy in UIP in relation to both radiologist and pathologist 7. No study has addressed this issue in view of the new American Thoracic Society (ATS)/European Respiratory Society (ERS) consensus criteria 1. Therefore, the aim of the present study was to evaluate the diagnostic accuracy of respiratory physicians in IPF, and to calculate the interobserver agreement between HRCT reviewers and histology reviewers in the diagnosis of UIP.
All of the patients presented in the current study were included in the Idiopathic Pulmonary Fibrosis International Group Exploring N-Acetylcysteine I Annual (IFIGENIA) trial 13. The IFIGENIA trial is a European prospective double-blind placebo-controlled trial studying the effect of high-dose N-acetylcysteine in combination with standard therapy (prednisone and azathioprine) in patients with IPF. Following the judgment of a local investigator, patients were included if the diagnosis of IPF was based on the international consensus criteria 1, and they were aged 18–75 yrs. Newly diagnosed (<6 months) as well as previously diagnosed (>6 months) patients were considered for the study. The IFIGENIA trial was approved by the local ethical committee of the participating centres and every patient signed their informed consent.
HRCT scanning protocol
HRCT of the thorax was performed in the supine position during breath-holding at full inspiration, with 1 or 1.5-mm-thick sections at 1-cm intervals throughout the entire thorax. Images were reconstructed using a high-frequency algorithm at window levels appropriate for the pulmonary parenchyma (mean -500– -700 HU; width 1,400–2,000 HU). No intravenous contrast was administered.
Review by the radiology committee
The local investigator provided copies of the original HRCT scan and sent these to the international trial coordinator (G. Corvasce). A copy was sent to each of the three members of the radiology committee (C.D.R. Flower, F. Laurent and J. Verschakelen). The copies of the HRCT scans were reviewed independently without knowledge of clinical, physiological or pathological parameters. The international trial coordinator ensured that the three members of the radiology committee (reviewers A, B and C) were unaware of the patient’s identity. Each member of the committee confirmed the diagnosis of UIP on thoracic HRCT based on the criteria of the international consensus statement 1. The degree of confidence in the diagnosis was recorded in terms of the scan being very suggestive, probable or unlikely for the diagnosis. The UIP diagnosis on thoracic HRCT was confirmed if the scan was scored as very suggestive or probable for UIP, and rejected if it was scored as unlikely. If disagreement occurred between the three members of the radiology committee, the UIP diagnosis agreed by the majority of the three members was accepted as definite.
Review of lung biopsy specimens by the histology committee
The diagnosis of UIP according to the criteria of the ATS/ERS consensus classification 1 was assessed by an independent panel of three pathology experts (A.G. Nicholson, E.K. Verbeken and F. Capron). The local investigator sent OLB or TLB slides to the international trial coordinator, who blinded the cases and sent them to two members of the pathology review committee (reviewers D and E). All slides were graded as being very suggestive, probable or unlikely for the diagnosis of UIP. For each observer, the UIP diagnosis on lung biopsy was confirmed if the slide was scored as very suggestive or probable for UIP and rejected if it was scored as unlikely. If the two reviewers disagreed as to diagnosis of UIP, the slides were sent to the third member of the pathology committee (reviewer F) and assessed in an identical fashion. The diagnosis agreed by the majority of the three members was accepted as final. The slides were reviewed independently without knowledge of clinical or physiological parameters.
Definite diagnosis of UIP
The diagnosis of UIP was rejected when one or both committees did not confirm a diagnosis of UIP.
Weighted kappa coefficients (κw) were used to measure the level of interobserver agreement. The κw were calculated using a method recommended for comparing level of agreement with categorical data 14, along with their respective 95% confidence intervals (CIs).
A total of 36 local investigators from six European countries were included (table 1⇓), giving 179 HRCT scans and 82 OLB or TLB specimens for review.
Radiology reviewer A reviewed 178 HRCT scans (one scan was never reviewed), reviewer B 176 (two scans were judged uninterpretable and one was never reviewed) and reviewer C 176 (two scans were judged uninterpretable and one was never reviewed; fig. 1⇓). After combining the observations of all three radiologists, the 532 HRCT observations were judged to be unlikely in 67 (12.6%) cases, probable in 203 (38.2%) and very suggestive in 258 (48.5%) for the diagnosis of UIP. For four (0.8%) observations, the HRCT scan was judged to be uninterpretable because of lack of quality. A total of 238 HRCT observations could be correlated with the results of lung biopsy. When the HRCT scans were judged to be unlikely, probable and very suggestive for UIP, 67.5, 84.4 and 91.7%, respectively, of the corresponding lung biopsy specimens were positive for UIP (fig. 2⇓).
All 82 biopsy specimens (44 OLB and 38 TLB) were sent to the international trial coordinator for review by pathology reviewers D and E. After combining the observations of the three histology reviewers, the 178 OLB/TLB observations were judged to be unlikely in 33 (18.5%) cases, probable in 66 (37.1%) and very suggestive in 76 (42.7%) for the diagnosis of UIP. For three (1.7%) observations, the biopsy slide was judged to be uninterpretable. Reviewer D reviewed all 82 OLB/TLB specimens and reviewer E 79 (three were judged uninterpretable). Histology reviewer F was solicited to review 14 biopsy slides (fig. 1⇑).
In 12.8% of the patients, the diagnosis of UIP was rejected by at least one review committee (table 1⇑; fig. 1⇑). The diagnosis of UIP was confirmed by the pathology review committee for 84% of the 82 OLB/TLB specimens. The diagnosis of UIP on HRCT was confirmed for 92.7% of the 165 HRCT scans (fig. 1⇑).
Table 2⇓ summarises the level of agreement between the three different HRCT reviewers; κw ranged 0.33–0.46. No important differences in κw were seen within the different subgroups. Table 3⇓ summarises the level of agreement between the two pathology reviewers; a κw of 0.30 (95% CI 0.12–0.48) was calculated. The level of agreement was 0.84 (0.55–1.14) in the subgroup of those patients in whom the diagnosis of UIP was not confirmed on HRCT. When the severity of lung function impairment (forced vital capacity of >60% or <60% of the predicted value) was taken into account, no difference in level of agreement was observed.
Two salient findings emerge from the present study. First, the diagnosis of IPF proposed by a respiratory specialist was rejected in 12.8% of cases after review of histology and HRCT by expert committee. Secondly, the mean level of agreement between the three different HRCT reviewers was 0.40, and between the two pathology reviewers 0.30.
The diagnostic accuracy of a pulmonary physician in IPF in relation to the ATS/ERS diagnostic criteria 1 remains to be established. A confident diagnosis of IPF proposed by a clinician was confirmed in 87.2% of cases in the present study. The rejection of the diagnosis was not based on clinical criteria, but rather on HRCT and/or lung biopsy findings that were not compatible with the diagnosis of UIP. Hunninghake et al. 7 found the probability of a patient being given a confident diagnosis by the referring clinician to be 81%, similar to the present results. Although the study of Hunninghake et al. 7 represented the first published prospective multicentric study regarding the level of agreement between clinicians, radiologists and pathologists as to the diagnosis of IPF, it is not clear from their study on which clinical grounds the diagnosis of IPF was made, since no clinical or radiological criteria were provided for the diagnosis of IPF.
The present study also addressed the question of agreement between histology reviewers in the diagnosis of UIP in view of the new pathological classification 1. The interobserver agreement between the histology reviewers was low, with a mean κw of 0.30, a level scored as showing fair agreement following the proposed interpretation of κw of Brennan and Silman 14. The κw ranges 0–1, with 0 indicating only chance agreement and 1 perfect agreement 14. Nicholson et al. 15 and Cherniack et al. 16 studied levels of agreement between pathologists for individual histological parameters (e.g. extent of fibrosis) in biopsy specimens showing UIP, and found κw ranging 0.56–0.76 and -0.06–0.30, respectively. However, since their κw were calculated for a different purpose, their results are not comparable with those of the present study. Nicholson et al. 17 presented another study examining the prognostic significance of histological patterns of idiopathic interstitial pneumonia. Slides of 37 lung biopsy specimens with UIP, 28 with NSIP and 13 with desquamative interstitial pneumonia or respiratory bronchiolitis-associated interstitial lung disease were reviewed independently by two pulmonary histopathologists. They found an overall κw of 0.49, but the level of agreement in distinguishing between UIP and NSIP was 0.26. In view of patient selection, this latter figure is comparable with the present study. It suggests that distinguishing a UIP or NSIP pattern based on histology is difficult, and perhaps more difficult with knowledge of the lobar histopathological variability in UIP and NSIP in the same lung 18. This finding was confirmed in a recent study 12 in which observer agreement on UIP as final histological diagnosis of 0.49 and on NSIP of 0.32 were found.
The level of agreement between the HRCT readers was fair to moderate 14. This κw is comparable with those of MacDonald et al. 9 and Flaherty et al. 3 (table 4⇓), who compared the interobserver agreement on HRCT for patients with NSIP and UIP. Hunninghake et al. 7 found an interobserver agreement of 0.54. Since the radiological criteria for the diagnosis of UIP on HRCT were not mentioned in this latter study, it is not possible to explain the difference between their κw and those of MacDonald et al. 9, Flaherty et al. 3 and the present study. Other studies 8, 10, 19, 20, however, have found a higher level of agreement, but their study populations and the purposes for which observer variability were calculated differed significantly from those of the present study (table 4⇓). Aziz et al. 21 found an observer agreement of 0.48 on the first-choice diagnosis in a cohort of 131 patients with diffuse parenchymal lung disease; the κw in the IPF cohort was 0.50.
It is important to emphasise the fact that radiologists with differing levels of experience and expertise may interpret radiographic images differently. The radiologists in the present study are specialists in thoracic imaging and have extensive expertise in the interpretation of HRCT scans. Each reader was blinded to the clinical parameters, and the reading was performed separately, so that the different readers could not influence each other. The κw was used to evaluate observer variability in order to remove that component of agreement attributable to chance. Although this method of statistical analysis permits a more accurate assessment of observer variability than unadjusted data, a κw may underestimate a high level of agreement 14. In the present study, the level of agreement between the different readers was unexpectedly low. Might this be due to the high prevalence of the disease in the study population, or might it be observer variation bias? The interpretation of κw depends upon the prevalence of the disease, which was high (0.84) in the present study 14. The prevalence of the disease was high because the HRCT scans and lung biopsy slides were from patients selected by a local investigator who had already confirmed the diagnosis of IPF to conform to the ATS/ERS criteria. The higher prevalence of the disease in the present study population is a possible explanation for the conflicting finding that 67% of histologically confirmed UIP cases gave HRCT results that were reported as unlikely. However, this is not astonishing, since a recent publication reported that 59% of patients with definite or probable NSIP on CT had a histological diagnosis of UIP 4. The present authors assume that many of the CT scans that had been reported as being unlikely for UIP would fulfil the CT criteria for NSIP.
The results of the present study may evoke concerns about diagnostic accuracy in IPF. This form of lung fibrosis is a rare disease, and no single accurate diagnostic test for it exists. Studies of diagnostic accuracy in IPF are performed mostly in tertiary referral centres. Even in these studies, significant interobserver variability exists. In most of these studies, as in the present one, prior knowledge of the presence of a form of interstitial lung disease exists, which may incite observer bias and therefore influence the results in terms of diagnostic accuracy. The incidence of IPF is low in a general pulmonary practice. Diagnostic accuracy (i.e. sensitivity and specificity) also depends upon the prevalence of the disease. A lower prevalence of disease results in a higher number of false positive and false negative diagnoses. If very costly therapeutic options come to market in the future, the only means of ensuring the greatest possible diagnostic accuracy in IPF is to refer patients to centres with expertise in pulmonary histology and thoracic imaging and clinical experience of IPF 22.
In summary, it has been shown that the accuracy of a clinical diagnosis of idiopathic pulmonary fibrosis is 87.2%. Given that idiopathic pulmonary fibrosis has such a poor prognosis 2 in relation to other forms of idiopathic interstitial pneumonia 3, it was concluded that the use of independent high-resolution computed tomography and histology panels to ensure accurate diagnosis of idiopathic pulmonary fibrosis, as performed in the Idiopathic Pulmonary Fibrosis International Group Exploring N-Acetylcysteine I Annual study 13, is extremely valuable and helps minimise bias. The present study demonstrated that the use of reviewer panels for radiology and histology in idiopathic pulmonary fibrosis trials is feasible. It is important that the clinician knows that an accurate diagnosis of idiopathic pulmonary fibrosis requires specific expertise that is available in tertiary referral centres with the close collaboration of histopathologists, radiologists and clinicians.
Statement of interest
Statements of interest for M. Thomeer, M. Demedts, C.D.R. Flower, J. Verschakelen, F. Laurent, A.G. Nicholson, E.K. Verbeken, F. Capron, M. Sardina, G. Corvasce and I. Lankhorst, and the Idiopathic Pulmonary Fibrosis International Group Exploring N-Acetylcysteine I Annual (IFIGENIA) study can be found at www.erj.ersjournals.com/misc/statements.shtml
The members of the Idiopathic Pulmonary Fibrosis International Group Exploring N-Acetylcysteine I Annual (IFIGENIA) study group are as follows.
Steering committee: J. Behr (Grosshadern Clinic, Ludwig Maximilian University, Munich, Germany); R. Buhl (Johannes Gutenberg University Clinic, Mainz, Germany); U. Costabel (Ruhrland Clinic, Essen, Germany); R. Dekhuijzen (Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands); M. Demedts (Chairman) and M. Thomeer (University Hospitals, Catholic University of Leuven, Leuven, Belgium); H.M. Jansen (Academic Medical Centre, Amsterdam, The Netherlands); W. MacNee (University of Edinburgh Medical School, Edinburgh, UK); and B. Wallaert (Calmette Hospital, Lille Regional University Hospital, Lille, France).
Country coordinators: P. de Vuyst (Erasmus University Hospital, Brussels, Belgium); B. Wallaert (France); J. Behr (Germany); S. Petruzzelli (Cardiothoracic Dept, Pisa University, Pisa, Italy); J.M.M. van den Bosch (St Antonius Hospital, Nieuwegein, The Netherlands); E. Rodríguez-Becerra (Virgen del Rocío University Hospital, Seville, Spain); W. MacNee (UK).
Radiology review committee: C.D.R. Flower (Evelyn Hospital, Cambridge, UK); J. Verschakelen (University Hospitals, Catholic University of Leuven, Leuven, Belgium); F. Laurent (Cardiological Hospital, Bordeaux University Hospital, Bordeaux, France).
Histology review committee: A.G. Nicholson (Royal Brompton Hospital, London, UK); E.K. Verbeken (University Hospitals, Catholic University of Leuven, Leuven); F. Capron (Pitie-Salpetriere Hospital, Paris, France).
Local investigators. Belgium: M. Demedts, P. de Vuyst, E. Michiels (East Limburg Hospital, Genk), H. Slabbynck (Middelheim General Hospital, Antwerp), M. Thomeer. France: A. Bourdin and P. Chanez (Arnaud de Villeneuve Hospital, Montpellier), J. Cadranel (Tenon Hospital, Paris), P. Camus (Le Bocage University Hospital, Dijon), P. Delaval (Pontchaillou Hospital, Rennes), N. Just and B. Wallaert (Calmette Hospital, Lille Regional University Hospital, Lille, France), J.F. Muir (Bois Guillaume Hospital, Rouen). Germany: U. Costabel, R. Baumgartner (Grosshadern Clinic, Ludwig Maximilian University, Munich), J. Behr, R. Bonnet and I Mäder (Bad Berka Central Clinic, Bad Berka), R. Buhl, A.M. Kirsten (Johannes Gutenberg University Clinic, Mainz), R. Loddenkemper (Heckeshorn Lung Clinic, Zehlendorf Clinic, Berlin), A. Meyer (Eppendorf University Hospital, Hamburg), J. Müller-Quernheim (Borstel Research Centre, Medical Clinic, Borstel), H. Steveling (Ruhrland Clinic, Essen, Germany), T. Welte (Magdeburg University Clinic, Magdeburg), H. Worth (Clinic Fürth, Fürth). Italy: G. Anzalone (Prato Hospital, Prato), G.B. Bottino (DIMI, Genoa University, Genoa), G. Bustacchini (S. Maria delle Croci Hospital, Ravenna), M. Dottorini (R. Silvestrini Hospital, Perugia), S. Gasparini (Torrette Hospital, Torrette di Ancona), C. Giuntini (Cardiothoracic Dept, Pisa University, Pisa), A. Rossi (IRCCS S. Matteo General Hospital, Pavia), G. Simon (Azienda Ospedaliera Villa Sofia, Palermo). The Netherlands: F. Beaumont (Bosch Medicentrum, Locatie Grootziekengasthuis, Hertogenbosch), M. Drent (Maastricht University Hospital, Maastricht), H.M. Jansen, J.M.M. van den Bosch, and F.J.J. van den Elshout (Rijnstate Hospital, Arnheim). Spain: J. Ancochea Bermudez (Hospital Universitario de la Princesa, Madrid), L. Callol Sanchez (Hospital Universitario Del Aire, Madrid), J.L. Llorente (Hospital De Cruces, Baracaldo-Bilbao), J.M. Rodriguez-Arias and I. Vigil (Hospital Sant Pau, Barcelona), E. Rodríquez-Becerra (Hospital Universitario Virgen del Rocío, Seville).
Zambon personnel and consultants: A. Ardia (consultant), M. Sardina, G. Corvasce, and I. Lankhorst (consultant).
- Received May 11, 2006.
- Accepted November 21, 2007.
- © ERS Journals Ltd