Abstract
Chest radiography for the diagnosis of active pulmonary tuberculosis (PTB) is limited by poor specificity and reader inconsistency. Scoring systems have been employed successfully for improving the performance of chest radiography for various pulmonary diseases. We conducted a systematic review to assess the diagnostic accuracy and reproducibility of scoring systems for PTB.
We searched multiple databases for studies that evaluated the accuracy and reproducibility of chest radiograph scoring systems for PTB. We summarised results for specific radiographic features and scoring systems associated with PTB. Where appropriate, we estimated pooled performance of similar studies using a random effects model.
13 studies were included in the review, nine of which were in low tuberculosis (TB) burden settings. No scoring system was based solely on radiographic findings. All studies used systems with various combinations of clinical and radiological features. 11 studies involved scoring systems that were used for making decisions concerning hospital respiratory isolation. None of the included studies reported data on intra- or inter-reporter reproducibility. Upper lobe infiltrates (pooled diagnostic OR 3.57, 95% CI 2.38–5.37, five studies) and cavities (diagnostic OR range 1.97–25.66, three studies) were significantly associated with PTB. Sensitivities of the scoring systems were high (median 96%, IQR 93–98%), but specificities were low (median 46%, IQR 35–50%).
Chest radiograph scoring systems appear useful in ruling out PTB in hospitals, but their low specificity precludes ruling in PTB. There is a need to develop accurate scoring systems for people living with HIV and for outpatient settings, especially in high TB burden settings.
Introduction
Despite early diagnosis being a key principle of tuberculosis (TB) control, in 2010 the global case detection rate for all forms of TB was only 65% [1]. Limitations of existing diagnostic tests are considered to contribute to the low case detection rate [1]. Sputum smear microscopy and chest radiography are two of the most commonly used tests for TB in most high TB burden countries. Smear microscopy has low sensitivity and fails to detect nearly half of all TB cases [2]. Smear microscopy is of particularly limited value in extrapulmonary TB, children, and people living with HIV (PLWH).
Chest radiography is a rapid test that has been used for over a century to diagnose pulmonary TB (PTB) [3]. Chest radiography can be performed at the point-of-care and incorporated into screening and diagnostic algorithms, usually as an add-on test to smear microscopy among persons with possible PTB, or as a screening test among specific populations, such as immigrants from high-burden TB settings. While chest radiography is acknowledged to have high sensitivity for detecting pulmonary abnormalities, its use for diagnosing PTB has been limited by modest specificity, and high inter- and intra-observer differences in reporting of radiographs [4]. Consequently, the probability of diagnosing active PTB based on a chest radiograph is dependent on the reader and not well standardised.
An analogous impediment to the use of chest radiography for the diagnosis of occupational lung diseases was overcome by the development of standardised methods for the reading of chest radiographs, a system that is now employed successfully by the International Union Against Cancer, the International Labour Organization, and the US National Institute for Occupational Safety and Health [5, 6]. Scoring systems have also been developed for grading the severity and extent of pulmonary disease among patients with cystic fibrosis [7], and form part of the lung injury score for assessing the severity of adult respiratory distress syndrome [8]. Similarly, a standardised scoring system for PTB that assigns weights to specific features of chest radiographs consistent with PTB, if accurate and reproducible, could potentially augment TB case detection rates using largely pre-existing resources. Such a standardised scoring system for PTB also has the potential to be combined with newer nucleic acid amplification tests, such as Xpert MTB/RIF (Cepheid, Sunnyvale, CA, USA) [9], either as a triage test to reduce costs or as an add-on test in Xpert MTB/RIF negative persons.
A systematic review of clinical prediction rules for isolating inpatients with suspected PTB was published in 2005 [10], but to our knowledge, no previous systematic reviews have assessed the performance of radiography scoring systems for PTB. Therefore, we carried out a systematic review to estimate the diagnostic accuracy of scoring systems using chest radiograph features for active PTB in patients with possible disease. A secondary objective was to assess the reproducibility of chest radiograph scoring systems for PTB.
Methods
We followed the guidelines for systematic reviews of diagnostic test accuracy recommended by the Cochrane collaboration diagnostic test accuracy working group, including writing a detailed protocol before starting the review [11, 12].
We aimed to include randomised controlled trials and observational studies of all study designs (i.e. cross-sectional, case–control and cohort) that assessed the performance of radiographic scoring systems for the diagnosis of PTB. We included studies that reported data from which we could extract true positive, true negative, false positive, and false negative values for determining sensitivity and specificity estimates, as well as studies that only reported summary measures of diagnostic accuracy (defined below).
Participants were patients with possible PTB who were ≥15 years old. We restricted studies to those that included a minimum of 10 patients with TB. With the aim of evaluating patients similar to those who present in routine clinical practice, we excluded studies that exclusively involved specific patient groups, such as patients with pneumoconioses, malignancies (both haematological and solid organ) and immune-mediated inflammatory disease, and patients on haemodialysis. We also excluded studies that investigated asymptomatic contacts of TB patients.
The index test was any chest radiograph scoring system, with the comparator being no chest radiograph scoring system.
The target condition was TB of the pulmonary parenchyma, pleura and intrathoracic lymph nodes. We included miliary TB if the disease involved either pulmonary parenchyma or multiple sites, one of which was the lung.
We considered liquid or solid culture as the reference standard for active PTB.
A radiograph scoring system was defined as a system that assigned numerical weights to specific features of chest radiographs consistent with PTB (such as cavitary lesions), with or without the presence of clinical findings.
Sensitivity is the proportion of patients with PTB who are correctly identified by the scoring system. Specificity is the proportion of patients without PTB who are correctly identified by the scoring system. Positive predictive value (PPV) is the proportion of patients considered positive by the scoring system who are correctly diagnosed with PTB. Negative predictive value (NPV) is the proportion of patients considered negative by the scoring system who are correctly diagnosed without PTB.
Diagnostic odds ratio (DOR) is the odds of a patient with PTB having a specific clinical or radiographic feature divided by the odds of a participant without PTB having the same clinical or radiographic feature.
Reproducibility refers to agreement of the scoring system when a chest radiograph is read more than once. Agreement could either be “intra-reader”, when the same person reads the chest radiograph more than once, blinded to his or her previous reading, or “inter-reader”, when two or more people read the same chest radiograph. Agreement is a reflection of the repeatability of a test (the scoring system) and is independent of the accuracy of the test.
Search methods for identification of studies
We searched MEDLINE (1946 to August 30, 2012), EMBASE (1947 to August 30, 2012) and Web of Science (1899 to August 30, 2012) for relevant articles, using published filters for diagnostic tests to improve sensitivity [13, 14]. We used the terms sensitiv*[tw] OR diagnos*[tw] OR di [fs] AND radiograph*[MeSH] OR chest xray[tw] OR mass chest x-Ray[MeSH] OR photofluorograph*[tw] OR scor*[tw] AND tuberculosis(sub-headings: lymph node/miliary/multidrug-resistant/Pleural/Pulmonary) [MeSH] OR Mycobacterium tuberculosis [MeSH]. The detailed search strategy can be found in the online supplementary material. We also reviewed the reference lists of included articles and review articles identified through the search, and hand-searched World Health Organization reports.
Selection of studies
Initially, two review authors (L.M. Pinto and K.R. Steingart) independently scrutinised titles and abstracts in English, French and Spanish for eligibility. Citations deemed relevant by either reviewer were selected and the papers retrieved for full-text review. Next, each eligible article was independently assessed by two reviewers (L.M. Pinto and K.R. Steingart) against the selection criteria. Disagreements were resolved by discussion between the reviewers. A list of excluded studies with their reasons for exclusion was maintained.
Assessment of study quality
Two reviewers (L.M. Pinto and K.R. Steingart) independently assessed study quality using the core set of 11 items from Quality Assessment of Diagnostic Accuracy Studies (QUADAS), a validated tool to evaluate the presence of bias and variation in diagnostic accuracy studies [15]. As recommended, each item was scored as “yes”, “no”, or “unclear”.
Data extraction
Two review authors (L.M. Pinto and K.R. Steingart) independently extracted the data from each study, using a data extraction form that was piloted and then finalised based on the experience gained from the pilot. Disagreements were resolved by discussion. Data were extracted for various characteristics, including the following: author; publication year; study design; country income status classified by the World Bank List of Economies, World Bank 2012 data [16]; two-by-two tables (with cells documenting true positives, false negatives, false positives and true negatives for the diagnostic test) of individual radiographic features; and details of the scoring systems and their performance characteristics. The data extraction form is included in the online supplementary material.
Statistical analysis
Data from the two-by-two tables were used to calculate sensitivity and specificity estimates for the scoring system for individual studies, along with their 95% confidence intervals at cut-offs determined by the study authors (mostly based on optimal sensitivities and specificities using receiver operator characteristic curves). Forest plots were generated to display sensitivity and specificity estimates using Meta-Disc (version 1.4) [17]. DORs for specific radiographic features associated with PTB were determined when data were provided. Heterogeneity was assessed by visual inspection of forest plots and the degree of heterogeneity measured by the I-squared statistic (I2). DORs for specific radiographic features were pooled only if the radiographic features and patient populations were similar across studies and I2 ≤75%. Meta-analysis was performed using a random effects approach, to account for the variability across studies and to derive conservative assessments of the uncertainty in the estimates [18], using Meta-Disc (version 1.4) [17]. Formal assessment of publication bias, using methods such as funnel plots or regression tests, was not performed because such techniques have not been found to be useful for diagnostic data [19]. An estimation of language bias was attempted by retrieving citations from the search strategy with and without a language filter, and the “filtered” citations were reported as a percentage of the overall citations retrieved.
Results
We identified 12 883 citations, of which 9137 unique articles were identified after exclusion of duplicate articles. We conducted the search with and without the language filters to assess the degree of bias, and found that our search strategy with language filters included 81% of all studies. After screening titles and abstracts, 187 articles were found to satisfy the criteria for further review and their full-texts were retrieved. After full-text review, 174 articles were excluded for various reasons and 13 articles (all observational studies) were included in the systematic review (fig. 1) [20–32].
PRISMA flow diagram for included and excluded studies. CXR: chest radiograph.
Included studies
We did not identify any scoring system that was based exclusively on radiographic criteria. Of the 13 included studies, 12 studies involved scoring systems that combined clinical and radiographic features [20–31] and one study involved validation of 13 clinical prediction rules for inpatient respiratory isolation [32]. This study was included as it retrospectively applied prediction rules from six studies identified by our systematic review to a group of eligible participants with symptoms suggestive of PTB. This study also tested rules that were derived from seven other studies, which were excluded from our systematic review as they did not satisfy our reference standard [33, 34] or did not use specific features visualised on the chest radiograph as part of the scoring system [35–39]. Table 1 lists the characteristics of the 12 included original studies, containing a total of 5767 participants, and the single study that validated six of these scoring systems (validated in 345 participants). The median number of individuals with possible PTB included in the studies was 283 (interquartile range 177–431).
Seven studies included all patients with possible PTB [20–25, 32], five studies included patients with possible PTB who were found to have negative sputum smears [26–30], and one study specifically excluded PLWH [31]. Nine studies were performed in high-income countries. Five studies involved radiologists; two studies included pulmonologists and five studies did not report the specialty of the radiograph reader. The demographic characteristics of the patients are provided in the online supplementary material. When reported, the majority of patients were male. Eleven studies included PLWH, who represented 11–61% of eligible patients.
Excluded studies
We identified two studies that were designed with the aim of deriving a clinical–radiographic scoring system for PLWH among hospitalised patients found to be sputum smear-negative [40, 41]. Both studies satisfied the majority of our inclusion criteria, and found the presence of mediastinal adenopathy and cavities to be significantly associated with PTB in univariate analysis. However, neither of the studies derived a score for PTB. The study by Le Minor et al. [41] concluded that the “numbers were insufficient to develop a score for TB”, while the study by Davis et al. [40] stated that “after exhaustive testing, we were unable to identify any combination of factors which reliably predicted bacteriologically confirmed tuberculosis”.
We also excluded 13 studies that used automated computer-assisted diagnosis as none of these studies used culture as a reference standard, a criterion for inclusion in this review [42–54]. Five studies that involved the grading of chest radiographs were excluded, as these studies were designed to grade the severity of PTB based on the extent of abnormalities visualised on the chest radiograph and not the diagnostic accuracy of scoring systems [55–59]. We also excluded three studies that used the Chest Radiograph Reading and Recording System (CRRS) [60–62], despite these studies demonstrating the CRRS tool to have good reliability for features of PTB visualised on a chest radiograph, as these studies did not use culture as a reference standard.
Assessment of methodological quality
As seen in figure 2, all the studies suffered from verification bias, as the results of the chest radiograph and/or the clinical components of the scoring system played a role in the selection of those patients would who be investigated further with culture, the reference standard. Seven (54%) of the 13 studies did not include a sample that was considered representative of the target population, as these studies did not enrol all individuals with possible PTB in a consecutive or random manner, often only including admitted patients without a suitable control group. Six (46%) studies did not specify whether the person assigning scores to the patients for the various components of the scoring system was blinded to the results of the reference standard.
Quality assessment of the included studies using the Quality Assessment of Diagnostic Accuracy Studies tool.
Findings
Studies that included all patients with possible PTB
We identified six studies that included all patients with possible PTB [20–25]. All studies were performed in an inpatient setting. All studies were aimed at deriving optimal prediction scores to identify patients who were likely to have PTB and require respiratory isolation, table 1. In univariate analyses, the most common radiographic features across studies found to be significantly associated with PTB were upper lobe infiltrates (pooled DOR 6.65, 95% CI 4.42–10.01; five studies) (fig. 3a), and cavities (DOR range 2.11–10.08; three studies) (online supplementary material).
Diagnostic odds ratio for active pulmonary tuberculosis (PTB) with an upper lobe infiltrate visualised on the chest radiograph. a) Among all patients with possible PTB and b) among smear-negative patients with possible PTB. The size of each square is proportional to the sample size of the study, such that larger studies are represented by larger squares. Diamonds represent the pooled estimate for the diagnostic OR. The lines represent the confidence intervals around the respective estimates. df: degrees of freedom.
The details of the parameters included in the scores and their respective weights are summarised in table 2, along with the performance characteristics of the scoring system and the final rule to aid in decision-making. The studies used several different methods to derive weights for the scoring system: logistic regression of the parameters found significant by univariate analysis (three studies); classification and regression tree analysis (one study) [21]; general regression neural network analysis (one study) [22]; and Chi-squared recursive partitioning (one study) [23]. All six studies achieved a sensitivity of the scoring system greater than 80% (median 95%, range 81–100%). For the five studies that reported specificity data, specificity estimates were low (median 42%, range 22–72%), suggesting a poor rule-in value for PTB.
Figure 4 shows forest plots of sensitivities and specificities of the scoring systems reported in individual studies. We did not pool estimates because of the differences in the scoring systems.
Scoring systems for studies that included all patients with possible pulmonary tuberculosis. a) The estimated sensitivity and b) the specificity of the study (black squares). The size of the square is proportional to the sample size of the study, such that larger studies are represented by larger squares. The lines represent the confidence intervals around the respective estimates.
Studies that only included patients with possible PTB who were found to be sputum smear negative
We identified five studies in this category [26–30]. Four studies were conducted in an inpatient setting for the purpose of determining a clinical rule for respiratory isolation [26, 27, 29, 30]; while one study was performed in an outpatient setting [28] (table 1). As with the previously described set of studies that included all patients with possible PTB, in the univariate analysis the most common radiographic features across studies found to be associated with PTB were upper lobe infiltrates (pooled DOR 3.57, 95% CI 2.38–5.37; five studies) (fig. 3b), and cavities (DOR range 1.97–25.66; three studies). We did not pool estimates based on scoring systems because of the differences in these scoring systems (online supplementary material).
To derive weights for the scoring system two studies used logistic regression of the parameters found significant in the univariate analysis [27, 30], while three studies involved validation of previous studies [26, 28, 29]. One of the validation studies used bootstrapping, which is a resampling method aimed at improving the internal validity of the data [27]. All studies achieved a sensitivity of the scoring system greater than 93% (median 96%, range 93–98%). However, specificity estimates were low (median 35%, range 14–50%), again suggesting a poor rule-in value for PTB (fig. 5). We did not consider these scoring systems to be similar and therefore, did not pool the accuracy estimates.
Scoring systems for studies that included smear-negative patients with possible pulmonary tuberculosis. a) The estimated sensitivity and b) the specificity of each scoring system (black squares). The size of the square is proportional to the sample size of the study, such that larger studies are represented by larger squares. The lines represent the confidence intervals around the respective estimates.
One study, performed in an inpatient setting, excluded PLWH (table 1) [31]. Logistic regression was used to derive weights for the score, but the study also validated the score derived by Wisnivesky et al. [30]. This study found a sensitivity of 97% and specificity of 42% (table 2).
Study that validated various clinical prediction rules
One study evaluated 13 different clinical prediction rules (identified though a comprehensive literature search) for the respiratory isolation of inpatients with suspected PTB [32]. As noted above, six of the 13 prediction rules met criteria for inclusion in our review. The authors applied the various rules, retrospectively, to emergency room patients who had been included in an earlier study [25] (this study is included in the current systematic review) to derive a scoring system for PTB. Similar to the original studies, the validation study found that most scoring systems had poor specificity for PTB. A comparison of the performance characteristics of the six scoring systems in their respective derivation studies, and in the validation study is provided in table 3.
Reproducibility
None of the included studies reported data on intra-reporter or inter-reporter reproducibility.
Discussion
We conducted this systematic review with the aim of assessing the diagnostic accuracy of standardised radiographic scoring systems for the diagnosis of PTB, and whether standardisation improves the performance of chest radiography. Our review failed to find any study that exclusively relied on radiographic features to derive a score, and all the included studies combined defined radiographic criteria with different clinical criteria. While the aim of the review was to assess the utility of radiographic scoring systems as diagnostic tools, especially in low-income, high-TB burden outpatient settings where such systems would be extremely beneficial if accurate, there appears to be a dearth of such studies. Most of the included studies were hospital-based, decision-to-isolate studies in high-income, low-TB burden settings.
Patients with PTB can generate up to 44 quanta of TB bacilli per hour (one quantum is defined as the infectious dose) [63], highlighting the necessity for rapid respiratory isolation of patients with PTB in the hospital setting. Yet, the unnecessary respiratory isolation of patients considerably increases costs to the healthcare system [64] and in resource-limited settings, where isolation beds may be in short supply, unnecessary isolation of some patients may preclude appropriate isolation of others. Scoring systems that improve the accuracy of decisions to subject patients with possible PTB to respiratory isolation can considerably improve the efficiency of healthcare systems and utilisation of resources. The scores developed suffered from low specificity, and had a high rule-out value (median NPV 99%, range 93–100%, values reported in seven studies) but a poor rule-in value (median PPV 22.5%, range 8–61%, values reported in eight studies) for PTB. The validation of six of these scores in a separate study reflected the same lack of specificity. A stratified analysis, by country income status, reflected better performance of scoring systems in high-income countries compared with middle- and low-income countries, but the improvement was unlikely to be clinically significant (online supplementary material). However, such scores may still be useful for limiting the number of patients for whom further investigations would be warranted (as compared to the use of subjective assessments of chest radiographs, that are known to have extremely poor specificity, thereby warranting further testing for a greater number of patients suspected of having PTB), especially among patients who are smear-negative.
The prediction rule developed by Wisnivesky et al. [30] was validated in four studies, two of which were conducted in patients who had negative sputum smears. The scoring system consistently demonstrated sensitivity higher than 92%, but had poor specificity. As a rule-out test, this scoring system appears to be validated in multiple studies. The study by Soto et al. [28] was a validation study of a score derived by the same research group in an earlier study [27]. Although the cut-off for the score was modified in the validation cohort, the scoring system performed well in a subgroup of patients with no prior history of TB. Four studies considered the patient to be positive for PTB if they had any one of the features present [20, 23, 26, 29, 30] making these systems resemble checklists rather than weighted scoring systems.
We identified only one study that assessed a clinical–radiographic scoring system for outpatients. Our systematic review also failed to identify a clinical radiographic scoring system for PLWH with possible PTB. Bock et al. [20] performed a subgroup analysis in PLWH, but found no radiographic feature to be significantly associated with PTB in this subgroup, a finding that is consistent with the atypical nature of radiographic manifestations of PTB described among PLWH [65].
Automated computer-assisted diagnosis employs techniques such as texture analysis for reading digital chest radiographs, and appears to be a promising modality for standardising and improving the diagnostic performance of digital chest radiography [66]. However, our review suggested a lack of methodologically high-quality studies. Researchers in this field could help advance their techniques better by following published guidelines for conducting and reporting their work, to ensure that their efforts contribute to a high-quality evidence base [67, 68].
The strength of our systematic review is in the extensive review of the literature, with two reviewers independently performing screening, quality assessment, and data extraction. However, three caveats need to be acknowledged while interpreting the results of the systematic review. First, 12 of the 13 studies were conducted among inpatients, a population in whom the manifestations of TB disease are likely to be subject to a spectrum bias compared with outpatients [69]. Secondly, nine of the 13 studies were conducted in high-income, low-TB burden settings, and the radiographic manifestations of the disease, pre-test probabilities and technical quality of radiographs are likely to be different in such settings compared with low-income, high-burden settings [70]. Lastly, the purpose of most of the studies was to assess the likelihood of PTB for purposes of respiratory isolation in hospitals, and the derivation of such scores could have different aims (and consequently different cut-offs) from scoring systems derived with diagnostic purposes in mind. We restricted our search to articles written in English, French and Spanish, but an assessment for language bias suggested that we included a high proportion of the available literature. However, we may have inadvertently failed to include articles in other languages, and acknowledge this as a shortcoming of the review. As assessed with QUADAS, we judged the studies to be at risk of bias for several items. In all studies, the results of the chest radiograph and/or the clinical components of the scoring system played a role in deciding which patients would receive a culture (verification bias). This bias may have led to overestimates of diagnostic accuracy [15]. This is especially true for studies in which radiographic findings were used to guide the decision for requesting sputum cultures for TB. Seven (54%) studies were not considered to include a representative sample (selection bias). Six (46%) studies did not provide adequate information about blinding of the radiographic interpretation. Selection bias and absence of blinding are features of study design that have also been associated with inflated accuracy estimates [71, 72]. These limitations in the quality of the included studies need to be taken into consideration when interpreting the results.
Conclusions
Our systematic review did not identify a scoring system for PTB based solely on radiographic features. The development and validation of such a system could help standardise the interpretation of chest radiographs. The review identified clinical–radiographic scoring systems for predicting the likelihood of PTB, among patients admitted to hospitals. Such scoring systems are intended for assessing the need for respiratory isolation. Most of these systems have high sensitivity but low specificity for PTB. There is a pressing need to derive accurate scoring systems for PLWH and outpatients, especially in low-resource settings. Technological advances in the interpretation of chest radiographs, such as computer-assisted diagnosis, need to be validated in well-designed studies to assess their utility.
Acknowledgments
We thank L.A. Kloda (Life Sciences Library, McGill University, Montreal, Canada) for her guidance and assistance in formulating the search strategy.
Footnotes
This article has supplementary material available from www.erj.ersjournals.com
Support statement: This work was supported by the European and Developing Countries Clinical Trials Partnership (TB-NEAT grant) and the Canadian Institutes of Health Research (CIHR) (MOP-89918). MP is supported by salary awards from CIHR and Fonds de recherche du Québec – Santé. L.M. Pinto is supported by a fellowship from the Shastri Indo-Canadian Institute. These agencies had no role in the analysis of data and decision to publish.
Conflict of interest: None declared.
- Received July 12, 2012.
- Accepted October 11, 2012.
- ©ERS 2013