Elsevier

Preventive Veterinary Medicine

Volume 92, Issue 3, 15 November 2009, Pages 249-255
Preventive Veterinary Medicine

Summary receiver operating characteristics (SROC) and hierarchical SROC models for analysis of diagnostic test evaluations of antibody ELISAs for paratuberculosis

https://doi.org/10.1016/j.prevetmed.2009.08.019Get rights and content

Abstract

Critical, systematic reviews of available diagnostic test evaluations are a meticulous approach to synthesize evidence about a diagnostic test. However, often the review finds that data quality is poor due to deficiencies in design and reporting of the test evaluations and formal statistical comparisons are discouraged. Even when only simple summary measures are appropriate, the strong correlation between sensitivity and specificity and their dependence on differences in diagnostic threshold across studies, creates the need for tools to summarise properties of the diagnostic test under investigation.

This study presents summary receiver operating characteristics (SROC) analysis as a means to synthesize information from diagnostic test evaluation studies. Using data from a review of diagnostic tests for ante mortem diagnosis of paratuberculosis as an illustration, SROC and hierarchical SROC (HSROC) analysis were used to estimate overall diagnostic accuracies of antibody ELISAs for bovine paratuberculosis while accounting for covariates: the target condition (infectious or infected) used in the test evaluation (one for the evaluation of Se and one for Sp); and the type of test (serum vs. milk). The methods gave comparable results (regarding the estimated diagnostic log odds ratio), considering the small sample size and the quality of data. The SROC analysis found a difference in the performance of tests when the target condition for evaluation of Se was infected rather than infectious, suggesting that ELISAs are not suitable for detecting infected cattle. However, the SROC model does not take differences in sample size between study units into account, whereas the HSROC allows for both between and within study variation. Considering the small sample size, more credibility should be given to the results of the HSROC. For both methods the area under the (H)SROC curve was calculated and results were comparable.

The conclusion is that while the SROC is simpler and easier to implement, analyse and interpret, the HSROC does have properties encourage the extra effort involved in the analysis.

Introduction

The ability to correctly diagnose a specific disease or infection has always been a central component in veterinary medicine. For many pathogens, the veterinarian has a wide range of tools covering the spectrum from clinical and pathological examinations, serological assays for detection of antibodies or direct detection of the pathogen. Furthermore, diagnostic tests are being used in a wider range of settings: confirmation of clinical cases, identification of subclinical infections, surveillance, certification of disease freedom, etc. The World Organisation for Animal Health has acknowledged this and at its 71st General Session of the OIE in May 2003, the International Committee adopted Resolution No. XXIX, which introduced the ‘fitness for purpose’ as a criterion for validation. The ‘fitness for purpose’ implies that to properly evaluate a diagnostic test, the context for its application must be considered, so that the condition detected by the test reflects the purpose for which the test is intended to be used. However, the vast array of available diagnostic tools and their many purposes has further increased the need for good studies of the reliability and performance of the available tests.

Design and analysis of diagnostic test evaluation studies has become part of the standard curriculum in many epidemiological basic courses. Traditionally the performance of a diagnostic test in an epidemiological setting is defined conditionally on the disease status as the sensitivity, Se = Pr(T+|D+), and specificity, Sp = Pr(T−|D−), where D describes the condition to be detected (the truly “diseased”), and T the test-result. In the classic design of a test evaluation, a perfect reference test, often referred to as a ‘gold standard’, is used to discriminate between truly diseased and truly non-diseased test individuals, which are subsequently tested with the diagnostic tests under evaluation. This approach has the advantage that an exact definition of a ‘diseased’ individual can be made, but the disadvantage that often there is a discrepancy between the defined disease condition and the condition relevant for the decision problem, i.e., the ‘fitness for purpose’ is not always met by the definition of ‘disease’. For example, the classic test evaluations often rely on agent detection methods to establish the true disease status, when serological assays are evaluated. While agent detection methods may be reliable to establish the “gold standard” for detection of infectious animals, it is usually a poor ‘gold standard’ for detection of infected animals or animals that are clinically affected. As a consequence, infected animals can often only be included in the study if they are infectious, i.e., shedding or excreting the agent, thereby potentially introducing selection bias. Therefore, in recent years, latent class analysis (Hui and Walter, 1980) has been widely accepted as an alternative method, where tests are evaluated in the absence of a ‘gold standard’. This approach somewhat circumvent this selection bias, but at the cost of introducing a more abstract disease definition, which partly depends on the tests used in the evaluation (Toft et al., 2003).

In Greiner and Gardner (2000) it is advocated that tests essentially always should be evaluated in the population where it is intended to be used. Hence, there is an apparent need for a new evaluation of a diagnostic test whenever the purpose or target population changes. In practice, there is a cost issue and the proper evaluation of a diagnostic test is very expensive and time consuming; therefore alternatives to starting over each time are needed. One such possibility is the use of critical reviews and potentially meta-analyses of published test evaluation studies to disseminate available information. Meta-analysis is a well established statistic discipline with a good theoretical foundation, but its practical application is often diminished somewhat by the lack of good data. The same can be said about meta-analysis of diagnostic test evaluation studies, where there are several examples of critical reviews within human medicine, but few that go beyond a critical subjective comparison. Considering the points made in the previous paragraphs, it seems that formal comparisons of Se and Sp across published studies should be discouraged and most often a critical review would exclude all but a few studies from further comparison. However, diagnostic test evaluations are improving, due to more stringent review procedures of submitted manuscripts, better reporting due to initiatives such as STARD and OIE guidelines, courses and workshops, such as the DTE-workshop series. Furthermore, even if formal inference should be avoided, it can often be justified to summarise Se and Sp across studies, just to give an idea of the potential of a diagnostic technique when applied for a specific purpose. However, simple averaging of Se and Sp across studies will generally not work, since a high Se is usually achieved by lowering the requirements for Sp. To illustrate this point, consider the following example: assume that three studies are reported on the exact same serological assay, differing only in their selected threshold for determining a test positive. The estimated pairs of Se and Sp are: (0.25, 0.95), (0.95, 0.25) and (0.75, 0.75) reflecting scenarios where the ability to rule-out disease, rule-in disease or a trade-off was deemed important. The average pair of Se and Sp is (0.65, 0.65), which does not summarise the properties reported by the three studies well. Clearly, there is a need for more adequate means of summarising data. One such method is the use of the summary ROC (SROC) curves which are analogous to receiver operating characteristics (ROC) curves known from evaluation of, e.g. serological assays (Greiner et al., 2000).

The objective of this study was to present graphical and model-based approaches to estimate, summarise and compare SROC curves as means to synthesize evidence from diagnostic test evaluation studies. To illustrate the concepts, we present the synthesis of a critical review of accuracies of ELISA when used for detection of different stages of paratuberculosis in cattle and compare various SROC techniques based on the derived set of comparable studies.

Section snippets

Critical review of antibody ELISAs for detection paratuberculosis in cattle

Paratuberculosis is a chronic infection, which is of particular concern in ruminants. The infection, which is caused by Mycobacterium avium subsp. paratuberculosis (MAP), may develop slowly, with disease progression usually taking place over several years. In Nielsen and Toft (2008), a critical review of accuracies of antibody ELISAs, interferon-γ assays and faecal culture techniques for ante mortem diagnosis of paratuberculosis were conducted. The means of addressing paratuberculosis and level

Results

The review process and subsequent exclusion of studies which did not meet the inclusion criteria resulted in a dataset with 36 test evaluation studies at least partially suitable for further analyses. Of the 36 test evaluation studies, 31 were serum antibody ELISAs and 5 were milk antibody ELISAs. Three were using Antel Biosystems milk ELISA, 6 used HerdChek serum ELISA from IDEXX Laboratories, Westbrook, Maine, USA; 2 used the serum ELISA from IDEXX Scandinavia; 6 used the Parachek serum ELISA

Discussion

In this study we have compared simple SROC and hierarchical SROC models on a dataset of 36 diagnostic test evaluations of antibody ELISAs for ante mortem diagnosis of paratuberculosis using three different definitions of target condition used in the evaluation study. The two models gave somewhat different, although comparable results.

Using the simple SROC model based on linear regression of the transformed Se and Sp estimates, we found that there was a difference between the diagnostic log odds

Conflict of interest statement

None declared.

Acknowledgement

This study was co-funded by the European Commission within the Sixth Framework Programme, as part of the project ParaTBTools (contract no. 023106 (FOOD)).

References (14)

There are more references available in the full text version of this article.

Cited by (0)

View full text