Abstract
The use of noninferiority randomised trials for patients with advanced non-small cell lung cancer has emerged during the past 10–15 years but has raised some issues related to their justification and methodology. The present systematic review aimed to assess trial characteristics and methodological aspects.
All randomised clinical trials with a hypothesis of noninferiority/equivalence, published in English, were identified. Several readers extracted a priori defined methodological information. A qualitative analysis was then performed.
We identified 20 randomised clinical trials (three phase II and 17 phase III), 11 of them being conducted in strong collaboration with industry. We highlighted some deficiencies in the reports like the lack of justification for both the noninferiority assumption and the definition of the noninferiority margin, as well as inconsistencies between the results and the authors' conclusions. CONSORT guidelines were better followed for general items than for specific items (p<0.001).
Improvement in the reporting of the meth"odology of noninferiority/equivalence trials is needed to avoid misleading interpretation and to allow readers to be fully aware of the assumptions underlying the trial designs. They should be restricted to limited specific situations with a strong justification why a noninferiority hypothesis is acceptable.
Abstract
Reporting on noninferiority trials should be improved and results interpreted using the noninferiority margin http://ow.ly/EB6dg
Introduction
Despite the intense research efforts in advanced non-small cell lung cancer (NSCLC) in the past 15 years, the prognosis of these patients remains dismal [1]. Moreover, the number of trials involving new agents reporting a positive outcome without meeting their primary end-point has increased over time and the magnitude of benefit in survival has tended to decrease over time, together with a sharp increase in trial sample size [2].
We face, then, with a situation where there is definitely a need to improve patient outcome (already in the first-line setting and even more in the second-line setting) while, in parallel, the conduct of noninferiority trials has developed during the last 10–15 years [3]. This means that investigators and pharmaceutical companies are willing to test experimental drugs (or therapeutic schemes), accepting that those drugs will have decreased efficacy. Indeed, noninferiority does not mean equivalence (statistically, strict equivalence could be demonstrated only if an infinite sample size could be reached) and noninferiority will in fact be translated, for implementation in a clinical trial, into a decreased efficacy lower than an amount to be specified. This amount is called the noninferiority margin. There are, of course, some arguments that may lead to the design of noninferiority trials [3]. Improvement can be hoped for in patients' quality of life (favouring then a treatment with a better quality of survival rather than a longer quantity of survival). Reducing toxicity might be another situation when a noninferiority design could be considered, in particular because this can improve patient quality of life or lead, should noninferiority be established, to a treatment opportunity for further patients who would otherwise have been assessed as unfit for treatment. Decreasing costs might be another parameter, as cost-effectiveness is taken into account by payers for deciding on reimbursement of new drugs. This expected benefit in a secondary outcome should be obvious in order to accept the potential reduced efficacy and should be clearly formulated at the time of formulating trial design. There is therefore a need to carefully justify the rationale for conducting a noninferiority trial. Beyond this justification, a second inherent difficulty is to define what one means by decreased efficacy and how the noninferiority margin will be determined. There are essentially two proposed methods to determine the noninferiority margin [3–6]: either to choose it by a clinical judgment that should take into account the expected benefits on secondary outcomes, with the drawback that there will be no proof that the experimental treatment performs better than the historical standard, or to choose it in order to maintain a proportion of the benefit that has been obtained by the reference treatment in the trials or in the meta-analyses that led one to consider this treatment as a new standard. In fact, the trial results would be more convincing in the case of success when a small margin is used but the required sample sizes may become very large.
Importantly, noninferiority trials require a specific methodology [7] and special guidelines (Consolidated Standards of Reporting Trials (CONSORT)) have been published on the items to report when publishing [8]. For the interested reader, some specific methodological aspects of noninferiority trials are developed in the online supplementary material.
There have been previous reviews assessing the purpose and the methodological aspects of noninferiority trials [3, 9, 10] in oncology (mainly breast cancer, NSCLC and colorectal cancer): all three identified critical deficiencies in design and reporting, namely related to the choice and justification of the noninferiority margin. However, no review, to our knowledge, focused on trials performed in patients with advanced NSCLC. Therefore, we thought it worth conducting a systematic review on that field with the aims to: 1) report on the general methodology of the trials and on the specific aspects linked to the noninferiority assumption(s); 2) make an overview of the noninferiority margins that have been used; 3) assess result interpretation according to the noninferiority hypothesis; 4) assess whether the CONSORT guidelines were followed; and 5) assess the reported reasons for carrying out noninferiority trials.
Methods
Trial identification
We defined the following eligibility criteria for a trial to be included in our review: 1) randomised clinical trial conducted in patients with advanced NSCLC; 2) trial testing a noninferiority hypothesis for anticancer drugs/regimens (or an equivalence hypothesis); and 3) trial published as a full paper in English (or French, as French is spoken by all the authors of this review). We excluded publications reporting protocols only, as well as duplicate publications, and publications reporting outcomes other than that tested for noninferiority or data pooled from several trials.
The literature search was performed by two experienced librarians (F. Pasleau and S. Danyi) in April 2014. In addition, we examined the bibliography of review papers selected by the same search. The search strategy is described in the online supplementary material.
Data extraction
The following data, defined before the literature search, were extracted from all publications. General characteristics: 1) year of publication, journal, first author, involvement of a pharmaceutical company in the conduct of the trial or in the interpretation of the results (that we scored as none, uncertain when a grant was provided or certain when some authors of the publication were company employees); 2) patients and treatment data (stage of disease, standard and experimental treatments, and treatment setting). Trial methodology: 1) trial phase, trial design and trial type (frequentist or Bayesian), primary outcome, secondary outcomes, description in the methods section of type of hypothesis (noninferiority/superiority) for each outcome, rationale for noninferiority, whether interim analyses were planned and, if yes, the method for adjusting the final analysis; 2) if a noninferiority hypothesis was used for the primary outcome, measure of treatment effect, margin of noninferiority, method for choosing the margin, whether the sample size was consistent with the design of noninferiority and, if yes, the alternative hypothesis for power calculation were extracted. Trial results for the primary outcome for two patient populations: 1) intent-to-treat (ITT) and per protocol (PP). Conclusion: of the authors and whether the conclusions were in accordance with the hypothesis of noninferiority.
In addition, we qualitatively assessed the trials using: 1) the CONSORT guidelines (the general guideline for reporting parallel-group randomised trials [11] and its extension for reporting noninferiority and equivalence randomised trials [8]); and 2) the risk of bias evaluation tool developed by the Cochrane Collaboration [12]. The CONSORT checklist consists of 37 items to be reported for randomised clinical trials and 12 extended items specific to studies with noninferiority hypotheses. We scored them as 1 (not reported), 2 (reported, but partially or inadequately) or 3 (reported).
The risk of bias tool from the Cochrane Collaboration is specifically dedicated to aspects of trials that are most linked to the risk of obtaining a biased estimate of the treatment effect. The scoring was performed as recommended by the Cochrane Collaboration, as low risk of bias, unclear risk of bias or high risk of bias in six domains (two related to the randomisation procedure, two related to the blinding of the study, one related to bias induced by incomplete outcome data and one related to bias induced by selective reporting).
Each publication was read, for data extraction, by three to 11 readers (median 5.5). All extractions were reviewed by one author (M. Paesmans) who performed a final and common evaluation by resolving the inconsistencies or, in case of subjectivity in the assessments, by calculating an average score. All final evaluations were submitted to the readers, reviewed and approved.
Analysis
We summarised our data extraction in descriptive tables providing individual results for the qualitative characteristics that we extracted from each publication as well as for each domain of the risk of bias tool.
For the items from the CONSORT guidelines, we used frequency tabulations and, for each of the items, calculated the mode, median and mean (although the numeric scores, in fact, reflect only an ordinal scale). We also calculated, for the general and extended items, an overall mean result. For the risk of bias tool, we calculated the number of items with a poor assessment, i.e. assessment of unclear risk of bias or high risk of bias (ranging from 0 to 6). We compared the medians of the distributions of the mean CONSORT scores (general and extended) with a Wilcoxon rank-signed test. We also assessed the impact of the following variables, both for the risk of bias tool (number of items with poor assessment) and the mean CONSORT scores, with Mann–Whitney tests: year of publication (dichotomised according to the median value), setting (first or second line), primary outcome (overall survival or not), reproducibility of sample size (yes or no) possible with the details provided in the publications, consistency of the authors conclusions with the results and the pre-planned design (yes or no), involvement of company members in the conduct of the trials and the analysis of the data, and high or low risk of bias according to the Cochrane tool.
SAS v9.3 (SAS Institute, Cary, NC, USA) was used for the statistical analysis.
Results
Trials identified
We identified 20 eligible trials (all published in English), which were published between 2002 and 2013 [13–32]: 11 were conducted in chemotherapy-naïve patients (i.e. first-line setting) [13–23] and nine in populations of patients with a progression or relapse following previous chemotherapy [24–32]. By definition, all were conducted in patients with locally advanced or metastatic tumour. Only one trial was specific to patients with stage IV disease. There were three phase II trials and 17 phase III trials. Two trials were purely academic, and 11 were fully conducted together with the pharmaceutical industry with members of the company included as authors (table 1). Table 1 provides the other general characteristics of the trials including the description of the standard and experimental arms.
Trial methodology
General methodology
The general methodological characteristics of the trials are reported in table 2. All had primary objectives with a noninferiority assumption (equivalence in one) and none of them had a Bayesian design (not reported in the table). In none of the trials was it explicitly specified in the methods section whether the secondary objectives were analysed with a superiority assumption or not, except that by Natale et al. [31], where all outcomes were first tested for superiority (noninferiority was to be tested only in the case of failure to detect improvement). Among the three phase II trials, two used response and one progression-free survival as the primary outcome. For the 17 remaining trials, 12 had overall survival (with a methodology focusing on 1-year survival in three), four analysed progression-free survival and one made use of response as the primary end-point. Planned interim analyses were not so infrequent (six (30%) trials) and their results were adequately taken into account at the time of the final analyses in five cases.
Specific aspects linked to the noninferiority hypothesis
Trial characteristics specific to the assumption of noninferiority are described in table 3. By rationale for noninferiority, we mean an explicit statement about an expected benefit on a secondary outcome and we found such a statement in only six trials. This expected benefit was in all cases reduced toxicity (plus, in one trial, ease of administration). It should be noted that two trials planned to test first for superiority (classical methodology and classical test based on the observed results) and then, in case of failure to detect superiority, for noninferiority. Three trials were designed to test first for the noninferiority hypothesis but calculated their sample size a priori with the alternative hypothesis of superiority of the experimental treatment, which decreases the power to detect true equality for both treatments (more details can be found in the online supplementary material). The method used to define the noninferiority margin was described in only six articles. In five studies, it was the effect-retention method (50% retention for four trials and proportion not specified in one). Use of the conventional method, on the basis of clinical reasoning and taking into account toxicity, was reported once, as was the use of guidelines from European Medicines Agency, without providing a reference to these guidelines [23]. In one study, two different margins were used and the effect-retention method was mentioned for the secondary definition of the margin. Except for the two trials that tested first for superiority and the trial in which sample size was not justified, all but one article reported an a priori sample size estimation that was consistent with the noninferiority margin. However, sufficient details allowing reproduction of the calculation of sample size were often missing: in six cases, the alternative hypothesis for determining statistical power was not reported and the targeted statistical power was not specified in one trial. When specified, the statistical power ranged from 75% to 90%. As most of the trials expressed the noninferiority margin in relative terms, we calculated the absolute difference accepted by the authors for concluding on noninferiority, using a summary parameter from the results (expected if reported or observed) in the control arm (table 3). When survival was the primary end-point, the median accepted absolute loss was 1.35 months (range 1.1–3.5 months).
Presentation of trial results
The results were presented using an ITT approach in 17 trials and using a PP approach in four (table 4). Both results were presented in two trials. In one trial, the results on the primary outcome were not presented. A benefit in a secondary outcome was documented in 12 trials. In total, for the 20 trials, 23 comparisons for noninferiority were presented (one trial had two experimental arms and one trial had three experimental arms). We compared the qualitative conclusions presented by the authors with the results, and we considered that the qualitative conclusions were consistent with the results and the trial design in 11 out of these 23 comparisons. The most frequent reason for inconsistency was claiming equivalence on the basis of absence of statistically significant difference, even if noninferiority was not demonstrated according to the a priori margin. For instance, in the study by Maruyama et al. [28], the primary end-point was overall survival and the noninferiority margin was defined as a hazard ratio ⩽1.25. The 95% confidence interval for the hazard ratio estimate was 0.89–1.40 (ITT analysis only), with an upper boundary >1.25. Although the authors correctly concluded that noninferiority could not be demonstrated, they also added “however, there was no statistically significant difference in overall survival”. This last statement suggests equivalence between both treatments, which is not correct. Another example of misleading conclusion can be found in the study by Hanna et al. [25]. The primary definition of noninferiority was a hazard ratio for overall survival ⩽1.11 (primary analysis, conventional method) or ⩽1.21 (secondary analysis, effect-retention method). The confidence interval obtained from the trial data was 0.80–1.20 (ITT analysis only). With the definition to be used for the primary analysis, the conclusion of noninferiority cannot be claimed. However, in the abstract, there was no mention of the noninferiority assumptions, no report of the hazard ratio estimate and a conclusion stating “treatment with pemetrexed resulted in clinically equivalent efficacy outcomes, but with significantly fewer side effects […] and should be considered a standard treatment option”. This conclusion is not consistent with the trial design and is subject to false interpretation by the reader.
Qualitative assessment
The assessment of the CONSORT items is presented in table 5 (frequency tabulations). The general mean score was significantly higher (2.3) than the mean score for extended items (2.0) (p<0.001, Wilcoxon's rank-signed test). Among the general items that were poorly described (median <2) were the items related to the randomisation procedure (generation of the allocated arm, allocated treatment concealment and implementation of the randomisation procedure), the item about the availability of the trial in a registry and that about the public availability of the protocol. In the extended items, the lack of clarity about the noninferiority hypothesis in the title, the lack of specification of the rationale for noninferiority and the absence of an explicit statement about patient selection (are they similar to those included in superiority trials?) were the main drawbacks. It should be stressed that the CONSORT checklist does not allow documentation of the lack of transparency of the a priori sample size estimation as the only requirement is to report a sample size estimation calculated using the noninferiority assumption and on the noninferiority margin. We did not identify, for any of the covariates that we pre-specified, any significant association with the general mean score (all p>0.05, Mann–Whitney tests). The same applies to the extended items. For the risk of bias tool, we only identified a higher risk when the primary outcome was different from overall survival, higher risk being explained by the lack of blinding in all the trials but one. The individual results for the risk of bias tool assessment are shown in table 6.
Discussion
The use of noninferiority designs is a recent development and, up to now, has remained quite modest in patients with advanced NSCLC. Indeed, a recent review by Sacher et al. [2] identified 118 randomised phase III clinical trials (targeting the same populations) published during 2001–2010 while, in the present review, we selected 15 such studies during the same period (13%, 95% CI 8–20%). The authors, although not addressing the issue of noninferiority assumption, reported a significant shift in the design and interpretation of trials with an increasing use of progression-free survival as primary outcome, despite the subjectivity of its assessment and despite the lack of demonstrated surrogacy for overall survival. They also raised concerns because of the increasing rate of inadequate result interpretation: indeed, they observed, more frequently in recent years than before, trials considered positive despite not reaching statistical significance on their primary outcome, on the basis of arguments related to secondary outcomes. A benefit observed in a secondary outcome without reaching the pre-defined difference on the primary outcome cannot be considered sufficient evidence to claim that the experimental arm is to be recommended.
To conclude about the value of an experimental treatment on the basis of benefits on other outcomes without a positive, statistically significant result on the primary outcome is, in fact, the aim of noninferiority trials. This could be viewed as a better approach than wrongly interpreting the results of superiority trials. Indeed, in a noninferiority trial, the definition of noninferiority is decided upfront with the choice of the noninferiority margin and the conclusion should be reached only if the noninferiority is statistically proven according to the design of the trial and the a priori assumptions. However, we found comparable methodological flaws in the present review, with about half of the conclusions (11 out of 23) not being consistent with trial results or design. As the absence of a statistically significant difference in a superiority trial cannot be interpreted as equivalence between the standard-care and experimental arm [11], failure to demonstrate noninferiority cannot be shifted to an interpretation of “similar efficacy” even if the point estimates for the primary end-point appear to be close. Indeed, when the experimental treatment is not “noninferior” to the standard one, the confidence interval for the treatment effect includes the value of the treatment effect corresponding to the noninferiority margin and the trial results are, therefore, statistically compatible with the inferiority of the experimental arm (table 7).
The issue of defining noninferiority is complex and, although debated in the literature [3], there is no consensus about the best method to be used. In the trials analysed here, the choice of the noninferiority margin is neither described nor discussed and the noninferiority margin itself is most often defined in relative terms, preventing the reader from interpreting the translation of this definition into absolute differences. We calculated these absolute differences and we believe that, in some cases, the price to pay to get a secondary possible benefit is potentially as high as losing 2.4 months from 12 months [28] or 3.5 months from 14 months [22] of median survival. In one trial, investigators even accepted an absolute loss of 10% in response starting from an absolute rate of 5% [30]: with such an assumption, no treatment could be definitely noninferior and also less toxic, and there is no need to conduct a clinical trial to achieve this result.
There is also a clear lack of proper documentation of why the investigators did not have the “ambition” to identify experimental treatments superior to the control arm but choose to attempt to demonstrate noninferiority. The expected benefits are rarely explicitly mentioned and, if so, the hypotheses on these benefits are never quantified. The planned switch from superiority to noninferiority that we observed in two trials in our review is of particular concern because there should be no other justification for noninferiority than the failure to detect superiority of the experimental arm. The use of the superiority of the experimental arm as alternative hypothesis for power calculation (at least in three trials out of 20), which is in contradiction with the noninferiority assumption as it consists of using a true situation different from the hypothesis, is also worrying. Some authors argued that there might be some pressure on investigators to produce significant results (which lead to better publications) or influence from industry (often funding trials) willing to conclude positively on the trials results [2]. A pharmaceutical company was involved in all the trials included in our review, including at least funding but also often authorship for company employees. The exact role of company employees is seldom described but authorship should mean a significant contribution to design, data analysis or data interpretation.
Progression-free survival or, even more, response as primary objective is a nonmarginal choice (eight out of 20 trials) and, in our opinion, should be avoided in the context of noninferiority trials. Its use in superiority trials is debated namely due to its subjectivity, to the influence of the intervals between assessments and to the lack of impact on survival in the case of small benefits [33]. The impact of a noninferior progression-free survival on overall survival should be further studied using retrospective studies as it has been shown that, in the population of patients with advanced NSCLC, large benefits are needed in progression-free survival to warrant a (small) benefit in survival [33]. This justifies the need to study the impact on survival of a noninferior outcome in progression-free survival. Furthermore, the practical difficulty of blinding chemotherapy regimens for patients and investigators induces a higher risk of bias when using progression-free survival as outcome.
There were three phase II trials among those eligible for our review. Accepting noninferiority in early development of a drug/regimen is even more intriguing and challenging than for phase III trials, and should imply a much more careful justification of the rationale for this choice (anecdotally, this was not given for two of these three trials).
We showed that the quality of reporting of noninferiority trials is worse for the items linked to noninferiority than for the general aspects, as measured by the CONSORT scores. Our review also identified some deficiencies in the reporting of the determination of the sample size; reports were often not detailed enough to be reproducible. The important recommendation to carry out both ITT and PP analyses is, in practice, seldom followed. Indeed, contrary to superiority trials, in noninferiority trials, the PP analysis is important to make sure that there was no dilution in the effect of the experimental treatment in the ITT population (leading to a overoptimistic conclusion of noninferiority) [7].
In conclusion, we believe that noninferiority trials for advanced NSCLC might be misleading and their results have little value for clinical practice. The situation portrayed by Elie et al. [7], who see a place for noninferiority trials when standard treatments have very high efficacy that cannot be improved any more meaningfully, is far from being common today in the population of patients with advanced NSCLC. Thus, the choice of a noninferiority design for future trials should be restricted to carefully selected situations. Such a situation was reached in the trial by Park et al. [18] in which the question was about the number of cycles required, with an obvious hope of reducing toxicity. If this choice is made, as already stated by Tanaka et al. [3], improvement in methodology should be achieved by a full description of the choice of the noninferiority margin (the effect-retention method should be favoured as it is the only one warranting that the experimental treatment is better than the previous standard), clear hypotheses about expected benefits in secondary outcomes, conduct of both ITT and PP analyses, and strict interpretation of results according to the design and the pre-specified hypotheses.
Our key messages are the following. When an experimental treatment is shown to be noninferior to the standard one, the interpretation of this result should be careful and take into account the noninferiority margin, which represents the a priori acceptable loss in efficacy: noninferiority does not mean that there is no difference between the experimental approach and the standard one. The observed benefit(s) on other end-points, especially those justifying the noninferiority design (if any) should be integrated in the conclusion too. However, the expected a priori benefits are seldom explicitly stated, which is a bad practice that needs to be improved as well as the reporting on methodological aspects of noninferiority trials. The impact of those trials on clinical practice should be low.
Footnotes
This article has supplementary material available from erj.ersjournals.com
Conflict of interest: None declared.
Previous articles in this series: No. 1: Powell HA, Baldwin DR. Multidisciplinary team management in thoracic oncology: more than just a concept? Eur Respir J 2014; 43: 1776–1786. No. 2: Shlomi D, Ben-Avi R, Balmor GR, et al. Screening for lung cancer: time for large-scale screening by chest computed tomography. Eur Respir J 2014; 44: 217–238. No. 3: De Ruysscher D, Nakagawa K, Asamura H. Surgical and nonsurgical approaches to small-size nonsmall cell lung cancer. Eur Respir J 2014; 44: 483–494. No. 4: Van Schil PE, Opitz I, Weder W, et al. Multimodal management of malignant pleural mesothelioma: where are we today? Eur Respir J 2014; 44: 754–764. No. 5: Kim L, Tsao MS. Tumour tissue sampling for lung cancer management in the era of personalised therapy: what is good enough for molecular testing? Eur Respir J 2014; 44: 1011–1022. No. 6: Blum T, Schönfeld N. The lung cancer patient, the pneumologist and palliative care: a developing alliance. Eur Respir J 2015; 45: 211–226.
- Received May 20, 2014.
- Accepted November 17, 2014.
- Copyright ©ERS 2015