We identified nineteen eligible articles, published between August 2017 and December 2019, describing thirteen individual reviews. Ten reviews[16-25] appeared as peer-reviewed articles, of which five also appeared in the form of one or more conference abstracts[26-31]; a further three reviews[32-34] were described in conference proceedings only. A flow diagram is shown in Additional file 2, and eligible reviews are summarised in Additional file 3.
Description of relevant reviews
All trials included in eligible reviews investigated the addition of one or more treatments, such as abiraterone, celecoxib, docetaxel and zoledronic acid, to the standard-of-care of androgen deprivation therapy (ADT) compared to ADT alone, including two combination treatments (zoledronic acid plus ADT in combination with each of docetaxel and celecoxib[35, 36]). One relevant trial[37] compared multiple research treatments under the same protocol, such that data from 14 randomised comparisons were represented across the reviews from within nine trial protocols. Each review used data from between three and twelve randomised comparisons (Figure 1), comprising between 1,773 and 7,844 patients. The relevant source data from each of the relevant trials is given in Additional files 4 and 5. The theoretical network resulting from analysis of all such data simultaneously is shown in Figure 2.
Sources of variation
We observed considerable variation between the included reviews in terms of review aims, eligibility criteria and included data, statistical methodology, reporting and inference.
1. Review aims
All thirteen eligible reviews either stated or implied an aim to synthesize data on optimal treatments for hormone-sensitive prostate cancer. Two reviews stated the additional aim of including updated results[21] and/or improved methodology[20, 21]. Four others specifically aimed to evaluate efficacy within pre-defined patient subgroups[19, 22-24], and four stated the aim of incorporating health economic considerations[24] or adverse effects[17, 22, 34].
2. Included trials
Ten of the 13 reviews described themselves as “systematic”, and all but one[16] reported that a formal search strategy had been used. All reviews specified a disease setting of hormone-sensitive prostate cancer (HSPC). Eight reviews[16, 18, 20, 21, 25, 32-34] only included trials in metastatic disease (M1). One of the largest relevant trials (STAMPEDE[35, 36, 38, 39]) randomised both M1 and high-risk non-metastatic (M0) patients; but M1-specific results were reported, making it eligible for most M1-only reviews. Three other reviews explicitly included trials in the high-risk[19] or locally-advanced[17, 22] non-metastatic setting, although one[17] ultimately limited their analysis to M1 due to lack of data. Only one review[23] included the STAMPEDE direct comparison of abiraterone vs docetaxel[39] published online in February 2018.
3. Included treatments
The set of included treatments varied depending upon the aims of the review. Ten reviews only included data comparing docetaxel or abiraterone plus ADT to ADT alone – reflecting the focus of clinical interest – although two such reviews[18, 19] also included data from the zoledronic acid plus docetaxel combination comparison of STAMPEDE[35], treating this as an additional docetaxel trial. The three remaining reviews permitted a wider, but varied, range of treatments (Figure 1). Although this presumably reflects deliberate choices made by review authors, only one review[21] gave an explicit justification, referring to earlier work[7] where the treatment (sodium chlodronate) was considered separately due to “differences in mechanisms of action” and because it “is not commonly used in practice”. By contrast, two other treatments rarely used in recent times (estramustin phosphate and flutamide[40, 41]) were included in one review[25].
4. Included participants
Patient inclusions were necessarily governed by the reported data. The vast majority of included trials conformed to the intention-to-treat principle; the exceptions being two small, older trials[25, 40, 41] where small numbers of patients were not analysed due to protocol deviation or non-eligibility.
Two reviews[23, 24] restricted to patients with “high volume metastatic disease” (HVD), of which one[23] additionally restricted to newly-diagnosed mHSPC; that is, patients who had not received prior therapy for prostate cancer. As STAMPEDE was considered highly clinically relevant but did not have published HVD-specific results at the time of publication, it was instead included in a sensitivity analysis. Only two reviews[19, 22] investigated patient subgroups other than M0/M1 or HVD: looking at age, performance status, Gleason score and presence of visceral metastases. Neither used the “deft” approach to testing for subgroup interactions in the meta-analytic context as recommended by Fisher et al[42].
Despite the availability of STAMPEDE results for M0 and M1 patients separately, it was not always clear that review authors extracted or analysed data consistently. For example, one review[16] specified that only M1 patients were eligible, but reported figures suggest that M0 patients were sometimes also included.
5. Included outcomes
Eleven of the 13 reviews reported overall survival (OS) results, and ten reported results on an intermediate survival outcome. Definitions of intermediate outcomes varied between trials, and were handled differently between reviews. One review[19] considered that “data on secondary outcomes … were not reported consistently enough between trials to allow for pooling of data”, while most other reviews did attempt such analysis.
Three reviews[20, 23, 25] imposed a specific definition of the intermediate outcome, resulting in fewer but possibly more comparable included trials. Another[21] specified a list of desired elements, but argued in favour of including two trials omitting one such element[12, 13] on the basis that definitions were similar enough overall to allow a clinical interpretation from the pooled result. One further review[22] appeared similar but was unclear; the others did not provide sufficient information.
6. Included results
Although three reviews explicitly stated that the most recent available trial report would be used[17, 19, 21], many reviews were inconsistent or unclear. For example, one review[18] referenced updated results for an included trial[43] but apparently used an older set of results[44] in their analysis. Updated OS results from another trial were published in a conference abstract[45], with intermediate outcome results presented at the conference itself; but only a single review[21] incorporated them. Particularly in a time-to-event context, updated results can increase power and precision by capturing additional events[46].
7. Statistical methods
A wide range of statistical methods were used. Three reviews[16, 32, 33] simply carried out pairwise meta-analyses of included treatments versus standard-of-care, with inference for indirect comparisons based upon a test of subgroup difference[47]. A more common approach, used in five reviews[17-19, 22, 24], was the “Bucher method”[48], applicable to three-treatment triangular networks and which has been criticised for estimating a separate heterogeneity variance for each comparison[47]. Two reviews[18, 19] accommodated the “docetaxel plus zoledronic acid” comparison from STAMPEDE within this framework by treating it as an additional docetaxel comparison, reflecting a similar approach sometimes used in pairwise meta-analysis[49]. Four other reviews analysed networks of four or more treatments using multiple treatment comparison (MTC) methods, either using frequentist multivariate analysis[21] or a Bayesian framework[20, 23, 25]. Of the nine frequentist reviews, six used random-effects modelling, one[17] used common-effect modelling, one[18] used a hybrid method (see Additional file 3), and one[24] was unclear.
Due to its adaptive multi-arm design[37], multiple treatment comparisons from the STAMPEDE trial may be correlated. If a review includes such comparisons as though they were independent trials, double-counting of control arm observations may lead to inflated variances. However, only three reviews[20, 21, 23] explicitly discussed this issue, despite it being indicated in the PRISMA-NMA statement[4]. One such review[20] stated that “treatment comparisons… from the same study were modelled… with a [Bayesian] correlation prior distributed uniformly on 0–0.95”. Another[21] sought to estimate the correlations themselves using event counts by treatment arm. Both also included zoledronic acid combination arms separately from docetaxel and celecoxib alone, which added strength to the docetaxel network comparison. The remaining review[23] was alone in including direct comparison data from STAMPEDE of abiraterone vs docetaxel[39]. Despite correctly noting “differences in the period of enrolment” between the direct comparison and the original comparisons against ADT, and “uncertainty in the extent of overlap of populations for each of the comparisons”[23], they did not attempt to formally account for this, choosing instead to perform sensitivity analyses.
8. Reporting
Three reviews were reported in conference proceedings only[32-34], and a further two[16, 25] took the form of “letters to the editor”; understandably these six reviews conformed poorly to PRISMA-NMA guidelines[4]. The eight peer-reviewed articles conformed better to varying degrees (see Additional file 6). Risk-of-bias assessment and handling of multi-arm trials were common omissions, and in particular only two reviews[21, 22] published their protocol in advance. There was also some evidence of outcome reporting bias, for example one review[25] presented an indirect estimate for the intermediate outcome but not for overall survival, despite evidence that both outcomes were analysed. Reporting of source data and description of statistical methodology was often poor, making it difficult to recreate the reported indirect treatment comparisons. Inconsistencies in use of source data, and minor reporting errors such as inconsistent patient or event counts, further hindered attempts to make reasonable judgments as to how such analyses might be recreated.
Comparison of primary results and of reviewers’ interpretations
Twelve of the 13 reviews analysed overall survival (OS), of which 9 explicitly reported an indirect estimate of abiraterone versus docetaxel. Despite the dissimilarities described above, results were fairly similar, with HRs of around 0.80 and of borderline significance at the 5% level (Figure 3). Eight reviews drew tentative conclusions regarding an OS advantage for abiraterone over docetaxel. By contrast, three reviews[19, 24, 34] stated categorically that there was no difference in OS; the conclusions for the final review[25] were unclear. Notably, conclusions differed among three reviews including an identical set of trials: two[17, 19] stated explicitly that their analysis did not demonstrate statistical significance, whilst the third[18] stated that “despite several limitations stemming from the paucity of comparative evidence, our results favour [abiraterone] over [docetaxel]”.
Of ten reviews which analysed an intermediate outcome, 7 reported indirect estimates. Due to the variations in intermediate outcome definition, we took the results most prominently presented or described in each review (see Additional file 3). The estimates here were more varied, with HRs ranging from 0.50 to 0.84 (Figure 3). In four reviews[17, 20-22] the estimates were strongly significant at conventional levels, and this was reflected in the reviewers’ conclusions. Two reviews[23, 24] concentrated on the high-volume disease (HVD) sub-population and as such differ noticeably in terms of available power and estimated effects, and appear as outliers in Figure 3. One[23] concluded that a “positive trend” was seen both in overall survival and in the intermediate outcome, whilst the other[24] stated that “no statistically significant difference” was seen. The remaining outlying result is taken from a review[25] for which descriptions of methodology and source data were particularly limited, and we were unable to recreate their analysis.