The methodology of this study was pre-specified in a protocol, which was made available online prior to the start of the study [10]. The final report is given in line with the Guidelines for Reporting Reliability and Agreement Studies (GRAAS) [11].
Rater selection
Each investigator (SM, SR, IM and VY) selected one independent rater from a cohort of academics based on the following criteria (to the best of each investigator’s knowledge):
-
Knowledge of research methodology;
-
Potential and/or demonstrated past interest in conducting systematic reviews of clinical trials;
-
Independent from each other and from the investigators (e.g. no joint publication listed in PubMed or other known prior academic collaboration);
-
Positive response to the written invitation for participation as rater.
From the potential number of raters contacted, the first raters who agreed to participate were selected. Hence, a total of four independent raters participated in this study. Each investigator (SR, IM and VY) revealed the identity of their chosen rater to the principal investigator (SM) only and remained unaware of each other’s rater selection until all ratings had been completed.
The number of raters was determined in accordance with a similar study to assess the inter-rater reliability of the CQS-1, published elsewhere [8]. Rater selection was quasi-random; that is, although no selection according to a random sequence was conducted, each rater’s acceptance to participate was left to chance. Raters were free to accept or decline a once-off written invitation without any further effort by the investigators to secure study participation.
Rater blinding
In order to assure rater independence, no rater interaction took place during the rating process, thus avoiding any interaction effect on the results. The raters remained unaware of each other’s participation in this study until all rating had been completed. However, in order to investigate the use of the CQS-2 under conditions as close as possible to the practical routine of trial appraisal, the raters were not blinded to the references of the trial reports, the author names and affiliations, nor to acknowledgements and funding sources. In addition, to obtain raters’ informed consent regarding their participation in this study, they received information about the full content of the study protocol. Hence, each rater was aware that their judgment was compared with those of other raters.
Sample size calculation
The number of required trial reports was calculated based on a minimum expected agreement between raters of 70%, and a 95% confidence interval (CI) of 15%, using the appropriate formula for sample size calculation: N = 1/E2 (with N = number of required articles and E = confidence interval) [12]. In line with the applied sample size calculation method, a minimum number of 44 (rounded to 45) required trial reports were determined.
Trial report selection
All 45 trial reports were selected from PubMed. The references are listed in the S1 Additional file. The database was searched by the principal investigator (SM) using the search term ‘prospective AND clinical AND controlled AND trial’ with the set limits: ‘Abstract’, ‘Free full text’ [Text availability], ‘Clinical trial’ [Article type], ‘From 2022/1/1 to 2022/05/31’ [Publication date] and ‘Best match’ [Display options]. Citation abstracts were checked whether they described a prospective, clinical, controlled trial, published in the English language. Trials were quasi-randomly selected by choosing the first 45 relevant citations from the resulting search list (trial protocols or trials in publication languages other than English were not included).
Trial rating process
The raters had no extensive expertise in the conduct of systematic reviews of randomised controlled trials. One rater was an epidemiologist and statistician with eight years’ experience; two were dentists employed in academic institutions with 2–3 years of work experience (one with two years’ experience in bias risk assessment), and one was a statistician with 25 years of experience and experience in bias risk assessment but not in the use of trials appraisal tools during systematic reviews.
The rater’s content knowledge of the trials was not assessed. However, due to the quasi-random nature of the trial selection, it was assumed to be slight. No calibration or training in using both CQS versions was carried out. All raters received the study protocol [10] for information about how to apply the CQS-1 and 2 only.
From the principal investigator (SM), each rater received a download link for the 45 trial reports via email and a MS Excel assessment template for both CQS versions was prepared in line with published specifications for each appraisal method [8, 9]. Each rater received only one template at a time in a random sequence for CQS-1 and CQS-2 rating. The random sequence (S1 Additional file) was generated using block randomisation (Block size = 2) out of a total of eight rating events. Raters entered their rating results into the template and sent these back to the principal investigator via email. Each rater received their next template two weeks after submission of the completed previous template.
The Composite Quality Score (CQS)
The CQS includes: (i) binary trial report rating per appraisal criterion (Scores: 0 = invalid/falsified, 1 = corroborated); (ii) multiplication of individual rating scores to an overall appraisal score, and (iii) identification of invalid/falsified trial reports based on a zero overall appraisal score.
CQS-1
The CQS-1 was originally developed as a composite of two trial appraisal categories for systematic and random error [8]. For each category, the following criteria were set:
(a) Systematic error (Randomisation)
Criterion I
‘Randomisation’ for allocation to treatment groups is in some form reported in the text (Yes = 1 / No = 0);
Criterion II
Concealing of the random allocation is in some form reported in the text (Yes = 1 / No = 0).
(b) Random error (Sample size)
Criterion III
The sample size of any particular treatment group reported in the trial report is not less than N = 200 (Yes = 1 / No = 0).
The minimum sample size limit (N) was calculated using the formula: N = {([P1 x (100 - P1)]+ [P2 x (100 - P2)])/(P2 - P1)2} x f(α,β) [13] and was based on the assumption that the difference in intervention effect between study groups (P1 - P2) is not less than 10%, with α = 5% and β = 20%, that is: f(α,β) = 7.9 [14].
CQS-2
The CQS-2 is an update of the CQS-1 and based on a systematic review with meta-analysis of meta-epidemiological study evidence, concerning the lack of trial design characteristics associated with over- or under-estimation of the correct effect estimate due to systematic error alone [9]. In contrast to the CQS-1, the CQS-2 does not include a category for random error. The following criteria were set:
Criterion I
‘Randomisation’ for allocation to treatment groups is in some form reported in the text (Yes = 1 / No = 0);
Criterion Ii:
-
Keeping the random allocation sequence in a locked computer file; and
-
Translation of the sequence into identical, coded, serially administered containers and/or sealed, opaque envelopes; and
-
Reassurance that the person who generated the sequence did not administer it.
are in some form reported in the text (Yes = 1 / No = 0);
Criterion III: Double-blinding or the blinding of at least two out of the three groups: trial participants, trial personnel, and trial outcome assessors in some form reported in the text (Yes = 1 / No = 0); and
Criterion IV
The sample size of any particular treatment group reported in the trial is not less than N = 100 (Yes = 1 / No = 0).
Statistical analysis
The inter-rater reliabilities for the overall appraisal score of the CQS-1 and the CQS-2 were established by use of the Brennan-Prediger coefficient (BPC) [12]. As in a previous study [8], this study did not use Cohen's Kappa for inter-rater reliability analysis. Cohen's Kappa is affected by a paradox that returns biased estimates of the statistic itself (situations where high strength of inter-rater agreement actually produce low values for Kappa). Hence, Cohen's Kappa is increasingly being replaced by several newer coefficients, such as the BPC that does not suffer from this shortcoming [12].
The BPCs of both CQS versions were compared using the two-sample z-test. All data analyses were carried out using SAS statistical software [15]. A 5% significance level was used.
During secondary analysis, the BPC for each single criterion and each corroboration level for both CQS versions was established. The corroboration levels indicate the number of consecutive criteria a trial has complied with (e.g. level C2 indicates Criterion I and II; level C3 indicates Criterion I, II and III, etc.). After a criterion has been rated with a 0-score, the corroboration level remains the same, even if a following criterion is rated with a 1-score, for example Corroboration level C2: Criterion I and II = 1-score, Criterion III = 0-score, Criterion IV = 1-score [5].