In this study, we aimed to train (Stage 1), calibrate (Stage 2) and evaluate (Stage 3) a binary ML classifier (‘the classifier’) designed to reduce study identification workload in maintaining the CCSR, with an acceptably low corollary risk of ‘missing’ records of ‘included’ (eligible) studies. We therefore needed to assemble three separate data sets from the CCSR screening workflows (see below and ‘Availability of data and materials’).
Training (Stage 1)
In Stage 1, we assembled a training data set containing bibliographic title-abstract records of all study reports (articles) manually screened for eligibility for the CCSR from its first search date (20th March 2020) up until 18th October 2020. Embase.com records had only been recently added to the CCSR's sources by mid-October and a backlog of medRxiv preprints was still being processed. As the CCSR's other sources were trial registers (not bibliographic title-abstract records), most of the training set records were from PubMed. These records had originally been identified using conventional Boolean searches of selected electronic bibliographic databases and trials registries, before being manually screened and labelled as either ‘included’ (eligible for the CCSR) or ‘excluded’ (ineligible) by Cochrane information specialists or the Cochrane Crowd6. After removing trials registry records, we were left 59,513 records, of which 20,878 were labelled as ‘included’ in the CCSR, and 38,635 were ‘excluded’. These records were imported into EPPI-Reviewer10, assigned to code sets, and used to train a logistic regression classifier using tri-gram ‘bag of words’ features, implemented in the SciKit-Learn python library, with ‘included’ records designated as the positive class (class of interest) and ‘excluded’ records as the negative class.
Calibration (Stage 2)
In Stage 2, we assembled a calibration data set containing 16,123 similar records of study reports manually screened for eligibility for the CCSR between 19th October and 2nd December 2020, again labelled as ‘included’ (6,005 eligible records) or ‘excluded’ (10,118 ineligible records) by the same people and process, and with trials registry records having again been removed. The records were imported into EPPI-Reviewer, assigned to code sets, and used to calibrate the classifier developed in Stage 1. Specifically, we applied the classifier to 16,123 calibration records, which automatically assigned a score (0-100) to each record. We then computed the threshold score that captured > 99% of the ‘included’ records present in this data set (i.e. Recall > 0.99). 0.99 is the threshold level of recall that is currently required for ML classifiers to be deployed in Cochrane systems and workflows10. We also computed standard performance metrics, namely: (cumulative) recall, (cumulative) precision and net workload reduction.
Evaluation (Stage 3)
In Stage 3, we assembled an evaluation data set of similar records containing 2310 includes and 2412 excludes of study reports manually screened for eligibility for the CCSR between 4th and 19th January 2021, once again labelled as ‘included’ (2,310 eligible records) or ‘excluded’ (2,412 ineligible records), with trials registry records removed. The records were imported into EPPI-Reviewer, assigned to code sets, and used to evaluate the classifier developed in Stage 1. Specifically, we applied the classifier to 4,722 evaluation records, identified ‘included’ and ‘excluded’ records scoring above and below the threshold score we had computed in Stage 2; and then we computed (cumulative) recall, (cumulative) precision and net workload reduction. We also analysed characteristics of ‘included’ study reports that would have been ‘missed’ by the workflow if the classifier had been implemented.
Finally, we compared key characteristics of study reports between the three study data sets described above in this section (training, calibration, evaluation), to check post-hoc that they comprised similar enough sets of records to validate our results from calibrating and evaluating the classifier.