2.1 Literature preparation
2.1.1 Classification criteria
In the classification criteria, the ambiguity of title and abstract expression and the variety of study types was taken into consideration. In the finest version of classification, i.e. classify the literature by its nature regardless of the screeners’ intend to include or exclude the literature in their reviews, literature was divided into 5 different categories: RCTs, randomization-unclear controlled trials (RUCTs), non-randomized clinical trials/studies (NRCTs), non-clinical literature (NC), and Unclear. The RUCTs comprised comparative studies where the group assigning method was not stated clearly enough. The NRCTs included clinical studies other than RCTs and RUCTs, e.g., case-control studies. The NC category included medical reviews, laboratory experiments, and other literature that was not related to medicine at all. In the Unclear category, the literature contained so little information in the titles and abstracts that could not be assigned into any of the aforementioned categories.
Following the finest classification, the literature was further merged into 3-category and 2-category training. The 3-category training comprised RCTs, may-be-RCTs, and non-RCTs. The may-be-RCTs category included both RUCTs and Unclear, while the non-RCTs category contained the literature classified as NRCTs and NC. Finally, traditional dichotomous classification was performed by further merging the may-be-RCTs and non-RCTs categories.
2.1.2 Literature collection and labelling
Chinese literature published between Jan 1st 2014 and Dec 31th 2018 indexed as ‘Oral Science’ in the China Network Knowledge Infrastructure (CNKI) was searched and exported. Citations were excluded in the cases where key information, such as the abstract, author, or publication information, was missing.
Literature classification was conducted by 2 experienced researchers (S.C. and Y.X.) separately using an online platform that our group developed especially for this study. The platform aims at assisting researchers with the screening and label-checking tasks for Chinese literature citation, and provides the functions of citation management, screening assistance, and automatic label checking. The citations were screened by the 2 researchers independently, according to the 5-category criteria. After the labeling results of the 2 researchers were checked, citations with the same labels were used as the gold standard. Results with different labels were resolved by consensus after peer discussion and/or turning to an experienced senior researcher (C.L.) for final decision. The 3- and 2-category labels were derived from the 5-category labels according to the aforementioned classification criteria.
2.2 AI screener training
2.2.1 CNN model description
In this paper, a customized neural network architecture is proposed that uses a CNN to process text inputs (Fig. 1). The input of the model is the citation of a Chinese study that includes the title and abstract information, while the output is the confidence coefficient of whether the study should be labeled as one of the different categories. The architecture represents words as vectors and the input text is a concatenation of word vectors. During preprocessing, JIEBA,25 a Chinese text segmentation tool, was employed as the tokenizer for the text segmentation and deleting stop words tasks. Then, a citation was processed into a sequence of words based on the order of words that appeared in the original text. A word vector was randomly initialized, which would be further adjusted during model training.
Next, the id-th word corresponded to a d-dimensional word vector. A convolution operation involves a filter which was applied to a window of w words in order to produce a new feature. Multiple filters were applied to each possible window of words in the sequence, producing many feature maps. Subsequently, a max-pooling operation was applied over the feature map and the maximum value was set as the feature corresponding to that particular filter. These features that formed the penultimate layer were passed to a fully connected layer. The final output was the probability of the citation being from a specific category, e.g., RCT, which ranged between 0 and 1. The closer the probability to 1, the more likely the citation to be in the category, and vice versa. Cross entropy was the loss function of the model, and the Adam optimizer was used to update the network parameters by backpropagation.
2.2.2 Parameter adjustment
The model adjusted the parameters on the dev set with the learning rate being initially set as 0.0005. The batch size was set to an integer power of 2 in order to adjust the number of filters and the size of each filter convolution window. The pooled results were treated with dropout and the values in the range of parameters (0,1). In addition, in order to prevent the over-fitting of the model, L2 penalty was applied to the model, the parameters of which could be adjusted according to the accuracy rate on the dev set.
2.2.3 Model storage and usage
After each training process with the respective classification strategies, a stable model was obtained. The model parameters and the word vectors were saved. Then, when the model was used, the parameters were loaded, while the fully connected layer ‘dropout_keep_prob’ was set to 1. When these models were used, the input data were processed and the results were predictions of the possibility belonging to each literature category. The results were saved with the relevant citation data.
2.3 Prospective validation trial
2.3.1 Sample and model preparation
In this comparative study, a sample containing 1,422 citations that had been isolated from the screening researchers and CNN models were adopted to reveal the performance of the models with different classification strategies. During model preparation, cutting thresholds of the possibility coefficient given by the CNN models were determined. In our previously published study,26 the sensitivity (SEN) and specificity (SPE) of the CNN model could be adjusted by setting different thresholds. The primary goal was to maintain the screening as sensitive as possible. Ideally, the sensitivity should be 1. However, if such SEN could not be reached, 0.95 should be the bottom line. The secondary goal was to maintain the SPE as high as possible. Therefore, an SPE higher than 0.8 was desired but not mandatory. Hereby, the High-sensitivity Threshold was determined as follows: when the SPE can be maintained higher than 0.8, we first seek for SEN = 1. If that threshold does not exist, the standard will be lowered to 1 > SEN > 0.95 when the SPE is still higher than 0.8. If the threshold still cannot meet the subordinate standard, the High-sensitivity Threshold will be set at SEN = 0.95 irrespective of how low the SPE is. Receiver-operating characteristic (ROC) curves and the area under curve (AUC), along with the SEN and SPE with the chosen thresholds, were employed to compare the models obtained by the different classification strategies.
It should be noted that, while the 2-category model had only one High-sensitivity Threshold for the RCTs category, the 3- and 5-category models had respectively 3 and 5 High-sensitivity Thresholds for each category. For example, when trained with 5-category strategy, the CNN model will provide 5 sets of probabilities for a single citation, 5 thresholds were therefore needed to determine to which category did the citation belong. This might lead to the result that a specific citation may be labelled with multiple category tags if the possibility given for those categories were all above the respective thresholds. Therefore, combination of different thresholds was needed. The workflow of the combination of different thresholds is shown in Fig. 2. Since the sensitivity of citation screening was the main aim, no citations were discarded until the positive choosing was completed. The first screening was for RCTs and the positive results were kept. Then, the RUCTs and Unclear categories were screened from the remaining results and kept aside. Finally, screening for NRCTs and NC was performed in the remainder, where the negative results (Ambiguous) were kept and the positive ones were discarded.
2.3.2 Screening performance validation
The 1,422 samples underwent both manual and CNN model screening. The same independent screening process was performed by the 2 experienced researchers. After discussion, the final classification labels were used as the gold standard. The independent results of both researchers prior to discussion were deemed as performance of manual screening. Fully-prepared CNN models performed the screening according to the aforementioned thresholds and workflow. Finally, SEN and SPE were given with 95% confidence interval (95% CI).