Pythia takes a set of natural-language questions—derived by splitting the review’s key questions (i.e., research questions) into their component phrases—and uses modern approaches for semantic text encoding to represent the text of the citations in a form that can be used by deep neural networks.
Pythia returns the top-ranked citations for each question amounting to a total of 100 citations (e.g., if there are two questions, it selects the top 50 for each question; if there are 10 questions, it selects only the top 10 for each question). A citation can appear in the results for more than one question. A human then screens the 100 citations, annotating specific terms that indicate relevance in the relevant abstracts by the specific aspect of the eligibility criteria (see Fig. 1 for an example of an annotated abstract). Based on these screening decisions and the annotations of relevant terms, Pythia refines its search and returns the next top 100 unscreened citations, to be screened/tagged, until convergence is achieved (i.e., no other relevant abstracts are retrieved) or for a set number of iterations (batches). For this project, we limited each project to 10 batches.
Creating the Dataset
The literature collection searched by Pythia was constructed using the metadata of all PubMed records (ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/). For each article, we concatenated the title and abstract of each citation and indexed the resulting concatenated text, along with the publication year of the paper. We discarded any citation without an abstract, leaving a subset of approximately 21 million articles (of the original 31 million articles in PubMed).
Selecting the Citations for Screening
Pythia uses the natural-language questions provided by the review team to produce a list of 100 potentially relevant citations. This set is screened by a human, who annotates each citation as relevant or irrelevant and tags words or phrases as indicative of relevance based on the eligibility criteria, using the Population, Intervention, Comparator, Outcome (PICO) framework. Pythia then extracts a set of positive key phrases from the articles annotated as included and a set of negative key phrases from the articles annotated as rejected. Each key phrase is executed as a search, and Pythia retrieves 200 articles, using the BM25 retrieval algorithm (10). Any abstracts that had been previously retrieved are excluded, and the retrieved articles are ranked using a deep-learning model designed to retrieve both citations and snippets (11). Pythia is designed to penalize the score of each article containing a negative key-phrase and increase the score of any article containing a positive key-phrase, thus taking into account the feedback provided by users. Finally, Pythia returns the top 100 citations for human screener evaluation.
Experimental Setup
The first 100 articles are exported and screened manually, with the human screener adjudicating the relevance of each abstract and tagging keywords that indicate relevance by the specific aspect of the eligibility criteria in those abstracts that are deemed potentially relevant. These “curated” articles are used by Pythia to refine the search and re-rank the full corpus of abstracts. Pythia then exports a new set of 100 top-ranked abstracts to be screened and tagged. For this project, we chose to limit each review to 10 cycles of article export and manual screening, representing 1000 citations screened per review. Because the same citation might be identified (and screened) for more than one key question, the total number of unique citations ranged from 800 to 1,000 across projects.
Evaluation
To evaluate the Pythia, we selected a convenience sample of seven systematic reviews on a variety of clinical topics undertaken by the Brown Evidence-based Practice Center in the last five years (12–18). At the time the project began, three reviews were completed and four were ongoing. By the time of final analysis, all reviews were completed, though one was still undergoing peer review (15). All reviews followed the methods set out in the EPC Methods Guide (19). The search strategies for all seven reviews were conducted by a trained medical librarian and peer reviewed, using the 2015 PRESS assessment form (20). After a series of pilot round to ensure consistency, the review team double-screened all retrieved citations for relevance.
In three of these reviews, Pythia was used prospectively, with a human annotator screening 10 batches of 100 citations each. Pythia reranked all articles in the database after each batch was screened and provided another set of 100 for screening.
In the other four reviews, Pythia was evaluated retrospectively, with automatic annotation based on labels given to the abstracts screened in the original review's manual screening. This created a list of known positive and negative articles. To estimate the importance of the human-generated PICO tags in the prospective annotation, the machine automatically extracted "key phrases" from relevant and irrelevant articles to use in the analysis. We examined three settings, namely “no key phrases”, “positive key phrases only”, and “positive and negative key phrases”.
-
In the “no key phrases” setting, Pythia did not use the key phrases of the citations in ranking articles.
-
In the “positive key phrases only” setting, Pythia used only the positive key phrases to increase the score of all articles that share a key phrase with a known positive article (key phrases extracted from known relevant citations).
-
In the “positive and negative key phrases” setting, Pythia penalized any citation that shared a key phrase with a known negative article (key phrases extracted from known irrelevant citations) and increased the score of any citation that shares a key phrase with a known positive article.
With this design, we may have missed relevant articles that were not included in the provided list (i.e., reviewers of the original review never saw that article). Therefore, we expect that the retrospective scores would improve with human inspection. The studies screened as relevant were compared to the studies with PubMed identifiers (PMIDs) included in the final report. Any studies screened in through the prospective process, but not included in the original report searches, were assessed for eligibility by the original report's primary investigator. None was found to be eligible.
Performance Measures
For each systematic review, the final included citations that had a PubMed Identifier (PMID), indicating that they could be found in PubMed, were considered to be the reference standard (T). This set was divided by whether they were identified by Pythia (TP) or not (FN). The citations identified by Pythia (P) were divided into subgroups by whether they were included in the final report (TP) or not (FP). We omitted the number of citations correctly rejected (TN) because this number is extremely large (the source set included approximately 21 million citations) and is not of particular interest.
We were interested in two dimensions of classification performance: workload and sensitivity (i.e., recall). Sensitivity was defined as the ability to identify the truly relevant citations using Pythia (TP/T). To measure the workload, we defined precision as the proportion of citations screened that were relevant (TP/P). To make this number more intuitive, we use NNR, defined as the number of irrelevant citations that the reviewer had to screen for each relevant citation found (1/Precision). Our aim was to maximize sensitivity while minimizing workload.
For comparison, we report precision and NNR for the manual screening process as defined by the number of relevant articles included in the final report as a proportion of the total number of articles retrieved in the PubMed searches for each review. This does not include double-screening, so each abstract was only counted once. The sensitivity for the manual screening process by definition is 100%. Where P values are reported, they were calculated using the Fisher exact test.