In this section, we present an overview of the different IR methodologies present in the literature, dividing the discussion into two subsections: clinical IR tools and methodologies. The clinical IR tools subsection focuses on IR tools and systems that have been developed or implemented for the clinical IR. The methodologies subsection discusses querying, indexing, and ranking methodologies that have been proposed and evaluated in the field of clinical IR.
5.3.1 Clinical Information Retrieval Tools
Traditionally, SQL-based searching or querying systems were used to build clinical IR systems, but these were not effective in searching the highly unstructured free-text EHR data (19). Consequently, advanced clinical IR tools are now being developed using more modern search engine techniques.
IR Tools:
Lucene is a Java-based IR tool that provides a set of APIs for building full-text search on documents (20). It includes tools for indexing, searching, and ranking documents, as well as support for various query types, such as Boolean query searches. Lucene is widely used as the foundational tool for building custom search applications and is also used as the core search engine in many commercial products.
Solr is an open-source enterprise search platform built on top of Lucene (21). It provides a standalone server that can be used to index and search large collections of documents, as well as a rich set of features for managing and scaling search applications, including support for distributed search and faceted navigation. Solr is commonly used to build search applications for websites, intranets, and other large-scale systems.
Elasticsearch is an open-source full-text search engine, which provides a distributed indexing system on the top of Lucene. Many clinical IR systems have been developed leveraging Elasticsearch, some of which are as follows. Researchers from Mayo clinic developed a distributed infrastructure with two Hadoop clusters to process the HL7 messages into an Elasticsearch index. This Elasticsearch index could provide high-speed text searching (0.2-s per query) on an index containing a dataset of 25 million HL7-derived JSON documents (22). SigSaude is another platform that integrated patient information from student-run clinics of the Federal University of Rio Grande do Norte. The platform was built on top of an Elasticsearch index and the data views were created using Kibana (23).
Lemur is a research project focused on developing IR and natural language processing techniques for use in large-scale search applications (24). It includes tools for indexing, searching, and evaluating the performance of IR systems, as well as support for a variety of advanced features such as query expansion and language modeling. Lemur is primarily used as a research platform and is not as widely used as Lucene, Solr, or Elasticsearch in commercial applications.
IR Systems:
Essie is a concept-based search engine developed by NIH, with concept-based query expansion and probabilistic relevancy ranking (25, 26). Lucene-based search engines have long been used for clinical IR and patient cohort detection (27, 28). Yadav et. al proposed a modified Apache Lucene ranking algorithm based system which has an feedback system based on the number clicks and likes-dislikes for the search results (29).
EMERSE (Electronic Medical Record Search Engine), launched in 2005, is one of the earliest non-commercial EMR search engines. EMERSE supports free-text queries and has been used by many hospital systems. Researchers from the University of Michigan documented how EMERSE has been used in their hospital system, enabling the retrieval of information for clinicians, administrators, and clinical or translational researchers (30). EMERSE uses clinical narratives and may not be the best search engine if queries involve structured electronic health record data such as demographic information or lab tests. EMERSE has been successfully used in screening clinical notes to identify patient cohorts, such as to identify glaucoma patients with poor medication compliance (31).
CogStack is an IR system which was built to integrate document retrieval and information extraction for a large UK NHS Trust (32). The CogStack platform includes a stack of services that enable full-text clinical data searches, real-time risk prediction, and alerts for advanced patient monitoring (33). Wang et al. used the CogStack platform to implement real-time psychosis risk detection and an alerting service in a real-world EHR system. This is the first study to create and use early-stage psychosis detection and alerting system in clinical practice (33).
MetaMap is a common natural language processing tool utilized in constructing IR systems (34). MetaMap is a tool developed to retrieve relevant MEDLINE citations based on queries of the user. It allows one to search for the titles and abstracts of MEDLINE citations by mapping concepts in the text to the UMLS Metathesaurus. Researchers create simple hashes that map the Concept Unique Identifiers (CUI) from MetaMap to patient records (27, 35). The U.S. National Library of Medicine (NLM) manages the MEDLINE/PubMed database, which contains bibliographic references to biomedical articles. Users can download these MEDLINE/PubMed records for research purposes.
CDAPubMed is an open-source web browser extension developed in 2012 to incorporate EHR elements into biological literature retrieval methods (36). The Retrieval And Visualization in ELectronic health records (RAVEL) project aims at retrieving relevant elements within the patient’s EHR and visualizing them. They proposed implementing an extensive industrial research and development effort on the EHR while taking the following factors into account: IR, data visualization, and semantic indexing (22, 37). Medreadfast is a hybrid browser designed specifically for combining an EHR keyword search over an automatically inferred hierarchical document index (38).
Although most of these tools were developed between 2005 and 2012, it can be observed that they are still used for clinical IR research. This suggests that more advanced clinical IR methods—utilizing advanced machine learning techniques—could be integrated into these already-established workflows to improve their efficiency and effectiveness.
5.3.2 Methodologies
This section summarizes the methods used in the reviewed articles for the following three IR components: Querying, Indexing, and Ranking.
Query Methods:
Keyword search is the simplest technique to search over free-text EHRs. It involves identifying and searching for the lexicalized (surface) forms of specific words or phrases within a collection of EHR documents or a clinical database. To perform a keyword search, the user enters a query containing one or more keywords into the search field of a search engine or database. The keyword search engine then looks for documents or records that contain those keywords and returns a list of results ranked according to the number of occurrences of these keywords. Early clinical IR systems used keyword search, which did not always return the most relevant or accurate results, particularly if the keywords used in the query were too broad (39). Studies demonstrated that this method may not be well-suited for searching for more complex clinical information as it relies on the surface form of query terms rather than the underlying semantics of the search query (38).
The limitations of keyword-based search led to the development of more advanced querying and ranking systems that could interpret the semantics of complex clinical texts in EHRs. One such limitation is the issue of negation, which can lead to retrieving irrelevant documents despite containing the query keywords. The presence of a query keyword does not always imply that the document is relevant. For instance, “no family history of cancer” could be retrieved for a query to search patients with “cancer”. This issue of negation has to be addressed to avoid retrieving EHRs that contain phrases in contexts that aren't relevant to the query. Garcelon et. al. tried to address this problem by extracting subtexts from each original patient record and classifying them into 4 categories: “patient–not negated”, ”patient–negated”, ”family history–not negated”, ”family history–negated” (40). By using contextual information, such as negation, temporality, and the subject of clinical mentions, semantic contexts can be incorporated into an Elasticsearch-based indexing/scoring system (41, 42).
In biology, ontology is the formal representation of a set of concepts and their interactions within a domain. It helps to classify, annotate, and query biological data by organizing and standardizing the information within a certain area(43). Ontologies and other knowledge-based resources are used to extract the semantic nature and associations of medical terms, which are then used at the record level to infer the patient's overall medical history (44–46). Semantic search enhances the representation of both queries and free-text EHRs by expressing concepts and their contexts. In 2011, Gurulingappa et al. developed a computational platform for clinical IR with the aim of exploring clinical ontology-based semantic search techniques (47). Afzal et al. proposed query generation from Medical Logic Modules (MLMs) (48) where they built different query sets from the concepts used in MLMs. These sets were then expanded with domain ontology derived from SNOMED CT. More details about semantic search will be discussed in later sections of this paper.
Concept-based information retrieval (CBIR) is a type of IR system that uses concepts, or high-level abstractions, to represent and index the content of documents. These concepts are typically derived from the words and phrases that appear in the documents and are organized into a hierarchy or ontology to provide a more intuitive and meaningful representation of the information. This method can be more effective than a traditional keyword-based search, as it offers less opportunity for ambiguity and vocabulary mismatch. In these systems, queries and documents are standardized from their original terms to concepts from medical ontologies. Early uses of CBIR for biomedical literature (49) have been ported to use for clinical IR using SNOMED CT concepts (8, 48, 50). Researchers used MetaMap to identify UMLS concepts and to map the UMLS and SNOMED concept id in the EHRs with the queries (50). Formal Concept Analysis (FCA) is another method to derive the concept hierarchy and match it with the indexed documents (51, 52).
Query expansion is another mechanism through which concepts can be integrated into the query. Instead of altering the query to a concept-based representation, the sets of synonyms in an ontology accompanying the concepts found in the query are added as additional query terms. This has been used, for instance, to perform query expansion using the UMLS Metathesaurus (53–56). Topic modelling is a technique used in natural language processing to identify and extract the main themes in a collection of text documents. It can be used to expand patient queries by identifying related concepts and keywords that are present in the EHR notes but not included in the original query (8). As with UMLS and SNOMED-based query expansion, MeSH-based query expansion has also been utilized (57).
Clinical IR queries can be mapped to a common data model, like the Observational Medical Outcomes Partnership (OMOP) Common Data Model, to standardize queries. This involves the extraction of entity mention types from patient-level IR queries and mapping them to a subset of OMOP data fields (58). Wen and colleagues proposed an empirical data model that is implemented to cover major entity mention types in cohort identification tasks (41). They investigated the Clinical Data Repository tables from the Mayo Clinic and Oregon Health & Science University to map the corresponding fields in both a structured and an unstructured format to the proposed data model. In 2020, Shi et al. investigated the relationship between different querying approaches and the characteristics of the cohort definition structure or query taxonomy. But even after developing a 59-parameter taxonomy, they failed to find any significant associations (59).
Modern IR systems frequently utilize automatic query expansion to increase the search space, as the original query may be too narrow or ambiguous, or the search terms may not accurately capture the relevant information. The reformulated query with the expansion terms achieves better results than the original query. The expanded query can be used to obtain more accurate and relevant information from EHRs, which can aid in making better clinical decisions and improving patient outcomes. In Clinical IR, researchers have proposed several methods for query expansion based on features of medical language and clinical needs (47). Semantic Query Expansion(SQE) techniques use semantically similar terms to expand the queries (51, 52). Based on the meaning of the words in the query, semantic query expansion seeks to develop useful candidate features suitable for query expansion. Utilizing the clinical associations between terms from ontologies, including knowledge of synonyms and hypernym/hyponyms, and semantic relationships among medical concepts, such as symptoms, exams and tests, diagnoses, and treatments, led to an improvement in the precision and recall values of the IR systems (60). In a recent paper, Wang, Qi (61) used a Candecomp Parafac-Alternating Least Squares (CP-ALS) decomposition algorithm to identify latent variables or hidden factors within EHRs to enhance the initial query. These latent variables can be used to represent important concepts or patterns in the EHR data, such as disease progression, treatment effectiveness, or patient outcomes. In another study, Kreuzthaler, Pfeifer (62) used a log-likelihood based co-occurrence analysis to identify patterns of co-occurrence between the ICD-10 codes and the related keywords. By comparing the log-likelihood of different pairs of terms, this method could identify terms that are most likely to be related to each other. The identified co-occurring terms were then used to identify possible candidates for expanding the initial query.
Term weighting is the process of assigning a weight to each term in a document in order to reflect the importance of that term in the document. This method can be used to improve the effectiveness of IR systems by helping them to identify and prioritize the most relevant terms and documents. Semantic term weighting is a type of term weighting that takes into account the meaning and context of the terms being used, rather than just their frequency within a document. There are a variety of techniques that can be used to calculate semantic term weights, including methods that take into account the co-occurrence of terms within a document, the relationships between terms, and the overall structure and content of the document. Yang et. al. proposed an algorithm for SQE by improving expansion term weights (63) and their similarity calculation using Word2Vec, GloVe, and BERT (64–66). Wang et al. proposed an automatic parts-of-speech based term weighting scheme which iteratively calculates the term weight by utilizing a cyclic coordinate method. They used a golden section line search algorithm along each coordinate to optimize an objective function defined by mean average precision (MAP) (67). Yang et al. weighted the terms with semantic similarities and assigned calculated category weights and co-occurrence frequencies between expansion terms and multiple query terms. If semantic term weighting is done on an index, instead of the query, we may have to deal with two challenges: to determine the meaning of a medical term in a given clinical text and to give semantic weights to a large number of terms in the indexed clinical texts (68). Hence term weighting is done mostly on search queries.
Query expansion using a combination of multiple techniques has been shown to produce more effective results than relying on a single expansion system, as described in the previous section. Several studies have reported that combining different external resources can significantly improve the effectiveness of query expansion. For instance, some researchers have proposed a method that combines medical concept weighting and expansion collection weighting, which has been shown to improve retrieval effectiveness compared with uniform weighting methods(69, 70). Specifically, the medical concept weighting approach assigns different weights to medical concepts based on their importance in representing the information needs of the query, while the expansion collection weighting approach assigns different weights to the expansion terms based on their relevance to the collection as a whole. The combination of these two approaches has been found to enhance the performance of the IR system by capturing both the query-specific and collection-specific aspects of relevance.
Relevance feedback is the process of incorporating feedback on the retrieved documents. Generally this is done with manual user feedback (e.g., from data collected by users).Pseudo-relevance feedback, however, is an automatic feedback mechanism that often improves retrieval performance without manual interactions (8). The Rocchio algorithm is a very popular relevance feedback algorithm which models the feedback information as a vector space model. Hyperspace Analogue to Language (HAL) is a method for representing and analyzing high-dimensional text data by mapping it into a lower-dimensional space, called a "hyperspace", in a way that preserves the similarity relationships between the text data (71). Researchers have also proposed a HAL-based Rocchio model, called HRoc, to better incorporate proximity information to query expansion (72). Zhu et al. used Mixture of Relevance Models (MRM) (56) for building a clinical IR system for discharge summaries. For query expansion, they derived related terms from a relevance model using pseudo-relevance feedback.
Multi-modal search enables searching using both text and visuals, as well as retrieval that includes images, charts, and other illustrations from relevant documents in addition to text. Both text and visual information are included in queries and document representation. The use of techniques from the fields of natural language processing, IR, and content-based image retrieval allows both the text and images to be embedded in queries and document representation. However, not many researchers have attempted to implement multi-modal search systems in the clinical domain. For the scope of time covered in this review, we could only find one such study: one by Demner-Fushman, Antani (73) that used a combination of techniques and tools from the fields of NLP, IR, and content-based image retrieval.
Indexing Methods:
The index is one of the key components of an IR system. Indexing is the process of collecting and managing the data, including its storage, to facilitate the efficient IR. In this section we review different methods for building an IR index found in the literature.
Inverted indexes are commonly used in IR systems because they allow for fast and efficient searching of large collections of documents. An inverted index acts as a map between the terms and the corresponding document to which they belong. Numerous papers have been published which used inverted indexing for clinical IR. It is particularly useful for handling full-text searches, in which users enter a keyword or phrase and the system returns all documents containing that term. Elasticsearch is designed as an inverted index-based search engine to facilitate fast and accurate IR (20). Technically, the projects built on Elasticsearch are indirectly using an inverted index-based indexing system (22, 23, 41, 74, 75). In a recent paper, Dai et. al. proposed an inverted index-based IR system to find cohorts of patients, with a special focus on family disease history (76).
Rule-based indexing is a method of indexing documents in an IR system based on a set of predefined rules or criteria. These rules can be used to classify the EHR documents into categories, or to extract specific information, such as keywords or metadata, from the documents. Rule-based indexing systems typically involve the use of software programs or scripts that are designed to parse the documents and apply predefined rules to extract the relevant information. Edinger et al. experimented with rule-based indexing, developing rules for identifying clinical document sections (26). Rule-based indexing systems can be efficient and reliable, but they can also be inflexible and require significant manual effort to maintain and update the rules as the content of the documents changes. JointEmbed is an IR approach that automatically generates continuous vector space embeddings that implicitly capture semantic information, leveraging multiple knowledge sources such as free text cases and pre-existing knowledge graphs (77). JointEmbed was used for the medical CBR task of retrieving pertinent patient electronic health records, where the quality of the retrieval is crucial due to potential health implications.
Ranking Methods:
A ranking model matches queries with the relevant documents and scores each document’s relevance with the query. In this section, we discuss about different ranking approaches, ranging from probabilistic models to deep learning-based ranking methods.
Clinical information can be retrieved and synthesized when using semantically similar terms from EHR vectors or embeddings. Vector search is a technique used in IR systems to find documents or other data items that match a given query based on their vector representation. In a vector search, documents are represented as vectors in a high-dimensional space. Various approaches, such as term frequency-inverse document frequency (TF-IDF) and word embeddings, can be used to generate these vectors. The vectors are then used to calculate the similarity between the query and the documents or data items, and the most similar documents or data items are returned as search results.
Vector Space Models (VSM), which use word vectors or embeddings, are used to select similar terms from multiple EHRs and evaluate their performance quantitatively and qualitatively across multiple chart review tasks (78). VSMs have gained interest recently with the emergence of deep representation models and vector search techniques in IR systems. VSM methods have proved to be efficient in patient identification, which retrieves patient records corresponding to a specific treatment sequence (79). In order to find similar terms to support chart reviews, researchers introduced a novel vector space model called the medical-context vector space model. It is a collection of clinical terms which are normalized with their frequencies in various medical contexts. VSMs are widely used in open-domain IR systems because they provide a simple and effective way to represent and compare documents and queries. They are also relatively easy to implement and can be used in a variety of different types of clinical IR tasks, including clinical document classification, text similarity, and search.
TF-IDF and BM25 are two of the popular VSM algorithms used in clinical IR. TF-IDF is a probabilistic model that reflects how relevant a query word is to a document in a corpus. It is calculated by multiplying the term frequency (TF) of a word by the inverse document frequency (IDF) of the word. The TF of a word is the number of times the word appears in a document, while the IDF is a measure of how common the word is across all documents in the corpus. TF-IDF has been widely used to identify the most important clinical terms or concepts within EHRs (68). Okapi BM25 is also a probabilistic ranking model, which compares each word of the query and its number of occurrences in the given document with its frequency in the entire document collection (80). Although BM25 is based on the principle of TF and IDF, it takes into account factors such as the frequency of the query terms in the document, the length of the document, and the average length of documents in the corpus. It also includes a parameter called k1 and b that can be adjusted to fine-tune the ranking function. By default, Elasticsearch uses BM25 ranking algorithm (23, 41, 74, 75), which ensures the scalability of the model by using Elasticsearch’s distributed architecture (22). Hristidis et. al. compared a Clinical ObjectRank (CO) system using an authority-flow algorithm which exploits the entities' associations in EHRs to discover the most relevant entities. Their results showed that CO outperformed BM25 in terms of sensitivity (65% vs. 38%) by 71% on average, while maintaining the specificity (64% vs. 61%) (39). VSMs, such as TF-IDF and BM25, have been widely adopted in clinical IR systems due to their ability to effectively rank the relevance of documents to a query. However, it has been noted that these models have limitations in their ability to capture complex concepts and relationships within the text. One of the main limitations of vector space ranking models is their reliance on term frequency and inverse document frequency as the sole measures of relevance. This approach does not take into account the context in which words appear in the text, which can make it difficult to capture subtle nuances and relationships between concepts.
A class of techniques known as Learning to Rank (LTR or LETOR) uses supervised machine learning (ML) to address ranking issues. LTR ranks the document set based on the relative relevance of each document in the corpus (6, 81). With the recent advancement of deep learning and Pretrained Language Models(P LM), neural LTR approaches have been adopted in latest clinical IR systems (82). In their research, Arvanitis et al. proposed a k-nearest document search algorithm to efficiently compute the similarity between two EHRs (83). In this algorithm, the similarity between two EHRs is measured by comparing their content, represented as a set of features, to the content of other EHRs in the corpus.
RankNet, one of the most popular LETOR algorithms, is a supervised learning algorithm that uses neural networks to learn the ranking function from the relevance judgments. AdaRank is an extension of this algorithm and is a sorting learning algorithm for IR that is particularly useful in the context of clinical IR (84). It is designed to optimize the trade-off between relevance and diversity of the retrieved documents by iteratively adjusting the weights of the features used to rank the documents based on feedback from relevance judgments. AdaRank uses loss function to measure the difference between the predicted relevance scores and the actual relevance judgments, and it can take into account multiple features such as the text of the documents, the author, the publication date, the source and many other relevant pieces of information to rank the documents. In many studies, the AdaRank algorithm has proved to outperform VSMs and to be capable of handling the complex and diverse nature of clinical documents like EHRs and improve the performance of clinical IR systems (85).
With the success of deep learning-based contextualized language models, neural IR systems have been developed, which facilitate the use of contextualized embeddings for the task of relevance ranking. BERT (Bidirectional Encoder Representations from Transformers) is a contextualized language model, which makes use of the Transformer encoder structure with self-attention mechanisms that learns contextual relations between words (or sub-words) in text. BERT-based clinical language models like BioBERT and clinical BERT have enabled researchers to contextualize query and document embeddings for different clinical IR applications including patient cohort retrieval. A query with a patients’ target characteristics and document corpus are passed to these language models to retrieve the clinical reports of similar patients (82, 86) Shi, Syeda-mahmood (87) proposed an approach that used lexicon-driven concept detection to identify relevant concepts in sentences from EHRs, and then used these concepts as queries. These queries were used as input to train a Sentence-BERT (SBERT) model. In a recent study (88), the authors explored the use of masking techniques during the fine-tuning stage of BERT for a reading comprehension QA task on clinical notes. The results suggested that transformer-based QA systems may benefit from moderate masking during fine-tuning, likely by forcing the model to learn abstract context patterns rather than relying on specific clinical terms or relations.
Re-ranking refers to the process of adjusting the ranking of a subset of documents that were retrieved using an initial ranking function. The initial ranking function, such as TF-IDF or BM25, is applied to the entire corpus of documents. The re-ranking process then focuses on a specific subset of the top N documents that were retrieved by the initial ranking function. The goal of re-ranking is to improve the relevance of the top-ranking documents retrieved by the initial ranking function or by taking into account additional information or criteria that were not considered in the initial ranking. Based on expanded search terms and users’ feedback, the retrieved outputs are re-ranked to generate the new ranking scores (42). Thus, clinical IR becomes a two-step process, where 1) the ranked documents are retrieved by the user query, and 2) the retrieved documents are retrieved based on the expanded query (56). Kullback-Leibler(KL) divergence, a measure of the difference between two probability distributions, can be used as a way to compare the relevance of different documents to a user's query was used in a study by Yang et al. to compare the similarity of an EHR document's content to the contents of other relevant documents in their clinical IR system (63). The documents with the lowest KL divergence are considered to be the most similar to the other relevant documents and are ranked higher
While there is a growing interest in using deep learning and language model-based approaches, they are not yet widely adopted in the field. Out of the papers reviewed, only 12 used deep learning methods, and of those, only 5 employed pretrained language models like BERT. In contrast, 39 papers represented machine learning-based IR methods, and TF-IDF and BM25 together constituted more than 70 papers. This suggests that there is a need for more research in the area of deep learning and language model-based IR in the clinical domain. Such approaches have the potential to improve the accuracy and relevance of retrieval results, and thus can play an important role in supporting clinical decision-making.