Building of Dictionaries:
We summarized the current research on PTSD integration into the RDoC framework and built our context-dependent keyword and sentence dictionary from that research and subject matter experts (SME). Artificial intelligence models struggle to classify narratives in niche domains when they have not been trained on them or tailored to the specialized subject matter. We aim to address this problem by including SMEs in dictionary development[13]. Our dictionaries include the following attributes:
Negative Valence Systems: Research consistently supports the relevance of negative valence systems in PTSD, characterized by fear and anxiety symptoms, and particularly anxious avoidance of trauma-related cues. This anxiety may generalize to neutral cues during flashbacks, and key mechanisms involving the amygdala, prefrontal cortex, and hippocampus are implicated in fear conditioning and extinction[14], [15]. PTSD is associated with overgeneralized fear, impairments in fear extinction, and cue generalization. Dysfunctional amygdala and hypoactivity in the ventromedial prefrontal cortex contribute to heightened fear responses and hindered extinction[16]. Genetic factors, including the BDNF val66met-allele, are linked to impaired fear extinction, impacting treatment response[17].
Positive Valence Systems: Positive valence systems, focusing on reward learning and valuation, are understudied in PTSD, with anhedonia reflecting emotional numbing and diminished goal-oriented behavior[18]. Reward processing deficits involve dopamine and serotonin systems, influenced by genetic factors[19]. Oxytocin and SSRIs show promise in addressing reward deficits and anhedonia in PTSD treatment[20].
Cognitive Systems: Cognitive deficits in PTSD affect attention, planning, and memory, with attentional bias towards threat stimuli and memory biases contributing to hyperarousal[21]. Epigenetic modifications and gene polymorphisms, like in the glucocorticoid receptor (GR) gene, are linked to memory deficits in PTSD[22], [23]. Effective treatment may improve cognitive deficits.
Arousal and Regulatory Systems: Hyperarousal, a core symptom of PTSD, involves heightened nervousness, sleep problems, and increased startle responses, with sympathetic nervous system overdrive contributing[8], [24], [25]. Genetic variations in adrenergic receptors influence emotional memory, and medications like prazosin and propranolol show efficacy in treating PTSD-related hyperarousal[26], [27].
Systems for Social Processes: Social processes, including attachment, communication, and self-perception, are affected in PTSD, particularly in cases of complex PTSD or interpersonal trauma[28]. Concepts like shame, guilt, and paranoid distrust are prevalent in interpersonally traumatized PTSD patients and merit further study[29], [30], [31].
Sensorimotor Systems: Current transdiagnostic research explores sensorimotor abnormalities in children, individuals at risk of psychosis, and first-episode psychosis patients, among others[32], [33], [34]. Sensorimotor dysfunction, recognized only in recent years, can be used to enhance early identification and develop effective treatments.
We utilized a “Human-in-the-Loop” approach, incorporating subject matter expertise to develop sentence dictionaries [35]. We developed sentence dictionaries to address the limitation of existing keyword dictionaries, which often include common English words lacking context specificity. Through sentence dictionaries, we aim to encompass the entire context in which a word associated with RDoC domain is utilized. Figures 1 and 2 illustrate the iterative flow of sentence dictionary development and the study workflow, respectively. Supplementary tables 1 and 2 depict the RDoC keyword and sentence dictionaries. The study aims to identify population and disease-trajectory-specific RDoC domains for early PTSD diagnosis and treatment research.
The steps of the iterative workflow correspond to. (A) Data collection (B) Building of keyword and sentence dictionary from literature[36], [37] and SMEs
Sentence transformer model:
We used the pre-trained model all-mpnet-base-v2 [38], a transformer-based natural language model to identify the presence of RDoC domains in clinical notes of PTSD patients. The model is based on the MPNet architecture and has the highest performance in generating sentence embeddings according to Sentence-Transformers [38], [13]. We did not perform any additional fine-tuning on our dataset. The model that was provided by Sentence-Transformers was used in its original form, which has an output dimension of 768. Thus, the output of this model for each RDoC in the PTSD dataset is an embedding that has a length of 768.
The RDoC extraction pipeline involves: 1. Collecting and preprocessing of 5.67 million PTSD clinical notes from the UPMC EMR system. 2. Building of a keyword dictionary from literature and SMEs. 3. Conducting keyword searches and extracting sentences from 110,000 clinical notes. 4. Creating a context-specific RDoC sentence dictionary, evaluated by SMEs for categorization and inclusion (Table 3). 5. Utilizing sentence transformers to extract RDoC information by comparing cosine similarity scores between sentences, where Sentence A is the new sentence from a clinical note that is being annotated and Sentence B is the sentence from the sentence dictionary. The selection criteria involve identifying the presence of a keyword in both sentences. Once the keyword is identified, the sentence transformer focuses on only those sentences where the keyword is present in sentence B. Subsequently, the transformer assesses the cosine similarity scores between Sentence A and each potential Sentence B until the optimal match is determined. This is a supervised approach. 6. Calculating F1 macro scores to determine the optimal threshold for identifying RDoC categories, comparing manual annotation by SMEs with sentence transformer. (7) Identifying and Visualizing RDoC in two use cases (i) Across multiple patient populations and (ii) Throughout various disease trajectories.
Statistical analysis:
The delineation of RDoC domain categories for data collection underwent an iterative refinement due to the extensive volume of clinical notes, as illustrated in Fig. 2. Initial stages encompassed employing a keyword dictionary, extracting sentences with these keywords, labeling sentences into categories through SMEs, and leveraging sentence transformers for RDoC domain identification along with corresponding keywords. Table 3 shows the count of identified patients, with SMEs reviewing a subset of randomly selected cases (N = 8,351 sentences). These procedures were crucial in the labeling process.
Table 1
Number of patients identified by the Sentence Transformer and number of randomly selected cases to review and manually annotated by subject matter experts
RDoC | No. of Patient cases | Randomly selected cases to review and manually annotated by Subject matter experts |
Arousal regulation | 18724 | 443 |
Cognitive systems | 17740 | 2165 |
Negative valence | 21829 | 2629 |
Positive valence | 18360 | 2797 |
Sensorimotor systems | 9770 | 74 |
Social process | 17014 | 243 |
SMEs initially developed labels that best defined each domain, but noticed missing keywords, prompting ongoing refinement. Inclusion of these keywords enhanced the dictionary, improving RDoC identification accuracy. In the final iteration, SMEs verified correct assignments. Comparing manual SME annotations with sentence transformer results, Supplementary Table 3 shows that a 0.3 cosine similarity threshold yielded the best F1 macro scores across all RDoC domains, ensuring an F1 score of at least 80% across all domains (N = 8,351 sentences).