Design
This longitudinal study used data collected in an open cohort design for the ETMED-L project [20]. In each of the four data collection waves (March 2021, November 2021, November 2022, and November 2023), medical students from all curriculum years (1 to 6) were included (except for the last data collection wave, in which first-year students were excluded to prioritize multiple participation). Linear Mixed Models (LMMs) were applied because they effectively accommodate limited and discontinuous data, providing a reliable way to estimate the longitudinal trajectories of empathy across the complete six years of medical education, even if assessments were conducted at a maximum of four time points for each participant.
Data Collection
At each of the four data collection waves, an online questionnaire investigating mental health and interpersonal competence was sent to all matriculated medical students at the University of Lausanne (Switzerland), except for the students who were at the university as part of an academic exchange. It took approximately one hour to complete the online questionnaire fully, and the students received 50 CHF (~ 50 USD) for each completed questionnaire. The Research Ethics Committee of the canton of Vaud approved the project (project number 2020–02474), and all participants provided written informed consent. At the end of the questionnaire, contact information for mental health help services was provided if students felt the need to seek support.
Measures
Empathy and emotion recognition
To adopt a more comprehensive framework of empathy, we used five measures from three validated instruments as well as an emotion recognition test described in more detail in the published study protocol [20]. First, we assessed empathy with the widely used Jefferson Scale of Physician Empathy - Student Version (JSPE-S [6]), which was developed to measure medical students’ orientations or attitudes toward empathic relationships in the context of patient care [6]. Second, the validated French version of the Questionnaire of Cognitive and Affective Empathy (QCAE [9, 21]) was used as a more multidimensional measure of empathy assessing separately cognitive and affective empathy. For assessing behavioural empathy, the observation of interactions with patients would have been ideal, but these kinds of data would have been difficult to obtain. After careful examination of existing self-reported instruments, we found that the validated French version of the Ability to Modify Self-Presentation Scale (AMSP [22, 23]) could be a proxy to measure a behavioral dimension of empathy, as it assesses one’s ability to modify their behaviours according to the social situation at hand. Finally, we used the Geneva Emotion Recognition Test – Short Form (GERT-S [24]) to measure emotion recognition accuracy. Indeed, empathy has traditionally been assessed through self-report questionnaires, but there are also well-validated performance-based tasks that evaluate the ability to recognize emotions in individuals depicted in pictures or short videos [25]. The GERT-S presents 42 video clips of actors expressing one out of 14 emotions while saying pseudolinguistic sentences. The final score of the test is the number of emotions correctly recognized by the participants. These types of emotion recognition tasks have been found to have a significant, though modest, correlation with self-reported measures of both cognitive and affective empathy (Murphy and Lilienfeld, 2019), suggesting that while the ability to recognize others' emotions is linked to both understanding and resonating with others' emotions, they also capture aspects of emotional processing that go beyond these dimensions.
Psychosocial and health-related covariates
Four sociological covariates were analysed: identifying as male (gender identity recoded into 1 = male and 2 = female or nonbinary), parents’ education (number of parents with a college or university degree), having a partner (1 = yes, 0 = no), and social support. Social support was measured as the average score of two questions adapted from the Swiss Household Panel survey [26]: “If necessary, in your opinion, to what extent can someone provide you with practical help, this means concrete help or useful advice, if 0 means "not at all" and 10 "a great deal"?” for practical support and “To what extent can someone be available in case of need and show understanding, by talking with you for example, if 0 means "not at all" and 10 "a great deal"?” for emotional support.
Regarding psychological covariates, having consulted a psychotherapist during the past year (1 = yes, 0 = no) and coping strategies were assessed. The French version of the coping section of the Euronet questionnaire [27, 28] was used to measure three types of coping strategies [29]: emotion-focused, problem-focused, and help-seeking. Moreover, mental health and burnout indicators were included. For mental health, the Center for Epidemiological Studies-Depression [30] was used to measure depression symptoms, and the State-Trait Anxiety Inventory [31] was used for anxiety symptoms. Additionally, the emotional exhaustion, cynicism, and academic efficacy (reversed) dimensions of burnout were assessed with the Maslach Burnout Inventory Student-Survey [32].
Finally, two health-related covariates were included: physical activity (number of hours per week) and satisfaction with one’s own health (“Are you satisfied with your health?” rated on a scale from 1 = very unsatisfied to 5 = very satisfied).
Statistical Analysis
To investigate the longitudinal trajectories of empathy during medical school, separate LMMs were modelled for each empathy and emotion recognition measure, with curriculum year and covariates as fixed effects. Specifically, the fixed effects account for the systematic variation associated with changes across the curriculum years and control for potential confounding factors. Additionally, random intercepts were incorporated at the student level to account for the correlation between repeated measures within the same student while considering the inherent variability between students. The restricted maximum likelihood method was used to produce unbiased estimates of variance and covariance parameters [33, 34]. Furthermore, two different temporal variance-covariance structures (Autoregressive Covariance Structure of AR1 and Autoregressive/Moving Average Covariance Structure of ARMA1.1) were tested to potentially account for the temporal spillover of empathy and emotion recognition, with the best fitting model being selected via likelihood ratio tests. The nonlinear trajectories of empathy and emotion recognition were subsequently tested by including the time as quadratic or cubic. The best-fitting model was selected via likelihood ratio tests and presented as the final model. For all the final models, we report marginal R2 as the percentage of variation explained by the fixed part of the model, conditional R2 as the percentage of variation explained by both the fixed and random parts of the model, and the intraclass correlation coefficient (ICC) as the proportion of total between-student variance. Moreover, standardized β values are presented for effect size estimation. Standardized βs between 0.10–0.29 are considered small, those between 0.30–0.49 are considered medium, and those 0.50 or greater are considered large effect sizes [35].
To explore significant differences between specific time points, pairwise comparisons to each preceding year were additionally conducted with Holm-Bonferroni corrections to account for multiple comparisons. However, unlike LMMs, these comparisons do not account for the correlation between repeated measures within the same student and are thus a less accurate representation of longitudinal trajectories.
The highest missing rate at the item level was 0.59% (AMSP items). It has been shown that much gain from multiple imputation is unlikely when missing rates are lower than 5% [36]. Thus, missing data at the item level were replaced by mean scores if less than 20% of the items were missing. If 20% or more of the items were missing, the total score was considered missing, and missing data at the score level were then handled with full information maximum likelihood in the LMMs. Stata version 17 [37] and R version 4.2.2 [38] were used for the analyses, and p values < .05 were considered significant.