To summarize the results of this review, we present them in three sections. The first section includes the results of the selection process. The second section details the characteristics of the selected studies, such as their area of focus, publication year and location, and data source. The third section highlights the machine learning models used in the studies for predicting and studying mental health outcomes.
3.1 Selected studies
Our search strategies resulted in a total of 22,082 listed articles from Google Scholar, EMBASE, PsycINFO and PubMed. Figure 1 shows the flow of our search strategy and results. All records from PsycINFO and PubMed were reviewed (60), an additional 280 records were reviewed from Google Scholar and the most relevant 100 were reviewed from EMBASE. Based on titles and abstracts, 79 were selected and further reviewed. Most of these records were excluded because they did not focus on the population of interest. Instead, they focused on majority populations and racially homogenous populations and/or did not include discussions about immigrant/migrant status. We also reviewed five abstracts from citation searching. Ultimately 13 publications were included in this review.
3.2 Publication Characteristics
Table 1 presents some high-level characteristics of the reviewed publications. All but two of the analyzed articles were published in the last three years, with two earliest from 2017 (31, 32). More than half of the papers were from the US or incorporated populations based in the US, four were from Europe, and the rest were from Asia. Among the 13 articles, five focused on refugee populations (32-36), three focused on Hispanic populations in the US (31, 37, 38), two focused on black individuals (39, 40), one on Native Americans (41), and the last two articles focused on Korean immigrants in the US (42) and immigrant populations in Europe (43). The areas of mental health focus included stress (35), ADHD (39, 40), trauma (32, 33), depression (36, 38, 40), PTSD (34), psychological distress (42), schizophrenia (43), suicidal ideation (37, 41), and substance abuse (31).
Surveys (32, 34, 36, 42), drawings (33), secondary data sets (including EHR data, surveillance data, and national sample sets) (31, 33, 37, 41, 43), internet-based posts (35, 38), and genomic sequencing data (39, 40) were analyzed in the included publications (see Table 2). Various populations were considered, and sample sizes varied widely due to the type of data collected and analyzed. For example, Augsburger and Elbert (32) enrolled 56 resettled refugees in a study to prospectively analyze their risk-taking.33 Goldstein, Bailey (37) used a retrospective dataset with 22,968 unique Hispanic patients, and Acion et al. (31) included 99,013 Hispanic individuals in their secondary data analysis. Children were also included in the reviewed studies; one examined the depression and PTSD levels of 631 refugee children residing in Turkey (34). Another study analyzed drawings from 2480 Syrian refugee children to find the predictors of exposure to violence and mental well-being (33). Other sample sets analyzed 0.15 million unique tweets from Twitter (35) and 441,000 unique conversations from internet message boards and social media sites (38). Genomic sequencing data was collected from 4,179 black individuals (40) and 524 black individuals (39).
Most reviewed studies used supervised learning intending to explain or predict certain MH outcomes. For example, to classify substance use disorder treatment success in Hispanic patients, Acion et al. compared 16 different ML models to an ensemble method they called "Super Learning" (31). Similarly, Huber et al. compared various ML algorithms, including decision trees, support vector machines, naïve Bayes, logistic regression, and K-nearest neighbor, to determine the model with the best predictive power for classifying Schizophrenia spectrum disorders in migrants (43). Two studies explored the impact of trauma exposure on MH using ML (32, 33). Two studies utilized social media data to understand MH at a population-health level through ML algorithms (35, 38). All study aims are found in Table 2.
3.3 Machine Learning Model Performance and Characteristics
Table 3 outlines a summary of ML characteristics and model performance. This review found that all 13 included publications fell into three categories: classification (31, 35, 37, 39-43), regression (32-34, 36), and unsupervised topic modeling (38).
The publications used a range of ML models, from one (32-34, 36, 39, 40, 42) to 16 (31). In studies where multiple ML models were used, the aim was often to compare the models to determine the best predictive power. For example, Acion et al. compared 16 models and evaluated them using the area under the receiver operating characteristic curve (AUC) to classify substance use disorder treatment success in Hispanic patients (31). Huber et al. compared five different ML algorithms, including decision trees, support vector machines, naïve Bayes, logistic regression, and K-nearest neighbor, to determine the model with the best predictive power for classifying Schizophrenia spectrum disorders in migrants (43). Two of the studies used linear regression (34, 36). All of the studies developed custom models to meet their study aims. The most common programs used in these studies were R (31, 32), SPSS (34, 42), and Python (35, 39, 40).
Predictors that were included in the modeling were sociodemographic characteristics (31, 34, 37, 41-43), and some also included MH variables and experiences (31, 32, 34, 37, 41-43) collected from EHRs or surveys. One study first determined which of the included 653 input variables (including sociodemographic data, childhood/adolescence experiences, psychiatric history, past criminal history, social and sexual functioning, hospitalization details, prison data, and psychopathological symptoms) were the best predictor variables and trained a final ML algorithm using only those (43).
Two studies did not report the best algorithm performance (37, 38). For the other studies, accuracy and AUC were commonly reported. For example, Acion et al. classified substance use disorder treatment success in Hispanic patients and found that the AUC of studied models ranged from 0.793 to 0.820, with the ensemble method achieving an AUC of 0.820, which was not significantly better than the traditional logistic regression model's AUC of 0.805 (31). Huber et al. identified a tree algorithm that differentiated native Europeans and non-European migrants with schizophrenia with an accuracy of 74.5% and a predictive power of AUC = 0.75 (43). In Liu et al., the trained ML model had an accuracy of 78% in predicting ADHD in African American patients (39). In a similar study to classify ADHD, depression, anxiety, autism, intellectual disabilities, speech/language disorder, developmental delays, and oppositional defiant disorder in African Americans, the model had an accuracy of 65% in distinguishing patients with at least one MH diagnosis from controls (40). A second prediction model aimed at predicting the diagnosis of two or more MH disorders had a low accuracy level, with an exact match rate of 7.2-9.3% (40). Khatua and Nejdl (35) analyzed tweets acquired from Twitter feeds from self-identified refugees and categorized them into themes of the immigrant struggle with an accuracy of 61.61% and 75.89%.
The included studies also used p-values to assess their ML algorithms. Goldstein and Bailey utilized multivariable logistical regression to examine the relationship between experienced discrimination and suicidal ideation in Hispanic patients (37). They found that 19.0% of Hispanic patients who experienced discrimination also experienced suicidal ideation, compared to 11.5% of patients that did not experience discrimination (p=0.001). Moreover, Hispanic patients had 1.72 greater odds of having suicidal thoughts if they experienced discrimination compared to those that did not (p=0.003). A study by Erol and Seçinti used regression analysis to study the relationship between PTSD and depression and various predictors in adolescent refugee minors (34). They found that moderate and severe changes in family income level and stress in food access predicted depression scores and PTSD symptoms (p < 0.01). Drydakis (36) used random effects models to estimate the relationship between the number of mobile applications that facilitate immigrants' societal integration and immigrants' integration, health, and mental health.37 The results showed a negative association between the number of standard m-Integration applications and adverse MH status (p < 0.01). Accuracy was also measured using importance and normalized importance (42), Root-mean-square error (RMSE) (32), and Least Absolute Shrinkage and Selection Operator (LASSO) coefficients (33).
3.4 Cross Validation
Six studies used internal cross-validation methods (31-33, 40, 41, 43). Only one study used an external data set to validate their ML algorithm (39). That external validation of the algorithm reduced the accuracy of their algorithm from 78% to 70-75% (39). Almost half of the included publications did not use or discuss their cross-validation method (34, 36-38, 42).