We investigated the classification ability of voice features in different scenarios. Results showed that voice features can assist clinical diagnosis.
First, after matching, descriptive statistics investigation indicates no distribution bias between case group and control group in each task. By achieving this we ruled out the potential confounding factors(Pan et al., 2019) for voice predicting depression. Using MFCCs features, we built several logistic regression models for different classification scenarios with the i-vector method.
Type 1 includes AH and DH models to examine the baseline classification ability voice features have. Given the results, when classifying health and non-health, the F-score is 0.82, AUC score is 0.79. This showed to what extent voice features can distinguish mental illness from healthy controls.
Type 2 consists of DH, BH and SH models to investigate the ability of voice features in distinguishing specific mental illness from healthy ones. For DH, BH and SH tasks, the F-score ranged from 0.73 to 0.80, the AUC score ranged from 0.75 to 0.80. There already exist lots of research classifying depression and health using voice features. Our results about DH are consistent with the findings of others(Afshan et al., 2018; Alghowinem et al., 2013; Cummins et al., 2015; Horwitz et al., 2013; Quatieri & Malyska, 2012; Pan et al., 2019; Sidorov & Minker, 2014), so that we can take this task performance as another baseline. The DH task results illustrated the effectiveness of our dataset. ROC difference test showed there is no significant difference in the pairwise comparison among AH and the three mental illnesses vs. healthy tasks.
Type 3 comprises DB, DS, BS models to further show the performance of voice features on pairwise classification among the three mental illnesses. Model performance of both DS (F-score = 0.83; AUC = 0.83) and BS (F-score = 0.91; AUC = 0.92) are ideal. In fact, the F-score and AUC score for BS task are both the highest across all 7 tasks. But model performance of DB is the worst (F-score = 0.44) among all tasks, and the AUC score 0.5 means voice features doesn’t contribute to the discrimination between depression and bipolar disorder.
Further pairwise ROC test showed there is no significant difference of model performance among AH, DH, BH, SH, DS and BS tasks. But for DB and BS comparison, the model performance of DB is significantly worse than that of BS model. See Table 6.
To our knowledge, this is the first research examining the discriminatory power of voice features concerning depression and other mental illnesses with similar depression symptoms. It showed that voice features can be applied not only on classifying depression and health, but also on detecting other mental illnesses, with considerable amount of predicting accuracy. This research fully described the distinguishing abilities of voice features under different classification scenarios.
The results are promising, which might be because the i-vectors are able to catch the mental illness relevant information. To extract i-vectors, first it needs to put both case group and control group together to learn the shared information between the two group, and then this shared part is removed from the voice data to get the i-vectors, which means that i-vectors capture mental illness relevant voice information. i-vector based system has been proved to be effective in both short and long utterances from 10s to 5 min(Guo, Nookala, & Alwan, 2017; Guo et al., 2018, 2016; Guo, Yang, Arsikere, & Alwan, 2017). And clearly, it also captures different voice information for different mental illnesses when classifying between mental illnesses. Different mental illness has different symptom in voice. For example, vocal features observed to change with a depression patient’s mental condition and emotional state is motivated by perception of monotony, hoarseness, breathiness, glottalization, and slur in the voice of a depressed subject(France & Shiavi, 2000; Low, Maddage, Lech, Sheeber, & Allen, 2010; Moore, Clements, Peifer, & Weisser, 2003; Mundt, Snyder, Cannizzaro, Chappie, & Geralts, 2007; Ozdas, Shiavi, Silverman, Silverman, & Wilkes, 2004; Trevino, Quatieri, & Malyska, 2011). The voice of schizophrenia patients can be described as poverty of speech, increased pauses, distinctive tone and intensity of voice, which has been associated with core negative symptoms such as diminished emotional expression, lack of vocal intonation and alogia, having difficulty in controlling voice to express affective and emotional contents in proper social text(Alpert, Rosenberg, Pouget, & Shaw, 2000; Cohen, Mitchell, Docherty, & Horan, 2016; Cohen, Najolia, Kim, & Dinzeo, 2012; Galynker, Cohen, & Cai, 2000; Guo, Xu, et al., 2018; Hoekert, Kahn, Pijnenborg, & Aleman, 2007; Millan, Fone, Steckler, & Horan, 2014; Trémeau et al., 2005).
This research failed to classify depression and bipolar disorder. Unipolar depression and bipolar depression are quite similar. Bipolar disorder is combined with depression phase and also manic phase, that’s why it’s so hard to distinguish between the two them. Research about bipolar disorder usually tracks user phone call recordings to extract voice features and detecting different emotion states(depression state, manic state)(Faurholt-Jepsen et al., 2016; Maxhuni et al., 2016). We still should working on tracking emotion states to help distinguish depression from bipolar disorder or finding some other ways.