In this section, the author discusses widely used datasets for textual emotion detection, which is the primary focus of this article. Additionally, it informs researchers about recent advancements in the textual emotion detection domain. In textual emotion detection, researchers can either create their own datasets or utilize publicly available ones. This section highlights some publicly available and useful datasets that feature reliable labeling or annotation processes. These datasets are widely employed by researchers in textual emotion detection and are described as follows:
In this section, the authors explore widely used datasets for textual emotion detection, the main focus of this article, to inform researchers about recent advancements in the field. Researchers can create their own datasets or use publicly available ones. This section highlights some publicly available and useful datasets known for their reliable labeling or annotation processes. These datasets are widely utilized by researchers in textual emotion detection and are described as follows.
The EmotionLines dataset represents a significant effort in the field of natural language processing and emotion recognition, providing researchers with a rich resource for studying emotions as expressed in dialogue. This dataset comprises 29,245 labeled utterances extracted from 2,000 dialogues, offering a diverse collection of emotional expressions for analysis and modeling (Chen et al., 2018).
EmotionLines categorizes each utterance into one of eight emotion labels: anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral. These labels encompass the spectrum of emotions as defined by Ekman's six basic emotions (anger, disgust, fear, happiness, sadness, surprise) along with a neutral category and a non-neutral category.
One of the notable challenges of the EmotionLines dataset is the imbalance in class distributions among the emotional categories. This imbalance can affect the performance of machine learning models trained on the dataset, as they may become biased towards predicting the majority classes (such as neutral or happiness) more accurately than the minority classes (such as anger or disgust).
To address this issue, researchers often employ various techniques such as resampling methods (e.g., oversampling minority classes), adjusting loss functions to penalize misclassifications of minority classes more heavily, or using ensemble methods designed to handle imbalanced datasets effectively. These approaches help in achieving more balanced performance across all emotion categories and enhance the robustness of emotion recognition models trained on EmotionLines.
The EmotionLines dataset has facilitated advancements in several areas of emotion research within natural language processing and human-computer interaction. Researchers have utilized this dataset to develop and evaluate emotion recognition systems, sentiment analysis tools, and dialogue systems that can better understand and respond to emotional cues in text-based interactions.
EmoBank represents a pivotal dataset in the domain of computational linguistics, focusing on the nuanced exploration of emotions conveyed through language. Developed by Strapparava and Mihalcea (2007), EmoBank comprises 10,000 sentences carefully curated to encompass a diverse range of genres, ensuring broad applicability and relevance across different linguistic contexts.
One of the distinguishing features of EmoBank lies in its dual annotation approach, capturing not only the emotions expressed by writers but also the emotions perceived by readers. This dual perspective enriches the dataset by providing insights into how emotional content is conveyed and interpreted in textual communication. Such annotations offer a comprehensive view of emotional dynamics within language, facilitating deeper analyses into the alignment or divergence between intended and perceived emotional states.
In addition to expressive and perceptive annotations, a subset of the EmoBank corpus has been annotated according to Ekman's six basic emotions: anger, disgust, fear, happiness, sadness, and surprise. This annotation schema, established by Paul Ekman, categorizes emotions based on universal facial expressions and physiological responses, providing a standardized framework for understanding emotional states across cultures and contexts.
The inclusion of Ekman's basic emotions in EmoBank enables mappings between different representation formats of emotions within the dataset. This linkage facilitates interdisciplinary research, allowing computational linguists and psychologists alike to explore correlations between linguistic expressions and fundamental emotional categories. Such mappings not only enhance the dataset's utility for emotion recognition and sentiment analysis tasks but also foster a deeper understanding of how specific emotions manifest linguistically.
EmoBank's contributions extend beyond its dual annotation and comprehensive emotion mapping capabilities. It serves as a fundamental resource for advancing various applications in computational linguistics, including sentiment analysis, emotion-aware dialogue systems, and affective computing. By providing a large-scale, diverse dataset with rich emotional annotations, EmoBank supports the development and evaluation of sophisticated machine learning models that can interpret and respond to emotional cues in text effectively.
The EmotionPush dataset represents a pioneering effort in the realm of natural language processing and emotion analysis, focusing specifically on private dialogues in social spoken-language interactions. Developed as a repository of instant message logs and corresponding read event logs from real conversations on Facebook Messenger, EmotionPush comprises a total of 162,031 message logs. Key to its creation is the emphasis on ensuring data privacy and utility through innovative masking techniques and novel task proposals.
Privacy is paramount in datasets derived from private conversations, and EmotionPush addresses this concern by meticulously masking all named entities. Each entity is anonymized using a code composed of its type and a unique identifier, thus safeguarding the identities of individuals while maintaining the integrity of the conversational data for analysis. Furthermore, to balance data utility with privacy, the dataset is released partially in its original textual form and partially in the form of word embeddings, ensuring that researchers can explore both semantic and syntactic aspects of the conversations.
The availability of EmotionPush has significant implications for advancing research in natural language processing, particularly in emotion-aware computing and dialogue systems. Researchers can leverage this dataset to train and evaluate machine learning models that understand and respond to emotional cues in real-time conversations. By analyzing patterns in emotional expressions and response dynamics, advancements can be made in sentiment analysis, affective computing, and the development of empathetic AI-driven interfaces.
Experiment results with EmotionPush underscore its efficacy in supporting state-of-the-art models for emotion classification and response time prediction. As research continues, future directions may involve expanding the dataset to include additional emotional dimensions, such as nuanced sentiment analysis or cultural variations in emotional expression. Furthermore, advancements in privacy-preserving techniques and data augmentation methodologies will enhance the dataset's utility while maintaining robust privacy protections.
The EmoryNLP dataset stands as a comprehensive resource for studying emotional expressions within narrative contexts, offering a rich tapestry of annotated utterances across various scenes and episodes. Developed to capture the nuanced spectrum of human emotions as portrayed in fictional narratives, EmoryNLP comprises 97 episodes, 897 scenes, and a total of 12,606 annotated utterances. Each utterance within this dataset is meticulously labeled with one of seven emotions, drawn from the primary emotions: sad, mad, scared, powerful, peaceful, and joyful, alongside a default category of neutral.
The structure of EmoryNLP facilitates a detailed exploration of emotional dynamics within narrative discourse. Episodes and scenes provide contextual frameworks within which utterances are analyzed, capturing the interplay between characters, events, and emotional states throughout a narrative arc. The annotation process ensures that each utterance is classified according to its predominant emotional content, enabling researchers to delve into the distribution and portrayal of emotions across different narrative contexts.
Willcox's feeling wheel, a conceptual framework for categorizing emotions, serves as the basis for EmoryNLP's emotion annotation (Willcox, 1982). This approach aligns with established psychological theories of emotion, providing a structured yet nuanced representation of emotional states ranging from core emotions like sadness, anger, and fear to more nuanced states like powerfulness, peacefulness, and joyfulness. By incorporating these categories, EmoryNLP enriches the dataset with a diverse palette of emotional expressions that reflect the complexity of human affective experiences.
Analyzing EmoryNLP offers insights into how emotions are portrayed and perceived in narrative discourse, shedding light on storytelling techniques, character development, and emotional arcs within literary and cinematic works. Future research directions may involve expanding the dataset to include additional genres, languages, or cultural contexts to capture broader nuances in emotional expression and perception. Furthermore, advancements in computational linguistics and machine learning will continue to refine methodologies for emotion annotation and analysis, enhancing the dataset's utility and applicability across interdisciplinary research domains.
The SemEval-2019 Task 3 dataset represents a significant contribution to the field of natural language processing, specifically focusing on emotion recognition and classification within textual data. Developed by Chatterjee et al. (2019), this dataset comprises a total of 30,000 texts, consisting of 15,000 emotion-labeled texts and an additional 15,000 unlabeled texts. The labeled texts are categorized into three primary emotions: happy, sad, and angry, providing a structured foundation for studying emotional expressions in various linguistic contexts.
The SemEval-2019 Task 3 dataset is structured to facilitate research in emotion classification, offering a balanced distribution of labeled texts across three core emotional states: happy, sad, and angry. Each text instance within the labeled subset is manually annotated to reflect one of these emotional categories, ensuring a reliable ground truth for training and evaluating machine learning models.
In addition to the labeled texts, the dataset includes a substantial set of 15,000 unlabeled texts. These texts serve a dual purpose: they provide a pool for researchers to explore semi-supervised learning approaches, where algorithms can leverage both labeled and unlabeled data to improve classification accuracy and robustness. This aspect of the dataset encourages the development of innovative methodologies that can harness large amounts of unannotated data to enhance the performance of emotion recognition systems.
While the SemEval-2019 Task 3 dataset provides a valuable resource for studying basic emotions like happiness, sadness, and anger, challenges such as ambiguity in emotional expressions and cultural variations in emotion perception remain pertinent. Addressing these challenges requires robust annotation methodologies and cross-cultural validation to ensure the dataset's applicability across diverse linguistic and cultural contexts.
Moreover, future research directions may involve expanding the dataset to include additional emotional categories, exploring multimodal approaches that incorporate visual and auditory cues alongside textual data, and adapting models to handle nuanced emotional expressions beyond the core emotions defined in the current dataset.
In the realm of natural language understanding and sentiment analysis, the GoEmotions dataset emerges as a comprehensive resource for studying the spectrum of human emotions expressed in online conversations. Curated from Reddit comments, this dataset comprises 58,000 meticulously labeled instances, each annotated across 27 distinct emotion categories along with a Neutral label. Developed to capture the diverse emotional nuances inherent in digital communication, GoEmotions offers a nuanced exploration of how individuals express and perceive emotions in an online context (Demszky et al., 2020).
GoEmotions distinguishes itself through its expansive range of emotion categories, totaling 27 labels that encompass a broad spectrum of emotional states. These categories include admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, and surprise. Such granularity allows researchers to delve deeply into specific emotional nuances and variations that might otherwise be overlooked in broader sentiment analyses.
The availability of GoEmotions has significantly bolstered research in natural language processing (NLP) and sentiment analysis. Researchers and practitioners can utilize this dataset to train and evaluate machine learning models capable of accurately detecting and interpreting a wide array of emotions in text. Applications span sentiment analysis tools, emotion-aware chatbots, social media monitoring systems, and beyond, where understanding emotional nuances is crucial for enhancing user interaction and engagement.
Beyond its utility in NLP applications, GoEmotions provides valuable insights into the emotional dynamics of online communities. By analyzing how emotions are expressed across different topics and user interactions on Reddit, researchers can gain a deeper understanding of societal trends, cultural shifts, and collective emotional responses within digital spaces. Such insights not only inform academic research but also offer practical implications for designing platforms that foster positive emotional experiences and mitigate negative sentiments.
As the field of emotion-aware computing continues to evolve, GoEmotions serves as a foundational resource for advancing methodologies and technologies that incorporate emotional intelligence into computational systems. Future directions may involve expanding the dataset to include additional languages, dialects, or social media platforms to capture diverse cultural and linguistic nuances in emotional expression. Moreover, ongoing efforts to refine annotation methodologies and ensure dataset scalability will further enhance GoEmotions' impact and utility across various domains.
Table 2 summarize some of the dataset and their range of sentiments/emotions:
Table 2
Dataset | Data size | Sentiments/emotions | Range |
Emotion Lines | 29245 labeled utterances from 2000 dialogues | anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral | 8 |
Emotion Push | 91,000 records | joy, sadness, anger, fear, disgust, and surprise | 6 |
EmotionNLP | 60,000 records | anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. | 8 |
SemEval Tasks | 9613 reviews in SST-2 | Positive and negative | 2 |
SemEval- 2014 (Task 4): 5936 reviews for training and testing | Positive, negative and neutral | 3 |
SemEval- 2018 (Affect in dataset task): 1758 reviews for testing | Anger, Joy, sad and fear | 4 |
GoEmotions | 58,000 records | Admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral | 28 |
EmoBank | 10,548 records | Valence, Arousal Dominance model (VAD) | - |
The Table 2 shows that most existing datasets in natural language processing categorize emotions into a limited range of 6 to 10 categories, which simplifies the complexity of human emotional expression but may not capture the full spectrum of nuanced emotions. An exception to this is the GoEmotions dataset, which stands out with its annotation of 28 distinct emotion categories. However, this richness comes with inherent balance limitations, as some emotions may be underrepresented compared to others, which can affect the performance of models trained on this data.
Despite the availability of datasets like GoEmotions that expand the number of emotion categories, there remains a scarcity of datasets that comprehensively address additional dimensions such as valence, arousal, and dominance. These dimensions are crucial in understanding the intensity, positivity or negativity, and control associated with emotions, which are essential for applications ranging from affective computing to mental health assessments. The scarcity of datasets covering these scales underscores a significant gap in emotion data resources, limiting the development and accuracy of models that aim to capture the multifaceted nature of human emotions in textual data.
Finding 3
Most available datasets categorize emotions into 6–10 categories. The only exception is the GoEmotions dataset, which offers 28 categories but has balance limitations. Additionally, few datasets address the valence, arousal, and dominance scales, despite the high demand for this type of emotion data in many applications.