With the growing role of social media and online communities in our daily life, monitoring and managing hate speech has become particularly important. Because it not only expands the channels for people to exchange information and facilitates the way people communicate, but also induces the breeding of hate speech due to the anonymity and freedom of online speech[1]. The United Nations defines hate speech as any kind of communication in speech, writing or behavior, that attacks or uses pejorative or discriminatory language with reference to a person or a group on basis of who they are, in other words, based on their religion, ethnicity, nationality, race, color, descent, gender or other identity facto[2]. To some extent, hate speech can lead to social conflicts and vicious events, and even endanger the harmonious development of society. For example, the second study on hate speech and discrimination in Costa Rica in 2022 found that hate speech caused division and a bad atmosphere in Costa Rica[3]. Therefore, the detection of hate speech has become a key social issue and is regarded as a very important task with potential significant benefits in the research field.
In multimodal scenarios, hate speech detection (HSD) is facing great challenges. Due to the huge amount of data involved, traditional manual detection methods are inefficient [4, 5]. With the development of natural language processing and machine learning technology, HSD has been able to automatically identify hate speech in text [6]. However, the development of multimedia makes people's communication mode not only limited to text, but also more willing to use multimodal superimposed information such as pictures, videos, or text as the carrier of communication. In this process, hate speech has become more difficult to detect due to the diversification of information carriers. As shown in Fig. 1, a certain image may not have hate factors by itself, but it can produce different emotional tendencies when paired with a certain sentence. The example in Fig. 1, after text is paired with images, can express gender discrimination, insult, pornography, and other hateful connotations. The text information in the first image reflects anti-lesbian remarks, the second image implies support for drugs, especially from the needle and other signs in the image, the text in the third image combined with the image implies racial discrimination, and the fourth image compares the hanging of the head to a swing, containing meanings such as violent death.
Multimodal hate speech detection (HSD) is defined as a technology that uses multimodal information including images and text to capture and distinguish hate speech [7]. Compared with a single text analysis method, this method can more accurately and efficiently realize the automation of hate speech detection.
This paper aims to explore the methods and applications of multimodal hate speech detection and construct a multimodal hate speech detection model based on image-text fusion to improve the accuracy and robustness of hate speech detection. In this paper, Hateful Memes, the currently typical multimodal hate speech dataset, will be used for the training and validation of the model.
The main contributions of our work are summarized as follows:
1) This paper proposes a new joint model for multi-modal hate speech detection, which uses a moving window for multi-level visual feature extraction for visual modalities and a RoBERTa pre-trained model for textual modalities. In addition, the structure of the model is relatively clear and highly interpretable.
2) This paper innovatively introduces a multi-head self-attention mechanism at the model fusion stage. This mechanism allows the model to dynamically adjust the weight of text features, effectively fusing image, and text information. Important text parts better match image features better, thereby improving the fusion effect.
3) This paper carries out a series of experiments on the characteristics, content, and modes of the benchmark dataset Hateful Memes for multimodal hate speech. We separate the text and image in the memes, removing the text to avoid interference with image feature extraction. Using innovative joint methods, this paper achieves better results on Hateful Memes than the SOTA model.