The MixAttacker method mainly includes three parts:
(1) For the original input sentence, the words among them that are more sensitive to the classification model are selected. That is, the importance level of each word is calculated, and the words are arranged in reverse order of scores.
(2) The words are processed according to the order of the above word sequence. A new word is generated using WoBERT based on the context of the sentence in which the original word is located to replace that word in the original sentence.。
(3) A new sentence, selected by merit from the sentences obtained in the WoBERT prediction stage, is processed using back-translation.
The specific processing process of the above three parts is illustrated in Fig. 1.
3.1 Adversarial attacks in text classification
An adversarial example is an example obtained after a reasonable transformation of the original text. Specifically, a qualified adversarial example should have the following characteristics: (1) It has some changes compared with the original input. (2) It is difficult for humans to detect such changes. (3) The adversarial example makes the classification result different from the original text, that is, the attack is successful.
For a text classification task, given data containing N sentences and corresponding N labels \(\left(X,Y\right)=\{\left({X}_{1},{Y}_{1}\right),\left({X}_{2},{Y}_{2}\right),\dots ,\left({X}_{N},{Y}_{N}\right)\}\), respectively, there is a classifier F such that \(X\to Y\). For a sentence \(X\in \varvec{X}\), an effective adversarial example Xadv should satisfy the following requirements:
$$F\left({X}_{adv}\right)\ne F\left(X\right),and Sim\left({X}_{adv},X\right)\ge \epsilon ,and Ptb({X}_{adv},X)\le \mu$$
1
Sim (·) is used to measure the semantic similarity between the original sentence and the adversarial example. Ptb (·) is used to measure the degree of disturbance added to the adversarial example compared with the original sentence. In addition, for a high-quality adversarial example, it should meet the requirements of semantic and perturbation and have a good performance of fluency.
3.2 Characteristics of Chinese
In Chinese, the word character is the smallest unit of text. Furthermore, Wikipedia specifies that the word is the smallest unit that can be used independently and contains semantic content. In order to obtain as much semantics of the original corpus as possible and to ensure a more reasonable sentence transformation, we choose to process around the original sentence with the word as the unit during the generation of the adversarial examples.
Chinese words can be composed of one, two, or more characters, as shown in Table 2. A proper grasp of the characteristics of Chinese words helps the final generated results to be closer to our design.
Table 2
Chinese character and word
The number of Chinese character | Chinese word |
One character | 风(wind)、书༈book༉、糖༈sugar༉ |
Two characters | 电脑(computer)、家庭༈family༉、跑步༈run༉ |
More than two characters | 巧克力(chocolate)、言而无信༈fail to keep faith༉ |
3.3 Ranking the importance of words
Considering that each word in the input sentence has a different influence on the result of the classification model, it is necessary to calculate the importance of each word in the sentence to select the word that is more sensitive to the classification model for the following processing. The above operation makes generating the adversarial example more targeted and further ensures that the differences between the adversarial example and the original text are minimized.
We learned from TEXTFOOLER [9] by comparing the change of label prediction between the original sentence and the sentence with the target word removed, which is used as an important basis for scoring the target word. Furthermore, based on this practice, a part-of-speech weighting strategy is introduced. We score the target words from the above two perspectives to distinguish the importance of each word in the sentence. The above description is summarized in Formula 2.
$${I}_{{w}_{i}}=\alpha (T{F}_{{w}_{i}}+A)$$
2
The score \({{I}_{w}}_{i}\) is used to score the words \({w}_{i}\in X\). The above equation describes the processing of part-of-speech weighting after scoring \({TF}_{{w}_{i}}\). In a specific scenario, the part- of -speech of word \({w}_{i}\) is fixed and unique, and α is the weight of the part- of -speech. For example, when classifying emotional tendencies, α can be set to 100 for adjectives and 1 for other parts to highlight the importance of adjectives. During the experiment, we found that the \({TF}_{{w}_{i}}\) value fluctuates within a small interval (roughly \({TF}_{{w}_{i}}\in (-\text{3,10})\)). Since \({TF}_{{w}_{i}}\) maybe minus, a positive constant A needs to be added to ensure that the weight is applied to a number greater than 1 to obtain the desired weighting effect. The specific definition of \({TF}_{{w}_{i}}\) in the above equation is as follows:
$$T{F}_{{w}_{i}}=\left\{\begin{array}{c}{F}_{Y}\left(X\right)-{F}_{Y}\left({X}_{\backslash {w}_{i}}\right),if F\left(X\right)=F\left({X}_{\backslash {w}_{i}}\right)=Y \\ \left({F}_{Y}\left(X\right)-{F}_{Y}\left({X}_{\backslash {w}_{i}}\right)\right)+\left({F}_{\stackrel{-}{Y}}\left({X}_{\backslash {w}_{i}}\right)-{F}_{\stackrel{-}{Y}}\left(X\right)\right),\\ if F\left(X\right)=Y,F\left({X}_{\backslash {w}_{i}}\right)=\stackrel{-}{Y,}and Y\ne \stackrel{-}{Y}.\end{array}\right.$$
3
By accessing the classification model F, we can judge the influence of \({w}_{i}\) in the sentence. Where \({X}_{\backslash {w}_{i}}\{{w}_{1},{w}_{2},\dots ,{w}_{i-1},{w}_{i+1},\dots ,{w}_{n}\}\) denotes the sentence after deleting the word \({w}_{i}\), and \({F}_{Y}(\bullet )\) denotes the predicted score for label Y. The formula shows how to calculate the sensitivity of \({w}_{i}\) to the classification model in two cases (the same or different prediction labels after word deletion).
3.4 Using WoBERT to generate replacements
In order to ensure the generation of qualified adversarial examples as efficiently as possible, we no longer rely on corpus data heuristics to design solutions and draw lessons from BERT-Attack (Li et al., 2020). We use a pre-trained masked language model to predict keyword alternatives using keyword context, thus generating new sentences with similar semantics but different surface forms.
We chose to make predictions for the Chinese words in the sentence for the following reasons:
(1) Shorter sequences, faster processing speed: Chinese words are composed of n Chinese characters, and the sequence formed by cutting the sentence by Chinese word level must be shorter than the sequence formed by Chinese character level, thus the processing speed of the word level model will be faster than the character level.
(2) For the generation task, mitigating the exposure bias problem: The Chinese word-based model predicts one complete Chinese word (n Chinese characters) at a time. And this is equivalent to the Chinese character-based model making n predictions, which are recursively dependent on n steps. So, the Chinese character-based model's exposure bias problem is more severe than the Chinese word-based model's.
(3) Lower uncertainty of Chinese word meanings reduces modeling complexity: In Chinese expressions, Chinese character meanings carry more uncertainty than Chinese word senses. This leads to modeling where one layer of embedding is sufficient to represent the Chinese word meaning, while the Chinese character meaning requires multiple layers of embedding to be constructed.
In most of the current "BERT" type pre-trained language models for Chinese, such as BERT-wwm (Cui et al., 2021) and AMBERT (Zhang et al., 2021), the basic unit is still the "word". These models only manage to incorporate word information, but do not essentially capture the word information in the original corpus. The emergence of WoBERT, a Chinese pre-training model that is genuinely purely oriented to word granularity, confirms the desirability of a word-based pre-training model. WoBERT is a model that continues to be pre-trained based on the open-source BERTa-wwm-ext (Cui et al., 2021) from Harbin Institute of Technology, using words as the unit and Masked LM(MLM) as the training task. Its experimental results (Sun, 2020) show that WoBERT is comparable to BERT in NLP tasks that do not require exact boundaries and has a more significant speedup. In particular, the word-level model works better than the character-based model for the generation task. To this end, the WoBERT-MLM model is finally chosen to predict replacement candidates.
In addition, to further maintain the original semantics as much as possible and reduce the consumption of running MLM predictions per iteration, we follow the BERT-Attack setting. Do not mask the selected words in the original sentence, but use the complete sequence as input and perform top-N prediction of the replacement words based on the position in the sentence where the selected words are located. Based on the above idea, we designed our approach for expanding the sources of replacement candidate sets. Therefore, in the whole process of WoBERT prediction and generation of new sentences, the re-placement candidate set we use to search contains the results generated by the following three modes:
(1) mode1: For the target word, predict the possible replacement words at the word's location in the original sentence, and select the top-N of the replacement word into the replacement candidate set.
(2) mode2: For the target word, we predict the possible insertion words before or after the word's location in the original sentence and select the top-N of the insertion words to join with the target word as the new word in the replacement candidate set.
(3) mode3: For each replacement word selected in mode1, the new sentence obtained after replacing the original word is used as input. The possible insertion words be-fore or after the location of the replacement word in the sentence are predicted. The top-N of the insertion word is selected and connected with the replacement word respectively as the new word in the replacement candidate set.
The whole replacement candidate set generation strategy is illustrated in Fig. 2.
3.5 Back-Translation
In each iteration, the new sentences generated using the MLM method may have two problems: first, they may not necessarily attack the target model successfully, and second, the new sentences generated may have unreasonable grammar or collocation, which may cause the fluency of the sentence expression to be affected. In order to improve these problems while ensuring a slight semantic loss, back-translation processing is introduced to the text. The process of back-translation is to take a given sentence in a particular class of languages, translate it first into another intermediate language, and then translate that sentence back into the original language. This process aims to complete the optimization of details, such as adjusting the sentence structure and standardizing the wording while ensuring that the semantics are not distorted, which is conveyed by the sentence.
In order to ensure the quality of back-translation, MixAttacker uses Youdao API to complete the action of back-translation, using English as the intermediate language, through the "Chinese-English-Chinese" translation processing, so that the sentence has more characteristics of high-quality adversarial examples. Results of the back-translation are illustrated in Table 3.
Table 3
Examples generated at the WoBERT and Back-Translation stages
(label)Original text | (label)WoBERT generation | (label)Back-Translation |
(none)无论失去什么, 都不要失去好心情。 | (none)无论失去什么, 都我不想失去好心情。 | (like)无论我失去什么, 我都不想失去我的好心情。 |
(politic)台南法院将于4月8日开庭审理张铭清遇袭案。 | (politic)南湖法院将于4月8日开庭审理张铭清遇袭案。 | (society)南湖法院将于4月8日审理张某的案件。 |
(politic)文化部: 营业性演出不得以假唱假演奏欺骗观众。 | (politic)中国文联解读: 营业性演出不得以假唱假演奏欺骗观众。 | (society)商业表演不应该假唱来欺骗观众。 |
(negative)虽然预定比较方便, 但是宾馆的对顾客的关怀不足, 房间冷得让人睡不着觉服务人员还说暖和的!༁༁ | (positive)虽然预定比较方便, 但是宾馆的对顾客的关怀不足, 房间有些寒冷得让人睡不着觉服务人员还说暖和的!༁༁ | (positive)虽然预订比较方便, 但是酒店对顾客的关心是不够的, 房间有些冷让人睡不着, 服务人员也说温暖!༁༁ |
(negative)房间还可以, 不脏;热水放了30分钟后才有点热, 洗的不爽༛还不可以刷卡。 | (positive)房间还可以, 不脏太;热水放了30分钟后才有点热, 洗的不爽༛还不可以刷卡。 | (positive)房间还可以, 不太脏;热水放了30分钟才有点热。你还不能刷卡。 |
(The back-translation examples illustrated in the table reflect the effect of expression-level adjustments on the sentences generated in the WoBERT stage, and the examples show that the back-translated sentences are generally more fluent.)
In practice, we not only put back-translation after the WoBERT layer processing but also put it at the top of the overall algorithm as an independent adversarial example generation strategy. Specifically, (as illustrated in Algorithm 1) the model first back-translates the original input text and uses the back-translated result to attack the target model. If the attack is successful, the back-translated result is directly output as an adversarial example (the adversarial example generated here is illustrated in Table 4). If not, the algorithm enters the processing flow illustrated in Fig. 1.
Table 4
Adversarial examples generated only by back-translation
(label)Original text | (label)Back Translation |
(negative)感觉一般, 最便宜的房间没有窗户, 房间太小。不过, 里要去的办公地点很近。 | (positive)最便宜的房间没有窗户, 而且太小了。然而, 办公室很近。 |
(positive)酒店还可以, 房内有电脑, 数字电视。早餐一般啦。隔音效果比较差, 下面是商业街, 不太安静。 | (negative)酒店没问题。房间里有电脑和数字电视。早餐一般般。隔音效果差, 下面是商业街, 不是很安静。 |
(positive)酒店的硬件设施达到了四星的标准, 24小时供应热水, 水量也很足。但是服务只能算是三星的水平, 早餐最多是二星的标准。由于团队过多, 酒店晚上餐厅不提供点餐服务, 只有自助餐。和其他地方的HolidayInn比起来, 性价比差远了。 | (negative)酒店的硬件达到了四星级标准, 热水全天24小时供应。但是服务只有三星级, 早餐最多是两星级。由于团体人数多, 酒店餐厅晚上不提供点餐服务, 只提供自助餐。与其他地方的HolidayInn相比, 它的性价比远不如假日酒店。 |
3.6 Algorithm description
We summarize the whole MixAttacker process in Algorithm 1.
Algorithm 1: MixAttaker algorithm |
Input: text data X with original label Y, classification model F, Functions (back translate sentence function-BTranslate word segmentation function-Split, word influence calculation function-Score, descending sorting function according to word influence-SortByScore, Pick the top-N by WoBERT prediction-Pred) Output: adversarial example Xadv of text X 1. Initialization: Xadv← X, Importance← {}, Candidates← {} 2. X'←BTranslate(X) 3. If F(X') ≠ Y: 4. Xadv←X' 5. Return Xadv 6. Else: 7. W← {w1, w2, …, wn} ←Split(X) 8. for wi in W: 9. Importance.add (Score (wi, X, Y, F)) 10. Wscore←SortBySore (W, Importance) 11. Candidates←Pred(X) where 3 modes 12. for i in Wscore: 13. If Perturbation > 20%: 14. Return None 15. L← {} 16. for c in Candidates[wi]: 17. L[c] ← w1…wi-1 [c] wi+1…wn 18. If \(\exists\) c in Candidates s.t F(L[c]) ≠ Y: 19. L’← L[c'] where L[c'] has maximum similarity with X 20. Else: 21. L’ ←L[c'] where L[c'] causes maximum reduction in probability of label Y in F(L[c']) 22. X’←BTranslate(L’) 23. If F(X’) ≠ Y: 24. Return Xadv ←X’ 25. Elif F(L’) ≠ Y: 26. Return Xadv ←L' 27. Return None |
The overall process of the algorithm is described as follows:
Part1 (line 2–5): Back-translate the input sentences and attack the classifier with the back-translation results, and output the back-translation results as adversarial examples if the predicted labels change. Otherwise, enter Part2 (as illustrated in Fig. 1).
Part2 (line 6–26):
A (line 6–10): Use WoBERT to segment the input sentences. Then use the LTP tool to tag part-of-speech of the words obtained after segmenting, and then obtain a set of words in reverse order of word importance by importance ranking calculation.
B (line 11–26): Iterate sequentially through the words in the descending sequence of word importance:
①(line 11–21)For the target words, new words are predicted using WoBERT-MLM (three predicted modes are used in this stage, as illustrated in Fig. 2), the top-N predicted words are selected. Then the n sentences obtained after replacing the original words with them respectively attack the classifier. Finally, a sentence is selected that meets the requirements: the predicted label changes and has the highest semantic similarity degree with the original input. If not, the predicted label does not change and has the lowest confidence score.
②(line 21–26)Back-translate that sentence selected from ① and let the back-translation result attack the classifier. If the prediction label is changed, the back-translation result is output as an adversarial example; if the label is not changed, the result of ① is judged: if the sentence selected in ① can change the prediction label, it is output as an adversarial example. Otherwise, the B step operation is performed on the next word in the word importance descending sequence.
Note that when the perturbation threshold is reached (i.e., the last word allowed to be modified is processed), if neither ① nor ② produces a sentence that can change the predicted label, then Return None.