Hierarchical multi-label text classification model based on multi-scale gated-dilated convolution

doi:10.21203/rs.3.rs-4274864/v1

Download PDF

Research Article

Hierarchical multi-label text classification model based on multi-scale gated-dilated convolution

https://doi.org/10.21203/rs.3.rs-4274864/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

This paper proposed a Hierarchical Multi-label Text Classification Model based on Multi-Scale Gated-Dilated Convolution (HiDilated) to address the issue of insufficient feature extraction in longer text data. The model emphasized the design of a three-layer one-dimensional dilated convolutional structure with a gating mechanism. By exponentially increasing the receptive field of the network, it effectively captured long-distance dependencies between words, fully extracting deeper textual semantic information, thereby enhancing understanding of complex textual structures and semantic content. Additionally, the model integrated multi-scale gated-dilated convolutions, multi-head self-attention mechanisms, and Bi-GRU into different positions within the feature extraction layer. A multi-granularity fusion module was designed to thoroughly extract both local key information and long-distance semantic information from the text. Moreover, considering the imbalanced distribution of labels with a hierarchical structure, the paper designed a focal balanced loss as the model's loss function. This loss function assigned appropriate weights to samples based on their classification difficulty, enabling the model to focus more on deeper, harder-to-classify labels during training. Experimental results demonstrated that the proposed model achieved higher classification accuracy than baseline models, and that each improved module contributed to enhancing the model's performance. These findings confirm the superiority and practicality of the HiDilated model.

hierarchical multi-label text classification

multi-scale gated-dilated convolutions

multi-granularity fusion

focal balanced loss

Text classification is an important fundamental research task in the field of natural language processing, which is widely used in multiple downstream tasks. For example, there are a large number of news with different audiences on news websites. Only by highly summarizing the main content of the news and assigning one or more topics to the news can the news be efficiently and accurately delivered to different users based on the labels. In the process of using intelligent triage technology in the medical field, only by accurately classifying the query statements can satisfactory answers be given to the questioners, thereby saving a lot of medical resources and improving service quality and efficiency. Therefore, the task of text classification has attracted many researchers.

Existing text classification algorithms can be divided into single-label text classification algorithms and multi-label text classification algorithms. In single-label text classification, a sample is associated with one of a set of unrelated categories, and it learns and predicts a certain feature of the research target. However, in real life, many instances belong to multiple categories at the same time, and there is a certain dependency relationship between the categories, which is called multi-label text classification.

Hierarchical Multi-Label Text Classification (HMTC) is a research branch of multi-label text classification. It is similar to multi-label classification in that an object has multiple labels, but the difference is that the labels of HMTC tasks can be organized into a hierarchical structure of trees or directed acyclic graphs[1]. HMTC has many application scenarios in the real world, such as predicting protein function[2], automatic question answering[3], literature organization[4], etc. Figure 1 shows an example of text classification with a two-level hierarchical label structure, where the target label is defined as a fine-grained label, and its parent node is a coarse-grained label.

In the task of HMTC, the global algorithm is one of the key research focuses for researchers. Most global methods are modified based on flat methods, which utilize overall information to obtain the classification results of samples in the category hierarchy[5]. Recent studies have shown that directly encoding the overall label hierarchy structure with a structural encoder can further improve classification accuracy. In 2020, Zhou[6] designed a structural encoder to learn label representations by integrating prior hierarchical structure information of labels, and proposed an end-to-end hierarchy-aware global model (HiAGM) for hierarchical text classification. Deng[7] addressed the issues of noise in the HiAGM model and the lack of consideration for any statistical constraints learned by the structural encoder, and introduced an information maximization model (HTCInfoMax) for hierarchical text classification. Chen[8] treated the HMTC problem as a semantic matching problem and built a hierarchy-aware label semantics matching network (HiMatch) based on HiAGM to capture the text-label semantic matching relationship between coarse-grained and fine-grained labels in a hierarchy-aware manner. These methods effectively capture complex hierarchical label relationships and enable deep-level interactions between text features and label features, thereby effectively improving the overall classification performance of labels.

Although HMTC methods have received widespread attention and achieved a series of results, there are still some issues that need further research and resolution, and these issues become increasingly prominent as the scale of text continues to expand and the label system becomes increasingly large. The main issues include:

a) Insufficient text feature extraction and lack of multi-granularity learning of text information during modeling, leading to an imbalance between long-distance dependency information and local dependency information in the information obtained by the model.

b) Although ordinary convolutions widely used in text classification can process input text in parallel, they often require stacking multiple convolutional layers when processing long texts, resulting in high computational consumption and increased risk of gradient disappearance.

c) In multi-label classification tasks, labels often exhibit a long-tailed distribution. If the same weight factor is used to process labels, it may lead to overfitting of head category samples and reduce the accuracy and credibility of the results[9].

Based on the above analysis, this paper focuses on the HMTC problem and makes the following main contributions:

a) We construct a hierarchical multi-label text classification model based on multi-scale gated-dilated convolutions (HiDilated). We introduce Bi-GRU, multi-head self-attention, and CNN at different positions in the text feature extraction layer, and design a multi-granularity fusion module to encode text in a parallel manner, thereby fully extracting local key information and long-distance semantic features of the text.

b) We design multi-scale gated-dilated convolutions to replace traditional convolutions, exponentially increasing the receptive field of the network to capture long-distance dependency relationships between words.

c) Starting from the difficulty level of sample classification, we design a focal balanced loss function that can assign greater weights to deeply difficult-to-classify samples, thereby alleviating the model training issues caused by the imbalance between positive and negative samples.

d) We conduct experiments on two datasets containing longer texts. After comprehensive consideration of multiple evaluation metrics, the performance of the proposed model is superior to other algorithms in all metrics, indicating that the improved algorithm is more competitive compared with the comparison algorithms.

The overall architecture of the HiDilated proposed in this paper is shown in Fig. 2. It consists of two main components: the main classification task module and the optimization learning task module.

Main Classification Task: Firstly, a text encoder is used to encode the input text sequence. Here, a Multi-Granularity Fusion (MGF) module is designed to fully extract textual semantic information, with a focus on utilizing a Multi-Scale Gated-Dilated Convolutional Neural Network (MGDC) to capture long-distance dependencies between words. Subsequently, to avoid noise caused by heterogeneous fusion, after reshaping the text features through a fully connected layer, they are directly fed as node inputs to the label structure encoder in a serial structure. Then, in the feature fusion layer, the textual information is updated based on the hierarchical perception structure encoder. In the label structure encoder, due to the hierarchical relationships among labels, a Hierarchical Graph Convolutional Network (Hierarchical-GCN) is employed to aggregate label information in three directions: top-down, bottom-up, and self-loop.

Optimization Learning Task: The structural encoder is utilized again to encode the labels. The label semantics and the text semantics obtained from the text encoder are projected into a joint embedding space. Two optimization learning tasks are introduced to model the semantic matching relationship between text and labels.

Overall, the model performs HMTC tasks under the guidance of both the main classification learning task and the optimization learning task.

2.1 Multi-Scale Gated-Dilated Convolution

This section will introduce dilated convolution, multi-scale dilated convolution, and multi-scale gated-dilated convolution in sequence.

When utilizing CNN for feature extraction, the size of the convolutional kernel becomes a crucial factor determining the scope of captured features, which somewhat restricts the application of CNN in text classification. Additionally, pooling operations during convolution and transmission often lead to the loss of critical information, subsequently reducing the model's overall comprehension of the text.

Dilated Convolutional Neural Networks (DCNN)[10, 11] serve as an effective means to enhance the receptive field of the network, making them popular in semantic segmentation tasks in computer vision[12, 13]. They have also been successfully introduced into NLP[14] and speech processing[15]. By inserting "holes" into the convolution, DCNN effectively eliminates the negative impacts of information loss caused by common down sampling methods. Furthermore, by setting different dilation rates, the receptive field can be exponentially increased without increasing the number of parameters, while maintaining resolution and coverage. This means that under the same parametric conditions, DCNN can not only extract richer features but also cover the length of most sentences with fewer layers, thereby improving model efficiency. Therefore, DCNN is capable of capturing long-term dependencies, suitable for handling classification tasks involving longer text data. The actual receptive field size of a single-layer DCNN is given by:

$$N=k+\left(k-1\right)\times (d-1)$$

where $k$represents the kernel size, and $d$is the dilation rate, which is the spacing between values when the convolutional kernel processes the data.

Multi-Scale Dilated Convolutional Neural Network (MDC) consists of multiple single-layer DCNNs, and its receptive field size can be expressed as:

$${R}_{L+1}={(R}_{L}-1)+{N}_{L+1}$$

where ${R}_{L}$denotes the overall receptive field of the $L$th convolutional layer, ${N}_{L+1}$represents the actual receptive field of the $(L+1$)th layer.

Figure 3 illustrates the structure of a three-layer one-dimensional MDC with a kernel size of 2 and dilation rates of (1, 2, 3).

Improper combinations of convolutional kernel size and dilation rate may lead to the gridding effect in dilated convolution. This effect not only disrupts the continuity between word representations but also causes the loss of important local information.

Therefore, this paper considers designing the dilation rate combination of MDC as (1, 2, 3) with a convolutional kernel size of 2. This dilation rate combination allows the upper-layer convolution to access information at a longer distance while fully covering the underlying input. Additionally, as distant texts may not contain relevant information, this combination ensures that the top-layer convolution does not process overly distant information, reducing the impact of irrelevant information during text feature extraction. In this case, the MDC constructed by the model is a three-layer one-dimensional dilated convolutional structure with an overall receptive field size of 7. Compared to a three-layer one-dimensional ordinary convolution with a kernel size of 2 and an overall receptive field of 4, MDC significantly enlarges the network's receptive field without increasing the computational load.

Furthermore, this paper introduces a gating mechanism. Inspired by the use of gating mechanisms in LSTM to enhance the model's ability to process sequence-related information, Dauphin et al. proposed a gated convolutional neural network (GCNN) with superior performance[16]. Its specific calculation formula is as follows:

$$T=({W}_{L1}*H+{b}_{L1})⨂ \sigma ({W}_{L2}*H+{b}_{L2})$$

where $H$ denotes the text features output from the previous layer of the network, and $W$ and $b$ represent the convolutional kernel and bias, respectively. The right side of the equation consists of two branches, with one branch using a linear activation function to prevent gradient vanishing, and the other branch using a Sigmoid function to compress the features to [0, 1]. Through network learning, features are selected.

The gating unit enables the model to retain a certain degree of nonlinearity while providing a linear propagation path for gradients, effectively alleviating the problem of gradient vanishing. Additionally, the gating mechanism selectively transmits information flow, transmitting relevant features and forgetting irrelevant ones, thereby strengthening effective information and reducing the impact of ineffective information.

In this paper, MDC is used to replace the traditional convolution in GCNN, referred to as Multi-Scale Gated-Dilated Convolutional Neural Network (MGDC). The structure of this module is shown in Fig. 4.

According to Fig. 4, the overall process of HiDilated is to input the text representation into two MDC layers separately, obtaining two representation vectors A and B. Then, B is multiplied element-wise with A through the sigmoid activation function to get the final text representation vector.

2.2 Text Encoder

Addressing the issue of insufficient extraction of text features commonly seen in text classification tasks, this paper introduces MGDC, Bi-GRU, and Multi-Head Self-Attention at different positions of the feature extraction layer. A multi-granularity fusion module is designed that integrates and combines multiple feature extractors in parallel to enhance the semantic features of the text. The structure of this module is shown in Fig. 5.

Firstly, GloVe is used to embed each word in the text into a vector space, forming an initialized word vector. Then, Bi-GRU is employed to extract the shallow local semantic features of the text. Compared with LSTM and RNN, GRU is easier to compute and can improve the training efficiency of the model to a certain extent.

Meanwhile, in order to better capture the long-distance dependencies and comprehensive semantic information of sentences, the multi-head self-attention mechanism is used in parallel for attention weighting to extract features of different granularities from the text. This results in a representation vector that can more comprehensively reflect the local key features of the text without losing semantic features. The use of attention mechanisms not only brings better performance but also dynamically focuses on which words and sentences contribute to the classification of the text.

Finally, in order to increase the receptive field of the network and fully capture the long-distance dependencies of the text, this paper uses multi-scale gated-dilated convolutions to extract the advanced local semantic features of the text, and fuse the feature vectors obtained from the two parallel modules to obtain the final text representation vector.

2.3 Label Encoder

The label structure encoding module adopts Hierarchical-GCN for encoding.

2.3.1 Prior hierarchical information

Assuming that there exists a hierarchical path${ e}_{i,j}$between the parent node${ v}_{i}$and the child node${ v}_{j}$, then the prior probabilities $P\left({U}_{j}|{U}_{i}\right)$from the parent node to the child node and $P\left({U}_{i}|{U}_{j}\right)$ from the child node to the parent node are respectively shown in the following equations:

$$\left\{\begin{array}{c}P\left({U}_{j}|{U}_{i}\right)=\frac{P({U}_{j}\cap {U}_{i})}{P\left({U}_{i}\right)}=\frac{P\left({U}_{j}\right)}{P\left({U}_{i}\right)}=\frac{{N}_{j}}{{N}_{i}}\\ P\left({U}_{i}|{U}_{j}\right)=\frac{P({U}_{j}\cap {U}_{i})}{P\left({U}_{j}\right)}=\frac{P\left({U}_{j}\right)}{P\left({U}_{j}\right)}=1\end{array}\right.$$

Wherein, ${U}_{k}$denotes the occurrence of${ v}_{k}$, $P\left({U}_{j}|{U}_{i}\right)$denotes the probability of the occurrence of child node${ v}_{j}$when parent node${ v}_{i}$occurs, $P({U}_{j}\cap {U}_{i})$denotes the probability of both${ v}_{i}$and${ v}_{j}$ occurring simultaneously.

2.3.2 Hierarchical Graph Convolutional Neural Network

Hierarchical-GCN encodes each node based on its neighbor nodes, aggregating information from three directions: top-down, bottom-up, and self-loop. In the hierarchical graph, each node represents a label, and each directed edge represents a paired label-related feature. This paper uses prior hierarchical information as the weighted adjacency matrix of Hierarchical-GCN.

Specifically, the label hierarchy can be defined as a directed graph$G=({V}_{t},\overrightarrow{E},\overleftarrow{E})$, where ${V}_{t}$denotes the set of nodes in the hierarchy. $\overrightarrow{E}$denotes the top-down hierarchical path, corresponding to the prior probability from the parent node to the child node; vice versa for$\overleftarrow{E}$.

2.3.3 Fusion Layer

After obtaining the text representation and label representation vectors, in order to avoid noise caused by heterogeneous fusion, the text features are directly used as the node input of the structural encoder in the serial data stream, and the text information is updated through the hierarchical-aware structural encoder.

Considering the different numbers of text representation vectors and label nodes, the text representation vectors are first reshaped through a linear transformation to obtain vectors consistent with the number of label nodes, and they are then input into the structural encoder. Subsequently, Hierarchical-GCN is used to combine the text semantics with the prior hierarchical information.

$${S}_{t}=\sigma (\overrightarrow{E}\bullet {V}_{t}\bullet {W}_{{g}_{1}}+\overleftarrow{E}\bullet {V}_{t}\bullet {W}_{{g}_{2}})$$

Where $\sigma (\bullet )$is the ReLU activation function, $W$ is the weight matrix of Hierarchical-GCN, and${ S}_{t}$is the text representation vector with label hierarchical structure information.

Each sample updates its text information in the same hierarchical structure, outputting a hidden state of class-specific representation of hierarchical-aware text features, and using it as the final input to the classifier.

2.4 Optimizing Learning Tasks

In the module of optimizing learning tasks, Hierarchical-GCN is once again utilized as a label structure encoder to encode the labels individually.

$${S}_{l}=\sigma (\overrightarrow{E}\bullet {V}_{l}\bullet {W}_{{g}_{3}}+\overleftarrow{E}\bullet {V}_{l}\bullet {W}_{{g}_{4}})$$

Where, ${V}_{l}$denotes the set of nodes with label hierarchical information, $W$ is the weight matrix of Hierarchical-GCN, and${S}_{l}$is the obtained label representation vector.

Subsequently, referring to HiMatch[8] the text semantics and label semantics are projected into a joint embedding space to capture the semantic matching relationship between text and labels, including both coarse-grained and fine-grained labels, in a hierarchical-aware manner, and optimization learning is performed.

2.5 Classification Learning

Considering that BCE treats samples of different classification difficulties equally, which means that BCE does not set weights for any predicted labels. In practical applications, when the distribution of label data exhibits a long-tail phenomenon, i.e., the number of samples for a few labels is much larger than for others, this often leads to the majority of samples corresponding to head labels being easily classified. However, even though these head label samples are easy to distinguish, the loss they generate during training is not low. As training progresses, the loss of head labels gradually dominates. In other words, when accumulating the loss of a large number of easy-to-classify samples, the loss contribution of hard-to-classify samples, due to their small numbers, is almost completely overwhelmed. This results in the gradients of easy-to-classify samples dominating the optimization process of the model. To minimize the training loss, the model continuously improves the classification accuracy of head labels, but in this process, the model's ability to classify tail labels may be overlooked.

The consequence of this phenomenon is that while the classification performance of head labels improves, the classification effect of tail labels, especially deeply nested and hard-to-classify tail labels, remains unsatisfactory. However, in practical applications, to further enhance the overall performance of the model, it is often necessary to rely on these hard-to-classify samples. Therefore, effectively handling imbalanced data and enabling the model to focus evenly on samples of different classification difficulties is crucial for improving model performance.

To enhance the classification ability of tail labels and thereby improve the overall classification performance of the model, this paper proposes the adoption of the Focal Balanced Loss (FB Loss) in the HMTC task from the perspective of reweighting. This loss assigns weights to the loss corresponding to samples based on their ease of classification, i.e., assigning small weights to the majority of easy-to-classify samples and large weights to hard-to-classify samples, allowing the model to focus more on hard-to-classify samples. In other words, the main idea of this loss is to increase the proportion of tail hard-to-classify samples in the training loss, enabling the model to focus more on the classification learning ability of tail hard-to-classify labels, thereby improving the overall classification accuracy of the model. The definition of FB Loss is shown in the following equation:

$${\mathcal{L}}_{FB}\left(x,y\right)=-\frac{1}{NC}\sum _{k=1}^{N}\sum _{i=1}^{C}\left[{y}_{i}^{k}{\left(1-{z}_{i}^{k}\right)}^{\beta }\text{log}\left({z}_{i}^{k}\right)+(1-{y}_{i}^{k}){\left({z}_{i}^{k}\right)}^{\beta }\text{log}\left(1-{z}_{i}^{k}\right)\right]$$

The above equation can be regarded as adding a modulation factor${ \left(1-{z}_{i}^{k}\right)}^{\beta }$to the original cross-entropy loss, utilizing the rapid scaling characteristics of power functions to dynamically adjust the rate of reduction in the weights of easy samples, reducing the weights of easy-to-classify samples, and focusing the loss on the training of difficult samples. Through multiple experiments, it was found that when $\beta =2$, the model performs best.

Guided by the joint embedding loss and matching learning loss, the model performs classification learning, feeding the final features into a fully connected layer for prediction. The final loss function includes the classification loss, joint embedding loss, and hierarchical-aware matching loss:

$$\mathcal{L}={\mathcal{L}}_{FB}(y,\widehat{y})+{\lambda }_{1}{\mathcal{L}}_{joint}+{\lambda }_{2}{\mathcal{L}}_{match}$$

Wherein,$y$and$\widehat{y}$are the true labels and predicted labels, respectively, ${\lambda }_{1}$and${\lambda }_{2}$are hyperparameters that balance the joint embedding loss and matching learning loss.

3.1 Dataset and Evaluation Metrics

3.1.1 Dataset

In this paper, experiments are conducted on two classic hierarchical multi-label text datasets, WOS[17] and RCV1-V2[18]. WOS is extracted from the abstracts of published papers in Web of Science, while RCV1-V2 is a news classification corpus. Among them, WOS is used for single-path HMTC, while RCV1-V2 contains multi-path classification labels. The specific information of the datasets is shown in Table 1.

Table 1

Details of datasets
Dataset	WOS	RCV1-V2
Number of texts	46985	804414
Average number of tokens	108.03	221.28
Total number of labels	141	103
Hierarchical levels	2	4
Labels on level 1;2;3;4	7;134;-;-	4;55;43;1
Average number of labels	2.0	3.24
Train	30070	20833
Val	7518	2316
Test	9397	781265

3.1.2 Evaluation Metrics

For evaluating the model performance, this paper adopts the standard evaluation metrics of Micro-F1-score and Macro-F1-score[19]. Since Micro-F1 calculates the overall precision and recall for all classes first and then computes the F1-score using the formula, this metric is greatly influenced by categories with larger numbers. On the other hand, Macro-F1 averages the precision and recall of each category, giving equal weight to all labels. Therefore, this metric can better reflect the classification effect of low-frequency labels.

3.2 Comparative Experiments

To validate the effectiveness of the model, this paper selects six algorithms with superior performance, including TextRNN, TextCNN, TextRCNN[20], HiAGM[6], HTCInfoMax[7], and HiMatch[8], as comparison methods on the WOS and RCV1-V2 datasets.

TextRNN: It uses recurrent neural networks to solve text classification problems, fully utilizing the advanlabele of RNN in capturing contextual information.

TextCNN: It applies CNN to text classification tasks for the first time, leveraging multiple convolution kernels with different sizes to extract key information from sentences.

TextRCNN: By combining RNN and pooling layers, it captures contextual information as long as possible while extracting local key features of the text.

HiAGM: It is a hierarchical-aware global model that aggregates node information using label-dependent prior probabilities. It employs GCN and TreeLSTM as structural encoders to model label dependencies in a top-down and bottom-up manner.

HTCInfoMax: It establishes connections between each text sample and its corresponding labels by maximizing the mutual information between text and labels, thereby filtering out irrelevant label information.

HiMatch: Building on HiAGM, it frames HMTC as a text-label semantic matching problem, capturing the semantic matching relationship between text and labels of different granularities in a hierarchical-aware manner.

3.3 Experimental Settings

The model is built using the PyTorch framework. During training, the 300-dimensional pre-trained word embeddings from GloVe are used for word embeddings. The maximum length of input text is set to 256, and a Bi-GRU with a dimension size of 200 and a multi-head self-attention mechanism are employed, where the number of heads in the multi-head self-attention is set to 8. Additionally, a three-layer one-dimensional dilated convolution with a kernel size of 2, dilation rates of (1,2,3), and a gated mechanism is designed. The fixed threshold for labels is set to 0.5. Dropout is used to prevent overfitting during training, with Dropout parameters set to 0.5 for both the word embedding layer and the fully connected layer. The penalty margin is set to 0.2, and the hierarchical-aware penalty hyperparameters are set to 0.01 and 0.5, respectively. The decay factor for the FB Loss loss function is set to 2, and the balancing factors are all set to 1. The model is optimized using Adam with a learning rate of 1e-4.

3.4 Experimental Results

The experimental section begins with a comparison of the proposed HiDilated model with six classic baseline models. Subsequently, ablation experiments are conducted on the various modules improved in the model.

3.4.1 Comparative Experiments

The improved HiDilated model is compared with six other algorithms on the two classic hierarchical multi-label text classification datasets, WOS and RCV1-V2. The results are presented in Table 2. The symbol (+) indicates that a higher value indicates better model performance. Algorithms marked with "*" represent results obtained after reproduction, while those without "*" are directly quoted from the literature. The symbol "—" indicates that the corresponding evaluation metric is not provided in the literature.

Table 2

The experimental results comparing to other state-of-the-art models on WOS and RCV1-V2 datasets
Models	WOS		RCV1-V2
Models	Micro-F1(+)	Macro-F1(+)	Micro-F1(+)	Macro-F1(+)
TextRNN	77.94	69.65	—	—
TextCNN	82.00	76.18	76.60	43.00
TextRCNN	83.55	76.99	81.57	59.25
HiAGM*	85.46	79.36	83.96	63.35
HTCInfoMax*	84.62	78.84	83.51	62.71
HiMatch*	86.21	80.31	84.73	64.11
HiDilated	86.61	80.74	85.25	64.68

Based on the experimental results in Table 2, it can be observed that on the WOS dataset, the HiDilated model outperforms the HiMatch model by 0.40 and 0.43 percenlabele points in Micro-F1 and Macro-F1, respectively. On the RCV1-V2 dataset, a comparison of the performance of HiDilated with baseline models reveals that HiDilated excels in both evaluation metrics, achieving a 0.52% improvement in Micro-F1 and a 0.57% enhancement in Macro-F1 compared to the pre-improved HiMatch model.

The improved model compensates for the limitations of the HiMatch model in text feature extraction, more comprehensively combining local textual information with long-distance dependencies. Additionally, the experimental results in Macro-F1 indicate that the proposed model can mitigate the issue of label distribution imbalance to a certain extent.

In summary, the HiDilated model consistently demonstrates superior performance compared to other baselines on both the WOS and RCV1-V2 datasets.

3.4.2 Ablation Experiments

The HiDilated model improved in this study comprises three primary components: the Multi-Granularity Fusion module (MGF), the Multi-scale Gated-Dilated Convolution module (MGDC), and the FB Loss, which replaces the traditional binary cross-entropy loss. To further validate the effectiveness of these three modules, ablation experiments were conducted on the WOS dataset, and the results are presented in Table 3. In the table, "-w/o" indicates that the model does not include the specified module.

Table 3

Ablation experimental results of HiDilated on WOS and RCV1-V2 datasets
Ablation Models	Micro-F1	Macro-F1
HiMatch	86.21	86.31
HiDilated (Our Model)	86.61	80.74
-w/o MGDC & FB Loss	86.36	80.42
-w/o MGF & FB Loss	86.42	80.39
-w/o FB Loss	86.52	80.61
-w/o MGDC	86.45	86.54
-w/o MGF	86.51	80.53

Table 3 can be divided into three sections as follows:

The first set of ablation experiments was conducted without modifying the BCE loss. The results indicate that, compared to the unmodified HiMatch model, excluding the MGDC module (i.e., using only multi-head self-attention, Bi-GRU, and ordinary 1D convolution in parallel for text semantic feature extraction) leads to improvements of 0.15% and 0.11% in Micro-F1 and Macro-F1, respectively. Similarly, excluding the MGF module (i.e., relying solely on Bi-GRU and three layers of 1D dilated convolution for feature extraction) results in improvements of 0.21% and 0.08% in Micro-F1 and Macro-F1, respectively. Furthermore, when combining these two modules, the model's performance is further enhanced. These results validate the effectiveness of both modules. In the model, the MGF component primarily focuses on text feature extraction, enhancing the capture of global information through the use of multi-head self-attention mechanisms alongside traditional RNN and CNN, thereby improving text classification accuracy. Conversely, the MGDC module employs a three-layer 1D dilated convolution with a gating mechanism, emphasizing the extraction of long-distance information to ensure that the generated features retain as much semantic information conducive to classification as possible.

The second set of experiments builds upon the first by replacing the traditional BCE loss with the FB Loss. When applied to the model with the MGF module, the FB Loss results in improvements of 0.24% and 0.23% in Micro-F1 and Macro-F1, respectively. Similarly, when used with the model incorporating the MGDC module, improvements of 0.30% and 0.22% are observed. Finally, when applied to the model with both modules, the FB Loss leads to improvements of 0.40% and 0.43% in Micro-F1 and Macro-F1, respectively. These results validate the effectiveness of the Focal Loss, which considers the difficulty of sample classification and incorporates a negative sample confidence formula that effectively addresses the issue of imbalanced positive and negative sample ratios in the dataset.

The model HiDilated in this paper achieves the highest classification accuracy. Experimental results show that the best hierarchical multi-label text classification accuracy can be achieved when none of the ablation items is missing.

3.4.3 Parameter sensitivity

1)Dilated convolution module parameters

To study the effect of the combination of kernel size and dilation rate on classification results, seven experiments were conducted on the WOS dataset in this paper, and the results are shown in Table 4.

Table 4

Dilated convolution parameters study on WOS datasets
No.	Dilation rate	Kernel size	Micro-F1	Macro-F1
1	(1,2,3)	2	86.61	80.74
2	(1,2,3)	3	86.38	80.57
3	(1,2,3)	4	86.24	80.39
4	(1,2,4)	2	86.52	80.67
5	(1,2,5)	2	86.30	80.37
6	(1,2,4)	3	86.48	80.42
7	(1,2,5)	3	86.51	80.61

The first experiment is to find the optimal parameters for the HiDilated model.

In the second and third experiments, the kernel size was adjusted. The experimental results indicate that an appropriate kernel size of k = 2 is beneficial for capturing the underlying local semantic information of the dataset compared to larger filters.

The fourth and fifth experiments verified the effectiveness of the dilation rate combination (1,2,3) used in this paper. The experimental results show that the performance of the model will decrease to some extent when the dilation rate is set to (1,2,4) or (1,2,5) without changing the kernel size. Analysis of the above issues reveals that there may be irrelevant information in texts with relatively large distances, so when the dilation rate is (1,2,4), there is a certain degree of noise. When the dilation rate is set to (1,2,5), not only noise exists in the network, but also the gridding effect appears, leading to the loss of important local information and poor classification results.

The sixth and seventh experiments conducted comparative experiments on two other parameter combinations of dilation rate and kernel size, and the experimental results also demonstrated the effectiveness of the model parameters selected in this paper.

2)Parameters of the multi-feature fusion module

Since the classification accuracy of the model is affected by the fusion method of the two parallel mechanisms in the multi-feature fusion module, this paper studies the weight allocated to the two mechanisms during the fusion process. Here, under the condition of a dilation rate of (1,2,3) and a kernel size of 2, the following comparative experiments were conducted, and the results are shown in Table 5.

Table 5

Multi-Granularity Fusion parameters study on WOS datasets
No.	BiGRU	Multi-headed self-attention	Micro-F1	Macro-F1
1	0.9	0.1	86.61	80.74
2	1	0	86.41	80.68
3	0.95	0.05	86.48	80.62
4	0.8	0.2	86.39	80.45
5	0.5	0.5	85.90	79.49
6	0.2	0.8	86.21	79.27
7	0.1	0.9	86.25	79.19

According to the experimental results, assigning weights of 0.9 and 0.1 to BiGRU and Multi-headed self-attention, respectively, is the most appropriate among other combinations of weights. This weight allocation method is conducive to achieving the best classification performance of the model.

The results indicate that, compared with the traditional approach of using only GRU to extract text features, introducing the attention mechanism during multi-feature fusion can leverage the complementarity between features to a certain extent, thereby more fully extracting local and long-distance semantic information from longer texts and further enhancing the accuracy of text classification.

In this paper, we propose a hierarchical multi-label text classification model HiDilated, which is based on multi-scale gated-dilated convolutions. Unlike previous studies, we propose a multi-granularity fusion strategy for extracting text features and design multi-scale dilated convolutions with gating units to enhance the receptive field of the network. Additionally, we choose to use FB Loss instead of the traditional binary cross-entropy loss to alleviate the difficulty of classifying deep-level tail labels in hierarchical multi-label classification datasets. The effectiveness of the proposed model is verified on two classic hierarchical multi-label text datasets, WOS and RCV1-V2.

In future research, we plan to further improve the learning effect of the label structure encoder on label dependencies by studying models based on graph neural networks. Additionally, we intend to apply our model to other longer text classification datasets to test its practical utility.

Author Contribution

Yu: Proposed research ideas, cleaned and analyzed data, designed the research plan, conducted experiments, wrote the paper, and revised the paper.Chen: Provided guidance on research methods and revised the paper.Lin: Provided guidance on the revision of the paper.

Acknowledgements

Zhang Xinyi X, Jiahao, Soh C et al (2020) LA-HCN: label-based attention for hierarchical multi-label text classification neural network [J]. Expert Syst Appl, 187(1): 115922.1-115922.9.
Cerri R, Barros RC, Carvalho AC et al (2016) Reduction strategies for hierarchical multi-label classification in protein function prediction [J]. BMC Bioinformatics 17(1):1–24
Qu Bo C, Gao L, Cuiping et al (2012) An evaluation of classification models for question topic categorization [J]. J Am Soc Inform Sci Technol 63(5):889–903
Shengwen P, Ronghui Y, Hongning W et al (2016) DeepMeSH: deep semantic representation for improving large-scale MeSH indexing [J]. BMC Bioinformatics 32(12):i70–i79
Silla CN, Freitas AA (2011) Data Min Knowl Discovery 22(1–2):31–72A survey of hierarchical classification across different application domains [J]
Zhou Jie M, Chunping L, Dingkun et al (2020) Hierarchy-aware global model for hierarchical text classification [C]// Proc of the 58th Annual Meeting of the Association for Computational Linguistics. [S. l.]: ACL Press, : 1106–1117
Deng Zhongfen P, Hao H, Dongxiao et al (2021) HTCInfoMax: a global model for hierarchical text classification via information maximization [C]//Proc of the Conference of the North American Chapter of the Association for Computational Linguistics, : 3259–3265
Chen Haibin M, Qianli L, Zhenxi et al (2021) Hierarchy-aware label semantics matching network for hierarchical text classification [C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics. [S. l.]: ACL Press, : 4370–4379
Wu Tong H, Qingqiu L, Ziwei et al (2020) Distribution-balanced loss for multi-label classification in long-tailed datasets[C]//Proc of European Conference on Computer Vision. Berlin: Springer, : 162–178
Tan Ke C, Jitong W (2019) Gated residual networks with dilated convolution for monaural speech enhancement [J]. IEEE/ACM Trans Audio Speech Lang Process 27(1):189–198
Deng Feng J, Tao W, Xiaorui et al (2020) NAAGN: noise-aware attention-gated network for speech enhancement [C]//Proc of the International Speech Communication Association. Berlin: Springer, : 2457–2461
Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions[J/OL]. Comput Vis Pattern Recognit. https://doi.org/10.48550/arXiv.1511.07122
Wang Panqu C, Pengfei Y, Ye et al (2018) Understanding convolution for semantic segmentation [C]// Proc of IEEE Winter Conference on Applications of Computer Vision (WACV). Piscataway, NJ: IEEE Press, : 1451–1460
Kalchbrenner N, Espeholt L, Simonyan K et al (2016) Neural machine translation in linear time [. https://doi.org/10.48550/arXiv.1610.10099. J/OL]
Oord A, Dieleman S, Zen H et al (2016) Wave Net: a generative model for raw audio [C] //. Proc of the 9th ISCA Speech Synthesis Workshop. Springer, Berlin, pp 125–139
Dauphin Y, Fan A, Auli M et al (2017) Language modeling with gated convolutional networkS [C]// Proc of the 34th International Conference on Machine Learning. New York: ACM Press,: 933–941
Kowsari K, Brown D, Heidarysafa M et al (2017) Hdltex: hierarchical deep learning for text classification [C]// Proc of IEEE International Conference on Machine Learning and Applications (ICMLA). Piscataway, NJ: IEEE Press, : 364–371
Lewis D, Yang Yiming, Rose T et al (2004) RCV1: a new benchmark collection for text categorization research [J]. J Mach Learn Res 5(2):361–397
Gopal S, Yang Y (2013) Recursive regularization for large-scale classification with hierarchical and graphical dependencies [C]// Proc of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, : 257–265
Lai Siwei X, Liheng L, Kang et al (2015) Recurrent convolutional neural networks for text classification [C]// Proc of the 29th AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, : 2267–2273

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Hierarchical multi-label text classification model based on multi-scale gated-dilated convolution

Status:

Version 1

Abstract

Figures

1 Introduction

2 Model Architecture

2.1 Multi-Scale Gated-Dilated Convolution

2.2 Text Encoder

2.3 Label Encoder

2.3.1 Prior hierarchical information

2.3.2 Hierarchical Graph Convolutional Neural Network

2.3.3 Fusion Layer

2.4 Optimizing Learning Tasks

2.5 Classification Learning

3 Experiment

3.1 Dataset and Evaluation Metrics

3.1.1 Dataset

3.1.2 Evaluation Metrics

3.2 Comparative Experiments

3.3 Experimental Settings

3.4 Experimental Results

3.4.1 Comparative Experiments

3.4.2 Ablation Experiments

3.4.3 Parameter sensitivity

1)Dilated convolution module parameters

2)Parameters of the multi-feature fusion module

4 Conclusion

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1