The overall architecture of the HiDilated proposed in this paper is shown in Fig. 2. It consists of two main components: the main classification task module and the optimization learning task module.
Main Classification Task: Firstly, a text encoder is used to encode the input text sequence. Here, a Multi-Granularity Fusion (MGF) module is designed to fully extract textual semantic information, with a focus on utilizing a Multi-Scale Gated-Dilated Convolutional Neural Network (MGDC) to capture long-distance dependencies between words. Subsequently, to avoid noise caused by heterogeneous fusion, after reshaping the text features through a fully connected layer, they are directly fed as node inputs to the label structure encoder in a serial structure. Then, in the feature fusion layer, the textual information is updated based on the hierarchical perception structure encoder. In the label structure encoder, due to the hierarchical relationships among labels, a Hierarchical Graph Convolutional Network (Hierarchical-GCN) is employed to aggregate label information in three directions: top-down, bottom-up, and self-loop.
Optimization Learning Task: The structural encoder is utilized again to encode the labels. The label semantics and the text semantics obtained from the text encoder are projected into a joint embedding space. Two optimization learning tasks are introduced to model the semantic matching relationship between text and labels.
Overall, the model performs HMTC tasks under the guidance of both the main classification learning task and the optimization learning task.
2.1 Multi-Scale Gated-Dilated Convolution
This section will introduce dilated convolution, multi-scale dilated convolution, and multi-scale gated-dilated convolution in sequence.
When utilizing CNN for feature extraction, the size of the convolutional kernel becomes a crucial factor determining the scope of captured features, which somewhat restricts the application of CNN in text classification. Additionally, pooling operations during convolution and transmission often lead to the loss of critical information, subsequently reducing the model's overall comprehension of the text.
Dilated Convolutional Neural Networks (DCNN)[10, 11] serve as an effective means to enhance the receptive field of the network, making them popular in semantic segmentation tasks in computer vision[12, 13]. They have also been successfully introduced into NLP[14] and speech processing[15]. By inserting "holes" into the convolution, DCNN effectively eliminates the negative impacts of information loss caused by common down sampling methods. Furthermore, by setting different dilation rates, the receptive field can be exponentially increased without increasing the number of parameters, while maintaining resolution and coverage. This means that under the same parametric conditions, DCNN can not only extract richer features but also cover the length of most sentences with fewer layers, thereby improving model efficiency. Therefore, DCNN is capable of capturing long-term dependencies, suitable for handling classification tasks involving longer text data. The actual receptive field size of a single-layer DCNN is given by:
$$N=k+\left(k-1\right)\times (d-1)$$
1
where \(k\)represents the kernel size, and \(d\)is the dilation rate, which is the spacing between values when the convolutional kernel processes the data.
Multi-Scale Dilated Convolutional Neural Network (MDC) consists of multiple single-layer DCNNs, and its receptive field size can be expressed as:
$${R}_{L+1}={(R}_{L}-1)+{N}_{L+1}$$
2
where \({R}_{L}\)denotes the overall receptive field of the \(L\)th convolutional layer, \({N}_{L+1}\)represents the actual receptive field of the \((L+1\))th layer.
Figure 3 illustrates the structure of a three-layer one-dimensional MDC with a kernel size of 2 and dilation rates of (1, 2, 3).
Improper combinations of convolutional kernel size and dilation rate may lead to the gridding effect in dilated convolution. This effect not only disrupts the continuity between word representations but also causes the loss of important local information.
Therefore, this paper considers designing the dilation rate combination of MDC as (1, 2, 3) with a convolutional kernel size of 2. This dilation rate combination allows the upper-layer convolution to access information at a longer distance while fully covering the underlying input. Additionally, as distant texts may not contain relevant information, this combination ensures that the top-layer convolution does not process overly distant information, reducing the impact of irrelevant information during text feature extraction. In this case, the MDC constructed by the model is a three-layer one-dimensional dilated convolutional structure with an overall receptive field size of 7. Compared to a three-layer one-dimensional ordinary convolution with a kernel size of 2 and an overall receptive field of 4, MDC significantly enlarges the network's receptive field without increasing the computational load.
Furthermore, this paper introduces a gating mechanism. Inspired by the use of gating mechanisms in LSTM to enhance the model's ability to process sequence-related information, Dauphin et al. proposed a gated convolutional neural network (GCNN) with superior performance[16]. Its specific calculation formula is as follows:
$$T=({W}_{L1}*H+{b}_{L1})⨂ \sigma ({W}_{L2}*H+{b}_{L2})$$
3
where \(H\) denotes the text features output from the previous layer of the network, and \(W\) and \(b\) represent the convolutional kernel and bias, respectively. The right side of the equation consists of two branches, with one branch using a linear activation function to prevent gradient vanishing, and the other branch using a Sigmoid function to compress the features to [0, 1]. Through network learning, features are selected.
The gating unit enables the model to retain a certain degree of nonlinearity while providing a linear propagation path for gradients, effectively alleviating the problem of gradient vanishing. Additionally, the gating mechanism selectively transmits information flow, transmitting relevant features and forgetting irrelevant ones, thereby strengthening effective information and reducing the impact of ineffective information.
In this paper, MDC is used to replace the traditional convolution in GCNN, referred to as Multi-Scale Gated-Dilated Convolutional Neural Network (MGDC). The structure of this module is shown in Fig. 4.
According to Fig. 4, the overall process of HiDilated is to input the text representation into two MDC layers separately, obtaining two representation vectors A and B. Then, B is multiplied element-wise with A through the sigmoid activation function to get the final text representation vector.
2.2 Text Encoder
Addressing the issue of insufficient extraction of text features commonly seen in text classification tasks, this paper introduces MGDC, Bi-GRU, and Multi-Head Self-Attention at different positions of the feature extraction layer. A multi-granularity fusion module is designed that integrates and combines multiple feature extractors in parallel to enhance the semantic features of the text. The structure of this module is shown in Fig. 5.
Firstly, GloVe is used to embed each word in the text into a vector space, forming an initialized word vector. Then, Bi-GRU is employed to extract the shallow local semantic features of the text. Compared with LSTM and RNN, GRU is easier to compute and can improve the training efficiency of the model to a certain extent.
Meanwhile, in order to better capture the long-distance dependencies and comprehensive semantic information of sentences, the multi-head self-attention mechanism is used in parallel for attention weighting to extract features of different granularities from the text. This results in a representation vector that can more comprehensively reflect the local key features of the text without losing semantic features. The use of attention mechanisms not only brings better performance but also dynamically focuses on which words and sentences contribute to the classification of the text.
Finally, in order to increase the receptive field of the network and fully capture the long-distance dependencies of the text, this paper uses multi-scale gated-dilated convolutions to extract the advanced local semantic features of the text, and fuse the feature vectors obtained from the two parallel modules to obtain the final text representation vector.
2.3 Label Encoder
The label structure encoding module adopts Hierarchical-GCN for encoding.
2.3.1 Prior hierarchical information
Assuming that there exists a hierarchical path\({ e}_{i,j}\)between the parent node\({ v}_{i}\)and the child node\({ v}_{j}\), then the prior probabilities \(P\left({U}_{j}|{U}_{i}\right)\)from the parent node to the child node and \(P\left({U}_{i}|{U}_{j}\right)\) from the child node to the parent node are respectively shown in the following equations:
$$\left\{\begin{array}{c}P\left({U}_{j}|{U}_{i}\right)=\frac{P({U}_{j}\cap {U}_{i})}{P\left({U}_{i}\right)}=\frac{P\left({U}_{j}\right)}{P\left({U}_{i}\right)}=\frac{{N}_{j}}{{N}_{i}}\\ P\left({U}_{i}|{U}_{j}\right)=\frac{P({U}_{j}\cap {U}_{i})}{P\left({U}_{j}\right)}=\frac{P\left({U}_{j}\right)}{P\left({U}_{j}\right)}=1\end{array}\right.$$
4
Wherein, \({U}_{k}\)denotes the occurrence of\({ v}_{k}\), \(P\left({U}_{j}|{U}_{i}\right)\)denotes the probability of the occurrence of child node\({ v}_{j}\)when parent node\({ v}_{i}\)occurs, \(P({U}_{j}\cap {U}_{i})\)denotes the probability of both\({ v}_{i}\)and\({ v}_{j}\) occurring simultaneously.
2.3.2 Hierarchical Graph Convolutional Neural Network
Hierarchical-GCN encodes each node based on its neighbor nodes, aggregating information from three directions: top-down, bottom-up, and self-loop. In the hierarchical graph, each node represents a label, and each directed edge represents a paired label-related feature. This paper uses prior hierarchical information as the weighted adjacency matrix of Hierarchical-GCN.
Specifically, the label hierarchy can be defined as a directed graph\(G=({V}_{t},\overrightarrow{E},\overleftarrow{E})\), where \({V}_{t}\)denotes the set of nodes in the hierarchy. \(\overrightarrow{E}\)denotes the top-down hierarchical path, corresponding to the prior probability from the parent node to the child node; vice versa for\(\overleftarrow{E}\).
2.3.3 Fusion Layer
After obtaining the text representation and label representation vectors, in order to avoid noise caused by heterogeneous fusion, the text features are directly used as the node input of the structural encoder in the serial data stream, and the text information is updated through the hierarchical-aware structural encoder.
Considering the different numbers of text representation vectors and label nodes, the text representation vectors are first reshaped through a linear transformation to obtain vectors consistent with the number of label nodes, and they are then input into the structural encoder. Subsequently, Hierarchical-GCN is used to combine the text semantics with the prior hierarchical information.
$${S}_{t}=\sigma (\overrightarrow{E}\bullet {V}_{t}\bullet {W}_{{g}_{1}}+\overleftarrow{E}\bullet {V}_{t}\bullet {W}_{{g}_{2}})$$
5
Where \(\sigma (\bullet )\)is the ReLU activation function, \(W\) is the weight matrix of Hierarchical-GCN, and\({ S}_{t}\)is the text representation vector with label hierarchical structure information.
Each sample updates its text information in the same hierarchical structure, outputting a hidden state of class-specific representation of hierarchical-aware text features, and using it as the final input to the classifier.
2.4 Optimizing Learning Tasks
In the module of optimizing learning tasks, Hierarchical-GCN is once again utilized as a label structure encoder to encode the labels individually.
$${S}_{l}=\sigma (\overrightarrow{E}\bullet {V}_{l}\bullet {W}_{{g}_{3}}+\overleftarrow{E}\bullet {V}_{l}\bullet {W}_{{g}_{4}})$$
6
Where, \({V}_{l}\)denotes the set of nodes with label hierarchical information, \(W\) is the weight matrix of Hierarchical-GCN, and\({S}_{l}\)is the obtained label representation vector.
Subsequently, referring to HiMatch[8] the text semantics and label semantics are projected into a joint embedding space to capture the semantic matching relationship between text and labels, including both coarse-grained and fine-grained labels, in a hierarchical-aware manner, and optimization learning is performed.
2.5 Classification Learning
Considering that BCE treats samples of different classification difficulties equally, which means that BCE does not set weights for any predicted labels. In practical applications, when the distribution of label data exhibits a long-tail phenomenon, i.e., the number of samples for a few labels is much larger than for others, this often leads to the majority of samples corresponding to head labels being easily classified. However, even though these head label samples are easy to distinguish, the loss they generate during training is not low. As training progresses, the loss of head labels gradually dominates. In other words, when accumulating the loss of a large number of easy-to-classify samples, the loss contribution of hard-to-classify samples, due to their small numbers, is almost completely overwhelmed. This results in the gradients of easy-to-classify samples dominating the optimization process of the model. To minimize the training loss, the model continuously improves the classification accuracy of head labels, but in this process, the model's ability to classify tail labels may be overlooked.
The consequence of this phenomenon is that while the classification performance of head labels improves, the classification effect of tail labels, especially deeply nested and hard-to-classify tail labels, remains unsatisfactory. However, in practical applications, to further enhance the overall performance of the model, it is often necessary to rely on these hard-to-classify samples. Therefore, effectively handling imbalanced data and enabling the model to focus evenly on samples of different classification difficulties is crucial for improving model performance.
To enhance the classification ability of tail labels and thereby improve the overall classification performance of the model, this paper proposes the adoption of the Focal Balanced Loss (FB Loss) in the HMTC task from the perspective of reweighting. This loss assigns weights to the loss corresponding to samples based on their ease of classification, i.e., assigning small weights to the majority of easy-to-classify samples and large weights to hard-to-classify samples, allowing the model to focus more on hard-to-classify samples. In other words, the main idea of this loss is to increase the proportion of tail hard-to-classify samples in the training loss, enabling the model to focus more on the classification learning ability of tail hard-to-classify labels, thereby improving the overall classification accuracy of the model. The definition of FB Loss is shown in the following equation:
$${\mathcal{L}}_{FB}\left(x,y\right)=-\frac{1}{NC}\sum _{k=1}^{N}\sum _{i=1}^{C}\left[{y}_{i}^{k}{\left(1-{z}_{i}^{k}\right)}^{\beta }\text{log}\left({z}_{i}^{k}\right)+(1-{y}_{i}^{k}){\left({z}_{i}^{k}\right)}^{\beta }\text{log}\left(1-{z}_{i}^{k}\right)\right]$$
7
The above equation can be regarded as adding a modulation factor\({ \left(1-{z}_{i}^{k}\right)}^{\beta }\)to the original cross-entropy loss, utilizing the rapid scaling characteristics of power functions to dynamically adjust the rate of reduction in the weights of easy samples, reducing the weights of easy-to-classify samples, and focusing the loss on the training of difficult samples. Through multiple experiments, it was found that when \(\beta =2\), the model performs best.
Guided by the joint embedding loss and matching learning loss, the model performs classification learning, feeding the final features into a fully connected layer for prediction. The final loss function includes the classification loss, joint embedding loss, and hierarchical-aware matching loss:
$$\mathcal{L}={\mathcal{L}}_{FB}(y,\widehat{y})+{\lambda }_{1}{\mathcal{L}}_{joint}+{\lambda }_{2}{\mathcal{L}}_{match}$$
8
Wherein,\(y\)and\(\widehat{y}\)are the true labels and predicted labels, respectively, \({\lambda }_{1}\)and\({\lambda }_{2}\)are hyperparameters that balance the joint embedding loss and matching learning loss.