An Improved Deep Text Clustering via Local Manifold of an Autoencoder Embedding

doi:10.21203/rs.3.rs-2317581/v1

Download PDF

Research Article

An Improved Deep Text Clustering via Local Manifold of an Autoencoder Embedding

https://doi.org/10.21203/rs.3.rs-2317581/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Text clustering is a method for separating specific information from textual data and can even classify text according to topic and sentiment, which has drawn much interest in recent years. Deep clustering methods are especially important among clustering techniques because of their high accuracy. These methods include two main components: dimensionality reduction and clustering. Many earlier efforts have employed autoencoder for dimension reduction; however, they are unable to lower dimensions based on manifold structures, and samples that are like one another are not necessarily placed next to one another in the low dimensional. In the paper, we develop a Deep Text Clustering method based on a local Manifold in the Autoencoder layer (DCTMA) that employs multiple similarity matrices to obtain manifold information, such that this final similarity matrix is obtained from the average of these matrices. The obtained matrix is added to the bottleneck representation layer in the autoencoder. The DCTMA's main goal is to generate similar representations for samples belonging to the same cluster; after dimensionality reduction is achieved with high accuracy, clusters are detected using an end-to-end deep clustering. Experimental results demonstrate that the suggested method performs surprisingly well in comparison to current state-of-the-art methods in text datasets.

Text clustering

Deep clustering

Deep learning

Manifold learning

Autoencoder

Text clustering is a challenging process to discover and extract groups with similar elements in text collections [1], in which different documents are usually categorized based on content similarity. Currently, many different clustering algorithms have been introduced, divided into several categories. A group of these algorithms is partition-based, such as K-means and K-medoids, in which data are assigned to different clustering centers according to their smallest distance [2]. In these methods, the goal is to optimize the distance between the samples and the center of the clusters. Another type of clustering method is based on data distribution. This algorithm uses a set of probability distribution functions for displaying data. It assumes that the points in a given cluster are most likely to be derived from the same distribution [2]. This method cannot achieve high accuracy in clustering without having a suitable distribution function.

On the other hand, density-based clustering algorithms are a clustering category that can cluster data with any arbitrary distribution [3]. It can distinctly identify clusters with any data distribution and shape because a cluster is a continuous region of dense points isolated by interconnected regions of sparse points from others [2]. Another common clustering method is based on a corpus which reduces the dimensionality of features by merging synonyms and thus utilizes large datasets like WordNet, aiming to reduce the vector’s dimensionality and remove additional information [2]. Non-negative matrix factorization (NMF) and Latent semantic indexing (LSI) are two famous corpus-based clustering methods in which the documents are transformed into a new feature space with smaller dimensions, and the features are usually a linear combination of the original features.

Recently, NMF has received much attention and has found many applications in pattern recognition and text mining [4–7]. It is a linear-algebraic model to reduce the dimensions of non-negative vectors. This method can extract features and is used for the main matrix description as a product of a weight matrix and a base matrix. The most crucial goal of NMF is to find general system features by a non-negative combination of local system features. Therefore, since it can extract features, it has been widely used in natural language processing, image, and audio processing [8]. Although NMF is a simple and linear solution, it cannot find hidden nonlinear relationships in the data. This limitation reduces the ability to classify complex data [9]. Also, matrix decomposition methods, such as NMF, usually suffer from two problems. First, they cannot hierarchically discover patterns or topics and their subtopics, and second, their clustering results are far from human clustering.

Deep learning has received much attention in discovering and extracting nonlinear structures nowadays [10, 11], and many researchers using it have made significant progress in extracting deep representations for various tasks. Deep learning techniques learn the representations well in a hierarchical and nonlinear manner [12]. However, one of the main challenges of deep clustering is when data have high dimensions and complex structures, or there are many clusters, making the clustering process's complexity time-consuming and inefficient [13]. In such cases, displaying the data in a space with lower dimensions is inevitable and improves clustering. Also, reducing the dimensions can find the semantic similarity between texts more efficiently and produce better clustering. Until now, many methods have been used for dimension reduction, especially in the text, like independent component analysis (ICA), singular value decomposition (SVD), principal component analysis (PCA), etc. [14]. Today, deep methods face this problem, usually using autoencoders, which are typically used for displaying low-dimensional data and are one of the nonlinear dimension reduction methods [15]. However, in most of the current deep learning-based clustering method attempts, this dimension reduction is made without considering the manifold structures, and the more similar data will not necessarily be next to each other [10, 11, 16–18].

This paper develops a deep text clustering method with a local manifold in the autoencoder layer (DCTMA) that uses multiple similarity matrices to obtain manifold information, such that the final similarity matrix is obtained from the average of these matrices. This similarity matrix as additional information, is used with data representation for better clustering and is added to the bottleneck embedded layer in the autoencoder as an additional term to the clustering loss. The idea is that pairs of samples in a cluster should have similar representations in the embed space. As a result, the more similar data will be placed next to one another, and the representation learning accuracy problem can be reduced. In the presented model, along with dimensionality reduction achieved with high accuracy, clusters are detected using a deep end-to-end framework too. In this model, the consensus similarity matrix is fed to a deep network on top of an auto-encoder, thereby taking benefits of both sufficient similarity and context-aware representation of documents by an end-to-end solution.

Considering that most current similarity measures are based on cosine similarity (CS) and Euclidean distance (ED), the ED considers the vectors’ magnitude only, and CS only finds the angle between two vectors. They are not effective and not suitable measures for dealing with probabilities often used in text analysis [19]. In this paper, we have used the combination of several different document similarity measures, each of which will measure the similarity between two documents from a different point of view. Then a consensus similarity matrix is calculated for each data set which is an average of these three text similarity measures. It finds document similarity well and helps deep clustering to cluster it well too.

The proposed model has been examined in three different datasets with different evaluation metrics. According to the experimental results, the presented method has obtained better results on the same dataset than the recently published methods. The followings are the key contributions of this paper:

In this paper, a deep end-to-end clustering-based architecture is used to learn both representations and cluster labels jointly.
A new loss function is used, considering similarity matrices to obtain manifold information for better manifold clustering.
Added a combined similarity criterion that considers not only the direction but also the magnitude and semantics as an additional term to the clustering loss, improves its performance and better clustering results gain.

In the remainder of this paper, the related works are highlighted in Section 2. Then, the developed system is represented in Section 3 in detail. Section 4 presents experimental results and their discussions. Finally, the conclusion and perspective work is detailed in Section 5.

Recently, extensive research and efforts have been made to learn deep representations for document clustering.

Diallo et al. [10] have used a deep clustering method to learn document representations by introducing contractive auto-encoders, which handle the problem of preserving the cluster location by pushing neighboring data points together. In this method, the Frobenius norm is used as a penalty in addition to the conventional autoencoder framework to understand the features of the related document. Mei et al. [18] present a semi-supervised concept factorization and its combination with pairwise constraints to enhance clustering performance with supervised information. Using this method, the data points that belong to the same cluster in the primary domain will remain in a cluster in the secondary domain.

A neural network framework for semi-supervised clustering (SSC) with binary constraints is introduced in [16], which has two simple phases: the first phase uses a pair of Siamese neural networks to connect unlabeled pairs of points. The second phase uses the labeled pair data set of the first step in a supervised clustering method. This method is presented because binary classification is usually simpler than partially supervised multi-class clustering. In the other paper, Fu et al. [20] cluster short text using direct affinity matrix factorization. Since short texts are noisy, they are often considered outliers or relative outliers, resulting in highly noisy neighbor similarities in the affinity matrix. This research uses similarity matrix factorization to reduce this problem by combining a noisy matrix to record neighbor similarities directly and adding regularization to the assignment matrix to eliminate hard clustering.

In the deep learning-based approach presented by Sheng and Lipor [17], joint training of a Siamese network and autoencoder is used for unsupervised learning to learn a representation permissible for both clustering and classification via pairwise constraints. This framework can be transferred from one learning method to another and seamlessly integrates constrained clustering, semi-supervised classification, and supervised classification. A combination of k-means clustering with stack autoencoders for deep text clustering into separated groups is presented by Hosseini and Varzaneh [11]. In the sequence representation-based model presented by Guan et al. [18], pre-trained encoders are proposed for unsupervised text clustering using deep features. In this method, unlike the usual text presentation methods, using a pre-trained deep text encoder, a semantic representation of the text is prepared, which can solve the problem of feature sparsity problem. In the framework proposed by Moradi Fard et al. [21], data and cluster representations are jointly learned. In this learning, which is done through backpropagation, pairwise constraints are used for better learning of document representation. A method based on semi-supervised text clustering by Wei et al. [22] is presented in which labeled samples are used for deep contrastive semi-supervised clustering (DCSC), which jointly optimizes the clustering and representation learning. In the paper presented by Vilhagra et al. [23], deep clustering with a convolutional Siamese network has also been used to learn data representation with pairwise constraints, and the K-Means algorithm is used for unsupervised clustering. Table 1 summarizes recent papers that have worked in text clustering.

In contrast to the similarity matrix, one of the most critical challenges in pairwise matrix construction is that pairwise elements are usually randomly chosen; however, a similarity matrix element considers a similarity measure between two documents. In this paper, a multi-view consensus similarity matrix has been combined from three different similarity matrices.

The DCTMA model for deep community clustering in text data sets is described in detail in this section. First, some definitions of the proposed method are introduced, and then the DCTMA is then described in detail.

3.1 Definition 1 - Deep Autoencoder

An autoencoder can learn meaningful representations of the input data in an unsupervised manner. A single autoencoder (AE) consists of a two-layer neural network consisting of an encoder and a decoder [33]. The encoding layer transforms the network’s inputs into a lower-dimensional space (the embedding space), and the decoding layer is responsible for reconstructing the network from that embedding space. Figure 1 shows the basic structure of the autoencoder. If you use the same backpropagation algorithm with random weights, you can quickly train a single autoencoder (a shallow neural network). The steps of encoding and decoding are described as follows,

$$h\left({x}_{i}\right)=f\left({W}_{e}{x}_{i}+{b}_{1}\right)$$

1.1

$$\widehat{{x}_{i}}=f\left({W}_{d}h\left({x}_{i}\right)+{b}_{2}\right)$$

1.2

in which $f$ is a nonlinear activation function like sigmoid or ReLU. ${W}_{e}$ is encoder weights, ${W}_{d}$ is decoder weights, $\text{a}\text{n}\text{d} b$ is the bias parameter. AE’s training aim is to optimize the network parameters φ = (${(W}_{e}$, ${b}_{1}$, ${W}_{d}$, ${b}_{2}$) by minimizing the given error function,

$${l}_{rec}=\sum _{i=1}^{n}{‖{X}_{i}-{\widehat{X}}_{i}‖}^{2}= \sum _{i=1}^{n}{‖f{(W}_{d}f{\left(\right(W}_{e}){x}_{i}+{b}_{1})+{b}_{2})-{\widehat{X}}_{i}‖}^{2}$$

Some datasets have a complex relationship. As a result, using just a single autoencoder is insufficient. Because the size of the input features may be too large for a single autoencoder to handle. Therefore, a stack autocorrelator is used in these situations. Multiple autoencoders are layered to build a deep structure in a stacked autoencoder (SAE). It supplies the $(k + 1)$th layer, the hidden layer of the kth AE, as an input feature. Unsupervised pre-training, layer by layer as input is passed through, is a critical function of stacked autoencoders. The first layer can be utilized as an input for the following autoencoder once it has been pre-trained. Backpropagation can be used to fine-tune a neural network that has already been trained. The last layer may be used for traditional supervised classification.

Table 1

The current related works in three categories for text clustering
	Author [Ref]	Method
NMF Based Clustering	Aghdam et al. [5], 2021	A regularized asymmetric non-negative matrix factorization has been proposed for text clustering.
	Hassani et al. [24], 2021	The terms in the term-document matrix have been separated by the non-negative matrix factorization method into groups, the obtained term vectors are merged into a new feature vector which is more appropriate for clustering.
	He et al. [25], 2019	It uses the incremental NMF algorithm with l_2,1 norm for online short text clustering.
	Mohotti et al. [26], 2022	An unsupervised clustering method for semantic learning has been proposed by a dense text representation. A deep hierarchical NMF for enriching text representation and a Skip-Gram with negative sampling to learn semantic relationships has been utilized.
	Dawei et al. [27], 2018	A weighted kernel non-negative matrix factorization has been presented to map sparse representation to high-dimensional implicit vectors and simplify the complex operation. Also, an iterative optimization algorithm can distinguish the importance and weights of short texts.
Semi-supervised Based Clustering	Smieja et al. [16], 2020	A pair of Siamese neural networks have been used to label pairs of points and clusters by a supervised neural-network
	Sheng et al. [17], 2020	A Siamese network and autoencoder have been trained to learn a representation that is amenable to both clustering and classification.
	Craenendonck et al. [28], 2018	An approach to active clustering with pairwise constraints has been presented for the interactive clustering process.
	Fogel et al. [29], 2019	A new embedding based on clustering and a Siamese network has been proposed.
	Fard et al. [21], 2019	Using a framework based on differentiable deep clustering with side information of pairwise constraints
Deep Learning Based Clustering	Diallo et al. [10], 2021	A contractive autoencoder for document representation and a deep embedding clustering framework has been proposed.
	Bin et al. [30], 2018	An ensemble of deep learning models is utilized for learning deep features, and the clustering by a fast search algorithm discovers cluster centers automatically.
	Junkai et al. [31], 2017	Feature vectors by deep-learning vocabulary network are generated for text clustering meaning representation, and the feature vectors’ dimensionality is reduced using a sparse-group deep belief network.
	Hosseini et al. [11], 2022	A stacked autoencoder with K-means clustering has been used for deep text clustering.
	Guan et al. [18], 2020	A pre-trained text encoder with a deep feature-based text clustering framework is used.
	Chen et al. [32], 2020	Pre-trained language models were fine-tuned with a small number of labeled samples for semi-supervised text clustering.

3.2 Definition 2 - Different Similarity Measures

Most of the current researchers have used similarity measures based on Euclidean distance (ED) and cosine similarity (CS). The ED considers the vector's magnitude only, and CS considers only the angle between two vectors, then, they are not effective and suitable measures for dealing with probabilities, which are often used in text analysis. In this paper, we have used the combination of several document similarity measures, each of which will measure the similarity between two documents from a different point of view. A consensus similarity matrix is calculated for each data set. The consensus matrix is an average of three different text similarity measures as follows,

3.2.1 DTFSM Similarity Measure

The distance of term-frequency-based similarity measure (DTFSM) improves the document clustering effectiveness and reduces the high complexity and the number of operations required for text clustering in comparison to other measures [34]. In most of the conventional similarity measures, the normalized scalar product of two documents and the length of term frequency is used to normalize the scalar product. While this similarity method considers the normalized difference between term frequencies (see (3)).

The scalar product is normalized using the term frequency length and the normalized scalar product of two documents. At the same time, this similarity method takes the normalized difference between term frequencies into account (see (3)).

$$\text{D}\text{T}\text{F}\text{S}\text{M}\left(V,W\right)=1-\sqrt{\frac{{\sum }_{i=1}^{N}\left|{d}_{1,i}-{d}_{2,i}\right|}{2{\sum }_{i=1}^{N}\left|{d}_{1,i}.{d}_{2,i}\right|}}$$

(V and W are two documents, and d_j,i > 0 is the ith term weight in document j)

3.2.2 TS-SS Similarity Measure

By combining several geometric criteria, the triangle’s area similarity-sector’s area similarity measure (TS-SS) [19] presents similarity between different documents from a better perspective than geometric and non-geometric similar criteria. This method solves the drawbacks of the conventional ED and CS evaluation metrics by combining two similarity measures of the triangle’s area similarity (TS) and the sector’s area similarity (SS). In the triangle's area similarity method, the triangle area of two vectors (V and W) is used as a measure of similarity, and thus, both the magnitude of the vectors and the angle ($\alpha$) between them are included in the calculation as follow,

$$\text{T}\text{S}\left(V,W\right)=\left|W\right|.\left|V\right|.\text{s}\text{i}\text{n}\left(\alpha \right)/2$$

4.1

$$\alpha ={\text{cos}}^{-1}(\text{cos}\left(W,V\right)+10)$$

4.2

But since the TS similarity evaluation criterion does not take into account the magnitude difference (MD) of two vectors, the section's area similarity (SS) was considered as follows,

$$\text{S}\text{S}\left(V,W\right)=\pi .{\left(\text{E}\text{D}\left(V,W\right)+\text{M}\text{D}\left(V,W\right)\right)}^{2}.\left(\frac{\alpha }{360}\right)$$

4.3

Therefore TS-SS similarity measure is obtained from the product of TS and SS criteria. When two vectors have the maximum similarity (they are the same in terms of direction and magnitude), the TS-SS criterion will be equal to 0. The maximum value of this criterion is infinity.

3.2.3 PDSM Similarity Measure

The pairwise document similarity measure (PDSM) [35] is a criterion that can find the most similar documents when several documents have the same degree of similarity to one document. It performs the evaluation based on the term weights and the number of the term appearing in one of the two documents. Therefore, the similarity of the two documents (V, W) increases with the increase of terms (d) used in both of them and decreases with the increase of terms used in only one of them. Its relations are given in the following,

$$\text{P}\text{D}\text{S}\text{M}\left(V,W\right)=\frac{V\cap W}{V\cup W}\times \frac{\text{P}\text{F}\left(V,W\right)+1}{\text{M}-\text{A}\text{F}\left(V,W\right)+1}$$

5.1

$$V\cap W=\sum _{i=1}^{N}\text{m}\text{i}\text{n}({d}_{1,i},{d}_{2,i})$$

5.2

$$V\cup W=\sum _{i=1}^{N}\text{m}\text{a}\text{x}({d}_{1,i},{d}_{2,i})$$

5.3

Where PF represents the number of the present term, AF is absent terms number, N is the term number, d_i,j is the ith term weight in document j

3.3. Definition 3 - Consensus Similarity

Assume we have K matrices, each of which displays a different view of the data. The M matrices are combined using a fusion function, $F :\left\{{M}_{1},{M}_{2},{M}_{3},\dots {M}_{k}\right\}\to Z$, to yield an output matrix, Z. For the sake of simplicity, we'll suppose that all input and output matrices share the same ${R}^{m\text{*}n}$dimension. The input feature maps can be combined using a variety of fusion functions. We look into a few of them below.

1. Sum fusion $Z=sum \left({ M}_{1},{M}_{2},{M}_{3},\dots {M}_{\text{k}}\right)$ Each matrix entry is obtained from the sum of the entries of the other matrices.

$${Z}_{ij}={\sum }_{m=1}^{k}{\left[{M}_{ij}\right]}^{k}$$

2. Maxpooling function $Z=max \left({ M}_{1},{M}_{2},{M}_{3},\dots {M}_{\text{k}}\right)$: returns the maximum value for the corresponding entry in the other matrices in the method shown below.

$${Z}_{ij}=max \left[{{ M}_{\text{i}\text{j}}}^{1},{{ M}_{\text{i}\text{j}}}^{2},{{ M}_{\text{i}\text{j}}}^{3},\dots ,{{ M}_{\text{i}\text{j}}}^{k}\right]$$

3. Concatenation function $Z=cat \left({ M}_{1},{M}_{2},{M}_{3},\dots {M}_{\text{k}}\right)$: constructs the result by concatenating the input matrices in the manner described below.

$${Z}_{ij}= \left[{{ M}_{\text{i}\text{j}}}^{1},{{ M}_{\text{i}\text{j}}}^{2},{{ M}_{\text{i}\text{j}}}^{3},\dots ,{{ M}_{\text{i}\text{j}}}^{k}\right]$$

4. Average function $Z=ave \left({ M}_{1},{M}_{2},{M}_{3},\dots {M}_{\text{k}}\right)$: Each entry of a matrix is obtained from the average of the entries of the other matrices.

$${Z}_{ij}=\frac{{\sum }_{m=1}^{k}{\left[{M}_{ij}\right]}^{k}}{k}$$

Among the proposed methods, we use the average method to combine the matrices. As a result of the similarity matrices being the same dimension and being in the same range, the final matrix produced by averaging the similarity matrices effectively merges the various points of view. It will also include the data from all other matrices.

3.4. Proposed Method

In terms of dimensionality reduction, traditional and deep clustering are the two main categories of the topic of clustering. Since the dimensions of the data keep expanding, traditional methods for handling high-dimensional data are inefficient due to their time and memory requirements [36, 37]. Dimensionality reduction approaches with different methods transfer data to a space with much smaller dimensions and then perform clustering in the new space. The efficiency of clustering strongly depends on the quality of data representation. The use of deep neural networks to learn the appropriate representation of clustering has significantly increased the quality of clustering [38, 39]. This clustering method that uses Deep Neural Networks (DNN) to learn representation is called deep clustering. Deep clustering generally consists of two basic processes: dimensionality reduction and traditional clustering.

The previous methods utilized autoencoder to reduce the reduction; however, it is not very precise. The average matrix, which presents varying viewpoints on the similarity of the matrices in the form of an information manifold, is added to the latent layer of the autoencoder in the form of a penalty term to increase the accuracy of dimensionality reduction. This causes the data that is more similar to be placed next to one another more accurately. The previous methods performed the dimension reduction process as a preprocessing stage for clustering, so the representation and the feature space obtained may not be suitable for clustering. However, more recent techniques combine feature reduction and clustering into a single framework and carry out the two procedures simultaneously while using the same loss function. In other words, they both jointly learn the neural network's parameters and how to assign the obtained features to clusters. The integrated structure of joint deep clustering learning (representation learning with autoencoder and clustering) in Fig. 2 illustrates how a loss function can be used to solve the whole clustering problem.

Our model consists of representation learning, which aims at learning cluster-friendly embeddings via autoencoder and manifold matrix, and a cluster estimation phase, which aims to evaluate the number of clusters. Figure 2 shows the architecture of the proposed method. We jointly optimize the parameters of these components in a unified framework,

$${L}_{final}={L}_{rep}+{L}_{clu}$$

where ${L}_{rep} and {L}_{clu}$ indicate the objective function of the representation learning phase and cluster estimation phase. The representation learning phase is the autoencoder output and manifold information as a penalty term. Thus, the reconstruction loss of the autoencoder is defined as:

$${L}_{rep}={L}_{MSE}\left(X,{X}^{{\prime }}\right)+{L}_{manifold }\left(Z,{Z}^{{\prime }}\right)$$

Once we obtain the latent representation ${\stackrel{´}{H}}_{a}^{l}\in {R}^{n*d}$ from the data text, we can utilize it to generate the soft cluster assignment $Q\in {R}^{n*d}$, where ${q}_{ik}$obtained by Eq. 12 represents the probability that sample $i$ is assigned to the cluster $k$. Since ${q}_{ik}$ measures the similarity between the sample representation ${z}_{i}$and clustering center ${\mu }_{k}$ through Student's t-distribution as a kernel, once samples are closer to the cluster center, the corresponding soft assignments get a high probability and are more likely to be trustworthy.

$${q}_{ik}=\frac{{\left(1+{‖{z}_{i}-{\mu }_{k}‖}^{2}\right)}^{-1 }}{{\sum }_{k}{\left(1+{‖{z}_{i}-{\mu }_{k}‖}^{2}\right)}^{-1}}$$

Then the target distribution $P$ of samples is successfully constructed through Eq. 13, which intends to put more emphasis on a data point that is assigned with high confidence to improve cluster purity. It is worth noticing that to get the initial centers _${\mu }_{k}$ (k = 1; 2; _ _ _ ;K), we first pre-train the autoencoder by only minimizing the reconstruction loss to obtain a meaningful representation. After that, we run K-means on the learned representations, and in the following iterative training, the K-means clustering is never used again.

$${p}_{ik}=\frac{{{q}_{ik}}^{2}/{\sum }_{\text{i}}{q}_{ik}}{{\sum }_{\text{k}}{{(q}_{ik}}^{2}/{\sum }_{\text{i}}{q}_{ik}) }$$

We aim to force the current assignment $Q$ to approach the target distribution $P$. The final clusters can be obtained through jointly training autoencoder and cluster assignments. The clustering loss is defined according to the KL divergence between two distributions as below.

$${L}_{clu}=kl\left(Q‖P\right)=\sum _{i}\sum _{k}{p}_{ik} \text{l}\text{o}\text{g}\left(\frac{{q}_{ik}}{{p}_{ik}}\right)$$

The algorithm DCTMA finds clusters with high accuracy in the form of end-to-end. The algorithm steps are shown in Algorithm 1.

Algorithm 1: Deep Text Clustering method based on a local Manifold in the Autoencoder layer (DCTMA)

Input:

X: input data.

k: number of clusters.

$\epsilon$: stopping threshold.

$T$: target interval,

$MaxIter$: Maximum iterations.

Output:

cluster representatives $R$, Labels$S$;

1: $\stackrel{´}{\chi }$ = Compute TF-IDF matrix of input data $\chi$

2: for$iter \in 0, 1, \dots , Maxiter$

3: if$iter \% T==0$

4: Compute the embeddings for all samples

5: Compute target distribution ($P$) by Eq. (12)

6: Save last label assignments:${s}_{old} = s$

7: Compute new label assignments by Eq. (13)

8: if (${sum (s}_{old}\ne s) / n< \epsilon$)

9: Stop training

10: Choose a batch of samples$s\in \stackrel{´}{\chi }$

11: Compute manifold matrix on $s$ by Eq. (9)

12: Update network parameters on$s$

To evaluate the effectiveness and computational complexity of the proposed framework for text clustering, some tests are applied to the benchmark data sets, Reuters-10k, 20Newsgroups, and WebKB. This section evaluates and compares the proposed model with other methods using these three famous datasets. This section is organized as follows. Eight well-known and advanced comparison methods, including K-means, NMF_f, NMF_KL, hierarchical clustering, BIRCH clustering, mini-batch K-means clustering, deep embedded clustering (DEC), and regularized asymmetric non-negative matrix factorization (RANMF) are reviewed in section 4.1. The datasets used in the subsequent studies are reviewed in section 4.2. Section 4.3. summarized the evaluation criteria, and Section 4.4 discusses the evaluation results of the performance comparison of the proposed method on three different datasets with eight advanced approaches based on five widely used evaluation criteria.

4.1 Comparison Methods

The proposed method’s performance will be evaluated with different clustering methods to show how the proposed model can affect the performance of these clustering methods. The procedures are summarized as follows,

K-means: The iterative K-means algorithm splits the dataset into K non-overlapping, unique subgroups (clusters). It distributes data points to clusters such that the sum of the squared distances between the data points and the cluster centroid, which is the average value of all the cluster data points, is minimized [40]. The homogeneity (similarity) of the data points within a cluster increases as the amount of variance within the cluster decreases.
NMF: The experiments in this study use both of the original NMF formulations (NMF_F and NMF_KL) [41].
Hierarchical clustering: The purpose of this cluster analysis approach is to develop a hierarchy of groups as a collection of standard (flat) clustering techniques organized as a tree. These techniques build the clusters by top-down or bottom-up recursively splitting the objects [42]. There is no requirement to pre-specify the number of clusters when using agglomerative clustering. Smaller groups will be produced by the agglomerative clustering method, which may reveal data similarities.
BIRCH clustering: Balanced iterative reducing and clustering using hierarchies (BIRCH) method, applies hierarchical clustering to large data sets [43]. It divides the dataset into small summaries and groups the short summaries. The BIRCH algorithm creates a tree structure for the provided data, which it refers to as a clustering feature tree.
Mini-batch K-means clustering: It is a version of the standard K-means algorithm for clustering large datasets [44]. It uses a fixed-size subsample instead of the entire dataset during each iteration, which lowers the computing cost.
DEC: The DEC algorithm simultaneously learns feature representations and cluster allocations [45]. It is a deep Autoencoder-based model that jointly optimizes feature transformation and clustering such that it can map the input data to a low-dimensional embedding space. It computes an auxiliary target distribution and the kullback-Leibler to make the clustering robust and to optimize the parameters.
RANMF: is a novel method [5], using distance-based criteria, imposes constraints by applying a penalty on pairwise feature vectors.

4.2 Data set

In this paper, three common different datasets for evaluating the proposed method performance are utilized, as follows,

4.2.1 Reuters-10k Data set

Following the DEC model, a subset of 10000 examples was sampled randomly and term frequency-inverse document frequency (TF-IDF) features were computed on the 2000 most frequent words from the REUTERS dataset (http://www.daviddlewis.com/resources/testcollections/reuters21578/). In this research, only four root categories were used: corporate/industrial, government/social, markets, and economics as labels. This dataset is known as Reuters-10k.

4.2.2 WebKB Data set

This dataset (http://www.cs.umb.edu/~smimarog/textmining/datasets/ ) contains 8334 web pages collected by the worldwide knowledge base project of the CMU text learning group in 1997. It is manually categorized into seven classes: Student, Faculty, Course, Project, Staff, Department, and Other. The Department, Staff, and Other classes have been discarded in this experiment because they have few samples. Then the distribution of the remained 4199 texts per cluster is as follows, the Student category has 1641 documents (1097 documents for training data and 544 documents for testing data), the Faculty category has 1124 documents (750 documents for training data and 374 documents for testing data), the Course category has 930 documents (620 documents for training data and 310 documents for testing data) and the Project category has 504 documents (336 documents for training data and 168 documents for testing data).

4.2.3 20Newsgroups Data set

It is a 20 newsgroups dataset 2008 with 18821 texts, containing 20 different clusters with 1000 documents. Categories are balanced and clustered according to their topics (see Table 2.).

Table 2

Different categories distribution in the 20Newsgroups data set
Category name	Number of documents
alt.athesim	799
comp.graphics	973
comp.os.ms.windows.misc	966
comp.sys.ibm.pc.hardware	982
comp.sys.mac.hardware	963
comp.windows.x	985
misc.forsale	975
rec.autos	989
rec.motorcycles	996
rec.sport.hockey	999
sci.crypt	991
sci.electronics	984
sci.med	990
sci.space	987
Soc.religion.christian	996
talk.politics.guns	909
talk. politics.mideast	940
talk. politics.misc	775
talk.religion.misc	628

4.3 Evaluation Metrics

The clustering performance will be evaluated by five common metrics, normalized mutual information (NMI), adjusted mutual information (AMI), Silhouette Coefficient (SC), Accuracy (ACC), and adjusted rand index (ARI)

4.3.1 NMI and AMI

The AMI and NMI [46] are two different methods for evaluating mutual information (MI). In probability theory and information theory, the method of MI between two clusters is a criterion for showing the degree of interdependence of the two clusters. This concept is inherently related to the entropy of one cluster, which indicates the amount of information contained in another cluster. However, the mutual information between clusters (with a fixed number of set elements) tends to be more significant when the number of clusters increases and does not take on a constant value. The NMI is a score to scale the results between minimum mutual information and perfect correlation. On the other hand, AMI is a variation of MI and may be used for comparing clusters and normalizing against chance. Consider a set of $N$ elements with two pairwise disjoint partitions, $S$ (with $C$ clusters) and S’ (with $C$ clusters). There is a contingency table with $C\times C{\prime }$ elements for summarization of mutual information of clusters.

$$X={\left[{x}_{ij}\right]}_{j=1\dots {C}^{{\prime }}}^{i=1\dots C}$$

Where ${x}_{ij}$ is the number of elements belonging to both clusters ${S}_{i}$ and ${S}_{j}^{{\prime }}$. If the probability of a selecting random datapoint to fall into the cluster, ${S}_{i}$ is

$${P}_{S}\left(i\right)=\left|{S}_{i}\right|/N$$

and

$${P}_{S{S}^{{\prime }}}\left(i\right)=\left|{S}_{i}\cap {S}_{j}^{{\prime }}\right|/N$$

is the likelihood of the data point belonging to both ${S}_{i}$ and ${S}_{j}^{{\prime }}$, then the MI between clusters S and S’ is,

$$\text{M}\text{I}\left(S,{S}^{{\prime }}\right)={\sum }_{i=1}^{S}{\sum }_{j=1}^{{S}^{{\prime }}}{P}_{S{S}^{{\prime }}}\left(i,j\right)\text{l}\text{o}\text{g}\frac{{P}_{S{S}^{{\prime }}}\left(i,j\right)}{{P}_{S}\left(i\right){P}_{{S}^{{\prime }}}\left(j\right)}$$

Then, by considering the excepted MI, $\text{E}\left\{\text{M}\text{I}\right(S,S{\prime }\left)\right\}$, between two random clustering S and S’ with entropy $\text{H}\left(S\right)$ and $\text{H}\left(S\text{’}\right)$, the AMI is:

$$\text{A}\text{M}\text{I}\left(S,{S}^{{\prime }}\right)=\frac{\text{M}\text{I}\left(S,S{\prime }\right)-\text{E}\left\{\text{M}\text{I}\right(S,S{\prime }\left)\right\}}{\text{Avg}\left\{\text{H}\left(S\right),\text{H}\left({S}^{{\prime }}\right)\right\}-\text{E}\left\{\text{M}\text{I}\right(S,S{\prime }\left)\right\}}$$

When the AMI takes a value close to 1, it shows the two clusters are in agreement, and the 0 score shows independent clusters.

4.3.2 SC

One of the problems of MI-based metrics is to need for manual annotation by a human. The Silhouette coefficient is a measure independent of the cluster labels [47]. It provides object classification with a visual representation. This value says how much an element belongs to its cluster or another cluster. The Silhouette coefficient takes values in [-1, 1]. If the SC measure is higher, the clustering would be better. To calculate SC, any distance metric like Euclidean or Manhattan can be established to find the distance between two data points i,j (d(i,j)) in cluster S. Then, $\forall i\in S$, the mean distance of $i$ and other samples in cluster S is,

$$m\left(i\right)=\frac{1}{\left|S\right|-1}{\sum }_{j\in S,i\ne j}d(i,j)$$

The smaller m(i) means, the better the cluster assignment. Then n(i) is a measure to calculate the mean dissimilarity of datapoint $i$ to samples of the next cluster $S\text{’}$,

$$n\left(i\right)=\underset{S\ne S{\prime }}{\text{min}}\frac{1}{\left|S{\prime }\right|}{\sum }_{j\in {S}^{{\prime }}}d(i,j)$$

The cluster with the smallest mean dissimilarity is the best fit of the next neighbor cluster for datapoint i; then SC (value) of one data point i is then given as,

$$SC\left(i\right)=\left\{\begin{array}{c}1-\frac{m\left(i\right)}{n\left(i\right)}, if n\left(i\right)>m\left(i\right)\\ 0, if n\left(i\right)=m\left(i\right)\\ \frac{n\left(i\right)}{m\left(i\right)}-1, if n\left(i\right)<m\left(i\right)\end{array}\right.$$

4.3.3 ARI

In statistics, the Rand index (RI) in data clustering is an index that measures the similarity degree between two clusters. This index is calculated in the field of machine learning as follows,

$$RI=\frac{TP+TN}{TP+FP+FN+TN}$$

where TN is the number of true negatives, TP is the number of true positives, FN is the number of false negatives, and FP is the number of false positives. The ARI [48] [49] is a version that is adjusted for the chance grouping as follows,

$$ARI=\frac{RI-E\left\{RI\right\}}{{max}\left(RI\right)-E\left\{RI\right\}}$$

According to the above definition, the adjusted rand index is close to 0 for random labeling independent of the number of samples and clusters and a value of 1.0 when the clusters are identical.

4.3.4 ACC

Accuracy is the best match between the actual (class) labels (y) and the predicted (cluster) labels ($\widehat{y}$) [50]; if P is equal to all permutations in K clusters, Then ACC for n different samples is defined by,

$$ACC\left(y,\widehat{y}\right)=\underset{{perm}\in P}{{max}}\frac{1}{n}{\sum }_{i=0}^{n-1}1(perm\left({\widehat{y}}_{i}\right)={y}_{i})$$

4.4. Comparisons Results and Discussions

Some experiments have been applied to the dataset introduced in section 4.2 to evaluate the performance of the proposed method; The results have been compared and assessed in this section with different evaluation criteria with common and current text clustering methods. For this purpose, before processing the text, it was necessary to preprocess it. Preprocessing is one of the main steps in text processing. The usual preprocessing methods are tokenization, removal of stop words, and word stemming. In addition, we used the TF-IDF vector space model [51] to convert text to numbers. At this phase, after selecting the 2000 most frequent words, we have considered the number of iterations of execution for each data set equal to 8000 iterations and batch size equal to 64. The experiments are conducted using the Keras framework on a computer with Intel corei7-4700HQ, CPU 2.40 GHz, 8 GB RAM, and Nvidia GTX 740 GPU. Other experiments have been performed on three different datasets as follows,

4.4.1. Reuters-10k Data set

In Table 3, you can see the clustering results on Reuters-10k for 2000 selected features. All evaluations have been done for cluster numbers {2, 4, 6, 8, and 10}. Finally, the average is shown as the final result in the table. As it is evident, the proposed algorithm has performed better than other methods and for all the clusters' numbers. It should be mentioned that the results are the average of several times execution. Therefore, in most cases, the results are much better than the average, and in some cases, they are lower than the average. The average results obtained on all clusters are equal to AMI = 0.5210 and SC = 0.0213, which indicates the effect of adding manifold information in improving clustering performance and shows the proposed method outperforms the other clustering methods regarding both AMI and SC parameters.

As can be seen, NMF_F and RANMF are correlated with the proposed method. However, they could not improve the clustering performance even for the low number of clusters compared to the proposed deep learning-based method. Also, since these algorithms have not been able to preserve and transfer the similarities between different vectors of the original space to the low-dimensional space, they have not been able to perform as well as the proposed method that uses manifold information for clustering.

Table 3

The experimental results related to the compared algorithms on the Reuters-10k data set
	AMI						SC
	2	4	6	8	10	AVG	2	4	6	8	10	AVG
K-means	0.0644	0.3550	0.4818	0.4972	0.4803	0.3757	0.0086	0.0121	0.0149	0.0182	0.0204	0.0148
NMF_𝐹	0.3230	0.3696	0.3636	0.4718	0.4135	0.3883	0.0133	0.0131	0.0154	0.0184	0.0195	0.0159
NMF_𝐾𝐿	0.3645	0.4738	0.4333	0.4204	0.4409	0.4265	0.0132	0.0136	0.0137	0.0152	0.0177	0.0146
Hierarchical clustering	0.0833	0.3609	0.4518	0.4357	0.4156	0.3494	0.0087	0.0098	0.0107	0.0096	0.0132	0.0104
Brich	0.0087	0.3442	0.4105	0.3958	0.3860	0.3090	0.0088	0.0068	0.0104	0.0095	0.0126	0.0096
MiniBatch K-means	0.3515	0.2584	0.4132	0.3971	0.3919	0.3624	0.0136	0.0092	0.0143	0.0155	0.0173	0.0139
DEC	0.3034	0.4976	0.4376	0.4431	0.4123	0.4188	0.0125	0.0131	0.0121	0.0098	0.0190	0.0133
RANMF	0.3668	0.5786	0.4858	0.3924	0.4347	0.4516	0.0134	0.0134	0.0143	0.0146	0.0137	0.0138
Our	0.3938	0.6513	0.5391	0.5462	0.5250	0.5210	0.0242	0.0137	0.0374	0.0435	0.0367	0.0213

Table 4 shows the proposed method's evaluation results with different clusters and different evaluation criteria. According to the obtained results, the most ACC, NMI, AMI, and ARI parameters' values, are when there are four distinct clusters. The worst results of the NMI, AMI, and ARI parameters are with two cluster numbers. However, the eight and four cluster numbers have gained the most and worst SC values, respectively, and the worst ACC value corresponds to ten clusters.

Table 4

The proposed algorithm's performance on the Reuters-10k data set for different criteria
		2	4	6	8	10
Our	ACC	0.60940	0.84420	0.68240	0.59870	0.55570
	NMI	0.29398	0.65145	0.53933	0.54651	0.52544
	AMI	0.3938	0.65133	0.53910	0.54622	0.52507
	SC	0.02420	0.01370	0.03745	0.04352	0.03672
	ARI	0.29845	0.68203	0.54880	0.48074	0.41673

4.4.2. 20Newsgroups Data set

The dataset of 20Newsgroups, like Reuters-10k, has many expressions after preprocessing (about 69236), for which we selected only 2000 expressions to reduce the complexity, and we have shown the evaluation results in Table 5. As shown, the proposed model has performed better than all other methods, in AVG, in terms of SC = 0.0043 and AMI = .4066. It shows the importance of manifold information involvement in data clustering. After the proposed method, RANMF and DEC models have performed better in terms of AMI criterion, and DEC and NMF_F methods have performed better in terms of SC measure compared to the other methods. Also, in two cases- the cluster number equal to four and the cluster number equal to 16- the MiniBatchK-means and NMF_F methods both have obtained better performance than the proposed method.

Table 5

The experimental results related to the compared algorithms on the 20Newsgroup data set.
	AMI						SC
	4	8	12	16	20	AVG	4	8	12	16	20	AVG
K-means	0.1784	0.2801	0.2829	0.2776	0.3371	0.2712	-0.0006	-0.0016	0.0023	0.0057	0.0031	0.00258
NMF_𝐹	0.2104	0.2688	0.2696	0.3052	0.2820	0.2672	0.0011	0.0024	0.0038	0.0070	0.0042	0.0041
NMF_𝐾𝐿	0.3235	0.3006	0.3625	0.3629	0.3661	0.3431	0.0022	0.0013	0.0018	0.0027	0.0010	0.0018
Hierarchical clustering	0.1923	0.2664	0.2967	0.3245	0.3338	0.2827	-0.0032	-0.0011	-0.0015	0.00007	0.0007	-0.0010
Brich	0.2216	0.2798	0.3070	0.3361	0.3453	0.2979	-0.0038	-0.0019	-0.0005	0.00031	0.0009	-0.0009
MiniBatch K-means	0.1808	0.2013	0.2861	0.2737	0.2905	0.2464	0.0047	0.0024	0.0022	0.0027	0.0043	0.0032
DEC	0.3013	0.3923	0.4261	0.4340	0.4330	0.3973	0.0031	0.0041	0.0037	0.0038	0.0040	0.0037
RANMF	0.3010	0.3903	0.4161	0.4241	0.4230	0.3957	0.0030	0.0039	0.0035	0.0036	0.0037	0.0035
Our	0.3064	0.4061	0.4322	0.4457	0.4427	0.4066	0.0045	0.0043	0.0042	0.0048	0.0051	0.0043

Table 6 shows the other measures for evaluating the proposed model performance on the 20Newsgroups dataset. As it is clear, with the increase in the number of clusters from four clusters to 20 clusters, the ACC measure has improved. However, NMI, AMI, and ARI values have improved significantly by increasing the number of clusters up to 16; these measures have decreased slightly with the number of clusters equal to 20.

Table 6

The proposed algorithm performance on the 20Newsgroup data set for different criteria
		4	8	12	16	20
Our	ACC	0.18725	0.32113	0.41733	0.45453	0.50398
	NMI	0.30689	0.40698	0.43344	0.44721	0.44457
	AMI	0.30640	0.40614	0.43225	0.44571	0.4427
	SC	0.0045	0.00435	0.00425	0.00484	0.0051
	ARI	0.12823	0.23543	0.25005	0.27565	0.27396

4.4.3. WebKB Data set

The WebKB dataset also has 7647 terms after preprocessing, with clustering results shown in Table 7. According to the results, the proposed algorithm has outperformed the other methods regarding the mentioned criteria. Also, here the results are mentioned for different selected features. For selected features of 500 to 2000, the best AMI value is related to the proposed method and, for the SC criterion, the best values for selected features of 500 and 1000, are related to the proposed method too. However, the best SC value for the 2000 selected features, is related to the K-means clustering method.

Table 7

The experimental results on the WebKB dataset related to the compared algorithms for different selected feature numbers
	AMI				SC
Number of Features	500	1000	2000	AVG	500	1000	2000	AVG
K-means	0.3521	0.3554	0.3524	0.3533	0.0235	0.0177	0.0138	0.0183
NMF_𝐹	0.3452	0.3623	0.3680	0.3585	0.0207	0.0153	0.0106	0.0155
NMF_𝐾𝐿	0.3744	0.3571	0.3684	0.3666	0.0205	0.0134	0.0106	0.0148
Hierarchical clustering	0.2531	0.2538	0.2747	0.2605	0.0156	0.0093	0.0089	0.0112
Brich	0.2601	0.2754	0.2660	0.2671	0.0155	0.0098	0.0087	0.0113
MiniBatch K-means	0.0337	0.2654	0.3377	0.1237	-0.0120	-0.0266	0.0120	-0.0088
DEC	0.3541	0.3524	0.3644	0.3569	0.0120	0.0104	0.0105	0.0109
RANMF	0.3314	0.3222	0.3658	0.3398	0.0149	0.0141	0.0111	0.0133
Our	0.3692	0.3652	0.3736	0.3693	0.0259	0.0179	0.0129	0.0188

In this regard, the other evaluation measures have been reported in Table 8. The best-reported values of ACC and SC are related to the 500 selected features, and the best values of NMI, AMI, and ARI measures are related to the 2000 selected features.

Table 8

The proposed algorithm's performance on the WebKB data set for different criteria and different selected feature numbers
		500	1000	2000
Our	ACC	0.63515	0.6213	0.55680
	NMI	0.36975	0.3657	0.37416
	AMI	0.36924	0.3652	0.37365
	SC	0.0259	0.0179	0.01295
	ARI	0.30581	0.2916	0.30889

According to the results, the proposed method performs better in all data sets than most of the compared methods. Because the use of the inherent geometry of the sample’s manifold and similarity increases the effectiveness of the regularization process and better results obtained on the presented data set, which shows the ability of the proposed methods in end-to-end data clustering.

Each data set generally has specific characteristics, including the number of samples, clusters, and features, each of which affects the results. For example, the more the number of samples and the fewer the number of classes, the better results will be obtained because the samples are divided into larger categories due to the small number of classes, which requires less accuracy. In the meantime, comparing the results obtained with different criteria on the three datasets presented indicates that 20Newsgroup has received fewer evaluation metrics values than the other two datasets due to having a large number of classes.

In this paper, we propose a deep text clustering method based on a local Manifold (DCTMA), which uses a local Manifold in the Autoencoder layer. In this model, we first create a manifold matrix, which is the average of various similarity matrices of text data, then it incorporates the autoencoder to guide the process of representing learning. In the deep clustering framework, representation loss and KL divergence loss are concurrently optimized to obtain the deep representation for clustering. Numerous experiments on real document data sets show the effectiveness and robustness of DCTMA. An interesting future direction is to consider the quality difference among different similarity matrices.

Tang, X., Dong, C., Zhang, W.: Contrastive Author-aware Text Clustering. Pattern Recognition, : p. 108787. (2022)
Wang, S., et al.: Extreme clustering – A clustering method via density extreme points. Inf. Sci. 542, 24–39 (2021)
Settipalli, L., Gangadharan, G., Fiore, U.: Predictive and adaptive drift analysis on decomposed healthcare claims using ART based topological clustering. Inf. Process. Manag. 59(3), 102887 (2022)
Chen, Y., et al.: Parallel Non-negative Matrix Tri-Factorization for Text Data Co-clustering.IEEE Transactions on Knowledge and Data Engineering, (2022)
Aghdam, M.H., Zanjani, M.D.: A novel regularized asymmetric non-negative matrix factorization for text clustering. Inf. Process. Manag. 58(6), 102694 (2021)
Berahmand, K., et al.: Graph Regularized Nonnegative Matrix Factorization for Community Detection in Attributed Networks. IEEE Transactions on Network Science and Engineering (2022)
Nasiri, E., Berahmand, K., Li, Y.: Robust graph regularization nonnegative matrix factorization for link prediction in attributed networks.Multimedia Tools and Applications, : p.1–24. (2022)
Ren, Z., Zhang, W., Zhang, Z.: A deep nonnegative matrix factorization approach via autoencoder for nonlinear fault detection. IEEE Trans. Industr. Inf. 16(8), 5042–5052 (2019)
Behera, G., Nain, N.: DeepNNMF: deep nonlinear non-negative matrix factorization to address sparsity problem of collaborative recommender system.International Journal of Information Technology, : p.1–9. (2022)
Diallo, B., et al.: Deep embedding clustering based on contractive autoencoder. Neurocomputing. 433, 96–107 (2021)
Hosseini, S., Varzaneh, Z.A.: Deep text clustering using stacked AutoEncoder. Multimedia Tools and Applications. 81(8), 10861–10881 (2022)
Wang, J., Zhang, X.-L.: Deep nmf topic modeling.Neurocomputing, (2022)
Veiga Simão, A.M., et al.: Prosociality in cyberspace: Developing emotion and behavioral regulation to decrease aggressive communication. Cogn. Comput. 13(3), 736–750 (2021)
Jiang, Z., et al.: Variational deep embedding: An unsupervised and generative approach to clustering.arXiv preprintarXiv:1611.05148, 2016.
Curiskis, S.A., et al.: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit, vol. 57, p. 102034. Information Processing & Management (2020). 2
Śmieja, M., Struski, Å., Figueiredo, M.A.: A classification-based approach to semi-supervised clustering with pairwise constraints. Neural Netw. 127, 193–203 (2020)
Sheng, W., Lipor, J.: A Novel Framework for Deep Learning from Pairwise Constraints. in 54th Asilomar Conference on Signals, Systems, and Computers. 2020. IEEE. (2020)
Guan, R., et al.: Deep feature-based text clustering and its explanation.IEEE Transactions on Knowledge and Data Engineering, (2020)
Diallo, B., et al.: Multi-view document clustering based on geometrical similarity measurement. Int. J. Mach. Learn. Cybernet. 13(3), 663–675 (2022)
Fu, B., et al.: Anomaly Aware Symmetric Non-negative Matrix Factorization for Short Text Clustering. (2022)
Fard, M.M., Thonet, T., Gaussier, E.: Pairwise-Constrained Deep Document Clustering. in International Conference on Reliability and Statistics in Transportation and Communication. Springer. (2019)
Wei, F., et al.: Semi-Supervised Clustering with Contrastive Learning for Discovering New Intents. arXiv preprint arXiv:2201.07604, (2022)
Vilhagra, L.A., Fernandes, E.R., Nogueira, B.M.: Textcsn: a semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. in Proceedings of the 35th Annual ACM Symposium on Applied Computing. (2020)
Hassani, A., Iranmanesh, A., Mansouri, N.: Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput. Appl. 33(20), 13745–13766 (2021)
HE, C., et al.: Short Text Online Clustering Based on Incremental Robust Nonnegative Matrix Factorization, vol. 47, p. 1086. ACTA ELECTONICA SINICA (2019). 5
Mohotti, W.A., Nayak, R.: Deep Hierarchical Non-negative Matrix Factorization for Clustering Short Text. in Neural Information Processing. Springer International Publishing, Cham (2020)
CAO, D., et al.: Short text clustering algorithm based on weighted kernel nonnegative matrix factorization. J. Comput. Appl. 38(8), 2180 (2018)
Van Craenendonck, T., et al.: COBRAS: fast, iterative, active clustering with pairwise constraints. arXiv preprint arXiv:1803.11060, (2018)
Fogel, S., et al.: Clustering-driven deep embedding with pairwise constraints. IEEE Comput. Graph. Appl. 39(4), 16–27 (2019)
Lv, B., et al.: IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). 2018. IEEE. (2018)
Yi, J., et al.: A novel text clustering approach using deep-learning vocabulary network. Mathematical Problems in Engineering, 2017. (2017)
Chen, X., Beaver, I., Freeman, C.: Fine-Tuning Language Models For Semi-Supervised Text Mining. in IEEE International Conference on Big Data (Big Data). 2020. IEEE. (2020)
Yang, Y., Wu, Q.J., Wang, Y.: Autoencoder with invertible functions for dimension reduction and image reconstruction. IEEE Trans. Syst. Man Cybernetics: Syst. 48(7), 1065–1079 (2016)
Lakshmi, R., Baskar, S.: Efficient text document clustering with new similarity measures. Int. J. Bus. Intell. Data Min. 18(1), 49–72 (2021)
Oghbaie, M., Mohammadi, M., Zanjireh: Pairwise document similarity measure based on present term set. J. Big Data. 5(1), 1–23 (2018)
Jin, D., et al.: A survey of community detection approaches: From statistical modeling to deep learning.IEEE Transactions on Knowledge and Data Engineering, (2021)
Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. Ieee Access. 7, 31883–31902 (2019)
Su, X., et al.: A comprehensive survey on community detection with deep learning.IEEE Transactions on Neural Networks and Learning Systems, (2022)
Golzari Oskouei, A., Balafar, M.A., Motamed, C.: EDCWRN: efficient deep clustering with the weight of representations and the help of neighbors.Applied Intelligence, : p.1–23. (2022)
Chen, L., Zhong, Z.: Adaptive and structured graph learning for semi-supervised clustering. Inf. Process. Manag. 59(4), 102949 (2022)
Lee, D., Seung, H.S.: Algorithms for non-negative matrix factorization.Advances in neural information processing systems, 13. (2000)
Misztal-Radecka, J., Indurkhya, B.: Bias-Aware Hierarchical Clustering for detecting the discriminated groups of users in recommendation systems. Inf. Process. Manag. 58(3), 102519 (2021)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. ACM sigmod record. 25(2), 103–114 (1996)
Béjar Alonso, J.: K-means vs Mini Batch K-means: a comparison. (2013)
Ren, Y., et al.: Semi-supervised deep embedded clustering. Neurocomputing. 325, 121–130 (2019)
Yang, S., Huang, G., Cai, B.: Discovering topic representative terms for short text clustering. IEEE Access. 7, 92037–92047 (2019)
Li, W., Suzuki, E.: Adaptive and hybrid context-aware fine-grained word sense disambiguation in topic modeling based document representation. Inf. Process. Manag. 58(4), 102592 (2021)
Yang, Y.: Chap. 3 - Temporal Data Clustering. In: Temporal Data Mining Via Unsupervised Ensemble Learning, pp. 19–34. Elsevier (2017). Y. Yang, Editor
Hu, D., Feng, D., Xie, Y.: EGC: A novel event-oriented graph clustering framework for social media text. Inf. Process. Manag. 59(6), 103059 (2022)
Wang, R., et al.: Trio-based collaborative multi-view graph clustering with multiple constraints. Inf. Process. Manag. 58(3), 102466 (2021)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

An Improved Deep Text Clustering via Local Manifold of an Autoencoder Embedding

Status:

Version 1

Abstract

Figures

1. Introduction

2. Related Work

3. Contribution

3.1 Definition 1 - Deep Autoencoder

3.2 Definition 2 - Different Similarity Measures

3.2.1 DTFSM Similarity Measure

3.2.2 TS-SS Similarity Measure

3.2.3 PDSM Similarity Measure

3.3. Definition 3 - Consensus Similarity

3.4. Proposed Method

4. Experiments

4.1 Comparison Methods

4.2 Data set

4.2.1 Reuters-10k Data set

4.2.2 WebKB Data set

4.2.3 20Newsgroups Data set

4.3 Evaluation Metrics

4.3.1 NMI and AMI

4.3.2 SC

4.3.3 ARI

4.3.4 ACC

4.4. Comparisons Results and Discussions

4.4.1. Reuters-10k Data set

4.4.2. 20Newsgroups Data set

4.4.3. WebKB Data set

5. Conclusion

References

Additional Declarations

Status:

Version 1