Text clustering is a challenging process to discover and extract groups with similar elements in text collections [1], in which different documents are usually categorized based on content similarity. Currently, many different clustering algorithms have been introduced, divided into several categories. A group of these algorithms is partition-based, such as K-means and K-medoids, in which data are assigned to different clustering centers according to their smallest distance [2]. In these methods, the goal is to optimize the distance between the samples and the center of the clusters. Another type of clustering method is based on data distribution. This algorithm uses a set of probability distribution functions for displaying data. It assumes that the points in a given cluster are most likely to be derived from the same distribution [2]. This method cannot achieve high accuracy in clustering without having a suitable distribution function.
On the other hand, density-based clustering algorithms are a clustering category that can cluster data with any arbitrary distribution [3]. It can distinctly identify clusters with any data distribution and shape because a cluster is a continuous region of dense points isolated by interconnected regions of sparse points from others [2]. Another common clustering method is based on a corpus which reduces the dimensionality of features by merging synonyms and thus utilizes large datasets like WordNet, aiming to reduce the vector’s dimensionality and remove additional information [2]. Non-negative matrix factorization (NMF) and Latent semantic indexing (LSI) are two famous corpus-based clustering methods in which the documents are transformed into a new feature space with smaller dimensions, and the features are usually a linear combination of the original features.
Recently, NMF has received much attention and has found many applications in pattern recognition and text mining [4–7]. It is a linear-algebraic model to reduce the dimensions of non-negative vectors. This method can extract features and is used for the main matrix description as a product of a weight matrix and a base matrix. The most crucial goal of NMF is to find general system features by a non-negative combination of local system features. Therefore, since it can extract features, it has been widely used in natural language processing, image, and audio processing [8]. Although NMF is a simple and linear solution, it cannot find hidden nonlinear relationships in the data. This limitation reduces the ability to classify complex data [9]. Also, matrix decomposition methods, such as NMF, usually suffer from two problems. First, they cannot hierarchically discover patterns or topics and their subtopics, and second, their clustering results are far from human clustering.
Deep learning has received much attention in discovering and extracting nonlinear structures nowadays [10, 11], and many researchers using it have made significant progress in extracting deep representations for various tasks. Deep learning techniques learn the representations well in a hierarchical and nonlinear manner [12]. However, one of the main challenges of deep clustering is when data have high dimensions and complex structures, or there are many clusters, making the clustering process's complexity time-consuming and inefficient [13]. In such cases, displaying the data in a space with lower dimensions is inevitable and improves clustering. Also, reducing the dimensions can find the semantic similarity between texts more efficiently and produce better clustering. Until now, many methods have been used for dimension reduction, especially in the text, like independent component analysis (ICA), singular value decomposition (SVD), principal component analysis (PCA), etc. [14]. Today, deep methods face this problem, usually using autoencoders, which are typically used for displaying low-dimensional data and are one of the nonlinear dimension reduction methods [15]. However, in most of the current deep learning-based clustering method attempts, this dimension reduction is made without considering the manifold structures, and the more similar data will not necessarily be next to each other [10, 11, 16–18].
This paper develops a deep text clustering method with a local manifold in the autoencoder layer (DCTMA) that uses multiple similarity matrices to obtain manifold information, such that the final similarity matrix is obtained from the average of these matrices. This similarity matrix as additional information, is used with data representation for better clustering and is added to the bottleneck embedded layer in the autoencoder as an additional term to the clustering loss. The idea is that pairs of samples in a cluster should have similar representations in the embed space. As a result, the more similar data will be placed next to one another, and the representation learning accuracy problem can be reduced. In the presented model, along with dimensionality reduction achieved with high accuracy, clusters are detected using a deep end-to-end framework too. In this model, the consensus similarity matrix is fed to a deep network on top of an auto-encoder, thereby taking benefits of both sufficient similarity and context-aware representation of documents by an end-to-end solution.
Considering that most current similarity measures are based on cosine similarity (CS) and Euclidean distance (ED), the ED considers the vectors’ magnitude only, and CS only finds the angle between two vectors. They are not effective and not suitable measures for dealing with probabilities often used in text analysis [19]. In this paper, we have used the combination of several different document similarity measures, each of which will measure the similarity between two documents from a different point of view. Then a consensus similarity matrix is calculated for each data set which is an average of these three text similarity measures. It finds document similarity well and helps deep clustering to cluster it well too.
The proposed model has been examined in three different datasets with different evaluation metrics. According to the experimental results, the presented method has obtained better results on the same dataset than the recently published methods. The followings are the key contributions of this paper:
-
In this paper, a deep end-to-end clustering-based architecture is used to learn both representations and cluster labels jointly.
-
A new loss function is used, considering similarity matrices to obtain manifold information for better manifold clustering.
-
Added a combined similarity criterion that considers not only the direction but also the magnitude and semantics as an additional term to the clustering loss, improves its performance and better clustering results gain.
In the remainder of this paper, the related works are highlighted in Section 2. Then, the developed system is represented in Section 3 in detail. Section 4 presents experimental results and their discussions. Finally, the conclusion and perspective work is detailed in Section 5.