Currently available and widely used machine learning and deep learning strategies are becoming more and more complex and advanced (Goodfellow et al. 2017; Andrew Ng 2018; Janet and Kulik 2020; Wei et al. 2020; Géron 2022) They provide many possibilities, and various trends and innovations are constantly emerging. Currently, these innovations, among others, include: hyperautomation - thanks to machine learning technology, companies use the automation of numerous repetitive processes which are based on huge volumes of data, to enhance the speeds, accuracy, and reliability of the work; Internet of Things (IoT) – this technology connects numerous small devices across a network and allows for seamless communication between each other, making them a lot more intelligent, cyber security – uses machine learning to identify cyber threats, to fight cybercrime, and enhance the current antivirus software; no-code ML – such platforms allow companies to work without requiring an engineer or developer, allow the user with less technical skills to create their own tools with a drag-and-drop interface, which reduces costs and time; deep learning - further intensive development of this technology using multilayer neural networks with different architectures, in particular for modelling multidimensional data (> 2D), used for applications that require image recognition, autonomous movement, voice interaction; semi- and self-supervised learning - automate the traditional manual process, like data labeling; reinforcement learning - allows software to find solution by interacting with the environment, uses a reward and punishment system, and lets the machine learn by experimenting with a potential path and then deciding which one would have the best reward and is effective.
One of the constantly updated strategies is unsupervised machine learning, which does not require humans to intervene since the algorithms are designed to identify data groupings and patterns that are unseen (Tripathy et al. 2021). Labels are not used in the calculation. This type of learning is able to look at the data and identify similarities. Unsupervised machine learning uses clustering approaches, which mine data to find groupings. The advantages of such methods include solving the problem by learning the data and classifying it without any labels. It is very helpful at the initial stage of data analysis in finding patterns in data, i.e. subsets of similar objects, and dimensionality reduction can be easily accomplished. Unsupervised learning is is the perfect tool for data scientists, as it can help to understand raw data, and also find up to what degree the data are similar. This task is defined in this work. The problem was finding which of the sensors or data preparation methods provide the correct answer about the grouping of objects. Experimental data, which were the signals obtained in the voltammetric measurements, were input for the models. It was checked whether the samples without labels are correctly assigned to subsets that are known but not used in calculations. This approach allowed for the development of an overall correct research strategy which in many aspects used the knowledge about the experiments and a lot of domain information in the field of analytical chemistry. As a result, recommendations were formulated on the correct acquisition of data for supervised modelling.
K-means is an iterative clustering algorithm, which states that similar data points should be in a close neighbourhood in a data space (Blokdyk 2021). The value of k is the number of data points located near centroids in the data set. Object clustering uses measurement of the distance between points and the centroid. Various mathematical distance functions are typically used in calculations. Also, miscellaneous stop criteria are applied, e.g. no change in the membership of objects to clusters in the next iteration ends the calculations. The k-means clustering method is the simplest and most commonly used algorithm, but it is limited by the required number of clusters, which has to be pre-determined.
Hierarchical clustering allows for the formation of multiple clusters, which are distinct from each other, but the contents inside the cluster are highly similar to each other (Everitt et al. 2011). Important parameters of this algorithm are the distance function and the agglomeration procedure. In the first step, the distance matrix is calculated, which is the basis for further iterative clustering. The algorithm would treat each observation as a separate cluster. Then it would find two most similar clusters and merge them. This step continues iteratively until all clusters merge together. For visual representation of the clusters, a dendrogram would be formed. It would show the hierarchical similarity between the clusters.
In the unsupervised learning group, one of the most commonly used algorithms is Principal Components Analysis (PCA) (Jolliffe 2002). This strategy makes it possible to reduce the dimensionality of the data space by transforming the correlated input features into new, mutually orthogonal principal variables, with little loss of information. Typically, the first few major components describe a significant percentage of the information contained in the original data. A significant limitation of the number of variables enables objects to be visualized in a space with a smaller number of dimensions. When a limitation to two dimensions is sufficient, hidden relationships between samples and original variables can be observed in the plane. PCA is often used in voltammetry to qualitatively evaluate complex samples from the obtained signals.
Clustering evaluation
In the case of supervised methods, different coefficients of model quality assessment are typically used, i.e., for classification models accuracy, sensitivity, precision, specificity, f1-score, receiver operating characteristic curve (ROC), area under the ROC curve (AUC), confusion matrix, while for regression models it is a root mean squared error (RMSE) and the assessment of the predicted / measured relationship (Powers 2020). Numerical coefficients allow for unambiguous decisions regarding the usefulness of defined models. There is a need to define such measures to assess grouping efficiency in clustering methods.
Clustering validity indices (CVIs) are utilized to validate the clustering results and find the correct number of clusters in a dataset. The CVIs are intended to indicate the intensity of separability, compactness among clusters and the geometric structural knowledge of the dataset. The indicators that can be used to estimate the number of clusters include Calinski–Harabasz (CH) index, Davies–Bouldin (DB) index, Silhouette (S) index or gap statistics.
Let's assume that dataset {\({x}_{ij}\)}, i = 1,…n, j = 1,… p consists of p features measured on n independent observations. Below variable descriptions in CVIs definitions (formulae (1)-(4)) are listed:
n is total number of observations in dataset,
k is ideal number of clusters,
\({C}_{i}\) is the i–th cluster,
\({n}_{i}\) is number of objects in the cluster \({C}_{i}\),
\(d\left(x,y\right)\) is distance between x and y (the most common choice is the Euclidean distance),
c is centroid of dataset,
\({c}_{i}\) is centroid of the cluster \({C}_{i}\).
Calinski and Harabasz Index
The CH index technique was proposed by Calinski and Harabasz (Calinski and Harabasz 1974) (also known as the Variance Ratio Criterion) to determine the ideal number of clusters k and is defined as (1):
$$CH\left(k\right)=\frac{\sum _{i}{n}_{i}{d}^{2}({c}_{i},c)/(k-1)}{\sum _{i}{\sum }_{x\in {C}_{i}}{d}^{2}\left({x,c}_{i}\right)/(n-k)} \left(1\right)$$
The score determines ratio of the sum of between-cluster dispersion and of within-cluster dispersion for all clusters (where dispersion is defined as the sum of distances squared). The CH index is a measure of how similar an object is to its own cluster (compactness) compared to other clusters (separation). Compactness is estimated based on the distances from the data points in a cluster to its cluster centroid and separation is based on the distance of the cluster centroids from the global centroid. Higher value of CH index means the clusters are compact and well separated, although there is no acceptable cut-off value. Typically, k is selected which gives a peak or at least an abrupt elbow on the line plot of CH indices. But if the line is horizontal then there is no such reason to prefer one solution over others.
Silhouette Index
The Silhouette index S was introduced by Kaufman and Rousseeuw (Rousseeuw 1987) and was built to show graphically how well each element is categorized in a given clustering output. It is defined as (2):
$$S\left(k\right)=\frac{1}{k}\sum _{i}\left(\frac{1}{{n}_{i}}\sum _{x\in {C}_{i}}\frac{b\left(i\right)-a\left(i\right)}{\text{max}\left[a\left(i\right), b\left(i\right)\right]}\right) \left(2\right)$$
where
$$a\left(x\right)=\frac{1}{{n}_{i}-1}\sum _{y\in {C}_{i},x\ne y}d\left(x,y\right)$$
$$b\left(x\right)=\begin{array}{c}min\\ j,j\ne i\end{array}\frac{1}{{n}_{j}}\sum _{y\in {C}_{j}}d\left(x,y\right)$$
The optimal number of clusters will be the value for which the \(S\left(k\right)\) is maximum. The Silhouette value is a measure of how similar an object is to its own cluster (compactness) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters. S index validates the clustering performance based on the pairwise difference of between and within-cluster distances. The Silhouette plot displays a measure of how close each point in one cluster is to points in the neighbouring clusters and thus provides a way to assess parameters like number of clusters visually.
Gap Statistics
Tibshirani et al. (Tibshirani et al. 2001) proposed a way to determine the ideal number of clusters k in a dataset by the gap statistic. The idea of the gap statistic (3) is to compare the total within-clusters sum of squares around the cluster centroid for various numbers of k with their expected values generated from a reasonable reference null distribution.
$$Gap\left(k\right)={E}_{n}^{*}\text{log}\left({w}_{k}\right)-\text{l}\text{o}\text{g}\left({w}_{k}\right)$$
3
,
$${w}_{k}=\sum _{i}\frac{1}{2{n}_{i}}{\sum }_{x\in {C}_{i}}{d}^{2}\left({x,c}_{i}\right)$$
,
where \({E}_{n}^{*}\) denotes the expectation under a sample size n from the reference distribution. The gap statistic measures the deviation of the observed \({w}_{k}\) value from its expected value under the null hypothesis. A value of k that maximizes the gap statistic will be the estimate of the ideal cluster number. The basic idea of the gap statistics is to choose the number of k, where the biggest jump in within-cluster distance occurred, based on the overall behaviour of uniformly drawn samples. It could be the case that only a very slight reduction in within-cluster distance occurred.
Davies–Bouldin Index
The Davies-Bouldin (DB) index is one of the clustering algorithms evaluation measures (Davies, D.L. Bouldin 1979). It is most commonly used to evaluate the goodness of split by a k-means clustering algorithm for a given number of clusters. The Davies–Bouldin index is calculated as the average similarity of each cluster with a cluster most similar to it. The lower the average similarity is, the better the clusters are separated and the better is the result of the clustering performed. DB can be defined using Eq. (4):
$$DB\left(k\right)=\frac{1}{k}\sum _{i}\underset{j,{ j}\ne i}{\text{max}}\left(\frac{1}{{n}_{i}}\sum _{x\in {C}_{i}}d(x,{c}_{i}\right)+\frac{1}{{n}_{j}}\sum _{x\in {C}_{j}}d(x,{c}_{j}\left)\right)/d({c}_{i},{c}_{j})\left) \right(4)$$
The index is defined as the average similarity between each cluster \({C}_{i}\) for i = 1,...,k and its most similar one \({C}_{j}\). A lower Davies-Bouldin index relates to a model with better separation between the clusters. This index signifies the average similarity between clusters, where the similarity is a measure that compares the distance between clusters with the size of the clusters themselves. Zero is the lowest possible score. Values closer to zero indicate a better partition. The computation of Davies-Bouldin is simpler than that of Silhouette scores. The index is solely based on quantities and features inherent to the dataset as its computation only uses point-wise distances.