Developing a highly accurate machine learning model represents a significant milestone in the artificial intelligence realm. To enhance experimental quality and reduce costs, it is imperative to implement an ML-based model for defect detection problem analysis [21–23]. Previous studies have employed various ML algorithms, showing differing levels of effectiveness and accuracy [18, 21, 24].
This study aims to develop a prototype for SDP analysis that enhances accuracy while optimizing testing costs. We explore several established ML techniques using a publicly available dataset to improve upon previous research findings. Specifically, SVM, MLP, KNN, NB, and DT classification models are employed to classify classes, evaluating their performance based on metrics such as accuracy, sensitivity, specificity, and confusion matrices.
3.1 Data description
The study employed the JM1 dataset [25], obtained from the PROMISE software engineering repository, which is freely accessible and included in the NASA Metric Data Program (MDP). JM1 originates from a coding tool utilized in NASA spacecraft projects, predominantly programmed in the C language. For comprehensive details regarding the dataset's attributes, please consult Table 1, which summarizes its key characteristics.
Table 1
Title
|
Language
|
Source
|
Modules
|
Features
|
Defective
|
Defect-free
|
Defect rate
|
JM1
|
C
|
NASA
Spacecraft
Instrument
|
10885
|
21
|
8779
|
2106
|
80.65
|
This database's attributes are determined by four McCabe measures, twelve Halstead measures, and various other metrics. McCabe metrics focus on programming constructs and can be extracted directly from source code, making them method-level metrics [26]. In contrast, Halstead metrics utilize numerical data and can be easily computed using any software tool [25]. Feature D encompasses two categories: True and False, representing the presence or absence of software defects, respectively. This study identifies class 2 software defects and the absence of class 1 software defects. Table 2 details the properties of the JM1 dataset.
Table 2
Metrics
|
Feature No.
|
Feature name
|
Feature description
|
McCabe metrics
|
1
|
Loc
|
Code line count
|
2
|
v(g)
|
Cyclomatic complexity
|
3
|
ev(g)
|
Essential complexity
|
4
|
iv(g)
|
Design complexity
|
Halstead metrics
|
5
|
N
|
Total operators + operands
|
6
|
V
|
Volume
|
7
|
L
|
Program length
|
8
|
D
|
Difficulty
|
9
|
I
|
Intelligence
|
10
|
E
|
Program writing effort
|
11
|
B
|
Bugs
|
12
|
T
|
Time estimator
|
13
|
IOCode
|
Code line count
|
14
|
IOComment
|
Comment lines count
|
15
|
IOBlank
|
Blank lines count
|
16
|
IOCodeAndComment
|
Code/comments lines
|
Other metrics
|
17
|
uniqOp
|
Unique operators
|
18
|
uniqOpnd
|
Unique operands
|
19
|
totalOp
|
Total operators
|
20
|
totalOpnd
|
Total operands
|
21
|
branchCount
|
Flow of graph
|
Defect
|
|
D
|
Module has defects or not
|
3.2. Classifiers
In the realm of software defect prediction, ML algorithms play a crucial role in categorizing faulty modules. This study assesses the effectiveness of SVM, MLP, KNN, NB, and DT algorithms in classifying software defects. These algorithms were chosen for their diverse operational characteristics and widespread use in research. Detailed evaluations of the performance of each algorithm are provided below.
3.2.1 Support Vector Machine. SVM is a highly effective tool for data classification. It operates by identifying a linear separator that optimizes the gap between different classes of data (Fig. 1). While ideally suited for binary classification tasks, SVM requires pairwise calculations for datasets with multiple classes [27]. In SVM, training samples are mapped to points in space to maximize the distance between classes. New samples are then assigned to a class based on their position relative to this gap. The aim is to determine the hyperplane that best separates the classes, represented mathematically by Eq. (1).
3.2.2 Multi-Layer Perceptron. The perceptron neural network is a supervised learning method that employs binary classification. Its binary classifier evaluates whether a given input, presented as a numerical vector, belongs to a specific class. The classifier utilizes a linear prediction function, which involves a feature vector and a corresponding set of weights. In an artificial neural network with multiple neurons, each output neuron operates independently, facilitating individual learning for each output. Figure 2 illustrates a basic representation of a perceptron neural network [28–34].
3.2.3 K Nearest Neighbour. The KNN technique is a non-parametric statistical approach utilized for both statistical classification and regression. In KNN, the parameter K refers to the number of nearest neighbours in the data space, and its output varies based on whether it is used for classification or regression tasks. Once a specific value for K is determined, the method classifies an unlabelled test sample by comparing it to its K nearest neighbours in the training set. Various methods such as Mahalanobis distance, Manhattan distance, Euclidean distance, Jaccard distance, and others are employed to calculate neighbourhood distance or assign weights to different neighbours [35–37].
3.2.4 Naive Bayesian. The Bayesian approach is a model for data classification that relies on conditional probability. It involves calculating the probability density function of each class to determine the likelihood of assigning test data to a specific class. Bayes’ formula, also referred to as Bayes’ rule, is utilized to determine the probability of an event. This formula includes three probabilities, represented by Eq. (2): P(C) is the prior probability of event C, P(C|X) is the posterior probability of event C, and P(X|C) is the probability of event X based on the information of event C.
$$P\left( {C|X} \right)=\frac{{P\left( {C|X} \right)P\left( C \right)}}{{P\left( X \right)}}$$
2
Assuming that the problem’s independent variables consist of n features in the form X = (X1, X2, ..., Xn), the dependent variable determining the data’s class can be obtained by calculating the probability of Ck occurring in P (Ck|X1, ..., Xn) through the various states of the events in X.
3.2.5 Decision Tree. The DT is an advanced model used to facilitate decision-making in a hierarchical format. It assesses potential outcomes, including those influenced by chance events, resource costs, and utility, and is visually represented as a tree structure. Each node represents a decision or condition, with edges pointing to its child nodes. To determine the optimal decision, the algorithm analyses data and uses metrics such as entropy or Gini Impurity to partition the data into distinct categories. This process continues recursively until a complete DT is constructed. Known for their high interpretability and readability, decision trees are considered essential tools in data mining, artificial intelligence decision-making, and machine learning [38, 39].
3.3 Feature selection
Feature selection methods are employed to decrease the number of features utilized in training and testing prediction models. These methods fall into three primary categories:
-
Filter method: This method chooses features independently of the applied ML algorithm used, relying on pre-defined criteria.
-
Wrapper method: This method evaluates feature subsets using a classification function, incorporating feedback from the learning algorithm. It operates as a black box approach.
-
Combined model: The combined model integrates evaluation criteria from both previous methods at different search stages. It combines feature selection within the classifier structure, but may face challenges such as determining the optimal number of features and overlooking feature relationships, potentially affecting final feature selection outcomes.
Feature selection techniques serve several purposes:
-
Simplifying models for easier interpretation by researchers [40].
-
Reducing training times [41].
-
Mitigating the dimensionality curse [42].
-
Enhancing data compatibility with the learning model class [43].
-
Extracting inherent patterns from the input space [44–48].
These techniques assume that certain features in the data may be redundant or irrelevant, and can thus be safely excluded without significant information loss [43]. Distinguishing between redundant and irrelevant features is crucial, as a relevant feature may become redundant when another closely related feature is present [49].
This article employs a feature selection method based on subset evaluation, which strategically identifies potential features through systematic search. The search and identification process is enhanced using optimization algorithms rooted in artificial intelligence. Specifically, the binary genetic algorithm, a trusted wrapper method, is utilized to recognize the best optimal features.
The genetic algorithm is a search method widely employed for approximate solutions to search and optimization problems. Based on evolutionary biology principles such as heredity, biological mutation, and natural selection, it effectively predicts or matches patterns. This makes it a promising alternative to regression-based forecasting techniques. Genetic algorithm programming involves transforming inputs into solutions through a modelled evolutionary process, where a fitness function evaluates candidate solutions until specified conditions are met. The algorithm comprises randomized processes.
The aim of the current research is to maximize Correct Classification Rate (CCR). Given that the genetic algorithm minimizes the objective function, the proportional function aims to equal the inverse of CCR. The algorithm terminates iteration once a specified size is achieved. The genetic algorithm implementation follows these steps:
- Initializing the population and evaluating individuals.
- Applying crossover to chromosomes and evaluating results.
- Introducing mutations to chromosomes and evaluating outcomes.
- Merging initial, crossover, and mutated populations.
- Sorting merged populations and selecting a population equal to the initial size (truncation).
- If stop conditions are not met, returning to step 2.
- Termination [50, 51].
In addition to feature selection, feature combination methods can also be employed. One such method is the feature dimension reduction technique.
3.4 Fisher’s discriminant analysis
Fisher's separation rate is a statistical technique used to enhance classification accuracy in classification problems. By assuming a Gaussian distribution of sample data for each natural class, Fisher's separation rate quantifies the degree of class differentiation, evaluates the influence of each feature on this differentiation, and assesses the potential benefits of combining features for improved detection. In scenarios involving two classes, Fisher's separation rate performs optimally when data exhibits a Gaussian or quasi-Gaussian distribution. These classes typically share an overlapping region, which complicates classification. Enhancing separation between the classes involves reducing this overlap, as illustrated in Fig. 3. Fisher's separation rate quantifies this improvement. As the shared area diminishes, the average distinction between the characteristics of the two classes increases, while their variance decreases, thereby enhancing differentiation. Eq. (3) defines Fisher's separation rate for a two-class problem [52].
$$FDR=\frac{{{{\left( {{\mu _1} - {\mu _2}} \right)}^2}}}{{\sigma _{1}^{2} - \sigma _{2}^{2}}}$$
3
To optimize and increase the separation of data from two classes, coefficients a = [a1, a2, ..., an] are applied to the features. Here, n represents the number of features. Therefore, Eq. (3) is modified as follows:
$$FDR=\frac{{{a^{\text{T}}}\left( {{\mu _1} - \mu } \right){{\left( {{\mu _1} - \mu } \right)}^{\text{T}}}a+{a^{\text{T}}}\left( {{\mu _2} - \mu } \right){{\left( {{\mu _2} - \mu } \right)}^{\text{T}}}a}}{{{a^{\text{T}}}\left( {\sigma _{1}^{2}} \right)a+{a^{\text{T}}}\left( {\sigma _{2}^{2}} \right)a}}$$
4
where µ=[µ1, µ2, ..., µn] is the average of each feature in the whole data and µ1=[µ1,1, µ1,2, .. ., µ1,n] is the average of each feature in class 1 and µ2=[µ2,1, µ2,2, ..., µ2,n] is the average of each feature in class 2 and σ1=[σ1,1, σ1,2, ..., σ1,n] is the standard deviation of each feature in class 1 and σ2=[σ2,1, σ2,2, ..., σ2,n] are the standard deviation of each feature in class 2 and a is the FDR coefficients. To obtain the coefficient values of a so that the FDR is maximized, we need an optimization algorithm. One of the fast algorithms in this field is the Particle Swarm Optimization (PSO) algorithm [53]. The PSO minimizes the objective function, but since the objective is to maximize the FDR, the fitness function of the PSO is characterized as the reverse of the Eq. (4).
The objective of the optimization process is to achieve the most effective outcome possible by utilizing decision variables to select the optimal solution, typically through maximizing or minimizing constraints and objectives [54].
PSO is a population-based metaheuristic method used for addressing optimization problems. In PSO, each potential solution is represented as a particle with a specific velocity. Each particle assesses the objective function at its current position in the search space. The direction for the next movement in the search space is determined by combining information from the particle's own best position (personal best) and the best position among all particles in the swarm (global best). After each iteration, the algorithm proceeds to update the velocity vector using Eq. (5) and the position vector using Eq. (6).
$$V_{{ij}}^{{t+1}}=wV_{{ij}}^{t}+{c_1}r_{1}^{t}\left( {personalbes{t_{ij}} - X_{{ij}}^{t}} \right)+{c_2}r_{2}^{t}\left( {globalbes{t_j} - X_{{ij}}^{t}} \right)$$
5
Eq. (5) involves several key coefficients and variables, including the inertia coefficient denoted by w, the personal learning coefficient represented by c1, and the collective learning coefficient denoted by c2. Additionally, r1 and r2 are random numbers that follow a uniform distribution, while personal-best is the particle’s best personal memory and global-best is the best memory of all particles in each iteration.
$$X_{{ij}}^{{t+1}}=X_{{ij}}^{t}+V_{{ij}}^{{t+1}},\,\,\,\,\,\left( {i=1,\,2,\,\, \cdots ,\,p} \right),\,\,\,\,\,\left( {i=1,\,2,\,\, \cdots ,\,n} \right)$$
6
According to the formulas mentioned earlier, the PSO algorithm involves defining p as the particles total number, n as the iterations total number, X as the location vector, and V as the velocity vector. After updating the velocity and location vectors and relocating all particles, the next iteration begins. This iterative process continues until the optimal solution is achieved [55]. The procedure for implementing the PSO algorithm is outlined as:
- Generating and evaluating the initial population.
- Identifying both the best global and personal memories.
- Updating the position and velocity vectors and evaluating the new solutions.
- If the stop criteria are not satisfied (specified number of iterations have passed), return to step 2.
- End.
Managing large datasets not only places strain on computer hardware but can also impact the performance of machine-learning algorithms. To mitigate these challenges, our study employs PCA to uncover patterns in the data. PCA aims to identify correlations between variables, and when significant correlations exist, reducing the data dimensions can prove beneficial.
3.5 Principal component analysis
PCA is a widely utilized method for analysing large datasets characterized by numerous feature dimensions per observation. It constitutes a statistical technique aimed at reducing data dimensionality while retaining maximal information content. This facilitates enhanced comprehension and visualization of intricate, multidimensional datasets. By transforming data into a new coordinate system, PCA empowers researchers to represent data changes using fewer dimensions compared to the original dataset. In practical applications, many studies employ the first two principal components to visualize data in two dimensions and identify clusters of correlated data points. Ultimately, PCA aids in discerning the principal directions of variance within high-dimensional data, projecting them into a lower-dimensional subspace while preserving crucial information [56].
3.6 Performance evaluation
In general, a predictor can have four possible outcomes:
In general, a predictor can have four possible outcomes:
- True Positive: Modules that are accurately identified as faulty.
- False Positive: Non-defective modules that are mistakenly labelled as defective.
- True Negative: Non-defective modules that are correctly classified as non-faulty.
- False Negative: Faulty modules that are inaccurately classified as non-faulty.
True positives and true negatives indicate the correct diagnosis by the predictor, while false positives and false negatives have distinct consequences in predicting software defects, leading to the following effects:
-
Inaccurate defect detection can result in additional time and costs during software development and testing.
-
Consistent false positives and false negatives can diminish confidence in the accuracy of predictive software defect diagnoses, causing team members to lose confidence in their diagnostic abilities.
-
Frequent false positives can lead to real bugs being ignored or overlooked, posing a serious problem for teams facing many false issues.
-
False positives can decrease development team productivity by necessitating additional time and resources to investigate incorrect issues, leading to delays in software delivery and reduced overall quality.
-
Repeated false positives and false negatives can demotivate development and testing teams, resulting in frustration and a sense of ineffectiveness, negatively impacting morale and productivity.
To minimize the impact of inaccurate software defect detection, it is crucial to employ the most effective tools and techniques for testing and evaluating software quality. Ongoing training of development and testing teams can enhance their ability to provide precise diagnoses while fostering a culture that prioritizes software quality enhancement and the avoidance of false detection.
Within the realm of software defect prediction, current research utilized K-Fold cross-validation with 500 folds to evaluate classification algorithms. The evaluation criteria were based on the confusion matrix, specifically accuracy, sensitivity (recall), and specificity, measured using parameters such as true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The evaluation criteria were clearly defined as follows:
$$Precision=\frac{{TP}}{{TP+FP}}$$
7
$$Specificity=\frac{{TN}}{{TN+FP}}$$
8
$$Sensitivity\left( {recall} \right)=\frac{{TP}}{{TP+FN}}$$
9
$$Accuracy=\frac{{TP+TN}}{{TP+TN+FP+FN}}$$
10
$$F - Measure=2 \times \frac{{\left( {precision \times recall} \right)}}{{\left( {precision+recall} \right)}}$$
11