Predicting software defects: a comprehensive analysis of machine learning approaches

doi:10.21203/rs.3.rs-5006431/v1

In software development, achieving flawless software is essential for maintaining quality and reducing testing costs. Predicting software defects is a crucial aspect of enhancing software quality. This paper explores various techniques, including feature selection, principal component analysis, and fisher discriminant ratio, utilizing well-known machine learning algorithms on the publicly available JM1 dataset, addressing the gap in the current literature. support vector machine, multi-layer perceptron, K-nearest neighbor, Naïve Bayes, and decision tree algorithms are utilized along with the K-Fold approach for class label classification. Additionally, a binary genetic algorithm with a support vector machine classifier is employed for feature selection, and a particle swarm optimization algorithm is used to determine optimal fisher discriminant ratio coefficients. Model performance is evaluated according to accuracy, sensitivity, specificity, F-measure, precision, and a confusion matrix. The findings indicate that all machine learning models perform well with different processing techniques. However, the support vector machine algorithm, when combined with optimal fisher discriminant ratio coefficients, achieved the highest accuracy at 88.2% and excelled in specificity at 99.6%. The K-nearest neighbor classifier with selected features attained the highest scores in precision, sensitivity, and F-measure. Other classification algorithms did not surpass these models in any performance metrics.

Software Defect Prediction (SDP)

Machine Learning (ML)

Principal Component Analysis (PCA)

Fisher Discriminant Ratio (FDR)

JM1

Evaluated machine learning algorithms on the JM1 dataset for software defect prediction
Combined support vector machine with optimal fisher discriminant ratio for best accuracy
K-nearest neighbor classifier excelled in precision, sensitivity, and F-measure
Utilized binary genetic algorithm for feature selection with SVM classifier
Employed particle swarm optimization to optimize fisher discriminant ratio coefficients

Software defect prediction is of importance in software development and maintenance [1]. SDP entails forecasting potential flaws in source code by leveraging techniques or tools informed by historical data [2]. The aim of SDP is to enhance and regulate software quality, reduce maintenance and repair risks, lower software development costs, and achieve specific quality standards by identifying defective software components and addressing defects early in the development process [2–4].

Software, defined as a collection of instructions, data, or programs designed for specific tasks across various applications, undergoes a unique engineering process characterized by complexity, adaptability, variability, and invisibility. Each of these traits poses the potential for defects in the final software product [5]. Software defects, or mistakes in the software code, can lead to unexpected or undesired behavior, data loss, causing malfunctions, security vulnerabilities, and other negative impacts [6]. These defects, categorized as internal or external bugs, result from specific changes made to the source code or unrecorded changes in the version control system [2]. Early identification and rectification of potential defects help allocate testing resources efficiently, reduce project costs, increase execution speed, and enhance the software development process [7]. Studies have revealed the Pareto principle, or the 80 − 20 rule, applies to the distribution of software defects among modules, with 20% of program modules accounting for 80% of all defects, potentially impacting prediction accuracy [8]. Within SDP, two approaches based on prediction context and data type exist: Within-Project Defect Prediction (WPDP) and Cross-Project Defect Prediction (CPDP) methods [9, 10]. WPDP utilizes locally labeled samples to train the prediction model, while CPDP employs labeled samples from other software projects [11]. Predicting software errors is crucial for software excellence, as errors and failures inversely affect quality. Timely detection assists in risk recognition and strategic resource allocation for superior outcomes [12]. Maintaining high-quality standards in software production necessitates rigorous code testing to ensure user satisfaction and minimize costly and time-consuming defect rectification [13, 14]. Effective techniques and tools for identifying error-prone modules are essential for optimizing testing and code recovery.

SDP utilizes statistical models, data mining techniques, and ML algorithms to predict possible software defects according to historical data [15]. Two methods for creating precise SDP models have been explored: designing feature sets to better capture defects and utilizing improved classifiers based on machine learning [2]. ML-based SDP requires feature extraction from program code structure and information based on predetermined criteria, such as Halstead’s criteria and McCabe’s characteristics. This extracted data is then used to construct software models, analyzing features like design complexity, cyclomatic complexity, code volume, line length, and program execution time. The most effective features are identified and combined to train an SDP model with labeled data, establishing the relationship between input software features and output fault indication, determining module faults.

This paper addresses the software defect prediction problem by leveraging dataset measurements, models, and ML techniques to reframe it as a learning and classification challenge. Through this methodology, the objective is to deepen the comprehension of software quality prediction and the diverse factors influencing it. Furthermore, the aim is to construct a predictive model for SDP utilizing neural networks based on static code criteria, with the intention of refining this model further through the utilization of artificial intelligence algorithms to enhance its accuracy and efficacy. This article delineates the software quality prediction quandary as a two-class classification issue employing the SVM algorithm as the classification model. The independent variables consist of the dataset's characteristics (JM1), while the classifier predicts the module's quality as a dependent variable based on predefined class labels. The model's efficiency is bolstered through the application of artificial intelligence algorithms and feature selection techniques. Ultimately, the classification model performance is evaluated based on a range of criteria.

The article is meticulously structured into distinct sections for improved comprehension. Section 2 furnishes a comprehensive overview of prior research conducted in the domain of SDP. In Section 3, the fundamental concepts and procedural steps of the proposed methodology are presented. Section 4 presents the outcomes of various tests and assessments. Finally, Section 5 concludes the article by offering insightful deductions and recommendations for potential future endeavors.

Examining the prediction uncertainty level generated by classifiers can be beneficial in forecasting software defects. Bowes et al. [16] compared the effectiveness of four classifiers - Random Forest, Naive Bayes, R-Part, and SVM - using the NASA open-source dataset. They conducted a sensitivity analysis by documenting the predictions of each classifier in a confusion matrix and calculating the prediction uncertainty of all classifiers. While these classifiers performed similarly in terms of predictive performance values, they identified distinct sets of defects, and some models were more reliable in defects prediction compared to others. Furthermore, certain classifiers were able to recognize specific subsets of defects. Therefore, relying solely on conventional measures such as F-measure, accuracy, or recall to evaluate predictive performance would only offer a limited understanding of model performance. To comprehensively detect defects, it is recommended to employ a variety of classifiers. The efficacy of seperate classifiers in identifying defects sets should be considered, and ensemble decision strategies should be improved accordingly. This research has significant implications for failure prediction, as different classifiers detect different defects, and current approaches to grouping defects may need to be reevaluated. A study by Arora and Saha [17] compared the effectiveness of an artificial neural network (ANN) and SVM in predicting software defects. The research team analyzed seven open-source promise datasets and evaluated the models based on accuracy, recall, and specificity. The study found that both SVM and ANN could predict intricate connections between software features and defects. However, SVM had better recall, while the ANN model achieved higher accuracy and specificity. The researchers recommended that evaluation parameters should be determined based on the project’s criticality before deciding on the classification model to be used. Iqbal et al. [18] studied the feature selection-based classification. The framework utilized with and without feature selections. The proposed framework incorporated two group techniques of Bagging and Boosting with random forest classification in both dimensions. The researchers evaluated the framework's performance using criteria such as F-measure, recall, precision, accuracy, MCC, and ROC. To validate the framework, they compared the outcomes of both dimensions with other classification algorithms, including Naive Bayes, MLP, RBF, SVM, KNN, K-Star, One-Rule, DT, and RF, using 12 NASA datasets. They concluded that in some datasets, the proposed classification framework surpassed other methods. However, the issue of class imbalance remained unresolved, leading to inferior and biased performance of the methods. Malhotra and Kamal [19] investigated the enhancement of ML techniques in predicting software defects by balancing datasets. The researchers employed five ML classifiers named DT (J48, RF), NB, AB, and BG to devise defect prediction models and utilized five oversampling approaches including SMOTE, ADASYN, SL-SM, SPIDER, and SPIDER2 to address imbalanced data. All sampling methods were implemented, and a final version of SPIDER2, called SPIDER3, was proposed. The effectiveness of ML learners in enhancing the outcomes of the proposed defect prediction techniques on unbalanced datasets was also assessed and compared to oversampling methods. Experimental verification was conducted using AUC and three criteria of sensitivity, accuracy, and specificity. The results were evaluated statistically and indicated a marked improvement in the predictive ability of ML classifiers using oversampling methods. Notably, SPIDER3 showed promising results. Iqbal et al. [20] studied analyzing the resampling techniques performance - including "random under-sampling," "oversampling random sampling," and "artificial minority oversampling technique" - in predicting software defects. The study used 12 NASA MDP datasets and 10 supervised ML classifiers, evaluating performance through various criteria such as accuracy, F-measure, MCC, and ROC. Results indicated that most classifiers had better performance with the "random sampling" method across multiple datasets.

Numerous researchers have proposed defect prediction methods with statistical techniques and ML techniques. Iqbal et al. [21] also evaluated various ML classification techniques, including NB, MLP, RBF, SVM, KNN, K-STAR, ONE-RULE, PART, and DT, for software defect prediction on 12 NASA datasets. The team measured performance using criteria such as accuracy, recall, F-measure, MCC, and ROC, but found that accuracy and ROC were not reliable because of class imbalance issues. Instead, recall, precision, F-measure, and MCC provided more accurate results. The results of their study can serve as a benchmark for future, ensuring that proposed techniques, models, or frameworks are comparable and easy to verify. Daoud et al. [22] utilized an artificial neural network to predict software modules susceptible to defects. The authors presented a real-time software fault prediction-based framework, including empirical analysis comparing four training algorithms performance: back-propagation, scaled conjugate gradient, Bayesian Regularization (BR), Broyden-Fletcher-Goldfarb-Shanno Quasi-Newton, and Levenberg-Marquardt. A fuzzy layer was also introduced in the framework to identify the best training function based on performance. The study used NASA’s publicly cleaned dataset and evaluated the performance using different criteria such as specificity, accuracy, recall, F-measure, ROC, R2, and MSE. MATLAB software was used to create two graphical user interface tools to implement the proposed framework. The first tool was used to compare training functions and extract results, while the second tool was used to choose the best training function using fuzzy logic. The fuzzy layer selected the BR training algorithm because it performed better than the other algorithms in many criteria. The BR training function accuracy was also compared with other ML methods, and it was cocluded that BR performs better than all others.

Based on prior research, various ML techniques have been employed to forecast software defects, with performance varying across different datasets. The accuracy of these techniques has often been suboptimal. Therefore, there is a need to enhance accuracy by examining diverse ML techniques and statistical methodologies. This study addresses this critical research issue by investigating methods to improve the precision of SDP using well-known ML algorithms on the publicly available JM1 dataset, thereby addressing a significant gap in the current literature.

Developing a highly accurate machine learning model represents a significant milestone in the artificial intelligence realm. To enhance experimental quality and reduce costs, it is imperative to implement an ML-based model for defect detection problem analysis [21–23]. Previous studies have employed various ML algorithms, showing differing levels of effectiveness and accuracy [18, 21, 24].

This study aims to develop a prototype for SDP analysis that enhances accuracy while optimizing testing costs. We explore several established ML techniques using a publicly available dataset to improve upon previous research findings. Specifically, SVM, MLP, KNN, NB, and DT classification models are employed to classify classes, evaluating their performance based on metrics such as accuracy, sensitivity, specificity, and confusion matrices.

3.1 Data description

The study employed the JM1 dataset [25], obtained from the PROMISE software engineering repository, which is freely accessible and included in the NASA Metric Data Program (MDP). JM1 originates from a coding tool utilized in NASA spacecraft projects, predominantly programmed in the C language. For comprehensive details regarding the dataset's attributes, please consult Table 1, which summarizes its key characteristics.

Table 1

Dataset Description
Title	Language	Source	Modules	Features	Defective	Defect-free	Defect rate
JM1	C	NASA Spacecraft Instrument	10885	21	8779	2106	80.65

This database's attributes are determined by four McCabe measures, twelve Halstead measures, and various other metrics. McCabe metrics focus on programming constructs and can be extracted directly from source code, making them method-level metrics [26]. In contrast, Halstead metrics utilize numerical data and can be easily computed using any software tool [25]. Feature D encompasses two categories: True and False, representing the presence or absence of software defects, respectively. This study identifies class 2 software defects and the absence of class 1 software defects. Table 2 details the properties of the JM1 dataset.

Table 2

Features Description
Metrics	Feature No.	Feature name	Feature description
McCabe metrics	1	Loc	Code line count
	2	v(g)	Cyclomatic complexity
	3	ev(g)	Essential complexity
	4	iv(g)	Design complexity
Halstead metrics	5	N	Total operators + operands
	6	V	Volume
	7	L	Program length
	8	D	Difficulty
	9	I	Intelligence
	10	E	Program writing effort
	11	B	Bugs
	12	T	Time estimator
	13	IOCode	Code line count
	14	IOComment	Comment lines count
	15	IOBlank	Blank lines count
	16	IOCodeAndComment	Code/comments lines
Other metrics	17	uniqOp	Unique operators
	18	uniqOpnd	Unique operands
	19	totalOp	Total operators
	20	totalOpnd	Total operands
	21	branchCount	Flow of graph
Defect		D	Module has defects or not

3.2. Classifiers

In the realm of software defect prediction, ML algorithms play a crucial role in categorizing faulty modules. This study assesses the effectiveness of SVM, MLP, KNN, NB, and DT algorithms in classifying software defects. These algorithms were chosen for their diverse operational characteristics and widespread use in research. Detailed evaluations of the performance of each algorithm are provided below.

3.2.1 Support Vector Machine. SVM is a highly effective tool for data classification. It operates by identifying a linear separator that optimizes the gap between different classes of data (Fig. 1). While ideally suited for binary classification tasks, SVM requires pairwise calculations for datasets with multiple classes [27]. In SVM, training samples are mapped to points in space to maximize the distance between classes. New samples are then assigned to a class based on their position relative to this gap. The aim is to determine the hyperplane that best separates the classes, represented mathematically by Eq. (1).

$$w * x - b=0$$

1

3.2.2 Multi-Layer Perceptron. The perceptron neural network is a supervised learning method that employs binary classification. Its binary classifier evaluates whether a given input, presented as a numerical vector, belongs to a specific class. The classifier utilizes a linear prediction function, which involves a feature vector and a corresponding set of weights. In an artificial neural network with multiple neurons, each output neuron operates independently, facilitating individual learning for each output. Figure 2 illustrates a basic representation of a perceptron neural network [28–34].

3.2.3 K Nearest Neighbour. The KNN technique is a non-parametric statistical approach utilized for both statistical classification and regression. In KNN, the parameter K refers to the number of nearest neighbours in the data space, and its output varies based on whether it is used for classification or regression tasks. Once a specific value for K is determined, the method classifies an unlabelled test sample by comparing it to its K nearest neighbours in the training set. Various methods such as Mahalanobis distance, Manhattan distance, Euclidean distance, Jaccard distance, and others are employed to calculate neighbourhood distance or assign weights to different neighbours [35–37].

3.2.4 Naive Bayesian. The Bayesian approach is a model for data classification that relies on conditional probability. It involves calculating the probability density function of each class to determine the likelihood of assigning test data to a specific class. Bayes’ formula, also referred to as Bayes’ rule, is utilized to determine the probability of an event. This formula includes three probabilities, represented by Eq. (2): P(C) is the prior probability of event C, P(C|X) is the posterior probability of event C, and P(X|C) is the probability of event X based on the information of event C.

$$P\left( {C|X} \right)=\frac{{P\left( {C|X} \right)P\left( C \right)}}{{P\left( X \right)}}$$

2

Assuming that the problem’s independent variables consist of n features in the form X = (X₁, X₂, ..., X_n), the dependent variable determining the data’s class can be obtained by calculating the probability of Ck occurring in P (C_k|X₁, ..., X_n) through the various states of the events in X.

3.2.5 Decision Tree. The DT is an advanced model used to facilitate decision-making in a hierarchical format. It assesses potential outcomes, including those influenced by chance events, resource costs, and utility, and is visually represented as a tree structure. Each node represents a decision or condition, with edges pointing to its child nodes. To determine the optimal decision, the algorithm analyses data and uses metrics such as entropy or Gini Impurity to partition the data into distinct categories. This process continues recursively until a complete DT is constructed. Known for their high interpretability and readability, decision trees are considered essential tools in data mining, artificial intelligence decision-making, and machine learning [38, 39].

3.3 Feature selection

Feature selection methods are employed to decrease the number of features utilized in training and testing prediction models. These methods fall into three primary categories:

Filter method: This method chooses features independently of the applied ML algorithm used, relying on pre-defined criteria.
Wrapper method: This method evaluates feature subsets using a classification function, incorporating feedback from the learning algorithm. It operates as a black box approach.
Combined model: The combined model integrates evaluation criteria from both previous methods at different search stages. It combines feature selection within the classifier structure, but may face challenges such as determining the optimal number of features and overlooking feature relationships, potentially affecting final feature selection outcomes.

Feature selection techniques serve several purposes:

Simplifying models for easier interpretation by researchers [40].
Reducing training times [41].
Mitigating the dimensionality curse [42].
Enhancing data compatibility with the learning model class [43].
Extracting inherent patterns from the input space [44–48].

These techniques assume that certain features in the data may be redundant or irrelevant, and can thus be safely excluded without significant information loss [43]. Distinguishing between redundant and irrelevant features is crucial, as a relevant feature may become redundant when another closely related feature is present [49].

This article employs a feature selection method based on subset evaluation, which strategically identifies potential features through systematic search. The search and identification process is enhanced using optimization algorithms rooted in artificial intelligence. Specifically, the binary genetic algorithm, a trusted wrapper method, is utilized to recognize the best optimal features.

The genetic algorithm is a search method widely employed for approximate solutions to search and optimization problems. Based on evolutionary biology principles such as heredity, biological mutation, and natural selection, it effectively predicts or matches patterns. This makes it a promising alternative to regression-based forecasting techniques. Genetic algorithm programming involves transforming inputs into solutions through a modelled evolutionary process, where a fitness function evaluates candidate solutions until specified conditions are met. The algorithm comprises randomized processes.

The aim of the current research is to maximize Correct Classification Rate (CCR). Given that the genetic algorithm minimizes the objective function, the proportional function aims to equal the inverse of CCR. The algorithm terminates iteration once a specified size is achieved. The genetic algorithm implementation follows these steps:

Initializing the population and evaluating individuals.
Applying crossover to chromosomes and evaluating results.
Introducing mutations to chromosomes and evaluating outcomes.
Merging initial, crossover, and mutated populations.
Sorting merged populations and selecting a population equal to the initial size (truncation).
If stop conditions are not met, returning to step 2.
Termination [50, 51].

In addition to feature selection, feature combination methods can also be employed. One such method is the feature dimension reduction technique.

3.4 Fisher’s discriminant analysis

Fisher's separation rate is a statistical technique used to enhance classification accuracy in classification problems. By assuming a Gaussian distribution of sample data for each natural class, Fisher's separation rate quantifies the degree of class differentiation, evaluates the influence of each feature on this differentiation, and assesses the potential benefits of combining features for improved detection. In scenarios involving two classes, Fisher's separation rate performs optimally when data exhibits a Gaussian or quasi-Gaussian distribution. These classes typically share an overlapping region, which complicates classification. Enhancing separation between the classes involves reducing this overlap, as illustrated in Fig. 3. Fisher's separation rate quantifies this improvement. As the shared area diminishes, the average distinction between the characteristics of the two classes increases, while their variance decreases, thereby enhancing differentiation. Eq. (3) defines Fisher's separation rate for a two-class problem [52].

$$FDR=\frac{{{{\left( {{\mu _1} - {\mu _2}} \right)}^2}}}{{\sigma _{1}^{2} - \sigma _{2}^{2}}}$$

3

To optimize and increase the separation of data from two classes, coefficients a = [a₁, a₂, ..., a_n] are applied to the features. Here, n represents the number of features. Therefore, Eq. (3) is modified as follows:

$$FDR=\frac{{{a^{\text{T}}}\left( {{\mu _1} - \mu } \right){{\left( {{\mu _1} - \mu } \right)}^{\text{T}}}a+{a^{\text{T}}}\left( {{\mu _2} - \mu } \right){{\left( {{\mu _2} - \mu } \right)}^{\text{T}}}a}}{{{a^{\text{T}}}\left( {\sigma _{1}^{2}} \right)a+{a^{\text{T}}}\left( {\sigma _{2}^{2}} \right)a}}$$

4

where µ=[µ₁, µ₂, ..., µ_n] is the average of each feature in the whole data and µ₁=[µ_1,1, µ_1,2, .. ., µ_1,n] is the average of each feature in class 1 and µ₂=[µ_2,1, µ_2,2, ..., µ_2,n] is the average of each feature in class 2 and σ₁=[σ_1,1, σ_1,2, ..., σ_1,n] is the standard deviation of each feature in class 1 and σ₂=[σ_2,1, σ_2,2, ..., σ_2,n] are the standard deviation of each feature in class 2 and a is the FDR coefficients. To obtain the coefficient values of a so that the FDR is maximized, we need an optimization algorithm. One of the fast algorithms in this field is the Particle Swarm Optimization (PSO) algorithm [53]. The PSO minimizes the objective function, but since the objective is to maximize the FDR, the fitness function of the PSO is characterized as the reverse of the Eq. (4).

The objective of the optimization process is to achieve the most effective outcome possible by utilizing decision variables to select the optimal solution, typically through maximizing or minimizing constraints and objectives [54].

PSO is a population-based metaheuristic method used for addressing optimization problems. In PSO, each potential solution is represented as a particle with a specific velocity. Each particle assesses the objective function at its current position in the search space. The direction for the next movement in the search space is determined by combining information from the particle's own best position (personal best) and the best position among all particles in the swarm (global best). After each iteration, the algorithm proceeds to update the velocity vector using Eq. (5) and the position vector using Eq. (6).

$$V_{{ij}}^{{t+1}}=wV_{{ij}}^{t}+{c_1}r_{1}^{t}\left( {personalbes{t_{ij}} - X_{{ij}}^{t}} \right)+{c_2}r_{2}^{t}\left( {globalbes{t_j} - X_{{ij}}^{t}} \right)$$

5

Eq. (5) involves several key coefficients and variables, including the inertia coefficient denoted by w, the personal learning coefficient represented by c₁, and the collective learning coefficient denoted by c₂. Additionally, r₁ and r₂ are random numbers that follow a uniform distribution, while personal-best is the particle’s best personal memory and global-best is the best memory of all particles in each iteration.

$$X_{{ij}}^{{t+1}}=X_{{ij}}^{t}+V_{{ij}}^{{t+1}},\,\,\,\,\,\left( {i=1,\,2,\,\, \cdots ,\,p} \right),\,\,\,\,\,\left( {i=1,\,2,\,\, \cdots ,\,n} \right)$$

6

According to the formulas mentioned earlier, the PSO algorithm involves defining p as the particles total number, n as the iterations total number, X as the location vector, and V as the velocity vector. After updating the velocity and location vectors and relocating all particles, the next iteration begins. This iterative process continues until the optimal solution is achieved [55]. The procedure for implementing the PSO algorithm is outlined as:

Generating and evaluating the initial population.
Identifying both the best global and personal memories.
Updating the position and velocity vectors and evaluating the new solutions.
If the stop criteria are not satisfied (specified number of iterations have passed), return to step 2.
End.

Managing large datasets not only places strain on computer hardware but can also impact the performance of machine-learning algorithms. To mitigate these challenges, our study employs PCA to uncover patterns in the data. PCA aims to identify correlations between variables, and when significant correlations exist, reducing the data dimensions can prove beneficial.

3.5 Principal component analysis

PCA is a widely utilized method for analysing large datasets characterized by numerous feature dimensions per observation. It constitutes a statistical technique aimed at reducing data dimensionality while retaining maximal information content. This facilitates enhanced comprehension and visualization of intricate, multidimensional datasets. By transforming data into a new coordinate system, PCA empowers researchers to represent data changes using fewer dimensions compared to the original dataset. In practical applications, many studies employ the first two principal components to visualize data in two dimensions and identify clusters of correlated data points. Ultimately, PCA aids in discerning the principal directions of variance within high-dimensional data, projecting them into a lower-dimensional subspace while preserving crucial information [56].

3.6 Performance evaluation

In general, a predictor can have four possible outcomes:

True Positive: Modules that are accurately identified as faulty.
False Positive: Non-defective modules that are mistakenly labelled as defective.
True Negative: Non-defective modules that are correctly classified as non-faulty.
False Negative: Faulty modules that are inaccurately classified as non-faulty.

True positives and true negatives indicate the correct diagnosis by the predictor, while false positives and false negatives have distinct consequences in predicting software defects, leading to the following effects:

Inaccurate defect detection can result in additional time and costs during software development and testing.
Consistent false positives and false negatives can diminish confidence in the accuracy of predictive software defect diagnoses, causing team members to lose confidence in their diagnostic abilities.
Frequent false positives can lead to real bugs being ignored or overlooked, posing a serious problem for teams facing many false issues.
False positives can decrease development team productivity by necessitating additional time and resources to investigate incorrect issues, leading to delays in software delivery and reduced overall quality.
Repeated false positives and false negatives can demotivate development and testing teams, resulting in frustration and a sense of ineffectiveness, negatively impacting morale and productivity.

To minimize the impact of inaccurate software defect detection, it is crucial to employ the most effective tools and techniques for testing and evaluating software quality. Ongoing training of development and testing teams can enhance their ability to provide precise diagnoses while fostering a culture that prioritizes software quality enhancement and the avoidance of false detection.

Within the realm of software defect prediction, current research utilized K-Fold cross-validation with 500 folds to evaluate classification algorithms. The evaluation criteria were based on the confusion matrix, specifically accuracy, sensitivity (recall), and specificity, measured using parameters such as true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The evaluation criteria were clearly defined as follows:

$$Precision=\frac{{TP}}{{TP+FP}}$$

7

$$Specificity=\frac{{TN}}{{TN+FP}}$$

8

$$Sensitivity\left( {recall} \right)=\frac{{TP}}{{TP+FN}}$$

9

$$Accuracy=\frac{{TP+TN}}{{TP+TN+FP+FN}}$$

10

$$F - Measure=2 \times \frac{{\left( {precision \times recall} \right)}}{{\left( {precision+recall} \right)}}$$

11

4.1 Experimental setup

In an experimental setting, classification-focused ML models were developed on a computer with an Intel(R) Core(TM) i7-6500U processor, 16 GB of RAM, and 128 GB of secondary storage. These models were implemented using MATLAB version R2023a, supplemented with toolboxes for statistics, machine learning, system identification, and optimization. MATLAB is widely recognized for its efficacy in projects involving analysis, forecasting, and data science, and is proficient in handling both quantitative and qualitative data.

4.2 Statistical analysis and preprocess

In statistical data analysis, identifying outliers is typically the initial step. Outliers within a dataset can lead to several issues, including decreased model accuracy, reduced overall model capability, increased computational cost and time, and greater model complexity for ML algorithms. Consequently, each feature within the Jm1 dataset was assessed for outliers using the median method. The investigations revealed numerous outliers within the dataset, as illustrated in Fig. 4’s box plot of the features.

Figure 4 demonstrates that box plots have limited capacity to fully describe the data due to the presence of numerous outliers. However, data variance plays an essencial role in ML problems, as it represents the diversity of the data. To ensure accurate ML models, it is crucial to find a balance between variance and variation in the data. Excessive variance can lead to model overfitting and poor performance with new data. To address this, slightly reducing the variance can enhance the model's capacity to generalize to new data. Conversely, if the variance is too low, the model may underfit and perform poorly. In this case, increasing the variance can improve model predictions.

Consequently, the data were analyzed with regard to variance, as well as the number and percentage of outliers. Table 3 illustrates the number and percentage of outliers present in each feature.

Based on the data presented in Table 3, the features IOComment, ev(g), T, and E exhibit the highest number of outliers. Additionally, the features E and T have a variance more than four times the average total variance, while the least amount of outlier data is associated with the features uniqOp, I, D, and uniqOpnd.

After removing the outliers, the dataset was reduced to 3,465 samples with 21 remaining features. The features ev(g), IOComment, and IOCodeAndComment had constant values and were therefore removed from the dataset, while the features uniqOp, uniqOpnd, totalOp, totalOpnd, and branchCount contained undefined data (NaN), resulting in the removal of 5 records from the dataset. After applying the necessary preprocessing, the dataset was further reduced to 3,460 samples with 18 features.

Table 3 Statistical description of features values

To preserve the original nature of the data and balance the dataset, random samples were selected from class 1 to match the number of class 2 data. Class 1 originally contained 3,012 records, constituting 87.0520% of the total data, while class 2 contained 448 records, constituting 12.9480% of the total data. Consequently, the data points number for each class was set to 448, and a total of 896 records were selected for the two classes.

4.3 Experimental results

The results of our tests on the JM1 dataset using different ML models are presented in this section. These visual aids facilitate a comprehensive comparison of the different methods employed. We meticulously evaluated all results using multiple algorithms and the k-fold strategy, with a K value of 500, considering criteria 7 to 11. Table 4 showcases the evaluation outcomes of all the ML models.

Table 4

Evaluation metrics using K-Fold (K = 500)
Classifier	Accuracy (%)	Precision (%)	Sensitivity (%)	Specificity (%)	F-measure (%)
SVM	88	11.7	9	99	9.7
MLP	86.1	1	1.2	98.5	0.9
KNN	84.5	15.5	13.3	93.7	13.5
NB	80.2	15.5	16.3	88.3	14.6
DT	83.1	18.7	17.5	91.1	16.9

Acordingg to Table 4, the SVM algorithm is the most accurate algorithm when evaluating the accuracy criterion. The DT model performs best in terms of precision, sensitivity, and F-measure, while the SVM classifier exhibits the highest level of specificity. Figure 5 displays the results of all ML models for better comparison.

4.4 FDR results

To perform a more comprehensive analysis, we utilized the FDR technique and the PSO algorithm to obtain weighting coefficients that were subsequently applied to the data to optimize the dataset and enhance its resolution. Figure 6 depicts the bar chart of FDR coefficients while Fig. 7 displays the fitness diagram (which is the inverse of FDR) of the PSO algorithm.

Based on the data presented in Fig. 5, it is clear that the PSO algorithm effectively reduces the objective function value. This is evident from the downward trend observed in the graph, which eventually levels off towards the end. Referring to Eq. (5), we can conclude that the FDR value is maximized to its fullest potential. The reliability of the coefficients obtained from Fig. 6 is also noteworthy.

Table 5 provides a detailed overview of the classification results obtained after applying the optimal coefficients from the FDR technique to all features. Based on the values presented in Table 5 and considering the factors of accuracy and specificity, the SVM classifier demonstrates the highest accuracy. Conversely, when considering precision, sensitivity, and F-measure criteria, the DT classifier shows the highest accuracy. Moreover, the performance of the other algorithms is not superior to either SVM or DT.

Upon comparing the results with those in Table 4, we can conclude that the classification accuracy of the SVM and MLP algorithms increased by 0.2% and 0.3%, respectively, after applying the optimal coefficients obtained from the FDR technique. However, in other cases, the changes were negligible, and the results remained almost the same.

Table 5

Evaluation metrics after applying optimum weights obtained from FDR technique
Classifier	Accuracy (%)	Precision (%)	Sensitivity (%)	Specificity (%)	F-measure (%)
SVM	88.2	9.6	7.3	99.6	7.9
MLP	86.4	0.5	0.9	99	0.5
KNN	83.9	16	14.3	92.5	14.1
NB	78.3	15.6	17.2	85.8	14.9
DT	82.9	18.4	17.3	90.8	16.7

4.5 Results after applying PCA

To optimize ML models performance, it is crucial to minimize the correlation between features. A statistical analysis was conducted to determine the level of correlation between features, revealing a high correlation in many cases. Specifically, the strongest correlation was found between features 11 and 9 (I and B), with a value of 1. Additionally, features 12, 11, 10, 9, 8, 7, 5, and 4 (iv(g), N, L, D, I, E, B, and T) showed high correlation with other features. Conversely, feature V (the sixth characteristic) had the lowest correlation with other features. Figure 8 illustrates the correlation between features.

Finally, the PCA technique was ustilized to decrease the correlation between features and address dimensionality issues. The best classification was achieved with 18 components. Table 6 shows the classification results after applying PCA. After employing PCA and cross-referencing with Table 4, no changes were observed in the outcomes for the SVM, KNN, DT, or NB algorithms. However, the specificity value for the MLP neural network improved from 98.5–99.3%, while the precision and sensitivity values decreased by 0.4% and 0.5%, respectively. Nevertheless, in terms of the accuracy criterion, the SVM algorithm remains the most accurate.

Table 6

Evaluation metrics after applying PCA
Classifier	Accuracy (%)	Precision (%)	Sensitivity (%)	Specificity (%)	F-measure (%)
SVM	88	11.7	9	99	9.7
MLP	86.6	0.5	0.8	99.3	0.5
KNN	84.5	15.5	13.3	93.7	13.5
NB	80.2	15.5	16.3	88.3	14.6
DT	83.1	18.7	17.5	91.1	16.9

4.6 Feature selection results

To accurately identify and influence the primary factors affecting the classification results, we employed a feature selection method involving the binary genetic algorithm applied to the available data. This method led us to a set of 7 key features, which are presented in Table 7. Further insights into the outcomes produced by using these selected features and implementing the k-fold method can be found in Table 8.

Table 7

Evaluation metrics after applying PCA
Metrics	Feature name	Feature description
McCabe	Loc	Code line count
McCabe	v(g)	Cyclomatic complexity
Halstead metrics	L	Program length
	D	Difficulty
	IOBlank	Blank lines count
Other metrics	uniqOp	Unique operators
Other metrics	uniqOpnd	Unique operands

By applying the selected features and comparing the results to Table 4, the accuracy of the MLP neural network increased by 0.4%, the DT classifier by 0.7%, and the NB classifier by 4%. The KNN classifier exhibited a 3.8% increase in precision, a 4.4% increase in sensitivity, and the NB classifier showed a 5.6% increase in specificity. Additionally, the KNN classifier demonstrated a 3.7% increase in the F-measure criterion. Despite these improvements, the SVM classifier still maintains the highest accuracy compared to other classifiers when considering the primary criterion of accuracy.

Table 8

Evaluation metrics after feature selection
Classifier	Accuracy (%)	Precision (%)	Sensitivity (%)	Specificity (%)	F-measure (%)
SVM	87.9	12.8	9.9	98.5	10.5
MLP	86.5	0.4	0.6	99.3	0.5
KNN	83.9	19.3	17.7	91.8	17.2
NB	84.2	13.1	11.3	93.9	11.3
DT	83.8	17.8	15.6	92.4	15.3

This article examines various machine learning models using the K-Fold approach and applying techniques such as feature selection, PCA, and FDR to predict software defects. All models are trained and tested using MATLAB, analyzing their performance on the publicly available JM1 dataset. The results demonstrated that the SVM algorithm, when combined with optimal FDR coefficients, achieves the best accuracy and specificity, with an accuracy of 88.2%. The KNN algorithm with selected features excels in precision, sensitivity, and F-measure criteria. The MLP, DT, and NB classification algorithms did not outperform SVM and KNN in any performance metrics. The results of the best-performing algorithms with their respective processing techniques are presented in Table 9.

Table 9

Evaluation metrics of best classification algorithms with relevant processing using K-Fold
Classifier	Process	Accuracy (%)	Precision (%)	Sensitivity (%)	Specificity (%)	F-measure (%)
SVM	FDR	88.2	9.6	7.3	99.6	7.9
KNN	FS	83.9	19.3	17.7	91.8	17.2

Future work will focus on implementing more precise statistical processing techniques to improve dataset quality, including the removal of outliers and preserving dataset integrity to optimize both supervised and unsupervised ML algorithms. Continued use of k-fold and ensemble approaches will be explored to ensure maximum accuracy. Additionally, we will investigate feature selection techniques for each classifier to determine the most suitable features. Combining ensemble classifiers with resampling techniques will also be considered to further improve performance. Future research will aim to optimize a broader range of classifiers with an expanded set of parameters, enhancing the diversity of classifiers used in ensemble learning. These efforts aim to develop advanced failure prediction models and evaluate their performance in in-project studies, providing valuable tools for future researchers to overcome current performance limitations in defect prediction.

AB	AdaBoost
ADASYN	ADAptive SYNthetic
ANN	Artificial Neural Network
AUC	Area Under the Receiver Operating Characteristic Curve
BG	Bagging
CCR	Correct Classification Rate
CPDP	Cross Project Defect Prediction
DT	Decision Tree
FDR	Fisher’s Discriminant Ratio
FS	Feature Selection
KNN	K Nearest Neighbour
MCC	Matthew’s Correlation Coefficient
ML	Machine Learning
MLP	Multi-Layer Perceptron
MSE	Mean Square Error
NB	Naive Bayesian
PCA	Principal Component Analysis
PSO	Particle Swarm Optimization
RBF	Radial Basis Function
RF	Random Forest
ROC	Receiver Operating Characteristics
SDP	Software Defect Prediction
SL-SM	Safe Level SMOTE
SMOTE	Synthetic Minority Oversampling Technique
SPIDER	Selective Pre-Processing of Imbalanced Data
SVM	Support Vector Machine
WPDP	Within Project Defect Prediction

Author contributions M.S., S.R. and S.P. (Shahrzad Pouramirarsalani) contributed to the study conception, design and experiment discussion, materials and methods preparation, result data collection and analysis. Supervision, investigation, writing- reviewing and editing were performed by SP (Sajjad Pakzad), HA and SP (Siamak Pedrammehr). The first draft of the manuscript was written by M.S. and S.R., and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding This research received no external funding.

Data availability The data that support the findings of this study are available upon reasonable request from the authors.

Ethical approval and consent to participate This article does not contain any studies with human participants or animals performed by any of the authors.

Competing interests The authors declare that they have no conflict of interest.

Omri S, Sinz C. Deep learning for software defect prediction: A survey. IEEE/ACM 42nd Int Conf Softw Eng Workshops 2020;209–214.
Khleel NAA, Nehéz K. A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method. J Intell Inf Syst 2023;60:673–707.
Ferenc R, Bán D, Grósz T, Gyimóthy T. Deep learning in static, metric-based bug prediction. Array 2020;6:100021.
Pandit M, Gupta D, Anand D, Goyal N, Aljahdali HM, Mansilla AO, Kadry S, Kumar A. Towards design and feasibility analysis of DePaaS: AI based global unified software defect prediction framework. Appl Sci 2022;12:493.
Shihab E. An exploration of challenges limiting pragmatic software defect prediction. PhD Thesis, Queen's University, Canada, 2012.
Tong H, Liu B, Wang S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 2018;96:94–111.
Kumar Y, Singh V. A practitioner approach of deep learning based software defect predictor. Ann Rom Soc Cell Biol 2021;25:18764–85.
Zheng S, Gai J, Yu H, Zou H, Gao S. Software defect prediction based on fuzzy weighted extreme learning machine with relative density information. Sci Program 2020;2020:852705.
Li Z, Jing XY, Zhu X. Progress on approaches to software defect prediction. IET Softw 2018;12:161–75.
Thota MK, Shajin FH, Rajesh P. Survey on software defect prediction techniques. Int J Appl Sci Eng 2020;17:331–44.
Apat SK, Rao S, Patra PSK. Software bug prediction analysis using various machine learning approaches. IJEAS 2020;29:1508–16.
Goyal S, Bhatia PK. Software quality prediction using machine learning techniques. Innov Comput Intell Comput Vis Proc ICICV 2020;2021:551–60.
Jimoh R, Balogun A, Bajeh A, Ajayi S. A PROMETHEE based evaluation of software defect predictors. J Comput Sci Its Appl 2018;25:106–19.
Mao Y, Zhu Y, Tang Z, Chen Z. A novel airspace planning algorithm for cooperative target localization. Electronics 2022;11:2950.
Kalaivani N, Beena R. Overview of software defect prediction using machine learning algorithms. Int J Pure Appl Math 2018;118:3863–73.
Bowes D, Hall T, Petrić J. Software defect prediction: do different classifiers find the same defects? Softw Qual J 2018;26:525–52.
Arora I, Saha A. Software defect prediction: a comparison between artificial neural network and support vector machine. Adv Comput Commun Technol Proc 10th ICACCT 2016;2018:51–61.
Iqbal A, Aftab S, Ullah I, Bashir MS, Saeed MA. A feature selection based ensemble classification framework for software defect prediction. Int J Mod Educ Comput Sci 2019;9:54–64.
Malhotra R, Kamal S. An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 2019;343:120–40.
Iqbal A, Aftab S, Matloob F. Performance analysis of resampling techniques on class imbalance issue in software defect prediction. Int J Inf Technol Comput Sci 2019;11:44–53.
Iqbal A, Aftab S, Ali U, Nawaz Z, Sana L, Ahmad M, Husen A. Performance analysis of machine learning techniques on software defect prediction using NASA datasets. Int J Adv Comput Sci Appl 2019;10:300–8.
Daoud MS, Aftab S, Ahmad M, Khan MA, Iqbal A, Abbas S, Iqbal M, Ihnaini B. Machine learning empowered software defect prediction system. Intell Autom Soft Comput 2022;31:1287–300.
Ali MM, Huda S, Abawajy J, Alyahya S, Al-Dossari H, Yearwood J. A parallel framework for software defect detection and metric selection on cloud computing. Cluster Comput 2017;20:2267–81.
Mumtaz B, Kanwal S, Alamri S, Khan F. Feature Selection Using Artificial Immune Network: An Approach for Software Defect Prediction. Intell Autom Soft Comput 2021;29:669–84.
PROMISE Software Engineering Repository. Available from: http://promise.site.uottawa.ca/SERepository/
Cetiner M, Sahingoz OK. A comparative analysis for machine learning based software defect prediction systems. Proc 11th Int Conf Comput Commun Networking Technol (ICCCNT) IEEE 2020.
Grishma B, Anjali C. Software root cause prediction using clustering techniques: A review. Proc 2015 Global Conf Commun Technol (GCCT) IEEE 2015;511–5.
Jafari Z, Rajebi S, Haghipour S. Using the Neural Network to Diagnose the Severity of Heart Disease in Patients Using General Specifications and ECG Signals Received from the Patients. ASTESJ 2020;5:882–92.
Jafari Z, Yousefi AM, Rajabi S. Using different types of neural networks in detection the body's readiness for blood donation and determining the value of each of its parameters using genetic algorithm. Innovaciencia 2020;8:1–10.
Mirzayi S, Rajebi S. Diagnosis of Epilepsy Using Signal Time Domain Specifications and SVM Neural Network. Mach Learn Res 2020;5:28–38.
Salimi S, Nobarian MS, Rajebi S. Skin disease images recognition based on classification methods. Int J Tech Phys Probl Eng 2015;22:78–85.
Sarabi S, Asadnejad M, Hosseini ST. Using Artificial Intelligence for Detection of Lymphatic Disease and Investigation on Various Methods of Its Classifications. An Int J Phys Eng Sci 2020;43:58–65.
Sarabi S, Asadnejad M, Rajabi S. Using neural network for drowsiness detection based on EEG signals and optimization in the selection of its features using genetic algorithm. Innovaciencia 2020;8:1–9.
Saraei M, Rahmani S, Rajebi S. A Different Traditional Approach for Automatic Comparative Machine Learning in Multimodality Covid-19 Severity Recognition. Int J Innov Eng 2023;3:1–12.
Andarabi S, Nobakht A, Rajebi S. The study of various emotionally-sounding classification using knn, bayesian, neural network methods. Proc 2020 Int Conf Electr Commun Comput Eng (ICECCE) IEEE 2020.
Jafari Z, Yousefi AM, Rajebi S. Investigation on different pattern classification methods and proposing the optimum method with implementation on blood transfusion dataset. IJTPE 2020;12:66–70.
Zhang X, Liu CA. Model averaging prediction by K-fold cross-validation. J Econometrics 2023;235:280–301.
Charbuty B, Abdulazeez A. Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2021;2:20–8.
Yousefi V, Kheiri S, Rajebi S. Evaluation of K-nearest neighbor, bayesian, perceptron, RBF and SVM neural networks in diagnosis of dermatology disease. Int J Tech Phys Probl Eng (IJTPE) 2020;12:114–20.
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. Springer; 2013.
Liu H. Feature Selection. In: Sammut C, Webb GI, editors. Encyclopedia of Machine Learning. Springer US: Boston, MA; 2010. p. 402–6.
Kramer MA. Nonlinear principal component analysis using autoassociative neural networks. AIChE J 1991;37:233–43.
Kratsios A, Hyndman C. Neu: A meta-algorithm for universal uap-invariant feature representation. J Mach Learn Res 2021;22:4102–52.
Persello C, Bruzzone L. Relevant and invariant feature selection of hyperspectral images for domain generalization. Proc 2014 IEEE Geosci Remote Sens Symp IEEE 2014;3562–5.
Hinkle J, Muralidharan P, Fletcher PT, Joshi S. Polynomial regression on Riemannian manifolds. Proc Comput Vision–ECCV 2012: 12th Eur Conf Comput Vision, Florence, Italy, October 7-13, 2012, Proc., Part III 12 Springer; 2012. p. 1–14.
Yarotsky D. Universal approximations of invariant maps by neural networks. Construct Approx 2022;55:407–74.
Hauberg S, Lauze F, Pedersen KS. Unscented Kalman filtering on Riemannian manifolds. J Math Imaging Vision 2013;46:103–20.
Saifudin A, Trisetyarso A, Suparta W, Kang CH, Abbas BS, Heryadi Y. Feature Selection in Cross-Project Software Defect Prediction. J Phys Conf Ser 2020;1569:022001.
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res 2003;3:1157–82.
Katoch S, Chauhan SS, Kumar V. A review on genetic algorithm: past, present, and future. Multimed Tools Appl 2021;80:8091–126.
Mirjalili S. Genetic algorithm. In: Evolutionary Algorithms and Neural Networks: Theory and Applications. Springer; 2019.
Kalinkov K, Ganchev T, Markova V. Adaptive feature selection through fisher discriminant ratio. Proc 2019 Int Conf Biomed Innov Applications (BIA) IEEE; 2019.
Buchari MA, Mardiyanto S, Hendradjaya B. Implementation of Chaotic Gaussian Particle Swarm Optimization for Optimize Learning-to-Rank Software Defect Prediction Model Construction. J Phys Conf Ser 2018;978:012079.
Jain M, Saihjpal V, Singh N, Singh SB. An overview of variants and advancements of PSO algorithm. Appl Sci 2022;12:8392.
Gad AG. Particle swarm optimization algorithm and its applications: a systematic review. Arch Comput Methods Eng 2022;29:2531–61.
Zirjam A, Rajebi S. Applying Different Pattern Recognition Methods for Identifying Skin Diseases. Mach Learn Res 2020;5:39–45.

No competing interests reported.

Predicting software defects: a comprehensive analysis of machine learning approaches

Status:

Version 1

Abstract

Figures

Article Highlights

1 Introduction

2 Related works

3 Methodology