Just-In-Time Software Defect Prediction (JIT-SDP) is a technique used in software development to identify potential defects or bugs in the code before they occur. This approach involves analyzing the code and identifying patterns that are likely to lead to defects, such as complex code structures, poor coding practices, or incomplete testing. The goal of just-in-time defect prediction is to catch potential issues early in the development process at the commit level, before they become more difficult and expensive to fix (Chen, Zhao et al. 2018, Cabral, Minku et al. 2019). By early detection of these issues, developers can take corrective action and prevent defects from occurring in the first place. This approach can save time and money by reducing the need for extensive testing and rework later on in the development cycle. In this type of approach, if the goal is only to identify defects at the level of change, the chronological order of the data is not essential. However, if the goal is well-timed and early identification, we need to take chronology into account. The prediction model performance may degrade over time if chronology is important, which needs resolving by retraining using consistent data. This phenomenon is called Concept Drift (CD) (Gangwar, Kumar et al. 2021). Solving problems related to dynamic environments that require real-time or near-real-time processing (such as our problem or learning from logs and recorded events, operations, and sensor readings) is increasing. Therefore, incremental learning has attracted the attention of many researchers in recent years (Gama, Žliobaitė et al. 2014).
Several studies have investigated Concept Drift Detection (CDD) in the context of SDP and JIT-SDP to evaluate model stability. Previous studies have focused on monitoring model performance and data distribution to detect CD over time (McIntosh and Kamei 2018, Gangwar, Kumar et al. 2021, Cabral and Minku 2022). These methods require labeling of newly entered data, which is not feasible in practical applications. However, a new method based on model interpretation has been proposed in (Demšar and Bosnić 2018), which contends the potential to identify CD more accurately and not require labeling input data. The interpretation algorithm determines the impact of each feature of the dataset on the class label and explains why the model makes a particular prediction. For instance, in (Jiarpakdee, Tantithamthavorn et al. 2020), features that have the most significant impact on the probability of defect commit are introduced as the most important features. Demšar et al. (Demšar and Bosnić 2018) to identify points where CD occurred over time derived points where there was a significant difference between consecutive model interpretation vectors with an average effect size using a statistical test that detects change points. They evaluated this method on a synthetic dataset compared to a baseline method based on Error-Rate (ER) monitoring resulting from incremental Naive Bayes classifier accuracy measure. The naive Bayes (NB) classifier is widely used in SDP and is known for its simplicity. Surprisingly, the NB classifier often outperforms more complex classification models (Ji, Huang et al. 2019). In this study, we improved their method to detect CD on defect data and evaluated our proposed method on a total of 20 JIT-SDP datasets. Our aim is to propose a reliable and stable model for detecting CD on JIT-SDP datasets to create accurate predictions over time. We have demonstrated that methods based on interpretation vectors with positive and negative effect sizes outperform methods based on the average interpretation vector for both incremental and non-incremental classifiers. The method based on non-incremental learning also performs better than the incremental one both in performance and speed of operation. To support our claim, we compared these proposed methods with different baseline methods. Unlike the study (Demšar and Bosnić 2018), which only has used a baseline method based on ER monitoring for evaluation, various baseline methods can be applied that monitor a variety of the Naïve Bayes classifier criteria, including threshold-dependent measures such as Recall, Precision and threshold-independent measures such as ER and AUC. Additionally, we identified the features that have the greatest impact on creating and discovering CDs over time. For this purpose, instead of obtaining the significant distances of the vectors resulting from the model interpretation, we extracted the significant distances resulting from each feature and arranged them according to their impact on the discovery of CD points. Popular algorithms for CD detection using performance monitoring (Demšar and Bosnić 2018) include FLORA (Widmer and Kubat 1996), ADWIN (Bifet and Gavalda 2007), CUSUM (Page 1954), Page-Hinkley test (PH) (Page 1954), Statistical Process Control (SPC) (Page 1954), RCD (Gonçalves Jr and Barros 2012). Similar to (Demšar and Bosnić 2018), we applied the Page-Hinckley (PH) test to detect significant changes in the performance measures of the Naive Bayes classifier algorithm and the interpretation vector distances obtained over time. We evaluated our proposed methods in comparison with baseline methods by performing the Friedman test on well-known performance measures of CDD methods. These metrics include cdAccuracy, MTD, MTR, and MTFA. These measures will be described in more detail in the following sections.
Research (Lin, Tantithamthavorn et al. 2021) showed that the interpretation of the general JIT-SDP model (interpretation on the combination of datasets) is not compatible with the local (interpretation of the dataset individually). As a result, general JIT-SDP models are unable to fully explain variations in local JIT-SDP model interpretation. In this paper, we utilize the interpretation of the local JIT-SDP model to detect CD on individual projects. Specifically, we present the following main ideas:
-
Detecting CD points using monitoring of threshold-dependent and threshold-independent performance measures
-
Identifying CD points on software unlabeled change level data using model interpretation
-
Extraction effective features in CD discovery
Figure 1 provides a clearer understanding of how the model interprets a series of training sets. The light blue color denotes a positive impact of a range of feature values on defect detection, while bold blue indicates an overall positive impact of each feature. Conversely, orange and red colors signify a negative impact on defect detection. Consequently, if a feature value falls within these ranges, the sample is recognized as clean. Therefore, in this example, la, lt, nf, and ns are considered effective features that play a more significant role in defect detection in certain intervals where their blue part dominates over orange or red parts.
Our paper is divided into several sections. The second part focuses on previous research related to our study, while Sections 3 and 4 present our proposed methodology and results, respectively. Finally, we discuss the conclusion and future works.