Detecting Concept Drift in Just-In-Time Software Defect Prediction Using Model Interpretation

doi:10.21203/rs.3.rs-3183620/v1

Download PDF

Research Article

Detecting Concept Drift in Just-In-Time Software Defect Prediction Using Model Interpretation

https://doi.org/10.21203/rs.3.rs-3183620/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Context: Previous studies have indicated that the stability of Just-In-Time Software Defect Prediction (JIT-SDP) models can change over time due to various factors, including modifications in code, environment, and other variables. This phenomenon is commonly referred to as Concept Drift (CD), which can lead to a decline in model performance over time. As a result, it is essential to monitor the model performance and data distribution over time to identify any fluctuations.

Objective: We aim to identify CD points on unlabeled input data in order to address performance instability issues in evolving software and investigate the compatibility of these proposed methods with methods based on labeled input data. To accomplish this, we considered the chronological order of the input commits generated by developers over time. In this study, we propose several methods that monitor the distance between model interpretation vectors and values of their individual features over time to identify significant distances for detecting CD points. We compared these methods with various baseline methods.

Method: In this study, we utilized a publicly available dataset that has been developed over the long-term and comprises 20 open-source projects. Given the real-world scenarios, we also considered verification latency. Our initial idea involved identifying CD points on within-project by discovering significant distances between consecutive vectors of interpretation of incremental and non-incremental models.

Results: We compared the performance of the proposed CD Detection (CDD) methods to various baseline methods that utilized incremental Naïve Bayes classification. These baseline methods are based on monitoring the error rate of various performance measures. We evaluated the proposed approaches using well-known measures of CDD methods such as accuracy, missed detection rate, mean time to detection, mean time between false alarms, and meantime ratio. Our evaluation was conducted using the Friedman statistical test.

Conclusions: According to the results obtained, it appears that method based on the average interpretation vector does not accurately recognize CD. Additionally, methods that rely on incremental classifiers have the lowest accuracy. On the other hand, methods based on non-incremental learning that utilized interpretation with positive effect size demonstrate the highest accuracy. By employing strategies that utilized the interpretation values of each feature, we were able to derive features that have the most positive effect in identifying CD.

JIT-SDP

CDD

explanation

incremental classifier

data stream

time series

fault diagnosis system

statistical process control

model instability

reliability

performance fluctuates

Just-In-Time Software Defect Prediction (JIT-SDP) is a technique used in software development to identify potential defects or bugs in the code before they occur. This approach involves analyzing the code and identifying patterns that are likely to lead to defects, such as complex code structures, poor coding practices, or incomplete testing. The goal of just-in-time defect prediction is to catch potential issues early in the development process at the commit level, before they become more difficult and expensive to fix (Chen, Zhao et al. 2018, Cabral, Minku et al. 2019). By early detection of these issues, developers can take corrective action and prevent defects from occurring in the first place. This approach can save time and money by reducing the need for extensive testing and rework later on in the development cycle. In this type of approach, if the goal is only to identify defects at the level of change, the chronological order of the data is not essential. However, if the goal is well-timed and early identification, we need to take chronology into account. The prediction model performance may degrade over time if chronology is important, which needs resolving by retraining using consistent data. This phenomenon is called Concept Drift (CD) (Gangwar, Kumar et al. 2021). Solving problems related to dynamic environments that require real-time or near-real-time processing (such as our problem or learning from logs and recorded events, operations, and sensor readings) is increasing. Therefore, incremental learning has attracted the attention of many researchers in recent years (Gama, Žliobaitė et al. 2014).

Several studies have investigated Concept Drift Detection (CDD) in the context of SDP and JIT-SDP to evaluate model stability. Previous studies have focused on monitoring model performance and data distribution to detect CD over time (McIntosh and Kamei 2018, Gangwar, Kumar et al. 2021, Cabral and Minku 2022). These methods require labeling of newly entered data, which is not feasible in practical applications. However, a new method based on model interpretation has been proposed in (Demšar and Bosnić 2018), which contends the potential to identify CD more accurately and not require labeling input data. The interpretation algorithm determines the impact of each feature of the dataset on the class label and explains why the model makes a particular prediction. For instance, in (Jiarpakdee, Tantithamthavorn et al. 2020), features that have the most significant impact on the probability of defect commit are introduced as the most important features. Demšar et al. (Demšar and Bosnić 2018) to identify points where CD occurred over time derived points where there was a significant difference between consecutive model interpretation vectors with an average effect size using a statistical test that detects change points. They evaluated this method on a synthetic dataset compared to a baseline method based on Error-Rate (ER) monitoring resulting from incremental Naive Bayes classifier accuracy measure. The naive Bayes (NB) classifier is widely used in SDP and is known for its simplicity. Surprisingly, the NB classifier often outperforms more complex classification models (Ji, Huang et al. 2019). In this study, we improved their method to detect CD on defect data and evaluated our proposed method on a total of 20 JIT-SDP datasets. Our aim is to propose a reliable and stable model for detecting CD on JIT-SDP datasets to create accurate predictions over time. We have demonstrated that methods based on interpretation vectors with positive and negative effect sizes outperform methods based on the average interpretation vector for both incremental and non-incremental classifiers. The method based on non-incremental learning also performs better than the incremental one both in performance and speed of operation. To support our claim, we compared these proposed methods with different baseline methods. Unlike the study (Demšar and Bosnić 2018), which only has used a baseline method based on ER monitoring for evaluation, various baseline methods can be applied that monitor a variety of the Naïve Bayes classifier criteria, including threshold-dependent measures such as Recall, Precision and threshold-independent measures such as ER and AUC. Additionally, we identified the features that have the greatest impact on creating and discovering CDs over time. For this purpose, instead of obtaining the significant distances of the vectors resulting from the model interpretation, we extracted the significant distances resulting from each feature and arranged them according to their impact on the discovery of CD points. Popular algorithms for CD detection using performance monitoring (Demšar and Bosnić 2018) include FLORA (Widmer and Kubat 1996), ADWIN (Bifet and Gavalda 2007), CUSUM (Page 1954), Page-Hinkley test (PH) (Page 1954), Statistical Process Control (SPC) (Page 1954), RCD (Gonçalves Jr and Barros 2012). Similar to (Demšar and Bosnić 2018), we applied the Page-Hinckley (PH) test to detect significant changes in the performance measures of the Naive Bayes classifier algorithm and the interpretation vector distances obtained over time. We evaluated our proposed methods in comparison with baseline methods by performing the Friedman test on well-known performance measures of CDD methods. These metrics include cdAccuracy, MTD, MTR, and MTFA. These measures will be described in more detail in the following sections.

Research (Lin, Tantithamthavorn et al. 2021) showed that the interpretation of the general JIT-SDP model (interpretation on the combination of datasets) is not compatible with the local (interpretation of the dataset individually). As a result, general JIT-SDP models are unable to fully explain variations in local JIT-SDP model interpretation. In this paper, we utilize the interpretation of the local JIT-SDP model to detect CD on individual projects. Specifically, we present the following main ideas:

Detecting CD points using monitoring of threshold-dependent and threshold-independent performance measures

Identifying CD points on software unlabeled change level data using model interpretation

Extraction effective features in CD discovery

Figure 1 provides a clearer understanding of how the model interprets a series of training sets. The light blue color denotes a positive impact of a range of feature values on defect detection, while bold blue indicates an overall positive impact of each feature. Conversely, orange and red colors signify a negative impact on defect detection. Consequently, if a feature value falls within these ranges, the sample is recognized as clean. Therefore, in this example, la, lt, nf, and ns are considered effective features that play a more significant role in defect detection in certain intervals where their blue part dominates over orange or red parts.

Our paper is divided into several sections. The second part focuses on previous research related to our study, while Sections 3 and 4 present our proposed methodology and results, respectively. Finally, we discuss the conclusion and future works.

In the previous section, we highlighted several challenges and gaps related to the JIT-SDP problem. In this section, we will present relevant previous works to address these issues. We will start by discussing various approaches, including the challenges of CD and time series data analysis in model interpretation and instability of the JIT-SDP model, which are utilized in our proposed method.

Concept drift

In non-stationary environments where new data is continuously being generated, the sequence in which samples are processed plays a critical role. Specifically, it is important to avoid using future data to analyze past data. Instead, information extracted from earlier data should be used to make predictions about future data. This means that the model must be trained with earlier data and then used to make predictions on new data. CD can occur when the data distribution changes (Demšar and Bosnić 2018). CD is an significant issue that can greatly affect the accuracy and reliability of many real-world machine learning models (Webb, Lee et al. 2018). There are several methods for detecting CD, which can be divided into two categories: performance monitoring algorithms and distribution comparison algorithms (Gama, Žliobaitė et al. 2014).

Incremental learning

There are two ways to train supervised machine learning models: offline learning and incremental learning. In offline learning, the entire dataset is available at training time. On the other hand, in incremental learning, the model processes data as it emerges over time (Lemos Neto, Coelho et al. 2022). Therefore, the challenge of an incremental learning problem is to retain information for as long as possible without experiencing performance degradation.

Just-in-time software defect prediction model

The JIT-SDP model is a predictive model that identifies potentially defective code commits. Research (McIntosh and Kamei 2018) has investigated JIT-SDP models as evolving systems and has shown that value fluctuations of data attributes (metrics) can affect the performance and interpretation of JIT-SDP models. Additionally, relevant research suggests that JIT-SDP models should be retrained using new data within three months to avoid inaccurate predictions in the future. However, this interval was determined experimentally, and it may vary across different projects. The investigation of other time intervals has been postponed for future works. On the other hand, it is often overlooked that the labels for future samples are generated with a delay. In research (Cabral, Minku et al. 2019), new software changes are considered as input training samples created over time, and predictions are made while taking into account the verification delay (VL). By considering the VL, JIT-SDP models can train on available data, resulting in more accurate estimates of prediction performance in real-world scenarios (Tan, Tan et al. 2015). When dealing with a time-series dataset, the labels of training samples may be built with a significant delay. Therefore, neglecting VL poses a severe threat to the model validity. The VL challenge also exists in defect detection in software systems due to delays in discovering defects. According to (Tan, Tan et al. 2015), this delay can range from 1 day to 11 years in the SDP-JIT problems, with an average typically close to or less than 90 days. By considering the 90-day delay, we can hope for correct labeling (Tan, Tan et al. 2015). Similarly, we assumed a distance as VL between training and test data. Based on investigations in our dataset, the time interval for a dataset containing 100 samples is about 90 days (Tan, Tan et al. 2015).

JIT-SDP model interpretation

Study (Hassan 2009) presented an investigation of the interpretation into JIT-SDP defect models on individual projects and combinations of projects. This experiment was conducted using 20 open-source project datasets. The conclusion is that the interpretation output of JIT-SDP models varies across different projects. While there may be some overlap in the interpretation of each project in general (combinations of projects), there are also many differences. Therefore, it is recommended performing CD discovery locally in individual projects.

Model instability

According to the findings of study (Tantithamthavorn, Hassan et al. 2018), it appears that applying class rebalancing methods does not have an impact on AUC performance. Therefore, AUC can be considered a reliable measure in JIT-SDP problems. The research has utilized three threshold-independent criteria and seven threshold-dependent criteria to investigate the effect of class rebalancing on model performance and interpretation. In threshold-dependent measures, probabilities are converted to two classes (defective or clean) using a default threshold value of 0.5. For example, if a module has a predicted probability above 0.5, it is classified as defective; otherwise, it is classified as clean (Tantithamthavorn, Hassan et al. 2018). Similarly, research (McIntosh and Kamei 2018) focuses on threshold-independent model performance measures for JIT-SDP model accuracy but avoids threshold-dependent measures such as precision and recall due to their sensitivity to different thresholds and unbalanced data.

As previously mentioned, we proposed a new CDD method that utilizes model interpretation techniques. We investigated this method using both incremental and non-incremental learning models, as well as vectors of model interpretation with effects sizes of average, positive, and negative. Previous research has applied performance monitoring methods to discover CD, commonly using ER monitoring. These methods require labeling of newly entered data, which is not feasible in practical applications. We also employed different measures for monitoring as baseline methods. Finally, we investigated which one of our proposed methods is more compatible with the baseline methods. In this section, we explain the methodology of our proposed approaches and evaluation techniques.

Datasets

In this work, we utilized the research dataset (Lin, Tantithamthavorn et al. 2021) as it considers the following specific criteria for relevant software projects:

Criterion 1 - Publicly available datasets (in a repository such as GitHub): To increase the possibility of re-studying

Criterion 2 - Abundant and long-time development: To select suitable projects for modeling the discovery of defects, the criterion of their extensive development has been considered by (Lin, Tantithamthavorn et al. 2021).

The dataset consists of 20 open-source projects. Table 1 provides an overview of the projects that were studied. The CommitGuru (Rosen, Grawi et al. 2015) tool was used to extract the commit-level metrics, and The commits from each project were collected from the GitHub repository. The SZZ algorithm was then employed to identify defect-prone commits.

Table 1

Summary of the software projects in the dataset, including the percentage of defect commits compared to clean ones in parentheses (Lin, Tantithamthavorn et al. 2021).
Project name	Date of first commit	Lines of code	# of changes
accumulo	Oct 4, 2011	600,191	9,175 (21%)
angular	Jan 5, 2010	249,520	8,720 (25%)
brackets	Dec 7, 2011	379,446	17,624 (24%)
bugzilla	Aug 26, 1998	78,448	9,795 (37%)
camel	Mar 19, 2007	1,310,869	31,369 (21%)
cinder	May 3, 2012	434,324	14,855 (23%)
django	Jul 13, 2005	468,100	25,453 (42%)
fastjson	Jul 31, 2011	169,410	2,684 (26%)
gephi	Mar 2, 2009	129,259	4,599 (37%)
hibernate-orm	Jun 29, 2007	711,086	8,429 (32%)
hibernate-search	Aug 15, 2007	174,475	6,022 (35%)
imglib2	Nov 2, 2009	45,935	4,891 (29%)
jetty	Mar 16, 2009	519,265	15,197 (29%)
kylin	May 13, 2014	214,983	7,112 (25%)
log4j	Nov 16, 2000	37,419	3,275 (46%)
nova	May 27, 2010	430,404	49,913 (26%)
osquery	Jul 30, 2014	91,133	4,190 (23%)
postgres	Jul 9, 1996	1,277,645	44,276 (33%)
tomcat	Mar 27, 2006	400,869	19,213 (28%)
wordpress	Apr 1, 2003	390,034	37,937 (47%)

Table 2

Summary of commit-level metrics
Category	Name	Description
Diffusion	NS	Number of modified subsystems
	ND	Number of modified directories
	NF	Number of modified files
	Entropy	Distribution of modified code across each file
Size	LA	Lines of code added
	LD	Lines of code deleted
	LT	Lines of code in a file before the commit
Purpose	FIX	Whether or not the commit is a defect fix

Commit-level metrics

This dataset is divided into the following three categories, each of which is described concisely in Table 2.

Diffusion category

It measures the distribution of a commit across different parts. As demonstrated in (Hassan 2009), a highly distributed commit is more susceptible to defects due to its high complexity. The diffusion category includes the number of modified subsystems (NS), the number of modified directories (ND), the number of modified files (NF), and the distribution of modified codes per file (Entropy). Similar to (Hassan 2009), we normalize entropy by maximizing entropy log₂ⁿ to account for differences in the number of files (n) between changes.

Size category

This measure is determined by lines added (LA), lines deleted (LD), and lines total (LT). Previous studies have argued that this metric is a strong indicator of defect commit likelihood (Moser, Pedrycz et al. 2008).

Purpose category

This includes commits generated from fixing defects. According to experience, a commit that fixes one defect may introduce another (Guo, Zimmermann et al. 2010).

To scale data

Dealing with skew: Due to most change metrics having a wide range of values, we performed a logarithmic transformation to reduce their effect (Kamei, Shihab et al. 2012). We applied the standard logarithmic transformation to all metrics except "FIX" because it is a binary value.

A. Baseline Methods

In this experiment, we used Naive Bayes classifier due to its inherently incremental, efficient, and robust nature (Demšar and Bosnić 2018). The traditional CDD methods monitor the error performance, specifically the Error-Rate (ER) (Demšar and Bosnić 2018), which is calculated using Eq. (1).

Error Rate = 1-Accuracy (1)

In this study, similar to (Demšar and Bosnić 2018), we monitored the error rate performance using the Page-Hinkley test (PH) algorithm, also known as ER PH (Gama 2010). This algorithm can identify CD points by monitoring error rate over time. Several studies have evaluated various performance measures of machine learning algorithms over time to investigate model stability. Since the PH algorithm only detects sudden rises and not drops, it is suitable for investigating errors. Therefore, we monitored the inverse of the performance measures. Specially, AUC error or AUC-Er is equal to Eq. 2.

AUC-ER = 1 – AUC (2)

Then, we obtained CD points by monitoring error in two threshold-independent performance measures, namely ER and AUC, as well as four threshold-dependent measures (Tantithamthavorn, Hassan et al. 2018), namely Recall, Kappa, Precision, and Gmean (abbreviated as ER PH, AUC-Er PH, Recall-Er PH, Kappa-Er PH, Precision-Er PH, and Gmean-Er PH). These serve as baseline methods for comparing the output consistency of our proposed methods. AUC is computed by measuring the area under the curve that plots the true positive rate against the false positive rate while varying the threshold used to determine whether a sample is classified as positive or not. Cohen's kappa statistics measure the randomness of predictions, with perfect predictions having kappa = 1 and completely random predictions having kappa = 0. Further criteria are described in Table 3:

Table 3

Description of the criteria used in this study
Measures	Definition	Description
Accuracy	\(\frac{TP+TN}{TP+TN+FP+FN}\)	The percentage of correct predictions made by a model
Precision (p)	\(\frac{TP}{TP+FP}\)	A proportion of true positive results among all positive results.
Recall (R)	\(\frac{TP}{TP+FN}\)	The ratio of true positives (correctly identified relevant instances) to the sum of true positives and false negatives
F-measure	\(2\times \frac{P\times R}{P+R}\)	The harmonic mean of precision and recall, giving equal weight to both metrics
Matthews Correlation Coefficient (MCC)	\(\frac{TP\times TN-FP \times FN}{\sqrt{(TP+FN)(TP+FP)(TN+FP)(TN+FN)}}\)	A measure of the quality of binary classifications true and false positives and negatives
G-mean	\(\sqrt{{TP}_{rate} \times {TN}_{rate}}\)	A geometric mean of a true positive rate and a true negative rate

B- Proposed method

Our goal is to detect CD in software change-level data and create reliable and stable prediction models over time. Due to the nature of the CDD method, we consider the time order of the data in both the training and testing stages of the model. The proposed method consists of the following steps.

To divide the datasets into groups of 100. For example, if a dataset contains 9000 commits, it will be divided into 90 data groups.
Due to validation latency, the test data will be assigned at a distance of one batch from the training data. For instance, if the training data is in the first group, the third group will be assigned as test data.
To calculate two threshold-independent performance measures (ER and AUC) and four threshold-dependent measures (Recall, Kappa, Precision, and Gmean) for each data group using incremental Naive Bayes model training. These measures will be used to check the stability of the SDP model as stated in previous studies (baseline methods based on performance criteria).
To monitor the model performance obtained from the desired measures over time using the PH statistical test to detect significant changes that occur in each performance measure for every 20 datasets.
To apply the IME interpretation algorithm on models trained by each data group and extracting model interpretation vectors.
To determine distances between sequential vectors of model interpretation.
To derive intervals with significant differences using the change detection algorithm (PH statistical test) as CD points.
To calculate performance measures of CDD methods (cdAccuracy, MDR, MTD, MTFA, and MTR) for all proposed methods executed on 20 individual datasets to compare them with baseline methods.
To evaluate proposed methods using the Friedman test to establish the significance of the conclusion.

In incremental learning, data is added to the training set gradually (Fig. 2-A). In contrast, non-incremental learning considers the first batch of data as the training dataset before detecting any CD and in the next step, only the new input data batch is determined as testing data and training data does not increase over time (Fig. 2-B). If CD is detected, the batch before the testing set is considered a gap, and the batch before that is used as training data in both cases. In our work, we used incremental learning to create the baseline methods but employed both incremental and non-incremental learning in the proposed methods because subsequent data may not always be available to train the learning model.

On the other hand, unlike many existing CDD methods, our approach allows for controlling the rate of false positive detections by adjusting the parameters of the change detection algorithm. This improves its performance and enables reliable change points to be detected at different intervals. For example, if a standard CDD method is applied to a stream of 500 observations and the CDD method recognizes 5 points as CD, we should investigate whether these are actual CD points or just false positives flagged by the CDD method due to statistical fluctuations. If one CD point is generated in every 100 observations, they are probably false positives. However, if we have a technique that controls the false positive rate, reliable change points with different intervals can be detected (Ross, Adams et al. 2012). Similarly, in this work, when using the default parameters for the PH algorithm as in research (Demšar and Bosnić 2018), CD occurs in almost the same points in all local models and datasets. The default parameters for PH are alert threshold (corresponding to the allowable false alarm rate) β = 0.01, sensitivity (allowable error rate) δ = 0.001, and fading factor (update weight for historical values in the PH statistic) γ = 0.999 (Demšar and Bosnić 2018). This challenge exists for this value of the false alarm parameter and its smaller values. Therefore, we resolved it by changing the false alarm value to β = 0.1. It is possible to check the CD for other values of this parameter.

Experimental evaluation

To evaluate the efficiency of the proposed CDD method, we used the following well-known measures (Bifet, Read et al. 2013):

cdAaccuracy: The accuracy of the proposed method in detecting CD points
Missed Detection Rate (MDR): The possibility of not receiving a warning when there is a CD. It is calculated as the ratio of unknown CDs to complete CDs. A proper detector should have an MDR equal to close to zero. It can be obtained using Eq. 3.

MDR = 1- cdAccuracy (3)

Mean Time to Detection (MTD): The average delay that every detected CD point has with respect to the original CD occurrence location. A suitable detector should have a small MTD value.
Mean Time between False Alarms (MTFA): This measure determines how long we have a false alarm if no CD has occurred. A suitable detector should have a high MTFA.
Mean Time Ratio (MTR): This measure expresses a compromise between robustness and sensitivity and is obtained using Eq. 4:

MTR = \(\frac{MTFA}{MTD}\times\)(1 - MDR)(4)

Goodness-of-fit: To determine the level of compatibility between local JIT-SDP models and observations, we use Cohen's kappa coefficient, which is a numerical measure ranging from 0 to 1. Values closer to 1 indicate a higher degree of proportional and direct agreement, while those closer to 0 suggest no agreement. Typically, the kappa coefficient values for the data series of local JIT-SDP models fall between 0.3 and 0.7, while AUC values are usually above 0.7. Although there is a decrease in the kappa coefficient for some data series, our local JIT-SDP models still outperform the random state (AUC = 0.5) as shown in (Lin, Tantithamthavorn et al. 2021) study. In summary, by utilizing Cohen's kappa coefficient and AUC values together, we can determine the level of compatibility between local JIT-SDP models and observations.

In this section, we present the results of our proposed method. We investigated the impact of interpretation vectors with positive, negative, and average effect sizes in discovering CD points on both incremental and non-incremental learning. We also measured their compatibility with the output of baseline methods based on monitoring various performance criteria of the Naive Bayes classifier. In the next subsection, we obtained the features that have the greatest impact on the occurrence and discovery of CD points.

Comparison of methods based on model interpretation vector

In this subsection, we have assessed the compatibility of interpretation-based CDD methods with baseline methods that monitor performance criteria such as Error-Rate PH, AUC-Er PH, Kappa-Er PH, Precision-Er PH, and Recall-Er PH. Additionally, we evaluated CDD methods using four well-known measures: cdAccuracy, MTD, MTR, and MTFA. Since the PH algorithm can only detect sudden rises and not drops, it is suitable for investigating errors. Therefore, we monitored the inverse of the performance measures. We have described different interpretation-based CDD methods as follows. The methods have been named in a manner where the use of "i" or "n" at the beginning of the method name indicates whether it is incremental or non-incremental learning. Additionally, "pos", "neg", and "avg", express the type of interpretation, while the suffix "all" at the end of the names indicates that the method is based on the interpretation vector. The suffix "f_i" indicates that the method is based on the distance between consecutive values of the interpretation of i-th feature.

iavg_all: This method monitors the distance between incremental learning model interpretation vectors with the average effect size.
navg_all: This method monitors the distance between non-incremental learning model interpretation vectors with the average effect size
ipos_all: This method monitors the distance between incremental learning model interpretation vectors with the positive effect size
npos_all: This method monitors the distance between non-incremental learning model interpretation vectors with the positive effect size
ineg_all: This method monitors the distance between incremental learning model interpretation vectors with the negative effect size
nneg_all: This method monitors the distance between non-incremental learning model interpretation vectors with the negative effect size
iavg_f_i: This method monitors the numerical distance between the i-th feature values of the incremental learning model interpretation vector with the average effect size
navg_f_i: This method monitors the numerical distance between the i-th feature values of the non-incremental learning model interpretation vector with the average effect size
ipos_f_i: This method monitors the numerical distance between the i-th feature values of the incremental learning model interpretation vector with the positive effect size
npos_f_i: This method monitors the numerical distance between the i-th feature values of the non-incremental learning model interpretation vector with the positive effect size
ineg_f_i: This method monitors the numerical distance between the i-th feature values of the incremental learning model interpretation vector with the negative effect size
nneg_f_i: This method monitors the numerical distance between the i-th feature values of the non-incremental learning model interpretation vector with the negative effect size

Table 4 presents the accuracy measure of the proposed CDD methods for comparison with the baseline ER-PH method on various datasets. The last row of the table displays the average value of all rows, which is determined by the Friedman ranking algorithm. The four columns on the right indicate Friedman ranking of different methods, with lower values indicating higher ranks. Therefore, CD points extracted from the proposed method based on positive effect size on non-incremental learning (npos_all) are more compatible with the outputs of the baseline ER-PH method (as shown in the last row of Table 4). According to Friedman ranking in Table 4, nneg_all, ipos_all, and ineg_all methods follow in subsequent ranks, respectively. However, it is important to mention that the proposed method based on average effect size did not detect any CD points and therefore was not included in Table 4. In essence, the rank of Friedman test related to each proposed method indicates its compatibility level with the baseline method, with a lower rank indicating higher compatibility.

Table 4

Accuracy measure of proposed CDD methods compared to the baseline ER-PH method on different datasets
					Friedman mean rank
	ipos_all	ineg_all	npos_all	nneg_all	ipos_all	ineg_all	npos_all	nneg_all
accumulo	0	0	1	0	14	14	3	14
angular	0	0	0.5	0	15	15	4.5	15
brackets	0	0	1	0	16	16	2.5	16
bugzilla	0	0	1	0	14.5	14.5	3.5	14.5
camel	0	0	0.67	0	15.5	15.5	3	15.5
cinder	0	0	1	0	15	15	4	15
django	0.17	0	0.17	0.17	11	18.5	11	11
fastjson	0	0	0	0	11.5	11.5	11.5	11.5
gephi	0	0	1	0	15.5	15.5	4.5	15.5
hibernate-orm	0	0	0	0	13	13	13	13
hibernate-search	0.5	0.5	0.5	0	6.5	6.5	6.5	17.5
imglib2	0	0	1	1	16	16	5	5
jetty	0	0	1	0	15	15	1	15
kylin	0	0	0.5	0	13	13	2	13
log4j	0	0	0	1	14	14	14	3
nova	0	0	0.67	0.5	18.5	18.5	1.5	4
osquery	0	0	1	0	14	14	3	14
postgres	0	0	0.67	0.22	18	18	2	9
tomcat	0	0	0.33	0	17	17	8	17
wordpress	0	0	0.5	0.12	18	18	3	12.5
meanavg					14.55	14.93	5.33	12.55

Our results of the efficiency criteria for CDD methods (Accuracy, MTD, and MTR) using the Friedman test revealed that the proposed methods are compatible with other baseline methods (AUC-Er PH, Kappa-Er PH, Precision-Er PH, and Recall-Er PH) in the same order as Error-Rate PH (as shown in Table 4). The results are summarized in Table 5. The first row of this table lists the names of the baseline methods. Under each baseline method, the proposed methods which are most compatible with that baseline method are given in order. The first column on the left side of this table shows the order of compatibility, where 1 is the most compatible and 5 is the least compatible. As shown in Table 5, it is clear that the order of compatibility of the proposed methods is the same in all baseline methods. Therefore, we conclude that the method based on non-incremental learning model interpretation with the positive effect size followed by the negative effect size is always more effective in detecting CD. However, it should be noted that as shown in Table 5, the average interpretation vector is not effective for this purpose. Additionally, incremental learning has lower accuracy than non-incremental learning. As evident in Table 6, methods based on incremental learning have fewer false alarms. The reason is that these methods have very low accuracy; therefore they also have low false alarms.

Table 5

Friedman test output ranking on Accuracy, MTD, and MTR measures of proposed methods based on distance of interpretation vector evaluated against baseline methods
	AUC-ER PH	ER PH	Gmean-ER PH	kappa-ER PH	precision-ER PH	recall-ER PH
1	npos_all	npos_all	npos_all	npos_all	npos_all	npos_all
2	nneg_all	nneg_all	nneg_all	nneg_all	nneg_all	nneg_all
3	ipos_all	ipos_all	ipos_all	ipos_all	ipos_all	ipos_all
4	ineg_all	ineg_all	ineg_all	ineg_all	ineg_all	ineg_all
5	i/navg_all	i/navg_all	i/navg_all	i/navg_all	i/navg_all	i/navg_all

Table 6

Friedman test output ranking on MTFA measure of proposed methods based on distance of interpretation vector evaluated against baseline methods
	AUC-ER PH	ER PH	Gmean-ER PH	kappa-ER PH	precision-ER PH	recall-ER PH
1	ipos_all	ipos_all	ipos_all	ipos_all	ipos_all	ipos_all
2	ineg_all	ineg_all	ineg_all	ineg_all	ineg_all	ineg_all
3	nneg_all	nneg_all	nneg_all	nneg_all	nneg_all	npos_all
4	npos_all	npos_all	npos_all	npos_all	npos_all	nneg_all
5	i/navg_all	i/navg_all	i/navg_all	i/navg_all	i/navg_all	i/navg_all

Comparison of CDD methods based on numerical distance between feature values of model interpretation vectors

In this study, we used the dataset from Lin et al. research (Lin, Tantithamthavorn et al. 2021), which identified the most important features for the model trained on each individual project using local model interpretation methods and checked whether the interpretation outputs remained consistent across these models despite changes in dataset. In this work, we used this dataset in order to extract the most important features contributing to CD detection and evaluated our proposed CDD methods against baseline methods. The results presented in Table 7 to 10. The first row of each table lists the baseline methods, and below them are the proposed methods based on feature interpretation values which are most compatible with that baseline method in order. Hypothetically, npos_nf represents the proposed CDD method based on the interval of consecutive values of the nf feature interpreted with positive effect size (pos) on non-incremental learning (prefix n). Our findings suggest that nf, lt, la, ns, and fix with positive effect size in order, with only a higher or lower rank, have the greatest impact on detecting CD in non-incremental learning according to all performance measures of CDD method. Their impact is even greater than nneg_all (the method based on interpretation vectors with negative effect size). To facilitate comparison, we have bolded and highlighted in yellow the position of methods based on model interpretation vectors specified with "all" suffix. The rest are related to methods based on feature interpretation values.

Table 7. Friedman test output ranks the accuracy measure of the proposed methods for comparison with each baseline method

AUC-ER PH	ER PH	Gmean-ER PH	kappa-ER PH	precision-ER PH	recall-ER PH
npos_all	npos_all	npos_nf	npos_nf	npos_all	npos_all
npos_nf	npos_nf	npos_all	npos_all	npos_nf	npos_nf
npos_lt	npos_lt	npos_lt	npos_lt	npos_lt	npos_lt
npos_la	npos_la	npos_la	npos_la	npos_la	npos_la
npos_ns	npos_fix	npos_ns	npos_ns	npos_ns	npos_ns
npos_fix	npos_ns	npos_fix	npos_fix	npos_fix	npos_fix
nneg_lt	nneg_lt	nneg_lt	nneg_lt	nneg_lt	nneg_ld
nneg_ld	npos_entropy	nneg_ld	nneg_ld	nneg_ld	nneg_lt
npos_entropy	npos_ld	npos_entropy	npos_entropy	npos_entropy	npos_entropy
npos_ld	nneg_ld	npos_ld	npos_ld	npos_ld	npos_ld
nneg_ns	nneg_all	nneg_nf	nneg_all	nneg_all	nneg_all
nneg_entropy	nneg_nf	nneg_entropy	nneg_nf	nneg_nf	nneg_nf
nneg_all	nneg_ns	nneg_all	nneg_ns	nneg_ns	nneg_ns
nneg_nf	nneg_entropy	nneg_ns	nneg_entropy	nneg_entropy	nneg_entropy
nneg_fix	nneg_fix	nneg_fix	nneg_fix	nneg_fix	nneg_fix
npos_churn	nneg_la	nneg_la	nneg_la	nneg_la	nneg_la
nneg_la	npos_churn	npos_churn	npos_churn	npos_churn	npos_churn
ipos_all	ipos_all	ipos_all	ipos_all	ipos_all	ipos_all
ineg_all	ineg_all	ineg_all	ineg_all	ineg_all	ineg_all
nneg_churn	ineg_ld	ineg_ld	ineg_ld	ineg_ld	ineg_ld
ipos_la	nneg_churn	nneg_churn	nneg_churn	nneg_churn	nneg_churn
ineg_ld	ipos_la	ipos_la	ipos_la	ipos_la	ipos_la

Table 8. Friedman test output ranks the MTD measure of the proposed methods for comparison with each baseline method

AUC-ER PH	ER PH	Gmean-ER PH	kappa-ER PH	precision-ER PH	recall-ER PH
npos_all	npos_all	npos_all	npos_all	npos_nf	npos_nf
npos_nf	npos_nf	npos_nf	npos_nf	npos_all	npos_all
npos_lt	npos_la	npos_lt	npos_lt	npos_lt	npos_lt
npos_la	npos_lt	npos_la	npos_la	npos_la	npos_la
npos_fix	npos_fix	npos_fix	npos_fix	npos_fix	npos_fix
npos_ns	nneg_lt	nneg_lt	nneg_ld	npos_entropy	npos_ns
nneg_lt	npos_ns	npos_ld	npos_ld	npos_ns	nneg_lt
npos_ld	npos_ld	npos_entropy	npos_entropy	npos_ld	npos_entropy
nneg_ld	npos_entropy	nneg_ld	nneg_lt	nneg_lt	npos_ld
npos_entropy	nneg_ld	npos_ns	npos_ns	nneg_ld	nneg_ld
nneg_all	nneg_all	nneg_all	nneg_all	nneg_entropy	nneg_all
nneg_entropy	nneg_nf	nneg_nf	nneg_nf	nneg_all	nneg_ns
nneg_nf	nneg_entropy	nneg_ns	nneg_entropy	nneg_ns	nneg_nf
nneg_ns	nneg_ns	nneg_entropy	nneg_ns	nneg_nf	nneg_entropy
nneg_fix	nneg_fix	nneg_fix	nneg_la	nneg_fix	nneg_fix
nneg_la	nneg_la	nneg_la	nneg_fix	nneg_la	nneg_la
npos_churn	npos_churn	npos_churn	npos_churn	ipos_all	npos_churn
ipos_all	ipos_all	ipos_all	ipos_all	npos_churn	ipos_all
ineg_all	nneg_churn	ineg_all	ineg_all	ineg_all	ineg_all
ineg_ld	ipos_la	ineg_ld	ineg_ld	ineg_ld	ineg_ld
ipos_la	ineg_all	ipos_la	nneg_churn	ipos_la	ipos_la
nneg_churn	ineg_ld	nneg_churn	ipos_la	nneg_churn	nneg_churn

Table 9. Friedman test output ranks the MTR measure of the proposed methods for comparison with each baseline method

AUC-ER PH	ER PH	Gmean-ER PH	kappa-ER PH	precision-ER PH	recall-ER PH
npos_all	npos_all	npos_nf	npos_nf	npos_all	npos_all
npos_nf	npos_nf	npos_all	npos_all	npos_nf	npos_nf
npos_lt	npos_lt	npos_lt	npos_lt	npos_lt	npos_lt
npos_la	npos_la	npos_la	npos_fix	npos_la	npos_la
npos_fix	npos_fix	npos_fix	npos_la	npos_fix	npos_fix
npos_ns	nneg_lt	npos_ns	npos_ns	npos_ns	npos_ns
nneg_lt	npos_ns	nneg_lt	nneg_ld	nneg_ld	npos_entropy
nneg_ld	npos_ld	npos_entropy	nneg_lt	npos_entropy	nneg_ld
npos_entropy	npos_entropy	npos_ld	npos_entropy	nneg_lt	nneg_lt
npos_ld	nneg_ld	nneg_ld	npos_ld	npos_ld	npos_ld
nneg_entropy	nneg_nf	nneg_all	nneg_all	nneg_all	nneg_all
nneg_all	nneg_all	nneg_nf	nneg_nf	nneg_entropy	nneg_entropy
nneg_nf	nneg_entropy	nneg_entropy	nneg_entropy	nneg_nf	nneg_ns
nneg_ns	nneg_ns	nneg_ns	nneg_ns	nneg_ns	nneg_nf
nneg_fix	nneg_fix	nneg_fix	nneg_la	nneg_fix	nneg_fix
nneg_la	nneg_la	nneg_la	nneg_fix	nneg_la	nneg_la
npos_churn	npos_churn	npos_churn	npos_churn	npos_churn	npos_churn
ipos_all	ipos_all	ipos_all	ipos_all	ipos_all	ipos_all
ineg_all	nneg_churn	ineg_all	ineg_all	ineg_all	ineg_all
ineg_ld	ipos_la	ineg_ld	ineg_ld	ineg_ld	ineg_ld
nneg_churn	ineg_all	nneg_churn	nneg_churn	ipos_la	ipos_la
ipos_la	ineg_ld	ipos_la	ipos_la	nneg_churn	nneg_churn

Table 10. Friedman test output ranks the MTFA measure of the proposed methods for comparison with each baseline method

AUC-ER PH	ER PH	Gmean-ER PH	kappa-ER PH	precision-ER PH	recall-ER PH
ipos_all	ipos_all	ipos_all	ipos_all	ipos_all	ipos_all
ineg_all	ineg_all	ineg_all	ineg_all	ineg_all	ineg_all
ipos_la	ipos_la	ipos_la	ipos_la	ipos_la	ipos_la
ineg_ld	ineg_ld	ineg_ld	ineg_ld	ineg_ld	ineg_ld
npos_churn	npos_churn	npos_churn	npos_churn	npos_churn	npos_churn
nneg_fix	nneg_fix	nneg_fix	nneg_fix	nneg_fix	nneg_fix
nneg_ns	nneg_ns	nneg_nf	nneg_nf	nneg_nf	nneg_nf
nneg_nf	nneg_nf	nneg_la	nneg_la	nneg_la	nneg_la
nneg_la	nneg_la	nneg_entropy	nneg_entropy	nneg_entropy	nneg_entropy
nneg_entropy	nneg_churn	nneg_churn	nneg_churn	nneg_churn	nneg_churn
nneg_churn	npos_ld	npos_ld	nneg_ns	nneg_ns	nneg_ns
nneg_all	nneg_entropy	npos_entropy	npos_fix	npos_fix	npos_entropy
npos_ld	nneg_all	nneg_ns	nneg_all	nneg_all	npos_all
npos_fix	npos_fix	nneg_all	nneg_ld	nneg_ld	npos_fix
npos_entropy	nneg_ld	npos_fix	npos_ld	npos_ld	nneg_all
nneg_ld	npos_entropy	npos_nf	npos_entropy	npos_entropy	nneg_ld
nneg_lt	nneg_lt	nneg_ld	npos_nf	npos_nf	npos_nf
npos_la	npos_nf	npos_la	npos_ns	npos_lt	npos_ld
npos_nf	npos_la	npos_lt	npos_la	npos_all	npos_lt
npos_lt	npos_lt	nneg_lt	npos_lt	npos_ns	npos_la
npos_all	npos_ns	npos_ns	npos_all	npos_la	npos_ns
npos_ns	npos_all	npos_all	nneg_lt	nneg_lt	nneg_lt

According to the results, the proposed CDD methods based on incremental learning model interpretation are not very accurate. Figure 3 represents the interval of interpretation vectors over time based on one of the local models (models based on both incremental and non-incremental classifiers). The methods based on the incremental classifier did not detect any CD (Fig. 2-a). However, in the method based on the non-incremental classifier, CD can also be recognized from the graph. This diagram is similar to other local models; therefore, we refrain from presenting them to save space. The red straight line shows the CD point detected by the ER-PH method. After this point, the graph shows more numerical values and fluctuations.

For example, Table 11 presents the outputs of one of the local baseline models that shows the detected CD points for the local model corresponding to the "Angular" dataset. Figure 4 illustrates graphs related to error trends of different measures of the machine learning model over time and CD points discovered on them. The dotted chart represents the moving average. As shown in Fig. 4, an increase in error rate is observed in detected CD points in both main and moving average charts.

Table 11

The detected drift points by different methods for the Angular dataset. The first column corresponds to the method based on interpretation of incremental learning, while the second column is associated with non-incremental learning. The method based on an incremental classifier does not detect CD, whereas only the method based on an non-incremental learning model interpretation vector with a positive effect detects drift at two points (33 and 52).
nonInc Interpretation	ER PH	AUC-Er PH	Gmean-Er PH	precision-Er PH	recall-Er PH	kappa-Er PH
Pos: 33	53	53	41	29	20	39
Pos: 52	73			48	41	73
				71	83

The stability of performance measures of a machine learning model

In this study, we analyzed the stability of performance measures for a machine learning model using the number of CDs discovered by baseline methods as an indicator. To visualize this stability, Fig. 5 shows a boxplot of the number of CD points located on 20 different projects for each performance measure of the Naïve Bayes model. The results indicate that AUC has the lowest amount of instability while Precision measure has the highest amount. In short, Precision measures in different projects have recognized the most diverse number of CDs. Additionally, previous research (Tantithamthavorn, Hassan et al. 2018) has shown that applying rebalancing techniques does not significantly affect AUC values. Therefore, we concluded that AUC usually has low instability. ER and Recall measures also do not have much more instability compared to AUC, even though Recall measure also depends on the threshold.

Treats to validity

In this section, we analyze the threats to the validity of the proposed methodology and the results derived from it, which can be categorized into three types: construct, internal, and external.

Construct Validity

In this section, we will examine threats that challenge the accuracy of our methodology and results. Different studies have commonly used either the ER measure as a baseline method or the AUC measure for investigating model stability. In this research, we used various criteria, including two threshold-independent and four threshold-dependent methods, to construct baseline methods and compared them with our proposed methods. Our proposed methods produced identical results compared to all baseline methods, suggesting that they can be trusted. To evaluate the results, we employed the Friedman statistical test and found that CDs are detected at equal intervals for default parameter values and their lower values, consistent with prior studies showing false positives. We examined different values for Friedman test parameters and found that values higher than the default parameter are most consistent with baseline method results. These parameter values can be derived through optimization algorithms; however, since this process is expensive, it should be presented in a new work. Based on our examination of 20 datasets and six baseline methods, we concluded that these parameters will produce similar results for JIT-SDP problems involving Java languages and problems with similar data entry speed and class ratio. However, this can be checked in future works on datasets other than JIT-SDP problems.

Internal validity

In terms of internal validity, we demonstrate the extent to which our hypotheses and methods account for the results obtained. As previously mentioned, another valid study has shown that using a method based on the IME model interpretation algorithm on a synthetic dataset is more effective than error rate monitoring methods for discovering CD. In this work, we applied their method with improvements on a real dataset and achieved consistent results. We conducted these experiments on a dataset containing 20 commonly used projects from previous works. In other studies related to model stability, incremental learning has been utilized to create the model. In our work, we used incremental learning to create the baseline methods but employed both incremental and non-incremental learning in the proposed methods because subsequent data may not always be available to train the learning model. When using incremental learning in the interpretation-based method, it takes too much time to obtain the model interpretation vector if CD is not discovered and the data stream grows, making continuing execution practically impossible. Additionally, results from the model based on non-incremental learning are more consistent with all baseline methods. Models such as Naive Bayes, Random Forest, and Regression have been commonly used in research related to model stability as well and we have also incorporated the Naive Bayes model in our work.

External validity

External validity refers to the ability to generalize proposed methods to other studies and regions. In this work, our findings suggest that methods based on interpretation vectors and most important features with the positive effect size are more compatible with all baseline methods, while those based on negative effect size are less compatible. This conclusion may be due to the use of imbalanced data. For further investigation, these proposed methods can also be examined on a balanced dataset. Additionally, we conducted experiments on 20 widely used datasets, which are considered complete datasets. However, these datasets are related to Java codes. Furthermore, these experiments can be replicated on datasets related to other programming languages or non-JIT-SDP imbalanced datasets.

In this study, our aim was to identify CD points to maintain the stability of the JIT-SDP model. Previous studies introduced a CDD method based on an interpretation model, which we investigated in the context of the JIT-SDP issue and made significant improvements in it. Our results showed that the average interpretation vector, whether incremental or non-incremental learning, did not accurately recognize CD. Additionally, methods based on incremental classifiers also had the least consistency with baseline methods. Among methods based on non-incremental learning, those that used interpretation of positive effect size had the highest consistency. This comparison held true for two other CDD method performance measures, i.e. MTD and MTR. Therefore, the method that used an interpretation vector with the positive effect size ranked first or second. Among strategies that used interpretation values of features with positive effect size, the software metrics of nf, lt, la, ns, and fix, in order, were most compatible with baseline methods according to different performance measures and had higher or lower ranks. The reason why interpretation of positive and negative effect sizes worked better may be due to an imbalance in defect data. These proposed methods can be tested on balanced datasets with streaming or time series characteristics in future studies. In this work, we aimed to develop efficient methods in terms of performance and computationally. As a result, processing speed remains constant over time in methods based on non-incremental classifiers.

Acknowledgements The authors would like to thank the anonymous reviewers for their constructive comments.

Authors’ Contribution Zeynab Chitsazian designed the proposed methodology and wrote the text of the paper and Saeed Sedighian Kashi, as a supervisor, provided the necessary guidance during the preparation of the paper

Funding information No funding was obtained for this study

Conflict of interest The authors declare that there are no conflict of interests regarding the publication of this paper.

Data Availability Statement The datasets used in this paper are as follows:

https://github.com/SAILResearch/suppmaterial-19-dayi-risk_data_merging_jit

Bifet, A. and R. Gavalda (2007). Learning from time-changing data with adaptive windowing. Proceedings of the 2007 SIAM international conference on data mining, SIAM.
Bifet, A., J. Read, B. Pfahringer, G. Holmes and I. Žliobaitė (2013). CD-MOA: Change detection framework for massive online analysis. International Symposium on Intelligent Data Analysis, Springer.
Cabral, G. G. and L. L. Minku (2022). "Towards reliable online just-in-time software defect prediction." IEEE Transactions on Software Engineering 49(3): 1342-1358.
Cabral, G. G., L. L. Minku, E. Shihab and S. Mujahid (2019). Class imbalance evolution and verification latency in just-in-time software defect prediction. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE.
Chen, X., Y. Zhao, Q. Wang and Z. Yuan (2018). "MULTI: Multi-objective effort-aware just-in-time software defect prediction." Information and Software Technology 93: 1-13.
Demšar, J. and Z. Bosnić (2018). "Detecting concept drift in data streams using model explanation." Expert Systems with Applications 92: 546-559.
Gama, J. (2010). Knowledge discovery from data streams, CRC Press.
Gama, J., I. Žliobaitė, A. Bifet, M. Pechenizkiy and A. Bouchachia (2014). "A survey on concept drift adaptation." ACM computing surveys (CSUR) 46(4): 1-37.
Gangwar, A. K., S. Kumar and A. Mishra (2021). "A Paired Learner-Based Approach for Concept Drift Detection and Adaptation in Software Defect Prediction." Applied Sciences 11(14): 6663.
Gonçalves Jr, P. M. and R. S. Barros (2012). A comparison on how statistical tests deal with concept drifts. Proceedings on the International Conference on Artificial Intelligence (ICAI), The Steering Committee of The World Congress in Computer Science, Computer ….
Guo, P. J., T. Zimmermann, N. Nagappan and B. Murphy (2010). Characterizing and predicting which bugs get fixed: an empirical study of microsoft windows. Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering-Volume 1.
Hassan, A. E. (2009). Predicting faults using the complexity of code changes. 2009 IEEE 31st international conference on software engineering, IEEE.
Ji, H., S. Huang, Y. Wu, Z. Hui and C. Zheng (2019). "A new weighted naive Bayes method based on information diffusion for software defect prediction." Software Quality Journal 27(3): 923-968.
Jiarpakdee, J., C. Tantithamthavorn, H. K. Dam and J. Grundy (2020). "An empirical study of model-agnostic techniques for defect prediction models." IEEE Transactions on Software Engineering.
Kamei, Y., E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha and N. Ubayashi (2012). "A large-scale empirical study of just-in-time quality assurance." IEEE Transactions on Software Engineering 39(6): 757-773.
Lemos Neto, Á. C., R. A. Coelho and C. L. d. Castro (2022). "An Incremental Learning Approach Using Long Short-Term Memory Neural Networks." Journal of Control, Automation and Electrical Systems: 1-9.
Lin, D., C. Tantithamthavorn and A. E. Hassan (2021). "The impact of data merging on the interpretation of cross-project just-in-time defect models." IEEE Transactions on Software Engineering 48(8): 2969-2986.
McIntosh, S. and Y. Kamei (2018). Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction. Proceedings of the 40th International Conference on Software Engineering.
Moser, R., W. Pedrycz and G. Succi (2008). A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. Proceedings of the 30th international conference on Software engineering.
Page, E. S. (1954). "Continuous inspection schemes." Biometrika 41(1/2): 100-115.
Rosen, C., B. Grawi and E. Shihab (2015). Commit guru: analytics and risk prediction of software commits. Proceedings of the 2015 10th joint meeting on foundations of software engineering.
Ross, G. J., N. M. Adams, D. K. Tasoulis and D. J. Hand (2012). "Exponentially weighted moving average charts for detecting concept drift." Pattern recognition letters 33(2): 191-198.
Tan, M., L. Tan, S. Dara and C. Mayeux (2015). Online defect prediction for imbalanced data. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, IEEE.
Tantithamthavorn, C., A. E. Hassan and K. Matsumoto (2018). "The impact of class rebalancing techniques on the performance and interpretation of defect prediction models." IEEE Transactions on Software Engineering 46(11): 1200-1219.
Webb, G. I., L. K. Lee, B. Goethals and F. Petitjean (2018). "Analyzing concept drift and shift from sample data." Data Mining and Knowledge Discovery 32(5): 1179-1199.
Widmer, G. and M. Kubat (1996). "Learning in the presence of concept drift and hidden contexts." Machine learning 23(1): 69-101.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Detecting Concept Drift in Just-In-Time Software Defect Prediction Using Model Interpretation

Status:

Version 1

Abstract

Figures

Introduction

Related works

Concept drift

Incremental learning

Just-in-time software defect prediction model

JIT-SDP model interpretation

Model instability

Experimental design

Datasets

Commit-level metrics

To scale data

Methodology

A. Baseline Methods

B- Proposed method

Experimental evaluation

Results

Comparison of methods based on model interpretation vector

Comparison of CDD methods based on numerical distance between feature values of model interpretation vectors

The stability of performance measures of a machine learning model

Treats to validity

Construct Validity

Internal validity

External validity

Conclusion and future works

Declarations

References

Additional Declarations

Status:

Version 1