sillywalk / defect-prediction Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 46.25 MB

License: GNU General Public License v3.0

Makefile 1.68% Python 44.98% Jupyter Notebook 53.33%

defect-prediction's Introduction

Defect Prediction

Scientific Computation Software

EMBLEM Paper repository:

The open source code for this work is updated here: https://github.com/sillywalk/defect-prediction/tree/dev

defect-prediction's People

Contributors

Stargazers

Watchers

Forkers

huytu7

defect-prediction's Issues

RQ2: How are defect prediction on commit level of human lable data based on Fastread labeled data vs Keyword labeled data?

Both hedges and cohen test have shown that for defect prediction on commit level of human labeled data FFT+SMOTE learned on Fastread labeling data had performed better than on keyword labeling data.

Traditional approach:
1/ release based : where all the bugs for are calculated at once per release and all the independent metrics are collected for all the files within the project
2/ Learning from all the available examples in the past[:i_release]
=> build defect prediction model from traiining all the files from the past releases

Modern approach:
1/ just-in-time based : where all the rows are files being updated per comit within a release are and all the independent metrics are collected for only those updated files.
2/ incremental learning: learn from i, predicton i + 1

*** Notes: let say abinit_core.F90 being updated 3 times, the dataset for that release will contain 3 defective_abinit_core.F90 and 3 clean_adbinit_core.F90. If the file abinit_miscellaneous.F90 is updated in release 1 but not updated in release 2, the dataset in release 2 will not contain any information on abinit_miscellaneous.F90

Intuitions:
1. not all files are updated from release to release (only necessary files for bug fixing and functions adding).
2. incremental learning, the previous version is enough to understand and predict if the change of future files is bug fixing or not

RQ4

LAMMPS (Language: C)

Train   Test    Prec    Pd      Pf      F1      IFA     PCI20
1       2       67      66      52      67      0       20
2       3       59      63      64      61      2       21
3       4       59      37      37      45      0       19
4       5       81      62      58      71      1       16
5       6       53      62      60      57      1       21
6       7       86      48      31      62      0       17
7       8       82      45      40      58      4       16
8       9       53      57      58      54      0       23
9       10      65      57      49      61      0       20

ABINIT (Language: FORTRAN)

Train   Test    Prec    Pd      Pf      F1      IFA     PCI20
1       2       52      53      49      52      0       18
2       3       50      62      60      56      0       16
3       4       52      51      52      52      3       16
4       5       50      44      45      47      0       16
5       6       51      40      41      45      0       15
6       7       54      43      40      48      3       17
7       8       52      48      46      50      0       15
8       9       54      46      46      50      1       16
9       10      51      50      49      51      1       16

MDANALYSIS (Language: Python)

Train   Test    Prec    Pd      Pf      F1      IFA     PCI20
1       2       55      46      50      51      0       21
2       3       52      47      45      50      13      23
3       4       46      56      68      50      0       21
4       5       54      56      56      55      3       22
5       6       57      59      54      58      1       21
6       7       52      52      51      52      0       24
7       8       48      46      47      47      5       23
8       9       47      44      58      46      0       20
9       10      62      54      44      58      1       21

Pseudo labeling using 30% of the FR labeled data to predict on the rest + Effect Size Test

RQ1: How close are Fastread and Keyword to actual Human Labeling

Let the SE developers be experts in labeling the data to find bug-fixing commits. By having AI in the loop human, how close between Fastread in comparison with Keyword (state-of-the-art) to human's labels (ground truth).

Defect Prediction on Release Level with Keyword Searches

RQ1: Are traditional method of keywords searches of commits with defect prediction on release levels performing well?

RQ2: Which predicting method is best for buggy commits? (P_OPT20)

Effect Size of the remaining 5 datasets.

Commits Messages Bugy

RQ3: Is keywords searching for commits consistent with our mechanical turks? How about human-in-the-loop AI bug reports reading method?

KEYWORDS:

	Precision	Recall	F1
abinit	58.94%	90.13%	71.20%
libmesh	43%	92%	59.12%
lammps	13.38%	89.62%	23.28%
mdanalysis	51.43%	89.62%	31.28%

FASTREAD:

	Precision	Recall	F1
abinit	72.63%	87.83%	79.51%
libmesh	49.89%	90.06%	64.20%
lammps	23.85%	97.53%	34.27%
mdanalysis	41.44%	94.43%	57.60%

For both precision and f1, FASTREAD achieved better performance than just keywords searches, human-in-the-loop AI bug reports reading method are more consistent to the result of our mechanical turks than just keyword searches.

RQ1: is keyword better than fastread for labeling?

1/ different by small effect between fastread + learner versus keyword + learner.
2/ if fastread with the learner > keyword with the learner
Table is sorted based on FFT.

RQ4: Why FFT?

Median G-score performance delta of FFT with LR, RF, and SVM:

RQ3: How is Fastread+FFT comparing to Commit.Guru (Keyword+LR) for defect prediction on commit level?

From RQ2 + RQ1, we know our method on FASTREAD labeled data would perform great to predict our ideal human labeled defect prediction.
We can then generalize/scale it up to predict the next release of on it's own labeling method.
25%, 50%, and 75% percentile of the absolute difference between our method and commit-guru.

commit.guru doesn't use smote so the data imbalance is a big problem.

Comparison between different tricks for data collections!

4 treatments in total that got repeated 15 times per treatment:

all the previous releases data (_all)
only use the previous release (_incremental)
use half of the train data to learn (_reduce_1) - assuming that half of the training data is garbage and by only using half, we can remove some bad data points and noises.
use half of the train data to learn and half of the test data to test (_reduce_2) - same previous assumption but also assuming that half of our testing data is also garbage.

PF results summary:
https://docs.google.com/spreadsheets/d/1UNOxsWn_eDygba70HaNikUcZwWYsyeVeisOjPk6z2GY/edit?usp=sharing

Raw:
_all results: https://docs.google.com/spreadsheets/d/1iOpCQSixeIyofm1-GdizgWoueCc89NcCQefZC3f8UJY/edit?usp=sharing
_incremental results: https://docs.google.com/spreadsheets/d/1tmrfi3lbcgreN7WF2XhpUjD5OQFIiJEeLbLsdokGj6E/edit?usp=sharing
_reduce_1: https://docs.google.com/spreadsheets/d/10e0c7obf4RnI10gqOKhBiMbKINh0mL6aJqFEofAW09k/edit?usp=sharing
_reduce_2:
https://docs.google.com/spreadsheets/d/1KVAwkxvZtwfgUenracYQJrR6DVTDxfFmiDLiL0cfrB4/edit?usp=sharing

RQ1.2: Which labelling method can generate better data miner for defect prediction tasks? (P_OPT20)

Keyword Searches + Commit Level Defect Prediction

RQ2: Are traditional method of keywords searches of commits with defect prediction on commit levels performing well?

Defect Prediction on Release Level with Keyword Searches

RQ1: Are traditional method of keywords searches of commits with defect prediction on release levels performing well?

Disagreement Rate Graphs (9 projects)

mdanalysis - 775 bugs with 792 of estimated bugs in 1274 reviewed commits (3303 in total).

lammps - 573 bugs with 601 of estimated bugs in 858 reviewed commits (7324 in total).

libmesh - 1399 bugs with 1495 of estimated bugs in 2221 reviewed commits (8679 in total).

abinit - 676 bugs with 698 of estimated bugs in 1121 reviewed commits (5392 in total).

RQ4: Which identification and prediction system perform best for buggy commits? (P_OPT20)

Figures below indicating the numerical improvements outside of statistical testing results from using F3T as a buggy commit identification and prediction system instead of the standard system of Commit.Guru. In this figure, the higher the vertical bars, the better the F3T performs in comparison to another learning method. Let X be the F3T score and Y is the score from another data mining method, then on this chart, the height of each bar is median X-Y seen across all tests in a project:

K: Keyword, S: SMOTE, SVM: Support Vector Machines, LR: Logistic Regression, RF: Random Forest
F3T performs as well, or better, that the other learners in 21/27 for both G-score and Popt(20).
RF is widely adopted in defect prediction task (as the first rank by Ghotra et al) but surprisingly, F3T usually performs much better than Keyword+SMOTE+RF (except 2 cases in Popt(20)).
When F3T performs comparatively better, it can do so by a large amount (up to 25% and 27% absolute improvement but those are equivalent up to 103% and 85% relative improvement for G-score and Popt(20)).
When F3T performs comparatively worse, the size of its loss is not large (see the left-hand-side negative vertical bars where F3T losses often by just 3% and 7% absolute loss or only 11% and 7% as relative losses).