Giter VIP home page Giter VIP logo

defect-prediction's Introduction

defect-prediction's People

Contributors

huytu7 avatar rahlk avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

huytu7

defect-prediction's Issues

Standardized Results so far

Traditional approach:
1/ release based : where all the bugs for are calculated at once per release and all the independent metrics are collected for all the files within the project
2/ Learning from all the available examples in the past[:i_release]
=> build defect prediction model from traiining all the files from the past releases

Modern approach:
1/ just-in-time based : where all the rows are files being updated per comit within a release are and all the independent metrics are collected for only those updated files.
2/ incremental learning: learn from i, predicton i + 1

*** Notes: let say abinit_core.F90 being updated 3 times, the dataset for that release will contain 3 defective_abinit_core.F90 and 3 clean_adbinit_core.F90. If the file abinit_miscellaneous.F90 is updated in release 1 but not updated in release 2, the dataset in release 2 will not contain any information on abinit_miscellaneous.F90

Intuitions:
1. not all files are updated from release to release (only necessary files for bug fixing and functions adding).
2. incremental learning, the previous version is enough to understand and predict if the change of future files is bug fixing or not

RQ4

LAMMPS (Language: C)

Train   Test    Prec    Pd      Pf      F1      IFA     PCI20
1       2       67      66      52      67      0       20
2       3       59      63      64      61      2       21
3       4       59      37      37      45      0       19
4       5       81      62      58      71      1       16
5       6       53      62      60      57      1       21
6       7       86      48      31      62      0       17
7       8       82      45      40      58      4       16
8       9       53      57      58      54      0       23
9       10      65      57      49      61      0       20

image

ABINIT (Language: FORTRAN)

Train   Test    Prec    Pd      Pf      F1      IFA     PCI20
1       2       52      53      49      52      0       18
2       3       50      62      60      56      0       16
3       4       52      51      52      52      3       16
4       5       50      44      45      47      0       16
5       6       51      40      41      45      0       15
6       7       54      43      40      48      3       17
7       8       52      48      46      50      0       15
8       9       54      46      46      50      1       16
9       10      51      50      49      51      1       16

image

MDANALYSIS (Language: Python)

Train   Test    Prec    Pd      Pf      F1      IFA     PCI20
1       2       55      46      50      51      0       21
2       3       52      47      45      50      13      23
3       4       46      56      68      50      0       21
4       5       54      56      56      55      3       22
5       6       57      59      54      58      1       21
6       7       52      52      51      52      0       24
7       8       48      46      47      47      5       23
8       9       47      44      58      46      0       20
9       10      62      54      44      58      1       21

image

Commits Messages Bugy

RQ3: Is keywords searching for commits consistent with our mechanical turks? How about human-in-the-loop AI bug reports reading method?

KEYWORDS:

Precision Recall F1
abinit 58.94% 90.13% 71.20%
libmesh 43% 92% 59.12%
lammps 13.38% 89.62% 23.28%
mdanalysis 51.43% 89.62% 31.28%

FASTREAD:

Precision Recall F1
abinit 72.63% 87.83% 79.51%
libmesh 49.89% 90.06% 64.20%
lammps 23.85% 97.53% 34.27%
mdanalysis 41.44% 94.43% 57.60%

For both precision and f1, FASTREAD achieved better performance than just keywords searches, human-in-the-loop AI bug reports reading method are more consistent to the result of our mechanical turks than just keyword searches.

RQ4: Why FFT?

Median G-score performance delta of FFT with LR, RF, and SVM:

selection_022

Comparison between different tricks for data collections!

4 treatments in total that got repeated 15 times per treatment:

  • all the previous releases data (_all)
  • only use the previous release (_incremental)
  • use half of the train data to learn (_reduce_1) - assuming that half of the training data is garbage and by only using half, we can remove some bad data points and noises.
  • use half of the train data to learn and half of the test data to test (_reduce_2) - same previous assumption but also assuming that half of our testing data is also garbage.

PF results summary:
https://docs.google.com/spreadsheets/d/1UNOxsWn_eDygba70HaNikUcZwWYsyeVeisOjPk6z2GY/edit?usp=sharing

Raw:
_all results: https://docs.google.com/spreadsheets/d/1iOpCQSixeIyofm1-GdizgWoueCc89NcCQefZC3f8UJY/edit?usp=sharing
_incremental results: https://docs.google.com/spreadsheets/d/1tmrfi3lbcgreN7WF2XhpUjD5OQFIiJEeLbLsdokGj6E/edit?usp=sharing
_reduce_1: https://docs.google.com/spreadsheets/d/10e0c7obf4RnI10gqOKhBiMbKINh0mL6aJqFEofAW09k/edit?usp=sharing
_reduce_2:
https://docs.google.com/spreadsheets/d/1KVAwkxvZtwfgUenracYQJrR6DVTDxfFmiDLiL0cfrB4/edit?usp=sharing

Disagreement Rate Graphs (9 projects)

mdanalysis - 775 bugs with 792 of estimated bugs in 1274 reviewed commits (3303 in total).
mdanalysis
lammps - 573 bugs with 601 of estimated bugs in 858 reviewed commits (7324 in total).
lammps
libmesh - 1399 bugs with 1495 of estimated bugs in 2221 reviewed commits (8679 in total).
libmesh
abinit - 676 bugs with 698 of estimated bugs in 1121 reviewed commits (5392 in total).
abinit

RQ4: Which identification and prediction system perform best for buggy commits? (P_OPT20)

Figures below indicating the numerical improvements outside of statistical testing results from using F3T as a buggy commit identification and prediction system instead of the standard system of Commit.Guru. In this figure, the higher the vertical bars, the better the F3T performs in comparison to another learning method. Let X be the F3T score and Y is the score from another data mining method, then on this chart, the height of each bar is median X-Y seen across all tests in a project:

  • K: Keyword, S: SMOTE, SVM: Support Vector Machines, LR: Logistic Regression, RF: Random Forest
  • F3T performs as well, or better, that the other learners in 21/27 for both G-score and Popt(20).
  • RF is widely adopted in defect prediction task (as the first rank by Ghotra et al) but surprisingly, F3T usually performs much better than Keyword+SMOTE+RF (except 2 cases in Popt(20)).
  • When F3T performs comparatively better, it can do so by a large amount (up to 25% and 27% absolute improvement but those are equivalent up to 103% and 85% relative improvement for G-score and Popt(20)).
  • When F3T performs comparatively worse, the size of its loss is not large (see the left-hand-side negative vertical bars where F3T losses often by just 3% and 7% absolute loss or only 11% and 7% as relative losses).

rq4_final

rq4_final_2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.