sysevr / sysevr Goto Github PK

Python 6.74% C 0.75% C++ 0.18% Shell 0.38% Dockerfile 0.01% Makefile 0.06% Batchfile 0.08% ANTLR 0.03% Java 2.34% Perl 0.02% XSLT 0.55% HTML 81.65% JavaScript 0.01% CSS 0.28% Roff 6.80% Groovy 0.12%

sysevr's Introduction

SySeVR: A Framework for Using Deep Learning to Detect Vulnerabilities

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, Zhaoxuan Chen. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing (TDSC). 2021. doi: 10.1109/TDSC.2021.3051525.

We propose a general framework for using deep learning to detect vulnerabilities, named SySeVR. For evaluate the SySeVR, we collect the Semantics-based Vulnerability Candidate (SeVC) dataset, which contains all kinds of vulnerabilities that are available from the National Vulnerability Database (NVD) and the Software Assurance Reference Dataset (SARD).

At a high level, the SyVC representation corresponds to a piece of code in a program that may be vulnerable based on a syntax analysis. The SeVC representation corresponds to the extended statements of the SyVCs, with the extension to incorporate some of the other statements that are semantically related to the SyVCs.

SeVC dataset focuses on 1,591 open source C/C++ programs from the NVD and 14,000 programs from the SARD. It contains 420,627 SeVCs, including 56,395 vulnerable SeVCs and 364,232 SeVCs that are not vulnerable. Four types of SyVCs are involved.

Library/API Function Call : This accommodates the vulnerabilities that are related to library/API function calls.
Array Usage: This accommodates the vulnerabilities that are related to arrays (e.g., improper use in array element access, array address arithmetic, address transfer as a function parameter).
Pointer Usage: This accommodates the vulnerabilities that are related to pointers (e.g., improper use in pointer arithmetic, reference, address transfer as a function parameter).
Arithmetic Expression: This accommodates the vulnerabilities that are related to improper arithmetic expressions (e.g., integer overflow).

sysevr's People

Contributors

Stargazers

Watchers

Forkers

terry2012 ladantazik bingslient saikat107 chubbymaggie ufwt maxwellhq xzhou29 xiaofan6033 ningan123 animeshbchowdhury seekingdream syssec-laboratory penguin219 lq6 dastlord springri bwry harite lily79 psychomasson maxwellzk li-siyuan onstar99 v1cker ross-hr sahilsuneja1 hiteshvaidya landandland ilwoof yinhuilin yuhanzhang the-elves ronhab research-zoo xiaoxiao199831 wliuxingxiangyu qursaan laremn xzh0u harshakumarakalutarage vuldeelocator ngocdang499 drors3 zhaojianrui ying-2016 zigzagframework aruniyer27 hylkefoeken nju-wusong tek4vn vvvvrainbow mahone lorenz9314 mahmoudzamani msl9810 skybulk jinwenhui93 willisguo14 williematthewliu wuts0301 baxiansheng wmkhoo always-y smartxspark zlsbytina rebelwings yzhbeihai don2025 supertom-spec tonybotongchu fusky kprateek777 pryriat lipeixuanlpx gmihran sohu0106 faysalhossain2007 xiao1i jingyisu youngbrady wangyc23 sirliuyang sonnguyenvnu xiaochaolee mahbubcseju nashid dakayadah45 makingw bymavis hanxm715 firmware-vulnerability-detection 5l1v3r1 4b5f5f4b m-e-l-u-h-a-n brainlyh luohui1028 tao7777 nmng108 99sao

sysevr's Issues

Where did you get the dict_cwe2father.pkl in make_label.py?

f = open("dict_cwe2father.pkl", 'rb')
dict_cwe2father = pickle.load(f)
f.close()

f = open("label_vec_type.pkl", 'rb')
label_vec_type = pickle.load(f)
f.close()

f = open("dict_testcase2code_rndargs_500_test.pkl", 'rb')
dict_testcase2code = pickle.load(f)
f.close()

Dependency on igraph is missing

access_db_operate.py imports igraph. The dependency on that library and its version are not give in readme.

There is no .joernIndex folder

I'm not able to find .joernIndex directory. Whats the reason of the problem. I'm running the source code using joern.jar file. But I dont see any output generated. And also my database gets no update.

运行getvullineforcountiing.py 输出的txt和pkl文件一直是空的

How can I extract datasets as two column one for source code and another for vulnerabilities type

Hello Dear.

First I want to say thanks for this great repo!

But I have a little bit of confusion, And I have no idea if that's correct or now.

I downloaded this reop this I unzipped the four folders (API function call, Array usage, etc..).

Then I build a data frame for each folder including two columns one for the source code and the other contains 1 or 0
to detect if the source code has a vulnerability or not.

Is this correct?

Have a nice day!

to read the input program or source code

Sir How to read the input program or source code for the next step?
please show me that code!!

Training with Keras.metrics is giving the correct number of true negatives and false negatives, but 0 for true positives and false positives

I try with NVD dataset and get the training values with all 0 labels. Some people also in my case but they get 1 instead.
The problem might be in the preprocess_dlIntput.py on function multi_labels_to_two? Anyone have fix this or any advice

Following error occur in points_get.py

WHILE RUNNING points_get.py, I got the following error :

(n18 {childNum:"12",code:"strncpy ( lastcomm , me -> comm , sizeof ( lastcomm ) )",functionId:4,isCFGNode:"True",location:"9:1:274:319",type:"ExpressionStatement"})
Traceback (most recent call last):
File "points_get.py", line 246, in
_dict = get_all_pointer_use(j)
File "points_get.py", line 137, in get_all_pointer_use
count=int(location[0])
ValueError: invalid literal for int() with base 10: '"'

how to solve it?? Please help me!!!

IOError : [Error no: 21] is a directory : './w2v-model/wordmodel3'

This error is occuring while running the python file create_w2vmodel.py.
how to change this issue......??
Also i have a dobt what should i give to w2v_model_path = "./w2v_model/wordmodel3"

How did you generate/collect non-vulnerable data samples?

Hi,

Thanks for sharing this interesting work. I want to ask about the scientific process followed to generate or collect the non-vulnerable training samples? In the paper, you mentioned that vulnerable samples were collected from NVD and NIST, these only provide vulnerable samples (i.e., CVEs), then how could add the 43,913 code gadgets that are not vulnerable?

Looking forward to your reply.

Thanks

针对slice2code中data_preprocess.py文件的f.close()的问题

在阅读data_preprocess.py源码的时候，存在如下疑惑：
第28行代码如下：

f = open(file_path,'a+')

第30行for循环的内部，第51行代码如下：

f.close()

若以当前状态运行程序，会存在如下错误：

Traceback (most recent call last):
  File "data_preprocess.py", line 48, in <module>
    f.write(str(sentence)+'\n')
ValueError: I/O operation on closed file.

我认为原因是将f.close()错误地放到了30行的for()循环内，导致从这个循环执行第二次及以后，就无法打开文件。所以我将f.close()移除当前for()循环，就收到了成功的提示，如下所示：

pointersuse_slices.txt
arraysuse_slices.txt
integeroverflow_slices.txt
api_slices.txt
\success!

想问下是我在执行其他步骤可能存在错误，还是源码本身存在些小问题呢？其他人有没有这种情况，或者其他的运行问题

Issue in installing joern.

I'm having issue in installing joern. So its not able to connect to db. How to proceed ?

The code to label the statement of vulnerability

I find you share a pkl file to indicate the line of code of SARD dataset that has vulnerability. But for NVD, there is no file for that. Could you share the code to compute the line of code that has vulnerability for NVD?

Which part of the dataset is not vulnerable？

Failed to execute get_cfg_relation.py

Failed to execute get_cfg_relation.py

Use Python2

from py2neo.ext.gremlin import Gremlin 
getting ImportError: No module named ext.gremlin

Use Python3

from joern.all import JoernSteps
ModuleNotFoundError: No module named 'joern.all'

Dependencies Configuration:

Followed Vagrantized Joern to set up the project dependecies.
** Everything installs successfully and was able to generate .joernindex and load it into Neo4j.
** I am running my python code under conda environment (with Python 3.6) which I activated before installing all the dependencies and running the code.

What should I do if I want to use this tech to analyze Java project?

This Project inspires me a lot. I like the idea that generating SeVCs from AST and PDG very much. I'm doing some research of analyzing Java project but sysevr is for C/C++. Want to know if there is an easy way to convert sysevr to a Java code analyzer and generate SeVCs for java source code.

Thanks for anyone who could answer my question!

gremlinPlugin

hi
python-joern can't connect to neo4j because of gremlinPlugin and I can't able to install it
please can you help me?

Issue while executing get_cfg_relation.py

Each time on call to j.connectToDatabase below runtime error is reported:

Dictionary keys changed during iteration. for the site-packages/joern-0.1-py3.10.egg/joern/all.py

Have installed and setup eyerything using https://github.com/tophertimzen/vagrantized-Joern/blob/master/setup.sh
Is this prototype is not in working condition?

Can anyone please help me here. or guide if we need a different joern version etc.

Thanks

Problems when running get_cfg_relation.py

Traceback (most recent call last):
File "get_cfg_relation.py", line 329, in
main()
File "get_cfg_relation.py", line 269, in main
j.connectToDatabase()
File "/usr/local/lib/python2.7/dist-packages/joern-0.1-py2.7.egg/joern/all.py", line 27, in connectToDatabase
self.gremlin = Gremlin(self.graphDb)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/ext/gremlin/init.py", line 31, in init
super(Gremlin, self).init(graph, "GremlinPlugin")
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/core.py", line 2638, in init
extensions = self.graph.resource.metadata["extensions"]
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/core.py", line 198, in metadata
self.get()
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/core.py", line 234, in get
response = self.__base.get(headers=headers, redirect_limit=redirect_limit, **kwargs)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 959, in get
return self.__get_or_head("GET", if_modified_since, headers, redirect_limit, **kwargs)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 936, in __get_or_head
return rq.submit(redirect_limit=redirect_limit, **kwargs)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 433, in submit
http, rs = submit(self.method, uri, self.body, self.headers)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 362, in submit
raise SocketError(code, description, host_port=uri.host_port)
py2neo.packages.httpstream.http.SocketError: Connection refused

I'm running in Ubuntu 20.02.

Some data records are really weird

For instance, in API function call.zip, there are records like:
------------------------------ 8140 64420/CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54e.c memmove 28 void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54b_badSink(int64_t * data) CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54c_badSink ( data ); void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54c_badSink(int64_t * data) CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54d_badSink ( data ); void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54d_badSink(int64_t * data) CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54e_badSink ( data ); void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54e_badSink(int64_t * data) int64_t source [ 100 ] = { 0 } ; memmove ( data , source , 100 * sizeof ( int64_t ) ); printLongLongLine ( data [ 0 ] ); void printLongLongLine (int64_t longLongIntNumber) printf ( "%lld\n" , longLongIntNumber ); 1 ------------------------------

According to my understanding, each record marked by the --- represents a vulnerable or clean SeVC, the first line and second line represent its corresponding file and function. Then what's the meaning of this record? Thanks a lot for clarifying this.

AttributeError: 'ProgbarLogger' object has no attribute 'log_values'

@yezihagendasi @MilesQLi

Does anyone knows how to change this error?? I am getting this error while i was running bgru.py .... The following is the error I am getting :
Train...
(0, 0)
start
Epoch 1/10
Traceback (most recent call last):
File "bgru.py", line 220, in
main(traindataSetPath, testdataSetPath, realtestdataSetPath, weightPath, resultPath, batchSize, maxLen, vectorDim, layers, dropout)
File "bgru.py", line 85, in main
model.fit_generator(train_generator, steps_per_epoch=steps_epoch, epochs=10)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1426, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_generator.py", line 229, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 336, in on_epoch_end
self.progbar.update(self.seen, self.log_values)
AttributeError: 'ProgbarLogger' object has no attribute 'log_values'

The "Program data " directory does not match the code

It seems like the dataset directory "Program data" does not match what is referenced in the code. for example, in make_label_nvd.py:

code_path = './data_source/linux_kernel/' #slice code of software
label_path = './C/label_source/linux_kernel/' #labels

Any idea how I can get the full dataset?

Why access_db_operate script take so much time for SARD data?

When I ran the access_db_operate.py, it takes too much time to process the data. Can anyone tell me is it normal behavior or I might do something wrong?