Giter VIP home page Giter VIP logo

sysevr's Introduction

SySeVR: A Framework for Using Deep Learning to Detect Vulnerabilities

Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, Zhaoxuan Chen. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing (TDSC). 2021. doi: 10.1109/TDSC.2021.3051525.


We propose a general framework for using deep learning to detect vulnerabilities, named SySeVR. For evaluate the SySeVR, we collect the Semantics-based Vulnerability Candidate (SeVC) dataset, which contains all kinds of vulnerabilities that are available from the National Vulnerability Database (NVD) and the Software Assurance Reference Dataset (SARD).

At a high level, the SyVC representation corresponds to a piece of code in a program that may be vulnerable based on a syntax analysis. The SeVC representation corresponds to the extended statements of the SyVCs, with the extension to incorporate some of the other statements that are semantically related to the SyVCs.

SeVC dataset focuses on 1,591 open source C/C++ programs from the NVD and 14,000 programs from the SARD. It contains 420,627 SeVCs, including 56,395 vulnerable SeVCs and 364,232 SeVCs that are not vulnerable. Four types of SyVCs are involved.

  1. Library/API Function Call : This accommodates the vulnerabilities that are related to library/API function calls.
  2. Array Usage: This accommodates the vulnerabilities that are related to arrays (e.g., improper use in array element access, array address arithmetic, address transfer as a function parameter).
  3. Pointer Usage: This accommodates the vulnerabilities that are related to pointers (e.g., improper use in pointer arithmetic, reference, address transfer as a function parameter).
  4. Arithmetic Expression: This accommodates the vulnerabilities that are related to improper arithmetic expressions (e.g., integer overflow).

sysevr's People

Contributors

sysevr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sysevr's Issues

Where did you get the dict_cwe2father.pkl in make_label.py?

f = open("dict_cwe2father.pkl", 'rb')
dict_cwe2father = pickle.load(f)
f.close()

f = open("label_vec_type.pkl", 'rb')
label_vec_type = pickle.load(f)
f.close()

f = open("dict_testcase2code_rndargs_500_test.pkl", 'rb')
dict_testcase2code = pickle.load(f)
f.close()

There is no .joernIndex folder

I'm not able to find .joernIndex directory. Whats the reason of the problem. I'm running the source code using joern.jar file. But I dont see any output generated. And also my database gets no update.

How can I extract datasets as two column one for source code and another for vulnerabilities type

Hello Dear.

First I want to say thanks for this great repo!

But I have a little bit of confusion, And I have no idea if that's correct or now.

I downloaded this reop this I unzipped the four folders (API function call, Array usage, etc..).

Then I build a data frame for each folder including two columns one for the source code and the other contains 1 or 0
to detect if the source code has a vulnerability or not.

Is this correct?

Have a nice day!

Following error occur in points_get.py

WHILE RUNNING points_get.py, I got the following error :

(n18 {childNum:"12",code:"strncpy ( lastcomm , me -> comm , sizeof ( lastcomm ) )",functionId:4,isCFGNode:"True",location:"9:1:274:319",type:"ExpressionStatement"})
Traceback (most recent call last):
File "points_get.py", line 246, in
_dict = get_all_pointer_use(j)
File "points_get.py", line 137, in get_all_pointer_use
count=int(location[0])
ValueError: invalid literal for int() with base 10: '"'

how to solve it?? Please help me!!!

How did you generate/collect non-vulnerable data samples?

Hi,

Thanks for sharing this interesting work. I want to ask about the scientific process followed to generate or collect the non-vulnerable training samples? In the paper, you mentioned that vulnerable samples were collected from NVD and NIST, these only provide vulnerable samples (i.e., CVEs), then how could add the 43,913 code gadgets that are not vulnerable?

Looking forward to your reply.

Thanks

针对slice2code中data_preprocess.py文件的f.close()的问题

在阅读data_preprocess.py源码的时候,存在如下疑惑:
第28行代码如下:

f = open(file_path,'a+')

第30行for循环的内部,第51行代码如下:

f.close()

若以当前状态运行程序,会存在如下错误:

Traceback (most recent call last):
  File "data_preprocess.py", line 48, in <module>
    f.write(str(sentence)+'\n')
ValueError: I/O operation on closed file.

我认为原因是将f.close()错误地放到了30行的for()循环内,导致从这个循环执行第二次及以后,就无法打开文件。所以我将f.close()移除当前for()循环,就收到了成功的提示,如下所示:

pointersuse_slices.txt
arraysuse_slices.txt
integeroverflow_slices.txt
api_slices.txt
\success!

想问下是我在执行其他步骤可能存在错误,还是源码本身存在些小问题呢?其他人有没有这种情况,或者其他的运行问题

The code to label the statement of vulnerability

I find you share a pkl file to indicate the line of code of SARD dataset that has vulnerability. But for NVD, there is no file for that. Could you share the code to compute the line of code that has vulnerability for NVD?

Failed to execute get_cfg_relation.py

Failed to execute get_cfg_relation.py

  1. Use Python2
from py2neo.ext.gremlin import Gremlin 
getting ImportError: No module named ext.gremlin
  1. Use Python3
from joern.all import JoernSteps
ModuleNotFoundError: No module named 'joern.all'

Dependencies Configuration:

  1. Followed Vagrantized Joern to set up the project dependecies.
    ** Everything installs successfully and was able to generate .joernindex and load it into Neo4j.
    ** I am running my python code under conda environment (with Python 3.6) which I activated before installing all the dependencies and running the code.

What should I do if I want to use this tech to analyze Java project?

This Project inspires me a lot. I like the idea that generating SeVCs from AST and PDG very much. I'm doing some research of analyzing Java project but sysevr is for C/C++. Want to know if there is an easy way to convert sysevr to a Java code analyzer and generate SeVCs for java source code.

Thanks for anyone who could answer my question!

gremlinPlugin

hi
python-joern can't connect to neo4j because of gremlinPlugin and I can't able to install it
please can you help me?

Problems when running get_cfg_relation.py

Traceback (most recent call last):
File "get_cfg_relation.py", line 329, in
main()
File "get_cfg_relation.py", line 269, in main
j.connectToDatabase()
File "/usr/local/lib/python2.7/dist-packages/joern-0.1-py2.7.egg/joern/all.py", line 27, in connectToDatabase
self.gremlin = Gremlin(self.graphDb)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/ext/gremlin/init.py", line 31, in init
super(Gremlin, self).init(graph, "GremlinPlugin")
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/core.py", line 2638, in init
extensions = self.graph.resource.metadata["extensions"]
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/core.py", line 198, in metadata
self.get()
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/core.py", line 234, in get
response = self.__base.get(headers=headers, redirect_limit=redirect_limit, **kwargs)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 959, in get
return self.__get_or_head("GET", if_modified_since, headers, redirect_limit, **kwargs)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 936, in __get_or_head
return rq.submit(redirect_limit=redirect_limit, **kwargs)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 433, in submit
http, rs = submit(self.method, uri, self.body, self.headers)
File "/home/yjl/.local/lib/python2.7/site-packages/py2neo/packages/httpstream/http.py", line 362, in submit
raise SocketError(code, description, host_port=uri.host_port)
py2neo.packages.httpstream.http.SocketError: Connection refused

I'm running in Ubuntu 20.02.

Some data records are really weird

For instance, in API function call.zip, there are records like:
------------------------------ 8140 64420/CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54e.c memmove 28 void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54b_badSink(int64_t * data) CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54c_badSink ( data ); void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54c_badSink(int64_t * data) CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54d_badSink ( data ); void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54d_badSink(int64_t * data) CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54e_badSink ( data ); void CWE121_Stack_Based_Buffer_Overflow__CWE805_int64_t_alloca_memmove_54e_badSink(int64_t * data) int64_t source [ 100 ] = { 0 } ; memmove ( data , source , 100 * sizeof ( int64_t ) ); printLongLongLine ( data [ 0 ] ); void printLongLongLine (int64_t longLongIntNumber) printf ( "%lld\n" , longLongIntNumber ); 1 ------------------------------

According to my understanding, each record marked by the --- represents a vulnerable or clean SeVC, the first line and second line represent its corresponding file and function. Then what's the meaning of this record? Thanks a lot for clarifying this.

AttributeError: 'ProgbarLogger' object has no attribute 'log_values'

@yezihagendasi @MilesQLi

Does anyone knows how to change this error?? I am getting this error while i was running bgru.py .... The following is the error I am getting :
Train...
(0, 0)
start
Epoch 1/10
Traceback (most recent call last):
File "bgru.py", line 220, in
main(traindataSetPath, testdataSetPath, realtestdataSetPath, weightPath, resultPath, batchSize, maxLen, vectorDim, layers, dropout)
File "bgru.py", line 85, in main
model.fit_generator(train_generator, steps_per_epoch=steps_epoch, epochs=10)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1426, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_generator.py", line 229, in fit_generator
callbacks.on_epoch_end(epoch, epoch_logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 77, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 336, in on_epoch_end
self.progbar.update(self.seen, self.log_values)
AttributeError: 'ProgbarLogger' object has no attribute 'log_values'

The "Program data " directory does not match the code

It seems like the dataset directory "Program data" does not match what is referenced in the code. for example, in make_label_nvd.py:

code_path = './data_source/linux_kernel/' #slice code of software
label_path = './C/label_source/linux_kernel/' #labels

Any idea how I can get the full dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.