Giter VIP home page Giter VIP logo

dessertlab / evil Goto Github PK

View Code? Open in Web Editor NEW
24.0 3.0 3.0 1.26 MB

EVIL (Exploiting software VIa natural Language) is an approach to automatically generate software exploits in assembly/Python language from descriptions in natural language. The approach leverages Neural Machine Translation (NMT) techniques and a dataset that we developed for this work.

License: GNU General Public License v3.0

Shell 2.01% Python 97.60% C++ 0.34% Cython 0.05%
nmt shellcode encoder decoder software-exploitation exploit assembly linux seq2seq codebert

evil's Introduction

EVIL: Exploiting Software via Natural Language

This repository contains the dataset and the code related to the paper EVIL: Exploiting Software via Natural Language accepted for publication at the 32nd International Symposium on Software Reliability Engineering (ISSRE 2021) conference.

The paper is publicly available on IEEEXplore. The slide presentation is available on slideshare, while you can find the video presentation of the paper on Youtube

EVIL is an approach to automatically generate software exploits in assembly/Python language from descriptions in natural language. The approach leverages Neural Machine Translation (NMT) techniques and a dataset that we developed for this work.

alt text

This repository contains:

  1. A substantive dataset containing exploits collected from shellcode databases, and their descriptions in the English language. The dataset includes both assembly code (i.e, shellcodes and decoders) and Python code (i.e., encoders). Such data is valuable to support research in machine translation for security-oriented applications since the techniques are data-driven.
  2. The code to reproduce the experiments described in the paper.
  3. The appendix of the paper containing additional information on the test set.

Dataset

To automatically generate Python and assembly programs used for security exploits, we curated a large dataset for feeding NMT techniques. A sample in the dataset consists of a snippet of code from these exploits and their corresponding description in the English language. We collected exploits from publicly available databases (exploitdb, shellstorm), public repositories (e.g., GitHub), and programming guidelines. In particular, we focused on exploits targeting Linux, the most common OS for security-critical network services, running on IA-32 (i.e., the 32-bit version of the x86 Intel Architecture). The dataset is stored in the folder EVIL/datasets and consists of two parts:

  1. Encoders: a Python dataset, which contains Python code used by exploits to encode the shellcode;
  2. Decoders: an assembly dataset, which includes shellcode and decoders to revert the encoding. This dataset extends the Shellcode_IA32 dataset.

Both datasets are already slipt in train, dev, and test set. encoder-*.in represents the natural language intents and encoder-*.out represents the corresponding code snippets. Please, find the detailed information of the dataset on the paper.

Experiments

We provide the code to replicate the experiments of the paper. In particular, the repository contains the code to generate assembly/Python exploits with CodeBERT and Seq2Seq models. We also added the code to run the pre-processing and post-processing phases. The detailed steps to replicate the experiments are described in the INSTALL.md file.

Appendix

The folder EVIL/Appendix contains detailed information on the 20 encoders and decoders used in the test set. It includes the source URL, the number of total lines (n_t) of the programs, and the number of syntactically correct (n_syn) and semantically correct (n_sem) lines generated by our approach, for both the encoders in Python and decoders in Assembly. In total, the test set for the Python programs contains 375 unique pairs of Python code snippets (not including prints) along with their natural description. The test set for assembly contains 305 unique pairs of code snippets (95 are multi-line snippets) and natural language intents.

Contacts

For further information, contact us via email: [email protected] (Pietro) and [email protected] (Erfan).

evil's People

Contributors

piliguori avatar taisazero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

kenuosec cridin1

evil's Issues

There are some errors in the data set.

For example, in python dataset,
in the file "encoder-train.in", line 3230 is "define the method serialize_headers with an argument self."
in the file "encoder-train.out", line 3230 is "def streaming_content ( self ) :"
This is with an obvious method name error.
Errors such as this exist in large numbers in the dataset, resulting in one input and multiple outputs after IP resolution.
For example, above data after IP, is:
"define the method var0 with an argument self." and "def streaming_content ( self ) :"
The placeholder var0 does not represent the code correctly.

CodeBERT_Launch fails

CodeBERT_Launch fails in eval_prep.py while opening files test_1.gold and test_1.output since it cannot find these files.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.