Giter VIP home page Giter VIP logo

transast's Introduction

TransAST: A Machine Translation-Based Approach for Obfuscated Malicious JavaScript Detection

Yan Qin, Weiping Wang†, Zixian Chen, Hong Song, Shigeng Zhang(†corresponding author:[email protected])Paper

Requirements

The codebase is tested on a server with Intel Xeon Silver 4114 2.20GHz

  • Ubuntu 16.04
  • Python 3.10
  • PyTorch 1.13.0
  • Gensim 3.8.1
  • Sentencepiece 0.1.95
  • 2 NVIDIA TITAN V 12G with CUDA version 11.7, and 256G of memory

To run the code, please install the relevant python packages.

Data

Provided data

We have provided some JS file and corresponding obfuscated JS file (To be uploaded) for quick test. Please download and unzip the data.

Customize your own data

Stay tuned for data preparation scripts.

Please organize your own data as follows:

TransAST
│
└─── Origin JS file (or corresponding AST sequences file)
    │
    └─ train
	     └─ Benign JS or AST sequences
	     └─ Malicious JS or AST sequences
    └─ test
	     └─ Benign JS or AST sequences
	     └─ Malicious JS or AST sequences
│
└─── Corresponding Ofuscated JS file (or corresponding AST sequences file)
    │
    └─ train
	     └─ Benign JS or AST sequences
	     └─ Malicious JS or AST sequences
    └─ test
	     └─ Benign JS or AST sequences
	     └─ Malicious JS or AST sequences	     
│
└─── ...

The corresponding obfuscation code file can be obtained from the origin code processed by JavaScript Obfuscator Tool.

Compressed sequence

./js2astseq.py: The JS code is processed into an AST sequence.

./transformer/courpus.py: Create a corpus and generate the raw_corpus.

./transformer/tokenize_transformer.py: Create a dictionary based on raw_corpus and generate a dict.

./transformer/trans_corpus.py: Compress the corpus based on the generated dict dictionary, and change the raw_corpus into char_corpus

./transformer/spm.py: Train the compressed data from char_corpus to generate an spm model (.model) for subsequent feature processing.

where'.........'(argument) in above every file: Be replaced with the specific location of the corresponding file. Please replace it according to your condition.

Translation task

./transformer/engine.py : Train translation model.

where

  • -l1-path: Put the origin char_corpus of train set;
  • -l2-path: Put the the corresponding obfuscated char_corpus of train set;
  • -test-l1-path: Origin char_corpus file path of test set;
  • -test-l2-path: The corresponding obfuscated char_corpus file path of test set;
  • -dict: File path of dict;
  • -spm: File path of spm model (.model);
  • -save-dir: The path where the model is saved;

./transformer/translate.py : Use the trained model to translate, that is, to de-obfuscate the obfuscated JS code into origin JS code.

where

  • -input: Obfuscated char_corpus file path of test set;
  • -output: File path of Origin char_corpus file from the translation model processes the obfuscated char_corpus file;
  • -model: File path of the translation model;
  • -dict: File path of dict;
  • -spm: File path of spm model (.model);

Please carefully check configuration parameter, and you can always add your own model config.

Detection task

./TextCNN_SPM.py : Use the textCNN model to test the performance of the translation model translation (deobfuscation)

where

  • -file-path: Origin char_corpus file path of train set;
  • -dir-path: Origin char_corpus file path of train set(remove the char_corpus file name);
  • -test-file-path: Origin char_corpus file path of test set;
  • -test-dir-path: Origin char_corpus file path of test set(remove the char_corpus file name);
  • -shuffle: Default;
  • -sp: File path of spm model (.model);
  • -embed-len: Default is 1000;
  • -device: Default is 0;
  • -test: The second run is used;
  • -snapshot: File path of the model trained by textCNN during the first run;

Please carefully check configuration parameter, and you can always add your own model config.

Note: You should run it twice, the first time to see how the model performs on the origin JS code detection, and the second time to see how the model performs when the training set is the origin JS code and the test set is the de-obfuscated JS code (i.e. the translation performance).-testand -snapshotonly are used during the second run.

BibTeX

@article{10202623,
  author={Qin, Yan and Wang, Weiping and Chen, Zixian and Song, Hong and Zhang, Shigeng},
  booktitle={2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)}, 
  title={TransAST: A Machine Translation-Based Approach for Obfuscated Malicious JavaScript Detection}, 
  year={2023},
  volume={},
  number={},
  pages={327-338},
  doi={10.1109/DSN58367.2023.00040}}

transast's People

Contributors

chenzx-github avatar xiyan19 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.