Giter VIP home page Giter VIP logo

hexterisk / static-malwired Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 1.0 234 KB

Classifies if a PE is benign or a malware. If it is found to be a malware, the PE is then classified among different malware classes. Deployed on flask.

Home Page: https://hexterisk.github.io/blog/posts/2020/07/20/classification-of-malware-through-static-analysis/

Jupyter Notebook 10.14% Python 80.84% CSS 6.51% HTML 2.51%
malware machine-learning flask random-tree

static-malwired's Introduction

Static Malwired

Classifies if a PE is benign or a malware, based on static analysis. The given PE can be classified among the classes defined in the config file. The user must modify the classes based on the dataset and the requirement. Simply edit the classes in config.py and the model in train.py as per requirement. The POC in the IPython notebook has been run on a much smaller dataset. The model gives a 93% accuracy on a dataset of decent size.

Also checkout the repository hosting the POC for a classifier working on dynamic analysis.

DISCLAIMER: The whole suite has been created, although the user would have to acquire the dataset and train the model by themselves.

Algorithm

A blog-post on the features used in this POC. Apart from the usual (and unreliable) static information from headers in the PE file format, the major part is the use of raw bytes with the following algorithm.

  • Create shingles out of the binary data from executables. Take a bunch of consecutive words(ideally 3) from the data and hash them into integers. This retains a little more of the structure of the data than simply hashing words individually. Each unique hash string is called a shingle.

  • Calculate MinHash of each shingle. MinHash signatures can be used to approximate the Jaccard similiarity between any two sets, and it's faster than simply calculating the union or intersection of two large sets since the signatures are much shorter than the sets(shingles in this case) itself. It works because it can be proven that MinHash similarity is equal to the Jaccard similarity. Read more here.

  • Implement Locality Sensitive Hashing over these hashes. The MinHash computation is an O(n) algorithm. LSH can be used to bring this down to a sub-linear cost. The algorithm ensures that sets with higher Jaccard similarity always have higher probability to get returned than the sets with lower similarities. Read about the algorithm here.

Code Base

The dataset for malware analysis was formed from the database of MalShare.

DISCLAIMER: The dataset built from MalShare is in no way adequate. Kindly find other source for the database. The dataset I used was confidential and I cannot disclose it. Therefore, I've only mentioned and released the code containing the open source site I used. Pull requests towards any similar sources would be appreciated. The script downloader.py can be modified to include classes for different sources.

Flow:

  1. Run downloader.py to download the malware database. Add more classes for processing data from more websites, or simply skip this step if you already have a database and don't need to download anything. Make sure the directory structure for the dataset folder is as follows:
static-malwired/
├── dataset/
│   ├── classA
│   │   ├── exeA
│   │   ├── exeB
│   │   .
│   │   .
│   │   .
│   │   └── exeZ
│   ├── classB
│   │   ├── exeA
│   │   .
│   │   └── exeZ
│   ├── classC
│   │   ├── exeA
│   │   .
│   │   └── exeZ
.   .
.   .
.   .
│   └── classZ
├── app.py
├── builder.py
.
.
.
└── transformer.py
  1. Run builder.py to build the dataset from the downloaded database.

  2. Run train.py to train a model.

  3. Run predict.py to predict a sample's type. Provide the path to the file to be classified and the path to the trained model as command-line arguments, in the same order.

Components:

app.py: Flask RESTful API on which the project has been deployed.

builder.py: Builds the dataset from all the files in the database.

config.py: Contains malware classes list that can be deleted as per requirement.

downloader.py: Script that scrapes pages of MalShare and downloads PEs from their database using their API.

features.py: Feature extraction script from the files in the dataset. Original script taken from the ember project and modified to apply various tweaks for multiclass classification and file specific signatures.

predict.py: A python script to predict a PE's class using a trained model.

recomposer.py: Script to recompose a file's header, section names and inject random data to revamp a file. Provided to facilitate dataset augmentation. Shall be used if general file information features like Headers and Sections are used.

train.py: A python script to train the model, given the data.

transformer.py: Transform the given files into vectorized features.

static-malwired's People

Contributors

hexterisk avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

nk3xnuclear

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.