Giter VIP home page Giter VIP logo

quo.vadis's Introduction

Quo Vadis

License GitHub last commit follow on Twitter

This repository is part of the following publication: https://dl.acm.org/doi/10.1145/3560830.3563726

Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations

⚠️ The model is a research prototype, provided as-is, without warranty of any kind, in a pre-alpha state.

Dataset

Dataset structure used for model pre-training is as follows:


Raw PE samles and in-the-wild filepaths are not disclosed due to Privacy Policy. However,

  • PE emulation dataset available in emulation.dataset
  • Filepath dataset (open sources only, in-the-wild paths used for pre-training are excluded):

Citation

If you are inspired by the work or use data, please cite us:

@inproceedings{10.1145/3560830.3563726,
author = {Trizna, Dmitrijs},
title = {Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations},
year = {2022},
isbn = {9781450398800},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3560830.3563726},
doi = {10.1145/3560830.3563726},
booktitle = {Proceedings of the 15th ACM Workshop on Artificial Intelligence and Security},
pages = {127–136},
numpages = {10},
keywords = {reverse engineering, neural networks, malware, emulation, convolutions},
location = {Los Angeles, CA, USA},
series = {AISec'22}
}

Architecture

Hybrid, modular structure for malware classification. Supported modules:


Environment Setup

Tested on Python 3.8.x - 3.9.x. Because of a large number of dependencies with specific versions (due to pre-trained machine learning models), we suggest using a virtual environment or conda:

% python3 -m venv QuoVadisEnv
% source QuoVadisEnv/bin/activate
(QuoVadisEnv)% python -m pip install -r requirements.txt

Usage

API interface is available under models.py.

Definition of classifier

from models import CompositeClassifier

classifier = CompositeClassifier(meta_model = "MultiLayerPerceptron", 
                                   modules = ["ember", "emulation"],
                                   root = "/home/user/quo.vadis/",
                                   load_meta_model = True)

Available pretrained configurations:

meta_model = 'LogisticRegression', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation', 'filepaths']
meta_model = 'MultiLayerPerceptron', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'MultiLayerPerceptron', modules = ['emulation']
meta_model = 'MultiLayerPerceptron', modules = ['filepaths']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation', 'filepaths']
meta_model = 'XGBClassifier', modules = ['ember', 'emulation', 'filepaths', 'malconv']
meta_model = 'XGBClassifier', modules = ['emulation']
meta_model = 'XGBClassifier', modules = ['filepaths']

Evaluation on PE list

pefiles = os.listdir("/path/to/PE/samples")
x = classifier.preprocess_pelist(pefiles)
probs = classifier.predict_proba(x)

You can use predict_proba_pelist() instead of predict_proba() to get probabilities out of the PE list right away instead of a preprocessed array:

probs = classifier.predict_proba_pelist(pefiles)

Given that filepaths is specified in modules = , you have to specify the filepaths of the PE sample at the moment of execution using the pathlist= argument:

filepaths = pd.read_csv(filepaths.csv, header=None)
probs = classifier.predict_proba_pelist(pefiles, pathlist=filepaths.values.tolist())

Note! len(pefiles) == len(filepaths)

Re-Training

Using the fit_pelist() method and providing ground true labels for PE files -- malware (1) or benign (0):

labels = load_labels()
classifier.fit_pelist(pefiles, labels, pathlist=filepaths.values.tolist())

Example

An example usage can be found under example.py:

# python example.py --example --how ember emulation filepaths

[*] Loading model...
WARNING:root:[!] Loading pretrained weights for ember model from: ./modules/sota/ember/parameters/ember_model.txt
WARNING:root:[!] Loading pretrained weights for filepath model from: ./modules/filepath/pretrained/torch.model
WARNING:root:[!] Using speakeasy emulator config from: ./data/emulation.dataset/sample_emulation/speakeasy_config.json
WARNING:root:[!] Loading pretrained weights for emulation model from: ./modules/emulation/pretrained/torch.model
WARNING:root:[!] Loading pretrained weights for late fusion MultiLayerPerceptron model from: ./modules/late_fustion_model/MultiLayerPerceptron15_ember_emulation_filepaths.model

[*] Legitimate 'calc.exe' analysis...
WARNING:root:[!] Taking current filepath for: evaluation/adversarial/samples_goodware/calc.exe
WARNING:root: [+] 0/0 Finished emulation evaluation/adversarial/samples_goodware/calc.exe, took: 0.19s, API calls acquired: 6
[!] Given path evaluation/adversarial/samples_goodware/calc.exe, probability (malware): 0.000005
[!] Individual module scores:

       ember  filepaths  emulation
0  0.000015    0.00319   0.062108 

WARNING:root: [+] 0/0 Finished emulation evaluation/adversarial/samples_goodware/calc.exe, took: 0.11s, API calls acquired: 6
[!] Given path C:\users\myuser\AppData\Local\Temp\exploit.exe, probability (malware): 0.549334
[!] Individual module scores:

       ember  filepaths  emulation
0  0.000015   0.999984   0.062108 

[*] BoratRAT analysis...
WARNING:root: [+] 0/0 Finished emulation ./b47c77d237243747a51dd02d836444ba067cf6cc4b8b3344e5cf791f5f41d20e, took: 0.25s, API calls acquired: 194

[!] Given path %USERPROFILE%\Downloads\BoratRat.exe, probability (malware): 0.9997
[!] Individual module scores:

       ember  filepaths  emulation
0  0.035511   0.999602    0.96526 

WARNING:root: [+] 0/0 Finished emulation ./b47c77d237243747a51dd02d836444ba067cf6cc4b8b3344e5cf791f5f41d20e, took: 0.25s, API calls acquired: 194

[!] Given path C:\windows\system32\calc.exe, probability (malware): 0.0392
[!] Individual module scores:

       ember  filepaths  emulation
0  0.035511   0.086567    0.96526 

Evaluation

More detailed information about modules and individual tests:

  • ./modules/emulation/
  • ./modules/filepaths/
  • ./modules/sota/

Note! Parameters for the sota models can be downloaded from here.

Performance of this model on the proprietary dataset: ~90k PE samples with filepaths from real-world systems:


DET and ROC curves:


Detection rate with fixed False Positive rate:


Future work

  • Experiments with retrained MalConv / Ember weights -- it makes sense to evaluate them on the same distribution
    • Note: this, however, does not matter since our goal is not to compare our modules with MalConv / Ember directly but to improve them. For this reason, it is even better to have original parameters. The main takeaway -- adding multiple modules together allows boosting results drastically. At the same time, each is noticeably weaker (even the API call module, which is trained on the same distribution).
  • Run GAMMA against composite solution (not just ember/malconv modules) - it looks like attacks are highly targeted. Interesting if it will be able to generate evasive samples against a complete pipeline .. (however, defining that in secml_malware might be painful ...)
  • Work on CompositeClassifier() API interface:
    • make it easy to take a PE sample(s) & additional document options (providing PE directory, predefined emulation report directory, etc.)
    • .update() to overtrain network with own examples that were previously flagged incorrectly
    • work without submitted filepath (only PE mode) - provide paths as separate argument to .fit()?
  • Additional modules:
    • (a) Autoruns checks (see Sysinternals book for a full list of registries analyzed)
    • (b) network connection information
    • etc.

quo.vadis's People

Contributors

dtrizna avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

quo.vadis's Issues

EMBER LGBT feature extraction error

I am trying to evaluate the composite classifier (including both the EMBER and MALCONV modules).
While the malconv model has no pretrained (as written in the readme), I am evaluating EMBER with the provided pretrained model by Elastic.
The feature extractor crashes at this step (it crashes inside Ember, but it is the same line in the code provided inside the project):

entry_section = lief_binary.section_from_offset(lief_binary.entrypoint).name

this happens because the lief method returns None (even with legitimate software), and the exception is not caught.
Hence, execution halts.
I am trying to better isolate this issue.

Attaching the sample code I wrote to test this:

if __name__ == '__main__':
    base_dir = os.path.join(os.path.dirname(__file__), 'FOLDER')
    files = [os.path.join(base_dir, f) for f in os.listdir(base_dir) if not f.startswith('.')]
    print(files)
    clf = CompositeClassifier(modules=['ember', 'emulation'])
    x = clf.preprocess_pelist(files)
    print(clf.predict_proba(x))

Issue with BoratRAT Sample Link in the Code

Hello dtrizna,

I hope you're doing well. I was going through your code and noticed that the link provided for the BoratRAT sample from VX-UNDERGROUND seems to be broken or not accessible:
https://samples.vx-underground.org/samples/Families/BoratRAT/Samples/b47c77d237243747a51dd02d836444ba067cf6cc4b8b3344e5cf791f5f41d20e.7z
I tried accessing it multiple times, but without success. I believe this could be an oversight, or the sample might have been moved or removed from the source.

It would be greatly appreciated if you could check on this and possibly provide an updated link or suggest an alternative source for the sample. This will be very helpful for those of us who are trying to follow along with your work.

Thank you for your attention to this matter and for your contributions to the community. Looking forward to your response.

Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.