Giter VIP home page Giter VIP logo

malkg's Introduction

MalKG

Official repository for MalKG

Paper title: Knowledge Graph Generation and Entity Prediction for Contextual Malware Threat Intelligence

Relation Extraction

Installation and Requirements

All python files were ran using Python 3.8. All notebook files were ran using Google Colab.

SpaCy

$pip3 install -U pip setuptools wheel
$pip3 install -U spacy
$python3 -m spacy download en_core_web_sm

DocRED

16GB of Video Memory Compatible with CUDA, and 32GB of RAM

$pip3 install -r /Code/DocRED/code/requirements.txt

Flair12

$pip3 install -U flair

Training

Data Set

We trained the DocRED model with a set of 64 hand annotated threat reports. These reports were annotated using BRAT, and can be found under Code/Training Data Parser/input/.

Preprocessing

Converting BRAT annotations into JSON files for DocRED

To convert the text files and BRAT annotation files into the appropriate format for DocRED go to Code/Training Data Parser/ and run:

$python3 SpaCy_parser.py

The output will be under Code/Preprocessing/annotated_data.json.

Cleaning Up JSON files for DocRED

Due to memory constraints, we were only able to run documents that contained 80 named entities or fewer through DocRED. Thus, we broke documents containing more than 80 named entities into multiple documents. We also excluded documents containing more than 16384 words and changed some of the named entity classifications to match flair12. Finally, we had to split the documents into a training set and a validation set. All of this can be accomplished by navigating to Code/Preprocessing/ and running:

$python3 docred_preprocessing.py

The output will be under Code/DocRED Input/train_data.json and Code/DocRED Input/validate_data.json

DocRED

We ran DocRED using Google Colab with the notebook Code/DocRED/DOCRED.ipynb. Code/DocRED/data/ should contain train_data.json, validate_data.json, and test_data.json. This will be processed by running:

$python3 gen_data.py --in_path ../data --out_path prepro_data

The number of epochs used for training can be set in Code/DocRED/code/train.py, and we have it set to 10,000. Every 5 epochs DocRED compares the current epoch to the previous best epoch, and saves the model to Code/DocRED/code/checkpoint/checkpoint_BiLSTM.zip if it is better. Training can be done by running:

$CUDA_VISIBLE_DEVICES=0 python3 train.py --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_validate

Testing

Converting PDFs to TXT

Threat Report PDFs were ran through Adobe Acrobat using the Action Wizard with Export PDFs to TXTs.sequ to convert them into TXTs.

Named Entity Recognition

We used Flair12 and SetExpan to extract named entities from our test data. Entities can be extracted from Threat Reports by navigating to Code/NER/, and using the notebook file or by running:

$python3 automated_flair12.py

The output will be under Code/Preprocessing/threatreport_flair12_data.json

Cleaning Up JSON files for DocRED

Similarly to the training data, we needed to break the testing data apart due to memory constraints. This can be done by navigating to Code/Preprocessing/ and running:

$python3 docred_preprocessing.py

All of the testing documents will be output under Code/DocRED Input/threatreport_flair12_test_data_all.json and Code/DocRED Input/threatreport_setexpan_test_data_all.json. However, we were only able to test 256 documents at one time. Thus, Code/DocRED Input/Threat Report Flair12 Data/ and Code/DocRED Input/Threat Report SetExpan Data/ contains the testing data split by 256 documents.

DocRED

In order to test with DocRED, the JSON file you want to test must be renamed to test_data.json and placed under Code/DocRED/data/. This can be processed by running:

$python3 gen_data.py --in_path ../data --out_path prepro_data

and testing can be done by running:

$CUDA_VISIBLE_DEVICES=0 python3 test.py --save_name checkpoint_BiLSTM --train_prefix dev_train --test_prefix dev_test --input_theta 0.5

The results of the test will be output as Code/DocRED/code/dev_test_index.json.

Converting the DocRED output into CSVs

To convert DocRED output into CSVs, go to Code/Postprocessing/ and run:

$python3 format.py

This outputs to the corresponding csvs directories, as well as Results/.

Citations

Yao, Y., Ye, D., Li, P., Han, X., Lin, Y., Liu, Z., Huang, L., Zhou, J., Sun, M. (2019). DocRED: A Large-Scale Document-Level Relation Extraction Dataset. Proceedings of ACL 2019.

malkg's People

Contributors

malkg-researcher avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.