Giter VIP home page Giter VIP logo

websentinel's Introduction

pip install numpy pandas joblib tokenizers langdetect scikit-learn tk

1. Data Tokenization

(create_data_for_tokenization.py)

In order to train a phishing website detection model, you first need to tokenize all the HTML files into tokens using Byte Pair Encoding (BPE). We will use the tokenizer library for this. Once the html files are in their respective folders, run the following command.


python create_data_for_tokenization.py --labeled_data_folder labeled_data --vocab_size 300 --min_frequency 5

The script takes three parameters as inputs:

  • labeled_data_folder: Folder containing data for phishing and legitimate websites.

  • vocab_size: Maximum number of tokens to have in the vocabulary

  • min_frequency: Tokens having frequency lower than this value will be ignored

This script is designed for preprocessing HTML data, tokenizing it using Byte-Level BPE, and saving the tokenizer's vocabulary and configuration for further use.

2. Model Training

(train_phishing_detection_model.py)

Once we have create a Byte Pair Encoding tokenizer, we will be able to use it to tokenize HTML files and extract features for machine learning. On top of BPE tokens, we will apply TFIDF scores to get a feature representation of each HTML file. Run the following command to train your own model.


python train_phishing_detection_model.py --tokenizer_folder tokenizer/ --labeled_data_folder labeled_data/ --ignore_other_languages 1 --apply_different_thresholds 1 --save_model_dir saved_models

The script takes five parameters as inputs:

  • tokenizer_folder: Folder containing tokenizer files. The default folder is 'tokenizers'

  • labeled_data_folder: Folder containing data for phishing and legitimate websites.

  • ignore_other_languages: Whether to ignore languages other than english. Set it to 0 if you want to include all languages.

  • apply_different_thresholds: Whether to apply different confidence thresholds during model evaluation.

  • save_model_dir: Directory to save to model files

3. Model Testing

Once we have a trained model, we can simply test it live on any website using the following command.

(test_model.py)


python test_model.py --tokenizer_folder tokenizer --threshold 0.5 --model_dir saved_models --website_to_test *url*

The script takes four parameters as inputs:

  • tokenizer_folder: Folder containing tokenizer files. The default folder is 'tokenizers'

  • threshold: Threshold to use for making final predictions. By default, the value is 0.5.

  • model_dir: Directory where saved model files exist.

  • website_to_test: Website you want to test. Please add "http://" or "https://" before the website to make everything work. Otherwise, you will face an error.

Using Pre-trained Model

To use the pre-trained model, please go to the 'pretrained_models' directory and unzip the 'document-frequency-dictionary.zip' file. Do not unzip it in a new directory, keep it in the same directory. Once that is done, you can run the following command to use the pre-trained model.


python test_pretrained_model.py --tokenizer_folder pretrained_models --threshold 0.5 --model_dir pretrained_models --website_to_test *url*

The script takes four parameters as inputs:

  • tokenizer_folder: Folder containing tokenizer files. The default folder is 'tokenizers' but here we will use 'pretrained_models'.

  • threshold: Threshold to use for making final predictions. By default, the value is 0.5.

  • model_dir: Directory where saved model files exist. The pre-trained model files exist in 'pretrained_models'.

  • website_to_test: Website you want to test. Please add "http://" or "https://" before the website to make everything work. Otherwise, you will face an error.


websentinel's People

Contributors

adityas1731 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.