Giter VIP home page Giter VIP logo

smart-contracts-vulnerabilities-ml-detector's Introduction

CGT Classification Project

This project involves training and evaluating various machine learning models for classification tasks using the CGT dataset. The models include a BERT-based model, a Feedforward Neural Network (FFNN), an LSTM-based classifier, and a pool of traditional classifiers.

Installation

To get started with this project, install the required libraries using pip:

pip install numpy pandas torch scikit-learn tqdm transformers xgboost

Dataset

The dataset used in this project is the CGT dataset, which can be found here.

Download and Setup

  1. Clone the dataset repository:
git clone https://github.com/gsalzer/cgt.git
  1. Place the cloned repository in the appropriate directory structure:
project_directory/
├── dataset/                  # Cloned CGT dataset repository
└── your_project_files/   # Your project files

Configuration

Device Setup

The project uses a GPU if available:

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Random Seed

Random seeds are set for reproducibility:

RANDOM_SEED = 0

Dataset Path

Specify the path to the dataset:

PATH_TO_DATASET = os.path.join("..", "dataset", "cgt")

Training Configurations

  • Model Type: BERT (microsoft/codebert-base)
  • Max Features: 500
  • Batch Size: 1
  • Number of Folds: 10
  • Number of Epochs: 25
  • Number of Labels: 20
  • Learning Rate: 0.001
  • Test Size: 0.1

File Configurations

Handle different file types: source, runtime, and bytecode.

Log Directory

Logs are stored in a directory created if it doesn't already exist:

LOG_DIR = os.path.join("log", FILE_TYPE)
if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

Preprocessing

Preprocessing Functions

Functions are provided to preprocess hex data and Solidity code:

  • Hex Data Preprocessing: Converts hex data to a readable byte string.
  • Solidity Code Preprocessing: Removes comments and blank lines.

Data Initialization

Initialize inputs, labels, and groundtruth from the dataset:

inputs, labels, gt = init_inputs_and_gt(dataset)

Setting Labels

Set up labels based on groundtruth:

labels = set_labels(dataset, labels, gt)

Vectorization

TF-IDF vectorizer is used to convert text data into numerical features:

VECTORIZER = TfidfVectorizer(max_features=MAX_FEATURES)

Models

BERTModelTrainer

Handles training and evaluation of a BERT-based model. Uses the transformers library to load a BERT model for sequence classification.

FFNNClassifier

A simple feedforward neural network with three fully connected layers for classification tasks.

LSTMClassifier

An LSTM-based model for text classification, initialized with pretrained GloVe embeddings.

Load GloVe Embeddings

Download the GloVe embeddings from Kaggle and extract the file to the appropriate directory:

project_directory/
├── asset/
│   └── glove.6B.100d.txt  # GloVe embeddings file
└── your_project_files/    # Your project files

Load the GloVe embeddings:

glove_embeddings = load_glove_embeddings(os.path.join("..", "asset", "glove.6B.100d.txt"))

Training and Evaluation

Trainer Class

Handles the training and evaluation of a neural network model.

CrossValidator Class

Performs k-fold cross-validation of a model, training and evaluating it across multiple folds.

ClassifiersPoolEvaluator Class

Evaluates a pool of classifiers using TF-IDF features and k-fold cross-validation.

Initializing and Training Models

  1. BERT Model:

    model = RobertaForSequenceClassification.from_pretrained(BERT_MODEL_TYPE, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)
    model.config.problem_type = "multi_label_classification"
    model.to(DEVICE)
    
    tokenizer = RobertaTokenizer.from_pretrained(BERT_MODEL_TYPE, ignore_mismatched_sizes=True)
    
    x, y = tokenizer(INPUTS, add_special_tokens=True, max_length=512, return_token_type_ids=False, padding="max_length", truncation=True, return_attention_mask=True, return_tensors='pt'), LABELS
    
    x_train, x_test, y_train, y_test = train_test_split(x['input_ids'], y, test_size=TEST_SIZE)
    train_masks, test_masks = train_test_split(x['attention_mask'], test_size=TEST_SIZE)
    
    train_data = TensorDataset(x_train, train_masks, torch.tensor(y_train).float())
    test_data = TensorDataset(x_test, test_masks, torch.tensor(y_test).float())
    
    CrossValidator(BERTModelTrainer(model), train_data, test_data).k_fold_cv(log_id="bert")
  2. FFNN Model:

    model = FFNNClassifier()
    
    x = torch.FloatTensor(VECTORIZER.fit_transform(INPUTS).toarray())
    y = torch.FloatTensor(LABELS)
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)
    train_data = TensorDataset(x_train, y_train)
    test_data = TensorDataset(x_test, y_test)
    
    CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="ffnn")
  3. LSTM Model:

    embeddings = load_glove_embeddings("path_to_glove_file")
    vocab_size = len(embeddings)
    embedding_dim = len(next(iter(embeddings.values())))
    hidden_dim = 128
    
    model = LSTMClassifier(vocab_size, embedding_dim, hidden_dim, np.array(list(embeddings.values())))
    
    tokenizer = SomeTokenizer(vocab=embeddings.keys())  # Assume you have a tokenizer that converts text to sequences of indices
    
    sequences = tokenizer.texts_to_sequences(INPUTS)
    padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)  # Assume you pad sequences to a maximum length
    x = torch.tensor(padded_sequences)
    y = torch.FloatTensor(LABELS)
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)
    train_data = TensorDataset(x_train, y_train)
    test_data = TensorDataset(x_test, y_test)
    
    CrossValidator(Trainer(model), train_data, test_data).k_fold_cv(log_id="lstm")
  4. Classifiers Pool Evaluation:

    evaluator = ClassifiersPoolEvaluator()
    evaluator.pool_evaluation()

Evaluation and Results

Metrics such as precision, recall, and F1 score are calculated and saved to a CSV file.

smart-contracts-vulnerabilities-ml-detector's People

Contributors

matteo-rizzo avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.