Giter VIP home page Giter VIP logo

bert-sentiment's Introduction

Hi there ๐Ÿ‘‹

I'm a data scientist / machine learning engineer.

bert-sentiment's People

Contributors

10-zin avatar abhishekkrthakur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bert-sentiment's Issues

Getting issue while loading model

Getting below issue while loading the model in local system. Model was trained on colab.

Traceback (most recent call last):
  File "app.py", line 74, in <module>
    MODEL.load_state_dict(torch.load(config.MODEL_PATH, map_location=torch.device('cpu'))) #New created
  File "C:\Users\Vijender\Downloads\bert_sentiment\lib\site-packages\torch\nn\modules\module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERTBaseUncased:
        Missing key(s) in state_dict: "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias",

Loading DataParallel GPU model on CPU

Follow up to #1 issue
@abhishekkrthakur : Can you give any leads on how to load DataParallel GPU model on CPU?
As per pytorch docs tried following but still raises above RuntimeError

device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device))

IndexError: index ? is out of bounds for axis 0 with size ?

Hello,

I've tried your implementation of bert model in order to predict the sentiment of a sentence.

When training the model on google TPUs, i had this issues, on this line :

review = str(self.review[item])

I've spended so many time to find the issue but i don't figure out why it throws an array index out of bound ?

I started training with a litle dataset of 10000 rows.

StackStrace :

bi = 0, loss = 0.6744452714920044
bi = 10, loss = 0.6480506658554077
bi = 20, loss = 0.6070395708084106
bi = 30, loss = 0.3570273816585541
bi = 40, loss = 0.322771281003952
bi = 50, loss = 0.42349475622177124
bi = 60, loss = 0.2848508358001709
bi = 70, loss = 0.2577969431877136
bi = 80, loss = 0.4233595132827759
bi = 90, loss = 0.600457489490509
bi = 100, loss = 0.22680382430553436
bi = 110, loss = 0.09512724727392197
bi = 120, loss = 0.14158135652542114
bi = 130, loss = 0.653974175453186
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 134, in _loader_worker
_, data = next(data_iter)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "", line 12, in getitem
review = str(self.review[item])
IndexError: index 4244 is out of bounds for axis 0 with size 979

In Dataset.py

i just try to run your code then i found an error called not know keywards argument
{"pad_to_max_len} is not recognized

class CustomDataset(Dataset):

def __init__(self , review , target):
    super(CustomDataset , self).__init__()
    self.review = review
    self.target = target
    self.tokenizer = config.TOKENIZER
    self.max_len = config.MAX_LEN
    self.train_encodings = self.tokenizer(review, truncation=True, padding=True)

def __len__(self):
    return len(self.review)

def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.train_encodings.items()}
    item['labels'] = torch.tensor(self.target[idx])
    return item

just look at this this can be helpfull

index out of range

Hi I'm getting the below error when I run the train script. Can you help me

File "C:\Users\thanisb\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
[Finished in 33.4s with exit code 1]

TypeError: _init_() got an unexpected keyword argument 'comment_text' and AttributeError: module 'config' has no attribute 'DEVICE'

while doing the code for multilingual toxic comment classification i am getting errors,
import config
import dataset
import engine
import torch
import pandas as pd
import torch.nn as nn
import numpy as np

from model import BERTBaseUncased
from sklearn import model_selection
from sklearn import metrics
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

def run():
df1 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-toxic-comment-train.csv", usecols = ["comment_text","toxic"])
df2 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-unintended-bias-train.csv", usecols = ["comment_text","toxic"])

  df_train = pd.concat([df1,df2], axis=0).reset_index(drop=True)
  
  df_valid = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\validation.csv")

  train_dataset = dataset.BERTDataset(
      comment_text=df_train.comment_text.values,
      target=df_train.toxic.values
  )

  train_data_loader = torch.utils.data.DataLoader(
      train_dataset,
      batch_size=config.TRAIN_BATCH_SIZE, 
      num_workers=4
  )

  valid_dataset = dataset.BERTDataset(
      comment_text=df_valid.comment_text.values, 
      target=df_valid.toxic.values
  )

  valid_data_loader = torch.utils.data.DataLoader(
      valid_dataset, 
      batch_size=config.VALID_BATCH_SIZE, 
      num_workers=1
  )

  device = torch.device(config.DEVICE)
  model = BERTBaseUncased()
  model.to(device)

  param_optimizer = list(model.named_parameters())
  no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
  optimizer_parameters = [
      {
          "params": [
              p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
          ],
          "weight_decay": 0.001,
      },
      {
          "params": [
              p for n, p in param_optimizer if any(nd in n for nd in no_decay)
          ],
          "weight_decay": 0.0,
      },
  ]

  num_train_steps = int(len(df_train) / config.TRAIN_BATCH_SIZE * config.EPOCHS)
  optimizer = AdamW(optimizer_parameters, lr=3e-5)
  scheduler = get_linear_schedule_with_warmup(
      optimizer, num_warmup_steps=0, num_training_steps=num_train_steps
  )

  best_accuracy = 0
  for epoch in range(config.EPOCHS):
      engine.train_fn(train_data_loader, model, optimizer, device, scheduler)
      outputs, targets = engine.eval_fn(valid_data_loader, model, device)
      targets = np.array(targets) >= 0.5
      accuracy = metrics.roc_auc_score(targets, outputs)
      print(f"AUC Score = {accuracy}")
      if accuracy > best_accuracy:
          torch.save(model.state_dict(), config.MODEL_PATH)
          best_accuracy = accuracy

if name == "main":
run()

error message:

PS C:\Users\saura\Desktop\tcc\src> python train.py
Traceback (most recent call last):
  File "train.py", line 86, in <module>
    run()
  File "train.py", line 46, in run
    device = torch.device(config.DEVICE)
AttributeError: module 'config' has no attribute 'DEVICE'

Dataloader possible bug

For some reason I am unable to iterate throught the Pytorch Dataloader. Could be something i am missing or the Dataloader has bug.

import transformers
from sklearn import model_selection
import torch
import pandas as pd
    
    
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-cased", do_lower_case=True)
max_len = 512
train_batch_size = 8
    
"""
This class takes reviews and targets as arguments 
- Split the reviews and tokenizes
"""
class BERTDataset:
    def __init__(self, review, target):
        self.review = review
        self.target = target
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.review)

    def __getitem__(self, item):
        review = str(self.review[item])
        review = " ".join(review.split())

        tokenized_inputs = self.tokenizer.encode_plus(
            review,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding=True,
            truncation=True
        )

        ids = tokenized_inputs["input_ids"]
        mask = tokenized_inputs["attention_mask"]
        token_type_ids = tokenized_inputs["token_type_ids"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "targets": torch.tensor(self.target[item], dtype=torch.float),
        }


dfx = pd.read_csv(training_file).fillna("none")
dfx['sentiment'] = dfx['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

df_train, df_valid = model_selection.train_test_split(
                        dfx,
                        test_size=0.1,
                        random_state=42,
                        stratify=dfx['sentiment'].values
                        )

# reset indices 
df_train = df_train.reset_index(drop=True)

# get ids, tokens, masks and targets  
train_dataset = BERTDataset(review=df_train['review'], target=df_train['sentiment'])

# load into pytorch dataset object
# DataLoader inputs tensor dataset of Inputs and targets 
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=train_batch_size, num_workers=0)

# Iterating to the Data loader
train_iter = iter(train_data_loader)
print(type(train_iter))

review, labels = train_iter.next()

When iterating through the dataloader the following error comes up.

RuntimeError Traceback (most recent call last)
    <ipython-input-19-c99d0829d5d9> in <module>()
          2 print(type(train_iter))
          3 
    ----> 4 images, labels = train_iter.next()
    
    5 frames
    /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
         53             storage = elem.storage()._new_shared(numel)
         54             out = elem.new(storage)
    ---> 55         return torch.stack(batch, 0, out=out)
         56     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
         57             and elem_type.__name__ != 'string_':
    
    RuntimeError: stack expects each tensor to be equal size, but got [486] at entry 0 and [211] at entry 1

Appreciate your inputs.

line 15 model.py

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.