Giter VIP home page Giter VIP logo

bert-sentiment's Issues

index out of range

Hi I'm getting the below error when I run the train script. Can you help me

File "C:\Users\thanisb\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\functional.py", line 1724, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
[Finished in 33.4s with exit code 1]

Getting issue while loading model

Getting below issue while loading the model in local system. Model was trained on colab.

Traceback (most recent call last):
  File "app.py", line 74, in <module>
    MODEL.load_state_dict(torch.load(config.MODEL_PATH, map_location=torch.device('cpu'))) #New created
  File "C:\Users\Vijender\Downloads\bert_sentiment\lib\site-packages\torch\nn\modules\module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERTBaseUncased:
        Missing key(s) in state_dict: "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias",

TypeError: _init_() got an unexpected keyword argument 'comment_text' and AttributeError: module 'config' has no attribute 'DEVICE'

while doing the code for multilingual toxic comment classification i am getting errors,
import config
import dataset
import engine
import torch
import pandas as pd
import torch.nn as nn
import numpy as np

from model import BERTBaseUncased
from sklearn import model_selection
from sklearn import metrics
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

def run():
df1 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-toxic-comment-train.csv", usecols = ["comment_text","toxic"])
df2 = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\jigsaw-unintended-bias-train.csv", usecols = ["comment_text","toxic"])

  df_train = pd.concat([df1,df2], axis=0).reset_index(drop=True)
  
  df_valid = pd.read_csv(r"C:\Users\saura\Desktop\tcc\input\validation.csv")

  train_dataset = dataset.BERTDataset(
      comment_text=df_train.comment_text.values,
      target=df_train.toxic.values
  )

  train_data_loader = torch.utils.data.DataLoader(
      train_dataset,
      batch_size=config.TRAIN_BATCH_SIZE, 
      num_workers=4
  )

  valid_dataset = dataset.BERTDataset(
      comment_text=df_valid.comment_text.values, 
      target=df_valid.toxic.values
  )

  valid_data_loader = torch.utils.data.DataLoader(
      valid_dataset, 
      batch_size=config.VALID_BATCH_SIZE, 
      num_workers=1
  )

  device = torch.device(config.DEVICE)
  model = BERTBaseUncased()
  model.to(device)

  param_optimizer = list(model.named_parameters())
  no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
  optimizer_parameters = [
      {
          "params": [
              p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
          ],
          "weight_decay": 0.001,
      },
      {
          "params": [
              p for n, p in param_optimizer if any(nd in n for nd in no_decay)
          ],
          "weight_decay": 0.0,
      },
  ]

  num_train_steps = int(len(df_train) / config.TRAIN_BATCH_SIZE * config.EPOCHS)
  optimizer = AdamW(optimizer_parameters, lr=3e-5)
  scheduler = get_linear_schedule_with_warmup(
      optimizer, num_warmup_steps=0, num_training_steps=num_train_steps
  )

  best_accuracy = 0
  for epoch in range(config.EPOCHS):
      engine.train_fn(train_data_loader, model, optimizer, device, scheduler)
      outputs, targets = engine.eval_fn(valid_data_loader, model, device)
      targets = np.array(targets) >= 0.5
      accuracy = metrics.roc_auc_score(targets, outputs)
      print(f"AUC Score = {accuracy}")
      if accuracy > best_accuracy:
          torch.save(model.state_dict(), config.MODEL_PATH)
          best_accuracy = accuracy

if name == "main":
run()

error message:

PS C:\Users\saura\Desktop\tcc\src> python train.py
Traceback (most recent call last):
  File "train.py", line 86, in <module>
    run()
  File "train.py", line 46, in run
    device = torch.device(config.DEVICE)
AttributeError: module 'config' has no attribute 'DEVICE'

In Dataset.py

i just try to run your code then i found an error called not know keywards argument
{"pad_to_max_len} is not recognized

class CustomDataset(Dataset):

def __init__(self , review , target):
    super(CustomDataset , self).__init__()
    self.review = review
    self.target = target
    self.tokenizer = config.TOKENIZER
    self.max_len = config.MAX_LEN
    self.train_encodings = self.tokenizer(review, truncation=True, padding=True)

def __len__(self):
    return len(self.review)

def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.train_encodings.items()}
    item['labels'] = torch.tensor(self.target[idx])
    return item

just look at this this can be helpfull

line 15 model.py

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

Loading DataParallel GPU model on CPU

Follow up to #1 issue
@abhishekkrthakur : Can you give any leads on how to load DataParallel GPU model on CPU?
As per pytorch docs tried following but still raises above RuntimeError

device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device))

IndexError: index ? is out of bounds for axis 0 with size ?

Hello,

I've tried your implementation of bert model in order to predict the sentiment of a sentence.

When training the model on google TPUs, i had this issues, on this line :

review = str(self.review[item])

I've spended so many time to find the issue but i don't figure out why it throws an array index out of bound ?

I started training with a litle dataset of 10000 rows.

StackStrace :

bi = 0, loss = 0.6744452714920044
bi = 10, loss = 0.6480506658554077
bi = 20, loss = 0.6070395708084106
bi = 30, loss = 0.3570273816585541
bi = 40, loss = 0.322771281003952
bi = 50, loss = 0.42349475622177124
bi = 60, loss = 0.2848508358001709
bi = 70, loss = 0.2577969431877136
bi = 80, loss = 0.4233595132827759
bi = 90, loss = 0.600457489490509
bi = 100, loss = 0.22680382430553436
bi = 110, loss = 0.09512724727392197
bi = 120, loss = 0.14158135652542114
bi = 130, loss = 0.653974175453186
Exception in thread Thread-6:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/parallel_loader.py", line 134, in _loader_worker
_, data = next(data_iter)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "", line 12, in getitem
review = str(self.review[item])
IndexError: index 4244 is out of bounds for axis 0 with size 979

Dataloader possible bug

For some reason I am unable to iterate throught the Pytorch Dataloader. Could be something i am missing or the Dataloader has bug.

import transformers
from sklearn import model_selection
import torch
import pandas as pd
    
    
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-cased", do_lower_case=True)
max_len = 512
train_batch_size = 8
    
"""
This class takes reviews and targets as arguments 
- Split the reviews and tokenizes
"""
class BERTDataset:
    def __init__(self, review, target):
        self.review = review
        self.target = target
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.review)

    def __getitem__(self, item):
        review = str(self.review[item])
        review = " ".join(review.split())

        tokenized_inputs = self.tokenizer.encode_plus(
            review,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding=True,
            truncation=True
        )

        ids = tokenized_inputs["input_ids"]
        mask = tokenized_inputs["attention_mask"]
        token_type_ids = tokenized_inputs["token_type_ids"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "targets": torch.tensor(self.target[item], dtype=torch.float),
        }


dfx = pd.read_csv(training_file).fillna("none")
dfx['sentiment'] = dfx['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

df_train, df_valid = model_selection.train_test_split(
                        dfx,
                        test_size=0.1,
                        random_state=42,
                        stratify=dfx['sentiment'].values
                        )

# reset indices 
df_train = df_train.reset_index(drop=True)

# get ids, tokens, masks and targets  
train_dataset = BERTDataset(review=df_train['review'], target=df_train['sentiment'])

# load into pytorch dataset object
# DataLoader inputs tensor dataset of Inputs and targets 
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=train_batch_size, num_workers=0)

# Iterating to the Data loader
train_iter = iter(train_data_loader)
print(type(train_iter))

review, labels = train_iter.next()

When iterating through the dataloader the following error comes up.

RuntimeError Traceback (most recent call last)
    <ipython-input-19-c99d0829d5d9> in <module>()
          2 print(type(train_iter))
          3 
    ----> 4 images, labels = train_iter.next()
    
    5 frames
    /usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
         53             storage = elem.storage()._new_shared(numel)
         54             out = elem.new(storage)
    ---> 55         return torch.stack(batch, 0, out=out)
         56     elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
         57             and elem_type.__name__ != 'string_':
    
    RuntimeError: stack expects each tensor to be equal size, but got [486] at entry 0 and [211] at entry 1

Appreciate your inputs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.