Giter VIP home page Giter VIP logo

pytorch-nlp's Introduction

๐Ÿ’• Now Archived ๐Ÿ’•

With the PyTorch toolchain maturing, it's time to archive repos like this one. You'll be able to find more developed options for every part of this toolkit:

Happy developing! โœจ

Feel free to contact me if anyone wants to unarchive this repo and continue developing it. You can reach me at "petrochukm [at] gmail.com".


Basic Utilities for PyTorch Natural Language Processing (NLP)

PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. torchnlp extends PyTorch to provide you with basic text data processing functions.

PyPI - Python Version Codecov Downloads Documentation Status Build Status Twitter: PetrochukM

Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs

Installation ๐Ÿพ

Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install pytorch-nlp using pip:

pip install pytorch-nlp

Or to install the latest code via:

pip install git+https://github.com/PetrochukM/PyTorch-NLP.git

Docs

The complete documentation for PyTorch-NLP is available via our ReadTheDocs website.

Get Started

Within an NLP data pipeline, you'll want to implement these basic steps:

1. Load your Data ๐Ÿฟ

Load the IMDB dataset, for example:

from torchnlp.datasets import imdb_dataset

# Load the imdb training dataset
train = imdb_dataset(train=True)
train[0]  # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}

Load a custom dataset, for example:

from pathlib import Path

from torchnlp.download import download_file_maybe_extract

directory_path = Path('data/')
train_file_path = Path('trees/train.txt')

download_file_maybe_extract(
    url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
    directory=directory_path,
    check_files=[train_file_path])

open(directory_path / train_file_path)

Don't worry we'll handle caching for you!

2. Text to Tensor

Tokenize and encode your text as a tensor.

For example, a WhitespaceEncoder breaks text into tokens whenever it encounters a whitespace character.

from torchnlp.encoders.text import WhitespaceEncoder

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]

3. Tensor to Batch

With your loaded and encoded data in hand, you'll want to batch your dataset.

import torch
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors

encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]

train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
    train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])

batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]

PyTorch-NLP builds on top of PyTorch's existing torch.utils.data.sampler, torch.stack and default_collate to support sequential inputs of varying lengths!

4. Training and Inference

With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out this example code for training on the Stanford Natural Language Inference (SNLI) Corpus.

Last But Not Least

PyTorch-NLP has a couple more NLP focused utility packages to support you! ๐Ÿค—

Deterministic Functions

Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go, like so:

import random
import numpy
import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123):  # Ensure determinism
    print('Random:', random.randint(1, 2**31))
    print('Numpy:', numpy.random.randint(1, 2**31))
    print('Torch:', int(torch.randint(1, 2**31, (1,))))

This will always print:

Random: 224899943
Numpy: 843828735
Torch: 843828736

Pre-Trained Word Vectors

Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings, like so:

import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe

encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

vocab_set = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
    embedding_weights[i] = pretrained_embedding[token]

Neural Networks Layers

For example, from the neural network package, apply the state-of-the-art LockedDropout:

import torch
from torchnlp.nn import LockedDropout

input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)

# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)

Metrics

Compute common NLP metrics such as the BLEU score.

from torchnlp.metrics import get_moses_multi_bleu

hypotheses = ["The brown fox jumps over the dog ็ฌ‘"]
references = ["The quick brown fox jumps over the lazy dog ็ฌ‘"]

# Compute BLEU score with the official BLEU perl script
get_moses_multi_bleu(hypotheses, references, lowercase=True)  # RETURNS: 47.9

Help โ“

Maybe looking at longer examples may help you at examples/.

Need more help? We are happy to answer your questions via Gitter Chat

Contributing

We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope that other organizations can benefit from the project. We are thankful for any contributions from the community.

Contributing Guide

Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to PyTorch-NLP.

Related Work

torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar. torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders. PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint, torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with low coupling.

AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.

Authors

Citing

If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to cite it:

@misc{pytorch-nlp,
  author = {Petrochuk, Michael},
  title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},
}

pytorch-nlp's People

Contributors

benjamin-work avatar floscha avatar grant avatar jiaqiliu avatar jmribeiro avatar mrshu avatar nzw0301 avatar petrochukm avatar qqaatw avatar rishishian avatar seanhung21 avatar songheony avatar timgates42 avatar wxl1999 avatar xingxingzhang avatar zbyte64 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-nlp's Issues

Input batch size doesn't match hidden[0] batch size

Hi.
Hi. I am using an LSTMCell. However, I get this error:
Input batch size 1 doesn't match hidden[0] batch size 512

This is my testing code:

lstm = nn.LSTMCell(256,512)
weights = ['weight_hh']
weight_drop_gru = WeightDrop(lstm, weights, dropout=0.5)
input_ = torch.randn(1, 256)
hidden_state = torch.randn(1, 512)
weight_drop_gru(input_, hidden_state)

The input and hidden both have a batch_size of 1. Why does the GRUCell work but LSTMCell doesn't work?

Add SQUAD Dataset

One of the standard datasets used for machine comprehension tasks is SQuaD. Thus, I suggest adding that to the torchnlp.datasets package

Can't get pre-trained FastText embedding

Expected Behavior

I want to get a pre-trained FastText embedding in reasonable amount of time (and save it to a caching folder)

Actual Behavior

It takes a lot of time (ETA: 1000 hours) and eventually results into:

ConnectionResetError: [Errno 104] Connection reset by peer

I tested it on macOS and Ubuntu 16.04

Steps to Reproduce the Problem

Run

from torchnlp.word_to_vector import FastText

vectors = FastText(cache='cache')

import error: fails to find sru.cu and no attribute 'encode' issue

Expected Behavior

import torchnlp.nn as nn should work

Actual Behavior

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/torchnlp/nn/__init__.py", line 8, in <module>
    from torchnlp.nn.sru import SRU
  File "/opt/conda/lib/python3.7/site-packages/torchnlp/nn/sru.py", line 68, in <module>
    SRU_CODE = open('sru.cu').read()
FileNotFoundError: [Errno 2] No such file or directory: 'sru.cu'
if I try within File /opt/conda/lib/python3.7/site-packages/torchnlp/nn, I get:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/torchnlp/nn/__init__.py", line 8, in <module>
    from torchnlp.nn.sru import SRU
  File "/opt/conda/lib/python3.7/site-packages/torchnlp/nn/sru.py", line 69, in <module>
    SRU_PROG = Program(SRU_CODE.encode('utf-8'), 'sru_prog.cu'.encode('utf-8'))
  File "/opt/conda/lib/python3.7/site-packages/pynvrtc/compiler.py", line 52, in __init__
    include_names)
  File "/opt/conda/lib/python3.7/site-packages/pynvrtc/interface.py", line 200, in nvrtcCreateProgram
    c_char_p(encode_str(src)), c_char_p(encode_str(name)),
  File "/opt/conda/lib/python3.7/site-packages/pynvrtc/interface.py", line 54, in encode_str
    return s.encode("utf-8")
AttributeError: 'bytes' object has no attribute 'encode'

sru.cu is there but I get other errors down the line...

Steps to Reproduce the Problem

  1. python3.7
  2. cuda 10
  3. both pip installations yield the same errors
  4. import torchnlp.nn as nn

Attention's softmax dimension is wrong?

Hi, I found that the weights computed by nn.Attention is always 1 in example:
atten = nn.Attention(3, attention_type='dot')
a = Variable(torch.randn(1, 1, 3))
b = Variable(torch.randn(1, 2, 3))
output, weights = atten(a, b)
and the output is:
Variable containing: (0 ,.,.) = 1 1 [torch.FloatTensor of size 1x1x2]
It is always 1 whatever I feed into attention layer with[1, n, m] that is batch size is always 1.
Then I check the code found that the nn.Softmax(dims=0) In my opinion this operator does softmax along the first dimension(if a tensor's size is 3 like [z, y, x], it done along the z dimension). But the first dimension is batch_size * output_len, so it does wrong direction softmax cuz it should do it along last dims query_len which contains real scores for every context in a inst.

E.g.

given a query vector [1, 1, 3] and the context [1, 2, 3] that is context is a 2 rows, 3 column matrix . Then, attention context will be transposed to [1, 3, 2], scoring and get scores [1, 1, 2]. As your operation, scores are reshaped to [1, 2] and do softmax along the first dims. Finally, the weights on it are always 1.

I changed it to nn.Softmax(dims=-1), which operates along the last dims(query_len), it worked:
Variable containing: (0 ,.,.) = 0.1426 0.8574 [torch.FloatTensor of size 1x1x2]

Thx for your great work
-Chen

Add End-To-End Example

Add an end to end example using the modules to load data, encode data, iterate over the data and train a model.

IdentityEncoder, iteration over 0-d tensor error

It's expected the IdentityEncoder can work with a scalar value:

>>> from torchnlp.text_encoders import IdentityEncoder
>>> i = IdentityEncoder([1])
>>> i.encode(1)
tensor([5])
>>> i.decode(i.encode(1))
1
>>> i.decode(i.encode(1)[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/michaelp/.local/lib/python3.6/site-packages/torchnlp/text_encoders/identity_encoder.py", line 44, in decode
    tokens = [self.itos[index] for index in tensor]
  File "/Users/michaelp/.pyenv/versions/3.6.6/lib/python3.6/site-packages/torch/tensor.py", line 381, in __iter__
    raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor

Dataset does not return tuple

In the documentation for each dataset, it says Returns: Tuple with the training tokens, dev tokens and test tokens in order if their respective boolean argument is true.

However, when only a single split is selected, a tuple is not returned. The selected set is just returned directly.

Eg. running len(penn_treebank_dataset(train=True)) should return 1, but instead returns 929589.

I understand that this was probably intended as a convenience measure, but a) the behavior and the docs should match, and b) generally it's better to have a function return consistently shaped data.

Ujson reliance doesn't permit successful install

Expected Behavior

Having trouble installing this due to not being able to install ujson.
As I understand ujson is faster than json but has issues on different pythons.
for example: jjjake/internetarchive#14

Steps to Reproduce the Problem

My system

import sys
print(sys.version)
3.6.3 |Anaconda custom (64-bit)| (default, Oct 13 2017, 12:02:49)
[GCC 7.2.0]
  1. pip3 install py pytorch-nlp

missing fine-grained option in smt dataset

PyTorch-NLP/torchnlp/datasets/smt.py

Although "fine-grained" is an argument in the smt_dataset() function, it is not used inside the function so this option does not work. A small modification is needed. Specifically, it should be given as an argument when the function parse_tree() is called.

Solution

lines 105 and 107 in smt.py should be changed (respectively) from:
examples.extend(parse_tree(line, subtrees=subtrees))
examples.append(parse_tree(line, subtrees=subtrees))
to:
examples.extend(parse_tree(line, subtrees=subtrees, fine_grained=fine_grained))
examples.append(parse_tree(line, subtrees=subtrees, fine_grained=fine_grained))

...and I believe the problem is fixed!

[Issue with Example Code] Running into Expected size issue

When trying to run main.py I get an error when it gets to evaluating with a validation data-set:

RuntimeError: Expected hidden[0] size (1, 70, 1150), got (1, 10, 1150)
coming from the line:
val_loss = evaluate(val_data, val_source_sampler, val_target_sampler, eval_batch_size)

(70 happens to be the value of bptt, while 10 is the eval_batch_size)

Any idea why this might be happening?

Full error:

val_loss = evaluate(val_data, val_source_sampler, val_target_sampler, eval_batch_size)

  File "<ipython-input-1-b515fee49737>", line 196, in evaluate
    output, hidden = model(data, hidden)

  File "C:\Users\john\Desktop\Deep_Learning_A_Z\pytorch-0.4.1-py36_cuda90_cudnn7he774522_1\Lib\site-packages\torch\nn\modules\module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "C:\Users\john\Desktop\Deep_Learning_A_Z\PyTorch-NLP-master\examples\awd-lstm-lm\model.py", line 114, in forward
    raw_output, new_h = rnn(raw_output, hidden[l])

  File "C:\Users\john\Desktop\Deep_Learning_A_Z\pytorch-0.4.1-py36_cuda90_cudnn7he774522_1\Lib\site-packages\torch\nn\modules\module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)

  File "C:\Users\john\Desktop\Deep_Learning_A_Z\PyTorch-NLP-master\torchnlp\nn\weight_drop.py", line 24, in forward
    return original_module_forward(*args)

  File "C:\Users\john\Desktop\Deep_Learning_A_Z\pytorch-0.4.1-py36_cuda90_cudnn7he774522_1\Lib\site-packages\torch\nn\modules\rnn.py", line 184, in forward
    self.check_forward_args(input, hx, batch_sizes)

  File "C:\Users\john\Desktop\Deep_Learning_A_Z\pytorch-0.4.1-py36_cuda90_cudnn7he774522_1\Lib\site-packages\torch\nn\modules\rnn.py", line 153, in check_forward_args
    'Expected hidden[0] size {}, got {}')

  File "C:\Users\john\Desktop\Deep_Learning_A_Z\pytorch-0.4.1-py36_cuda90_cudnn7he774522_1\Lib\site-packages\torch\nn\modules\rnn.py", line 149, in check_hidden_size
    raise RuntimeError(msg.format(expected_hidden_size, tuple(hx.size())))

RuntimeError: Expected hidden[0] size (1, 70, 1150), got (1, 10, 1150)

RuntimeError in torchnlp.nn._weight_drop wrapped by torch.nn.DataParallel

Expected Behavior

I want to convert torch.nn.Linear modules to weight drop linear modules in my model (possibly big), and I want to train my model with multi-GPUs. However, I have RuntimeError in my sample code. First, I have _weight_drop() which drops some part of weights in torch.nn.Linear (see the code below).

Actual Behavior

RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/generic/THCTensorMathBlas.cu:255

Steps to Reproduce the Problem

  1. Run this code in python 3.7 and pytorch 1.1.0 with 2-GPUs

     import torch
     from torchnlp.nn import _weight_drop
     from torch.utils.data import Dataset, DataLoader
    
     input_size = 5
     hidden_size = 5
     output_size =2
     batch_size = 30
     data_size = 100
    
     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    
     class RandomDataset(Dataset):
    
         def __init__(self, size, length):
             self.len = length
             self.data = torch.randn(length, size)
    
         def __getitem__(self, index):
             return self.data[index]
    
         def __len__(self):
             return self.len
    
     rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                              batch_size=batch_size, shuffle=True)
    
     class Model(torch.nn.Module):
         # Our model
         def __init__(self, D_in, H, D_out):
             super(Model, self).__init__()
             self.linear1 = torch.nn.Linear(D_in, H)
             self.linear2 = torch.nn.Linear(H, D_out)
    
         def forward(self, input):
             h_relu = self.linear1(input).clamp(min=0)
             output = self.linear2(h_relu)
             print("\tIn Model: input size", input.size(),
                   "output size", output.size(), torch.cuda.current_device())
             return output
    
     model = Model(input_size, hidden_size, output_size)
     linear_module_list = [v for v in model.named_modules() if isinstance(v[1], torch.nn.Linear)]
     for name, module in linear_module_list:
         _weight_drop(module, ['weight'], dropout=0.5)
     model.to(device)
     model = torch.nn.DataParallel(model)
    
     data = list(rand_loader)[0]
     input = data.to(device)
     output = model(input)
    
  2. However, โ€œoutput=model(input)โ€ is not computed in this code with this error message.

     In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2]) 0
     ---------------------------------------------------------------------------
     RuntimeError                              Traceback (most recent call last)
     <ipython-input-3-4a83ee11bad2> in <module>
           2 while(True):
           3     input = data.to(device)
     ----> 4     output = model(input)
           5     print("Outside: input size", input.size(),
           6           "output_size", output.size())
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
         491             result = self._slow_forward(*input, **kwargs)
         492         else:
     --> 493             result = self.forward(*input, **kwargs)
         494         for hook in self._forward_hooks.values():
         495             hook_result = hook(self, input, result)
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
         150             return self.module(*inputs[0], **kwargs[0])
         151         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
     --> 152         outputs = self.parallel_apply(replicas, inputs, kwargs)
         153         return self.gather(outputs, self.output_device)
         154 
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
         160 
         161     def parallel_apply(self, replicas, inputs, kwargs):
     --> 162         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
         163 
         164     def gather(self, outputs, output_device):
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
          81         output = results[i]
          82         if isinstance(output, Exception):
     ---> 83             raise output
          84         outputs.append(output)
          85     return outputs
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in _worker(i, module, input, kwargs, device)
          57                 if not isinstance(input, (list, tuple)):
          58                     input = (input,)
     ---> 59                 output = module(*input, **kwargs)
          60             with lock:
          61                 results[i] = output
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
         491             result = self._slow_forward(*input, **kwargs)
         492         else:
     --> 493             result = self.forward(*input, **kwargs)
         494         for hook in self._forward_hooks.values():
         495             hook_result = hook(self, input, result)
    
     <ipython-input-2-ac863085502c> in forward(self, input)
          33 
          34     def forward(self, input):
     ---> 35         h_relu = self.linear1(input).clamp(min=0)
          36         output = self.linear2(h_relu)
          37         print("\tIn Model: input size", input.size(),
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
         491             result = self._slow_forward(*input, **kwargs)
         492         else:
     --> 493             result = self.forward(*input, **kwargs)
         494         for hook in self._forward_hooks.values():
         495             hook_result = hook(self, input, result)
    
     <ipython-input-1-66fbe470597c> in forward(*args, **kwargs)
          18             #w.to(device)
          19             setattr(module, name_w, w)
     ---> 20         return original_module_forward(*args)
          21 
          22     def extra_repr(*args):
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/linear.py in forward(self, input)
          90     @weak_script_method
          91     def forward(self, input):
     ---> 92         return F.linear(input, self.weight, self.bias)
          93 
          94     def extra_repr(self):
    
     ~/miniconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/functional.py in linear(input, weight, bias)
        1404     if input.dim() == 2 and bias is not None:
        1405         # fused op is marginally faster
     -> 1406         ret = torch.addmm(bias, input, weight.t())
        1407     else:
        1408         output = input.matmul(weight.t())
    
     RuntimeError: arguments are located on different GPUs at /opt/conda/conda-bld/pytorch_1556653114079/work/aten/src/THC/generic/THCTensorMathBlas.cu:255
    

The main reason for this error is that I try to compute linear multiplication between two tensors belonging to different GPUs. I try to modify my _weight_drop() function to manually assign the current device in the DataParallel process, but it does not work. Is there any idea to figure out this problem? This code works fine in single GPU or CPU mode

Add GLUE datasets

GLUE datasets are standard for evaluating NLU tasks.

In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark
(GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks.

LockedDropout - Doc Misleading

x (:class:`torch.FloatTensor` [batch size, sequence length, rnn hidden size]): Input to

Hi, in the code of LockedDropout, I found it confusing.
The doc/comment says:

LockedDropout applies the same dropout mask to every time step.
x shape: [batch size, sequence length, rnn hidden size]

However, based on the implementation, I found that we are not applying the same dropout mask to every time step, but applying the same mask to every datapoint in a batch.

Looking forward to your reply. Thanks!

Support loading fasttext model from custom file

What if I want to use own pretrained fasttext model (or even commoncrawl model instead of standard wiki one)? E.g. look what they publish now: https://fasttext.cc/docs/en/crawl-vectors.html.
Current FastText impl

    url_base = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.{}.vec'
    aligned_url_base = 'https://s3.amazonaws.com/arrival/embeddings/wiki.multi.{}.vec'

    def __init__(self, language="en", aligned=False, **kwargs):
        if aligned:
            url = self.aligned_url_base.format(language)
        else:
url = self.url_base.format(language)

doesn't allow you to do such basic stuff.

Tempfile PermissionError after bleu demo on windows

Expected Behavior

ok

Actual Behavior

Traceback (most recent call last):
File "D:/doc/nlp/ContextTransformer/demo.py", line 7, in
bleu=get_moses_multi_bleu(hypotheses, references, lowercase=True) # RETURNS: 47.9
File "C:\ProgramData\Anaconda3\lib\site-packages\torchnlp\metrics\bleu.py", line 86, in get_moses_multi_bleu
with open(hypothesis_file.name, "r") as read_pred:
PermissionError: [Errno 13] Permission denied: 'C:\Users\XUECHA~1\AppData\Local\Temp\tmpluhikfae'

Steps to Reproduce the Problem

from torchnlp.metrics import get_moses_multi_bleu

hypotheses = ["The brown fox jumps over the dog ็ฌ‘"]
references = ["The quick brown fox jumps over the lazy dog ็ฌ‘"]

Compute BLEU score with the official BLEU perl script
bleu=get_moses_multi_bleu(hypotheses, references, lowercase=True) # RETURNS: 47.9
print(bleu)

Consider not using lambdas as default arguments to enable pickling

Problem description

Some classes have lambdas as default arguments (e.g. here). This prevents these objects from being pickled.

Steps to Reproduce the Problem

import pickle
from torchnlp.text_encoders import StaticTokenizerEncoder

encoder = StaticTokenizerEncoder(['hi', 'you'])
pickle.dumps(encoder) 

# raises error
PicklingError: Can't pickle <function StaticTokenizerEncoder.<lambda> at 0x7fb77c042b70>: attribute lookup StaticTokenizerEncoder.<lambda> on torchnlp.text_encoders.static_tokenizer_encoder failed

Solution

Replace lambdas by actual functions.

Test on IMDB dataset is failing

Expected Behavior

The test assertion checks if the first row of the Test Dataset of IMDB matches the given text and sentiment.

Actual Behavior

The assertion fails, as the order of rows returned is not fixed. This is due to glob.iglob, which returns files in arbitrary order.

Steps to Reproduce the Problem

python -m pytest tests/datasets/test_imdb.py

or...

>>> from torchnlp.datasets import imdb
>>> train, test = imdb.imdb_dataset(train=True, test=True)
>>> print(test[0]['text'])

AttributeError: 'WeightDropGRU' object has no attribute '_flat_weights'

I normally called
WeightDropGRU(input_size = num_input_features, hidden_size = hidden_size,
num_layers = num_layers, batch_first = True, dropout = dropout_gru,
bidirectional = bidirectional, weight_dropout=drop_weight)
on Mac everything works wekk but, on Linux, this error happened. (My Pytorch version is 1.13)

Traceback (most recent call last):
File "exper_special.py", line 88, in
NER.to(device)
File "/home/.pyenv/versions/anaconda3-2019.07/envs/allennlp_dw2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 426, in to
return self._apply(convert)
File "/home//.pyenv/versions/anaconda3-2019.07/envs/allennlp_dw2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
module._apply(fn)
File "/home/.pyenv/versions/anaconda3-2019.07/envs/allennlp_dw2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply
module._apply(fn)
File "/home/.pyenv/versions/anaconda3-2019.07/envs/allennlp_dw2/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 124, in _apply
self.flatten_parameters()
File "/home/.pyenv/versions/anaconda3-2019.07/envs/allennlp_dw2/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 104, in flatten_parameters
all_weights = self._flat_weights
File "/home/.pyenv/versions/anaconda3-2019.07/envs/allennlp_dw2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 585, in getattr
type(self).name, name))
AttributeError: 'WeightDropGRU' object has no attribute '_flat_weights'

Allow StaticTokenizerEncoder to take any iterable

Actual Behavior

Right now, parameter sample of StaticTokenizerEncoder must be a list (explicit check).

It forces the user to pre-load the whole dataset in memory, which is not desirable for very large datasets.

Expected Behavior

It would be great if StaticTokenizerEncoder (and all child classes) could take any iterable for sample (not necessarily a list).

Therefore, sample could be for instance an iterator : the encoder would go once through the whole dataset to compute token counts, which could then be saved (e.g. pickled) for later use.
And token counts are typically much smaller than the dataset itself.

Steps to Reproduce the Problem

This raises a TypeError: Sample must be a list.

from torchnlp.encoders.text import WhitespaceEncoder
iterable = (x for x in ['hello world', 'PyTorch NLP'])
encoder = WhiteSpaceEncoder(iterable)

Proposal

This virtually just implies removing the explicit check (if not isinstance(sample, list) at torchnlp.encoders.text.StaticTokenizerEncoder:67).
I tried, and tests pass just fine. I can make a PR with this if you think this is a good idea.

The SpacyEncoder tokenizes incorrectly

Expected Behavior

The SpacyEncoder should tokenize the sentence "This ain't funny." into ["This", "ai", "n't", "funny", "."].

Actual Behavior

The SpacyEncoder tokenizes the sentence "This ain't funny." into ["This", "ain't", "funny."].

Steps to Reproduce the Problem

  1. Install latest PyTorch-NLP(0.3.5) and Spacy(2.0.11)
>>> from torchnlp.text_encoders import SpacyEncoder
>>> encoder = SpacyEncoder(["This ain't funny."])
>>> encoder.vocab
['<pad>', '<unk>', '</s>', '<s>', '<copy>', 'This', "ain't", 'funny.']

It behaves differently from the example.

Notes

I checked the source code and used Spacy to tokenize directly.

>>> import spacy
>>> from spacy.tokenizer import Tokenizer
>>> nlp = spacy.load('en_core_web_sm')
>>> doc = nlp(u"This ain't funny.")
>>> for token in doc:
...     print(token.text)
...
This
ai
n't
funny
.
>>> tokenizer = Tokenizer(nlp.vocab)
>>> doc = tokenizer(u"This ain't funny.")
>>> for token in doc:
...     print(token.text)
...
This
ain't
funny.

I tried running the same code on Spacy's website and got the same result. It seems the tokenizer alone can't work properly.

I'm not sure whether there's something wrong with Spacy, or you are using it in a wrong way. Maybe I should open a issue on Spacy's repository?

bleu metric doesn't seem to work properly

I tried to test the get_moses_multi_bleu metric and it doesn't seem to work properly.
I ran the following:

x = ['abc']
y = ['abc']
print(get_moses_multi_bleu(x,y))

and it prints 0.0 instead of 1

Edit: I believe the reason is that because the sentence is too short for 2-gram, 3-gram, etc... So there should be a warning message in that case

RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

Expected Behavior

Load FastText vectors

Environment:
Ubuntu 16.04
Python 3.6.4
Pytorch 0.4.1

Actual Behavior

Throws the following error:

File "", line 1, in
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in init
super(FastText, self).init(name, url=url, **kwargs)
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 72, in init
self.cache(name, cache, url=url)
File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 153, in cache
word, len(entries), dim))
RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

Steps to Reproduce the Problem

  1. Open python console
  2. Write the following code:
        from torchnlp.word_to_vector import FastText
        vectors = FastText()
    
    
  3. Throws the error mentioned above.

FastText memory requirements

Attempted to instantiate FastText with 13 GB of RAM free, the system runs out of memory.

File "torchnlp\word_to_vector\pretrained_word_vectors.py", line 171, in cache
    self.vectors = torch.Tensor(vectors).view(-1, dim)
MemoryError

I've seen it mentioned elsewhere that 12 GB should be enough but it looks like we might have two copies of the vectors kicking around at one point.

Special tokens should be properly encoded by text_encoders

Expected Behavior

encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))

--CONSOLE---
tensor([3, 5, 2])

Actual Behavior

encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))

--CONSOLE---
tensor([ 5, 6, 7, 8, 5, 14, 6, 7])

Explanation

Most if this tokenizers are not aware of this special tokens and end up splitting the special token into different tokens. For instance the '<s>' token becames '<', 's', '>'.

My solution to this problem was to create a method for masking special tokens and another one to restore them in place.

   def _mask_reserved_tokens(self, sequence):
        reserved_tokens = re.findall(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', sequence)
        sequence = re.sub(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', "RESERVEDTOKENMASK", sequence)
        return reserved_tokens, sequence
    def _restore_reserved_tokens(self, reserved_tokens, sequence):
        sequence = _detokenize(sequence)
        for token in reserved_tokens:
            sequence = sequence.replace('RESERVEDTOKENMASK', token, 1)
        return _tokenize(sequence)

Then the encode function becames:

def encode(self, sequence):
        """ Encodes a ``sequence``.
        Args:
            sequence (str): String ``sequence`` to encode.
        Returns:
            torch.Tensor: Encoding of the ``sequence``.
        """
        sequence = super().encode(sequence)
        reserved_tokens, sequence = self._mask_reserved_tokens(sequence)
        sequence = self.tokenize(sequence)
        sequence = self._restore_reserved_tokens(reserved_tokens, sequence)
        vector = [self.stoi.get(token, self.unknown_index) for token in sequence]
        if self.append_eos:
            vector.append(self.eos_index)
        return torch.tensor(vector)

I dont know if this is just a problem that I have but If not I believe that this should be handled natively.

Http 403 when calling FastText()

Expected Behavior

Calling FastText() should successfully download data from an S3 bucket.

Actual Behavior

Making a call to FastText() is currently raising the error urllib.error.HTTPError: HTTP Error 403: Forbidden

Steps to Reproduce the Problem

  1. Run a clean Python 3.6 REPL using Docker with the command docker run -it --rm python:3.6 bash. This should work the same in Python 3.6 and 3.7.
  2. Install latest pytorch-nlp package: pip install torchvision pytorch-nlp
  3. Run this code to get the HTTPError:
from torchnlp.word_to_vector import FastText
vectors = FastText()
  1. This will produce this error:
wiki.en.vec: 0.00B [00:00, ?B/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in __init__
    super(FastText, self).__init__(name, url=url, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 71, in __init__
    self.cache(name, cache, url=url)
  File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 110, in cache
    download_file_maybe_extract(url=url, directory=cache, check_files=[name])
  File "/usr/local/lib/python3.6/site-packages/torchnlp/download.py", line 160, in download_file_maybe_extract
    urllib.request.urlretrieve(url, filename=filepath, reporthook=_reporthook(t))
  File "/usr/local/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/local/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
  1. Using the `aligned=True' option gives a 404:
>>> FastText(aligned=True)
wiki.multi.en.vec: 0.00B [00:00, ?B/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in __init__
    super(FastText, self).__init__(name, url=url, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 71, in __init__
    self.cache(name, cache, url=url)
  File "/usr/local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 110, in cache
    download_file_maybe_extract(url=url, directory=cache, check_files=[name])
  File "/usr/local/lib/python3.6/site-packages/torchnlp/download.py", line 160, in download_file_maybe_extract
    urllib.request.urlretrieve(url, filename=filepath, reporthook=_reporthook(t))
  File "/usr/local/lib/python3.6/urllib/request.py", line 248, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/local/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/local/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/local/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/local/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

The S3 URL in question is from

url_base = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.{}.vec'
. Attempting to download this file from the AWS CLI using language='en' gives this error:

$ aws s3 cp s3://fasttext-vectors/wiki.en.vec .
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden

Looking at this bucket in the AWS console gives the error "All access to this object has been disabled". Hopefully it's just a matter of adjusting the bucket permissions/policy.

how to retrain the glove or other word vectorize models?

how to retrain the glove or other word vectorize algorithms?
Armed in search an sequence-based model for protain-protain binding prediction, it need to retain charactor vectorize model for amino acid sequence. In this repo, I just found the pretrained api.

Apply SRU layer: 'Variable' object has no attribute 'new_zeros'

Unable to recreate the SRU example code provided in readme. Getting a Variable object has no attribute 'new_zeros'. Running pytorch .3.0.

Expected Behavior

RETURNS: (
output [torch.FloatTensor (6x3x20)],
hidden_state [torch.FloatTensor (2x3x20)]
)

Actual Behavior

`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
in ()
----> 1 sru(input_)

/anaconda3/envs/nlp/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
--> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

/anaconda3/envs/nlp/lib/python3.6/site-packages/torchnlp/nn/sru.py in forward(self, input_, c0)
509 dir_ = 2 if self.bidirectional else 1
510 if c0 is None:
--> 511 zeros = input_.new_zeros(input_.size(1), self.hidden_size * dir_)
512 c0 = [zeros for i in range(self.num_layers)]
513 else:

/anaconda3/envs/nlp/lib/python3.6/site-packages/torch/autograd/variable.py in getattr(self, name)
65 if name in self._fallthrough_methods:
66 return getattr(self.data, name)
---> 67 return object.getattribute(self, name)
68
69 def getitem(self, key):

AttributeError: 'Variable' object has no attribute 'new_zeros'
`

Steps to Reproduce the Problem

`from torchnlp.nn import SRU
import torch

input_ = torch.autograd.Variable(torch.randn(6, 3, 10))
sru = SRU(10, 20)

sru(input_)
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.