keitakurita / practical_nlp_in_pytorch Goto Github PK

View Code? Open in Web Editor NEW

530.0 530.0 169.0 398 KB

A repository containing tutorials for practical NLP using PyTorch

Jupyter Notebook 98.77% Python 1.13% Shell 0.10%

practical_nlp_in_pytorch's People

Contributors

Stargazers

Watchers

Forkers

labixiaok pvk444 ngo010 ant3ng radhikadua123 shawnlyu rdoooooo anakteka minhhdvn benudek liuyitingg docastage nlpka6j amit2das pkrouth machinelearningrifche binhna nachoku yuan776 bsat007 azamat25 lcy081099 kelvinson kaixintumao matiasbattocchia lawiss useric weixuanwx t1anzhenyu abarbosa94 rgaonkar mthomas941 bensentray nikhil-iyer-97 azizilyosov panda4us dainis-boumber raman-r-4978 bin2000 suyanzhou626 kanikel santwana1890 lopiyuquita vedom galpaydin aayushkubb divyanshu16 shuowenwei temadobryyr yuhsin1006 emrul zheewang swathygsb suryatapavi wanqguan rain1024 gokulsg decoli aiedward drsnowbird iszhuangsha reganzm akankshadara jcha7071 bkakilli tanxiaobing-hl jboru flyboy9 wwenqin stjordanis d5555 ibrahim85 wangstar0211 salchem maltintas45 youarerare yesitsrg baconwaffle deepchatterjeevns thierryherrmann zhangjiekui roysh mariekevanbuchem seanlee2cod tinker713 lessw2020 gersys junhaowang hitman56 abiraja2004 daukantas gsqfly shun-liang helenstr ibn-hamzat dragomirradev intuitionmachine amiya-mandal salama4 biranchi2018

practical_nlp_in_pytorch's Issues

bert_with_fastai single text prediction

In the bert_with_fastai notebook, would you know how to do a single text prediction?
Thanks

Memory allocation for h_t and c_t

I think it would be a tiny bit faster if you allocate the whole memory for h_t and c_t from the beginning in OptimizedLSTM. I mean:

    if init_states is None:
        h_t, c_t = (torch.zeros(bs, self.hidden_size).to(x.device),
                    torch.zeros(bs, self.hidden_size).to(x.device))

instead of:

    if init_states is None:
        h_t, c_t = (torch.zeros(self.hidden_size).to(x.device),
                    torch.zeros(self.hidden_size).to(x.device))

Question: calculation of memory length for validation in TransformerXL

Hi, very helpful repo, learned a lot from it.

I got a question about an implementation detail in TransformerXL.

In the transformer_xl_from_scratch notebook, the memory length during validation is calculated as val_memory_length + train_bptt - val_bptt.

Why aren't it just set to val_memory_length?

Looking forward to reply.

Not enough elements in Tuple for the keyword argument in NaiveLSTM

In /deep_dives/lstm_from_scratch.ipynb,

class NaiveLSTM(nn.Module): 
    ....
    def forward(self, x: torch.Tensor, 
                init_states: Optional[Tuple[torch.Tensor]]=None
               ) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        """Assumes x is of shape (batch, sequence, feature)"""
        ...

Shouldn't init_states be:
init_states: Optional[Tuple[torch.Tensor, torch.Tensor]]=None

instead of just:
init_states:Optional[Tuple[torch.Tensor]]=None

Because init_states needs exactly two values to unpack to h_t, c_t in:

....
else:
    h_t, c_t = init_states

How to get the same for single label classification

I need to have the class category for classification. Can you please tell me the required changes in the code? I am a beginner.

Transformer - decoder block does not use encoder output for keys and values in attention mechanism

From the transformer implementation here:

class DecoderBlock(nn.Module):
    level = TensorLoggingLevels.enc_dec_block
    def __init__(self, d_model=512, d_feature=64,
                 d_ff=2048, n_heads=8, dropout=0.1):
        super().__init__()
        self.masked_attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        self.attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        self.position_wise_feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )

        self.layer_norm1 = LayerNorm(d_model)
        self.layer_norm2 = LayerNorm(d_model)
        self.layer_norm3 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_out, 
                src_mask=None, tgt_mask=None):
        # Apply attention to inputs
        att = self.masked_attn_head(x, x, x, mask=src_mask)
        x = x + self.dropout(self.layer_norm1(att))
        # Apply attention to the encoder outputs and outputs of the previous layer
        att = self.attn_head(queries=att, keys=x, values=x, mask=tgt_mask)
        x = x + self.dropout(self.layer_norm2(att))
        # Apply position-wise feedforward network
        pos = self.position_wise_feed_forward(x)
        x = x + self.dropout(self.layer_norm2(pos))
        return x

In the forward mehtod, should:

att = self.attn_head(queries=att, keys=x, values=x, mask=tgt_mask)

Not be:

att = self.attn_head(queries=att, keys=enc_out, values=enc_out, mask=tgt_mask)

the shape should be (batch_size, seq_len, embedding_dimensions)

hello, i want to ask why the input word embeddings shape is (seq, batch_size, emb_dim)? i think it is should be  (batch_size, seq_len, emb_dim).
The batch_size represent the number of  sentense, and the seq represent the length of each sentense.

config

class Config(dict):
def init(self, **kwargs):
super().init(**kwargs)
for k, v in kwargs.items():
setattr(self, k, v)

def set(self, key, val):
    self[key] = val
    setattr(self, key, val)

config = Config(
testing=False,
bert_model_name="bert-base-uncased",
max_lr=3e-5,
epochs=1,
use_fp16=False,
bs=4,
discriminative=False,
max_seq_len=128,
)

what is the importance of function set here , could you please give an example?

reader.read produced a KeyError

Hey Kei

I found your excellent tutorial when I was searching for ELMO.

I installed the latest AllenNLP (0.8.4) and downloaded your code. I ran it and the cell that contains:

train_ds, test_ds = (reader.read(DATA_ROOT / fname) for fname in ["train.csv", "test.csv"])

and it produced the following error msg:

KeyError: "None of [['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']] are in the [index]"

The train.csv is from a download of the Jigsaw dataset when I participated in the toxic comment competition. The header of the train.csv looks like:

"id","comment_text","toxic","severe_toxic","obscene","threat","insult","identity_hate"

Have you had a chance to run the code w/ the latest AllenNLP? If not, which version were you using? Just being lazy and hoping for a quick pointer before I dive in...

Thx,
SH

var vs tensor error in nb

;-)

attn(q, k, v)

RuntimeError Traceback (most recent call last)
in
----> 1 attn(q, k, v)

/opt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
323 for hook in self._forward_pre_hooks.values():
324 hook(self, input)
--> 325 result = self.forward(*input, **kwargs)
326 for hook in self._forward_hooks.values():
327 hook_result = hook(self, input, result)

in forward(self, q, k, v, mask)
28 attn = attn / attn.sum(dim=-1, keepdim=True)
29 attn = self.dropout(attn)
---> 30 output = torch.bmm(attn, v) # (Batch, Seq, Feature)
31 log_size(output, "attention output size") # (Batch, Seq, Seq)
32 return output

RuntimeError: bmm(): argument 'mat2' (position 1) must be Variable, not torch.FloatTensor

—

attn_head = AttentionHead(20, 20)
attn_head(q, k, v)

TypeError: torch.mm received an invalid combination of arguments - got (torch.FloatTensor, Variable), but expected one of:

(torch.FloatTensor source, torch.FloatTensor mat2)
didn't match because some of the arguments have invalid types: (torch.FloatTensor, !Variable!)
(torch.SparseFloatTensor source, torch.FloatTensor mat2)
didn't match because some of the arguments have invalid types: (!torch.FloatTensor!, !Variable!)

SciBert embedding

Hi!
what if I want to use scibert embedding in my model is it enough just to replace this code :
`from allennlp.data.token_indexers import PretrainedBertIndexer

token_indexer = PretrainedBertIndexer(
pretrained_model="bert-base-uncased",
max_pieces=config.max_seq_len,
do_lowercase=True,
)

def tokenizer(s: str):
return token_indexer.wordpiece_tokenizer(s)[:config.max_seq_len - 2]`

by this code

` from allennlp.data.token_indexers import PretrainedBertIndexer

token_indexer = PretrainedBertIndexer(
pretrained_model="scibert-scivocab-uncased",
max_pieces=config.max_seq_len,
do_lowercase=True,
)

def tokenizer(s: str):
return token_indexer.wordpiece_tokenizer(s)[:config.max_seq_len - 2]`

ImportError: cannot import name 'PretrainedBertIndexer'

when trying to understand
bert_text_classification.ipynb

this part of notebook

from allennlp.data.token_indexers import PretrainedBertIndexer

token_indexer = PretrainedBertIndexer(
pretrained_model="bert-base-uncased",
max_pieces=config.max_seq_len,
do_lowercase=True,
)

apparently we need to truncate the sequence here, which is a stupid design decision

def tokenizer(s: str):
return token_indexer.wordpiece_tokenizer(s)[:config.max_seq_len - 2]

gives this error

ImportError Traceback (most recent call last)
in ()
----> 1 from allennlp.data.token_indexers import PretrainedBertIndexer
2
3 token_indexer = PretrainedBertIndexer(
4 pretrained_model="bert-base-uncased",
5 max_pieces=config.max_seq_len,

ImportError: cannot import name 'PretrainedBertIndexer'

NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

allennlp version used is 1.0.0

seems version differ , I could not find the solution
what should I do?

Size of the tensors

Hi ty for your tutorial. But I can't figure out why sometimes tensors have shape (input_size, hidden_size) and sometimes (hidden_size, hidden_size)

Attribute error while loading databunch

Hi,

I have created a custom databunch which I am trying to load using load_data. But I am getting an attribute error -

File “/home/views.py”, line 641, in get
path, r"/home/data_save.pkl")
File “/usr/local/lib/python3.7/site-packages/fastai/basic_data.py”, line 281, in load_data
ll = torch.load(source, map_location=‘cpu’) if defaults.device == torch.device(‘cpu’) else torch.load(source)
File “/usr/local/lib/python3.7/site-packages/torch/serialization.py”, line 529, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File “/usr/local/lib/python3.7/site-packages/torch/serialization.py”, line 702, in _legacy_load
result = unpickler.load()
AttributeError: Can’t get attribute ‘FastAiBertTokenizer’ on <module ’ main ’ from ‘manage.py’>

The FastAiBertTokenizer has been defined in the program but I am still getting the error.

Maybe I have to define this function or import it in the context that I’m loading the databunch. But I don’t know how.

This is the code -

path = Path()
data = load_data(path, r"data_save.pkl")
bert_model = CustomBertModel()
learn = Learner(data, bert_model, metrics=[accuracy])
st2 = torch.load(r"final_model_base.pth", map_location=torch.device('cpu'))
learn.model.state_dict(st2)
Can you help me with this?

License

You do not have a license file.

Any restrictions on reusing / modifying your code and/or blog text?

Thanks

transformer_xl encoder

Thanks for this great tutorial on tranformer. In transformer_xl model I don't see any Encoder class and the decoder is fed directly with inputs. Could you clarify what part of the transformer_xl code is responsible for encoder ?