morvanzhou / nlp-tutorials Goto Github PK

Thanks for the contribution made by @W1Fl with a simplified keras codes in simple_realize. And the a pytorch version of this NLP tutorial made by @ruifanxu.

Installation

$ git clone https://github.com/MorvanZhou/NLP-Tutorials
$ cd NLP-Tutorials/
$ sudo pip3 install -r requirements.txt

TF-IDF

TF-IDF numpy code

TF-IDF short sklearn code

Word2Vec

Efficient Estimation of Word Representations in Vector Space

Skip-Gram code

CBOW code

Seq2Seq

Sequence to Sequence Learning with Neural Networks

Seq2Seq code

CNNLanguageModel

Convolutional Neural Networks for Sentence Classification

CNN language model code

Seq2SeqAttention

Effective Approaches to Attention-based Neural Machine Translation

Seq2Seq Attention code

Transformer

Attention Is All You Need

Transformer code

ELMO

Deep contextualized word representations

ELMO code

GPT

Improving Language Understanding by Generative Pre-Training

GPT code

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT code

My new attempt Bert with window mask

nlp-tutorials's People

Contributors

Stargazers

Watchers

Forkers

jerryten ffcarina fudp zxsted hansenlyx0708 zhengguowei joeau heliosgb donniezhang586 liuwei000000 donjon86 tufanmaitydev jaylinlee sumerzhang tempestwk1 allensmile nileee chaoyue729 snakehacker feiyang2008 jacknichao virtual-emperor williamjin1992 fcpluto wxyhv balatatree super973 land007 xiaojidaner onetreeleft rajna bobhu2010 baby-ninja chloe2 fourhan dieptran43 jiang-234 lay-zhang novaconstantine littlestar321 flysj 751887412 acnq andylinee wwwcy bens1320 yangzhen1997 lyshuang blowhunter huakeda1 fulquan topsorrow qiuqi06 qiank128 fanmengdan nlp-fork w1fl pengyayuan sally1024 frankcv2048 simenglv fly2rain yeungsaoian rakanwen china-zyy loganwu0526 githubuser1007 bobo8318 taon1607 andybangbang duiyady shenruili duanzhihua lovetheworldaways guofei1989 xethicg pr162 hao155 llamaslama dengminna shotdowndiane changliu020 seanmaxx lalali-dev shijiale0609 sheencode jackey-cheng zomun lylyone thirumald emmmmmboom superliuyinxin wanshanwest daihou-scut firsing510 nanke4869 michelle19l zhang971234 sungchu wuhaolang

nlp-tutorials's Issues

关于pytorch下cbow结果不理想的问题

作者您好，我在通过你的代码学习NLP,当我运行pytorch文件夹下的CBOW代码时，发现其运行结果不理想与tensorflow版本差别比较大，请问是什么原因？我使用的是requests.txt中的tensorflow版本，pytorch则为同版本的cup版本

About padding of sentences less than the maximum length

Hello, @ruifanxu.
I would like to ask you about the filling of sentences less than the maximum length.
'''
def mask(self, seqs):
device = next(self.parameters()).device
batch_size, seq_len = seqs.shape
mask = torch.triu(torch.ones((seq_len,seq_len), dtype=torch.long), diagonal=1).to(device) # [seq_len ,seq_len]
pad = torch.eq(seqs,self.padding_idx) # [n, seq_len]
mask = torch.where(pad[:,None,None,:],1,mask[None,None,:,:]).to(device) # [n, 1, seq_len, seq_len]
return mask>0 # [n, 1, seq_len, seq_len]
'''
Suppose the longest sentence is 200.
My idea is that sentences with less than 200 words should be given zero padding so that it will not affect the attention calculation and the generation of the target sequence.
Therefore, if there are only three words in a sentence, the following 197 (step) should be 0 (zero padding), otherwise the following parameters (context= [n, step, model_dim]) will have model_dim without step .
However, the above program does not seem to be filled with zeros? Only the future mask. I want to ask if I have an understanding error? Thank you.

Looking forward to your reply!Thank you

My website

Originally posted by @massssaaaa in data4news/my-simple-website#25

關於GPT程式

您好~我想請問一下
1.模型輸入的seqs[:,:-1]是過去的一句話，而seqs[:,-1]是需要預測的真實值，但是如果句子沒有那麼長，之後會被padding與長度最長的句子數量相同，seqs[:,-1]不就都拿到padding值嗎~~
2.若設計為每一批次以最長句子而非整個文本最長句子為該批次的step長度(時間長度)，能解決上述問題嗎?
希望收到您的回覆!!!!!!!!!!!!!!!!!!!!謝謝

transformer中添加注意力的过程可能造成答案泄露问题

你好，
我注意到了encoder这一段代码

       attn = self.mh[0].call(yz, yz, yz, look_ahead_mask, training)       # decoder self attention
       o1 = self.bn[0](attn + yz, training)
       attn = self.mh[1].call(o1, xz, xz, pad_mask, training)       # decoder + encoder attention
       o2 = self.bn[1](attn + o1, training)
       ffn = self.drop(self.ffn.call(o2), training)
       o = self.bn[2](ffn + o2, training)
       return o

其中， o1 = self.bn[0](attn + yz, training) 可能产生了训练和推理操作不统一的问题（推理输出依赖标签）
比如，在我的实验中，我更改标签第二位（Go标记后一位）的值使得推理结果第二位（Go标记后一位）结果发生改变，这使得训练准确度跟测试准确度有很大的差异。
是否应该为这一步的yz添加mask或者去掉这个操作避免这种影响。

visual.py文件下的show_tfidf存在问题

def show_tfidf(tfidf, vocb, filename):
# [n_vocab, n_doc]
plt.imshow(tfidf, cmap="YlGn", vmin=tfidf.min(), vmax=tfidf.max())
plt.xticks(np.arange(tfidf.shape[1]+1), vocb, fontsize=6, rotation=90)
plt.yticks(np.arange(tfidf.shape[0]), np.arange(1, tfidf.shape[0] + 1), fontsize=6)
plt.tight_layout()
plt.savefig("./visual/results/%s.png" % filename, format="png", dpi=500)
plt.show()

原始的plt.xticks和plt.yticks有一点点小问题，经修改后可以解决问题。

transformer一些组件和原文不一致

你好

文中q,k,v由矩阵乘法计算而来,代码中为全连接层
文中使用LayerNormalization,代码中为BatchNormalization

这样会对效果造成影响吗

关于visual.py中show_tfidf函数报错问题（Problems about 'show_tfidf' function in visual.py）

plt.savefig("./visual/results/%s.png" % filename, format="png", dpi=500)
执行到这一行报错"No Such File or Direction"

解决方案：
将源代码替换为如下代码
# creating the output folder
output_folder = './visual/'
if not os.path.exists(output_folder):
os.makedirs(output_folder)
plt.savefig(os.path.join(output_folder, '%s.png') % filename, format="png", dpi=500)

本想pull request一下，没想到竟然不会，所以只能这么low的在这发个issues，刚入门NLP，很感谢作者的开源教程，希望越来越好

### English Version：
plt.savefig("./visual/results/%s.png" % filename, format="png", dpi=500)
While excuting this line, it encouters problem: No Such File or Direction.

Solution:
replace the original code as below:
# creating the output folder
output_folder = './visual/'
if not os.path.exists(output_folder):
os.makedirs(output_folder)
plt.savefig(os.path.join(output_folder, '%s.png') % filename, format="png", dpi=500)

I began to learn NLP several days ago. Thanks for the author's opened tutorrial video. I hope that author's video will be known by more people.

https://github.com/MorvanZhou/NLP-Tutorials/blob/master/GPT.py

def call(self, seqs, segs, training=False):
        embed = self.input_emb(seqs, segs)  # [n, step, dim]
        z = self.encoder(embed, training=training, mask=self.mask(seqs))     # [n, step, dim]
        mlm_logits = self.task_mlm(z)  # [n, step, n_vocab]
        nsp_logits = self.task_nsp(tf.reshape(z, [z.shape[0], -1]))  # [n, n_cls]
        return mlm_logits, nsp_logits

It seems to me that this should be changed as below

def call(self, seqs, segs, training=False):
        embed = self.input_emb(seqs, segs)  # [n, step, dim]
        z = self.encoder(embed, training=training, mask=self.mask(seqs))     # [n, step, dim]
        mlm_logits = self.task_mlm(z)  # [n, step, n_vocab]
        nsp_logits = self.task_nsp(tf.reshape(z[:, 0,:], [z.shape[0], -1]))  # [n, n_cls]
        return mlm_logits, nsp_logits

There are some reimplementations following the tutorials with PyTorch.

ELMo
BERT

Hope them help the people who are not be good at TensorFlow.

pytorch seq2seq 报错

在pytorch版本的seq2seq 的81行会出现
expected scalar type Long but found Int
问题