Giter VIP home page Giter VIP logo

nlp-tutorials's Introduction

Natural Language Processing Tutorial

Tutorial in Chinese can be found in mofanpy.com.

This repo includes many simple implementations of models in Neural Language Processing (NLP).

All code implementations in this tutorial are organized as following:

  1. Search Engine
  1. Understand Word (W2V)
  1. Understand Sentence (Seq2Seq)
  1. All about Attention
  1. Pretrained Models

Thanks for the contribution made by @W1Fl with a simplified keras codes in simple_realize. And the a pytorch version of this NLP tutorial made by @ruifanxu.

Installation

$ git clone https://github.com/MorvanZhou/NLP-Tutorials
$ cd NLP-Tutorials/
$ sudo pip3 install -r requirements.txt

TF-IDF

TF-IDF numpy code

TF-IDF short sklearn code

image

Word2Vec

Efficient Estimation of Word Representations in Vector Space

Skip-Gram code

CBOW code

image image image

Seq2Seq

Sequence to Sequence Learning with Neural Networks

Seq2Seq code

image

CNNLanguageModel

Convolutional Neural Networks for Sentence Classification

CNN language model code

image

Seq2SeqAttention

Effective Approaches to Attention-based Neural Machine Translation

Seq2Seq Attention code

image image

Transformer

Attention Is All You Need

Transformer code

image image image

ELMO

Deep contextualized word representations

ELMO code

image image

GPT

Improving Language Understanding by Generative Pre-Training

GPT code

image image

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT code

My new attempt Bert with window mask

image image

nlp-tutorials's People

Contributors

dependabot[bot] avatar morvanzhou avatar ruifan831 avatar ruifanxu avatar taylor-swift1 avatar w1fl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp-tutorials's Issues

关于pytorch下cbow结果不理想的问题

作者您好,我在通过你的代码学习NLP,当我运行pytorch文件夹下的CBOW代码时,发现其运行结果不理想与tensorflow版本差别比较大,请问是什么原因?我使用的是requests.txt中的tensorflow版本,pytorch则为同版本的cup版本

About padding of sentences less than the maximum length

Hello, @ruifanxu.
I would like to ask you about the filling of sentences less than the maximum length.
'''
def mask(self, seqs):
device = next(self.parameters()).device
batch_size, seq_len = seqs.shape
mask = torch.triu(torch.ones((seq_len,seq_len), dtype=torch.long), diagonal=1).to(device) # [seq_len ,seq_len]
pad = torch.eq(seqs,self.padding_idx) # [n, seq_len]
mask = torch.where(pad[:,None,None,:],1,mask[None,None,:,:]).to(device) # [n, 1, seq_len, seq_len]
return mask>0 # [n, 1, seq_len, seq_len]
'''
Suppose the longest sentence is 200.
My idea is that sentences with less than 200 words should be given zero padding so that it will not affect the attention calculation and the generation of the target sequence.
Therefore, if there are only three words in a sentence, the following 197 (step) should be 0 (zero padding), otherwise the following parameters (context= [n, step, model_dim]) will have model_dim without step .
However, the above program does not seem to be filled with zeros? Only the future mask. I want to ask if I have an understanding error? Thank you.

Looking forward to your reply!Thank you

關於GPT程式

您好~我想請問一下
1.模型輸入的seqs[:,:-1]是過去的一句話,而seqs[:,-1]是需要預測的真實值,但是如果句子沒有那麼長,之後會被padding與長度最長的句子數量相同,seqs[:,-1]不就都拿到padding值嗎~~
2.若設計為每一批次以最長句子而非整個文本最長句子為該批次的step長度(時間長度),能解決上述問題嗎?
希望收到您的回覆!!!!!!!!!!!!!!!!!!!!謝謝

transformer中添加注意力的过程可能造成答案泄露问题

你好,
我注意到了encoder这一段代码

       attn = self.mh[0].call(yz, yz, yz, look_ahead_mask, training)       # decoder self attention
       o1 = self.bn[0](attn + yz, training)
       attn = self.mh[1].call(o1, xz, xz, pad_mask, training)       # decoder + encoder attention
       o2 = self.bn[1](attn + o1, training)
       ffn = self.drop(self.ffn.call(o2), training)
       o = self.bn[2](ffn + o2, training)
       return o 

其中, o1 = self.bn[0](attn + yz, training) 可能产生了训练和推理操作不统一的问题(推理输出依赖标签)
比如,在我的实验中,我更改标签第二位(Go标记后一位)的值使得推理结果第二位(Go标记后一位)结果发生改变,这使得训练准确度跟测试准确度有很大的差异。
是否应该为这一步的yz添加mask或者去掉这个操作避免这种影响。

visual.py文件下的show_tfidf存在问题

def show_tfidf(tfidf, vocb, filename):
# [n_vocab, n_doc]
plt.imshow(tfidf, cmap="YlGn", vmin=tfidf.min(), vmax=tfidf.max())
plt.xticks(np.arange(tfidf.shape[1]+1), vocb, fontsize=6, rotation=90)
plt.yticks(np.arange(tfidf.shape[0]), np.arange(1, tfidf.shape[0] + 1), fontsize=6)
plt.tight_layout()
plt.savefig("./visual/results/%s.png" % filename, format="png", dpi=500)
plt.show()

原始的plt.xticks和plt.yticks有一点点小问题,经修改后可以解决问题。

transformer一些组件和原文不一致

你好

  1. 文中q,k,v由矩阵乘法计算而来,代码中为全连接层
  2. 文中使用LayerNormalization,代码中为BatchNormalization

这样会对效果造成影响吗

关于visual.py中show_tfidf函数报错问题(Problems about 'show_tfidf' function in visual.py)

plt.savefig("./visual/results/%s.png" % filename, format="png", dpi=500)
执行到这一行报错"No Such File or Direction"

解决方案:
将源代码替换为如下代码
# creating the output folder
output_folder = './visual/'
if not os.path.exists(output_folder):
os.makedirs(output_folder)
plt.savefig(os.path.join(output_folder, '%s.png') % filename, format="png", dpi=500)

本想pull request一下,没想到竟然不会,所以只能这么low的在这发个issues,刚入门NLP,很感谢作者的开源教程,希望越来越好

### English Version:
plt.savefig("./visual/results/%s.png" % filename, format="png", dpi=500)
While excuting this line, it encouters problem: No Such File or Direction.

Solution:
replace the original code as below:
# creating the output folder
output_folder = './visual/'
if not os.path.exists(output_folder):
os.makedirs(output_folder)
plt.savefig(os.path.join(output_folder, '%s.png') % filename, format="png", dpi=500)

I began to learn NLP several days ago. Thanks for the author's opened tutorrial video. I hope that author's video will be known by more people.

运行skip-gram报错

File "D:/Desktop/Information-Extraction-Chinese-master/skip-gram.py", line 79
_loss: tf.Tensor = self.loss(x, y, True)
^
SyntaxError: invalid syntax

Process finished with exit code 1
请求解答,十分感谢。

BERT problem

Thanks for the tutorials.

Your implementation of BERT is different from the original parer.
You use all tokens for NSP(next sentence prediction) instead of "[CLS]" token.

https://github.com/MorvanZhou/NLP-Tutorials/blob/master/GPT.py

def call(self, seqs, segs, training=False):
        embed = self.input_emb(seqs, segs)  # [n, step, dim]
        z = self.encoder(embed, training=training, mask=self.mask(seqs))     # [n, step, dim]
        mlm_logits = self.task_mlm(z)  # [n, step, n_vocab]
        nsp_logits = self.task_nsp(tf.reshape(z, [z.shape[0], -1]))  # [n, n_cls]
        return mlm_logits, nsp_logits

It seems to me that this should be changed as below

def call(self, seqs, segs, training=False):
        embed = self.input_emb(seqs, segs)  # [n, step, dim]
        z = self.encoder(embed, training=training, mask=self.mask(seqs))     # [n, step, dim]
        mlm_logits = self.task_mlm(z)  # [n, step, n_vocab]
        nsp_logits = self.task_nsp(tf.reshape(z[:, 0,:], [z.shape[0], -1]))  # [n, n_cls]
        return mlm_logits, nsp_logits

There are some reimplementations following the tutorials with PyTorch.

Hope them help the people who are not be good at TensorFlow.

pytorch seq2seq 报错

在pytorch版本的seq2seq 的81行会出现
expected scalar type Long but found Int
问题

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.