delip / pytorchnlpbook Goto Github PK

Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L

License: Apache License 2.0

Jupyter Notebook 99.70% Python 0.13% Shell 0.17%

natural-language-processing nlp pytorch-nlp pytorch pytorch-tutorial deep-learning deep-neural-networks neural-networks neural-machine-translation

pytorchnlpbook's Introduction

Natural Language Processing with PyTorch

Build Intelligent Language Applications Using Deep Learning
By Delip Rao and Brian McMahan

Welcome. This is a companion repository for the book Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning.

Get Started!
Chapter 1: Introduction
- PyTorch Basics
Chapter 2: A Quick Tour of NLP
Chapter 3: Foundational Components of Neural Networks
- In-text examples
- Diving deep into supervised training
- Classifying sentiment of restaurant reviews using a Perceptron
Chapter 4: Feed-forward Networks for NLP
- Limitations of the Perceptron
- Introducing Multi-layer Perceptrons (MLPs)
- Introducing Convolutional Neural Networks (CNNs)
- Surname Classification with an MLP
- Surname Classification with a CNN
Chapter 5: Embedding Words and Types
- Using Pretrained Embeddings
- Learning Continous Bag-of-words Embeddings (CBOW)
- Transfer Learning using Pre-trained Embeddings
Chapter 6: Sequence Modeling for NLP
- A sequence representation for Surnames
Chapter 7: Intermediate Sequence Modeling for NLP
- Generating novel surnames from sequence representations
- Uncondition generation
- Conditioned generation
Chapter 8: Advanced Sequence Modeling for NLP
- Understanding PackedSequences
- Sequence to Sequence Learning
- Attention
- Neural Machine Translation
Chapter 9: Classics, Frontiers, Next Steps

pytorchnlpbook's People

Contributors

Stargazers

Watchers

Forkers

chsafouane liuyangnorway alexsenchenko alvations shubhampachori12110095 xjs924 yibit formygithub2015 jianwenl liu6tot gym0569 currylym hengqujushi zymale zhangjiekui akailcy topdreamer 53x jiamim sth4k ch4osmy7h shaoqibnu kaizeonwong allensmile legendtianjin abgoswam zmwebdev eugeneware justin-thon jialeguo hal2001 igordzreyev gereka seanv507 vikasmech nguyendo24 mgsong cih-y2k taylorkangbeck takahi-i songxianjin a6210575 gongfull cristiano74 nada-projects jirlong minus31 surefirelin 1512474508 zheng19931128 mriganktiwari ml-playground ilyaandr webserg scottishfold007 nlpka6j bailianfa larrycheng lvjiujin sarthusarth kelvinson vanhuyz dsp6414 vickzhang johnrjj ji1kang hayleypark dwinkler1 kasra-hosseini harry771 wangjiaji sk48880 swapna-sundaran lagrangesmile kajyuuen chinawanghao 0xyuzi wudaclark panda4us kumardks ivarunkumar 14301044 dourches-m fendaq mrbai333 shbfy fintechcao sailfish009 ram12a jason08 davanstrien evah aiqoai gjlynx keynes1981 ritika26 vinace g-iyer leileimiao oldjonny

pytorchnlpbook's Issues

Bug in update_train_state

There is a problem with the implementation of the update_train_state function in chapters 3-5. Specifically, when the loss is decreasing train_state['early_stopping_best_val'] is not updated (except in the first epoch), so the early stopping criteria can only be fulfilled if the loss gets higher than in the first epoch.

# If loss worsened
if loss_t >= train_state['early_stopping_best_val']:
    # Update step
    train_state['early_stopping_step'] += 1
# Loss decreased
else:
    # Save the best model
    if loss_t < train_state['early_stopping_best_val']:
        torch.save(model.state_dict(), train_state['model_filename'])

    # Reset early stopping step
    train_state['early_stopping_step'] = 0

Please add the line train_state['early_stopping_best_val'] = loss_t, like in later chapters.

can't get data

i try to run the shell to get data , but the network is disconnected. can you upload the data in this project or other project, 3Q

Nltk punkt download error

Please help me to solve

5_1_Pretrained_Embeddings.ipynb notebook

file glove.6B.100d.txt from kaggle [link]
the appropriate from_embeddings_file method:

def from_embeddings_file(cls, embedding_file):
    """Instantiate from pre-trained vector file.

    Vector file should be of the format:
        word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
        word1 x1_0 x1_1 x1_2 x1_3 ... x1_N

    Args:
        embedding_file (str): location of the file
    Returns: 
        instance of PretrainedEmbeddigns
    """
    word_to_index = {}
    word_vectors = []

    with open(embedding_file, encoding="utf8") as fp:
        for line in fp.readlines():
            line = line.split(" ")
            word = line[0]
            vec = np.array([float(x) for x in line[1:]])

            word_to_index[word] = len(word_to_index)
            word_vectors.append(vec)

    return cls(word_to_index, word_vectors)

make_embedding_matrix assumes that the words are fed in the same order as the vocab

I'm studying 5_3_Document_Classification_with_CNN.

The make_embedding_matrix helper docs say that it should be fed in a list of words in the dataset. However, for the embedding matrix to return the correct embedding of a word from pretrained embeddings, the word list should be fed in the same order as in the vocabulary. Furthermore, there should be no gaps in the word indices in the vocabulary. These are big assumptions.

I think the correct way to construct the embedding matrix is to pass the vocab to the make_embedding_function, and use the token_to_idx method in the vocab to find which rows of the embedding matrix should be populated.

Correct me if I'm wrong.

Chapter 3 In Text Notebook

Chapter 3 In Text Notebook does not load. Also tried loading it separately using Jupyter.

cant download the dataset

connection refused...

Agnews Classifier Predict error

Hello,
Thanks for a great book. I've been working through the examples and its really helpful. One issue I noticed in the agnews classifier, is when I was running through the predict_category function I got an error when trying to predict one of the sports category:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-1b3d5180f20c> in <module>
      3   print("="*30)
      4   for sample in sample_group:
----> 5     pred = predict_category(sample, classifier, dc.vectorizer, dc._train_ds.max_seq_length+1)
      6     print(f"Prediction: {pred['category']} (p={pred['prob']:0.2f})")
      7     print(f"\t + Sample: {sample}")

<ipython-input-16-75dc8b0470ad> in predict_category(title, classifer, vectorizer, max_length)
     12   """
     13   title = preprocess_text(title)
---> 14   vectorized_title = torch.tensor(vectorizer.vectorize(title, max_length))
     15 
     16   # add batch dim so you have a batch of size 1

~/nlpbook/ag/ag/vectorizer.py in vectorize(self, title, vector_len)
     33 
     34     out_vector = np.zeros(vector_len, dtype=np.int64)
---> 35     out_vector[:len(vector)] = vector
     36     out_vector[len(vector):] = self.title_vocab.mask_idx
     37 

ValueError: cannot copy sequence with size 22 to array axis with dimension 21

The max_seq_length in the training set was 20, and in this line in the text, we pass in max_seq_length+1 effectively making the sequence 21 tokens long:

prediction = predict_category(sample, classifier, vectorizer, dataset._max_seq_length + 1)

However, the required length is 22, so when I changed the function call to pass max_seq_length+2, it worked. This begs the more general question:

When the test data's max_seq_length could potentially be larger than that of the training data, what do we do? How do we handle that? Do we just pass in a larger value for the max_seq_length? Even if we do that, how do we foresee how big of a value we might need?

Thanks.

cuda defaults to false in ch4.cnn instead of true

https://github.com/joosthub/PyTorchNLPBook/blob/db9fc8fe48a2416b36b21dde0dfce787c6ade2b2/chapters/chapter_4/4_4_cnn_surnames/4_4_Classifying_Surnames_with_a_CNN.ipynb#L552

code checks is not available to switch to false (but starts as false)

Chapter 03 Yelp Dataset has a Typo

Hi everyone,

Chapter 3 does not load Yelp data due to a typo on the last line of the dataset:

Line Review
73357: "1","Capital City Transfer han

Using nrows argument passing the number of rows - 1, fixed for me.

train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names = ['rating', 'review'], nrows=73356)

train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names = ['rating', 'review'], error_bad_lines=False)

Or by just appending a " at this line.

Still, would be nice to fix this typo on the dataset.

8_5_NMT_No_Sampling

Hello
I can't undestand, how to use model? It had learnt well, but how I can translate my own phrase? There is possible to translate only from dataset.
I'ts to hard for me to make all these operation back with vectirize and encode and then forward and encode...
Can somebody help me?

Jupyter Notebook Kernel keeps dying in the Toy Dataset Problem on a Macbook Pro

All the previous codes from Chapter 1 seem to be running fine but just this particular part of code seems to be causing the issue -

lr = 0.01
input_dim = 2

batch_size = 1000
n_epochs = 12
n_batches = 5

seed = 1337

torch.manual_seed(seed)
np.random.seed(seed)

perceptron = Perceptron(input_dim=input_dim)
optimizer = optim.Adam(params=perceptron.parameters(), lr=lr)
bce_loss = nn.BCELoss()

losses = []

x_data_static, y_truth_static = get_toy_data(batch_size)
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(perceptron, x_data_static, y_truth_static, ax=ax, title='Initial Model State')
plt.axis('off')
#plt.savefig('initial.png')

change = 1.0
last = 10.0
epsilon = 1e-3
epoch = 0
while change > epsilon or epoch < n_epochs or last > 0.3:
#for epoch in range(n_epochs):
for _ in range(n_batches):

    optimizer.zero_grad()
    x_data, y_target = get_toy_data(batch_size)
    y_pred = perceptron(x_data).squeeze()
    loss = bce_loss(y_pred, y_target)
    loss.backward()
    optimizer.step()
    
    
    loss_value = loss.item()
    losses.append(loss_value)

    change = abs(last - loss_value)
    last = loss_value
           
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(perceptron, x_data_static, y_truth_static, ax=ax, epoch=epoch, 
                  title=f"{loss_value}; {change}")
plt.axis('off')
epoch += 1
#plt.savefig('epoch{}_toylearning.png'.format(epoch))

Would you happen to know what the cause could be?

An error in NewsDataset class

def load_vectorizer_only(vectorizer_filepath):
with open(vectorizer_filepath) as fp:
return NameVectorizer.from_serializable(json.load(fp))

NameVectorizer should be alterd to NewsVectorizer

Typo in 3_5_Classifying_Yelp_Review_Sentiment.py

Hi, thanks for providing the code. I found a small typo in 3_5_Classifying_Yelp_Review_Sentiment.py. In the ReviewVectorizer class, the comment of vectorize should be one-hot rather than one-hit.

Manual Data Download

I am getting the following error. If I click on the manual download link.

That’s an error.

The requested URL was not found on this server. That’s all we know.

Errors in Yelp notebook

In the Chapter 3 notebook on preprocessing the Yelp data (LITE), there is duplicated code in blocks 4, 9, and 10. I don't the book in front of me right now, so maybe there's a reason why you do everything two or three times, but just reading the notebook I don't see why. If I'm wrong some comments would be good here.

Function preprocess_text does not seem to strip punctuations

def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

Calling preprocess_text('Are you a, boy or a girl?') returns:

''are you a , boy or a girl ? "

No early stopping implemented in Chapter 3 (yelp)

The early_stopping_best_val never gets updated so the loss is ALWAYS smaller and early stopping never happens
this line of code is missing in the if-else statement in update_train_stage function:

train_state['early_stopping_best_val'] = loss_t

.

Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:

dropout error

In fact, F.dropout(xx, p=0.5) works both in model's train mode and eval mode. You should write F.dropout(xx, p=0.5, training=self.training) or when you excute a cell more than one time, the result is different. I heard this book from the CS224 class, the professor said that it's a great book and if we don't read the book, he will be sad. Even now there seems no one maintaining this book. I want to say "Thank you, Great authors!"

Can't download train data for chapter 3

download.py doesn't work. And when I tried to go by link to train set from .md(https://github.com/delip/PyTorchNLPBook/blob/master/data/README.md):
`404. That’s an error.

The requested URL was not found on this server. That’s all we know.`

What I should do for download your examples?

random forest fraud data set

Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

Got This Below error in Notebook 5_2_munging_frankenstein.ipynb
Please hep on this

LookupError Traceback (most recent call last)
in ()
----> 1 tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
2 with open(args.raw_dataset_txt) as fp:
3 book = fp.read()
4 sentences = tokenizer.tokenize(book)

/usr/local/lib/python3.6/dist-packages/nltk/data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
832
833 # Load the resource.
--> 834 opened_resource = _open(resource_url)
835
836 if format == 'raw':

/usr/local/lib/python3.6/dist-packages/nltk/data.py in open(resource_url)
950
951 if protocol is None or protocol.lower() == 'nltk':
--> 952 return find(path, path + ['']).open()
953 elif protocol.lower() == 'file':
954 # urllib might not use mode='rb', so handle this one ourselves:

/usr/local/lib/python3.6/dist-packages/nltk/data.py in find(resource_name, paths)
671 sep = '*' * 70
672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep)
--> 673 raise LookupError(resource_not_found)
674
675

LookupError:

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'

running_loss definition

I can't understand the code
running_loss += (loss_t - running_loss) / (batch_index + 1)
I think loss_t means per current batch loss ,loss_t is less than running loss (loss_t - running_loss ?) can anyone explain it what the codes mean ?

get_num_batches uses integer division in Chapter3:ReviewDataSet

isn't this wrong if data_size not multiple of batch_size? shoudn't it be :

    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return int(np.ceil(len(self)/batch_size))

YELP raw_train.csv file no longer available on Google Drive, please provide alternate source

raw_train.csv

https://drive.google.com/open?id=1xeUnqkhuzGGzZKThzPeXe2Vf6Uu_g_xM gives a 404 error

Please provide update link to exact dataset used in the book, or to an entirely new set of yelp CSV-formatted datasets (train, test, and reviews_with_splits_lite)

error report for chapter3 code

An error occurred when I ran below code in Chapter3, could anyone please help to figure it out?

`classifier = classifier.to(args.device)

loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
mode='min', factor=0.5,
patience=1)

train_state = make_train_state(args)

epoch_bar = tqdm.notebook.tqdm(desc='training routine',
total=args.num_epochs,
position=0)

dataset.set_split('train')
train_bar = tqdm.notebook.tqdm(desc='split=train',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
dataset.set_split('val')
val_bar = tqdm.notebook.tqdm(desc='split=val',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)

try:
for epoch_index in range(args.num_epochs):
train_state['epoch_index'] = epoch_index

    # Iterate over training dataset

    # setup: batch generator, set loss and acc to 0, set train mode on
    dataset.set_split('train')
    batch_generator = generate_batches(dataset, 
                                       batch_size=args.batch_size, 
                                       device=args.device)
    running_loss = 0.0
    running_acc = 0.0
    classifier.train()

    for batch_index, batch_dict in enumerate(batch_generator):
        # the training routine is these 5 steps:

        # --------------------------------------
        # step 1. zero the gradients
        optimizer.zero_grad()

        # step 2. compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # step 3. compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)

        # step 4. use loss to produce gradients
        loss.backward()

        # step 5. use optimizer to take gradient step
        optimizer.step()
        # -----------------------------------------
        # compute the accuracy
        acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_t - running_acc) / (batch_index + 1)

        # update bar
        train_bar.set_postfix(loss=running_loss, 
                              acc=running_acc, 
                              epoch=epoch_index)
        train_bar.update()

    train_state['train_loss'].append(running_loss)
    train_state['train_acc'].append(running_acc)

    # Iterate over val dataset

    # setup: batch generator, set loss and acc to 0; set eval mode on
    dataset.set_split('val')
    batch_generator = generate_batches(dataset, 
                                       batch_size=args.batch_size, 
                                       device=args.device)
    running_loss = 0.
    running_acc = 0.
    classifier.eval()

    for batch_index, batch_dict in enumerate(batch_generator):
        
        # compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)

        # compute the accuracy
        acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_t - running_acc) / (batch_index + 1)
        
        val_bar.set_postfix(loss=running_loss, 
                            acc=running_acc, 
                            epoch=epoch_index)
        val_bar.update()

    train_state['val_loss'].append(running_loss)
    train_state['val_acc'].append(running_acc)

    train_state = update_train_state(args=args, 
                                     model=classifier,
                                     train_state=train_state)

    scheduler.step(train_state['val_loss'][-1])

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

    if train_state['stop_early']:
        break

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

except KeyboardInterrupt:
print("Exiting loop")`

ValueError: num_samples should be a positive integer value, but got num_samples=0

Duplicate data/ directory

The data/ directory exists at the top level and also in every chapter. I suspect this is supposed to be a symlink, but git and links aren't the best of friends. You could use a post_hook script to create the links, or a % command in the notebook to ensure the link is there and points to the right place.

where is early_stopping_criteria param in args set?

Hello,

I tried your sample code in Chapter 3 (CNN Classifier) and I found a line of code saying that:

  # Stop early ?
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria

where args.early_stopping_criteria seems not being included in the args set.

Documents classification with CNN，acc achieve 100% just after one epoch

anyone who has the same results as me ? I'm not sure if there is a bug in the code

Who on earth thought this was a smart way to get data for the book. Awful. Cost me hours.

yelp training data down

cannot access yelp training data in google drive

unnessessary code lines in SurnameGenerationModel class in7_3_Model1_Unconditioned_Surname_Generation-Origin.ipynb

It seems these code lines below are unnessessry, I uncommented these lines and the runned result is just the same. Would you agree that?

class SurnameGenerationModel(nn.Module):
def init(self, char_embedding_size, char_vocab_size, rnn_hidden_size,
batch_first=True, padding_idx=0, dropout_p=0.5):
"""
Args:
char_embedding_size (int): The size of the character embeddings
char_vocab_size (int): The number of characters to embed
rnn_hidden_size (int): The size of the RNN's hidden state
batch_first (bool): Informs whether the input tensors will
have batch or the sequence on the 0th dimension
padding_idx (int): The index for the tensor padding;
see torch.nn.Embedding
dropout_p (float): the probability of zeroing activations using
the dropout method. higher means more likely to zero.
"""
super(SurnameGenerationModel, self).init()

    self.char_emb = nn.Embedding(num_embeddings=char_vocab_size,
                                 embedding_dim=char_embedding_size,
                                 padding_idx=padding_idx)

    self.rnn = nn.GRU(input_size=char_embedding_size, 
                      hidden_size=rnn_hidden_size,
                      batch_first=batch_first)
    
    self.fc = nn.Linear(in_features=rnn_hidden_size, 
                        out_features=char_vocab_size)
    
    self._dropout_p = dropout_p

def forward(self, x_in, apply_softmax=False):
    """The forward pass of the model
    
    Args:
        x_in (torch.Tensor): an input data tensor. 
            x_in.shape should be (batch, input_dim)
        apply_softmax (bool): a flag for the softmax activation
            should be false if used with the Cross Entropy losses
    Returns:
        the resulting tensor. tensor.shape should be (batch, char_vocab_size)
    """
    x_embedded = self.char_emb(x_in)

    y_out, _ = self.rnn(x_embedded)

    #batch_size, seq_size, feat_size = y_out.shape
    #y_out = y_out.contiguous().view(batch_size * seq_size, feat_size)

    y_out = self.fc(F.dropout(y_out, p=self._dropout_p))
                     
    if apply_softmax:
        y_out = F.softmax(y_out, dim=1)
        
    #new_feat_size = y_out.shape[-1]
    #y_out = y_out.view(batch_size, seq_size, new_feat_size)
        
    return y_out

delip / pytorchnlpbook Goto Github PK

pytorchnlpbook's Introduction

Natural Language Processing with PyTorch

Table of Contents

pytorchnlpbook's People

Contributors

Stargazers

Watchers

Forkers

pytorchnlpbook's Issues

Recommend Projects

Recommend Topics

Recommend Org