Giter VIP home page Giter VIP logo

delip / pytorchnlpbook Goto Github PK

View Code? Open in Web Editor NEW
2.0K 54.0 805.0 7.92 MB

Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L

License: Apache License 2.0

Jupyter Notebook 99.70% Python 0.13% Shell 0.17%
natural-language-processing nlp pytorch-nlp pytorch pytorch-tutorial deep-learning deep-neural-networks neural-networks neural-machine-translation

pytorchnlpbook's Introduction

Natural Language Processing with PyTorch

Build Intelligent Language Applications Using Deep Learning
By Delip Rao and Brian McMahan

Welcome. This is a companion repository for the book Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning.

Table of Contents

pytorchnlpbook's People

Contributors

braingineer avatar delip avatar igordzreyev avatar reynoldsnlp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorchnlpbook's Issues

Bug in update_train_state

There is a problem with the implementation of the update_train_state function in chapters 3-5. Specifically, when the loss is decreasing train_state['early_stopping_best_val'] is not updated (except in the first epoch), so the early stopping criteria can only be fulfilled if the loss gets higher than in the first epoch.

# If loss worsened
if loss_t >= train_state['early_stopping_best_val']:
    # Update step
    train_state['early_stopping_step'] += 1
# Loss decreased
else:
    # Save the best model
    if loss_t < train_state['early_stopping_best_val']:
        torch.save(model.state_dict(), train_state['model_filename'])

    # Reset early stopping step
    train_state['early_stopping_step'] = 0

Please add the line train_state['early_stopping_best_val'] = loss_t, like in later chapters.

can't get data

i try to run the shell to get data , but the network is disconnected. can you upload the data in this project or other project, 3Q

5_1_Pretrained_Embeddings.ipynb notebook

file glove.6B.100d.txt from kaggle [link]
the appropriate from_embeddings_file method:

def from_embeddings_file(cls, embedding_file):
    """Instantiate from pre-trained vector file.

    Vector file should be of the format:
        word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
        word1 x1_0 x1_1 x1_2 x1_3 ... x1_N

    Args:
        embedding_file (str): location of the file
    Returns: 
        instance of PretrainedEmbeddigns
    """
    word_to_index = {}
    word_vectors = []

    with open(embedding_file, encoding="utf8") as fp:
        for line in fp.readlines():
            line = line.split(" ")
            word = line[0]
            vec = np.array([float(x) for x in line[1:]])

            word_to_index[word] = len(word_to_index)
            word_vectors.append(vec)

    return cls(word_to_index, word_vectors)

make_embedding_matrix assumes that the words are fed in the same order as the vocab

I'm studying 5_3_Document_Classification_with_CNN.

The make_embedding_matrix helper docs say that it should be fed in a list of words in the dataset. However, for the embedding matrix to return the correct embedding of a word from pretrained embeddings, the word list should be fed in the same order as in the vocabulary. Furthermore, there should be no gaps in the word indices in the vocabulary. These are big assumptions.

I think the correct way to construct the embedding matrix is to pass the vocab to the make_embedding_function, and use the token_to_idx method in the vocab to find which rows of the embedding matrix should be populated.

Correct me if I'm wrong.

Agnews Classifier Predict error

Hello,
Thanks for a great book. I've been working through the examples and its really helpful. One issue I noticed in the agnews classifier, is when I was running through the predict_category function I got an error when trying to predict one of the sports category:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-18-1b3d5180f20c> in <module>
      3   print("="*30)
      4   for sample in sample_group:
----> 5     pred = predict_category(sample, classifier, dc.vectorizer, dc._train_ds.max_seq_length+1)
      6     print(f"Prediction: {pred['category']} (p={pred['prob']:0.2f})")
      7     print(f"\t + Sample: {sample}")

<ipython-input-16-75dc8b0470ad> in predict_category(title, classifer, vectorizer, max_length)
     12   """
     13   title = preprocess_text(title)
---> 14   vectorized_title = torch.tensor(vectorizer.vectorize(title, max_length))
     15 
     16   # add batch dim so you have a batch of size 1

~/nlpbook/ag/ag/vectorizer.py in vectorize(self, title, vector_len)
     33 
     34     out_vector = np.zeros(vector_len, dtype=np.int64)
---> 35     out_vector[:len(vector)] = vector
     36     out_vector[len(vector):] = self.title_vocab.mask_idx
     37 

ValueError: cannot copy sequence with size 22 to array axis with dimension 21

The max_seq_length in the training set was 20, and in this line in the text, we pass in max_seq_length+1 effectively making the sequence 21 tokens long:

prediction = predict_category(sample, classifier, vectorizer, dataset._max_seq_length + 1)

However, the required length is 22, so when I changed the function call to pass max_seq_length+2, it worked. This begs the more general question:

When the test data's max_seq_length could potentially be larger than that of the training data, what do we do? How do we handle that? Do we just pass in a larger value for the max_seq_length? Even if we do that, how do we foresee how big of a value we might need?

Thanks.

Chapter 03 Yelp Dataset has a Typo

Hi everyone,

Chapter 3 does not load Yelp data due to a typo on the last line of the dataset:

Line Review
73357: "1","Capital City Transfer han

Using nrows argument passing the number of rows - 1, fixed for me.

train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names = ['rating', 'review'], nrows=73356)

Or

train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names = ['rating', 'review'], error_bad_lines=False)

Or by just appending a " at this line.

Still, would be nice to fix this typo on the dataset.

8_5_NMT_No_Sampling

Hello
I can't undestand, how to use model? It had learnt well, but how I can translate my own phrase? There is possible to translate only from dataset.
I'ts to hard for me to make all these operation back with vectirize and encode and then forward and encode...
Can somebody help me?

Jupyter Notebook Kernel keeps dying in the Toy Dataset Problem on a Macbook Pro

All the previous codes from Chapter 1 seem to be running fine but just this particular part of code seems to be causing the issue -

lr = 0.01
input_dim = 2

batch_size = 1000
n_epochs = 12
n_batches = 5

seed = 1337

torch.manual_seed(seed)
np.random.seed(seed)

perceptron = Perceptron(input_dim=input_dim)
optimizer = optim.Adam(params=perceptron.parameters(), lr=lr)
bce_loss = nn.BCELoss()

losses = []

x_data_static, y_truth_static = get_toy_data(batch_size)
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(perceptron, x_data_static, y_truth_static, ax=ax, title='Initial Model State')
plt.axis('off')
#plt.savefig('initial.png')

change = 1.0
last = 10.0
epsilon = 1e-3
epoch = 0
while change > epsilon or epoch < n_epochs or last > 0.3:
#for epoch in range(n_epochs):
for _ in range(n_batches):

    optimizer.zero_grad()
    x_data, y_target = get_toy_data(batch_size)
    y_pred = perceptron(x_data).squeeze()
    loss = bce_loss(y_pred, y_target)
    loss.backward()
    optimizer.step()
    
    
    loss_value = loss.item()
    losses.append(loss_value)

    change = abs(last - loss_value)
    last = loss_value
           
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(perceptron, x_data_static, y_truth_static, ax=ax, epoch=epoch, 
                  title=f"{loss_value}; {change}")
plt.axis('off')
epoch += 1
#plt.savefig('epoch{}_toylearning.png'.format(epoch))

Would you happen to know what the cause could be?

An error in NewsDataset class

def load_vectorizer_only(vectorizer_filepath):
with open(vectorizer_filepath) as fp:
return NameVectorizer.from_serializable(json.load(fp))

NameVectorizer should be alterd to NewsVectorizer

Typo in 3_5_Classifying_Yelp_Review_Sentiment.py

Hi, thanks for providing the code. I found a small typo in 3_5_Classifying_Yelp_Review_Sentiment.py. In the ReviewVectorizer class, the comment of vectorize should be one-hot rather than one-hit.

Screen Shot 2021-03-26 at 11 03 08 PM

Manual Data Download

I am getting the following error. If I click on the manual download link.

  1. That’s an error.

The requested URL was not found on this server. That’s all we know.

Errors in Yelp notebook

In the Chapter 3 notebook on preprocessing the Yelp data (LITE), there is duplicated code in blocks 4, 9, and 10. I don't the book in front of me right now, so maybe there's a reason why you do everything two or three times, but just reading the notebook I don't see why. If I'm wrong some comments would be good here.

Function preprocess_text does not seem to strip punctuations

def preprocess_text(text):
    text = ' '.join(word.lower() for word in text.split(" "))
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text

Calling preprocess_text('Are you a, boy or a girl?') returns:

''are you a , boy or a girl ? "

No early stopping implemented in Chapter 3 (yelp)

The early_stopping_best_val never gets updated so the loss is ALWAYS smaller and early stopping never happens
this line of code is missing in the if-else statement in update_train_stage function:

train_state['early_stopping_best_val'] = loss_t

.

Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:

dropout error

In fact, F.dropout(xx, p=0.5) works both in model's train mode and eval mode. You should write F.dropout(xx, p=0.5, training=self.training) or when you excute a cell more than one time, the result is different. I heard this book from the CS224 class, the professor said that it's a great book and if we don't read the book, he will be sad. Even now there seems no one maintaining this book. I want to say "Thank you, Great authors!"

Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

Got This Below error in Notebook 5_2_munging_frankenstein.ipynb
Please hep on this

LookupError Traceback (most recent call last)
in ()
----> 1 tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
2 with open(args.raw_dataset_txt) as fp:
3 book = fp.read()
4 sentences = tokenizer.tokenize(book)

/usr/local/lib/python3.6/dist-packages/nltk/data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
832
833 # Load the resource.
--> 834 opened_resource = _open(resource_url)
835
836 if format == 'raw':

/usr/local/lib/python3.6/dist-packages/nltk/data.py in open(resource_url)
950
951 if protocol is None or protocol.lower() == 'nltk':
--> 952 return find(path
, path + ['']).open()
953 elif protocol.lower() == 'file':
954 # urllib might not use mode='rb', so handle this one ourselves:

/usr/local/lib/python3.6/dist-packages/nltk/data.py in find(resource_name, paths)
671 sep = '*' * 70
672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep)
--> 673 raise LookupError(resource_not_found)
674
675

LookupError:


Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'

running_loss definition

I can't understand the code
running_loss += (loss_t - running_loss) / (batch_index + 1)
I think loss_t means per current batch loss ,loss_t is less than running loss (loss_t - running_loss ?) can anyone explain it what the codes mean ?

get_num_batches uses integer division in Chapter3:ReviewDataSet

isn't this wrong if data_size not multiple of batch_size? shoudn't it be :

    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return int(np.ceil(len(self)/batch_size))

error report for chapter3 code

An error occurred when I ran below code in Chapter3, could anyone please help to figure it out?

`classifier = classifier.to(args.device)

loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
mode='min', factor=0.5,
patience=1)

train_state = make_train_state(args)

epoch_bar = tqdm.notebook.tqdm(desc='training routine',
total=args.num_epochs,
position=0)

dataset.set_split('train')
train_bar = tqdm.notebook.tqdm(desc='split=train',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
dataset.set_split('val')
val_bar = tqdm.notebook.tqdm(desc='split=val',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)

try:
for epoch_index in range(args.num_epochs):
train_state['epoch_index'] = epoch_index

    # Iterate over training dataset

    # setup: batch generator, set loss and acc to 0, set train mode on
    dataset.set_split('train')
    batch_generator = generate_batches(dataset, 
                                       batch_size=args.batch_size, 
                                       device=args.device)
    running_loss = 0.0
    running_acc = 0.0
    classifier.train()

    for batch_index, batch_dict in enumerate(batch_generator):
        # the training routine is these 5 steps:

        # --------------------------------------
        # step 1. zero the gradients
        optimizer.zero_grad()

        # step 2. compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # step 3. compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)

        # step 4. use loss to produce gradients
        loss.backward()

        # step 5. use optimizer to take gradient step
        optimizer.step()
        # -----------------------------------------
        # compute the accuracy
        acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_t - running_acc) / (batch_index + 1)

        # update bar
        train_bar.set_postfix(loss=running_loss, 
                              acc=running_acc, 
                              epoch=epoch_index)
        train_bar.update()

    train_state['train_loss'].append(running_loss)
    train_state['train_acc'].append(running_acc)

    # Iterate over val dataset

    # setup: batch generator, set loss and acc to 0; set eval mode on
    dataset.set_split('val')
    batch_generator = generate_batches(dataset, 
                                       batch_size=args.batch_size, 
                                       device=args.device)
    running_loss = 0.
    running_acc = 0.
    classifier.eval()

    for batch_index, batch_dict in enumerate(batch_generator):
        
        # compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)

        # compute the accuracy
        acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_t - running_acc) / (batch_index + 1)
        
        val_bar.set_postfix(loss=running_loss, 
                            acc=running_acc, 
                            epoch=epoch_index)
        val_bar.update()

    train_state['val_loss'].append(running_loss)
    train_state['val_acc'].append(running_acc)

    train_state = update_train_state(args=args, 
                                     model=classifier,
                                     train_state=train_state)

    scheduler.step(train_state['val_loss'][-1])

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

    if train_state['stop_early']:
        break

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

except KeyboardInterrupt:
print("Exiting loop")`

ValueError: num_samples should be a positive integer value, but got num_samples=0

Duplicate data/ directory

The data/ directory exists at the top level and also in every chapter. I suspect this is supposed to be a symlink, but git and links aren't the best of friends. You could use a post_hook script to create the links, or a % command in the notebook to ensure the link is there and points to the right place.

where is early_stopping_criteria param in args set?

Hello,

I tried your sample code in Chapter 3 (CNN Classifier) and I found a line of code saying that:

  # Stop early ?
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria

where args.early_stopping_criteria seems not being included in the args set.

unnessessary code lines in SurnameGenerationModel class in7_3_Model1_Unconditioned_Surname_Generation-Origin.ipynb

It seems these code lines below are unnessessry, I uncommented these lines and the runned result is just the same. Would you agree that?

class SurnameGenerationModel(nn.Module):
def init(self, char_embedding_size, char_vocab_size, rnn_hidden_size,
batch_first=True, padding_idx=0, dropout_p=0.5):
"""
Args:
char_embedding_size (int): The size of the character embeddings
char_vocab_size (int): The number of characters to embed
rnn_hidden_size (int): The size of the RNN's hidden state
batch_first (bool): Informs whether the input tensors will
have batch or the sequence on the 0th dimension
padding_idx (int): The index for the tensor padding;
see torch.nn.Embedding
dropout_p (float): the probability of zeroing activations using
the dropout method. higher means more likely to zero.
"""
super(SurnameGenerationModel, self).init()

    self.char_emb = nn.Embedding(num_embeddings=char_vocab_size,
                                 embedding_dim=char_embedding_size,
                                 padding_idx=padding_idx)

    self.rnn = nn.GRU(input_size=char_embedding_size, 
                      hidden_size=rnn_hidden_size,
                      batch_first=batch_first)
    
    self.fc = nn.Linear(in_features=rnn_hidden_size, 
                        out_features=char_vocab_size)
    
    self._dropout_p = dropout_p

def forward(self, x_in, apply_softmax=False):
    """The forward pass of the model
    
    Args:
        x_in (torch.Tensor): an input data tensor. 
            x_in.shape should be (batch, input_dim)
        apply_softmax (bool): a flag for the softmax activation
            should be false if used with the Cross Entropy losses
    Returns:
        the resulting tensor. tensor.shape should be (batch, char_vocab_size)
    """
    x_embedded = self.char_emb(x_in)

    y_out, _ = self.rnn(x_embedded)

    #batch_size, seq_size, feat_size = y_out.shape
    #y_out = y_out.contiguous().view(batch_size * seq_size, feat_size)

    y_out = self.fc(F.dropout(y_out, p=self._dropout_p))
                     
    if apply_softmax:
        y_out = F.softmax(y_out, dim=1)
        
    #new_feat_size = y_out.shape[-1]
    #y_out = y_out.view(batch_size, seq_size, new_feat_size)
        
    return y_out

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.