delip / pytorchnlpbook Goto Github PK
View Code? Open in Web Editor NEWCode and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L
License: Apache License 2.0
Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media https://amzn.to/3JUgR2L
License: Apache License 2.0
def load_vectorizer_only(vectorizer_filepath):
with open(vectorizer_filepath) as fp:
return NameVectorizer.from_serializable(json.load(fp))
NameVectorizer should be alterd to NewsVectorizer
The data/ directory exists at the top level and also in every chapter. I suspect this is supposed to be a symlink, but git and links aren't the best of friends. You could use a post_hook script to create the links, or a % command in the notebook to ensure the link is there and points to the right place.
Hi everyone,
Chapter 3 does not load Yelp data due to a typo on the last line of the dataset:
Line Review
73357: "1","Capital City Transfer han
Using nrows argument passing the number of rows - 1, fixed for me.
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names = ['rating', 'review'], nrows=73356)
Or
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names = ['rating', 'review'], error_bad_lines=False)
Or by just appending a " at this line.
Still, would be nice to fix this typo on the dataset.
It seems these code lines below are unnessessry, I uncommented these lines and the runned result is just the same. Would you agree that?
class SurnameGenerationModel(nn.Module):
def init(self, char_embedding_size, char_vocab_size, rnn_hidden_size,
batch_first=True, padding_idx=0, dropout_p=0.5):
"""
Args:
char_embedding_size (int): The size of the character embeddings
char_vocab_size (int): The number of characters to embed
rnn_hidden_size (int): The size of the RNN's hidden state
batch_first (bool): Informs whether the input tensors will
have batch or the sequence on the 0th dimension
padding_idx (int): The index for the tensor padding;
see torch.nn.Embedding
dropout_p (float): the probability of zeroing activations using
the dropout method. higher means more likely to zero.
"""
super(SurnameGenerationModel, self).init()
self.char_emb = nn.Embedding(num_embeddings=char_vocab_size,
embedding_dim=char_embedding_size,
padding_idx=padding_idx)
self.rnn = nn.GRU(input_size=char_embedding_size,
hidden_size=rnn_hidden_size,
batch_first=batch_first)
self.fc = nn.Linear(in_features=rnn_hidden_size,
out_features=char_vocab_size)
self._dropout_p = dropout_p
def forward(self, x_in, apply_softmax=False):
"""The forward pass of the model
Args:
x_in (torch.Tensor): an input data tensor.
x_in.shape should be (batch, input_dim)
apply_softmax (bool): a flag for the softmax activation
should be false if used with the Cross Entropy losses
Returns:
the resulting tensor. tensor.shape should be (batch, char_vocab_size)
"""
x_embedded = self.char_emb(x_in)
y_out, _ = self.rnn(x_embedded)
#batch_size, seq_size, feat_size = y_out.shape
#y_out = y_out.contiguous().view(batch_size * seq_size, feat_size)
y_out = self.fc(F.dropout(y_out, p=self._dropout_p))
if apply_softmax:
y_out = F.softmax(y_out, dim=1)
#new_feat_size = y_out.shape[-1]
#y_out = y_out.view(batch_size, seq_size, new_feat_size)
return y_out
anyone who has the same results as me ? I'm not sure if there is a bug in the code
I'm studying 5_3_Document_Classification_with_CNN.
The make_embedding_matrix helper docs say that it should be fed in a list of words in the dataset. However, for the embedding matrix to return the correct embedding of a word from pretrained embeddings, the word list should be fed in the same order as in the vocabulary. Furthermore, there should be no gaps in the word indices in the vocabulary. These are big assumptions.
I think the correct way to construct the embedding matrix is to pass the vocab to the make_embedding_function, and use the token_to_idx method in the vocab to find which rows of the embedding matrix should be populated.
Correct me if I'm wrong.
file glove.6B.100d.txt from kaggle [link]
the appropriate from_embeddings_file
method:
def from_embeddings_file(cls, embedding_file):
"""Instantiate from pre-trained vector file.
Vector file should be of the format:
word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
word1 x1_0 x1_1 x1_2 x1_3 ... x1_N
Args:
embedding_file (str): location of the file
Returns:
instance of PretrainedEmbeddigns
"""
word_to_index = {}
word_vectors = []
with open(embedding_file, encoding="utf8") as fp:
for line in fp.readlines():
line = line.split(" ")
word = line[0]
vec = np.array([float(x) for x in line[1:]])
word_to_index[word] = len(word_to_index)
word_vectors.append(vec)
return cls(word_to_index, word_vectors)
All the previous codes from Chapter 1 seem to be running fine but just this particular part of code seems to be causing the issue -
lr = 0.01
input_dim = 2
batch_size = 1000
n_epochs = 12
n_batches = 5
seed = 1337
torch.manual_seed(seed)
np.random.seed(seed)
perceptron = Perceptron(input_dim=input_dim)
optimizer = optim.Adam(params=perceptron.parameters(), lr=lr)
bce_loss = nn.BCELoss()
losses = []
x_data_static, y_truth_static = get_toy_data(batch_size)
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(perceptron, x_data_static, y_truth_static, ax=ax, title='Initial Model State')
plt.axis('off')
#plt.savefig('initial.png')
change = 1.0
last = 10.0
epsilon = 1e-3
epoch = 0
while change > epsilon or epoch < n_epochs or last > 0.3:
#for epoch in range(n_epochs):
for _ in range(n_batches):
optimizer.zero_grad()
x_data, y_target = get_toy_data(batch_size)
y_pred = perceptron(x_data).squeeze()
loss = bce_loss(y_pred, y_target)
loss.backward()
optimizer.step()
loss_value = loss.item()
losses.append(loss_value)
change = abs(last - loss_value)
last = loss_value
fig, ax = plt.subplots(1, 1, figsize=(10,5))
visualize_results(perceptron, x_data_static, y_truth_static, ax=ax, epoch=epoch,
title=f"{loss_value}; {change}")
plt.axis('off')
epoch += 1
#plt.savefig('epoch{}_toylearning.png'.format(epoch))
Would you happen to know what the cause could be?
code checks is not available to switch to false (but starts as false)
There is a problem with the implementation of the update_train_state
function in chapters 3-5. Specifically, when the loss is decreasing train_state['early_stopping_best_val']
is not updated (except in the first epoch), so the early stopping criteria can only be fulfilled if the loss gets higher than in the first epoch.
# If loss worsened
if loss_t >= train_state['early_stopping_best_val']:
# Update step
train_state['early_stopping_step'] += 1
# Loss decreased
else:
# Save the best model
if loss_t < train_state['early_stopping_best_val']:
torch.save(model.state_dict(), train_state['model_filename'])
# Reset early stopping step
train_state['early_stopping_step'] = 0
Please add the line train_state['early_stopping_best_val'] = loss_t
, like in later chapters.
i try to run the shell to get data , but the network is disconnected. can you upload the data in this project or other project, 3Q
Chapter 3 In Text Notebook does not load. Also tried loading it separately using Jupyter.
def preprocess_text(text):
text = ' '.join(word.lower() for word in text.split(" "))
text = re.sub(r"([.,!?])", r" \1 ", text)
text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
return text
Calling preprocess_text('Are you a, boy or a girl?')
returns:
''are you a , boy or a girl ? "
In fact, F.dropout(xx, p=0.5) works both in model's train mode and eval mode. You should write F.dropout(xx, p=0.5, training=self.training) or when you excute a cell more than one time, the result is different. I heard this book from the CS224 class, the professor said that it's a great book and if we don't read the book, he will be sad. Even now there seems no one maintaining this book. I want to say "Thank you, Great authors!"
Resource wordnet not found.
Please use the NLTK Downloader to obtain the resource:
An error occurred when I ran below code in Chapter3, could anyone please help to figure it out?
`classifier = classifier.to(args.device)
loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
mode='min', factor=0.5,
patience=1)
train_state = make_train_state(args)
epoch_bar = tqdm.notebook.tqdm(desc='training routine',
total=args.num_epochs,
position=0)
dataset.set_split('train')
train_bar = tqdm.notebook.tqdm(desc='split=train',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
dataset.set_split('val')
val_bar = tqdm.notebook.tqdm(desc='split=val',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
try:
for epoch_index in range(args.num_epochs):
train_state['epoch_index'] = epoch_index
# Iterate over training dataset
# setup: batch generator, set loss and acc to 0, set train mode on
dataset.set_split('train')
batch_generator = generate_batches(dataset,
batch_size=args.batch_size,
device=args.device)
running_loss = 0.0
running_acc = 0.0
classifier.train()
for batch_index, batch_dict in enumerate(batch_generator):
# the training routine is these 5 steps:
# --------------------------------------
# step 1. zero the gradients
optimizer.zero_grad()
# step 2. compute the output
y_pred = classifier(x_in=batch_dict['x_data'].float())
# step 3. compute the loss
loss = loss_func(y_pred, batch_dict['y_target'].float())
loss_t = loss.item()
running_loss += (loss_t - running_loss) / (batch_index + 1)
# step 4. use loss to produce gradients
loss.backward()
# step 5. use optimizer to take gradient step
optimizer.step()
# -----------------------------------------
# compute the accuracy
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
running_acc += (acc_t - running_acc) / (batch_index + 1)
# update bar
train_bar.set_postfix(loss=running_loss,
acc=running_acc,
epoch=epoch_index)
train_bar.update()
train_state['train_loss'].append(running_loss)
train_state['train_acc'].append(running_acc)
# Iterate over val dataset
# setup: batch generator, set loss and acc to 0; set eval mode on
dataset.set_split('val')
batch_generator = generate_batches(dataset,
batch_size=args.batch_size,
device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()
for batch_index, batch_dict in enumerate(batch_generator):
# compute the output
y_pred = classifier(x_in=batch_dict['x_data'].float())
# compute the loss
loss = loss_func(y_pred, batch_dict['y_target'].float())
loss_t = loss.item()
running_loss += (loss_t - running_loss) / (batch_index + 1)
# compute the accuracy
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
running_acc += (acc_t - running_acc) / (batch_index + 1)
val_bar.set_postfix(loss=running_loss,
acc=running_acc,
epoch=epoch_index)
val_bar.update()
train_state['val_loss'].append(running_loss)
train_state['val_acc'].append(running_acc)
train_state = update_train_state(args=args,
model=classifier,
train_state=train_state)
scheduler.step(train_state['val_loss'][-1])
train_bar.n = 0
val_bar.n = 0
epoch_bar.update()
if train_state['stop_early']:
break
train_bar.n = 0
val_bar.n = 0
epoch_bar.update()
except KeyboardInterrupt:
print("Exiting loop")`
ValueError: num_samples should be a positive integer value, but got num_samples=0
cannot access yelp training data in google drive
Hello,
Thanks for a great book. I've been working through the examples and its really helpful. One issue I noticed in the agnews classifier, is when I was running through the predict_category
function I got an error when trying to predict one of the sports category:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-1b3d5180f20c> in <module>
3 print("="*30)
4 for sample in sample_group:
----> 5 pred = predict_category(sample, classifier, dc.vectorizer, dc._train_ds.max_seq_length+1)
6 print(f"Prediction: {pred['category']} (p={pred['prob']:0.2f})")
7 print(f"\t + Sample: {sample}")
<ipython-input-16-75dc8b0470ad> in predict_category(title, classifer, vectorizer, max_length)
12 """
13 title = preprocess_text(title)
---> 14 vectorized_title = torch.tensor(vectorizer.vectorize(title, max_length))
15
16 # add batch dim so you have a batch of size 1
~/nlpbook/ag/ag/vectorizer.py in vectorize(self, title, vector_len)
33
34 out_vector = np.zeros(vector_len, dtype=np.int64)
---> 35 out_vector[:len(vector)] = vector
36 out_vector[len(vector):] = self.title_vocab.mask_idx
37
ValueError: cannot copy sequence with size 22 to array axis with dimension 21
The max_seq_length
in the training set was 20, and in this line in the text, we pass in max_seq_length+1
effectively making the sequence 21 tokens long:
prediction = predict_category(sample, classifier, vectorizer, dataset._max_seq_length + 1)
However, the required length is 22, so when I changed the function call to pass max_seq_length+2
, it worked. This begs the more general question:
When the test data's max_seq_length
could potentially be larger than that of the training data, what do we do? How do we handle that? Do we just pass in a larger value for the max_seq_length
? Even if we do that, how do we foresee how big of a value we might need?
Thanks.
Got This Below error in Notebook 5_2_munging_frankenstein.ipynb
Please hep on this
LookupError Traceback (most recent call last)
in ()
----> 1 tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
2 with open(args.raw_dataset_txt) as fp:
3 book = fp.read()
4 sentences = tokenizer.tokenize(book)
/usr/local/lib/python3.6/dist-packages/nltk/data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
832
833 # Load the resource.
--> 834 opened_resource = _open(resource_url)
835
836 if format == 'raw':
/usr/local/lib/python3.6/dist-packages/nltk/data.py in open(resource_url)
950
951 if protocol is None or protocol.lower() == 'nltk':
--> 952 return find(path, path + ['']).open()
953 elif protocol.lower() == 'file':
954 # urllib might not use mode='rb', so handle this one ourselves:
/usr/local/lib/python3.6/dist-packages/nltk/data.py in find(resource_name, paths)
671 sep = '*' * 70
672 resource_not_found = '\n%s\n%s\n%s\n' % (sep, msg, sep)
--> 673 raise LookupError(resource_not_found)
674
675
LookupError:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download('punkt')
Searched in:
- '/root/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- '/usr/nltk_data'
- '/usr/lib/nltk_data'
Hello
I can't undestand, how to use model? It had learnt well, but how I can translate my own phrase? There is possible to translate only from dataset.
I'ts to hard for me to make all these operation back with vectirize and encode and then forward and encode...
Can somebody help me?
connection refused...
I am getting the following error. If I click on the manual download link.
The requested URL was not found on this server. That’s all we know.
https://drive.google.com/open?id=1xeUnqkhuzGGzZKThzPeXe2Vf6Uu_g_xM gives a 404 error
Please provide update link to exact dataset used in the book, or to an entirely new set of yelp CSV-formatted datasets (train, test, and reviews_with_splits_lite)
The early_stopping_best_val never gets updated so the loss is ALWAYS smaller and early stopping never happens
this line of code is missing in the if-else statement in update_train_stage function:
train_state['early_stopping_best_val'] = loss_t
Hello,
I tried your sample code in Chapter 3 (CNN Classifier) and I found a line of code saying that:
# Stop early ?
train_state['stop_early'] = \
train_state['early_stopping_step'] >= args.early_stopping_criteria
where args.early_stopping_criteria
seems not being included in the args set.
I can't understand the code
running_loss += (loss_t - running_loss) / (batch_index + 1)
I think loss_t means per current batch loss ,loss_t is less than running loss (loss_t - running_loss ?) can anyone explain it what the codes mean ?
In the Chapter 3 notebook on preprocessing the Yelp data (LITE), there is duplicated code in blocks 4, 9, and 10. I don't the book in front of me right now, so maybe there's a reason why you do everything two or three times, but just reading the notebook I don't see why. If I'm wrong some comments would be good here.
isn't this wrong if data_size not multiple of batch_size? shoudn't it be :
def get_num_batches(self, batch_size):
"""Given a batch size, return the number of batches in the dataset
Args:
batch_size (int)
Returns:
number of batches in the dataset
"""
return int(np.ceil(len(self)/batch_size))
download.py doesn't work. And when I tried to go by link to train set from .md(https://github.com/delip/PyTorchNLPBook/blob/master/data/README.md):
`404. That’s an error.
The requested URL was not found on this server. That’s all we know.`
What I should do for download your examples?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.