Giter VIP home page Giter VIP logo

lukeasargen / show-attend-and-tell-pytorch-lightning Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 116.93 MB

Encoder-Decoder CNN-LSTM Model with an attention mechanism for image captioning. Trained using the Microsoft COCO Dataset.

Python 3.49% Jupyter Notebook 96.51%
pytorch pytorch-lightning mscoco image-captioning show-attend-and-tell attention-mechanism attention-visualization lstm text-generation encoder-decoder

show-attend-and-tell-pytorch-lightning's Introduction

Motivation

I am fascinated by image captioning because it combines the knowledge of computer vision and natural language processing.

Main Paper

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. & Bengio, Y.. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:2048-2057. [arXiv]

Paper notes:

  • The image is passed through as CNN encoder to create annotations with L locations and D features at each location.
  • The LSTM decoder is initialized by passing the mean of the annotation features into separate linear projections for the hidden and cell state.
  • At each step, an additive attention mechanism creates a weight for L locations in the annotations. This weight can be thought of as the relative importance of the location. In Soft Attention, the annotations and attention weights are multiplied at each location and then the summation over each weighted feature map creates a single context vector of size D features.
  • The context vector and the previous word embedding are concatenated together as input to the LSTM.
  • A deep output layer projects the previous word embedding, hidden state, and context vector into the embedding dimension. These vectors are summed and passed through a projection to create logits over the vocab size.
  • The model is trained end-to-end using a cross entropy loss of the word prediction at each step.
  • An additional doubly stochastic loss is applied. This takes the L2 difference of the sum of attention weights over all timesteps and 1. "This can be interpreted as encouraging the model to pay equal attention to every part of the image over the course of generation." (Sec 4.2.1)
  • Use dropout and early stopping on Bleu score.

Original Source Code. kelvinxu, arctic-captions (https://github.com/kelvinxu/arctic-captions)

Setup

Create a new conda environment

conda create --name pytorch python=3.9
conda activate pytorch

Get pytorch installed. Command generated here: https://pytorch.org/

conda install pytorch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge

Requirements

pip install -r requirements.txt

Create the interactive notebook kernel:

conda install ipykernel jupyter
python -m ipykernel install --user --name pytorch --display-name "pytorch"

Dataset

MS COCO (Microsoft Common Objects in Context)

Links

I downsampled the images to 256 to speed up the image loading pipeline. You can see how that is done in https://github.com/Lukeasargen/GarbageML/blob/main/resize.py.

Use preprocess.ipynb to build a tokenized version of the dataset. It saves the image path paired with captions in a single json file. The captions are tokenized and padded. It also saves a dictionary mapping word strings to token indexes called "vocab_stoi".

Splits

train restval val test train+restval total
images 82783 30504 5000 5000 113287 123287
captions 414113 152634 25010 25010 566747 616767

Captions Counts

  • 122959 images have 5 captions
  • 324 images have 6 captions
  • 4 images have 7 captions

Captions Lengths (Log Scale)

caption_length_histogram

In preprocessing, I combine train and restval and just call it train. Additionally, all images will have 5 captions in order to use the built-in pytorch collate_fn.

Here are 2 test images. The captions have been decoded using a vocab size of 6400, so unknown words are decode as "UNK".

test_samples

Model

Encoder

You can use any of these torchvision models. Just get the name string correct for the --encoder_arch argument.

resnet18, resnet34, resnet50, resnet101, resnet152,
resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2,
squeezenet1_0, squeezenet1_1,
densenet121, densenet169, densenet201, densenet161,
shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5*, shufflenet_v2_x2_0*
mobilenet_v2, mobilenet_v3_large, mobilenet_v3_small
mnasnet0_5, mnasnet0_75*, mnasnet1_0, mnasnet1_3*

*supported but no pretrained version available

The code that modifies the torchvision models is in get_encoder() in model.py. Below is a simplified snippet that does the same things, check the full function for details.

from torchvision import models
def get_encoder(args):
    m = models.__dict__[args.encoder_arch](pretrained=args.pretrained)
    layers = list(m.children())[:-1]  # Remove some layers
    # final_dim and final_size are set by which arch is used
    layers.append(nn.Conv2d(final_dim, args.encoder_dim, kernel_size=1, stride=1, bias=True))
    if args.encoder_size < final_size:
        layers.append(nn.AdaptiveAvgPool2d((args.encoder_size, args.encoder_size)))
    elif args.encoder_size > final_size:
        layers.append(nn.Upsample((args.encoder_size, args.encoder_size), mode="bilinear", align_corners=False))    
    layers.append(FlattenShuffle())
    norm = Normalize(args.mean, args.std, inplace=True)
    return nn.Sequential(norm, *layers)

Decoder

The decoder is built during the SAT init where it assigns nn.Modules as attributes to the SAT pl.LightningModule class. Here is the minimal forward pass with a greedy decoder.

annotations = self.encoder(img)
h, c = self.init_lstm(annotations)
prev_words = torch.LongTensor([[self.stoi("<START>")]*img.shape[0]]).to(self.device)
for step in range(caplen):
    embed_prev_words = self.embedding(prev_words)
    zt, alpha = self.attention(annotations, h[-1])
    beta = torch.sigmoid(self.beta(h[-1]))
    h_in = torch.cat([embed_prev_words, beta*zt], dim=1).unsqueeze(0) # dim=0 is the sequence length, which is 1 in greedy decoding
    _, (h, c) = self.lstm(h_in, (h, c))
    logit = self.deep_output(embed_prev_words, h[-1], zt)
    probs = F.softmax(logit, dim=1)
    prev_words = torch.argmax(probs, dim=1)

Training

The model is trained end-to-end with label smoothing cross entropy loss on the word outputs and a doubly stochastic loss (Sec 4.2.1 Equation 14).

loss = LabelSmoothing()(logits_packed, targets_packed)
loss += self.hparams.att_gamma*((1-alphas.sum(dim=1))**2).mean()

Pytorch Lightning Utilities

The pl.Trainer implements these utilities and callbacks:

  • Gradient Clipping - uses nn.utils.clip_grad_norm_ and nn.utils.clip_grad_value_
  • Validation Interval - check_val_every_n_epoch
  • Early Stopping - callbacks.early_stopping.EarlyStopping
  • Model Checkpointing - callbacks.model_checkpoint.ModelCheckpoint
  • Mixed Precision - setting precision=16 uses the native amp in PyTorch

Training Methodology

I ran a bunch of small experiments on a subset of MSCOCO to help me understand some characteristics of the model.

Note : The experiments 0-4 use the wrong bleu4 calculation. I forgot to remove the padding tokens. Because the model never outputs padding tokens the scores are very low. This is fixed for experiments >4.

Experiment 0 (Adam vs Adamw) and Experiment 1 (Dropout)

  • v15: --opt=adam --dropout=0.2
  • v16: --opt=adamw --dropout=0.2
  • v17: --opt=adam --dropout=0.5

experiment_0_to_1

Using weight decay with adamw (v16) makes no difference from regular adam (v15). Using more dropout (v17) helped this small model.

Results: Use opt=adam and dropout=0.5.

Experiment 2 (Pretrain and Finetune), Experiment 3 (Lower lr), and Experiment 4 (reduce encoder_dim)

  • v18: --encoder_lr=4e-3 --decoder_lr=4e-3
  • v19: --pretrained --decoder_lr=4e-3
  • v20: --pretrained --encoder_finetune --encoder_lr=4e-3 --decoder_lr=4e-3
  • v21: --pretrained --encoder_finetune --encoder_lr=4e-3 --decoder_lr=1e-3
  • v22: --pretrained --encoder_finetune --encoder_lr=1e-5 --decoder_lr=1e-3
  • v23: --pretrained --encoder_dim=256 --encoder_lr=1e-5 --decoder_lr=1e-3
  • v24: --pretrained --encoder_finetune --encoder_dim=256 --encoder_lr=1e-5 --decoder_lr=1e-3

experiments_2_to_4

Training from scratch (v18) is a terrible idea. A pretrained encoder does better (v19) and if you fine tune the encoder (v20) the bleu score improves. Also, lowering the learning rate on the decoder (v21) and encoder (v22) boosts performs best on this size model (hidden size of 128 and one layer).

Reducing the features by fine tuning a randomly initialized 1x1 convolution decreases the validation score (v23). If you reduce the features and fine tune the encoder (v24), you can achieve results close to a poorly fine tuned encoder (v21, where the learning rate was too high). I will note that the reduced dimensions of the output took less time (-14%) to complete the same number of steps. It also used less gpu memory.

On fine tuning: I believe fine tuning the encoder helps by allowing the encoder to become invariant to the new training augmentations. For example, with a frozen encoder, the decoder will be forced to learn to be invariant to the data augmentations. However, if the encoder is fine tuned, the signals from the decoder can force the encoder to learn this invariance and produce more useful features. Additionally, the pretraining dataset may not have been covered the captioning dataset in the input space and the output features will have weak representations.

On reducing feature dimension: Adding a new 1x1 convolution usually underperforms the pretrained features. This new layer always requires training, so performance is driven by the training data. For the most part, these features are prone to overfitting on a smaller dataset. This possibly explains why these features underperform against the features transferred from imagenet pretraining.

Results: Use lower learning rates and finetune a pretrained encoder.

Code improvements for experiments >4: The primary fix was to remove special tokens when computing the bleu score.

The model architecture was also improved to follow the deep output layer (Equation 7) and support a Multilayer LSTM. Adding a second layer did not help this small subset of 32k training samples (Bleu4 dropped from 18.82 from 1 layer to 18.26 for 2 layers). I set the embedding normalization. I made a bucket sampler to feed batches of similar length sequences. Additionally, I did a quick run with the single layer model and increased the attention gamma in the loss from 1 to 2; the bleu4 dropped from 18.82 to 16.41, so a attention gamma of 1 is fine. The final improvement was teacher forcing with scheduling.

Experiment 5 (Teacher Forcing Schedule)

The value in the parentheses is a hard coded value I changed. It represents a multiplication factor, where factor*epochs is the location that the sigmoid hits 0.5. So 1.0 means at the final epoch the epsilon value is 0.5.

  • v39: bleu4=16.72, --decoder_tf=inv_sigmoid(0.5) --dropout=0.1
  • v40: bleu4=20.28, --decoder_tf=inv_sigmoid(0.7) --dropout=0.0
  • v41: bleu4=20.65, --decoder_tf=inv_sigmoid(0.8) --dropout=0.2
  • v42: bleu4=21.79, --decoder_tf=inv_sigmoid(1.0) --dropout=0.5
  • v43: bleu4=22.06, --decoder_tf=inv_sigmoid(1.2) --dropout=0.5
  • v44: bleu4=20.27, --decoder_tf=always --dropout=0.5
  • v45: bleu4= 8.11, --decoder_tf=None --dropout=0.5

experiments_5_bleu4

Use v45 as a control because there was no teacher forcing. The first attempts (v39, v40, v41) did so badly I thought reducing dropout would help increase the gradient signals. It turns out that dropout was not a controlling value, as v44 always used teacher forcing and it under performs compared to the scheduled v42 and v43. The main takeaway is simply using more teacher forcing for longer improved validation bleu.

Given how poorly the decoder does without teacher forcing, I did not test the linear and exponential schedules. Linear and exponential drop epsilon much quicker than inverse sigmoid. I think all methods would benefit from more "warm up" steps with always teacher forcing.

Results: Use more teacher forcing for longer.

Experiment 6 (Optical Transformations and Fine Tuning)

  • v58: bleu4=23.06, --aug_optical_strength=0.0
  • v60: bleu4=23.87, --aug_optical_strength=0.0 --encoder_finetune
  • v61: bleu4=21.90, --aug_optical_strength=0.2
  • v62: bleu4=23.71, --aug_optical_strength=0.2 --encoder_finetune

experiments_6_bleu4

Fine tuning the encoder (v60, v62) is beneficial regardless of the input transformations. Further, with this set of 180k image-caption pairs, the performance with fine tuning was about the same for both transformation strengths. Fine tuning the shufflenet_v2_x0_5 encoder took on average 32% longer to train.

Results: Fine tune the encoder if possible. Different image augmentations do not substantially change performance when fine tuning.

Experiment 7 (Label Smoothing)

This experiment is about decoder performance. The encoder is frozen and the input transformations are just horizontal flip, color jitter, and gaussian noise.

  • v63: bleu4=22.34, label_smoothing=0.0
  • v64: bleu4=22.58, label_smoothing=0.05
  • v65: bleu4=22.74, label_smoothing=0.1
  • v66: bleu4=22.82, label_smoothing=0.2
  • v67: bleu4=23.18, label_smoothing=0.4
  • v68: bleu4=22.68, label_smoothing=0.6

experiments_7_bleu4

A higher smoothing appears to be beneficial at higher learning rates. This is seen in the bleu4 chart early in training (before step 3k). It also looks like higher smoothing does not level off or decrease score near the end of training, the slope of line is still positive.

Results: Any label smoothing outperforms one-hot labels. This model is robust to most smoothing values. Smooth labels help the model continue to make improvements late in training.

Experiment 8 (Teacher Forcing and Learning Rate Schedule)

  • v70: bleu4=23.79, --decoder_tf=always --scheduler=plateau
  • v71: bleu4=24.08, --decoder_tf=inv_sigmoid (match v70 schedule)
  • v72: bleu4=24.61, --decoder_tf=inv_sigmoid --scheduler=plateau
  • v73: bleu4=22.88, --decoder_tf=inv_sigmoid --scheduler=exp --lr_gamma=0.9
  • v74: bleu4=24.54, --decoder_tf=inv_sigmoid --scheduler=cosine

experiments_8_bleu4

Always teacher forcing (v70) underperforms inv_sigmoid (v71) with the same learning rate schedules. Using a plateau scheduler and inv_sigmoid (v72) had the best bleu4 score. An exponential schedule (v73) did worst. The cosine annealing with restarts (v74) had an issue where it ended training on a high learning rate. I modified the schedule creation to always end on the lowest learning rate possible.

Results: Inverse Sigmoid is best and use cosine or plateau schedule.

Results: You do not need to normalize pretrained embeddings, but you should normalize randomly initialized embeddings. Bleu score is ambiguous, however the pretrained embeddings have lower perplexity.

Summary of Experiments

  • adamw with a little weight decay
  • dropout pretty high. over 0.1 at minimum
  • pretrained encoder and finetune with a lr around 1e-5
  • decoder lr is related by batch size. with batch=160, start with lr=2e-3. if you increase the batch, you can increase the lr
  • always use --deep_output. without it the decoder ignores the image
  • 1 layer lstm is fine.
  • --bucket_sampler saves time
  • keep att_gamma at 1.0
  • teacher forcing with inv_sigmoid schedule
  • weak image transformations. not too much cropping
  • label smoothing between 0.1 to 0.4 is beneficial
  • cosine or plateau lr schedule. make sure the lr is at least 1000 times less at the end of training
  • pretrained glove embeddings (non-normalized, embedding_lr=1e-5) or randomly initialized embeddings (normalized, embedding_lr=2e-2)

Things not tested that just kind of work

  • lr_warmup_steps with a few hundred steps
  • --grad_clip=norm --clip_value=5.0

Sampling a Caption

The steps for training match inference steps with slight modification to how the RNN/LSTM decoder outputs are interpreted. The output of the decoder is a probability distribution and it must be sampled to get the predicted word. The simplest method is to pick the most likely word; this is called greedy search. An extension of this idea is called beam search. During beam search, a breadth-first search is done by tracking several candidate sequences and at each step the topk most likely are choosen. Greedy and beam search always selected the most likely words, which can lead to common phrases being oversampled in the generation. I decided to also try to sample from the whole distribution using torch.multinomial. The results were poor at first, but I rescaled the probabilities to make a sharper distribution and the outcomes have a lot more unique words and still follow language rules.

The best way to understand the sampling process is to look in model.py in the forward method. Nearly every line is commented with an explanation.

Results

[WIP]

models with size 128, 256, 512

Example Predictions

Good Predictions

Inaccurate Predictions

Funny Predictions

References

Main Paper

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. & Bengio, Y.. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:2048-2057. [arXiv]

Papers, repositories, or any site that taught me something useful:

  • https://pytorch-lightning.readthedocs.io/en/latest/

  • Microsoft COCO Captions : https://arxiv.org/abs/1504.00325

  • https://www.nltk.org/api/nltk.translate.html - Bleu Score

  • sgrvinod, a-PyTorch-Tutorial-to-Image-Captioning (https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning)

    • similar preprocessing
    • freeze the encoder by setting requires_grad=False in encoder init
    • permute output of encoder
    • attention block (relu is wrong, I use tanh)
    • decoder forward with reducing batch size (my implement is using different indexing to avoid sorting)
    • gradient clipping (I used pl clipping, which calls pt clipping)
    • train.py L186 - I took the code for doubly stochastic attention regularization
    • early stopping on bleu4 metric
  • AaronCCWong, Show-Attend-And-Tell (https://github.com/AaronCCWong/Show-Attend-and-Tell)

    • clean additive attention module
    • decoder.py L113 - beam search, add the scores to the output and topk over a 1d list, the use // and % to get the vocab and sequence indexes
  • Ronald J. Williams, David Zipser; A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Comput 1989; 1 (2): 270–280. doi: https://doi.org/10.1162/neco.1989.1.2.270

    • 3.2 Teacher-Forced Real-Time Recurrent Learning
  • Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on association for computational linguistics (p./pp. 311--318).

    • I wrote my own bleu score function using this original paper
  • Pascanu, R., Gülçehre, Ç., Cho, K., & Bengio, Y. (2014). How to Construct Deep Recurrent Neural Networks. CoRR, abs/1312.6026. [arXiv]

    • 1 Introduction - Stacks of RNNs potentially operate at different timescales
    • 3.3.2 Deep Output RNN - intermediate layers can be used to compute the output
    • 3.3.3 Stacked RNN - all hidden states can possibly be used to compute the output, also the input can be feed to all layers, making shortcut connections
  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3156-3164. [arXiv]

    • 3.1. LSTM-based Sentence Generator - sampled with a beam width of 20
    • 4.3.1 Training Details - avoid overfitting by starting with a pretrained Encoder. pretrained word embeddings had no significant impact. use dropout. embedding and hidden size of 512
    • 4.3.3 Transfer Learning, Data Size and Label Quality - "we see gains by adding more training data since the whole process is data-driven and overfitting prone"
    • 4.3.4 Generation Diversity Discussion - "If we take the best candidate, the sentence is present in the training set 80% of the times."
  • Bengio, S., Vinyals, O., Jaitly, N. & Shazeer, N. (2015). Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama & R. Garnett (eds.), NIPS (p./pp. 1171-1179). [arXiv]

    • 1 Introduction - discrepancy between training with teacher forcing and inference. inference mistakes made early can be amplified by pushing the model to a state space not seen during training. use a curriculum to force the model to deal with its own mistakes
    • 2.4 Bridging the Gap with Scheduled Sampling - sample the true input at each step with epsilon probability. decrease epsilon over training
    • 4.1 Image Captioning - dropout had a negative impact on metrics besides log likelihood. random sampling for an entire sequence rather than at each step had bad results
  • Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks (cite arxiv:1706.04599Comment: ICML 2017) [arXiv]

    • 4.2 Extension to Multiclass Models - "The method to get an optimal temperature T for a trained model is through minimizing the negative log likelihood for a held-out validation dataset."
  • Inan, H., Khosravi, K., & Socher, R. (2017). Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling [arXiv]

    • Reusing the embedidng matrx as the output projection matrix is approximately the same as using a KL-divergence loss between y-hat prediction and y* empirical target distribution

show-attend-and-tell-pytorch-lightning's People

Contributors

lukeasargen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

show-attend-and-tell-pytorch-lightning's Issues

TypeError: iteration over a 0-d tensor

When I run train.py, the following error occurs at first epoch.

Epoch 0: 0%| | 0/128000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 259, in
main(args)
File "train.py", line 253, in main
val_dataloaders=val_loader
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 442, in optimizer_step
using_lbfgs=is_lbfgs,
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/optim/adam.py", line 66, in step
loss = closure()
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 733, in train_step_and_backward_closure
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 155, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "/media/user/New Volume/CODE/internet/22.7.2021/Show-Attend-and-Tell-Pytorch-Lightning-main/model.py", line 535, in training_step
logits_packed, targets_packed, alphas = self.train_batch(batch, epsilon)
File "/media/user/New Volume/CODE/internet/22.7.2021/Show-Attend-and-Tell-Pytorch-Lightning-main/model.py", line 468, in train_batch
if not any(incomplete_idxs): break # All the captions are done
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/tensor.py", line 585, in iter
raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor
Exception ignored in: <function tqdm.del at 0x7fceeff11dd0>
Traceback (most recent call last):
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/tqdm/std.py", line 1152, in del
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/tqdm/std.py", line 1306, in close
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/tqdm/std.py", line 1499, in display
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/tqdm/std.py", line 1155, in str
File "/home/user/anaconda3/envs/pytorch/lib/python3.7/site-packages/tqdm/std.py", line 1457, in format_dict

Question about Attention Module

Hi!

I have a small question "Attention module" part of your code.

Before passing final attention linear layer, there is tanh for non-linearity not ReLU.
And "flattened.shape[1]**-0.5" is multiplied after final attention.

Is there a special reason for using tanh not ReLU?
And why is that value multiplied?

Original code below.
att = self.f_att(torch.tanh(att_enc+att_dec))*flattened.shape[1]**-0.5 # att.shape = (batch, locations, 1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.