Giter VIP home page Giter VIP logo

multimodal's Introduction

Unit-tests Python version Downloads

TorchMultimodal (Beta Release)

Models | Example scripts | Getting started | Code overview | Installation | Contributing | License

Introduction

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale, including both content understanding and generative models. TorchMultimodal contains:

  • A repository of modular and composable building blocks (fusion layers, loss functions, datasets and utilities).
  • A collection of common multimodal model classes built up from said building blocks with pretrained weights for canonical configurations.
  • A set of examples that show how to combine these building blocks with components and common infrastructure from across the PyTorch Ecosystem to replicate state-of-the-art models published in the literature. These examples should serve as baselines for ongoing research in the field, as well as a starting point for future work.

Models

TorchMultimodal contains a number of models, including

Example scripts

In addition to the above models, we provide example scripts for training, fine-tuning, and evaluation of models on popular multimodal tasks. Examples can be found under examples/ and include

Model Supported Tasks
ALBEF Retrieval
Visual Question Answering
DDPM Training and Inference (notebook)
FLAVA Pretraining
Fine-tuning
Zero-shot
MDETR Phrase grounding
Visual Question Answering
MUGEN Text-to-video retrieval
Text-to-video generation
Omnivore Pre-training
Evaluation

Getting started

Below we give minimal examples of how you can write a simple training or zero-shot evaluation script using components from TorchMultimodal.

FLAVA zero-shot example
import torch
from PIL import Image
from torchmultimodal.models.flava.model import flava_model
from torchmultimodal.transforms.bert_text_transform import BertTextTransform
from torchmultimodal.transforms.flava_transform import FLAVAImageTransform

# Define helper function for zero-shot prediction
def predict(zero_shot_model, image, labels):
  zero_shot_model.eval()
  with torch.no_grad():
      image = image_transform(img)["image"].unsqueeze(0)
      texts = text_transform(labels)
      _, image_features = zero_shot_model.encode_image(image, projection=True)
      _, text_features = zero_shot_model.encode_text(texts, projection=True)
      scores = image_features @ text_features.t()
      probs = torch.nn.Softmax(dim=-1)(scores)
      label = labels[torch.argmax(probs)]
      print(
          "Label probabilities: ",
          {labels[i]: probs[:, i] for i in range(len(labels))},
      )
      print(f"Predicted label: {label}")


image_transform = FLAVAImageTransform(is_train=False)
text_transform = BertTextTransform()
zero_shot_model = flava_model(pretrained=True)
img = Image.open("my_image.jpg")  # point to your own image
predict(zero_shot_model, img, ["dog", "cat", "house"])

# Example output:
# Label probabilities:  {'dog': tensor([0.80590]), 'cat': tensor([0.0971]), 'house': tensor([0.0970])}
# Predicted label: dog
MAE training example
import torch
from torch.utils.data import DataLoader
from torchmultimodal.models.masked_auto_encoder.model import vit_l_16_image_mae
from torchmultimodal.models.masked_auto_encoder.utils import (
  CosineWithWarmupAndLRScaling,
)
from torchmultimodal.modules.losses.reconstruction_loss import ReconstructionLoss
from torchmultimodal.transforms.mae_transform import ImagePretrainTransform

mae_transform = ImagePretrainTransform()
dataset = MyDatasetClass(transforms=mae_transform)  # you should define this
dataloader = DataLoader(dataset, batch_size=8)

# Instantiate model and loss
mae_model = vit_l_16_image_mae()
mae_loss = ReconstructionLoss()

# Define optimizer and lr scheduler
optimizer = torch.optim.AdamW(mae_model.parameters())
lr_scheduler = CosineWithWarmupAndLRScaling(
  optimizer, max_iters=1000, warmup_iters=100  # you should set these
)

# Train one epoch
for batch in dataloader:
  model_out = mae_model(batch["images"])
  loss = mae_loss(model_out.decoder_pred, model_out.label_patches, model_out.mask)
  loss.backward()
  optimizer.step()
  lr_scheduler.step()

Code overview

diffusion_labs contains components for building diffusion models. For more details on these components, see diffusion_labs/README.md.

Look here for model classes as well as any other modeling code specific to a given architecture. E.g. the directory torchmultimodal/models/blip2 contains modeling components specific to BLIP-2.

Look here for common generic building blocks that can be stitched together to build a new architecture. This includes layers like codebooks, patch embeddings, or transformer encoder/decoders, losses like contrastive loss with temperature or reconstruction loss, encoders like ViT and BERT, and fusion modules like Deep Set fusion.

Look here for common data transforms from popular models, e.g. CLIP, FLAVA, and MAE.

Installation

TorchMultimodal requires Python >= 3.8. The library can be installed with or without CUDA support. The following assumes conda is installed.

Prerequisites

  1. Install conda environment

    conda create -n torch-multimodal python=\<python_version\>
    conda activate torch-multimodal
    
  2. Install pytorch, torchvision, and torchaudio. See PyTorch documentation.

    # Use the current CUDA version as seen [here](https://pytorch.org/get-started/locally/)
    # Select the nightly Pytorch build, Linux as the OS, and conda. Pick the most recent CUDA version.
    conda install pytorch torchvision torchaudio pytorch-cuda=\<cuda_version\> -c pytorch-nightly -c nvidia
    
    # For CPU-only install
    conda install pytorch torchvision torchaudio cpuonly -c pytorch-nightly
    

Install from binaries

Nightly binary on Linux for Python 3.8 and 3.9 can be installed via pip wheels. For now we only support Linux platform through PyPI.

python -m pip install torchmultimodal-nightly

Building from Source

Alternatively, you can also build from our source code and run our examples:

git clone --recursive https://github.com/facebookresearch/multimodal.git multimodal
cd multimodal

pip install -e .

For developers please follow the development installation.

Contributing

We welcome any feature requests, bug reports, or pull requests from the community. See the CONTRIBUTING file for how to help out.

License

TorchMultimodal is BSD licensed, as found in the LICENSE file.

multimodal's People

Contributors

22quinn avatar abhinavarora avatar amyreese avatar ankitade avatar apsdehal avatar bigfootjon avatar billywu1029 avatar caoe avatar dependabot[bot] avatar ebsmothers avatar edward-io avatar enmalik avatar facebook-github-bot avatar felixdivo avatar ge0405 avatar ivankobzarev avatar joecummings avatar katrina433 avatar kmuz-fb avatar langong347 avatar lessw2020 avatar pbontrager avatar rdoublea avatar rohan-varma avatar sophiazhi avatar soulitzer avatar yosuamichael avatar zhangtemplar avatar zrphercule avatar zrphercule2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multimodal's Issues

Incremental addition of the new modality

๐Ÿš€ The feature, motivation and pitch

๐Ÿค— Hello! Thank you for your work!

I see model configurations which working with certain modalities in this repo and it is great.

I have a question though, what if I have pretrained encoder for other modality (e.g. for audio) and a data for training (audio-text pairs and audio-image pairs).

  • How can I train a model which will be able to solve tasks with my new modality?
  • In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?

Alternatives

No response

Additional context

It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.

[FLAVA]Can't Access ImageNet

๐Ÿš€ The feature, motivation and pitch

I already have a huggingface account with an access token๏ผŒbut when I follow the readme to lanuch pretraining by running"python -m flava.train config=flava/configs/pretraining/debug.yaml" firstly, get the failing log :
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/imagenet-1k/imagenet-1k.py (ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=100)")))

I thought it was a connection issue, but when I type that url โ€œhttps://raw.githubusercontent.com/huggingface/datasets/2.0.0/datasets/imagenet-1k/imagenet-1k.pyโ€ directly into chorme, I get โ€œ404 not foundโ€

I want to know how to access the dataset? Thanks a lot!

Alternatives

No response

Additional context

No response

CoCa model implementation

๐Ÿš€ The feature, motivation and pitch

Thank you for your awesome works!
I have some questions about CoCa model implementation.

In */multimodal/torchmultimodal/models/coca/coca_model.py, it seems like we can decide whether using CascadedAttentionPooler or just single AttentionPooler.
However, when using CascadedAttentionPooler, dimensions are not matched at the second loop.

For example, after vision feature is extracted from VisionEncoder and its feature has shape of (B, h*w, dim).
It has to pass through vision_pooler layers (pooled_outputs = self.vision_pooler(image_embeddings)) and when using CascadedAttentionPooler, 'self.vision_pooler' class has 2 sequential AttentionPooler layers.
After passed through 1st AttentionPooler layer, feature has shape of (B,256,q_dim) and it doesn't matched with the LayerNorm in the second loop which is supporting 'dim', not 'q_dim'.
Is it okay if I arbitrarily modify the input dimension of the second AttnetionPooler layer?

Similary, when using 'vision_cls_token' with CascadedAttentionPooler, shape of vision feature is (B, h*w + 1(cls), dim) (e.g., B,1025,768).
And at the vision_pooler layer, it return learnable tokens after cross-attention with vision feature and it has (B,256,q_dim) shape for each captioning_image_embeddings, and contrastive_image_embeddings, respectively.
If you intended to not using visual features directly, is it necessary to add 'cls_token' at the initial stage?
I mean, what is the purpose of adding 'cls_token' at the front of visual features even though, you're not using them directly.

Thank you again!

Alternatives

No response

Additional context

No response

ModuleNotFoundError: No module named 'datasets'

Issue description

When trying to run torchmultimodal/examples/flava/train.py, I get the follwing error,

ModuleNotFoundError: No module named 'datasets'.

torchmultimodal/torchmultimodal/datasets seems to be empty.

mini-imageNet

Due to the large size of the ImageNet dataset, I am using the MiniImageNet dataset. I modified the YAML file accordingly.
datasets:
target: flava.definitions.TrainingDatasetsInfo
selected:
- image
- vl
- text
image:
target: flava.definitions.TrainingSingleDatasetInfo
train:
- target: flava.definitions.HFDatasetInfo
key: mini_train
subset: default
data_dir: >-
/home/liumaofu/hyy/multimodal/examples/flava/mini/ok/train/
val:
- target: flava.definitions.HFDatasetInfo
key: mini_val
subset: default
data_dir: >-
/home/liumaofu/hyy/multimodal/examples/flava/mini/ok/val/
At the same time, I modified the examples/flava/data/utils. py file:
def build_datasets_from_info(dataset_infos: List[HFDatasetInfo], split: str = "train"):
dataset_list = []
for dataset_info in dataset_infos:
print(f"Loading dataset from {dataset_info.data_dir}")

    current_dataset = load_from_disk(dataset_info.data_dir)

    if dataset_info.remove_columns is not None:
        current_dataset = current_dataset.remove_columns(dataset_info.remove_columns)
    if dataset_info.rename_columns is not None:
        for rename in dataset_info.rename_columns:
            current_dataset = current_dataset.rename_column(rename[0], rename[1])

    dataset_list.append(current_dataset)

return concatenate_datasets(dataset_list)

However, when executing the code๏ผšpython -m flava.train config=flava/configs/pretraining/debug.yaml
, an error is reported๏ผšDirectory /home/liumaofu/hyy/multimodal/examples/flava/mini/ok/train/ is neither a dataset directory nor a dataset dict directory.
The structure of my miniimagenet dataset is as follows:
miniImagenet
|-- train
| |-- class1
| | |-- image1.jpg
| | |-- image2.jpg
| | |-- ...
| |-- class2
| | |-- image1.jpg
| | |-- image2.jpg
| | |-- ...
| |-- ...
|-- val
| |-- class1
| | |-- image1.jpg
| | |-- image2.jpg
| | |-- ...
| |-- class2
| | |-- image1.jpg
| | |-- image2.jpg
| | |-- ...
| |-- ...
|-- test
| |-- class1
| | |-- image1.jpg
| | |-- image2.jpg
| | |-- ...
| |-- class2
| | |-- image1.jpg
| | |-- image2.jpg
| | |-- ...
| |-- ..
I ensure that their storage path is not a problem. May I ask why this error is reported and what should I do๏ผŸ

Cannot fine-tune CLIP model in GPU

๐Ÿš€ The feature, motivation and pitch

Hello!
I'm now fine-tuning the clip_vit_b16 model from models.clip.model. I also use CLIPTransform and torchvision's CocoCaptions function to perform image and text preprocessing on the MSCOCO dataset. Moreover, the input data to the model is an image-text pair (one image corresponds to one of the five texts). But when I loaded the image, text and model into the GPU using .to(deivce), I found that the program reported an error, which is "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!". May I ask how to solve it?
clip.txt is my code, please check it.
clip.txt

Alternatives

We printed out the device where the image, text and model parameters are located and found that they were all in cuda:0. Therefore, we were surprised why we reported an error that all tensors were not in the same device. We have already installed the relevant third-party libraries.

Additional context

No response

training flava with ddp and activation checkpointing gives runtime error

Hi, I'm following the tutorial here and trying to train flava with torchmm. However, training with ddp and activation checkpointing gives the following error.

File "/data/root/anaconda3/envs/torchdata/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/data/root/anaconda3/envs/torchdata/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside theforwardfunction. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpointfunctions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 493 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.
I also tried some other configurations as below, only ddp + checkpoint gives the above error. I'm wondering if there is some other change we need to make to the code.

launch activation checkpoint result
ddp โœ“ โœ—
ddp โœ— โœ“
fsdp โœ“ โœ“

My Env: pytorch=1.13.1, cuda 11.7, torchmultimodal(via pip, from source) and I launched my script with
torchrun --rdzv_endpoint="localhost:1234" --nproc_per_node=4 -m flava.native.train config=flava/native/configs/pretrain_debug.yaml

Many thanks in advance!

Add support for LLaVA model

๐Ÿš€ The feature, motivation and pitch

LLaVA seems to be currently a strong open-source competitor to GPT4-V, it doesn't seem to be supported by the library. Do you plan on adding it? If yes, is there something I could contribute with to help?

Alternatives

No response

Additional context

No response

README links are broken

Can this model be used for duplicate detection from both image and text?

๐Ÿš€ The feature, motivation and pitch

A model for near duplicate detection from both image and text.

Given two pairs of input composed of image and text, determine whether they are semantically duplicate or not.

inputA = (imageA, textA)
inputB = (imageB, textB)

Determine whether inputA and inputB are near duplicate or not?

Alternatives

No response

Additional context

No response

Linear probing on vision tasks

Hi, Thank you for sharing the code! Can I ask a question regarding the ImageNet linear probing experiments?

In Appendix B.2, the paper mentions that "We extract image features from the final layer of the image encoder (before the multi-modal encoder) and train a logistic regression classifier (L-BFGS implementation from [80]) on the extracted image features."

I wonder if the image features are the encoded_image or the projected_embeddings, from the output of the encode_image().

Thank you!

Image transform results between HF and our version does not line up

Issue description

Image transform results between HF and our version does not line up

Code example

A minimal repro here https://colab.research.google.com/drive/1tcghYqhPjy2G1sbkzy2UUbOmbzrQTkG5#scrollTo=wdCanLBZC2w8
if you see last few cells, the text outputs match but image outputs dont

A possible discrepancy is that HF version has center crop which is missing in our transform

Need eyes from @apsdehal to move forward

FLAVA pretraining docs run into IMAGENET_TAR env variable issue

Following the documentation at https://github.com/facebookresearch/multimodal/tree/main/examples/flava, authenticating with the HF dataset, and running python -m flava.train config=flava/configs/pretraining/debug.yaml runs into the following issue:

omegaconf.errors.InterpolationResolutionError: KeyError raised while resolving interpolation: "Environment variable 'IMAGENET_TAR' not found"

I'm assuming IMAGENET_TAR is an env variable that's used to cache the downloaded dataset, but this might not be set for the first run?

Unimodal evaluation results using models pretrained on ImageNet and CCNews

Hi! First of all congratulations on the impressive paper and this very well written repo. I've been playing around with this model, and I have a question about Table 4 / C.1 in the paper.
In columns 1 and 2, you report unimodal evaluation results in NLP and vision, but the pretraining dataset used here is PMD.
If I understand correctly, the normal pretraining pipeline is to use ImageNet-1k to train the image encoder, and CCNews+BookCorpus for the text encoder.
I was wondering if you have GLUE finetuning and image linear probe results on (unimodal) models pretrained on ImageNet-1k and CCNews+BookCorpus?
Thanks a lot!

Deprecate PretrainedMixin

Following up on this comment.

Currently we use PretrainedMixin to enable to load model from url. However this way has restriction that it can only be used for model that have class in TorchMultiModel repository (hence we can add the Mixin). However for use case where the model is from other repository, we can't rely on PretrainedMixin (example where we use model from torchvision, or we construct simple model by nn.Sequential(...) )

Hence I think it is better to use a util function load_module_from_url since it offer more flexibility.

This issue is for a plan to deprecate PretrainedMixin and use load_module_from_url utils function instead.

TorchMultimodal requires Python >=3.8

The README says that TorchMultimodal requires Python >= 3.7 however if you try to use it on 3.7 it will fail because typing.Literal is only available in >= 3.8.

In torchmultimodal/models/flava.py:12 should go from:

from typing import Any, Callable, List, Literal, Optional, Tuple, Union

to

from typing import Any, Callable, List, Optional, Tuple, Union
try:
    from typing import Literal
except ImportError:
    from typing_extensions import Literal

Or update the README to only support Python 3.8.

Code example

Using a Python 3.7 environment:

from torchmultimodal.models import flava
ImportError                               Traceback (most recent call last)
[<ipython-input-8-ef9f3b511f96>](https://localhost:8080/#) in <module>()
----> 1 from torchmultimodal.models import flava

[/content/torchmultimodal-dev/torchmultimodal/models/flava.py](https://localhost:8080/#) in <module>()
     10 from collections import namedtuple, OrderedDict
     11 from functools import partial
---> 12 from typing import Any, Callable, List, Literal, Optional, Tuple, Union
     13 
     14 import torch

ImportError: cannot import name 'Literal' from 'typing' (/usr/lib/python3.7/typing.py)

System Info

PyTorch version: 1.12.0.dev20220328
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Libc version: glibc-2.10

Python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.144+-x86_64-with-debian-buster-sid
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.5
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.12.0.dev20220328
[pip3] torchmetrics==0.7.3
[pip3] torchmultimodal==0.1.0a0
[pip3] torchtext==0.13.0.dev20220328
[pip3] torchvision==0.13.0.dev20220328
[conda] blas                      2.113                       mkl    conda-forge
[conda] blas-devel                3.9.0            13_linux64_mkl    conda-forge
[conda] cudatoolkit               11.3.1              ha36c431_10    conda-forge
[conda] libblas                   3.9.0            13_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            13_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            13_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            13_linux64_mkl    conda-forge
[conda] mkl                       2022.0.1           h8d4b97c_803    conda-forge
[conda] mkl-devel                 2022.0.1           ha770c72_804    conda-forge
[conda] mkl-include               2022.0.1           h8d4b97c_803    conda-forge
[conda] numpy                     1.21.5           py37hf2998dd_0    conda-forge
[conda] pytorch                   1.12.0.dev20220328 py3.7_cuda11.3_cudnn8.3.2_0    pytorch-nightly
[conda] pytorch-lightning         1.5.10                   pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] torchmetrics              0.7.3                    pypi_0    pypi
[conda] torchmultimodal           0.1.0a0                   dev_0    <develop>
[conda] torchtext                 0.13.0.dev20220328            py37    pytorch-nightly
[conda] torchvision               0.13.0.dev20220328      py37_cu113    pytorch-nightly
  • How you installed TorchMultimodal (conda, pip, source): pip from local build
  • Build command you used (if compiling from source): python setup.py develop

Fine-tuning and scaling up blog post?

The announcement blog post mentions there will be two more things:

Apart from the code, we are also releasing a tutorial for fine-tuning multimodal foundation models, and a blog post (with code pointers) on how to scale up such models using techniques from PyTorch Distributed (FSDP and activation checkpointing). We hope such examples and tutorials will serve to demystify a number of advanced features available in the PyTorch ecosystem.

If they are out, could you please help with the links?

Fine-tuning flava model

Hi,
Thank you for great model. I want to ask in order to fine-tune the model do I need to crate the account of huggingface or not? I want to fine tune the flava model on my dataset for some downstream task like image classification. So can you please guide me about the steps of fine-tuning of flava model. Thank you.

Support for CoCa Model

๐Ÿš€ The feature, motivation and pitch

Is there any plan to add support for CoCa model?

The only platform where I could find a model with pretrained models is open_clip repo. I think it would be really valuable if you could add it in this library.

Note: https://github.com/mlfoundations/open_clip has some model with unofficial pretrained weights.

Alternatives

No response

Additional context

No response

Training log file for flava full

๐Ÿš€ The feature, motivation and pitch

Hi guys,
If possible, can you attach the training log file of the flava full model in the readme.md section?
Thank you!

Alternatives

No response

Additional context

No response

Clip model sample training code

๐Ÿš€ The feature, motivation and pitch

Hello
I wonder if you are going to have sample training code (like the ones you have in "/examples" folder) for CLIP model?

Alternatives

No response

Additional context

No response

How to use ImageNet in FLAVA?

๐Ÿš€ The feature, motivation and pitch

The link of ImageNet datasets https://huggingface.co/datasets/aps/imagenet2012#dataset-summary is not found.

And even if download the imagenet_object_localization_patched2019.tar.gz on kaggle, and set the IMAGENET_TAR, it still download from huggingface.

But huggingface.co need registration, so the download is always failed, the failing log is :
ConnectionError: Couldn't reach https://huggingface.co/datasets/imagenet-1k/resolve/1500f8c59b214ce459c0a593fa1c87993aeb7700/data/train_
images_0.tar.gz (error 401)

I want to know how to use the dataset?

Alternatives

No response

Additional context

No response

Latency in picking up upstream changes from TorchText TransformerEncoder temporarily broke CLIPTextEncoder

Issue description

In this PR CLIPTextEncoder was refactored due to the upstream change. The latter passed our internal CI but broke the external CI (for the same tests) as in OSS we depend on the nightly build of torchtext which hasn't yet incorporated the bleeding edge changes which are synced manually by TorchText here.

Implications for TorchMultimodal CI

  1. Short-term: given our pinning on the upstream nightlies for development, upstream developers will need to keep the changes local to upstream codebase (both internally and externally). TorchMultimodal will need to catch up after the changes land in the upstream OSS nightly.
  2. Mid- to long-term : Building from upstream source will put us in a better sync with upstream development.

Implications for TorchMultimodal Design

Our implementation can be strongly coupled with that of an upstream component in order to reuse existing unimodal components (e.g., to avoid reinventing the wheel / code duplication across PyTorch domain libraries). In this case, the layers component of torchtext.models.roberta.modules.TransformerEncoder changed from nn.ModuleList to nn.TransformerEncoder. In CLIPTextEncoder, num_layers was previously derived from len(layers) and switched to use num_layers attribute of nn.TransformerEncoder:

proj_std = (self.width ** -0.5) * ((2 * self.encoder.layers.num_layers) ** -0.5)

Note that upstreaming components is a strategic decision we've made from the beginning. This last point is not meant to be corrective but rather a case study when things can go wrong as a result.

cc @ankitade @ebsmothers @kartikayk @RdoubleA

OOM while finetuning flava

๐Ÿš€ The feature, motivation and pitch

Hi guys,
I am trying to finetune flava on my custom dataset. I am also using mixed-precision but still getting OOM. I am running this using ddp on a single node multi-gpu (Nvidia A10) where in each gpu has 24G memory.
These are the configs:
Config:
training: _target_: flava.definitions.TrainingArguments lightning: max_steps: 30000 gpus: -1 val_check_interval: 1.0 num_sanity_val_steps: 0 strategy: ddp precision: 16 lightning_checkpoint: dirpath: "./test_ckpt/" filename: flava-{epoch:02d}-{step} save_last: true every_n_train_steps: null save_on_train_epoch_end: true verbose: true lightning_load_from_checkpoint: null seed: -1 batch_size: 128 num_workers: 10 learning_rate: 2e-4 adam_eps: 1e-8 adam_weight_decay: 1e-2 warmup_steps: 2000

Errror:

attn = attn / torch.sqrt(torch.tensor(q.shape[-1])) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 3; 22.06 GiB total capacity; 20.40 GiB already allocated; 28.44 MiB free; 20.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: : 0it [00:14, ?it/s]

The paper mentions flava-full is trained using a batch_size of 8192. Can you tell me which gpu was used for training the flava-full model?
Also, if you guys can suggest any workarounds which could help me here other than reducing batch_size, it would be very helpful
Thank you!

Alternatives

No response

Additional context

No response

Tutorial/reference to finetune FLAVA on custom dataset

๐Ÿš€ The feature, motivation and pitch

Hey guys,
I wanted to finetune FLAVA on custom dataset (image + text + image-text pairs). Could you please provide any reference/steps for doing so? Also, it would be really helpful if you could provide examples on finetuning flava in it's entirety i.e finetuning on MLM + MIM + MMM + ITM + Contrastive Learning.

Thank you!

Alternatives

No response

Additional context

No response

ALBEF: Train from scratch

๐Ÿš€ The feature, motivation and pitch

Hi, thanks for your great efforts for this excellent work! I want to train ALBEF from scratch, but I just find the code find-tuning. In the ALBEF paper, they use a pre-trained VIT, and also use BERT to initialize the weights for the text encoder and the multimodal encoder (except cross-attention modules). But I didn't find these initializations in this code. Could you please let me know where did you do that initialization?

Mant thanks!

Alternatives

No response

Additional context

No response

How to perform multimodal multitask instance segmentation in torchmultimodal?

I have a dataset with 4 channel input imagery, 2 labels set - one label set is in MS COCO JSON Format and another labels format is in TIF format.
The dataset is designed for instance segmentation multi-modal multi-task.
Which model can be used for this task?
How do I give input of 2 labels to the instance segmentation model?
How to train for it?

Can someone please provide guidance on this task?

'FlavaModelOutput' object has no attribute 'contrastive_logits_per_image'

Hi,
I have tried to run the following code:

from PIL import Image
import requests
from transformers import FlavaProcessor, FlavaModel

model = FlavaModel.from_pretrained("facebook/flava-full")
processor = FlavaProcessor.from_pretrained("facebook/flava-full")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.contrastive_logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

But I am getting an error that

'FlavaModelOutput' object has no attribute 'contrastive_logits_per_image'

Can you please help me in this regard. Thank you.

Add iteration strategy to multidatamodule

๐Ÿš€ The feature, motivation and pitch

FLAVA author has implemented Multidataloader and multidatamodule to iterate over different datasets for multitask learning. We want to somewhat generalize this approach for other MTL examples as a stop gap solution for now (the ideal one should probably be a torch data solution or a domain agnostic one). To this end:

  • Separate out the modules from https://github.com/facebookresearch/multimodal/blob/main/examples/flava/data/multitask.py into a folder called multimodal/examples/common/dataset_utils. Add unit tests for this without any logic changes to the modules (so its a guardrail for next step)
  • Add an option to the multidataloader to take in iteration strategy. Implement the sampling strategy as concrete example of the strategy
  • Hook it up with FLAVA training script based on sampling ratios mentioned in the paper and test E2E. (as a sanity check, we can punt on pushing this till we get the original author to sign off on it)

Alternatives

No response

Additional context

No response

AttributeError: 'MultiDataLoader' object has no attribute '__code__'

Issue description

At the execution of trainer.fit (line 87 of train.py),

trainer.fit(model, datamodule=datamodule)

I am getting the following error,

Code example

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:768, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    749 r"""
    750 Runs the full optimization routine.
    751 
   (...)
    765     datamodule: An instance of :class:`~pytorch_lightning.core.datamodule.LightningDataModule`.
    766 """
    767 self.strategy.model = model
--> 768 self._call_and_handle_interrupt(
    769     self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    770 )

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:721, in Trainer._call_and_handle_interrupt(self, trainer_fn, *args, **kwargs)
    719         return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
    720     else:
--> 721         return trainer_fn(*args, **kwargs)
    722 # TODO: treat KeyboardInterrupt as BaseException (delete the code below) in v1.7
    723 except KeyboardInterrupt as exception:

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:809, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    805 ckpt_path = ckpt_path or self.resume_from_checkpoint
    806 self._ckpt_path = self.__set_ckpt_path(
    807     ckpt_path, model_provided=True, model_connected=self.lightning_module is not None
    808 )
--> 809 results = self._run(model, ckpt_path=self.ckpt_path)
    811 assert self.state.stopped
    812 self.training = False

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1234, in Trainer._run(self, model, ckpt_path)
   1230 self._checkpoint_connector.restore_training_state()
   1232 self._checkpoint_connector.resume_end()
-> 1234 results = self._run_stage()
   1236 log.detail(f"{self.__class__.__name__}: trainer tearing down")
   1237 self._teardown()

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1321, in Trainer._run_stage(self)
   1319 if self.predicting:
   1320     return self._run_predict()
-> 1321 return self._run_train()

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1351, in Trainer._run_train(self)
   1349 self.fit_loop.trainer = self
   1350 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1351     self.fit_loop.run()

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/loops/base.py:203, in Loop.run(self, *args, **kwargs)
    201 while not self.done:
    202     try:
--> 203         self.on_advance_start(*args, **kwargs)
    204         self.advance(*args, **kwargs)
    205         self.on_advance_end()

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py:254, in FitLoop.on_advance_start(self)
    251 self.trainer._call_callback_hooks("on_epoch_start")
    252 self.trainer._call_lightning_module_hook("on_epoch_start")
--> 254 self.trainer._call_callback_hooks("on_train_epoch_start")
    255 self.trainer._call_lightning_module_hook("on_train_epoch_start")
    257 self.epoch_progress.increment_started()

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1634, in Trainer._call_callback_hooks(self, hook_name, *args, **kwargs)
   1632         if callable(fn):
   1633             with self.profiler.profile(f"[Callback]{callback.state_key}.{hook_name}"):
-> 1634                 fn(self, self.lightning_module, *args, **kwargs)
   1636 if pl_module:
   1637     # restore current_fx when nested context
   1638     pl_module._current_fx_name = prev_fx_name

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/tqdm_progress.py:259, in TQDMProgressBar.on_train_epoch_start(self, trainer, *_)
    257 def on_train_epoch_start(self, trainer: "pl.Trainer", *_: Any) -> None:
    258     total_train_batches = self.total_train_batches
--> 259     total_val_batches = self.total_val_batches
    260     if total_train_batches != float("inf") and total_val_batches != float("inf"):
    261         # val can be checked multiple times per epoch
    262         val_checks_per_epoch = total_train_batches // trainer.val_check_batch

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/callbacks/progress/base.py:173, in ProgressBarBase.total_val_batches(self)
    167 """The total number of validation batches, which may change from epoch to epoch for all val dataloaders.
    168 
    169 Use this to set the total number of iterations in the progress bar. Can return ``inf`` if the predict dataloader
    170 is of infinite size.
    171 """
    172 assert self._trainer is not None
--> 173 return sum(self.trainer.num_val_batches) if self._trainer.fit_loop.epoch_loop._should_check_val_epoch() else 0

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:505, in TrainingEpochLoop._should_check_val_epoch(self)
    503 def _should_check_val_epoch(self):
    504     return (
--> 505         self.trainer.enable_validation
    506         and (self.trainer.current_epoch + 1) % self.trainer.check_val_every_n_epoch == 0
    507     )

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:2278, in Trainer.enable_validation(self)
   2274 @property
   2275 def enable_validation(self) -> bool:
   2276     """Check if we should run validation during training."""
   2277     return (
-> 2278         self._data_connector._val_dataloader_source.is_defined()
   2279         and is_overridden("validation_step", self.lightning_module)
   2280         and self.limit_val_batches > 0
   2281     )

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:536, in _DataLoaderSource.is_defined(self)
    531 def is_defined(self) -> bool:
    532     """Returns whether the source dataloader can be retrieved or not.
    533 
    534     If the source is a module it checks that the method with given :attr:`name` is overridden.
    535     """
--> 536     return not self.is_module() or is_overridden(self.name, self.instance)

File ~/miniconda3/envs/torch-multimodal/lib/python3.8/site-packages/pytorch_lightning/utilities/model_helpers.py:56, in is_overridden(method_name, instance, parent)
     53 if parent_attr is None:
     54     raise ValueError("The parent should define the method")
---> 56 return instance_attr.__code__ != parent_attr.__code__

AttributeError: 'MultiDataLoader' object has no attribute '__code__'

Are there any methods unimplemented in torchmultimodal/examples/flava/data/multitask.py?

  • How you installed TorchMultimodal (conda, pip, source): conda
  • Build command you used (if compiling from source): pip install -e .
  • OS: linux (ubuntu)
  • TorchMultimodal version: torchmultimodal-0.1.0a0
  • Python version: 3.8
  • CUDA/cuDNN version: running in cpu
  • GPU models and configuration: N/A
  • Versions of any other relevant libraries:

Train diffusion on MNIST

Tried the Diffusion_Labs tutorial Train diffusion on MNIST after watching @pbontrager presentation at recent PyTorch Conference. The code works well. On Google Colab with GPU (free tier) it take 7 minutes per epoch. However it will be nice if the tutorial can be extended to cover how to save the trained model and use it for inference.

GPU Tests Failed with AttributeError: Can't pickle local object 'ArgumentParser.__init__.<locals>.identity'

Issue description

GPU tests failed when running on a GPU cluster:

test/modules/losses/test_contrastive_loss_with_temperature.py::TestContrastiveLossWithTemperature::test_multi_gpu_loss FAILED [ 66%]
test/modules/losses/test_contrastive_loss_with_temperature.py::TestContrastiveLossWithTemperature::test_single_gpu_loss FAILED [100%]

Possible solution:

While this is limitation of the spawn and forkserver modes, I don't see how this is a bug. Also, as you point out, making the method a classmethod or staticmethod works around the limitation.

See: https://bugs.python.org/issue33884

Code example

$ gpurun pytest test/modules/losses/test_contrastive_loss_with_temperature.py -vv 

============================= test session starts ==============================
platform linux -- Python 3.9.12, pytest-7.1.1, pluggy-1.0.0 -- /fsx/users/langong/conda/envs/torchmmcuda39/bin/python
cachedir: .pytest_cache
rootdir: /fsx/users/langong/work/fix_build
collecting ... collected 3 items

test/modules/losses/test_contrastive_loss_with_temperature.py::TestContrastiveLossWithTemperature::test_local_loss PASSED [ 33%]
test/modules/losses/test_contrastive_loss_with_temperature.py::TestContrastiveLossWithTemperature::test_multi_gpu_loss FAILED [ 66%]
test/modules/losses/test_contrastive_loss_with_temperature.py::TestContrastiveLossWithTemperature::test_single_gpu_loss FAILED [100%]

=================================== FAILURES ===================================
____________ TestContrastiveLossWithTemperature.test_multi_gpu_loss ____________

self = <test_contrastive_loss_with_temperature.TestContrastiveLossWithTemperature testMethod=test_multi_gpu_loss>

    @gpu_test(gpu_count=2)
    def test_multi_gpu_loss(self):
        with with_temp_files(count=1) as sync_file:
            world_size = 2
>           mp.spawn(
                self._model_worker,
                (self, sync_file, world_size),
                nprocs=world_size,
            )

test/modules/losses/test_contrastive_loss_with_temperature.py:138:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../conda/envs/torchmmcuda39/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:240: in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
../../conda/envs/torchmmcuda39/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:189: in start_processes
    process.start()
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/process.py:121: in start
    self._popen = self._Popen(self)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/context.py:284: in _Popen
    return Popen(process_obj)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/popen_spawn_posix.py:32: in __init__
    super().__init__(process_obj)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/popen_fork.py:19: in __init__
    self._launch(process_obj)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/popen_spawn_posix.py:47: in _launch
    reduction.dump(process_obj, fp)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <SpawnProcess name='SpawnProcess-1' parent=25159 initial>
file = <_io.BytesIO object at 0x7f5c7956d9f0>, protocol = None

    def dump(obj, file, protocol=None):
        '''Replacement for pickle.dump() using ForkingPickler.'''
>       ForkingPickler(file, protocol).dump(obj)
E       AttributeError: Can't pickle local object 'ArgumentParser.__init__.<locals>.identity'

../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/reduction.py:60: AttributeError
___________ TestContrastiveLossWithTemperature.test_single_gpu_loss ____________

self = <test_contrastive_loss_with_temperature.TestContrastiveLossWithTemperature testMethod=test_single_gpu_loss>

    @gpu_test(gpu_count=1)
    def test_single_gpu_loss(self):
        with with_temp_files(count=1) as sync_file:
            world_size = 1
>           mp.spawn(
                self._model_worker,
                (self, sync_file, world_size),
                nprocs=world_size,
            )

test/modules/losses/test_contrastive_loss_with_temperature.py:128:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../conda/envs/torchmmcuda39/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:240: in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
../../conda/envs/torchmmcuda39/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:189: in start_processes
    process.start()
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/process.py:121: in start
    self._popen = self._Popen(self)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/context.py:284: in _Popen
    return Popen(process_obj)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/popen_spawn_posix.py:32: in __init__
    super().__init__(process_obj)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/popen_fork.py:19: in __init__
    self._launch(process_obj)
../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/popen_spawn_posix.py:47: in _launch
    reduction.dump(process_obj, fp)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <SpawnProcess name='SpawnProcess-2' parent=25159 initial>
file = <_io.BytesIO object at 0x7f5c79498db0>, protocol = None

    def dump(obj, file, protocol=None):
        '''Replacement for pickle.dump() using ForkingPickler.'''
>       ForkingPickler(file, protocol).dump(obj)
E       AttributeError: Can't pickle local object 'ArgumentParser.__init__.<locals>.identity'

../../conda/envs/torchmmcuda39/lib/python3.9/multiprocessing/reduction.py:60: AttributeError
=========================== short test summary info ============================
FAILED test/modules/losses/test_contrastive_loss_with_temperature.py::TestContrastiveLossWithTemperature::test_multi_gpu_loss
FAILED test/modules/losses/test_contrastive_loss_with_temperature.py::TestContrastiveLossWithTemperature::test_single_gpu_loss
========================= 2 failed, 1 passed in 1.39s ==========================

System Info

PyTorch version: 1.12.0.dev20220407
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.20.4
Libc version: glibc-2.27

Python version: 3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-1051-aws-x86_64-with-glibc2.27
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.12.0.dev20220407
[pip3] torchmultimodal==0.1.0a0
[pip3] torchtext==0.13.0.dev20220407
[pip3] torchvision==0.13.0.dev20220407
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py39h7f8727e_0
[conda] mkl_fft                   1.3.1            py39hd3c417c_0
[conda] mkl_random                1.2.2            py39h51133e4_0
[conda] numpy                     1.21.2           py39h20f2e39_0
[conda] numpy-base                1.21.2           py39h79a1101_0
[conda] pytorch                   1.12.0.dev20220407 py3.9_cuda11.3_cudnn8.3.2_0    pytorch-nightly
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] torchmultimodal           0.1.0a0                   dev_0    <develop>
[conda] torchtext                 0.13.0.dev20220407            py39    pytorch-nightly
[conda] torchvision               0.13.0.dev20220407      py39_cu113    pytorch-nightly

  • How you installed TorchMultimodal (conda, pip, source): source
  • Build command you used (if compiling from source): python setup.py develop

Potential bug in CLIP transform implementation

self.text_transform = transforms.Compose(
[
tokenizer,
text_transforms.AddToken(self.text_start_token, begin=True),
text_transforms.AddToken(self.text_end_token, begin=False),
text_transforms.Truncate(self.text_max_length),
StrToIntTransform(),
text_transforms.ToTensor(padding_value=0),
PadTransform(max_length=self.text_max_length),
]
)

In above implementation, ideally the truncation should happen before adding the start/end tokens. If the sentence is long enough, it would remove the end token.

correct implementation

 self.text_transform = transforms.Compose( 
     [ 
         tokenizer, 
         text_transforms.Truncate(self.text_max_length-2), 
         text_transforms.AddToken(self.text_start_token, begin=True), 
         text_transforms.AddToken(self.text_end_token, begin=False), 
         StrToIntTransform(), 
         text_transforms.ToTensor(padding_value=0), 
         PadTransform(max_length=self.text_max_length), 
     ] 
 ) 

cc: @ebsmothers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.