wenhuchen / meta-module-network Goto Github PK

Code for WACV 2021 Paper "Meta Module Network for Compositional Visual Reasoning"

Python 99.81% Shell 0.19%

meta-module-network's Introduction

Meta-Module-Network

Code for WACV 2021 Paper "Meta Module Network for Compositional Visual Reasoning"

Data Downloading

Download all the question files and scene graph files and bottom-up features from the web server, it can take up to 300G disk space.

  bash get_data.sh

This script will download questions/ folder, and the "trainval_all_programs.json" is used for bootstrapping and "trainval_unbiased_programs.json" is used for finetunning in the paper. The "trainval_unbiased_programs.json" and "testdev_pred_programs.json" are both generated by the program generator model.

Meta Module Network Implementation

To understand more detailed implementation of MMN, please refer to README.

Description of different files

sceneGraphs/trainval_bounding_box.json: the scene graph provided by the original GQA dataset

  {
    imageId:
    {
      bouding_box_id:
      {
        x: number,
        y: number,
        w: number,
        h: number,
        relations: [{object: "bounding_box_id", name: "relation_name"} ... ],
        name: object_class,
        attributes: [attr1, attr2, ... ]
      },
      bouding_box_id:
      {
        ...
      },
    }
  }

questions: the questions-program pairs and their associated images.

[
  [
    "ImageId",
    "Question",
    "Programs": [f1, f2, ..., fn],
    "QuestionId",
    "Answer"
  ]
]

Data Preprocessing [Optional]:

If you want to know how the programs and training data are generated, please follow the following steps:

Preprocessing Question-Program Pairs:

Download the questions from the original GQA website and then put it in the parent folder '../gqa-questions/', the following steps are aimed to convert "questions" into program format as follows:

preprocess the trainval_all_question into trainval_all_programs.json

  python preprocess.py trainval_all

preprocess the "balanced" programs into different forms:

  python preprocess.py create_balanced_programs

create the programs into the "input" forms for trainval_all_programs.json:

  python preprocess.py create_all_inputs

create the programs into the "input" forms for *balanced.json:

  python preprocess.py create_inputs

Using NL2Program Model to Predict Test-Dev Programs from input questions:

Train the sequence-2-sequence model:

  python generate_program.py --do_preprocess

Evaluate the NL2Program

  python generate_program.py --do_testdev

Prepare the generated programs for the modular transformer

  python generate_program.py --do_trainval_unbiased

Meta Module Network Training and Evaluation

Prepare the inputs for the modular transformer:
```
  python preprocess.py create_pred_inputs
```
Start the bootstrap training of the modular transoformer or you can download the pre-trained models directly from Google Drive. This bootstrap process could take quite a long time, please be patient if you are training on your own:
```
 python run_experiments.py --do_train_all --model TreeSparsePostv2 --id TreeSparsePost2Full --stacking 2 --batch_size 1024
```

Start the finetunning on the balanced split:

  python run_experiments.py --do_finetune --id FinetuneTreeSparseStack2RemovalFullValSeed6999 --model TreeSparsePostv2 --load_from models/TreeSparsePost2Full --seed 6999 --stacking 2

Test the model on the testdev split:

  python run_experiments.py --do_testdev_pred --id FinetuneTreeSparseStack2RemovalValSeed6777 --load_from [MODEL_NAME]  --model TreeSparsePostv2 --stacking 2

Citation

If you find this paper useful, please add the following reference to your paper.

  @article{chen2019meta,
  title={Meta module network for compositional visual reasoning},
  author={Chen, Wenhu and Gan, Zhe and Li, Linjie and Cheng, Yu and Wang, William and Liu, Jingjing},
  journal={Proceedings of WACV},
  year={2021}
}

meta-module-network's People

Stargazers

Watchers

Forkers

mshpolo queenie88 arielsho jj-xiaomao huudatdo deyye

meta-module-network's Issues

Question about coordinate projection in TreeTransformerSparsePostv2

This layer in modular.py, TreeTransformerSparsePostv2 class:
self.coordinate_proj = nn.Linear(coordinate_dim, hidden_dim)

apparently expects 6 dimensional bounding box coordinate input and the pretrained model is trained as such. See also args.additional_dim in run_experiments.py which is set to 6 by default. Could you please explain why your bboxes have 6 dimensions or if I'm misinterpreting it? (I can't download your provided data to check if/what content is in dim 5&6.)
Thanks!

How much memory is needed to train the NLP model?

Code for teacher model

Hello, tks for your sharing. I didn't find the training process for the symbolic executor. Could you please give some references or open-source the code for it? Thanks anyway.

missing en_emb.npy

Hi, thanks for your quick reply. The file en_emb.npy is also missing, which is required at generate_program.py Line 218. Can you provide this file as well?

Complete code release

Hi, do you plan to update the code?
I can't find the implementations of three core parts of the model: visual encoder, program generator and meta modules. ex) where is Networks.py referred in generatre_program.py?

GQA_hypernym missing.

Hi, in Constants.py you need to use GQA_hypernym.json in Line 84. But I did not find this file. How to resolve this?

Pretrained models

Thanks for opening the code!
I think your work on preprocessing GQA programs is very necessary.
This is a crutial step for practical visual reasoning on natural images. I decide to follow this work.
For convenience of followers, could you please provide your pretrained models (especially for the program generator)?

Failing to download gqa_features.zip continually

Hi, I'm a student in South Korea.
I want to reproduce your experiment data but there was errors continually to download gqa_features.zip.

This was a wget command in get_data.sh:
wget https://convaisharables.blob.core.windows.net/meta-module-network/gqa_visual_features.zip

I have already read closed issue #3, this happens to me the same way and the downloading was stopped by "peer connection reset".
How can I download gqa_features.zip?
Or could you tell me how to extract the features?

where does gqa_visual_features.zip come from?

Hi, I notice that you use a customized version of gqa visual feature. How does it come from?
I cannot download it because Read error at byte 38950928384/174855185539 (Connection reset by peer). Giving up.

Coarse-to-fine Parser

Hi,

Thank you for your awesome work.

From generate_programs.py I understood that you are using a normal Seq2Seq Parser. Could you let me know if you have open-sourced the Coarse2Fine Parser Implementation so that I can replicate your parsing results described in the paper on my own data?

Thanks

PS: I have downloaded the google drive which has the model files and data.

Reproducing results from scratch not working

I want to train this model with new visual features from a different object detector.

I've now trained this model from scratch using bootstrapping with all train data, as described. I'm using the program files that were provided in this repo. The only (major) difference is that I'm using different visual features (quality should not be worse than bottomup-generated features). After training a few epochs (bootstrapping followed by finetuning), it's obvious that I'm nowhere close to reaching the numbers in the paper, only reaching accuracy of ~40% on both the balanced val set and the testdev set.

Any ideas what the problem could be? Is there any non-obvious tuning done wrt your visual features?

Implement details of MMN w/o BS

The GQA testdev accuracy of MMN without Bootstrap is 58.4% in your paper, however, I only reached 57.7% with your provided vision features under the default setup.
My primary hyperparameters as follows:

model=TreeSparsePostv2
pre_layers=3
stacking=2
lr_default=1e-4
batch_size=256

Could you please tell me more details about this experiment result?

Code for extracting features from Bottom up attention

Can you please provide the code for extracting features using bottom up code? I am working on a small subset of images and want to replicate the numpy files that you have used as features?

Code for CLEVR experiments?

Hi, I wanted to ask, is there any chance that the code for the preliminary CLEVR experiments mentioned in the paper could be released? Thanks for your time!