Giter VIP home page Giter VIP logo

iais's Introduction

IAIS: Inter-modal Alignment for Intra-modal Self-attentions

This repository contains the code for our paper Learning Relation Alignment for Calibrated Cross-modal Retrieval (ACL-IJCNLP 2021 main conference).

Overview of IAIS

Some code in this repo are copied/modified from UNITER, and other opensource implementations made available by PyTorch, HuggingFace, OpenNMT, and Nvidia. The image features are extracted using BUTD.

Overview

  1. We propose a Relation Consistency Hypothesis: Given a matched image-text pair, the linguistic relation should agree with the visual relation.
  2. We design a novel metric: Intra-modal Self-attention Distance with annotation (ISDa) to measure the consistency between textual and visual relations.
  3. We propose a new regularized training method called Inter-modal Alignment on Intra-modal Self-attentions (IAIS) to calibrate two intra-modal attention distributions mutually via inter-modal alignment, which helps learn better contextualized representations for image-text pairs.

Update

Please refer to Google Drive for our fine-tuned checkpoints and logs for MS COCO and Flickr30k.

Requirements

We provide Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Getting Started

  1. Download processed data and pretrained models with the following command.

    bash scripts/download_itm.sh $PATH_TO_STORAGE

    After downloading you should see the following folder structure:

    ├── img_db
    │   ├── coco_train2014
    │   ├── coco_train2014.tar
    │   ├── coco_val2014
    │   ├── coco_val2014.tar
    │   ├── flickr30k
    │   └── flickr30k.tar
    ├── pretrained
    │   ├── uniter-base.pt
    │   ├── uniter-large.pt
    └── txt_db
        ├── itm_coco_train.db
        ├── itm_coco_train.db.tar
        ├── itm_coco_val.db
        ├── itm_coco_val.db.tar
        ├── itm_coco_restval.db
        ├── itm_coco_restval.db.tar
        ├── itm_coco_test.db
        ├── itm_coco_test.db.tar
        ├── itm_flickr30k_train.db
        ├── itm_flickr30k_train.db.tar
        ├── itm_flickr30k_val.db
        ├── itm_flickr30k_val.db.tar
        ├── itm_flickr30k_test.db
        └── itm_flickr30k_test.db.tar
    
  2. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

  3. Run finetuning for the ITM task.

    All experiments in the paper are conducted on 8 NVIDIA V100 GPUs.

    • Image-Text Retrieval (Flickr30k)
      • finetune with hard negatives
        horovodrun -np 8 python train_itm_hard_negatives.py \
            --config config/train-itm-flickr-base-8gpu-hn.jgon
        
      • finetune with hard negatives + IAIS
        horovodrun -np 8 python train_itm_hard_negatives.py \
            --config config/train-itm-flickr-base-8gpu-hn.jgon --IAIS singular
        
    • Image-Text Retrieval (COCO)
      • finetune with hard negatives
        horovodrun -np 8 python train_itm_hard_negatives.py \
            --config config/train-itm-coco-base-8gpu-hn.json
        
      • finetune with hard negatives + IAIS
        horovodrun -np 8 python train_itm_hard_negatives.py \
            --config config/train-itm-coco-base-8gpu-hn.json --IAIS singular
        

    The argument --IAIS indicates incorporating the auxiliary IAIS loss to the fine-tuning phase. We support

    • singular: Singular Alignment that establishes a one-to-one mapping between linguistic and visual attention weight (Section 3.1 in the paper). Overview of IAIS
    • distributed: Distributed Alignment that establishes a distributed mapping (Section 3.2 in the paper). Overview of IAIS

    The main code for the IAIS method is in the UniterEncoder class in model/model.py.

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021iais,
    title = "Learning Relation Alignment for Calibrated Cross-modal Retrieval",
    author = "Ren, Shuhuai and Lin, Junyang and Zhao, Guangxiang and Men, Rui and Yang, An and Zhou, Jingren and Sun, Xu and Yang, Hongxia",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    year = "2021",
}

License

MIT

iais's People

Contributors

renshuhuai-andy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.