Giter VIP home page Giter VIP logo

cvse's Introduction

Introdcurion

This is Consensus-Aware Visual-Semantic Embedding (CVSE), the official source code for the paper Consensus-Aware Visual-Semantic Embedding for Image-Text Matching (ECCV 2020). It is built on top of the VSE++ in PyTorch.

Abstract:

Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations, thereby exploiting their matching relationships and making the corresponding alignments. Such approaches just exploit the superficial associations contained with the instance pairwise data, with no consideration of any external commonsense knowledge, which may hinder their capabilities to reason the higher-level relationships between image and text. In this paper, we propose a Consensus-Aware Visual-Semantic Embedding (CVSE) model to incorporate the consensus information, namely the commonsense knowledge shared between both modalities, into image-text matching. Specifically, the consensus information is exploited by computing statistical co-occurrence correlations between the semantic concepts from the image captioning corpus and deploying the constructed concept correlation graph to yield the consensus-aware concept (CAC) representations. Afterwards, CVSE learns the associations and alignments between image and text based on the exploited consensus as well as the instance-level representations for both modalities. Extensive experiments conducted on two public datasets verify that the exploited consensus makes significant contributions to constructing more meaningful visual-semantic embeddings, with the superior performances over the state-of-the-art approaches on the bidirectional image and text retrieval task.

The framework of CVSE:

The results on MSCOCO and Flicke30K dataset: (The inference/evaluation code has been corrected for fair comparision)

Image-to-Text Text-to-Image
Dataset R@1 R@5 R@10 R@1 R@5 R@10 mR
MSCOCO 74.8 95.1 98.3 59.9 89.4 95.2 85.5
Flickr30k 73.5 92.1 95.8 52.9 80.4 87.8 80.4

Requirements and Installation

We recommended the following dependencies.

  • Python 3.6
  • PyTorch 1.1.0
  • NumPy (>1.12.1)
  • TensorBoard
  • torchtext
  • pycocotools

Download data

Download the dataset files. We use the image feature created by SCAN, downloaded here. All the data needed for reproducing the experiments in the paper, including image features and vocabularies, can be downloaded from:

wget https://scanproject.blob.core.windows.net/scan-data/data.zip
wget https://scanproject.blob.core.windows.net/scan-data/vocab.zip

In this implementation, we refer to the path of extracted files for data.zip as $data_path and files for vocab.zip to ./vocab_path directory.

Training

  • Train MSCOCO models: Run train_coco.py:
python train_coco.py --data_path "$DATA_PATH"
  • Train Flickr30K models: Run train_f30k.py:
python train_f30k.py --data_path "$DATA_PATH"

Evaluate trained models

  • Test on MSCOCO dataset:

    • Test on MSCOCO 1K test set
    python evaluate.py --data_path "$DATA_PATH" --data_name 'coco_precomp' --model_path './runs/coco/CVSE_COCO/model_best.pth.tar' --data_name_vocab coco_precomp --split test 
    • Test on MSCOCO 5K test set
    python evaluate.py --data_path "$DATA_PATH" --data_name 'coco_precomp' --model_path './runs/coco/CVSE_COCO/model_best.pth.tar' --data_name_vocab coco_precomp --split testall 
  • Test on Flickr30K dataset:

python evaluate.py --data_path "$DATA_PATH" --data_name 'f30k_precomp' --model_path './runs/f30k/CVSE_f30k/model_best.pth.tar' --data_name_vocab f30k_precomp --split test 
  • Test on dataset transfer (MSCOCO-to-Flickr30K):
python evaluate.py --data_path "$DATA_PATH" --data_name 'f30k_precomp' --model_path './runs/coco/CVSE_COCO/model_best.pth.tar' --data_name_vocab coco_precomp --transfer_test --concept_path 'data/coco_to_f30k_annotations/Concept_annotations/'

Reference

If CVSE is useful for your research, please cite our paper:

@article{Wang2020CVSE,
  title={Consensus-Aware Visual-Semantic Embedding for Image-Text Matching},
  author={Wang, Haoran and Zhang, Ying and Ji, Zhong and Pang, Yanwei and Ma, Lin},
  booktitle={ECCV},
  year={2020}
}

Acknowledgement

Part of our code is borrowed from SCAN. We thank to the authors for releasing codes.

License

MIT License

cvse's People

Contributors

brucew91 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.