Giter VIP home page Giter VIP logo

3shnet's Introduction

The official web of the 3SHNet project.

《3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting》

by Xuri Ge*, Songpei Xu*, Fuhai Chen#, Jie Wang, Guoxin Wang, Shan An, Joemon M. Jose#

Information Processing and Management (IP&M 2024)

Introduction

In this paper, we propose a novel visual \textbf{S}emantic-\textbf{S}patial \textbf{S}elf-\textbf{H}ighlighting \textbf{Net}work (termed \textbf{\textit{3SHNet}}) for high-precision, high-efficiency and high-generalization image-sentence retrieval. 3SHNet highlights the salient identification of prominent objects and their spatial locations within the visual modality, thus allowing the integration of visual semantics-spatial interactions and maintaining independence between two modalities. This integration effectively combines object regions with the corresponding semantic and position layouts derived from segmentation to enhance the visual representation. And the modality-independence guarantees efficiency and generalization. Additionally, 3SHNet utilizes the structured contextual visual scene information from segmentation to conduct the local (region-based) or global (grid-based) guidance and achieve accurate hybrid-level retrieval. Extensive experiments conducted on MS-COCO and Flickr30K benchmarks substantiate the superior performances, inference efficiency and generalization of the proposed 3SHNet when juxtaposed with contemporary state-of-the-art methodologies. Specifically, on the larger MS-COCO 5K test set, we achieve 16.3%, 24.8%, and 18.3% improvements in terms of rSum score, respectively, compared with the state-of-the-art methods using different image representations, while maintaining optimal retrieval efficiency.

Prerequisites

Basic env: python=3.7; pytorch=1.8.0(cuda11); tensorflow=2.11.0; tensorboard, etc. You can directly run:

conda env create -n 3SHNet --file env.yaml

Data Preparation

To run the code, annotations and region-level and global-level image features with corresponding segmentation results for the MSCOCO and Flickr30K datasets are needed.

First, the basic annotations and region-level image features can be downloaded from SCN.

For the global-level image features, we implemented the pre-trained REXNet-101 in grid-feats-vqa to extract all images in MSCOCO and Flickr30K. After that, we stored them in independent .npy files (mainly for MScoco due to the large images).

About segmentation extractions, features are computed with the code provided by UPSNet. They include three types, i.e. segmentation semantic features (%s_segmentations.npy, dims=(N,133,7,7)), segmentation maps (%s_seg_maps.npy, dims=(N, 64, 64), downsampleing to reduce the calculations) and category-one-hot(%s_cat_onehots.npy, dims=(N,133) a little different from paper, which is embedded by a linear layer, it doesn't influence the conclusion). Here we provided the segmentation results of the test set as examples, which can also be used to obtain our reported results.

Training

We separate the global (grid-based) and local (region-based) training processes. Training the region-level based 3SHNet model please run train_rgn_seg_sp_se_coco.sh under the main folder to start training:

sh train_rgn_seg_sp_se_coco.sh

Training the global-level based 3SHNet model please run train_rgn_seg_sp_se_coco.sh under the main folder to start training:

sh train_grid_seg_sp_se_coco.sh

Similar training process can be applied to Flickr30K.

Testing the model

To test the trained models, you can directly run the eval scripts as:

sh eval_rgn_seg_sp_se_coco.sh

or

sh eval_gird_seg_sp_se_coco.sh

And to obtain the ensemble results, you can refer to the code in eval_ensemble.py And to obtain the cross-dataset testing results, you can refer to the code in eval_cross_ensemble.py

To obtain the reported results, we have released the single region-level and grid-level pre-trained models in Google Drive. You should modify the pre-trained model paths in the evaluation codes and then follow the above testing processes. To ensure reproducibility, we ran the code again and got similar or even higher results than reported in the paper!

The Global-level results will be uploaded after sorting it out. If you don't mind the untidy code, please email me.

Citation

  @article{ge20243shnet,
  title={3SHNet: Boosting image-sentence retrieval via visual semantic-spatial self-highlighting},
  author={Ge, Xuri and Xu, Songpei and Chen, Fuhai and Wang, Jie and Wang, Guoxin and An, Shan and Jose, Joemon M},
  journal={Information Processing and Management},
  year={2024},
  publisher={Elsevier}
  }

Acknowledgement: We referred to the implementations of vse_infty and SCAN to build up our codebase, thanks all.

3shnet's People

Contributors

xurige1995 avatar

Stargazers

 avatar  avatar bubbles avatar  avatar Hpkro avatar

Watchers

 avatar

3shnet's Issues

About segmentation extractions

Thank you for your excellent work. Can you show me the code for generating segmentation semantic features (%s_segmentations.npy, dims=(N,133,7,7)), segmentation maps (%s_seg_maps.npy, dims=(N, 64, 64), downsampleing to reduce the calculations) and category-one-hot(%s_cat_onehots.npy, dims=(N,133), or rather, which intermediate feature was taken from upsnet

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.