Giter VIP home page Giter VIP logo

awesome-video-text-retrieval's Introduction

Awesome Video-Text Retrieval by Deep Learning Awesome

A curated list of deep learning resources for video-text retrieval.

Contributing

Please feel free to pull requests to add papers.

Markdown format:

- `[Author Journal/Booktitle Year]` Title. Journal/Booktitle, Year. [[paper]](link) [[code]](link) [[homepage]](link)

Table of Contents

Implementations

PyTorch

TensorFlow

Others

Useful Toolkit

Papers

2023

  • [Pei et al. CVPR23] CLIPPING: Distilling CLIP-Based Models with a Student Base for Video-Language Retrieval. CVPR, 2023. [paper]
  • [Li et al. CVPR23] SViTT: Temporal Learning of Sparse Video-Text Transformers. CVPR, 2023. [paper] [code]
  • [Wu et al. CVPR23] Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval. CVPR, 2023. [paper] [code]
  • [Ko et al. CVPR23] MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models. CVPR, 2023. [paper] [code]
  • [Wang et al. CVPR23] All in One: Exploring Unified Video-Language Pre-Training. CVPR, 2023. [paper] [code]
  • [Girdhar et al. CVPR23] IMAGEBIND: One Embedding Space To Bind Them All. CVPR, 2023. [paper] [code]
  • [Huang et al. CVPR23] VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval. CVPR, 2023. [paper] [code]
  • [Li et al. CVPR23] LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling. CVPR, 2023. [paper] [code]
  • [Huang et al. CVPR23] Clover: Towards a Unified Video-Language Alignment and Fusion Model. CVPR, 2023. [paper] [code]
  • [Ji et al. CVPR23] Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning. CVPR, 2023. [paper]
  • [Gan et al. CVPR23] CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Public Chinese Video-text Dataset. CVPR, 2023. [paper] [code]
  • [Zhao et al. CVPRW23] Cali-NCE: Boosting Cross-Modal Video Representation Learning With Calibrated Alignment. CVPRWorkshop, 2023. [paper]
  • [Ma et al. TCSVT23] Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval. TCSVT, 2023. [paper]

2022

  • [Dong et al. ACMMM22] Partially Relevant Video Retrieval. ACM Multimedia, 2022. [homepage] [paper] [code] A new text-to-video retrieval subtask
  • [Wang et al. ACMMM22] Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning. ACM Multimedia, 2022. [paper] [code]
  • [Wang et al. ACMMM22] Learn to Understand Negation in Video Retrieval. ACM Multimedia, 2022. [paper] [code]
  • [Falcon et al. ACMMM22] A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval. ACM Multimedia, 2022. [paper] [code]
  • [Ma et al. ACMMM22] X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. ACM Multimedia, 2022. [paper]
  • [Hu et al. ECCV22] Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. ECCV, 2022. [paper] [code]
  • [Liu et al. ECCV22] TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. ECCV, 2022. [paper] [code]
  • [Dong et al. TCSVT22] Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval. TCSVT, 2022. [paper] [code]
  • [Li et al. CVPR22] Align and Prompt: Video-and-Language Pre-training with Entity Prompts, CVPR, 2022. [paper] [code]
  • [Shvetsova et al. CVPR22]Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval. CVPR, 2022. [paper] [code]
  • [Ge et al. CVPR22]Bridging Video-text Retrieval with Multiple Choice Questions. CVPR, 2022. [paper] [code]
  • [Han et al. CVPR22]Temporal Alignment Networks for Long-term Video. CVPR.2022. [paper] [code]
  • [Gorti et al. CVPR22] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. CVPR, 2022. [paper] [code]
  • [Lu et al. NIPS22] LGDN: Language-Guided Denoising Network for Video-Language Modeling. NIPS, 2022. [paper]
  • [Liu et al. SIGIR22] Animating Images to Transfer CLIP for Video-Text Retrieval. SIGIR, 2022. [paper]
  • [Zhao et al. SIGIR22] CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. SIGIR, 2022. [paper]
  • [Liu et al. ACL22] Cross-Modal Discrete Representation Learning. ACL, 2022. [paper]
  • [Gabeur et al. WACV22] Masking Modalities for Cross-modal Video Retrieval. WACV, 2022. [paper]
  • [Cao et al. AAAI22] Visual Consensus Modeling for Video-Text Retrieval. AAAI, 2022. [paper] [code]
  • [Cheng et al. AAAI22] Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss. AAAI, 2022. [paper][code]
  • [Wang et al. TMM22] Many Hands Make Light Work: Transferring Knowledge from Auxiliary Tasks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2022. [paper]
  • [Park et al. NAACL22] Exposing the Limits of Video-Text Models through Contrast Sets. NAACL, 2022. [paper] [code]
  • [Song et al. TOMM22] Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval. TOMM, 2022. [paper]
  • [Bai et al. ARXIV22] LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval. arXiv:2207.04858, 2022. [paper]
  • [Bain et al. ARXIV22] A CLIP-Hitchhiker's Guide to Long Video Retrieval. arXiv:2205.08508, 2022. [paper]
  • [Gao et al. ARXIV22] CLIP2TV: Align, Match and Distill for Video-Text Retrieval. arXiv:2111.05610, 2022. [paper]
  • [Jiang et al. ARXIV22] Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations. arXiv:2204.03382, 2022. [paper]

2021

  • [Dong et al. TPAMI21] Dual Encoding for Video Retrieval by Text. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper] [code]
  • [Wei et al. TPAMI21] Universal Weighting Metric Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. [paper]
  • [Lei et al. CVPR21] Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling. CVPR, 2021. [paper] [code]
  • [Wray et al. CVPR21] On Semantic Similarity in Video Retrieval. CVPR, 2021. [paper] [code]
  • [Chen et al. CVPR21] Learning the Best Pooling Strategy for Visual Semantic Embedding. CVPR, 2021. [paper][code]
  • [Wang et al. CVPR21] T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval. CVPR, 2021. [paper]
  • [Miech et al. CVPR21] Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers. CVPR, 2021. [paper]
  • [Liu et al. CVPR21] Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval. CVPR, 2021. [paper]
  • [Chen et al. ICCV21] Multimodal Clustering Networks for Self-Supervised Learning from Unlabeled Videos. ICCV, 2021. [paper]
  • [Ioana et al. ICCV21] TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval. ICCV, 2021. [paper][code]
  • [Yang et al. ICCV21] TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment. ICCV, 2021. [paper]
  • [Bian et al. ICCV21] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. ICCV, 2021. [paper][code]
  • [Wen et al. ICCV21] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-Training for Vision-Language Representation. ICCV, 2021. [paper][code]
  • [Luo et al. ACMMM21] CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising. ACM Multimedia, 2021. [paper]
  • [Wu et al. ACMMM21] HANet: Hierarchical Alignment Networks for Video-Text Retrieval. ACM Multimedia, 2021. [paper][code]
  • [Liu et al. ACMMM21] Progressive Semantic Matching for Video-Text Retrieval. ACM Multimedia, 2021. [paper]
  • [Han et al. ACMMM21] Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. ACM Multimedia, 2021. [paper]
  • [Wei et al. ACMMM21] Meta Self-Paced Learning for Cross-Modal Matching. ACM Multimedia, 2021. [paper]
  • [Patrick et al. ICLR21] Support-set Bottlenecks for Video-text Representation Learning. ICLR, 2021. [paper]
  • [Qi et al. TIP21] Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, 2021. [paper]
  • [Song et al. TMM21] Spatial-temporal Graphs for Cross-modal Text2Video Retrieval. IEEE Transactions on Multimedia, 2021. [paper]
  • [Dong et al. NEUCOM21] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval. Neurocomputing, 2021. [paper] [code]
  • [Jin et al. SIGIR21] Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval. SIGIR, 2020. [paper]
  • [He et al. SIGIR21]Improving Video Retrieval by Adaptive Margin. SIGIR, 2021. [paper]
  • [Wang et al. IJCAI21] Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment. IJCAI, 2021. [paper]
  • [Chen et al. AAAI21] Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval. AAAI, 2021. [paper]
  • [Hao et al. ICME21]What Matters: Attentive and Relational Feature Aggregation Network for Video-Text Retrieval. ICME, 2021. [paper]
  • [Wu et al. ICME21]Multi-Dimensional Attentive Hierarchical Graph Pooling Network for Video-Text Retrieval. ICME, 2021. [paper]
  • [Song et al. ICIP21] Semantic-Preserving Metric Learning for Video-Text Retrieval. IEEE International Conference on Image Processing, 2021. [paper]
  • [Hao et al. ICMR21] Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval. ICMR, 2021. [paper]
  • [Liu et al. ARXIV21] HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval. arXiv:2103.15049, 2021. [paper]
  • [Akbari et al. ARXIV21] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. arXiv:2104.11178 , 2021. [paper] [code]
  • [Fang et al. ARXIV21] CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv:2106.11097, 2021. [paper] [code]
  • [Luo et al. ARXIV21] CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. arXiv:2104.08860, 2021. [paper][code]
  • [Li et al. ARXIV21] Align and Prompt: Video-and-Language Pre-training with Entity Prompts. arXiv:2112.09583, 2021. [paper][code]

2020

  • [Yang et al. SIGIR20] Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. SIGIR, 2020. [paper]
  • [Ging et al. NeurIPS20] COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. NeurIPS, 2020. [paper] [code]
  • [Gabeur et al. ECCV20] Multi-modal Transformer for Video Retrieval. ECCV, 2020. [paper] [code] [homepage]
  • [Li et al. TMM20] SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Transactions on Multimedia, 2020. [paper]
  • [Wang et al. TMM20] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval. IEEE Transactions on Multimedia, 2020. [paper]
  • [Chen et al. TMM20] Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond. IEEE Transactions on Multimedia, 2020. [paper]
  • [Wu et al. ACMMM20] Interpretable Embedding for Ad-Hoc Video Search. ACM Multimedia, 2020. [paper]
  • [Feng et al. IJCAI20] Exploiting Visual Semantic Reasoning for Video-Text Retrieval. IJCAI, 2020. [paper]
  • [Wei et al. CVPR20] Universal Weighting Metric Learning for Cross-Modal Retrieval. CVPR, 2020. [paper]
  • [Doughty et al. CVPR20] Action Modifiers: Learning from Adverbs in Instructional Videos. CVPR, 2020. [paper]
  • [Chen et al. CVPR20] Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning. CVPR, 2020. [paper]
  • [Zhu et al. CVPR20] ActBERT: Learning Global-Local Video-Text Representations. CVPR, 2020. [paper]
  • [Miech et al. CVPR20] End-to-End Learning of Visual Representations From Uncurated Instructional Videos. CVPR, 2020. [paper] [code] [homepage]
  • [Zhao et al. ICME20] Stacked Convolutional Deep Encoding Network For Video-Text Retrieval. ICME, 2020. [paper]
  • [Luo et al. ARXIV20] UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation. arXiv:2002.06353, 2020. [paper]

2019

  • [Dong et al. CVPR19] Dual Encoding for Zero-Example Video Retrieval. CVPR, 2019. [paper] [code]
  • [Song et al. CVPR19] Polysemous visual-semantic embedding for cross-modal retrieval. CVPR, 2019. [paper]
  • [Wray et al. ICCV19] Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings. ICCV, 2019. [paper]
  • [Xiong et al. ICCV19] A Graph-Based Framework to Bridge Movies and Synopses. ICCV, 2019. [paper]
  • [Li et al. ACMMM19] W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM Multimedia, 2019. [paper] [code]
  • [Liu et al. BMVC19] Use What You Have: Video Retrieval Using Representations From Collaborative Experts. MBVC, 2019. [paper] [code]
  • [Choi et al. BigMM19] From Intra-Modal to Inter-Modal Space: Multi-Task Learning of Shared Representations for Cross-Modal Retrieval. International Conference on Multimedia Big Data, 2019. [paper]

2018

  • [Dong et al. TMM18] Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 2018. [paper] [code]
  • [Zhang et al. ECCV18] Cross-Modal and Hierarchical Modeling of Video and Text. ECCV, 2018. [paper] [code]
  • [Yu et al. ECCV18] A Joint Sequence Fusion Model for Video Question Answering and Retrieval. ECCV, 2018. [paper]
  • [Shao et al. ECCV18] Find and focus: Retrieve and localize video events with natural language queries. ECCV, 2018. [paper]
  • [Mithun et al. ICMR18] Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. ICMR, 2018. [paper] [code]
  • [Miech et al. arXiv18] Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516, 2018. [paper] [code]

Before

  • [Yu et al. CVPR17] End-to-end concept word detection for video captioning, retrieval, and question answering. CVPR, 2017. [paper] [code]
  • [OtaniEmail et al. ECCVW2016] Learning joint representations of videos and sentences with web image search. ECCV Workshop, 2016. [paper]
  • [Xu et al. AAAI15] Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI, 2015. [paper]

Ad-hoc Video Search

  • For the papers targeting at ad-hoc video search in the context of TRECVID, please refer to here.

Other Related

  • [Rouditchenko et al. INTERSPEECH21] AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. Interspeech, 2021. [paper] [code]
  • [Li et al. arXiv20] Learning Spatiotemporal Features via Video and Text Pair Discrimination. arXiv preprint arXiv:2001.05691, 2020. [paper]

Datasets

  • [MSVD] David et al. Collecting Highly Parallel Data for Paraphrase Evaluation. ACL, 2011. [paper] [dataset]
  • [MSRVTT] Xu et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR, 2016. [paper] [dataset]
  • [TGIF] Li et al. TGIF: A new dataset and benchmark on animated GIF description. CVPR, 2016. [paper] [homepage]
  • [AVS] Awad et al. Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking. TRECVID Workshop, 2016. [paper] [dataset]
  • [LSMDC] Rohrbach et al. Movie description. IJCV, 2017. [paper] [dataset]
  • [ActivityNet Captions] Krishna et al. Dense-captioning events in videos. ICCV, 2017. [paper] [dataset]
  • [DiDeMo] Hendricks et al. Localizing Moments in Video with Natural Language. ICCV, 2017. [paper] [code]
  • [HowTo100M] Miech et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. ICCV, 2019. [paper] [homepage]
  • [VATEX] Wang et al. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019. [paper] [homepage]

Licenses

CC0

To the extent possible under law, danieljf24 all copyright and related or neighboring rights to this repository.

awesome-video-text-retrieval's People

Contributors

danieljf24 avatar dxli94 avatar lijiabei-7 avatar noyii avatar roudimit avatar weiyx16 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-video-text-retrieval's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.