Giter VIP home page Giter VIP logo

shot2story's Introduction

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

We are excited to release a new video-text benchmark and extendable codes for multi-shot video understanding. Our 20k version of dataset includes detailed long summaries for 20k videos and shot captions for 80k video shots.

Stay tuned for more exciting data release and new features!


What's new ๐Ÿ‘€

๐ŸŒŸ Update (16/12/2023): Paper and Demo for SUM-shot model. It showcases the power and versatility of detailed and grounded video summaries. Dive into the demo and share your experiences with us! Chat-SUM-shot is on the way! Stay tuned!๐ŸŽฅ๐Ÿ“๐Ÿš€

๐ŸŒŸ Update (12/12/2023): Code for video summarization and shot captioning, in the sub-directory code of this repo. Dive into these new features and share your experiences with us! ๐ŸŽฅ๐Ÿ“๐Ÿš€

๐ŸŒŸ Update (30/11/2023): Data of Shot2Story-20K. Check them out and stay tuned for more exciting updates! ๐Ÿ’ซ๐Ÿš€


Demo

We build a demo for SUM-shot model hosted in Space. Please have a look and explore what it is capable of. Issues are welcomed! Chat-SUM-shot model is on the way!

Some hints to play with our demo:

  • ๐ŸŽ‰ Start with our provided demo videos, some of which are sampled from ActivityNet, not included in our training data.
  • ๐Ÿš€ Please upload videos less than 20MB. Enjoy!
  • ๐Ÿ˜„ For a more comprehensive understanding, try specifying reasonable starting and ending timestamps for the shots. Enjoy!
  • ๐Ÿ˜„ Setting temperature to 0.1 for the most grounded understanding and question-answering.
  • ๐Ÿ˜„ Setting temperature to greater value for the creative grounded understanding and question-answering.

Multi-round conversation analyzing a humorous video:

demo3.mp4

Multiple-step minutes-long video analysis:

demo_multistep.mov

Table of Contents

  1. ๐ŸŒŸ What's new ๐Ÿ‘€
  2. Demo
  3. Introduction
  4. Dataset Glance
  5. Baselines and Tasks
  6. License
  7. Citation
  8. Contact

Introduction

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary.


Dataset Glance


Dataset Glance

Our dataset comprises 20k video clips sourced from HD-VILA-100M. Each clip is meticulously annotated with single-shot video captions, narration captions, video summaries, extracted ASR texts, and shot transitions. Please refer to DATA.md for video and annotation preparation.

The dataset includes an average of 4.0 shots per video, resulting in a total of 80k video shots, each with detailed video caption and narration caption annotations. The average length of our video summaries is 201.8, while the average length of a video is 16s.

For more comprehensive details, please refer to the plots below.




Baselines and Tasks

To benchmark the advances of multi-modal video understanding, we designed several distinctive tasks using our dataset, including single-shot captioning, multi-shot summarization, and video retrieval with shot description. We design and implemented several baseline models using a frozen vision encoder and an LLM, by prompting the LLM with frame tokens and ASR (Automatic Speech Recognition) text.

Code here for running the project.




License

Our code is licensed under a Apache 2.0 License.

Our text annotations are released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. They are available strictly for non-commercial research. More guidelines of dataset can be found in here.


Citation

If you find this repo useful for your research, please consider citing the paper

@article{han2023shot2story20k,
      title={Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos}, 
      author={Mingfei Han and Linjie Yang and Xiaojun Chang and Heng Wang},
      journal={arXiv preprint arXiv:2311.17043},
      year={2023}
}

Contact

If you have any questions or concerns about our dataset, please don't hesitate to contact us. You can raise an issue or reach us at [email protected]. We welcome feedback and are always looking to improve our dataset.


We extend our thanks to the teams behind HD-VILA-100M, BLIP2, Whisper, MiniGPT-4, Vicuna and LLaMA. Our work builds upon their valuable contributions. Please acknowledge these resources in your work.

shot2story's People

Contributors

youthhan avatar websieu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.