Datasets for Video Captioning

In this repository, we organize the information about more that 25 datasets of (video, text) pairs that have been used for training and evaluating video captioning models. We this repository, we want to make it easier for researches to obtain datasets and replicate results. We are open to callobaration for including new datasets and related concepts.

Even when there are several online video platforms to get data from, the construction of a balanced labeled dataset is a costly task that requires significant and time-consuming human effort. Although most of the below listed datasets are not explicitly available, some of them can be obtained from the authors, if requested.

A detailed description of all these datasets can be found in our comprehensive review about Video-to-Text research area:


@article{PerezMartin2021BridgingReview,
	title={A Comprehensive Review of Video-to-Text Problem},
	author={Perez-Martin, Jesus and Bustos, Benjamin and F. Guimar\~{a}es, Silvio Jamil and Sipiran, Ivan and Perez, Jorge and Coello Said, Grethel
	journal = {Artificial Intelligence Review},
	doi = {https://doi.org/10.1007/s10462-021-10104-1},
	year={2022}
}

Video + Language Datasets

List of datasets of video-caption/description pairs for training video-to-text models. The last three columns show:

if the dataset contains temporal-localization information related to each description,
if the videos of the dataset contain audio, and
if corpus' descriptions are more than eleven words long on average.

Dataset	Year	Type	Caps. Source	Localization	Audio	Long Caps.
MSVD [1]	2011	Open (YouTube)	MTurk
MPII Cooking Act. [2]	2012	Cooking	in-house actors
MPII Cooking C. Act. [3]	2012	Cooking	in-house actors
YouCook [4]	2013	Cooking	MTurk
TACoS [5]	2013	Cooking	MTurk	✔️
TACoS-Multilevel [6]	2014	Cooking	MTurk
M-VAD [7]	2015	Movie	DVS		✔️	✔️
MPII-MD [8]	2015	Movie	Script + DVS		✔️
LSDMC [web]	2015	Movie	Script + DVS		✔️
MSR-VTT [9]	2016	Open	MTurk		✔️
MPII Cooking 2 [10]	2016	Cooking	in-house actors	✔️
Charades [11]	2016	Daily indoor act.	MTurk	✔️	✔️	✔️
VTW-full [12]	2016	Open (YouTube)	Owner/Editor
TGIF [13]	2016	Open	crowdworkers			✔️
TRECVID-VTT'16 [14]	2016	Open	MTurk		✔️	✔️
Co-ref+Gender [15]	2017	Movie	DVS		✔️
ActivityNet Caps. [16]	2017	Human act.	MTurk	✔️	✔️	✔️
TRECVID-VTT'17 [17]	2017	Open (Twitter)	MTurk		✔️
YouCook2 [18]	2018	Cooking	viewer/annotator	✔️	✔️
Charades-Ego [19]	2018	Daily indoor act.	MTurk	✔️	✔️	✔️
20BN-s-s V2 [20]	2018	Human-obj. interact.	MTurk
TRECVID-VTT'18 [21]	2018	Open (Twitter)	MTurk		✔️	✔️
TRECVID-VTT'19 [22]	2019	Open (Twitter+Flirck)	MTurk		✔️	✔️
VATEX [23]	2019	Open	MTurk		✔️	✔️
HowTo100M [24]	2019	Instructional (YouTube)	subtitles	✔️	✔️	✔️
TRECVID-VTT'20 [25]	2020	Open (Twitter+Flirck)	MTurk		✔️	✔️

Charts

Corpus Details

The next table details the vocabulary composition of each corpus, i.e., number of tokens, nouns, verbs, adjectives, and adverbs. We also show the average length of captions and the percent of tokens that appear in the GloVe-6B dictionary. The words categorization have been calculated by POS-tagging with universal tagset mapping. For obtaining this categorization of each corpus we provide dedicated scripts in preprocess_corpus directory.

Corpus	Avg. caption len.	Tokens	Nouns	Verbs	Adjectives	Adverbs	% of tokens in GloVe-6B
MSVD [1]	7.14	9,629	6,057	3,211	1,751	378	83.44
TACoS-Multilevel [6]	8.27	2,863	1,475	1,178	609	207	91.86
M-VAD [7]	12.44	17,609	9,512	2,571	3,560	857	94.99
MPII-MD [8]	11.05	20,650	11,397	6,100	3,952	1,162	88.96
LSDMC [web]	10.66	24,267	15,095	7,726	7,078	1,545	88.57
MSR-VTT [9]	9.27	23,527	19,703	8,862	7,329	1,195	80.74
TGIF [13]	11.28	10,646	6,658	3,993	2,496	626	97.85
Charades [11]	23.91	4,010	2,199	1,780	651	265	90.00
Charades-Ego [19]	26.30	2,681	1,460	1,179	358	190	91.12
VTW-full [12]	6.40	23,059	13,606	6,223	3,967	846	-
ActivityNet Caps. [16]	14.72	10,162	4,671	3,748	2,131	493	94.00
20BN-s-s V2 [20]	6.76	7,433	6,087	1,874	1,889	361	74.51
TRECVID-VTT'20 [25]	18.90	11,634	7,316	4,038	2,878	542	93.36
VATEX-en [23]	15.25	28,634	23,288	12,796	10,639	1,924	76.84
HowTo100M [24]	4.16	593,238	491,488	198,859	219,719	76,535	36.64

Standard Splits

The next table shows the number of video clips and captions in the standard splits of each video-caption/description pairs dataset.

Dataset	clips (train)	captions (train)	clips (val)	captions (val)	clips (test)	captions (test)	clips (total)	captions (total)
MSVD [1]	1,200	48,779	100	4,291	670	27,768	1,970	80,838
TACoS [5]	-	-	-	-	-	-	7,206	18,227
TACoS-Multilevel [6]	-	-	-	-	-	-	14,105	52,593
M-VAD [7]	36,921	36,921	4,717	4,717	4,951	4,951	46,589	46,589
MPII-MD [8] [15]	56,822	56,861	4,927	4,930	6,578	6,584	68,327	68,375
LSDMC [web]	91,908	91,941	6,542	6,542	10,053	10,053	108,503	108,536
MSR-VTT, 2016 [9]	6,512	130,260	498	9,940	2,990	59,800	507,502	200,000
MSR-VTT, 2017 web	10,000	200,000	-	-	3,000	60,000	13,000	260,000
Charades [11]	7,985	18,167	1,863	6,865	-	-	9,848	25,032
Charades-Ego [19]	6,167	12,346	1,693	1,693	-	-	7,860	14,039
TGIF [13]	80,850	80850	10,831	10,831	34,101	34,101	125,782	125,781
ActivityNet Caps. [16]	10,024	36,587	4,926	17,979	5,044	18,410	19,994	72,976
20BN-s-s V2 [20]	168,913	1,689,913	24,777	24,777	27,157	-	220,847	1,714,690
TRECVID-VTT'20 [25]	7,485	28,183	-	-	1,700	-	9,185	28,183
VATEX [23]	25,991	259,910	3,000	30,000	6,000	60,000	34,991	349,910
HowTo100M [24]	-	-	-	-	-	-	139,668,840	139,668,840

References

[1] D. L. Chen and W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, vol. 1, pp. 190–200.

[2] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2012, pp. 1194–1201.

[3] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele, “Script Data for Attribute-Based Recognition of Composite Activities,” in Computer Vision -- ECCV 2012, 2012, pp. 144–157.

[4] P. Das, C. Xu, R. F. Doell, and J. J. Corso, “A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2013, pp. 2634--2641.

[5] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal, “Grounding Action Descriptions in Videos,” Trans. Assoc. Comput. Linguist., vol. 1, pp. 25–36, Dec. 2013.

[6] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, “Coherent multi-sentence video description with variable level of detail,” in Pattern Recognition, 2014, pp. 184--195.

[7] A. Torabi, C. Pal, H. Larochelle, and A. Courville, “Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research,” CoRR, vol. abs/1503.0, Mar. 2015, [Online]. Available: http://arxiv.org/abs/1503.01070.

[8] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for Movie Description,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, vol. 07-12-June, pp. 3202–3212.

[9] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language,” 2016 IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5288–5296, 2016, doi: 10.1109/CVPR.2016.571.

[10] M. Rohrbach et al., “Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data,” Int. J. Comput. Vis., vol. 119, no. 3, pp. 346–373, Sep. 2016.

[11] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding,” in Computer Vision -- ECCV 2016, 2016, pp. 510–526, doi: 10.1007/978-3-319-46448-0_31.

[12] K.-H. Zeng, T.-H. Chen, J. C. Niebles, and M. Sun, “Title Generation for User Generated Videos,” in Computer Vision -- ECCV 2016, 2016, pp. 609–625, doi: 10.1007/978-3-319-46475-6_38.

[13] Y. Li et al., “TGIF: A New Dataset and Benchmark on Animated GIF Description,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, vol. 2016-Decem, pp. 4641–4650.

[14] G. Awad et al., “TRECVID 2016: Evaluating vdeo search, video event detection, localization, and hyperlinking,” 2016.

[15] A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh, and B. Schiele, “Generating Descriptions with Grounded and Co-Referenced People,” 2017.

[16] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-Captioning Events in Videos,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, vol. 2017-Octob, pp. 706–715.

[17] G. Awad et al., “Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking,” 2017.

[18] L. Zhou, C. Xu, and J. J. Corso, “Towards Automatic Learning of Procedures from Web Instructional Videos,” in Proceedings of the AAAI Conference on Artificial Intelligence, Mar. 2018, pp. 7590--7598.

[19] G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari, “Actor and Observer: Joint Modeling of First and Third-Person Videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7396–7404.

[20] F. Mahdisoltani, G. Berger, W. Gharbieh, D. Fleet, and R. Memisevic, “Fine-grained Video Classification and Captioning,” CoRR, vol. abs/1804.0, Apr. 2018. Available: http://arxiv.org/abs/1804.09235.

[21] G. Awad et al., “TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search,” 2018.

[22] G. Awad et al., “TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval,” 2019.

[23] X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, “VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research,” in The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 4581–4591.

[24] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, pp. 2630–2640.

[25] G. Awad et al., “TRECVID 2020: comprehensive campaign for evaluating video retrieval tasks across multiple application domains,” 2020.

jssprz / video_captioning_datasets Goto Github PK

video_captioning_datasets's Introduction

Datasets for Video Captioning

Video + Language Datasets

Charts

Corpus Details

Standard Splits

References

video_captioning_datasets's People

Contributors

Stargazers

Watchers

Forkers

video_captioning_datasets's Issues

About Data Set Acquisition

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent